An ESTARFM Fusion Framework for the Generation of Large-Scale Time Series in Cloud-Prone and Heterogeneous Landscapes

Monitoring the spatio-temporal development of vegetation is a challenging task in heterogeneous and cloud-prone landscapes. No single satellite sensor has thus far been able to provide consistent time series of high temporal and spatial resolution for such areas. In order to overcome this problem, data fusion algorithms such as the Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model (ESTARFM) have been established and frequently used in recent years to generate high-resolution time series. In order to make it applicable to larger scales and to increase the input data availability especially in cloud-prone areas, an ESTARFM framework was developed in this study introducing several enhancements. An automatic filling of cloud gaps was included in the framework to make best use of available, even partly cloud-covered Landsat images. Furthermore, the ESTARFM algorithm was enhanced to automatically account for regional differences in the heterogeneity of the study area. The generation of time series was automated and the processing speed was accelerated significantly by parallelization. To test the performance of the developed ESTARFM framework, MODIS and Landsat-8 data were fused for generating an 8-day NDVI time series for a study area of approximately 98,000 km2 in West Africa. The results show that the ESTARFM framework can accurately produce high temporal resolution time series (average MAE (mean absolute error) of 0.02 for the dry season and 0.05 for the vegetative season) while keeping the spatial detail in such a heterogeneous, cloud-prone region. The developments introduced within the ESTARFM framework establish the basis for large-scale research on various geoscientific questions related to land degradation, changes in land surface phenology or agriculture.


Introduction
Monitoring the spatio-temporal development of vegetation is a challenging task in heterogeneous and cloud-prone landscapes [1]. Consistent time series of high temporal and spatial resolution satellite imagery could substantially improve various regional analyses of land degradation, changes in land surface phenology (LSP), or changes in extent of agricultural area. However, no single satellite sensor has yet been able to provide time series to sufficiently monitor the vegetation development in these challenging areas. Moderate spatial resolution sensors such as MODIS (Moderate Resolution Imaging Spectroradiometer), MERIS (Medium Resolution Imaging Spectrometer) and SPOT-VEGETATION (Satellite Pour l' Observation de la Terre) provide information on the temporal development of LSP while they cannot sufficiently capture small-scale mosaics of different land cover types. On the other hand, current medium spatial resolution sensors such as Landsat provide the spatial detail necessary for various approaches in heterogeneous landscapes. Yet, in cloud-prone areas and generally for the first Landsat missions which did not provide frequent images globally, data availability is insufficient to assess the various growth stages and temporal changes of vegetation over the growing cycle. To overcome this classic discrepancy in remote sensing, fusing data from different sensors has gained increasing popularity in recent years for combining high temporal resolution sensors with medium to high spatial resolution sensors. Especially the Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) [2] and its enhanced version ESTARFM [3] have been frequently applied as described in the literature (e.g., [4][5][6][7][8]). Recent comparison studies highlighted the respective advantages of the two algorithms compared to other simple fusion approaches, but the overall accuracies are found to be highest for ESTARFM [9,10]. Furthermore, in heterogeneous landscapes, ESTARFM has been documented to reproduce the small-scale objects of the land surface more accurately than the original STARFM [10].
However, thus far, data fusion with these two algorithms has only been applied to relatively small study areas corresponding to a single Landsat scene (~33,300 km 2 ) or less, and the vast majority of studies focused on regions in North America (USA, Canada), Australia and China. Gao et al. [11] investigated the use of STARFM time series for the monitoring of crop condition in a study area of 225 km 2 in central Iowa, USA. They found the synthetic time series derived from MODIS and Landsat to be well suited to the task and could even distinguish different crop types (soybean and corn) and forest areas. Focusing on forest disturbance and regrowth, Schmidt et al. [12] generated a 12-year time series with STARFM for a 2960 km 2 study area in Queensland, Australia. By analyzing the time series with the Breaks for Additive Seasonal and Trend (BFAST) algorithm [13], they identified forest-clearing events and monitored the regrowing vegetation. In their conclusion, the value of such data fusion for land management and forest conversion was highlighted [12]. Tewes et al. [5] fused 250 m MODIS and 5 m RapidEye data using ESTARFM to monitor vegetation dynamics in semi-arid rangelands of South Africa (study area: 1434 km 2 ). Their results indicate the suitability of ESTARFM to characterize phenological development of different vegetation types in such heterogeneous areas. Except for the latter study, ESTARFM in particular has only been used in performance and feasibility studies that show the potential of the algorithm for different purposes like urban flood mapping or crop monitoring, sometimes in comparison to other data fusion algorithms [8][9][10]14]. However, to the best of the authors' knowledge, there have not yet been any studies applying ESTARFM to investigating specific geoscientific questions and for larger study areas. The frequent use reflects the high interest in this fusion method, but it also indicates that there might be limitations which have so far hindered broader applications.
This study identifies possible limitations of the ESTARFM algorithm and aims to improve these for a better applicability on large spatial scales (e.g., more than one Landsat tile). For this purpose, the algorithm was implemented in a framework facilitating the generation of time series and automatically accounting for regional differences in the heterogeneity of the study area. A further aim was to enhance the application of ESTARFM in cloud-prone areas by improving the availability of useable input Landsat data. Lastly, the benefits of the developed ESTARFM framework are demonstrated by applying it to the generation of a 30 m NDVI time series for a large, cloud-prone, and heterogeneous region of West Africa.

Study Area
The study region is located in the transboundary zone of southern Burkina Faso and northern Ghana (3˝10 1 33"W-0˝11 1 31"W/9˝29 1 4"N-12˝24 1 9"N) with an approximate size of 98,000 km 2 ( Figure 1). The climate is tropical and strongly seasonal with distinct rainy and dry seasons. The rainy season lasts for about four to five months between May and October with mean annual rainfall ranging Remote Sens. 2016, 8,425 3 of 21 from 700 to 1200 mm. The temperatures are relatively stable over the year with monthly averages between 25˝C and 32˝C. The study area is located in the West Sudanian Savanna ecoregion with vegetation types ranging from open grassland to closed woodland and forests [15]. Dense woody vegetation or forests mainly occur along rivers or in protected areas such as the Nazinga game ranch, Sissili classified forest, the Kaboré-Tambi National Park, or the Mole National Park. Outside these protected areas, agricultural use is predominant, e.g., around Kaboré-Tambi National Park, a strong increase in agricultural area (from about 30% to 60%) between 2001 and 2013 has been documented [16]. The agriculture in the study area mostly consists of small-scale rain-fed crops with average field sizes around 1 ha and a low productivity (1.15 t¨ha´1¨year´1 cereals for Burkina Faso) as well as livestock grazing [17,18]. The small-scale mosaic of different land cover and land use types (mainly semi-natural vegetation and croplands) make this study region spatially very heterogeneous [18]. An important landmark in the northeast is Lake Bagré which retains the White Volta for hydro-power generation but also supplies water to irrigated agricultural areas downstream of the dam. forests mainly occur along rivers or in protected areas such as the Nazinga game ranch, Sissili classified forest, the Kaboré-Tambi National Park, or the Mole National Park. Outside these protected areas, agricultural use is predominant, e.g., around Kaboré-Tambi National Park, a strong increase in agricultural area (from about 30% to 60%) between 2001 and 2013 has been documented [16]. The agriculture in the study area mostly consists of small-scale rain-fed crops with average field sizes around 1 ha and a low productivity (1.15 t·ha −1 ·year −1 cereals for Burkina Faso) as well as livestock grazing [17,18]. The small-scale mosaic of different land cover and land use types (mainly semi-natural vegetation and croplands) make this study region spatially very heterogeneous [18]. An important landmark in the northeast is Lake Bagré which retains the White Volta for hydro-power generation but also supplies water to irrigated agricultural areas downstream of the dam.
In some parts of the study area, the rainy season triggers flood events, whereas fires frequently occur during the dry season. In summary, the study area is highly dynamic and heterogeneous in space and time with expanding small-scale agriculture and a rainy and dry season cycle that strongly influences the seasonal development of vegetation.  (194/52, 194/53, 195/52, 195/53) with two MODIS tiles (h17v07, h17v08). Underlying land cover classification has a 250 m spatial resolution for the year 2006 (data source: Gessner et al. [19]).

Landsat-8 Operational Land Imager (OLI)
Landsat-8 Operational Land Imager (OLI) surface reflectance scenes were obtained from the USGS Landsat Archive [20] distributed on four tiles (195/52, 195/53, 194/52 and 194/53) and covering the period from December 2013 to January 2015 ( Figure 2). The datasets contain surface reflectances atmospherically corrected with the L8SR algorithm [21] and complemented by cloud and cloud shadow masks based on CFmask [21], an open source C code version of the Fmask algorithm [22]. The masks were applied to the reflectance data prior to the calculation of NDVI (Figure 3). After a thorough inspection of the available datasets, scenes with a cloud cover of more than 40% were excluded from the analysis; thus, a total number of 38 Landsat images were used for the fusion.  (194/52, 194/53, 195/52, 195/53) with two MODIS tiles (h17v07, h17v08). Underlying land cover classification has a 250 m spatial resolution for the year 2006 (data source: Gessner et al. [19]).
In some parts of the study area, the rainy season triggers flood events, whereas fires frequently occur during the dry season. In summary, the study area is highly dynamic and heterogeneous in space and time with expanding small-scale agriculture and a rainy and dry season cycle that strongly influences the seasonal development of vegetation.

Landsat-8 Operational Land Imager (OLI)
Landsat-8 Operational Land Imager (OLI) surface reflectance scenes were obtained from the USGS Landsat Archive [20] distributed on four tiles (195/52, 195/53, 194/52 and 194/53) and covering the period from December 2013 to January 2015 ( Figure 2). The datasets contain surface reflectances atmospherically corrected with the L8SR algorithm [21] and complemented by cloud and cloud shadow masks based on CFmask [21], an open source C code version of the Fmask algorithm [22]. The masks were applied to the reflectance data prior to the calculation of NDVI (Figure 3). After a thorough inspection of the available datasets, scenes with a cloud cover of more than 40% were excluded from the analysis; thus, a total number of 38 Landsat images were used for the fusion. However, the threshold of cloud cover can be defined by the user individually based on expert knowledge of the respective study region. However, the threshold of cloud cover can be defined by the user individually based on expert knowledge of the respective study region.

Figure 2.
Overview of Landsat data used in this study distributed over four tiles for the period December 2013 to January 2015. Upper part shows the weekly precipitation sums over the year 2014 averaged over three rainfall stations in the study area [23,24].  However, the threshold of cloud cover can be defined by the user individually based on expert knowledge of the respective study region.   For the same period as the Landsat data, daily, 500 m MODIS nadir bidirectional reflectance distribution function (BRDF)-adjusted reflectance datasets (MCD43A4) have been obtained in the product collection 6 for the tiles h17v07 and h17v08 ( Figure 1). A major improvement of MCD43A4 collection 6 is that the production frequency has been increased from 8 days to daily, providing maximum flexibility in the definition of dates and time steps for generation of high temporal resolution time series with ESTARFM. MCD43A4 is a 16-day rolling composite, representing the best BRDF possible during this period of time. Unlike in the previous collection 5, the date associated with each daily dataset is the centre of the moving 16-day input window (and not its first day). More information on collection 6 of the MCD43A4 product can be found in the MODIS User Guide [25].
The red and NIR bands of the MODIS datasets have been resampled using the nearest neighbour method to a resolution of 30 m, automatically co-registered to Landsat, and reprojected to UTM zone 30˝N using the MODIS Reprojection Tool (MRT) [26]. As recommended in the MODIS User Guide [25], only data with a full BRDF inversion (labelled with the best quality flag) have been processed. From these, NDVI was calculated and remaining outliers in the time series were removed (defined as observations which considerably differ from the median in a moving window with a length of 20 days). Similar to what was suggested by Eklundh and Jönsson [27], the threshold for outlier detection was defined as the standard deviation of the entire time series times a factor of 0.3 for the outliers above the median and times a factor of 0.1 for the outliers below the median. The temporal gaps in the result have been linearly interpolated and, subsequently, a Savitzky-Golay filter [28] with a total window length of 16 days and a quadratic polynomial has been applied to reduce residual noise. The factors and window lengths for outlier removal and Savitzky-Golay filtering were determined based on a visual analysis of their effect on temporal plots similar to those implemented in the software package TIMESAT [27].

Method for the Generation of High Temporal Resolution Time Series
The high temporal resolution Landsat-like time series were generated on the basis of the ESTARFM fusion algorithm [3] but with several modifications of the methodology to make the algorithm applicable to processing of large areas, characterized by heterogeneous landscapes and frequent cloud cover, in an automated way.
The original ESTARFM algorithm produces synthetic Landsat images at the dates of MODIS observations. For this purpose, one pair of simultaneous Landsat and MODIS images before ("left shoulder") and one pair after the prediction date ("right shoulder") is used. ESTARFM first establishes for each pixel a relationship between MODIS and Landsat of both image pairs via linear regression. The regressions are fitted for the central pixel in a moving window by taking into account all pixels in the window that are spectrally similar (similar pixels) to the central pixel. Spectral similarity is defined by the following threshold: where R s is the reflectance of the similar pixel candidate, R c is the reflectance of the central pixel, σ pBandq is the standard deviation of the band within the entire image and m is the number of classes in the entire image, as estimated by the user. The slope of the regression is used as a conversion coefficient between Landsat and MODIS. This coefficient allows transfer of the reflectance changes between the MODIS scenes (pair dates´prediction date) to the Landsat scene at the prediction date. For more information on the original ESTARFM fusion algorithm, see Zhu et al. [3].
However, ESTARFM has several limitations, as previously outlined by Zhu et al. [3], which hinder automated processing and its application to larger areas. These limitations are addressed and overcome in this study as described in the following.

Modified Selection of Similar Pixels
One issue of the original ESTARFM algorithm is how the aforementioned similar pixels are identified. Similar pixels are selected according to Equation (1), where the average number of land cover classes (m) in the study area must be estimated by the user. In areas of heterogeneous landscapes, a larger number of classes are specified, leading to a stricter condition for the selection of similar pixels [3]. On the one hand, the user estimation introduces an undesirable subjectivity to the process. On the other hand, by using an average number of land cover classes for the entire study area, spatial differences in the heterogeneity of the study area cannot be taken into account. This means that even in a study area containing small-scale mosaics of several land cover types in one part and relatively homogeneous land cover patterns in others, still only one overall number of classes would be considered. This becomes particularly important when working at larger scales.
In order to address this, we developed a modified approach for the identification of similar pixels in the proposed ESTARFM framework. This approach includes an automated clustering of the surface reflectance values of selected Landsat observations using the ISODATA algorithm [29]. The general possibility of a clustering was already mentioned by Zhu et al. [3], but they neither elaborated on it, nor implemented it in the original ESTARFM. In this study, clustering is performed once per year and on the basis of multi-spectral and multi-temporal Landsat reflectance values. Acquisition dates are selected for clustering which represent the major phenological stages of the vegetation development in the study area. The selection of these dates should be defined according to the seasonal and phenological characteristics of the study area. Then, based on this cluster image, similar pixels are selected as described in Equations (2) and (3). For a pixel to be selected as a similar pixel it has to belong to the same cluster as the central pixel of the moving window (Equation (2)) and the reflectance difference of the pixel to the central pixel must be below one standard deviation of the respective cluster in the moving window (Equation (3)).
In general, if a 30 m land cover dataset from the same year is available for a study area, it can be used to replace the clustering approach.
where C s is the cluster of the similar pixel candidate, C c is the cluster of the central pixel and σ pC cWin q is the standard deviation of the reflectance values of this cluster in the moving window.

Integration of Cloud-Affected Landsat Scenes
Zhu et al. [3] recommended using a large number of temporally well-distributed input Landsat data in ESTARFM in order to rely on high-resolution observations that are as close as possible to the prediction dates. High-resolution images that are temporally too far from the prediction dates are expected to decrease the quality of the output, particularly if input scenes are from another vegetative season as compared to the prediction date [3]. However, previous ESTARFM studies either restricted the input data to clear-sky high-resolution images or, when integrating partially cloud-covered scenes, these approaches produce gaps in the resulting synthetic images [5,8]. This substantially hampers the utility of the ESTARFM algorithm in regions of high cloud cover. In this study, an automated gap-filling approach was implemented to make best use of available, even partly cloud-covered Landsat images, without producing gaps in the output of ESTARFM. The gaps are filled with ESTARFM predictions (Figure 4) based on the other modifications introduced in the framework and described as follows.
scene is processed. If the gaps are still larger than one percent, the first step is repeated based on the remaining SSC candidates. If after this second iteration the residual gaps in the target scene is still above one percent, the SSC is selected that can fill the residual gaps so that the target scene is at least 99% gap free. This way, a maximum of three synthetic scenes have to be generated to fill one target scene and the filling is completed in a reasonable amount of time.

Quality Information
The ESTARFM algorithm does not give an indication on the quality of the fusion. It is, however, important to know whether a given prediction was based on stable regressions or if errors are likely to occur. For that reason, a quality dataset is produced here as an additional output of the ESTARFM framework ( Figure 3). With each synthetic image, a layer with the number of similar pixels used for the prediction of each pixel is obtained. Furthermore, a layer for each band is calculated storing the coefficient of determination from the linear regressions between Landsat and MODIS used for derivation of the conversion coefficient. The information in these layers is considered an indicator of In a first step of this gap filling, the first target scene with cloud gaps to be filled is selected. The shoulder scene combination (SSC), i.e., one Landsat image before and one after the target scene, fulfilling two conditions is identified: (a) the temporal distance between these shoulder scenes has to be as small as possible; and (b) an ESTARFM prediction derived from this SSC has to fill at least 50% of the gaps in the target scene. The selected SSC is used to generate an ESTARFM prediction and fill the target scene. If after that the residual gaps make up less than one percent, the next cloud-affected scene is processed. If the gaps are still larger than one percent, the first step is repeated based on the remaining SSC candidates. If after this second iteration the residual gaps in the target scene is still above one percent, the SSC is selected that can fill the residual gaps so that the target scene is at least 99% gap free. This way, a maximum of three synthetic scenes have to be generated to fill one target scene and the filling is completed in a reasonable amount of time.

Quality Information
The ESTARFM algorithm does not give an indication on the quality of the fusion. It is, however, important to know whether a given prediction was based on stable regressions or if errors are likely to occur. For that reason, a quality dataset is produced here as an additional output of the ESTARFM framework ( Figure 3). With each synthetic image, a layer with the number of similar pixels used for the prediction of each pixel is obtained. Furthermore, a layer for each band is calculated storing the coefficient of determination from the linear regressions between Landsat and MODIS used for derivation of the conversion coefficient. The information in these layers is considered an indicator of the stability and goodness of the regression and therefore provides an indicator of the proper functioning of the algorithm for different areas of the image.

Automation and Improvement of Performance
The original ESTARFM algorithm is designed for the generation of single synthetic scenes so that for the processing of a time series, it has to be run individually for each time step. To facilitate the generation of time series, a framework was designed in IDL ( Figure 3), which requires as input all available MODIS and Landsat data for one Landsat tile and then generates the time series between a definable start and end date automatically. Input image pairs of Landsat and MODIS are automatically selected that are the temporally closest ones to the prediction date as well as the MODIS scene at the prediction date.
Furthermore, for the application of ESTARFM to larger areas and long time series, the processing speed was optimized. So far, most ESTARFM functions only run on one CPU core, so that the prediction of an exemplary Landsat scene with four bands (B-G-R-NIR) takes more than 24 h (Table 1). Thus, the algorithm has been parallelized here, so that the number of used CPU cores can be defined individually. The images are then split into tiles and one tile is assigned to each core. This way, the user can adapt the algorithm to its system specifications and the processing time can be improved significantly (Table 1). Using the daily MCD43A4 MODIS product with the developed ESTARFM framework, the user can flexibly define the dates and time steps of the output time series. Though a daily ESTARFM time series could be produced, this also means a high amount of processing time and storage requirements while less time steps are sufficient for many applications. In this study, the ESTARFM framework was applied to the generation of an 8-day 30 m NDVI time series for the year 2014 (day of year, DOY 7 to 359) in the study area using all Landsat and MODIS input data described in Sections 3.1 and 3.2. Previous studies demonstrated that using NDVI directly as input to ESTARFM instead of using reflectances and calculating the NDVI from the fusion output results in higher accuracies due to reduced error propagation [10,30].
Of the 30 Landsat scenes from the year 2014, 17 were partly cloud-covered but could be included in the ESTARFM fusion due to the advancement of the algorithm (Section 3.3.2). For our study area, only one cloud-free (<1% cloud cover) Landsat scene for the vegetative season (DOY 159) was available for path 195 and none for path 194. However, especially the latter contained several scenes with a cloud cover of less than 40% that could be processed based on the developed cloud-filling approach ( Figure 5). In particular, on the southeastern tile (194/53), some areas had persistent cloud coverage over the year and had to be filled up to four times. In such regions, an application of the input scenes in ESTARFM without previous filling would result in gaps in the synthetic time series for most of the predictions.

Accuracy Assessment
The accuracy of the predicted ESTARFM time series was assessed in different ways. Similar to a cross-validation, the first approach was to exclude one Landsat scene from the input time series and predict this scene with ESTARFM on the basis of the remaining image acquisitions. The predicted scene was then correlated with the actual, formerly excluded Landsat scene and differences between the scenes were calculated. By doing so for every Landsat image (except for the first and last scenes since these cannot be predicted without shoulder scenes) an accuracy assessment of each tile of the ESTARFM time series was conducted. For this purpose, the interval of the output time series was adapted to the 16-day acquisition scheme of the Landsat path 195 (Figure 2) so that the Landsat scenes of the tiles 195/52 and 195/53 could be directly used for validation. For the Landsat path 194 with a different acquisition interval, additional images at the dates of existing Landsat scenes were generated for validation and, in this way, a total number of 195 images have been produced.
In addition to this cross-validation, since the used Landsat scenes do not cover a major part of the rainy season, eight additional Landsat scenes with cloud cover of up to 80% were used for validation of this period. In this way, the quality of predictions for the main rainy season can be estimated, although the results might represent only a very limited part of the study region due to the dominating cloud cover on these scenes.
Furthermore, in order to display the small-scale spatial patterns of the prediction quality, a subset area was monitored over several time steps. Based on these images, the prediction of different land cover types related to different phenological stages can be assessed.
In addition, the consistency of ESTARFM predictions between the eastern and western Landsat path was analyzed for the overlapping areas. Generally, low differences between the paths indicate a high prediction quality and also reduce the amount of subsequent processing of the time series for further geoscientific applications.

Accuracy Assessment
The accuracy of the predicted ESTARFM time series was assessed in different ways. Similar to a cross-validation, the first approach was to exclude one Landsat scene from the input time series and predict this scene with ESTARFM on the basis of the remaining image acquisitions. The predicted scene was then correlated with the actual, formerly excluded Landsat scene and differences between the scenes were calculated. By doing so for every Landsat image (except for the first and last scenes since these cannot be predicted without shoulder scenes) an accuracy assessment of each tile of the ESTARFM time series was conducted. For this purpose, the interval of the output time series was adapted to the 16-day acquisition scheme of the Landsat path 195 (Figure 2) so that the Landsat scenes of the tiles 195/52 and 195/53 could be directly used for validation. For the Landsat path 194 with a different acquisition interval, additional images at the dates of existing Landsat scenes were generated for validation and, in this way, a total number of 195 images have been produced.
In addition to this cross-validation, since the used Landsat scenes do not cover a major part of the rainy season, eight additional Landsat scenes with cloud cover of up to 80% were used for validation of this period. In this way, the quality of predictions for the main rainy season can be estimated, although the results might represent only a very limited part of the study region due to the dominating cloud cover on these scenes.
Furthermore, in order to display the small-scale spatial patterns of the prediction quality, a subset area was monitored over several time steps. Based on these images, the prediction of different land cover types related to different phenological stages can be assessed.
In addition, the consistency of ESTARFM predictions between the eastern and western Landsat path was analyzed for the overlapping areas. Generally, low differences between the paths indicate a high prediction quality and also reduce the amount of subsequent processing of the time series for further geoscientific applications.

Spatio-Temporal Patterns of the ESTARFM Time Series
The overall spatial patterns of the small-scale landscape in the study area are well reproduced by the ESTARFM time series of 2014 ( Figure 6). Parts of the river network with its gallery forests exhibiting high mean NDVI values stand out sharply (e.g., north of Wa or south of Bolgatanga), as do the forest patches in the highly fragmented northwest of the study area. Several small cities emerge from the densely vegetated areas as round patches of low NDVI (e.g., the cities Wa and Tamale). Furthermore, there are no edges, i.e., abrupt shifts, in mean NDVI level visible between the four different Landsat tiles. This indicates the general proper functioning of the ESTARFM fusion, since major differences in the prediction of the single tiles would be expressed by edges between them (see also Section 4.2).

Spatio-Temporal Patterns of the ESTARFM Time Series
The overall spatial patterns of the small-scale landscape in the study area are well reproduced by the ESTARFM time series of 2014 ( Figure 6). Parts of the river network with its gallery forests exhibiting high mean NDVI values stand out sharply (e.g., north of Wa or south of Bolgatanga), as do the forest patches in the highly fragmented northwest of the study area. Several small cities emerge from the densely vegetated areas as round patches of low NDVI (e.g., the cities Wa and Tamale). Furthermore, there are no edges, i.e., abrupt shifts, in mean NDVI level visible between the four different Landsat tiles. This indicates the general proper functioning of the ESTARFM fusion, since major differences in the prediction of the single tiles would be expressed by edges between them (see also Section 4.2).  The transect in Figure 6e, crossing the study area from north to south, reflects the general vegetation gradient of the study area which is in line with the general gradient of West Africa. From north to south, NDVI level and vegetation density increase with rising annual rainfalls. The comparison of the temporal development of the ESTARFM, MODIS and Landsat NDVI time series is highlighted in four plots (Figure 6a-d): Figure 6a shows the average NDVI development over a semi-natural savanna area on the Nazinga game ranch. Here, the time series of ESTARFM closely follows the one from MODIS and the single Landsat NDVI values are also well matched. Figure 6 also shows that if using only Landsat data for the interpretation of phenology, some features such as the peak NDVI would be missed. The NDVI level of the focus site in the Mole National Park (Figure 6b) is generally higher than for the one from the Nazinga game ranch. This focus site also consists of semi-natural savanna but is more densely vegetated and has a higher percentage of tree cover. These differences again reflect the north´south gradient of vegetation in the study region. For the Mole focus site, the NDVI values of MODIS and Landsat diverge slightly during the growing season with a higher level for Landsat. In such a case, ESTARFM gives priority to the Landsat input data and the derived ESTARFM time series matches Landsat more closely than MODIS. Figure 6c displays the time series of a rain-fed agricultural area close to the city of Bolgatanga in Ghana. In comparison to the semi-natural savannas, a generally lower NDVI can be observed, especially during the vegetative season between DOY 100 and 320 (mid-April to mid-November). The agricultural area shows a later green-up and a lower peak of NDVI. Figure 6d represents an irrigated agricultural area at the White Volta downstream of the Bagré dam in southern Burkina Faso. Its temporal development is considerably different from the other focus areas since it is independent of the timing of the rainy season. Two crop cycles can be observed in the graph, one around the DOY 100 (mid-April) and a second one around DOY 300 (end of October).
When comparing the annual mean NDVI from ESTARFM with Landsat and MODIS NDVI for three focus areas characterizing different landscape types, major differences become apparent (Figure 7). With its 500 m resolution, the MODIS data (second row) cannot reproduce the spatial details of the landscape patterns. For example, the waterbodies in Ouagadougou or the cultivated areas in the fragmented area in southern Burkina Faso cannot be delineated from the MODIS NDVI. Also the course of rivers in the Mole National Park or the irrigated fields around the dam in the southeast of the fragmented focus site cannot be adequately identified at this scale. However, the mean annual NDVI values provided by MODIS can be seen as realistic, because of the high temporal resolution of MODIS covering the entire phenological cycle. In contrast, the mean annual NDVI of the original Landsat time series (fourth row) allows the differentiation of the aforementioned features of the focus sites. On the other hand, the overall level of Landsat's mean annual NDVI is considerably lower than the one of MODIS. The reason for this is that there are only severely cloud-affected Landsat data available during the peak vegetative season between July and October (compare Figure 2) and thus this period is not well represented in the mean NDVI, thereby resulting in values that are too low in comparison to the MODIS benchmark. The ESTARFM time series (third row) combines both the high spatial detail which allows delineation of even small objects, and the high temporal resolution allowing derivation of a realistic mean NDVI (similar NDVI levels compared to MODIS, second row).

Accuracy Assessment of the ESTARFM Predictions
The cross-validation of the predicted ESTARFM time series comprises three accuracy measures (Figure 8): the coefficient of determination (R 2 ), the mean absolute error (MAE), and the root mean square error (RMSE). In general, the northeastern tile (194/52) produced the lowest error values with MAE around 0.02 and average RMSE values of 0.03. Also, the correlations between predicted and observed NDVI values were the highest for this tile (average R 2 = 0.83). Comparing the northern tiles (row 52) with the southern tiles (row 53), it becomes evident that ESTARFM predictions for the northern tiles commonly produce higher accuracies and lower error values (ranging between 0.01 and 0.04 MAE in the north and between 0.01 and 0.07 MAE in the south).   These spatio-temporal differences of the prediction accuracies can also be identified in the scatterplots of observed versus predicted NDVI values (Figure 9). During the dry season at the beginning of 2014 (DOY 40), the correlation is highest (R 2 = 0.88 for row 52 and 0.80 for row 53) and the range of NDVI values is rather low. While the NDVI range increases during the vegetative season in June (DOY 152), the correlation decreases (R 2 = 0.79 for row 52 and 0.54 for row 53). At the end of 2014 (DOY 344), at the beginning of the next dry season, the correlation increases again while the NDVI range is still high (R 2 = 0.87 for row 52 and 0.77 for row 53). The north´south difference in accuracy is also visible in these scatterplots, with row 52 generally showing lower errors and less scattering of values in the plot than row 53.
To estimate the prediction quality of the main rainy season where only severely cloud-affected Landsat data is available, eight additional Landsat scenes of path 195 with cloud cover of up to 80% were used for validation ( Table 2). The errors for this period vary between 0.03 and 0.10 for MAE and between 0.05 and 0.12 for RMSE with R 2 values between 0.3 and 0.76. Error values are found to increase towards the peak of rainy season (around DOY 280, beginning of October) and decrease afterwards.
In addition, a subset area in the northeast of the study region was selected to show an example of the spatial patterns of absolute differences between observed and predicted NDVI ( Figure 10). This area represents the different landscape types of the study region including grasslands, woodlands and heterogeneous parklands, i.e., mixtures of cropland and semi-natural vegetation. The general differences for the late dry season (DOY 40) are low with some minor discrepancies for the waterbody in the northern part of the subset. During the onset of the vegetative season (DOY 152), the differences increase, especially for the densely vegetated areas of the image (regions in dark red on the Landsat false color composite). For example, the rivers characterized by gallery forests display higher values of absolute difference, which can partially be attributed to temporarily inundated areas along the course of the streams. The agricultural areas and regions with less dense vegetation cover (bright colors on the Landsat false color composite) maintain low absolute differences for the vegetative season (DOY 152). In the early dry season of December (DOY 344), the overall differences decrease again with major deviations only for the waterbody in the northern part.
These spatio-temporal differences of the prediction accuracies can also be identified in the scatterplots of observed versus predicted NDVI values (Figure 9). During the dry season at the beginning of 2014 (DOY 40), the correlation is highest (R 2 = 0.88 for row 52 and 0.80 for row 53) and the range of NDVI values is rather low. While the NDVI range increases during the vegetative season in June (DOY 152), the correlation decreases (R 2 = 0.79 for row 52 and 0.54 for row 53). At the end of 2014 (DOY 344), at the beginning of the next dry season, the correlation increases again while the NDVI range is still high (R 2 = 0.87 for row 52 and 0.77 for row 53). The north−south difference in accuracy is also visible in these scatterplots, with row 52 generally showing lower errors and less scattering of values in the plot than row 53. To estimate the prediction quality of the main rainy season where only severely cloud-affected Landsat data is available, eight additional Landsat scenes of path 195 with cloud cover of up to 80% were used for validation ( Table 2). The errors for this period vary between 0.03 and 0.10 for MAE and between 0.05 and 0.12 for RMSE with R 2 values between 0.3 and 0.76. Error values are found to increase towards the peak of rainy season (around DOY 280, beginning of October) and decrease afterwards.  An additional way to assess the quality of the fusion is to analyze the consistency between different Landsat paths of the predicted time series. The overlapping areas between path 195 and 194 of the ESTARFM time series show a median deviation which is generally close to zero with a distinct temporal pattern over the course of the year (Figure 11). The first and third quartile shown in light blue ranges between´0.05 and 0.08. During the rainy season from DOY 100 to 320 (mid-April to mid-November), the differences between the two Landsat paths increase, which goes in line with the tendency towards increasing errors during the rainy season as described before. false color composite). For example, the rivers characterized by gallery forests display higher values of absolute difference, which can partially be attributed to temporarily inundated areas along the course of the streams. The agricultural areas and regions with less dense vegetation cover (bright colors on the Landsat false color composite) maintain low absolute differences for the vegetative season (DOY 152). In the early dry season of December (DOY 344), the overall differences decrease again with major deviations only for the waterbody in the northern part. An additional way to assess the quality of the fusion is to analyze the consistency between different Landsat paths of the predicted time series. The overlapping areas between path 195 and 194 of the ESTARFM time series show a median deviation which is generally close to zero with a distinct temporal pattern over the course of the year (Figure 11). The first and third quartile shown in light blue ranges between −0.05 and 0.08. During the rainy season from DOY 100 to 320 (mid-April to mid-November), the differences between the two Landsat paths increase, which goes in line with the tendency towards increasing errors during the rainy season as described before.

Input Data Availability and Influence on ESTARFM Predictions
To analyze the influence of the temporal distance (number of days between Landsat input observations and prediction dates) on the error level, a Landsat scene from the middle of the year (DOY 143, 195/52) was predicted not only with its nearest shoulder scenes, but also with all other, more distant shoulder combinations. The MAE between these predictions and the observed Landsat NDVI have been calculated and visualized against the total temporal distance of the input shoulder combinations (Figure 12). It is evident that the error level increases with increasing temporal distance, from less than 0.02 MAE to over 0.05 MAE. However, it can also be observed that the temporal distance is not the only factor determining the MAE. The symbols of Figure 12 highlight that the right shoulder (i.e., MODIS-Landsat combination after the prediction date) is significantly more important for the quality of prediction than the left shoulder (MODIS-Landsat combination before the prediction date). In the presented case, the left shoulder dates are all from the dry season at the beginning of 2014 (Figure 2), where changes in phenology between the scenes are minor, while the right shoulders are all from different phenological stages: DOY 159 is from approximately the same phenological stage as the target scene, DOY 319 from the period of browning after the vegetation peak, DOY 351 includes widespread burned areas, and DOY 18 (2015) is from the dry season of the following year. Hence, every time the right shoulder date (i.e., the phenological stage) is changed, a pronounced jump in MAE is observed. This analysis emphasizes the importance of using a shoulder

Input Data Availability and Influence on ESTARFM Predictions
To analyze the influence of the temporal distance (number of days between Landsat input observations and prediction dates) on the error level, a Landsat scene from the middle of the year (DOY 143, 195/52) was predicted not only with its nearest shoulder scenes, but also with all other, more distant shoulder combinations. The MAE between these predictions and the observed Landsat NDVI have been calculated and visualized against the total temporal distance of the input shoulder combinations ( Figure 12). It is evident that the error level increases with increasing temporal distance, from less than 0.02 MAE to over 0.05 MAE. However, it can also be observed that the temporal distance is not the only factor determining the MAE. The symbols of Figure 12 highlight that the right shoulder (i.e., MODIS-Landsat combination after the prediction date) is significantly more important for the quality of prediction than the left shoulder (MODIS-Landsat combination before the prediction date). In the presented case, the left shoulder dates are all from the dry season at the beginning of 2014 (Figure 2), where changes in phenology between the scenes are minor, while the right shoulders are all from different phenological stages: DOY 159 is from approximately the same phenological stage as the target scene, DOY 319 from the period of browning after the vegetation peak, DOY 351 includes widespread burned areas, and DOY 18 (2015) is from the dry season of the following year. Hence, every time the right shoulder date (i.e., the phenological stage) is changed, a pronounced jump in MAE is observed. This analysis emphasizes the importance of using a shoulder scene from the same phenological stage if possible. combinations ( Figure 12). It is evident that the error level increases with increasing temporal distance, from less than 0.02 MAE to over 0.05 MAE. However, it can also be observed that the temporal distance is not the only factor determining the MAE. The symbols of Figure 12 highlight that the right shoulder (i.e., MODIS-Landsat combination after the prediction date) is significantly more important for the quality of prediction than the left shoulder (MODIS-Landsat combination before the prediction date). In the presented case, the left shoulder dates are all from the dry season at the beginning of 2014 (Figure 2), where changes in phenology between the scenes are minor, while the right shoulders are all from different phenological stages: DOY 159 is from approximately the same phenological stage as the target scene, DOY 319 from the period of browning after the vegetation peak, DOY 351 includes widespread burned areas, and DOY 18 (2015) is from the dry season of the following year. Hence, every time the right shoulder date (i.e., the phenological stage) is changed, a pronounced jump in MAE is observed. This analysis emphasizes the importance of using a shoulder scene from the same phenological stage if possible.

Uncertainties in the Prediction with ESTARFM
Several modifications have been introduced in the presented ESTARFM framework resolving shortcomings and challenges that have been mentioned as drawbacks in the usage of the existing ESTARFM algorithm [3,10,31]. However, certain points affecting the accuracy of the ESTARFM fusion remain. First of all, the availability of Landsat input data has a great influence on the quality of the fusion. Minimizing the temporal distance of the shoulder pairs to the target scene is crucial for a successful prediction [3,32] and it became evident that using at least one shoulder pair from the same phenological stage as the prediction date has even higher positive influence ( Figure 12).
A further factor influencing the fusion quality is the land cover type of a pixel ( Figure 10). In our study region, land cover types with a lower NDVI range, such as mixed rain-fed agricultural areas and grasslands, show low absolute difference between observed and predicted NDVI over the year. In contrast, the predicted NDVI values of densely vegetated areas such as closed woodlands differ more starkly from the actual NDVI values. In our study region, these areas exhibit stronger phenological dynamics which are usually more difficult to predict with ESTARFM [9,33]. These dynamics are also a potential reason for the north´south difference of prediction quality in the study area since the southern regions are generally more densely vegetated than the northern parts. In addition, some of these areas with a high NDVI range are characterized by rapid changes through flooding (especially gallery forests) or burning (semi-arid savannas) which negatively affects the quality of prediction. However, not only the temporal variance plays a role for the reduced prediction quality during the main rainy season, but also the cloud coverage of the MODIS dataset. Although MODIS cloud gaps have been interpolated here, these areas are less accurate in the prediction of synthetic Landsat scenes. This issue should be more emphasized for land cover types with a distinct phenology decreasing the accuracy of interpolation. For areas with high temporal variance, Emelyanova et al. [9] and Jarihani et al. [10] achieved better results with STARFM than with ESTARFM. However, these comparison studies showed at the same time that the performance of ESTARFM is superior to STARFM for study regions with a high spatial heterogeneity (as being the case in this study) and the overall accuracies of ESTARFM tend to be higher as well [9,10]. A comparison of ESTARFM and ESTARFM framework with the newly developed Flexible Spatiotemporal Data Fusion (FSDAF) method by Zhu et al. [34] could be the focus of future research in order to identify potential synergies.
In addition, especially in areas with a complex canopy structure, the BRDF correction of input data can have an impact on the fusion quality. Walker et al. [33,35] tested the usability of the BRDF-corrected MODIS product MCD43A4, which was also applied in our study, together with Landsat-5 TM (Thematic Mapper) as an input to STARFM. They found generally low NIR correlations between Landsat and STARFM predictions and concluded that this could be due to Landsat-5 TM data not being corrected for BRDF effects in contrast to MCD43A4. They further reasoned that the discrepancy between the input datasets would be highest in areas dominated by tall, multi-layered vegetation causing high scattering of NIR radiation [33]. However, they also pointed out the suitability of the MCD43A4 for the fusion since BRDF effects in MODIS surface reflectance products can be considerable due to the large swath width [35,36]. Furthermore, the use of NDVI as in the presented study should decrease BRDF effects due to the similarity of directional signatures in the red and NIR bands [37]. Since Landsat-8, which was used in this study, is a pushbroom sensor, view angle effects should be minor. For the use of the whiskbroom systems Landsat-5 and Landsat-7, which are more affected by view angle effects in combination with MCD43A4, a potential solution could be to also correct the Landsat data for directional effects as done by Schmidt et al. [12].

Added Value of the ESTARFM Framework for Cloud-Prone Areas
In West Africa, as in many regions with a distinct rainy season, acquisitions of clear sky observations from polar orbiting instruments are challenging [38]. Landsat data are almost only available during the dry season, when vegetation and land surface in general have a significantly different appearance than during the rainy season. Previous studies indicated the importance of using Landsat input scenes that are temporally as close as possible to the prediction dates to minimize errors in the ESTARFM prediction [3,32]. Our results confirmed this assumption and additionally showed the value of Landsat input scenes from the same season and that abrupt events such as fires or floods between input and prediction date can decrease the accuracy. In order to avoid large gaps between input and prediction dates, we included an automatic cloud gap filling of Landsat scenes in the ESTARFM framework to make best use of scenes with a partial cloud cover without producing gaps in the output of the fused time series. With this filling approach, a total number of 30 Landsat input scenes from the year 2014 could be used for the study area instead of only the 13 cloud-free scenes. Thus, Landsat scenes from different seasons could be added to the input dataset, significantly improving the prediction quality.
It should be noted that when using the original STARFM [2] with two shoulder input pairs, cloud gaps in the fusion output can also be avoided as long as at least one of the shoulder observations is cloud free. However, gaps remain for those parts of the image where both shoulder datasets are clouded.
The different Landsat tiles have thus far been treated individually in the processing of the ESTARFM framework. In order to increase data availability for an application in higher latitudes, where overlapping areas between Landsat paths are considerably bigger than in the study region of this paper, it would be interesting to use the tiles of different paths jointly in the ESTARFM framework. This could be part of future improvements of the framework.

Added Value of the ESTARFM Framework for Large-Scale Analyses
Previous studies on the ESTARFM algorithm indicate its great potential for the generation of high spatial and temporal resolution time series, but they also pointed out issues like the slow processing speed or the way how similar pixels are selected that hinder its application on large scales [3,10,31]. The ESTARFM framework developed here significantly decreases processing time by parallelization and automating the algorithm. This feature, in addition to avoiding repeated manual input, is crucial for its application on large scales. Moreover, the way similar pixels are selected for the regression between Landsat and MODIS was changed in the framework. The approach in the original ESTARFM with a global threshold for the area of interest does not account for spatial differences in the heterogeneity of the study area. This means that if a study area comprises several Landsat tiles covering different landscape types, a single threshold is not sufficient for the selection of similar pixels. With the new approach presented here, regional differences are considered independently of the size of the study area.
When working on large scales, several Landsat paths have to be used and, due to the different acquisition dates of the paths, errors might occur in the overlapping areas of the predicted output. In order to improve the consistency of ESTARFM predictions for certain periods of the time series (compare Section 4.2), a matching of the different paths with suitable algorithms might be useful. This, however, depends on the further application of the ESTARFM time series and is beyond the scope of this study.
The fusion algorithm has as of yet only been applied to small study areas of less than 6000 km 2 , which is far less than one Landsat scene (~33,300 km 2 ). However, many relevant research questions in the field of vegetation mapping and monitoring require consistent vegetation information and related analyses at the national level or in larger geographic regions [39,40]. It is expected that ESTARFM time series can be of great use for improved mapping and monitoring of agricultural areas in West Africa. This has been a challenging task due to the patchy small-scale farming practices in the region, where areas of agriculture can be adequately separated from rangeland only during short periods of the year coinciding with prevailing cloud cover [1,41,42]. The phenological signatures produced by ESTARFM in the presented study show distinct characteristics of the small-scale rain-fed and irrigated agriculture and savanna types which could be helpful for an improved discrimination. If, however, these results cannot be produced efficiently at the national to regional levels, their relevance for decision makers would remain limited.

Conclusions and Outlook
In this study, a framework for the Enhanced Spatial and Temporal Adaptive Reflectance Fusion Model (ESTARFM) fusion algorithm was developed to make it applicable for large, cloud-prone and heterogeneous areas. For this purpose, several modifications were introduced that focus on the automation, acceleration of processing speed, handling of the algorithm and maximizing use of Landsat input data. The way similar pixels are selected for the regression between Landsat and Moderate Resolution Imaging Spectroradiometer (MODIS) was enhanced by accounting for spatial differences in the heterogeneity of the study area. Furthermore, an automatic filling of cloud gaps was developed to maximize the use of available cloud-free Landsat input data. In addition, the processing of time series in the ESTARFM framework was automated and processing speed was accelerated significantly by parallelization of the algorithm.
The results indicate that even in a challenging study area with heterogeneous land cover and abrupt fire and flood events, the developed ESTARFM framework proved to be suitable to generate time series on larger spatial scales. The errors varied between average mean absolute error (MAE) of 0.02 for the dry season and 0.05 for the vegetative season which indicates a higher prediction uncertainty for the latter. These fused datasets have the potential to substantially improve various regional analyses of land degradation, changes in land surface phenology, or changes in cropping practice where a combination of high temporal and spatial resolution is essential for gaining new insights in land surface processes.
With most recent satellite missions such as the European Space Agency's Sentinel program, new possibilities emerge for data fusion approaches extending and adding to the period of observations that can be covered by MODIS-Landsat time series (2000´present time). Especially the satellite constellations of Sentinel-2 (10 m,~5 days) and Sentinel-3 (300 m,~2 days) are predestined for data fusion. The short repeat cycle of the Sentinel-2 satellites will significantly contribute to a better availability of high-resolution data in comparison to Landsat. However, capturing the complete phenological cycle of land surface using only Sentinel-2 data is still not likely in cloud-prone areas. For such regions, the fusion of Sentinel-2 with Sentinel-3, Proba-V or VIIRS (Visible/Infrared Imager Radiometer Suite) could significantly contribute to a better quality of high temporal and spatial resolution time series.