A New Spatial–Temporal Depthwise Separable Convolutional Fusion Network for Generating Landsat 8-Day Surface Reﬂectance Time Series over Forest Regions

: Landsat has provided the longest ﬁne resolution data archive of Earth’s environment since 1972; however, one of the challenges in using Landsat data for various applications is its frequent large data gaps and heavy cloud contaminations. One pressing research topic is to generate the regular time series by integrating coarse-resolution satellite data through data fusion techniques. This study presents a novel spatiotemporal fusion (STF) method based on a depthwise separable convolutional neural network (DSC), namely, STFDSC, to generate Landsat-surface reﬂectance time series at 8-day intervals by fusing Landsat 30 m with high-quality Moderate Resolution Imaging Spectroradiometer (MODIS) 500 m surface reﬂectance data. The STFDSC method consists of three main stages: feature extraction, feature fusion and prediction. Features were ﬁrst extracted from Landsat and MODIS surface reﬂectance changes, and the extracted multilevel features were then stacked and fused. Both low-level and middle-level features that were generally ignored in convolutional neural network (CNN)-based fusion models were included in STFDSC to avoid key information loss and thus ensure high prediction accuracy. The prediction stage generated a Landsat residual image and is combined with original Landsat data to obtain predictions of Landsat imagery at the target date. The performance of STFDSC was evaluated in the Greater Khingan Mountains (GKM) in Northeast China and the Ziwuling (ZWL) forest region in Northwest China. A comparison of STFDSC with four published fusion methods, including two classic fusion methods (FSDAF, ESTARFM) and two machine learning methods (EDCSTFN and STFNET), was also carried out. The results showed that STFDSC made stable and more accurate predictions of Landsat surface reﬂectance than other methods in both the GKM and ZWL regions. The root-mean-square-errors (RMSEs) of TM bands 2, 3, 4, and 7 were 0.0046, 0.0038, 0.0143, and 0.0055 in GKM, respectively, and 0.0246, 0.0176, 0.0280, and 0.0141 in ZWL, respectively; it can be potentially used for generating the global surface reﬂectance and other high-level land products.


Introduction
The Landsat archive has become one of the most valuable remotely sensed datasets since its opening in 2008 [1]; it allows the capture of natural or anthropogenic impacts with unprecedented spatial detail and quality across large areas in spatial dimensions [2] and enables retrospective analyses and characterization of changes for nearly 50 years in the temporal dimension [3].Landsat time-series data have been widely used in a wide range of applications, such as land surface phenology mapping [4], assessment of human-induced land cover changes [5,6], identification of mangrove forests [7], estimation of forest biomass and its dynamics [3,8], and monitoring forest disturbance and subsequent recovery [9,10]; however, Landsat time series data often contain gaps due to acquisition frequency, cloud contamination, data communication issues, or some sensor failure (e.g., with the scan line corrector off), which affects further application, particularly in retrieving land surface parameters such as forest biomass [3,11].
To generate gap-filled Landsat time series data, many pixel-based composite algorithms have been developed.A few algorithms used Landsat data alone without auxiliary datasets, such as the closest-spectral-fit data assimilation method, the alternative similar pixel method, and the temporal interpolation gap-filling approach [11,12], but most studies have focused on utilizing data from coarse spatial resolution instruments (e.g., Moderate Resolution Imaging Spectroradiometer (MODIS) data) through a spatial-temporal fusion (STF) algorithm, taking advantage of the temporal resolution information of coarse spatial resolution data.The spatial and temporal adaptive reflectance fusion model (STARFM) is a widely accepted STF algorithm that was developed by Gao et al. [13] to predict daily surface reflectance from Landsat and MODIS data and later applied in blending the normalized difference vegetation index (NDVI), land surface temperature, and evapotranspiration [14][15][16].STARFM assumes that land cover types do not change over two dates, and the target Landsat reflectance can be mapped from MODIS reflectance change over two dates and Landsat data at the base time.The accuracy of STARFM-predicted reflectance was different among diverse land cover types and tended to decrease when applied to heterogeneous fine-grained landscapes [13,17].Zhu et al. [18] developed an enhanced STARFM method (ESTARFM), which improved the STARFM for predicting fine-resolution reflectance in heterogeneous landscapes by introducing observed reflectance changes between two dates and spectral unmixing analysis through a conversion coefficient calculated for each endmember, which was later used to observe flooding in heterogeneous scenarios [19].Based on the theory that the reflectance of a coarse image pixel is a linear combination of the reflectance of different endmembers, Wu et al. [20] developed the spatial and temporal data fusion approach (STDFA) with the assumption that the temporal variation properties of each endmember class were constant.Huang and Zhang [21] proposed the unmixing-based spatiotemporal reflectance fusion model (U-STFM) to predict reflectance change without reference to land cover change or phenological change.
In addition to weight function-based and unmixing-based fusion algorithms, some hybrid fusion algorithms have also been proposed.Gevaert and García-Haro [22] combined the characteristics of an unmixing-based fusion algorithm and STARFM and proposed the spatial and temporal reflectance unmixing model (STRUM).The results of this study suggested that the STRUM could generate reflectance with a higher correlation with reference to Landsat images, and in addition, when few high-resolution images were available, the STRUM was still suitable.Zhu et al. [23] proposed flexible spatiotemporal data fusion (FSDAF), combining the merits of spectral unmixing analysis and thin-plate spline interpolation.The FSDAF required minimum input data and was proven to be suitable for heterogeneous landscapes.Although these fusion methods are elaborately designed for specific applications, they suffer from a series of assumptions and associated issues, such as inaccurate estimation of endmember numbers, nonlinear spectral mixing, land cover changes or disturbed landscapes, and uneven quality of acquired data [24,25].
In recent years, learning-based fusion methods for generating time series images with high temporal and spatial resolution have gained much attention.These data-driven methods have been mainly based on sparse representation and deep learning techniques and learn the processes, spatial patterns, or texture features lying behind satellite images without strict constraints.Compared with the sparse representation [26,27], the deep convolutional neural network (CNN) model can extract more abundant information from massive remote sensing data and carry out STF more effectively.Song et al. [28] designed the spatiotemporal fusion using deep convolutional neural networks (STFDCNN) architecture that included a nonlinear mapping CNN between MODIS and low-spatial-resolution Landsat data, a super-resolution CNN between low-spatial-resolution Landsat data and original Landsat images, and image fusion.Liu et al. [29] proposed a two-stream CNN model (STFNET) that contained both temporal dependence and temporal consistency information.Temporal consistency was included as a constraint to prevent large discrepancies in predictions from two streams and ensure the accuracy of prediction results.The enhanced deep convolutional spatiotemporal fusion network (EDCSTFN) adopted the "encodermerge-decoder" architecture, in which the first encoder was to extract Landsat features; the second encoder was the residual encoder designed for learning the feature differences between the reference and prediction, and the features from two encoders were then merged and input into the decoder [25].Li et al. [30] added attention and multiscale mechanisms to CNNs and developed the AMNet model.In addition, some Generative Adversarial networks (GAN) were also used in the field.For example, Chen et al. [31] used SRresnet-based GAN for the feature-level fusion of Landsat and Sentinel data.Gao et al. [32] applied CNN and GAN models for cloud removal application by fusing synthetic aperture radar and optical data.
In this study, we proposed a novel STF method based on depthwise separable convolution (DSC), namely, STFDSC, for reconstructing Landsat reflectance fast and accurately.Unlike existing fusion models in which network computation was relatively large, we used the DSC in the STFDSC.The DSC extracted the features of each channel separately without changing the number of output feature layers and then applied a pointwise convolution on different layers to format the output, which significantly reduced the model parameters.Therefore, the proposed STFDSC is a lightweight CNN model.Compared with CNNs that use standard convolution, DSC had better performance when using similar amounts of model parameters [33]; moreover, multilevel features extraction structure is established in the proposed STFDSC algorithm, which are like feature pyramids [34] to avoid the loss of detail information and ensure that high accuracy could be achieved, whereas, in previous CNN-based fusion models, the low-level and middle features were generally ignored, possibly resulting in lower accuracy or unstable fusion results.The use of batch normalization and residual structure can speed up STFDSC training and avoid exploding/vanishing gradient problems that often occur in CNN models [35,36].The performance in reconstructing Landsat surface reflectance was examined in two forest regions of China.Intercomparisons with two classical fusion methods (FSDAF and ESTARFM) and two deep learning algorithms, EDCSTFN and STFNET, were also carried out.

Study Area
The Greater Khingan Mountains (GKM) of Northeast China and the Ziwuling (ZWL) forest area were selected as the study areas (Figure 1).
The GKM is located in a climatic and topographic transition zone from a coldtemperate coniferous forest biome to a mid-temperate grassland biome with decreasing latitude [37].The mean annual temperature ranges from −4 to −2 • C, and precipitation is 350-500 mm.Dominant tree species include Dahurian larch, Scotch pine, Korean spruce, Japanese white birch, two species of aspen, and Mongolian oak [38].
The ZWL has the largest secondary forests on the Loess Plateau in Northwest China and plays an important role in climatic regulation [39].The mean annual temperature is 6-10 • C, and the mean annual precipitation is 600-700 mm, of which 60% falls from June to September [40].The dominant tree species are Betula platyphylla and Quercus wutaishanica, accounting for 49.62% and 36.18%,respectively [39].2).The background data showed different forest types from Li et al. [41] with mapped colors described in the legend.

Landsat 5 TM data
Landsat 5 TM surface reflectance data were obtained from the United States Geological Survey archives (https://earthexplorer.usgs.gov,accessed on 23 December 2020).
They were processed to the L1TP level, which has the highest geometric and radiometric quality and is suitable for pixel-level time series analysis [42].We collected all path 120/Row 24 and path 127/Row 35 TM scenes acquired in 2004 and clipped them to an extent of 1500×1500 pixels, corresponding to a 45 km×45 km region of interest (ROI) in size, as shown in Figure 1.Only clear (< 5% cloud cover) Landsat data covering the ROIs in the GKM and ZWL were used.Landsat surface reflectance at TM band 2, band 3, band 4, and band 7 were included to train and test the STF methods.
The normalized difference vegetation index (NDVI) and normalized burn ratio (NBR) were also calculated from TM surface reflectance due to their sensitivities to forest disturbances and recovery [9,43], and the performance of five STF methods in reconstructing both vegetation indices was also evaluated.

MODIS data
The MCD43A4 nadir bidirectional reflectance distribution function (BRDF)-adjusted reflectance (NBAR) data were used to provide ancillary information for reconstructing Landsat imagery because they could produce more accurate synthetic results than MODIS daily reflectance data and 8-day composite reflectance data [44,45].We downloaded the MODIS NBAR data for 2004 from the Land Processes Distributed Active Archive Center (https://lpdaac.usgs.gov/products/mcd43a4v006/,accessed on 28 January 2021) in tiles h25v03 and h26v05 to correspond with Landsat scenes path 120/Row 24 and path 127/Row 35, respectively.MCD43A4 data were reprojected to the Universal Transverse Mercator (UTM) projection, spatially subset to the ROIs, and resampled to a 30-meter spatial resolution using the nearest neighbor resampling method.Consistent with Landsat data, NBAR data with at least 95% good quality (full BRDF inversions) The background data showed different forest types from Li et al. [41] with mapped colors described in the legend.

Landsat 5 TM data
Landsat 5 TM surface reflectance data were obtained from the United States Geological Survey archives (https://earthexplorer.usgs.gov,accessed on 23 December 2020).They were processed to the L1TP level, which has the highest geometric and radiometric quality and is suitable for pixel-level time series analysis [42].We collected all path 120/Row 24 and path 127/Row 35 TM scenes acquired in 2004 and clipped them to an extent of 1500 × 1500 pixels, corresponding to a 45 km × 45 km region of interest (ROI) in size, as shown in Figure 1.Only clear (<5% cloud cover) Landsat data covering the ROIs in the GKM and ZWL were used.Landsat surface reflectance at TM band 2, band 3, band 4, and band 7 were included to train and test the STF methods.
The normalized difference vegetation index (NDVI) and normalized burn ratio (NBR) were also calculated from TM surface reflectance due to their sensitivities to forest disturbances and recovery [9,43], and the performance of five STF methods in reconstructing both vegetation indices was also evaluated.

MODIS Data
The MCD43A4 nadir bidirectional reflectance distribution function (BRDF)-adjusted reflectance (NBAR) data were used to provide ancillary information for reconstructing Landsat imagery because they could produce more accurate synthetic results than MODIS daily reflectance data and 8-day composite reflectance data [44,45].We downloaded the MODIS NBAR data for 2004 from the Land Processes Distributed Active Archive Center (https://lpdaac.usgs.gov/products/mcd43a4v006/,accessed on 28 January 2021) in tiles h25v03 and h26v05 to correspond with Landsat scenes path 120/Row 24 and path 127/Row 35, respectively.MCD43A4 data were reprojected to the Universal Transverse Mercator (UTM) projection, spatially subset to the ROIs, and resampled to a 30-m spatial resolution using the nearest neighbor resampling method.Consistent with Landsat data, NBAR data with at least 95% good quality (full BRDF inversions) pixels were selected.For the ROI in the GKM, Landsat and NBAR data acquired on 15 June, 17 July, 2 August, 18 August, and 3 September were of good quality.For ZWL, data on 25 February, 12 March, 29 April, 15 May, and 6 October could be used (Table 1).The Landsat data acquired on 2 August from the GKM and 29 April from the ZWL were test images applied to evaluate the STF methods, whereas the other data served as training data.1. STFDSC is a two-stream model, with one stream predicting Landsat on T3 from Landsat on T2, as well as MODIS data on T2 and T3 (submodel 23) and the other stream predicting Landsat on T3 from Landsat on T4 and MODIS data on both T3 and T4 (submodel 43).The final predicted Landsat TM surface reflectance on T3 was obtained by combining both stream predictions.The final predicted result (Pre_L3) was a weighted combination of two-stream predictions [47].For each band, the weight was calculated according to the difference between each of the submodel predictions and the corresponding surface reflectance.A smaller difference corresponded to a higher weight.For each submodel, the input data were Landsat at the base time and MODIS reflectance changes from the base time to the prediction time.The submodel consisted of three stages: (1) feature extraction of fine and coarse images; (2) feature fusion, and (3) prediction of surface reflectance at TM bands 2, 3, 4 and 7.
The feature extraction stage mainly extracted important features from Landsat data as well as from MODIS reflectance change over two dates using one standard convolution and two following DSCs.Coarse image features were extracted based on MODIS reflectance change instead of original MODIS reflectance in this study, which could provide direct temporal variations of reflectance in forest regions, and more importantly, reduce the number of model parameters and avoid network degradation [46].
DSC could significantly reduce the number of model parameters and had better performance when using similar amounts of model parameters with standard convolution [33].
To develop a lightweight CNN model, DSC was applied in both feature extraction and feature fusion.All convolution operations were followed by batch normalization in DSC, which could reduce internal covariate shift and allow us to use larger learning rates and thus speed up the optimization of hyperparameters [35]; moreover, previous studies suggested that saturated activation functions in CNN models, such as tanh and sigmoid, could suffer from exploding/vanishing gradient problems; we thus selected the non-saturated activation function LeakyReLU in the STFDSC [36].
In the feature fusion stage, the multilevel features generated in the feature extraction part were stacked.All produced features except the lowest-level features of MODIS reflectance change were incorporated in the STFDSC.The lowest-level features of MODIS reflectance change were not considered because coarse MODIS data contained less spatial information at this level.The DSC operation was performed on all stacked features.The prediction part included the prediction of fine-resolution residual images and their combination with Landsat data to finish one-stream surface reflectance reconstruction.
The final predicted result (Pre_L 3 ) was a weighted combination of two-stream predictions [47].For each band, the weight was calculated according to the difference between each of the submodel predictions and the corresponding surface reflectance.A smaller difference corresponded to a higher weight.
In this study, the STFDSC model was implemented in PyTorch deep learning framework.

Training
Two submodels in Figure 2 were trained separately.The parameters in feature extraction, feature fusion and prediction were jointly optimized using Adam.Submodel 24 was taken as an example.Input data for the training process included Landsat on T2, which was expressed as L 2 , and MODIS residual image on T2 and T4 labeled by M 24 was computed as M 24 = M 4 − M 2 , and the target image was Landsat on T4 (L 4 ).The loss function used in the optimization is expressed as: where weight i is the weight assigned for the i-th band, MAE represents the mean absolute error, L 4,i is the i-th band of L 4 , meanL 4,j is the average reflectance of the i-th band of L 4 , and Pre_L 2→4 4,i is the predicted L 4,i based on L 2 , which is also the output of submodel 24 in the training process.
L2 regularization was included in the loss function to prevent overfitting.We assigned more weight to the band with smaller mean reflectance to reduce the problem that predictions at bands with smaller reflectance values were not accurate.
After the above steps, a trained submodel 24 was obtained.The trained model was then used to predict Pre_L 2→3 3 , as shown in Figure 2. Similarly, submodel 42 was trained and used to generate predictions Pre_L 4→3 3 .

Comparison of Experiments
To evaluate the impacts of the time intervals of the two nearest good-quality Landsat datasets on the prediction accuracy, we carried out the following three experiments in the ZWL and GKM (Table 2).Comparisons of STFDSC with FSDAF, ESTARFM, EDCSTFN, and STFNET under these scenarios were also performed.Among these methods, the STFDSC, ESTARFM, EDCSTFN, and STFNET methods required two reference images as input.One reference image acquired before T3 and one reference image acquired after T3 were used to reconstruct Landsat imagery on T3 in all experiments.FSDAF required one reference image, and we used the Landsat reference data close to T3 in each experiment.

Generating Landsat Surface Reflectance Time Series
The STFDSC was then used to generate Landsat time series at 8-day intervals between two reference Landsat, which included Landsat from 15 June to 3 September for GKM, and from 25 February to 15 May for ZWL.
Due to cloud and residual atmospheric effects, gaps existed in the MODIS reflectance time series [48].The linear interpolation method was used to generate spatially and temporally complete MODIS data time series before reconstructing Landsat data time series.

Accuracy Assessment
Evaluation indices included the root mean square error (RMSE), correlation coefficient (CC), and spectral angle mapper (SAM) index.The SAM M index calculates the spectral angle between the reconstructed image and the observed Landsat reference image.Lower SAM, lower RMSE, and higher CC values were associated with the higher quality of a reconstructed image.In this study, the average RMSE, CC, and SAM values of the predicted reflectance compared with the observed reflectance on 2 August and 29 April were calculated for the GKM and ZWL, respectively.
where n is the number of pixels in an image, y is the observed reflectance, ŷ is the predicted reflectance, µ y is the mean reflectance of the image, µ ŷ is the mean predicted reflectance of the image, j is the j-th band, and m is the number of bands.
In addition, pixels with NDVI values larger than 0.30 were extracted, and density scatter plots were then drawn based on 20,000 randomly selected pixels to further examine the differences between the observed reflectance and STF-predicted SR in forest pixels.Mean MAE and bias were computed as follows.

Performance of STF Methods in the Three Experiments
The evaluation indices calculated for the ZWL and GKM are shown in Figures 3 and 4, respectively.In the ZWL, EDCSTFN generally had better performance in predicting Landsat surface reflectance, followed by STFDSC.The results of Experiment T2/T3/T4 indicated that STFDSC performed slightly better in predicting surface reflectance at TM band 3 than EDCSTFN, with an RMSE of 0.0176 and CC of 0.8747.In Experiment T1/T3/T4, STFDSC provided an overall more accurate fusion result in terms of RMSE, and ESTARFM provided comparable results.In Experiment T2/T3/T5, T2 and T5 had a time interval of 208 days, and T2 and T3 were at an interval of 48 days.Under this condition, two traditional methods, ESTARFM and FSDAF, had better prediction accuracy than the other three CNN-based fusion methods (Figure 3).FSDAF had the best performance because it required only one pair of data for prediction, and Landsat at T2 and MODIS at T2 and T3 served as inputs.When only a set of Landsat-MODIS data was available, FSDAF could be a suitable choice.STFDSC outperformed two other CNN methods and had CC values close to those of the traditional fusion methods.Using the SAM index, STFDSC even outperformed the two traditional methods and had the best performance.
Figure 3 shows that the RMSE of the STFNET prediction at TM band 4 was significantly larger than that of the other methods, which may have been caused by the large variation in reflectance.The mean reflectance was 0.1383 at T2 and 0.3172 at T4 in the ROI in the ZWL region.Significant variation in reflectance was not found in the GKM region, where STFNET had good performances (Figure 4).which bands with smaller mean reflectance were assigned a larger weight to avoid substantial prediction error at TM band 3; therefore, compared with EDCSTFN and STFNET, STFDSC produced a more stable and accurate results, which also suggested its better performance in generating Landsat surface reflectance.In Experiment T2/T3/T5, STFDSC still achieved the best prediction accuracy, with an average RMSE of 0.0081, CC of 0.9314, and SAM of 0.0228 (Figure 4 and Figure 5).Among the three experiments in the GKM, input data of Experiment T2/T3/T4 were Landsat data at T2 and T4 and MODIS data from T2-T4 acquired the closest to T3.The STFDSC results were much closer to the Landsat reference data than ESTARFM, FSDAF, EDCSTFN, and STFNET.For all four TM bands, the surface reflectance predicted by STFDSC had the lowest RMSE and highest CC values.In addition, STFDSC also corresponded to the lowest SAM index, 0.0219, suggesting the optimal accuracy achieved (Figures 4 and 5).In Experiment T1/T3/T4, STFDSC still produced the most accurate synthetic image with an average RMSE of 0.0132 and CC of 0.8266.The surface reflectance generated by EDCSTFN was observed to be abnormal at TM band 3, partly due to its sensitivity to data preprocessing.We tried bilinear interpolation instead of the nearest neighbor method to preprocess the MODIS data and found that it performed normally in the GKM.The main reason for the observed abnormal phenomena was probably that the mean reflectance at TM band 3 was 0.0304, substantially lower than 0.3036 at TM band 4 and 0.0615 at band 7, and fluctuations existed in predicting lower reflectance by EDCSTFN.In this study, the loss function adopted the weighted average method, in which bands with smaller mean reflectance were assigned a larger weight to avoid substantial prediction error at TM band 3; therefore, compared with EDCSTFN and STFNET, STFDSC produced a more stable and accurate results, which also suggested its better performance in generating Landsat surface reflectance.In Experiment T2/T3/T5, STFDSC still achieved the best prediction accuracy, with an average RMSE of 0.0081, CC of 0.9314, and SAM of 0.0228 (Figures 4 and 5).

Performance of STF methods in deriving the NDVI and NBR
We calculated the NDVI and NBR from fused Landsat surface reflectance.As shown in Table 3, STFDSC provided the most accurate NDVI in the ZWL and GKM, with RMSEs of 0.0677 and 0.0226, respectively.In addition, STFDSC obtained the closest result to the

Performance of STF methods in deriving the NDVI and NBR
We calculated the NDVI and NBR from fused Landsat surface reflectance.As shown in Table 3, STFDSC provided the most accurate NDVI in the ZWL and GKM, with RMSEs of 0.0677 and 0.0226, respectively.In addition, STFDSC obtained the closest result to the

Performance of STF Methods in Deriving the NDVI and NBR
We calculated the NDVI and NBR from fused Landsat surface reflectance.As shown in Table 3, STFDSC provided the most accurate NDVI in the ZWL and GKM, with RMSEs of 0.0677 and 0.0226, respectively.In addition, STFDSC obtained the closest result to the reference NBR derived from Landsat data in the GKM, with an RMSE of 0.0281 and CC of 0.8891.In the ZWL, EDCSTFN produced slightly better NBR results than STFDSC, with RMSEs of 0.0457 and 0.9467.These results suggested that surface reflectance generated by STFDSC could be used to derive vegetation indices accurately.Moreover, we extracted the pixels with NDVI values larger than 0.30 and then randomly selected 20,000 pixels and drew the density plot of selected pixel TM band 4 and band 7 under the T2/T3/T4 experiment.The results are shown in Figure 6.All fusion methods, except EDCSTFN, underestimated the near-infrared reflectance in the ZWL.The underestimation of STFNET prediction was particularly serious, with a negative bias of 0.0935.The problem could be explained by the algorithm principle.Both STFDSC and STFNET contained two submodels that combined MODIS reflectance change and Landsat data to compute the weight, and predictions were the weighted average of results from the two submodels.ESTARFM and FSDAF also predicted the target Landsat data based on the differences in MODIS reflectance.In the ZWL, the mean MODIS reflectance change was 0.1485 from T2 to T3 and 0.0413 from T3 to T4, corresponding to Landsat reflectance changes of 0.1627 and 0.0162, respectively.MODIS reflectance change was less than Landsat reflectance change from T2 to T3, while from T3 to T4, MODIS reflectance change was larger than Landsat reflectance change, which made the predictions to be far from the reference.EDCSTFN used the MODIS and Landsat reflectance data without incorporating their changes over time and thus avoided the underestimation caused by the observed violation.In TM band 7, EDCSTFN achieved the best accuracy in terms of CC, RMSE, and MAE, followed by STFDSC.The results in the GKM are shown in Figure 7. STFDSC had the best prediction accuracy, followed by STFNET.EDCSTFN generally overestimated Landsat reflectance at both band 4 and band 7.
0.8891.In the ZWL, EDCSTFN produced slightly better NBR results than STFDSC, w RMSEs of 0.0457 and 0.9467.These results suggested that surface reflectance generated STFDSC could be used to derive vegetation indices accurately.Moreover, we extracted the pixels with NDVI values larger than 0.30 and then r domly selected 20,000 pixels and drew the density plot of selected pixel TM band 4 a band 7 under the T2/T3/T4 experiment.The results are shown in Figure 6.All fusion me ods, except EDCSTFN, underestimated the near-infrared reflectance in the ZWL.The u derestimation of STFNET prediction was particularly serious, with a negative bias 0.0935.The problem could be explained by the algorithm principle.Both STFDSC a STFNET contained two submodels that combined MODIS reflectance change and Land data to compute the weight, and predictions were the weighted average of results fr the two submodels.ESTARFM and FSDAF also predicted the target Landsat data bas on the differences in MODIS reflectance.In the ZWL, the mean MODIS reflectance chan was 0.1485 from T2 to T3 and 0.0413 from T3 to T4, corresponding to Landsat reflectan changes of 0.1627 and 0.0162, respectively.MODIS reflectance change was less than Lan sat reflectance change from T2 to T3, while from T3 to T4, MODIS reflectance change w larger than Landsat reflectance change, which made the predictions to be far from reference.EDCSTFN used the MODIS and Landsat reflectance data without incorporat their changes over time and thus avoided the underestimation caused by the observ violation.In TM band 7, EDCSTFN achieved the best accuracy in terms of CC, RMSE, a MAE, followed by STFDSC.The results in the GKM are shown in Figure 7. STFDSC h the best prediction accuracy, followed by STFNET.EDCSTFN generally overestima Landsat reflectance at both band 4 and band 7.

Visual assessment of the five STF fusion methods
Figure 8 shows the prediction results of the five fusion algorithms in the ZWL und Experiment T2/T3/T4.Due to cloud contamination of Landsat data on T4, the fusion sults of FSDAF, ESTARFM and STFNET were severely affected, where clouds and clo shadows were clearly observed.In contrast, no obvious cloud or shadow existed in t STFDSC and EDCSTFN reconstructed images, suggesting the robustness of the STFD and EDCSTFN fusion methods; however, EDCSTFN likely failed to capture spatial d tails, and the corresponding prediction results were more obscured than those from oth STF methods.NDVI and NBR data calculated from fused reflectance by STFDSC w also close to the real data in the first row (Figure 8).These results demonstrated the op mal performance of STFDSC in generating Landsat surface reflectance.

Visual Assessment of the Five STF Fusion Methods
Figure 8 shows the prediction results of the five fusion algorithms in the ZWL under Experiment T2/T3/T4.Due to cloud contamination of Landsat data on T4, the fusion results of FSDAF, ESTARFM and STFNET were severely affected, where clouds and cloud shadows were clearly observed.In contrast, no obvious cloud or shadow existed in the STFDSC and EDCSTFN reconstructed images, suggesting the robustness of the STFDSC and EDCSTFN fusion methods; however, EDCSTFN likely failed to capture spatial details, and the corresponding prediction results were more obscured than those from other STF methods.NDVI and NBR data calculated from fused reflectance by STFDSC were also close to the real data in the first row (Figure 8).These results demonstrated the optimal performance of STFDSC in generating Landsat surface reflectance.
Figure 9 shows the performance of the five STF methods in the GKM based on Experiment T2/T3/T4.The predicted results of STFDSC were shown to be very similar to those of the Landsat reference image in color, especially in the middle and lower right of the image in the first column.The spatial distribution of the NDVI and NBR in Figure 9 also suggested the better performance of the STFDSC method; however, in forests with a higher density such as in the lower right corner, significant differences between observations and STFDSC predictions were found.Due to the large errors in surface reflectance prediction at TM band 3, abnormal EDCSTFN NDVI values were observed.

Landsat Surface Reflectance Time Series Generated Using STFDSC
Landsat data time series from 25 February to 15 May 2004 at 8-day intervals in the ZWL region were shown in Figure 10.We did not predict the Landsat data from 15 May to 6 October, because the two reference data had an interval of 144-day in acquisition time, and STFDSC was not quite suitable for prediction with longer time series without any observation, as shown in Figure 3.
According to the MODIS data time series, significant greening occurred between 5 April and 21 April, which was well captured by reconstructed Landsat data.For Landsat data at reference time periods, such as 25 February, 29 April, and 15 May 2004, some pixels were contaminated by cloud; however, the phenomenon was not observed from Landsat data at prediction times, which indicated that the introduction of MODIS data could improve Landsat predictions to some degree.These results were also established for Landsat data time series from 15 June to 3 September 2004 in the GKM region, suggesting the rationality of STFDSC to generate Landsat data time series (Figure 11).The gaps in MODIS data (e.g., on 9 July and 25 July) did not greatly affect the accuracy of reconstructed Landsat due to the preprocess of MODIS data gaps before prediction.We extracted MODIS and Landsat reference at the green, red, near-infrared, and shortwave infrared bands of different forest types in GKM and ZWL, and results showed that Landsat reflectance data were similar to or even better than MODIS data in temporal trajectories (Figure 12). Figure 9 shows the performance of the five STF methods in the GKM based on Experiment T2/T3/T4.The predicted results of STFDSC were shown to be very similar to those of the Landsat reference image in color, especially in the middle and lower right of    According to the MODIS data time series, significant greening occurred between 5 April and 21 April, which was well captured by reconstructed Landsat data.For Landsat data at reference time periods, such as 25 February, 29 April, and 15 May 2004, some pixels were contaminated by cloud; however, the phenomenon was not observed from Landsat data at prediction times, which indicated that the introduction of MODIS data could improve Landsat predictions to some degree.These results were also established for Landsat data time series from 15 June to 3 September 2004 in the GKM region, suggesting the rationality of STFDSC to generate Landsat data time series (Figure 11).The gaps in MODIS data (e.g., on 9 July and 25 July) did not greatly affect the accuracy of reconstructed Landsat due to the preprocess of MODIS data gaps before prediction.We extracted MODIS and Landsat reference at the green, red, near-infrared, and shortwave infrared bands of different forest types in GKM and ZWL, and results showed that Landsat reflectance data were similar to or even better than MODIS data in temporal trajectories (Figure 12).

Discussion
We proposed the STFDSC method to predict Landsat data time series, which contains low-level and middle-level features that have been widely ignored in most STF-related CNN models [29,49].The usage of multilevel feature information could be the major reason for its better performance demonstrated in this study.
To build an efficient CNN, we used the DSC in the architecture of STFDSC, which was especially suitable for mobile and embedded vision applications [50].The combined

Discussion
We proposed the STFDSC method to predict Landsat data time series, which contains low-level and middle-level features that have been widely ignored in most STF-related CNN models [29,49].The usage of multilevel feature information could be the major reason for its better performance demonstrated in this study.
To build an efficient CNN, we used the DSC in the architecture of STFDSC, which was especially suitable for mobile and embedded vision applications [50].The combined use of DSC and some efficiency tricks (e.g., residual structure, BatchNorm) can prevent the overfitting and speed up network convergence.Compared with EDCSTFN, the number of trainable parameters in STFDSC was less.For the prediction of the target image with a size of 4 × 150 × 150, the forward propagation floating-point computation required by the two methods as shown in Table 4. STFNET was the lightest model, but at the cost of sacrificing prediction accuracy.Since surface reflectance at four TM bands varied in a wide range (e.g., from 0.0305 at TM band 3 to 0.3036 at TM band 4 in the GKM), the band with higher reflectance tended to contribute more to the loss function when the weight factor was not included.Therefore, we adopted the weighted loss function in STFDSC, assigning the weight for each band according to the mean surface reflectance value; this could ensure the stability of STFDSC in predicting surface reflectance at the band with small reflectance values to some degree.We also obtained the final predicted reflectance using the weighted average of the two-stream prediction results, which also reduced the variation in the forecast results.
It should be noted that, similar to other STF methods, high-quality reference data were required in STFDSC; however, this requirement could not be met in practical applications, or the two nearest reference data points on the acquisition date may have a long-time interval, which could possibly cause a large bias in the predictions.In some cases, only Landsat images with heavy cloud cover are available, particularly in tropical forests; it would be necessary to consider how to fuse with auxiliary data sources such as SAR data [51].
The proposed STFDSC could be a unified framework for fusing datasets from different sensors, including but not limited to Landsat and MODIS data.Future research could be conducted on the generation of higher spatial or temporal satellite images, such as through the fusion of WorldView-2 or Sentinel-2 images using the STFDSC algorithm [52,53], and higher-level processed satellite data products, such as the leaf area index and land surface temperature [47,54,55].

Conclusions
We proposed an STFDSC method for reconstructing Landsat surface reflectance from Landsat and MODIS data.The performance was evaluated in the GKM and ZWL forest regions of China, and comparisons with four published fusion methods, FSDAF, ESTARFM, EDCSTFN and STFNET, were also carried out.The results suggest that the STFDSC based results were much closer to Landsat reference data in the GKM than ESTARFM, FSDAF, EDCSTFN, and STFNET.EDCSTFN performed slightly better than STFDSC in ZWL, but produced abnormal predictions at TM band 3 in the GKM; moreover, the NDVI and NBR calculated from Landsat surface reflectance generated by STFDSC were very close to those obtained from Landsat reference data.The results indicated that the STFDSC proposed in this study had stable and accurate predictions in both the GKM and ZWL.Based on STFDSC, Landsat reflectance time series at 8-day temporal resolution were also reconstructed in two regions.Although STFDSC was designed to generate reflectance time series in this study, it could be used to integrate higher-level processed satellite data products such as the leaf area index and land surface temperature, which needs to be investigated in future studies.

Figure 1 .
Figure 1.Location of the study area, region of interest (purple regions in Box 1 and Box 2), and coverages of Landsat 5 scenes labeled by Worldwide Reference System 2 (WRS-2) path/row for the study area in the Greater Khingan Mountains (GKM, Box 1) and Ziwuling Mountains (ZWL, Box

Figure 1 .
Figure 1.Location of the study area, region of interest (purple regions in Box 1 and Box 2), and coverages of Landsat 5 scenes labeled by Worldwide Reference System 2 (WRS-2) path/row for the study area in the Greater Khingan Mountains (GKM, Box 1) and Ziwuling Mountains (ZWL, Box 2).The background data showed different forest types from Li et al.[41] with mapped colors described in the legend.

Figure 2 .
Figure 2. Architecture of the STFDSC model.Inputs of the STFDSC model were Landsat and MODIS data at base times (L2, L4, M2, and M4), and MODIS data at the prediction time (M3).Pre_L23 and Pre_L43 were predicted Landsat data using two sub-models, and Pre_L3 was the final Landsat prediction and obtained by weighted average of Pre_L23 and Pre_L43.

Figure 2 .
Figure 2. Architecture of the STFDSC model.Inputs of the STFDSC model were Landsat and MODIS data at base times (L2, L4, M2, and M4), and MODIS data at the prediction time (M3).Pre_L23 and Pre_L43 were predicted Landsat data using two sub-models, and Pre_L3 was the final Landsat prediction and obtained by weighted average of Pre_L23 and Pre_L43.

Figure 3 .
Figure 3. Performance of five STF methods in reconstructing Landsat surface reflectance in the ZWL.

Figure 3 .
Figure 3. Performance of five STF methods in reconstructing Landsat surface reflectance in the ZWL.

Figure 4 .
Figure 4. Performance of five STF methods in reconstructing Landsat surface reflectance in the GKM.

Figure 5 .
Figure 5. Performance of five STF methods in terms of the SAM index in the ZWL and GKM.

Figure 4 .
Figure 4. Performance of five STF methods in reconstructing Landsat surface reflectance in the GKM.

Figure 3 .
Figure 3. Performance of five STF methods in reconstructing Landsat surface reflectance in the ZWL.

Figure 4 .
Figure 4. Performance of five STF methods in reconstructing Landsat surface reflectance in the GKM.

Figure 5 .
Figure 5. Performance of five STF methods in terms of the SAM index in the ZWL and GKM.

Figure 5 .
Figure 5. Performance of five STF methods in terms of the SAM index in the ZWL and GKM.

Figure 6 .
Figure 6.Density scatter plot of the observed reflectance and predicted reflectance at TM bands and 7 in the ZWL.

Figure 6 .Figure 7 .
Figure 6.Density scatter plot of the observed reflectance and predicted reflectance at TM bands 4 and 7 in the ZWL.

Figure 7 .
Figure 7. Scatter density plot of the observed reflectance and predicted reflectance at TM band 4 and band 7 in the GKM.

Figure 8 .
Figure 8. Fusion results on 29 April in the ZWL region.The first column shows the TM band 7-4-2 composite images of the reference image and five STF methods; the second column shows the zoomed-in details of the black rectangle in the first column; the third column shows the calculated NDVI, and the last column shows the calculated NBR.

Figure 8 .
Figure 8. Fusion results on 29 April in the ZWL region.The first column shows the TM band 7-4-2 composite images of the reference image and five STF methods; the second column shows the zoomed-in details of the black rectangle in the first column; the third column shows the calculated NDVI, and the last column shows the calculated NBR.

20 Figure 9 .
Figure 9. Fusion results on 2 August in the GKM region.The first column shows the TM band 7-4-2 composite images of the reference image and five STF methods; the second column shows the zoomed-in details of the black rectangle in the first column; the third column shows the calculated NDVI, and the last column shows the calculated NBR.

Figure 9 .
Figure 9. Fusion results on 2 August in the GKM region.The first column shows the TM band 7-4-2 composite images of the reference image and five STF methods; the second column shows the zoomed-in details of the black rectangle in the first column; the third column shows the calculated NDVI, and the last column shows the calculated NBR.

Figure 10 .
Figure 10.MODIS and Landsat data time series from 25 February to 15 May 2004 in the ZWL region.TM images were shown using the 7-4-2 band combination, and MODIS images were shown using the 7-2-4 band combination.The blue arrow between MODIS and Landsat represented that the corresponding Landsat data were reconstructed, rather than the observed reference image.

Figure 10 . 20 424Figure 11 . 429 Figure 11 .
Figure 10.MODIS and Landsat data time series from 25 February to 15 May 2004 in the ZWL region.TM images were shown using the 7-4-2 band combination, and MODIS images were shown using the 7-2-4 band combination.The blue arrow between MODIS and Landsat represented that the corresponding Landsat data were reconstructed, rather than the observed reference image.Remote Sens. 2022, 14, x FOR PEER REVIEW 16 of 20

Figure 11 .
Figure 11.MODIS and Landsat data time series from 15 June to 3 September 2004 in the GKM region.TM images were shown using the 7-4-2 band combination, and MODIS images were shown using the 7-2-4 band combination.The blue arrow between MODIS and Landsat represented that the corresponding Landsat data were reconstructed, rather than the observed reference image.

Figure 12 .
Figure 12.MODIS and Landsat reflectance time series of different forest types in GKM and ZWL regions.The top row represented the temporal trajectories of reflectance in GKM, and the bottom row corresponded to temporal trajectories of reflectance in ZWL.

Figure 12 .
Figure 12.MODIS and Landsat reflectance time series of different forest types in GKM and ZWL regions.The top row represented the temporal trajectories of reflectance in GKM, and the bottom row corresponded to temporal trajectories of reflectance in ZWL.

Table 1 .
Landsat and MODIS data were used in this study for 2004.

Table 2 .
Comparison experiments carried out in the ZWL and GKM with different time intervals.

Table 3 .
Performances of five STF methods in reconstructing the NDVI and NBR.

Table 3 .
Performances of five STF methods in reconstructing the NDVI and NBR.

Table 4 .
The number of trainable parameters and forward propagated floating-point operations of the three CNN-based STF methods.