A Novel Deep Learning-Based Spatiotemporal Fusion Method for Combining Satellite Images with Different Resolutions Using a Two-Stream Convolutional Neural Network

Spatiotemporal fusion is considered a feasible and cost-effective way to solve the trade-off between the spatial and temporal resolution of satellite sensors. Recently proposed learning-based spatiotemporal fusion methods can address the prediction of both phenological and land-cover change. In this paper, we propose a novel deep learning-based spatiotemporal data fusion method that uses a two-stream convolutional neural network. The method combines both forward and backward prediction to generate a target fine image, where temporal change-based and a spatial information-based mapping are simultaneously formed, addressing the prediction of both phenological and land-cover changes with better generalization ability and robustness. Comparative experimental results for the test datasets with phenological and land-cover changes verified the effectiveness of our method. Compared to existing learning-based spatiotemporal fusion methods, our method is more effective in predicting phenological change and directly reconstructing the prediction with complete spatial details without the need for auxiliary modulation.


Introduction
Satellites with high temporal and spatial resolution are capable of capturing dynamics at a fine scale, obtaining dense remotely sensed time-series data that play an important role in studying the dynamics of earth systems, such as monitoring vegetation phenology [1], detecting land-cover changes [2], discriminating different land-cover types [3], and modeling carbon sequestration [4]. With the increase in the number of available satellite images, studies using dense time-series data have become extremely popular in this decade. However, a trade-off between the spatial and the temporal resolution of satellite sensors still exists due to limitations of the technology and budget constraints [5]. Spatiotemporal fusion methods, which combine different sensors, are considered a feasible and cost-effective way to solve this problem [6,7]. Specifically, spatiotemporal fusion methods combine high spatial and temporal resolution images to generate fused images under the condition that the two kinds of sensors have similar spectral properties.
• DL-SDFM addresses the prediction of both phenological and land-cover changes with high generalization ability and robustness. • DL-SDFM enriches the learning-based spatiotemporal fusion method with temporal change information, resulting in a more robust ability of predicting the phenological change compared to the existing learning-based spatiotemporal fusion methods. • DL-SDFM can directly reconstruct the prediction with complete spatial details without the need for auxiliary modulation.

Methods
We refer to images with low spatial resolution but high temporal resolution as "coarse" images, while images with high spatial resolution but low temporal resolution are called the "fine" images. In this paper, we consider both forward and backward prediction to generate a target fine image F 2 with higher robustness, given two coarse and fine image pairs (F 1 and C 1 , F 3 and C 3 ) and a corresponding coarse image C 2 ( Figure 1). The flowchart of DL-SDFM is shown in Figure 2. In forward prediction, a temporal change-based and a spatial information-based mapping ( , ) are built simultaneously using a two-stream convolutional neural network, taking two coarse and fine image pairs as inputs. Then, using the two learned mappings, two independent predictions ( , ) of the phenological and land-cover changes respectively are obtained. The final forward prediction is generated by combining the above two independent predictions using a weighted method. Then, backward prediction is implemented in the same fashion, i.e., a temporal change-based and a spatial information-based mapping ( , ) are built simultaneously, followed by obtaining the two independent predictions ( , ) for backward prediction. The final backward prediction is generated by combining the two independent predictions, while the target fine image is obtained through the combination of the forward and the backward prediction. A detailed description of each step of DL-SDFM is given below.  Images for spatiotemporal fusion. Coarse images were obtained at t 1 , t 2 , and t 3 , while fine images are available for t 1 and t 3 , and the fine image at t 2 is the target image.
The flowchart of DL-SDFM is shown in Figure 2. In forward prediction, a temporal change-based and a spatial information-based mapping (M 1 , M 2 ) are built simultaneously using a two-stream convolutional neural network, taking two coarse and fine image pairs as inputs. Then, using the two learned mappings, two independent predictions (F 1 2 ,F 2 2 ) of the phenological and land-cover changes respectively are obtained. The final forward predictionF 2 is generated by combining the above two independent predictions using a weighted method. Then, backward prediction is implemented in the same fashion, i.e., a temporal change-based and a spatial information-based mapping (M 1 , M 2 ) are built simultaneously, followed by obtaining the two independent predictions (F 1 2 ,F 2 2 ) for backward prediction. The final backward predictionF bw 2 is generated by combining the two independent predictions, while the target fine imageF The flowchart of DL-SDFM is shown in Figure 2. In forward prediction, a temporal change-based and a spatial information-based mapping ( , ) are built simultaneously using a two-stream convolutional neural network, taking two coarse and fine image pairs as inputs. Then, using the two learned mappings, two independent predictions ( , ) of the phenological and land-cover changes respectively are obtained. The final forward prediction is generated by combining the above two independent predictions using a weighted method. Then, backward prediction is implemented in the same fashion, i.e., a temporal change-based and a spatial information-based mapping ( , ) are built simultaneously, followed by obtaining the two independent predictions ( , ) for backward prediction. The final backward prediction is generated by combining the two independent predictions, while the target fine image is obtained through the combination of the forward and the backward prediction. A detailed description of each step of DL-SDFM is given below.

Temporal Change-Based Mapping
Existing learning-based spatiotemporal fusion methods do not introduce the physical temporal change information. To augment the learning-based spatiotemporal fusion method with temporal change information and develop a more powerful ability to predict phenological change, we formulate the temporal change-based mapping.
Suppose that the changes of reflectance from date t 1 to t 2 are linear. If the period ∆t is short, the coarse image at t 2 can be described as where C denotes the coarse image, (x, y) is a given pixel location at band B at two different dates, while a and b are the coefficients of the linear regression model that describe the temporal change of the reflectance of the coarse image between t 1 and t 2 . Since the coarse image has similar spectral bands to the fine image, the linear relationship between the two known coarse images (C 1 and C 2 ) can be applied to the fine image at t 1 to obtain the target fine image F 2 : The coefficients a and b can be estimated using the least squares method in a moving window. However, this will increase the computational cost. Therefore, we assume that a is equal to 1 to reduce the computational cost, so the target fine image at t 2 can be calculated as In this case, Equation (3) is equal to the basis of STARFM. If we further introduce a conversion coefficient V into (3) to apply the method to the spatial heterogeneity regions, the target fine image at t 2 can be calculated as Equation (4) is equal to the basis of ESTARFM. From the above analysis, the above methods can be summarized as the following mapping: where M is the mapping function and ∆T represents the temporal change information. The mapping function M is a predefined weighting function in a moving window. Although the effectiveness of M in the prediction of phenological change has been verified, since it is difficult to accurately model the complex and nonlinear relationships between the central and neighboring information, the generalization of these methods with such an artificial and predefined weighting function can be improved. For a more adequate characterization of the complex and nonlinear mapping function M to improve the fusion performance in phenological change prediction, in this paper, we leverage the CNN's capability in nonlinear mapping representation to formulate a nonlinear mapping M 1 for phenological change prediction and learning of the self-adaption weights.
Specifically, for forward prediction, we consider F 3 as the label, and the temporal change information C 3 − C 1 and the known fine image F 1 as the inputs. The mapping M 1 is learned via the proposed two-stream convolutional neural network (Section 2.3): Remote Sens. 2020, 12, 698 where Φ 1 is the parameter of mapping M 1 , L 1 is the defined loss function, and C 13 is an abbreviated form of C 3 − C 1 , which represents the temporal change information.

Spatial Information-Based Mapping
The mapping M 1 can be regarded as the linear-based spatiotemporal fusion with prior images, focusing on the view of temporal change. Although the self-adaption weight learned by the CNN has more powerful generalization ability than the traditional linear-based spatiotemporal fusion methods, since the basis of M 1 is the same as that of STARFM, it also lacks the ability to address the predictions with land-cover change. Therefore, for the latter, we further formulate the mapping M 2 , which can directly reconstruct the spatial detail.
Learning-based spatiotemporal fusion methods are considered to have a stronger ability to predict land-cover change. These methods first formulate the complex mapping between the coarse and fine images based on spatial structural similarity and then use the learned mapping to predict the target fine image. Given two fine and coarse image pairs F 1 and C 1 , F 3 and C 3 , the nonlinear mapping between the fine and coarse images can be defined as Although the above mapping endows the spatiotemporal fusion method with the capacity of predicting land-cover change, the magnification factor in spatiotemporal fusion is more significant than in single-image super-resolution (usually ranging from 2 to 4 in single-image super-resolution). In this case, texture details are severely blurred and distorted in coarse images. Thus, it may not be useful to reconstruct the spatial details directly using the above mapping.
To improve the above mapping's ability to directly reconstruct the spatial details, we introduce the spatial difference information between fine and coarse images, which is expressed by F − C, into the mapping (Equation (7)) to formulate the mapping M 2 . Note that the spatial difference information, which is also expressed as high-frequency information, has been shown to be useful in reconstructing the spatial detail [27]. For the mapping M 2 in forward prediction, we regard F 3 as the label, and the spatial differences information F 1 − C 1 and the known coarse image C 3 as the inputs. Similarly, the nonlinear mapping M 2 is also learned by the proposed two-stream convolutional neural network.
where Φ 2 is the parameter of mapping M 2 , and L 2 is the defined loss function.

Network Architecture
In this paper, we propose a relatively lightweight two-stream CNN with a dilated convolution-based inception module to simultaneously learn the mappings M 1 and M 2 . The network consists of three stages: (1) multi-scale feature extraction; (2) multi-source feature fusion; and (3) image reconstruction. The overall architecture and the configuration of the network are provided in Figure 4 and Table 1, respectively. A detailed description of each stage is given below.
(1) Multi-scale feature extraction Remote sensing images with high spatial heterogeneity contain abundant texture details, where the size of ground objects varies greatly. Thus, it is effective to use the rich multi-scale spatial information to improve the robustness of the feature extraction in these areas. To capture multi-scale spatial information, the GoogLeNet inception module proposed by Szegedy et al. [34] concatenates the outputs of different-sized filters, e.g., 3 × 3, 5 × 5, 7 × 7, assuming that each filter can capture information at the corresponding scale. Recently, the inception module has been utilized for image reconstruction and fusion tasks and has achieved state-of-the-art performance [35][36][37]. However, the increase of the size of the filters will inevitably result in an increase of parameters, which may not be appropriate in the Remote Sens. 2020, 12, 698 7 of 26 case of the insufficient prior images (in our case, only two fine and coarse image pairs are available for training for the spatiotemporal fusion task). Inspired by the GoogLeNet inception module, Shi [38] proposed the dilated convolution-based inception module to capture multi-scale information. In contrast to conventional convolutions, dilated convolutions enlarge the receptive field and maintain the size of the convolution kernel filter to avoid the increase of parameters. Dilated convolution employs the same filter at different ranges with different dilation factors, which allows it to capture multi-scale spatial information without increasing the parameters. As illustrated in Figure 3, the three dilated convolutions in the dilated convolution-based inception module can capture the multi-scale spatial information (see the pixels in blue), whilst operating on the same scale.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 26 In this paper, we utilize this module to capture multi-scale spatial information. As shown in Figure 4, the three dilated convolutions with kernel size 3 × 3 and dilation factors 1, 2, and 3 are simultaneously applied on the input image and produce feature maps of 64 channels, followed by concatenation into a single 192-channel feature map. Then, a conventional convolution with kernel size 3 × 3 is performed on the concatenated output and 64 feature maps are generated. A rectified linear unit (ReLU) is used after each convolution layer to introduce nonlinearity and to speed up the convergence of the network. (2) Multi-source feature fusion After capturing the multi-scale spatial information of fine and coarse image pair using the dilated convolution-based inception module, the extracted multi-scale information is concatenated into a single 384-channel feature map, followed by conventional convolution with kernel size 3 × 3, generating 64 feature maps.
However, a "gridding" issue [39] usually exists in the dilated convolution framework, which means that layers with an equal dilation factor may result in the loss of a large portion of information. A hybrid dilated convolution module [39] is a simple solution to address this issue, which uses a combination of dilated convolutions with different dilation factors to cover a square region without In this paper, we utilize this module to capture multi-scale spatial information. As shown in Figure 4, the three dilated convolutions with kernel size 3 × 3 and dilation factors 1, 2, and 3 are simultaneously applied on the input image and produce feature maps of 64 channels, followed by concatenation into a single 192-channel feature map. Then, a conventional convolution with kernel  Given two pairs of fine and coarse image ( and , and ), we formulate the objective function of the proposed network as follows: Here, Φ and Φ denote the network parameters of mappings and , is the weighting parameter, and ℒ and ℒ are the losses of two networks, which are denoted as: where ℒ is the mean square error (MSE)-based loss function.

Prediction Stage
Based on the two learned mappings ( and ), for the forward prediction, we obtain two independent predictions ( and ), focusing on phenological and land-cover change, respectively. (2) Multi-source feature fusion After capturing the multi-scale spatial information of fine and coarse image pair using the dilated convolution-based inception module, the extracted multi-scale information is concatenated into a single 384-channel feature map, followed by conventional convolution with kernel size 3 × 3, generating 64 feature maps.
However, a "gridding" issue [39] usually exists in the dilated convolution framework, which means that layers with an equal dilation factor may result in the loss of a large portion of information. A hybrid dilated convolution module [39] is a simple solution to address this issue, which uses a combination of dilated convolutions with different dilation factors to cover a square region without missing information. For more detailed descriptions of the "gridding" issue and the hybrid dilated convolution module, we refer the reader to [39].
To further enlarge the receptive field for extracting more contextual information while avoiding the "gridding" issue, a hybrid dilated convolution module is introduced in the multi-source feature fusion, where three dilated convolutions are applied with a kernel size of 3 × 3 and dilation factors of 1, 2, and 3, generating 64 feature maps. A ReLU is used after each convolution layer. This process can be described as: where • represents the convolution operation, while F n , W n , and b n represent the feature maps, filters, and biases of the n-th dilated convolutions in the multi-source feature fusion stage. (

3) Image Reconstruction
In the image reconstruction stage, conventional convolution with a kernel size of 3 × 3 and a filter is utilized to generate the final output.
Given two pairs of fine and coarse image (F 1 and C 1 , F 3 and C 3 ), we formulate the objective function of the proposed network as follows: Remote Sens. 2020, 12, 698 9 of 26 Here, Φ 1 and Φ 2 denote the network parameters of mappings M 1 and M 2 , λ is the weighting parameter, and L 1 and L 2 are the losses of two networks, which are denoted as: where L is the mean square error (MSE)-based loss function.

Prediction Stage
Based on the two learned mappings (M 1 and M 2 ), for the forward prediction, we obtain two independent predictions (F 1 2 andF 2 2 ), focusing on phenological and land-cover change, respectively. One pair of known fine and coarse images (F 1 and C 1 ) and a coarse image (C 2 ) at prediction date t 2 are taken as the inputs: Since the two mappings are based on temporal change and spatial information respectively, the two independent predictionsF 1 2 andF 2 2 may have different applicability under different scenarios. Here, we employ a weighted combination method to synthesize the two independent predictions, giving DL-SDFM the ability to predict both phenological and land cover change.
Under ideal conditions, the weight of the weighted combination method can be determined using the bias between the two predictions and the actual fine image; however, the actual fine image is unknown. Thus, we utilize the corresponding coarse image at the prediction date instead of the actual fine image to determine the weight. Meanwhile, to reduce the prediction errors caused by the inconsistency of spatial resolution between fine and coarse images, the weight measurement is implemented in a 3 × 3 moving window as follows: 2 (x, y) are the reflectance of the j-th pixel of the two independent predictions in a 3 × 3 moving window, centered on pixel (x, y), while C 2 j (x, y) is the corresponding reflectance of the j-th pixel of the coarse image at the prediction date. Then, the final predictionF 2 can be obtained using:

Combination with Backward Prediction
The method described above only focuses on forward prediction. To improve robustness through the combination of both forward and backward predictions, we further consider the backward prediction. Accordingly, F 1 is regarded as the label in the training stage of the backward prediction. The two mappings in backward prediction are also simultaneously learned by the two-stream convolutional neural network: where Φ 1 and Φ 2 denote the network parameters of the two mappings M 1 and M 2 for backward prediction, and C 31 is defined as C 1 − C 3 . Similar to the prediction stage in the forward prediction, the two independent backward predictionsF 1 2 ,F 2 2 and the final backward predictionF bw 2 (x, y, B, t 2 ) are obtained by: where w 1 (x, y, B, t 2 ) and w 2 (x, y, B, t 2 ) denote the weights of the two independent backward predictions.
To combine the forward and backward predictions and obtain the final prediction, we utilize the same weighted combination method based on a 3 × 3 moving window as described in Section 2.4.
where w f w (x, y, B, t 2 ) and w bw (x, y, B, t 2 ) denote the weight of the forward and backward predictions, which are determined by the bias between the two predictions and the coarse image at the prediction date.

Study Area and Data Sets
The performance of DL-SDFM was tested on two study sites to verify the effectiveness of the land-cover change and temporal change prediction, respectively.
The first study site was the Lower Gwydir Catchment (LGC), which is located in northern New South Wales, with an overall area of 5440 km 2 (3200 × 2720 pixels in Landsat images with six bands). Since a large flood occurred in mid-December 2004, leading to the significant changes in land cover, it is reasonable to use this study site for testing the effectiveness of DL-SDFM on land-cover change prediction.
The dataset in this study site is the same as the one used by Emelyanova et al. [9]. In this experiment, we used two MODIS MOD09GA Collection 5 and Landsat 7 ETM+ image pairs acquired on 26 November 2004, and 28 December 2004, respectively, and a MODIS image acquired on 12 December 2004, to predict the fine Landsat image acquired on 12 December 2004, while the actual Landsat image acquired on that date was used to evaluate the fusion performance ( Figure 5). MODIS images were upsampled from 500 m to the same spatial resolution as the Landsat images (25 m) using a nearest neighbor algorithm.
The second study site was located in a heterogeneous rain-fed agricultural area in central Iowa (CI), USA that has an overall area of 18,225 km 2 (4500 × 4500 pixels in Landsat images with 6 bands) with an obvious phenological change area. We chose this study site to test the performance of DL-SDFM in spatially heterogeneous areas with visible phenological change. Two MODIS MOD09GA Collection 6 and Landsat 7 ETM+ image pairs acquired on 14 May 2002, and 2 August 2002, along with the MODIS image acquired on 2 July 2002, were utilized to predict the Landsat image acquired on 2 July 2002. As before, the actual Landsat obtained on 2 July 2002 was utilized to evaluate the fusion performance ( Figure 6). MODIS images were upsampled from 480 m to the same spatial resolution as the Landsat images (30 m) using a nearest neighbor algorithm. experiment, we used two MODIS MOD09GA Collection 5 and Landsat 7 ETM+ image pairs acquired on 26 November 2004, and 28 December 2004, respectively, and a MODIS image acquired on 12 December 2004, to predict the fine Landsat image acquired on 12 December 2004, while the actual Landsat image acquired on that date was used to evaluate the fusion performance ( Figure 5). MODIS images were upsampled from 500 m to the same spatial resolution as the Landsat images (25 m) using a nearest neighbor algorithm.

Parameter Setting
In the training stage, all the input images of the two mappings were cropped to 50 with a stride of 50. The parameter was fine-tuned by comparing all these parameters, namely {30, 40, 50, 60} to obtain the lowest root mean square error (RMSE) on the test datasets. To avoid over-fitting, multiangle image rotation (angles of 0°, 90°, 180°, and 270°) was utilized to increase the training sample size. For optimization, the proposed network was trained using the Adam algorithm [40] as the gradient descent optimization method with momentum = 0.9, = 0.999, and = 10 , which is the same as that used by Yuan et al. [41]. The batch size was set to 64 to fit into the GPU memory. The learning rate α was initialized to 0.0001 for the whole network, which was determined by comparing all these parameters, namely {0.00001, 0.0001, 0.001, 0.01} to obtain the lowest RMSE on the test datasets. The training process lasted 60 epochs to ensure convergence. After every 10 epochs, the learning rate was multiplied by a descent factor of 0.5. We employed the Keras deep learning library with TensorFlow to train the network on a PC with 32 GB RAM, an i7-7700k CPU, and an NVIDIA GTX 1070 GPU.

Parameter Setting
In the training stage, all the input images of the two mappings were cropped to 50 with a stride of 50. The parameter was fine-tuned by comparing all these parameters, namely {30, 40, 50, 60} to obtain the lowest root mean square error (RMSE) on the test datasets. To avoid over-fitting, multi-angle image rotation (angles of 0 • , 90 • , 180 • , and 270 • ) was utilized to increase the training sample size. For optimization, the proposed network was trained using the Adam algorithm [40] as the gradient descent optimization method with momentum β 1 = 0.9, β 2 = 0.999, and ε = 10 −8 , which is the same as that used by Yuan et al. [41]. The batch size was set to 64 to fit into the GPU memory. The learning rate α was initialized to 0.0001 for the whole network, which was determined by comparing all these parameters, namely {0.00001, 0.0001, 0.001, 0.01} to obtain the lowest RMSE on the test datasets. The training process lasted 60 epochs to ensure convergence. After every 10 epochs, the learning rate was multiplied by a descent factor of 0.5. We employed the Keras deep learning library with TensorFlow to train the network on a PC with 32 GB RAM, an i7-7700k CPU, and an NVIDIA GTX 1070 GPU.
In the prediction stage, we cropped input images into patches of size 600 × 600 pixels to fit into the GPU memory. Meanwhile, to avoid boundary artifacts, we ensured that adjacent patches overlapped.

Comparison and Evaluation Strategy
(1) Comparison with other fusion methods To verify the superiority of DL-SDFM, three state-of-the-art spatiotemporal fusion methods, including STARFM, Flexible Spatiotemporal Data Fusion (FSDAF), and STFDCNN, were utilized as benchmark methods. Fusion results were quantitatively and visually evaluated by comparing the prediction with the actual fine image acquired at the prediction date. For the quantitative evaluation, six indices were used: RMSE, correlation coefficient (CC), universal image quality index (UIQI) [42], the structural similarity (SSIM) [43], erreur relative global adimensionnelle de synthèse (ERGAS) [44], and the spectral angle mapper (SAM) [45].
RMSE was used to provide a global metric of the radiometric differences between the predicted and the actual fine image. CC was used to show the linear relationship between the prediction and the actual fine image. SSIM was used to show the similarity of the overall structure between the predicted and the actual fine image. UIQI depicts the closeness between the two images utilizing the differences in the statistical distributions. SAM reflects the spectral fidelity of the prediction, while ERGAS measures the overall fusion result. The ideal values of RMSE, CC, SSIM, and UIQI are 0, 1, 1, and 1, respectively, while smaller values for ERGAS and SAM indicate better fusion performance. Additionally, the average Average Absolute Difference (AAD) maps of the six bands between the actual fine image and fusion results were calculated, which represent the spatial distribution and magnitude of the predictions' uncertainty. The closer the AAD was to zero, the less the uncertainty of predictions.
(2) Effectiveness of the fusion of temporal change information with spatial information In DL-SDFM, the two independent predictions focusing on the phenological and land-cover change were combined with a weighted combination in a moving window. To verify the effectiveness of this combination, we compared the quantitative results of the two independent predictions and the combination results in DL-SDFM. The quantitative evaluation indices include RMSE, CC, SSIM, UIQI, SAM, ERGAS, and the average AAD maps.
(3) Effectiveness of the reconstruction of the spatial detail To verify the effectiveness of DL-SDFM in spatial detail reconstruction, taking the prediction result of band 4 as an example, we further compared the spatial detail of the predictions of STFDCNN and DL-SDFM in both two study sites. For STFDCNN, in addition to the final prediction, its spatial detail of the nonlinear mapping result and the super-resolution result were also compared. The peak signal-to-noise ratio (PSNR), a common quantitative evaluation index in super-resolution, was used to give an evaluation of the image distortion, where a higher value indicates a better prediction [46].

Prediction with Land-Cover Change
(1) Comparison with other fusion methods The results of different fusion methods for the LGC site are shown in Figure 7, with the sub-areas inundated by floods zoomed to show more details. It can be seen that the predictions of STARFM provide the worst results with blurry texture details, while incomplete boundaries with some noise were generated for the inundated areas. The average AAD maps of STARFM (Figure 8b) further demonstrate that STARFM fails to handle the image with obvious land-cover change. As shown in Figures 7d and 8c, FSDAF produces more precise texture details than STARFM, capturing more complete spatial variation with less noise in the edge of inundated areas. However, some artifacts still exist in some heterogeneous areas. STFDCNN yields complete spatial details, except for the inundated areas, where obvious blocky artifacts can be seen near the edges.      In contrast, as shown in Figures 7f and 8e, the DL-SDFM generates a prediction with relatively good spatial details, less uncertainty, and clear texture. Moreover, compared to the prediction of STFDCNN, a more complete boundary of the inundated area was generated, suggesting the superiority of the DL-SDFM in predictions with obvious land-cover change.
Quantitative results of the different fusion methods are shown in Table 2. It can be seen that STARFM provides the worst performance for all metrics, which is mainly due to the failure in predicting the inundated area (Figure 7c). FSDAF generates good results with considerably better values for all metrics. Compared to the FSDAF, STFDCNN is significantly better in the fusion accuracy for band 5 and band 7. Considering that band 5 changes the most when a flood occurs, this result indicates the superiority of the STFDCNN for predictions with land-cover change. Nevertheless, the DL-SDFM provides the best performance for all six bands, with the maximum similarity to the actual image and the best spectral fidelity, suggesting that the DL-SDFM is more powerful in making predictions involving land-cover change. (2) Effectiveness of the fusion of temporal change information with spatial information The comparison results of the phenological change prediction, land-cover change prediction and the combination results for both the forward prediction and backward prediction of DL-SDFM for the LGC site are shown in Figure 9 and Table 3. It can be seen that the combination result yielded better fusion performance at the LGC site, which demonstrates the necessity of introducing the physical temporal change information and the effectiveness of the weighted combination method utilized in this paper.
The comparison results of the phenological change prediction, land-cover change prediction and the combination results for both the forward prediction and backward prediction of DL-SDFM for the LGC site are shown in Figure 9 and Table 3. It can be seen that the combination result yielded better fusion performance at the LGC site, which demonstrates the necessity of introducing the physical temporal change information and the effectiveness of the weighted combination method utilized in this paper.   (3) Effectiveness of the reconstruction of the spatial detail As shown in Figure 10, for the LGC site, both the nonlinear mapping and the super-resolution in STFDCNN fail to reconstruct the spatial details. The reconstruction of the spatial details of the predictions requires the additional high-pass modulations. The DL-SDFM, by comparison, can reconstruct the prediction with complete spatial details directly, which demonstrates its effectiveness in spatial detail reconstruction.

Prediction with Phenological Change
(1) Comparison with other fusion methods The CI site is located in a heterogeneous rain-fed agricultural area that underwent a phenological change. As shown in Figures 11a,b, and 12a, obvious phenological differences exist between the two pairs of images, increasing the difficulty of prediction. As shown in Figure 11, STARFM fails to predict the heterogeneous areas with obvious phenological changes. The zoomed-in images in Figure  11c show that the prediction of STARFM has a large spectral deviation and incomplete texture details in this area. FSDAF, by comparison, provides a prediction more similar to the actual image, while the spectral deviation is significantly reduced. STFDCNN again provides plausible spatial details; however, more obvious phenological deviation than that of the FSDAF can be seen.
The DL-SDFM, by contrast, yielded the prediction most similar to the actual image, suggesting that by considering the physical temporal change information, DL-SDFM is more powerful in predicting phenological change. The average AAD maps of the different fusion methods in the CI site ( Figure 12) show that the prediction of the DL-SDFM has the least uncertainty compared to the other fusion methods, which further verifies the superiority of the DL-SDFM in predicting phenological change in heterogeneous areas. Quantitative results of the different fusion methods (Table 4) show that DL-SDFM also performs best with regard to metrics except in the cases of band 5 and band 7. The obvious improvement of DL-SDFM in most of the bands verifies its effectiveness in phenological change prediction.

Prediction with Phenological Change
(1) Comparison with other fusion methods The CI site is located in a heterogeneous rain-fed agricultural area that underwent a phenological change. As shown in Figure 11a,b, and Figure 12a, obvious phenological differences exist between the two pairs of images, increasing the difficulty of prediction. As shown in Figure 11, STARFM fails to predict the heterogeneous areas with obvious phenological changes. The zoomed-in images in Figure 11c show that the prediction of STARFM has a large spectral deviation and incomplete texture details in this area. FSDAF, by comparison, provides a prediction more similar to the actual image, while the spectral deviation is significantly reduced. STFDCNN again provides plausible spatial details; however, more obvious phenological deviation than that of the FSDAF can be seen.    The DL-SDFM, by contrast, yielded the prediction most similar to the actual image, suggesting that by considering the physical temporal change information, DL-SDFM is more powerful in predicting phenological change. The average AAD maps of the different fusion methods in the CI site ( Figure 12) show that the prediction of the DL-SDFM has the least uncertainty compared to the other fusion methods, which further verifies the superiority of the DL-SDFM in predicting phenological change in heterogeneous areas. Quantitative results of the different fusion methods (Table 4) show that DL-SDFM also performs best with regard to metrics except in the cases of band 5 and band 7. The obvious improvement of DL-SDFM in most of the bands verifies its effectiveness in phenological change prediction. (2) Effectiveness of the fusion of temporal change information with spatial information As shown in Figure 13 and Table 5, the comparison results of the phenological change prediction, land-cover change prediction, and the combination results in the CI site are similar to those of the LGC site. The combination provides the best fusion performance for most of the bands, demonstrating the effectiveness of the weighted combination method.
(2) Effectiveness of the fusion of temporal change information with spatial information As shown in Figure 13 and Table 5, the comparison results of the phenological change prediction, land-cover change prediction, and the combination results in the CI site are similar to those of the LGC site. The combination provides the best fusion performance for most of the bands, demonstrating the effectiveness of the weighted combination method.   (3) Effectiveness of the spatial detail reconstruction The results in the CI site ( Figure 14) agreed with our expectation: the nonlinear mapping and the super-resolution in STFDCNN fail to reconstruct the spatial details. The DL-SDFM, by comparison, can reconstruct the prediction with complete spatial details directly. Although the visual effect of the DL-SDFM is a bit inferior to that of the STFDCNN, the lower PSNR of the STFDCNN suggests an uncertainty accumulation during the multiple steps.

Parameter Sensitivity Analysis
The weighting parameter controls the weight of the loss of two independent mappings. In this section, we analyze the influence of in both two study sites ( Figure 15). RMSE was utilized as the quantitative evaluation index of fusion performance. It can be seen that there are no significant differences in fusion accuracy under different weighting settings, indicating that the fusion performance of the DL-SDFM is not sensitive to the weighting sets. Additionally, the fusion accuracy is satisfactory when the parameter is set to 0.5, so this value was chosen in this paper.

Parameter Sensitivity Analysis
The weighting parameter λ controls the weight of the loss of two independent mappings. In this section, we analyze the influence of λ in both two study sites ( Figure 15). RMSE was utilized as the quantitative evaluation index of fusion performance. It can be seen that there are no significant differences in fusion accuracy under different weighting settings, indicating that the fusion performance of the DL-SDFM is not sensitive to the weighting sets. Additionally, the fusion accuracy is satisfactory when the parameter is set to 0.5, so this value was chosen in this paper. We analyze the convergence of the network in two study areas. As shown in Figure 16, during the optimization, the MSE of all the bands varies significantly in the beginning; then, it falls steadily and stabilizes. The loss curves in the two study sites demonstrate that the network converges within 60 epochs.

Advantages of the Proposed Method
The experimental results in the data sets with phenological and land-cover change verified the superiority of the DL-SDFM method over three state-of-the-art spatiotemporal fusion methods. The novelty of the DL-SDFM can be summarized as follows. We analyze the convergence of the network in two study areas. As shown in Figure 16, during the optimization, the MSE of all the bands varies significantly in the beginning; then, it falls steadily and stabilizes. The loss curves in the two study sites demonstrate that the network converges within 60 epochs. We analyze the convergence of the network in two study areas. As shown in Figure 16, during the optimization, the MSE of all the bands varies significantly in the beginning; then, it falls steadily and stabilizes. The loss curves in the two study sites demonstrate that the network converges within 60 epochs.

Advantages of the Proposed Method
The experimental results in the data sets with phenological and land-cover change verified the superiority of the DL-SDFM method over three state-of-the-art spatiotemporal fusion methods. The novelty of the DL-SDFM can be summarized as follows.

Advantages of the Proposed Method
The experimental results in the data sets with phenological and land-cover change verified the superiority of the DL-SDFM method over three state-of-the-art spatiotemporal fusion methods. The novelty of the DL-SDFM can be summarized as follows.
First, DL-SDFM addresses the prediction of both phenological and land-cover change with better generalization ability and robustness. The experimental results of the LGC site show that although STARFM is applicable in phenological change prediction, it fails to handle images with obvious land-cover change due to the inappropriate assumption that land cover remains unchanged. DL-SDFM, by comparison, shows a significant improvement in fusion performance, showing its effectiveness in the prediction of heterogeneous areas with phenological change. This improvement can be attributed to the temporal change-based mapping, which is learned by the two-stream CNN and has more powerful generalization ability and robustness than the traditional linear-based spatiotemporal fusion method, whose effectiveness depends on an artificial predefined weight function. DL-SDFM learns weights self-adaptively using the prior information, resulting in a more wide generality of the weight. Additionally, compared to the FSDAF, the recently proposed hybrid spatiotemporal fusion method, which is effective in land-cover change prediction due to the residual compensation, especially for the edges between two land-cover types, DL-SDFM also shows better fusion performance, which further demonstrates its ability to predict land-cover change.
Second, DL-SDFM endows the learning-based spatiotemporal fusion method with temporal change information, resulting in a more powerful ability to predict phenological change. Existing learning-based spatiotemporal fusion methods regard spatiotemporal fusion as a single image super-resolution task, which has advantages for the prediction of land-cover change. However, the lack of physical temporal change information renders these methods ineffective for the prediction of phenological change. For this reason, physical temporal change information was employed through formulating the temporal change-based mapping in DL-SDFM. Experimental results in the CI site show that although the STFDCNN provides plausible spatial details, more obvious phenological deviation than with FSDAF is observed because no physical temporal change information has been taken into account to address the phenological change. The DL-SDFM, by comparison, yields the prediction most similar to the actual image, which demonstrates its more powerful ability to predict phenological changes than the STFDCNN.
Third, DL-SDFM can directly reconstruct the prediction with complete spatial details; there is no need to use any other auxiliary modulation. Since the magnification factor of spatiotemporal fusion is much more significant than that in single-image super-resolution, existing learning-based spatiotemporal fusion methods using single-image super-resolution usually utilize the additional high-pass modulation to recover spatial details indirectly, which is tedious and increases the risk of cumulative uncertainties. The spatial information-based mapping in the DL-SDFM, by comparison, can reconstruct spatial details directly by incorporating the high-frequency information. This improvement simplifies the process of the learning-based spatiotemporal fusion method and improve the practicability.

Adaptability of the Proposed Method
In this paper, two pairs of fine and coarse image are utilized under the assumption that only two pairs of cloud-free fine and coarse image are available. This assumption is consistent with that of some typical spatiotemporal fusion methods [8,26,30], and it is considered to be reasonable, because it is usually not easy to collect more available images due to the cloud contamination and the limitations of the revisit period. For the case of more than two available pairs of fine and coarse image, it is recommended to use all the available pairs of fine and coarse image to train the two-stream CNN. Once the network has been trained, it can be directly used for the entire dataset. Meanwhile, due to the increase of the number of training samples, the robustness and generalization ability of the network will be improved.
Although in the DL-SDFM a relatively light-weight network is utilized, its computational efficiency is a bit lower than the traditional linear-based fusion methods. The average training time of each band for forward and backward prediction was 7012.8 s and 16,087.6 s, respectively. The average prediction time of each band for the above predictions was 1028.6 s and 2779.3 s. Therefore, to further improve the efficiency, it is recommended to employ GPU equipment with higher performance.
Spatiotemporal fusion methods assume that high spatial and temporal resolution images have similar spectral and radiometric properties. Since MODIS has similar bandwidth and radiation to Landsat, these two kinds of sensors were utilized to obtain the dataset used in this paper. To further apply the DL-SDFM to other types of sensors with significant radiometric inconsistency, such as the Chinese GF-1 wide-field view and MODIS [47,48], it is recommended to reduce the radiation differences first by applying a radiometric normalization.

Limitations and Future Work
DL-SDFM still has some limitations. First, DL-SDFM requires two pairs of known fine and coarse images. However, in many regions, it is not easy to collect these images because of cloud contamination and the limitation of the revisit period. Our future work might focus on combining the deep learning-based and the linear-based spatiotemporal fusion methods to develop a hybrid spatiotemporal fusion method that is applicable in the case of one known pair of fine and coarse images. In particular, since the two mappings in DL-SDFM cannot be learned using one pair of fine and coarse images, the linear-based spatiotemporal fusion method and the deep learning-based super-resolution can be employed to address the phenological change prediction and the land-cover change prediction, respectively.
Second, although the DL-SDFM can reconstruct spatial details directly, its visual effect is slightly inferior to that of STFDCNN. The reason may lie in that the STFDCNN uses the additional high-pass modulation twice to reconstruct the spatial details. Additionally, MSE is considered to generate the overly smooth effect, so it may also be the reason that the visual effect of the DL-SDFM is slightly inferior to that of STFDCNN. Therefore, future work should improve the visual effect further by employing a more appropriate loss function to replace the MSE or using a combination of multiple loss functions [49,50].

Conclusions
In this paper, we propose a novel deep learning-based spatiotemporal data fusion method (DL-SDFM) with a two-stream CNN, which considers both forward and backward prediction to predict the target fine image. The proposed method simultaneously forms temporal change-based and spatial information-based mappings for the prediction of the phenological change and the land-cover change, respectively. In this way, the DL-SDFM addresses both the phenological change prediction and the land-cover change prediction with higher generality and robustness. The comparative experimental results for the test datasets demonstrated the superiority of the DL-SDFM over STARFM, FSDAF, and STFDCNN. Moreover, compared to the existing learning-based spatiotemporal fusion methods, the DL-SDFM has a more powerful ability to predict the phenological change, due to the introduction of the physical temporal change information. Additionally, the ability of DL-SDFM to reconstruct the prediction with complete spatial details directly simplifies the process of deep learning-based spatiotemporal fusion and improves its applicability. However, some potential limitations are worthwhile to note. Future improvements include applying the DL-SDFM to the case of one known pair of fine and coarse images and the improvement of the visual effect by employing a more appropriate loss function.