Spatiotemporal Fusion of Satellite Images via Very Deep Convolutional Networks

: Spatiotemporal fusion provides an e ﬀ ective way to fuse two types of remote sensing data featured by complementary spatial and temporal properties (typical representatives are Landsat and MODIS images) to generate fused data with both high spatial and temporal resolutions. This paper presents a very deep convolutional neural network (VDCN) based spatiotemporal fusion approach to e ﬀ ectively handle massive remote sensing data in practical applications. Compared with existing shallow learning methods, especially for the sparse representation based ones, the proposed VDCN-based model has the following merits: (1) explicitly correlating the MODIS and Landsat images by learning a non-linear mapping relationship; (2) automatically extracting e ﬀ ective image features; and (3) unifying the feature extraction, non-linear mapping, and image reconstruction into one optimization framework. In the training stage, we train a non-linear mapping between downsampled Landsat and MODIS data using VDCN, and then we train a multi-scale super-resolution (MSSR) VDCN between the original Landsat and downsampled Landsat data. The prediction procedure contains three layers, where each layer consists of a VDCN-based prediction and a fusion model. These layers achieve non-linear mapping from MODIS to downsampled Landsat data, the two-times SR of downsampled Landsat data, and the ﬁve-times SR of downsampled Landsat data, successively. Extensive evaluations are executed on two groups of commonly used Landsat–MODIS benchmark datasets. For the fusion results, the quantitative evaluations on all prediction dates and the visual e ﬀ ect on one key date demonstrate that the proposed approach achieves more accurate fusion results than sparse representation based methods.


Introduction
One of the fundamental features of remote sensing data is the resolution in spatial, spectral, temporal, and radiometric domains. However, all single remote sensing sensors are constrained by their tradeoffs in spatial, spectral, temporal, and radiometric resolutions due to the technical and economic reasons. Specifically, we focus on the tradeoff between spatial and temporal resolution of remote sensing data in this study. For example, the images from sensors of Landsat TM/ETM+ and fusion methods, Song et al. proposed a convolutional neural network (CNN) based fusion method to automatically extract effective image features by learning an end-to-end mapping between MODIS and downsampled Landsat images [10]. However, this CNN-based method only used three hidden layers, which was difficult to accurately simulate the complex non-linear correspondence between MODIS and Landsat images (caused by the differences in aspects of imaging environment, sensor design, etc.).
To improve this CNN-based shallow model, we present a novel spatiotemporal fusion approach with very deep convolutional neural networks (VDCNs). Specifically, we first trained a nonlinear mapping VDCN model to directly correlate MODIS and downsampled Landsat data, and then trained a multi-scale SR VDCN model between the downsampled Landsat and the original Landsat data in the training stage; in the prediction step, the input MODIS images were first mapped into the downsampled Landsat data using the trained non-linear mapping function of VDCN, and then super-resolved to the original Landsat image through a two-step image super-resolution. The learned two VDCN models can automatically extract image features and optimally unify feature extraction, non-linear mapping (playing the same role as sparse coding and dictionary learning in sparse representation), and image reconstruction.
The remaining sections are structured as follows. In Section 2, we introduce the work related with the proposed method. The proposed method is presented in detail in Section 3, and the experimental results and comparisons are demonstrated in Section 4. In Section 5, the paper is concluded with some discussions.

Convolutional Neural Networks (CNNs)
CNNs are a type of deep and feed-forward artificial neural network that leverages a variation of multi-layer perceptrons tailored to recognize visual patterns directly from images without manual tweaking or preprocessing [12]. CNNs are shift-invariant because of their shared-weights architecture and translation invariance characteristics. Compared with shallow models (e.g., sparse representation based models), CNNs have much stronger learning capacity, ensuring more accurate predictions. Instead of using hand-designed features, CNNs can automatically learn rich feature hierarchies in a data-driven manner.
In recent years, much advance has been achieved for CNNs in the following aspects [13]: (1) the availability of largescale training sets with millions of annotated labels; (2) powerful GPU (Graphics Processing Unit) computation, which makes it practical to train a very large model; (3) a series of better model regularization strategies have been proposed, including the rectified linear unit (ReLU) [14], batch normalization [15], and residual learning [16]. Such advances promoted applying CNNs into a variety of computer vision tasks, such as object detection, object recognition, image classification, image de-noising, and image super-resolution, to name a few [17][18][19][20]. Since the breakthrough in image classification [20], the architecture of CNNs in [20] has been improved in several aspects. One of the important improvements is to increase the depth of CNNs using an architecture with a set of 3 × 3 convolution filters [21] (i.e., the very deep convolutional network). This work shows that a significant improvement can be achieved by setting the depth to [16][17][18][19] layers. In spite of the larger number of parameters and the greater depth compared to the original net in [20], the nets require less epochs to converge due to (1) the implicit regularization imposed by the greater depth and smaller convolution filter sizes; and (2) pre-initialization of certain layers.

CNNs for Single-Image Super-Resolution
Single-image super-resolution (SISR) aims to generate a high-resolution image from a low-resolution input image [22]. Recently, deep learning has been successfully introduced to SISR and delivered compelling performance [18,23]. In [18], Dong et al. first introduced CNN into SISR, whose model was comprised of three convolutional layers corresponding to patch extraction, non-linear mapping, and reconstruction, respectively. The input and output of this method corresponded to the low-resolution and high-resolution images, respectively, which directly learn a mapping between the low-and high resolution images in an end to end manner. Kim et al. [23] proposed a very deep CNN (20 layers) based SISR approach, which leverages residual-learning and gradient clipping techniques to speed up training. Compared with the previous methods, this method achieves more accurate results in large scale datasets. CNNs-based single image super-resolution techniques were also applied to spatial resolution enhancement of remote sensing images, such as the works in [24,25].

Methodology
To handle both phenology and land-cover changes, we did not impose any restrictions on the proportion of each land-cover type or the land-cover type changes in temporal axis. The input HSLT and HTLS data of the proposed method were Landsat/TM or ETM+ and MODIS images, respectively. Notably, our method could handle both cases of one-paired and two paired prior HSLT-HTLS images. However, we assumed there were two-paired prior Landsat-MODIS images in consideration of applying deep learning into massive remote sensing data. Figure 1 shows the overall flowchart of the proposed approach. In general, the proposed framework consisted of a training stage and a prediction stage. In the training stage, we first learned a non-linear mapping VDCN to directly correlate downsampled Landsat (250 m) and MODIS images (500 m), and then we trained a multi-scale SR (MSSR) VDCN between 250 m Landsat images and the original Landsat images (25 m). To model the complex correspondence between MODIS and Landsat images and to reduce the spatial resolution gap in the next super-resolution step, we set a small resolution gap in designing the non-linear mapping VDCN. Considering so large a spatial resolution gap (10 times) between the original Landsat and the downsampled Landsat images, we designed an MSSR VDCN including 2 times and 5 times. Assuming that the noise intensities caused by imaging environment and imaging system were the same in all bands, we trained a non-linear mapping model and an MSSR model for all bands. In the prediction stage, the input MODIS images were first mapped into the 250 m Landsat images via the learned non-linear mapping VDCN and a fusion model; then, the 250 m Landsat images were super-resolved to the original Landsat images via a two-step super-resolution and a fusion model. The adoption of the fusion model was to fully utilize the information in prior Landsat images. to speed up training. Compared with the previous methods, this method achieves more accurate results in large scale datasets. CNNs-based single image super-resolution techniques were also applied to spatial resolution enhancement of remote sensing images, such as the works in [24,25].

Methodology
To handle both phenology and land-cover changes, we did not impose any restrictions on the proportion of each land-cover type or the land-cover type changes in temporal axis. The input HSLT and HTLS data of the proposed method were Landsat/TM or ETM+ and MODIS images, respectively. Notably, our method could handle both cases of one-paired and two paired prior HSLT-HTLS images. However, we assumed there were two-paired prior Landsat-MODIS images in consideration of applying deep learning into massive remote sensing data. Figure 1 shows the overall flowchart of the proposed approach. In general, the proposed framework consisted of a training stage and a prediction stage. In the training stage, we first learned a non-linear mapping VDCN to directly correlate downsampled Landsat (250 m) and MODIS images (500 m), and then we trained a multi-scale SR (MSSR) VDCN between 250 m Landsat images and the original Landsat images (25 m). To model the complex correspondence between MODIS and Landsat images and to reduce the spatial resolution gap in the next super-resolution step, we set a small resolution gap in designing the non-linear mapping VDCN. Considering so large a spatial resolution gap (10 times) between the original Landsat and the downsampled Landsat images, we designed an MSSR VDCN including 2 times and 5 times. Assuming that the noise intensities caused by imaging environment and imaging system were the same in all bands, we trained a non-linear mapping model and an MSSR model for all bands. In the prediction stage, the input MODIS images were first mapped into the 250 m Landsat images via the learned non-linear mapping VDCN and a fusion model; then, the 250 m Landsat images were super-resolved to the original Landsat images via a two-step superresolution and a fusion model. The adoption of the fusion model was to fully utilize the information in prior Landsat images.

Configurations of Non-Linear Mapping VDCN
Inspired by the successful application of VDCN in image recognition [21], we designed a nonlinear mapping VDCN to model the complex correspondence between downsampled Landsat and MODIS data. Since the downsampled Landsat and MODIS data were highly correlated, we decomposed one downsampled Landsat image into a low frequency part (corresponding to MODIS image) and a high frequency part (corresponding to image details). We thus built the non-linear

Configurations of Non-Linear Mapping VDCN
Inspired by the successful application of VDCN in image recognition [21], we designed a non-linear mapping VDCN to model the complex correspondence between downsampled Landsat and MODIS data. Since the downsampled Landsat and MODIS data were highly correlated, we decomposed one downsampled Landsat image into a low frequency part (corresponding to MODIS image) and a high frequency part (corresponding to image details). We thus built the non-linear mapping model between MODIS images and the image details (or residual images). The work in [23,26] demonstrated that residual-learning methods achieved superior performance over corresponding non-residual methods in both efficiency and accuracy for image super-resolution. In the prediction stage, the predicted downsampled Landsat image was obtained by the sum of network input and output. Figure 2 illustrates the network architecture. It takes interpolated MODIS images as input and exports the image details. D convolutional layers and D-1 nonlinear layers are contained in the network, where each convolutional layer except for the last one is followed by an ReLU layer. The first layer operates on the input image with 64 3 × 3 filters. The layers from 2 to D-1 contain 64 3 × 3 filters, where each layer operates on 3 × 3 spatial region across 64 channels. The last layer contains a single 3 × 3 × 64 filter to yield the output residual images.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 17 mapping model between MODIS images and the image details (or residual images). The work in [23,26] demonstrated that residual-learning methods achieved superior performance over corresponding non-residual methods in both efficiency and accuracy for image super-resolution. In the prediction stage, the predicted downsampled Landsat image was obtained by the sum of network input and output. Figure 2 illustrates the network architecture. It takes interpolated MODIS images as input and exports the image details. D convolutional layers and D-1 nonlinear layers are contained in the network, where each convolutional layer except for the last one is followed by an ReLU layer. The first layer operates on the input image with 64 3 × 3 filters. The layers from 2 to D-1 contain 64 3 × 3 filters, where each layer operates on 3 × 3 spatial region across 64 channels. The last layer contains a single 3 × 3 × 64 filter to yield the output residual images.
The stride of convolution was fixed to 1 pixel and zero padding with 1 pixel adopted for the input of convolutional layers such that all feature maps and the output reconstructed image kept the same size. The size of the receptive field was proportional to the depth of the network. For our depth D network, the receptive field had a size (2D + 1) × (2D + 1).

Configurations of Super-Resolution VDCN
For the 10 times spatial resolution gap between downsampled Landsat and original Landsat images, building one single super-resolution model was difficult between them. The work in [23,27] demonstrated that a single VDCN model can achieve superior performance in both accuracy and efficiency for super-resolution with multiple up-scales. This is probably attributed to the fact that a single VDCN can simultaneously fit the correspondence between low and high resolution images with multiple upscales by taking into account more contextual information in the neighborhood and modelling complex functions with many nonlinear layers. Inspired by this, we thus proposed to design a multi-scale super-resolution (MSSR) VDCN between original and downsampled Landsat images. Specifically, a general VDCN model was trained for 2 up-scale factors (×2; ×5), where factor 2 was to super-resolve 250 m Landsat to 125 m Landsat and factor 5 was to super-resolve 125 m Landsat to 25 m Landsat.
Considering that low and high spatial resolution Landsat images are largely similar (low and high herein are in a relative sense), we built the MSSR model between low spatial resolution image and the image details (i.e., the residual images between low and high spatial resolution Landsat images). Suppose that the depth of the MSSR VDCN is D', the other parameters of the network architecture are the same to those of nonlinear mapping VDCN. The network takes interpolated low spatial resolution Landsat images as input and exports the image details. In the prediction stage, the predicted Landsat image is obtained by summing network input and output.

Training Networks
To train the non-linear mapping VDCN, we prepared N pairs of interpolated MODIS and down- Then the training samples were denoted as The goal of non-linear mapping VDCN The stride of convolution was fixed to 1 pixel and zero padding with 1 pixel adopted for the input of convolutional layers such that all feature maps and the output reconstructed image kept the same size. The size of the receptive field was proportional to the depth of the network. For our depth D network, the receptive field had a size (2D + 1) × (2D + 1).

Configurations of Super-Resolution VDCN
For the 10 times spatial resolution gap between downsampled Landsat and original Landsat images, building one single super-resolution model was difficult between them. The work in [23,27] demonstrated that a single VDCN model can achieve superior performance in both accuracy and efficiency for super-resolution with multiple up-scales. This is probably attributed to the fact that a single VDCN can simultaneously fit the correspondence between low and high resolution images with multiple upscales by taking into account more contextual information in the neighborhood and modelling complex functions with many nonlinear layers. Inspired by this, we thus proposed to design a multi-scale super-resolution (MSSR) VDCN between original and downsampled Landsat images. Specifically, a general VDCN model was trained for 2 up-scale factors (×2; ×5), where factor 2 was to super-resolve 250 m Landsat to 125 m Landsat and factor 5 was to super-resolve 125 m Landsat to 25 m Landsat.
Considering that low and high spatial resolution Landsat images are largely similar (low and high herein are in a relative sense), we built the MSSR model between low spatial resolution image and the image details (i.e., the residual images between low and high spatial resolution Landsat images). Suppose that the depth of the MSSR VDCN is D', the other parameters of the network architecture are the same to those of nonlinear mapping VDCN. The network takes interpolated low spatial resolution Landsat images as input and exports the image details. In the prediction stage, the predicted Landsat image is obtained by summing network input and output.

Training Networks
To train the non-linear mapping VDCN, we prepared N pairs of interpolated MODIS and down-sampled Landsat images, denoted as Then the training samples were denoted as The goal of non-linear mapping VDCN is to learn the nonlinear mapping F(X i ) from input MODIS images X i to predict the residual images R i . We solved the network parameters θ = {W k , b k } D k=1 , where W k and b k denote the weights and bias of the kth convolutional layer, respectively, through minimizing the loss function as This regression objective was optimized by adopting the mini-batch gradient descent based on back-propagation [28]. To accelerate the network optimization convergence while suppressing exploding gradients, we adopted a varying learning rate strategy during the iterations as in [23]. Specifically, we initially set a large learning rate and then decreased the learning rate gradually.
The training procedure of the MSSR VDCN is similar to the above. One thing noteworthy is that the training samples for scales 2 and 5 were combined into one dataset. During training, images with different scales fell into the same mini-batch.

Three-Layer Prediction Step
Given two-paired prior Landsat-MODIS images on t 1 and t 3 and one MODIS image on t 2 as input, we aimed to predict the Landsat image on t 2 . Usually, it is assumed that the prediction date t 2 is between t 1 and t 3 , so that we can integrate the spatial and temporal information before and after to do the prediction. Considering that the spatial information among time series satellite images is closely correlated (e.g., the phenology changes are dominant, or the proportion of land-cover type changes is small), we designed a fusion model to fully utilize the information in prior Landsat and MODIS images. On the other hand, based on the learned non-linear mapping VDCN and the MSSR VDCN, the non-linear mapping prediction and the two-step super-resolution predictions (×2; ×5) could be executed sequentially. We experimentally found that first executing the ×5 super-resolution step and then the ×2 super-resolution step output almost the same accuracy but cost more in computation. Combining the VDCN-based predictions and the fusion model, the prediction stage was achieved through three layers: the non-linear mapping layer, the ×2 super-resolution layer, and the ×5 super-resolution layer, where each layer consisted of a VDCN-based prediction and a fusion model, as demonstrated in the lower part of Figure 1.
In each prediction layer, three images with lower spatial resolutions were fed into one VDCN, which exported three images with higher spatial resolutions. Due to the existence of estimation errors in the predictions of VDCNs, we defined the outputs of VDCNs as the transitional images. Then, the fusion model integrated three transitional images and two prior Landsat images (those with low spatial resolutions were down-sampled from the original Landsat images) together to predict the Landsat image on t 2 under different spatial resolution frames.
We take the first layer as an example to demonstrate the fusion procedure. We denoted the inputs of three transitional images as T 1 i (i = 1, 2, 3) and two prior Landsat images as L 1 i (i = 1, 3). We constructed the fusion model using a high pass modulation (HPM) module and an indicative weighting (IW) module. The overall flowchart is shown in Figure 3. As in [9], the HPM was a linear temporal change model between images of one prior date and the prediction date. Taking the time point t 1 as an example, the HPM predicts the Landsat image on t 2 by modulating the prior Landsat image on t 1 with the ratio coefficients between transitional images on t 2 and t 1 . The mathematical formula is as follows: We leveraged a weighting strategy to integrate the two-end predictions, where the weighting matrix is computed from two transitional images as follows: Considering that when relatively large temporal changes exist between one prior date and the prediction date, the prediction from that end may reduce the prediction accuracy at the prediction date; we thus proposed an indicative weighting strategy by using an indicative matrix to choose the predictions from two end dates. Therefore, the predicted Landsat image on t2 is computed as follows: When the temporal change is too large at one end, we then only choose the prediction result from the other end, and vice versa. To determine the values of the indicative matrix I, we define a threshold value ρ (e.g., 0.7 ρ = ). Then, the values of I are determined at each pixel location (r, c) as follows: The fusion procedures for the other two layers are the same as above.

Experimental Results
In this section, we compare the proposed method with two shallow learning models, i.e., the sparse representation based spatiotemporal fusion method in [9] (abbreviated as SRSTF) and the convolutional neural network based spatiotemporal fusion method in [10] (abbreviated as CNNSTF). To extensively evaluate the performance of our method, we selected two Landsat-MODIS benchmark datasets in [29]. On one hand, these datasets are composed of 14 and 17 paired remote sensing images, respectively, which is suitable for deep learning that needs a large training set. On the other hand, the landscapes of the datasets have obvious contrasting spatial and temporal dynamical characteristics associated with both land-cover type and phenology changes. For description Similarly, the prediction from the time point t 3 is as follows: We leveraged a weighting strategy to integrate the two-end predictions, where the weighting matrix is computed from two transitional images as follows: Considering that when relatively large temporal changes exist between one prior date and the prediction date, the prediction from that end may reduce the prediction accuracy at the prediction date; we thus proposed an indicative weighting strategy by using an indicative matrix to choose the predictions from two end dates. Therefore, the predicted Landsat image on t 2 is computed as follows: When the temporal change is too large at one end, we then only choose the prediction result from the other end, and vice versa. To determine the values of the indicative matrix I, we define a threshold value ρ (e.g., ρ = 0.7). Then, the values of I are determined at each pixel location (r, c) as follows: The fusion procedures for the other two layers are the same as above.

Experimental Results
In this section, we compare the proposed method with two shallow learning models, i.e., the sparse representation based spatiotemporal fusion method in [9] (abbreviated as SRSTF) and the convolutional neural network based spatiotemporal fusion method in [10] (abbreviated as CNNSTF). To extensively evaluate the performance of our method, we selected two Landsat-MODIS benchmark datasets in [29]. On one hand, these datasets are composed of 14 and 17 paired remote sensing images, respectively, which is suitable for deep learning that needs a large training set. On the other hand, the landscapes of the datasets have obvious contrasting spatial and temporal dynamical characteristics associated with both land-cover type and phenology changes. For description convenience in this section, the proposed very deep convolutional network based spatiotemporal fusion method is abbreviated as VDCNSTF.

Sites and Datasets
The first study site was the Coleambally Irrigation Area (CIA), which is a rice based irrigation system located in southern New South Wales, Australia, and covering 2193 km 2 . Over CIA, there were 17 cloud-free Landsat-MODIS image-pairs available from 2001 to 2002 for the austral summer growing season. The study in [29] demonstrated that the temporal dynamics of CIA dataset are crop phenology over a single growing season within the irrigation area. The relatively small field sizes determine that CIA is more spatially heterogeneous. The Lower Gwydir Catchment (LGC) was the second study site, locating in northern New South Wales and covering 5440 km 2 . Fourteen cloud-free Landsat-MODIS image-pairs over LGC were available from April 2004 to April 2005. A large flood and the subsequent inundation occurred in mid-December 2004, covering an area of about 44%, which indicated that LGC is a more temporally dynamic site.
For the CIA dataset, the Landsat images were derived from Landsat-7 ETM+ and were atmospherically corrected via MODTRAN4 [30]. For the LGC dataset, the Landsat images were derived from Landsat-5 TM and were corrected atmospherically with the method in [31]. During pre-processing, geocorrection was defined using the Australian Geodetic Datum (AGD66) for Landsat data. For the CIA dataset, the spatial resolution is 25 m and the image size is 2040 × 1720; for the LGC dataset, the spatial resolution is 25 m and the image size is 2720 × 3200. For both study sites, the MODIS images are from Terra MOD09GA Collection 5 and the spatial resolution is 500 m. To match the Landsat data resolution, the MODIS images were up-sampled to 25 m by using the nearest neighbor algorithm. To co-register Landsat and MODIS images with sub-pixel accuracy, an optimal offset was applied to each MODIS image by maximizing the correlation function between the image pairs. For experimental purpose, we selected the bands 1, 2, 3, 4, 5, and 7 of the Landsat images and the bands 1, 2, 3, 4, 6, and 7 of the MODIS images. Due to the different band order arrangements between Landsat and MODIS images, we adjusted the band orders of MODIS images to match those of Landsat images.

Quantitative Evaluation Indices
Since the ground truth Landsat images were known, we selected four indices that quantitatively evaluated the results from different aspects. The first one was root mean square error (RMSE), which measures the radiometric differences between the fusion result and the ground truth as where L andL denote the ground truth and the fusion result, respectively, and h and w denote the image height and width, respectively. The smaller the RMSE is, the better the prediction is. We leveraged the spectral angle mapper (SAM) [32] as the second index to measure the spectral distortion of the result defined as where N denotes the number of pixels in images and M is the number of bands. The smaller the SAM is, the better the result is. We took the structural similarity (SSIM) [33] as the third metric, measuring the similarity of the overall spatial structures between the fusion result and the ground truth as where µ L and µˆL denote the means of the ground truth and fusion result, respectively; σ LL represents the covariance between the ground truth and fusion result; σ L and σˆL are the variances of the ground truth and fusion result, respectively; and C 1 and C 2 are two small constants to avoid instability when µ 2 L + µ 2 L or σ 2 L + σ 2 L approach to zero. SSIM is valid when falling in [−1; 1], and a larger SSIM means a better fusion result. Finally, the erreur relative global adimensionnelle de synthese (ERGAS) [34] was chosen as the last index to evaluate the overall fusion result as where the spatial resolutions of Landsat and MODIS images are denoted by h and l, respectively; the ith band image is denoted by L i ; and µ i denotes the average value of the ith band image. A better fusion result is achieved when ERGAS is less than or equal to zero.

Experimental Setting
For description convenience, we arranged both CIA and LGC datasets in chronological order and number them from 1 to 17 and 1 to 14, respectively. During the training stage, we chose all bands of the 1st, the 6th, and the 14th image-pairs to generate the training samples for both datasets. In the prediction stage, the other image-pairs excluding those for training were utilized for testing. Specifically, we selected all three neighboring image-pairs (e.g., the 5th, the 7th, and the 8th image-pairs) from both time series; by assuming that the Landsat images on each middle date were unknown, we predicted the Landsat-like images from the corresponding MODIS images and two neighboring image-pairs (one before and one after). By referring to [23], the parameter settings of the proposed method are shown in Table 1, where the sizes of training sub-images were set to be the sizes of receptive fields and the learning rate decreased by a factor of 10 every 20 epochs. The codes were implemented by using matconvnet on a machine with Geforce GTX TITAN X GPU, 3.4 GHz CPU and 16 G RAM. Although the training stage took a long time (e.g., roughly 8 h for LGC dataset), the prediction of each Landsat image took about 10 min. For the comparison method SRSTF, the optimal parameters were set according to the reference in [9].

Experimental Results
For the CIA dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, the 12th, the 13th, the 15th, and the 16th prediction dates; for the LGC dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, and the 12th prediction dates. The quantitative evaluation results on the CIA and LGC datasets in terms of RMSE, SAM, SSIM, and ERGAS are demonstrated in Figures 4 and 5, respectively, where the RMSE and SSIM are the average values of all bands. As seen in these two figures, we observed that the fusion results of VDCNSTF had lower RMSE than those of SRSTF and CNNSTF for all prediction dates on both datasets, which demonstrated that VDCNSTF could achieve more accurate radiometric values. The lower SAM values and the higher SSIM values of VDCNSTF results for all prediction dates on both datasets indicated that the proposed method performed better than SRSTF and CNNSTF in predicting both spatial and spectral information. The lower ERGAS values of VDCNSTF results also supported this point. For the CIA dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, the 12th, the 13th, the 15th, and the 16th prediction dates; for the LGC dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, and the 12th prediction dates. The quantitative evaluation results on the CIA and LGC datasets in terms of RMSE, SAM, SSIM, and ERGAS are demonstrated in Figures 4 and 5, respectively, where the RMSE and SSIM are the average values of all bands. As seen in these two figures, we observed that the fusion results of VDCNSTF had lower RMSE than those of SRSTF and CNNSTF for all prediction dates on both datasets, which demonstrated that VDCNSTF could achieve more accurate radiometric values. The lower SAM values and the higher SSIM values of VDCNSTF results for all prediction dates on both datasets indicated that the proposed method performed better than SRSTF and CNNSTF in predicting both spatial and spectral information. The lower ERGAS values of VDCNSTF results also supported this point.  For the CIA dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, the 12th, the 13th, the 15th, and the 16th prediction dates; for the LGC dataset, we predicted the Landsat-like images on the 3rd, the 4th, the 5th, the 7th, the 8th, the 9th, the 10th, the 11th, and the 12th prediction dates. The quantitative evaluation results on the CIA and LGC datasets in terms of RMSE, SAM, SSIM, and ERGAS are demonstrated in Figures 4 and 5, respectively, where the RMSE and SSIM are the average values of all bands. As seen in these two figures, we observed that the fusion results of VDCNSTF had lower RMSE than those of SRSTF and CNNSTF for all prediction dates on both datasets, which demonstrated that VDCNSTF could achieve more accurate radiometric values. The lower SAM values and the higher SSIM values of VDCNSTF results for all prediction dates on both datasets indicated that the proposed method performed better than SRSTF and CNNSTF in predicting both spatial and spectral information. The lower ERGAS values of VDCNSTF results also supported this point.  To show more details of the fusion results, we showed the results on one key date of both study sites, respectively. For the CIA dataset, the 8th prediction date was selected because of the color turning in the sporadic irrigation fields from the 8th date to the 9th date (see Figure 6). For the LGC dataset, we also chose the 8th prediction date due to the occurrence of a large flood on this day causing relatively large temporal dynamics and abnormal changes of land surface (see Figure 7). Figures 6 and 7 show the three neighboring image-pairs on the 7th, the 8th, and the 9th dates for CIA and LGC datasets, respectively, where the Landsat images show bands 5, 4, and 3 as R-G-B and the MODIS images show bands 6, 2, and 1 as R-G-B. We predicted the Landsat images on the 8th date from other input images for both CIA and LGC datasets.
For clear comparisons, we selected one zoomed in area (shown in the red rectangles of Figure 6e) for the CIA dataset and one zoomed in area (shown in the red rectangles of Figure 7e) for the LGC dataset. Specifically, the zoomed in CIA area was selected due to the heterogeneity of fields and obvious phenology changes; the zoomed in LGC area was selected due to the dramatically varying land-cover types (from field area to flooded area between the 8th date and the 9th date). Figures 8  and 9 demonstrate the fusion results from SRSTF, CNNSTF, and VDCNSTF for CIA and LGC sites, respectively. The ground truth, the fusion results from SRSTF, the fusion results from CNNSTF, and the fusion results from VDCNSTF are displayed in the first rows. For comparisons with clearer details, we selected one representative region of interest (ROI) for both CIA and LGC datasets, as shown in the red rectangles of the first rows of Figures 8 and 9. The enlarged ROIs of the ground truth, the fusion results of SRSTF, the fusion results of CNNSTF, and the fusion results of VDCNSTF for CIA and LGC datasets are displayed in the second rows of Figures 8 and 9, respectively. To show more details of the fusion results, we showed the results on one key date of both study sites, respectively. For the CIA dataset, the 8th prediction date was selected because of the color turning in the sporadic irrigation fields from the 8th date to the 9th date (see Figure 6). For the LGC dataset, we also chose the 8th prediction date due to the occurrence of a large flood on this day causing relatively large temporal dynamics and abnormal changes of land surface (see Figure 7). Figures 6 and 7 show the three neighboring image-pairs on the 7th, the 8th, and the 9th dates for CIA and LGC datasets, respectively, where the Landsat images show bands 5, 4, and 3 as R-G-B and the MODIS images show bands 6, 2, and 1 as R-G-B. We predicted the Landsat images on the 8 th date from other input images for both CIA and LGC datasets.  For clear comparisons, we selected one zoomed in area (shown in the red rectangles of Figure  6e) for the CIA dataset and one zoomed in area (shown in the red rectangles of Figure 7e) for the LGC dataset. Specifically, the zoomed in CIA area was selected due to the heterogeneity of fields and obvious phenology changes; the zoomed in LGC area was selected due to the dramatically varying land-cover types (from field area to flooded area between the 8th date and the 9th date). Figures 8  and 9 demonstrate the fusion results from SRSTF, CNNSTF, and VDCNSTF for CIA and LGC sites, respectively. The ground truth, the fusion results from SRSTF, the fusion results from CNNSTF, and the fusion results from VDCNSTF are displayed in the first rows. For comparisons with clearer details, we selected one representative region of interest (ROI) for both CIA and LGC datasets, as shown in the red rectangles of the first rows of Figures 8 and 9. The enlarged ROIs of the ground truth, the fusion results of SRSTF, the fusion results of CNNSTF, and the fusion results of VDCNSTF for CIA and LGC datasets are displayed in the second rows of Figures 8 and 9, respectively. Tables 2 and 3 list the quantitative results in terms of RMSE, SAM, SSIM, and ERGAS of the fusion results shown by Figures 8 and 9. We can observe that the fusion result of VDCNSTF was better than the fusion result of SRSTF and CNNSTF on all bands, which indicated that VDCNSTF achieved better spatial and spectral prediction results than comparison methods.   Figures 8 and 9. We can observe that the fusion result of VDCNSTF was better than the fusion result of SRSTF and CNNSTF on all bands, which indicated that VDCNSTF achieved better spatial and spectral prediction results than comparison methods.

Conclusions
In this paper, we proposed a spatiotemporal fusion method based on VDCNNs by blending the spatial information of Landsat data and the temporal information of MODIS data. To handle the highly non-linear correspondence relations between MODIS and Landsat data, we trained a nonlinear mapping VDCN between MODIS and Landsat data with low-spatial resolutions. To bridge the large spatial resolution gap between the original Landsat and the downsampled Landsat data (10 times), we trained a multi-scale super-resolution VDCN between low spatial resolution Landsat and   Table 3. Quantitative evaluations of the fusion results in Figure 9. Bold fonts indicate better results.

Discussion
Comparing the quantitative evaluation results for CIA and LGC datasets in Figures 4 and 5, the general prediction errors on LGC dataset were higher than those on the CIA dataset, which indicated that land-cover type changes were more difficult to predict than phenology changes. This was due to the spatial resolution gap being too large between MODIS and Landsat images and the lost land cover change information in MODIS images were more difficult to recover than the lost phenology change information in MODIS images. To compare the improvements of our method over SRSTF on two datasets, we computed the average improvements of all prediction dates for CIA and LGC datasets and the results were as follows: decreased 0.0011 vs. 0.0010 for RMSE, decreased 0.04 vs. 0.05 for SAM, increased 0.0049 vs. 0.0085 for SSIM, and decreased 0.004 vs. 0.005 for ERGAS. This demonstrated that our method could better leverage the difficult land-cover type changes than CNNSTF, which may be attributed to the fact that our deep learning model could better correlate MODIS and Landsat images than the shallow learning model, and the VDCN based MSSR model had higher prediction accuracy than the one-step super-resolution model in CNNSTF.
Comparing the fusion results on the CIA dataset and the ground truth in Figure 8, we can see that all were able to predict the phenology changes within the prediction and the prior dates. However, as shown by the enlarged ROIs in the second row of Figure 8, which have some special heterogeneous regions, VDCNSTF performed better than SRSTF and CNNSTF in terms of the predicted spectral information. Comparing the fusion results on the LGC dataset and the ground truth in Figure 9, we conclude that all were unable to predict well the dramatically flooded areas because the change information in the low SR MODIS images was lost; but despite the dramatic changes of land-cover types shown in Figure 7, they could predict most areas well in aspects of both spatial structures and spectral information. The enlarged ROIs in the bottom row of Figure 9 demonstrated that the fusion results of all methods had some degree of spectral distortion and lost some spatial details, but VDCNSTF performed better than SRSTF and CNNSTF in predicting areas with land-cover type changes.

Conclusions
In this paper, we proposed a spatiotemporal fusion method based on VDCNNs by blending the spatial information of Landsat data and the temporal information of MODIS data. To handle the highly non-linear correspondence relations between MODIS and Landsat data, we trained a non-linear mapping VDCN between MODIS and Landsat data with low-spatial resolutions. To bridge the large spatial resolution gap between the original Landsat and the downsampled Landsat data (10 times), we trained a multi-scale super-resolution VDCN between low spatial resolution Landsat and original Landsat images. In the prediction step, the Landsat data on the prediction date was predicted from the corresponding MODIS data and two prior MODIS-Landsat data pairs. Based on the learned VDCN models and a fusion model, the prediction stage consisted of three layers, where each layer contained a VDCN-based prediction step and a fusion model. To thoroughly explore the prior information, we leveraged the predicted images generated by the VDCN model as transitional images and then used an HPM module and an indicative weighting strategy to integrate the information in prior image-pairs. Experimental evaluations on two benchmark datasets validated the superiority of the proposed method over other learning based methods.
Although the proposed method achieved favorable performance compared to other learning based methods, there is still a lot of room for improvement in both prediction accuracy of spectral information and in the finer details of recovering spatial information. Prediction accuracy of spectral information is very important for application to heterogeneous regions, and spatial detail recovery is very important for application to land-cover type changes. To increase the prediction accuracy of spectral information, our future work will focus on implementing precise geo-registration between two types of satellite sensors in the pre-processing step and building a more accurate fusion model between the outputs of VDCN and the prior images. To recover the lost spatial details in MODIS images, our future work will continue with learning the temporal dynamics of Landsat images.
Author Contributions: Y.Z. and H.S. contributed equally to this work. They proposed the algorithm and performed the experiments. L.S. and Z.W. supervised the study, analyzed the results, and gave insightful suggestions for the manuscript. Y.Z. and H.S. drafted the manuscript. B.J. contributed to the revision of the manuscript. All authors read and approved the submitted manuscript.