1. Introduction
One of the fundamental features of remote sensing data is the resolution in spatial, spectral, temporal, and radiometric domains. However, all single remote sensing sensors are constrained by their tradeoffs in spatial, spectral, temporal, and radiometric resolutions due to the technical and economic reasons. Specifically, we focus on the tradeoff between spatial and temporal resolution of remote sensing data in this study. For example, the images from sensors of Landsat TM/ETM+ and SPOT/HRV are featured by high-spatial resolutions of 10–30 m but low-temporal resolutions from a half to one month, while the images from sensors of MODIS, AVHRR, and MERIS are characterized by low-spatial resolutions of 250–1000 m but daily temporal coverage. These remote sensing data are complementary in both spatial and temporal resolutions.
However, capturing both spatial and temporal dynamical characteristics simultaneously is one important property for existing current remote sensing based monitoring systems. Examples include a monitoring system for change detection of land use/cover, crop growth monitoring system, disaster monitoring system, etc. [
1,
2]. To solve this conflict between data deficiency and the needs of practical applications, various spatiotemporal fusion methods [
3,
4,
5,
6,
7,
8,
9,
10] were proposed in the past decade. These methods combine high spatial resolution remote sensing data and high temporal resolution remote sensing data to generate fused data with both high-spatial and high-temporal resolutions. Specifically, the high-spatial resolution data suffer from low-temporal resolution (shortened as HSLT) and the high-temporal resolution data has low-spatial resolution (shortened as HTLS), but they are characterized by some similar spectral properties, such as band width and the number of bands. Considering the long revisit cycles of HSLT images and the effects of bad weather, the usual assumption for inputs of spatiotemporal fusion methods is that one or two pairs of HSLT–HTLS images on prior dates and one or more HTLS images on prediction dates are given. According to the differences in fusion basis, these fusion methods are usually divided into three classes: reconstruction-based, transformation-based, and learning-based [
11]. We introduce several representative works for each category in the following section.
In the reconstruction-based methods, given the input images, they first search the neighboring pixels with spectrally similarity, and then each pixel in the fused image is predicted by a weighted sum of these pixels. Gao et al. [
3] first introduced the reconstruction-based method into spatiotemporal fusion, termed as the spatial and temporal adaptive reflectance fusion model (STARFM), which fuses the Landsat and MODIS surface reflectance to obtain daily Landsat-like surface reflectance. However, STARFM has the following issues: first, it cannot deal well with the abnormal cases of land-cover type changes or disturbance events not contained in one Landsat image; second, it cannot handle well the predictions in heterogeneous landscapes. To address these issues, afterwards, several STARFM-improved models have been proposed. Hilker et al. [
4] presented a spatial and temporal adaptive algorithm or mapping reflectance change (STAARCH) by discovering the temporal variations from a dense set of MODIS data. Zhu et al. [
5] proposed an enhanced STARFM (ESTARFM) by handling homogeneous and heterogeneous regions separately with different conversion coefficients. Wang et al. [
6] extended STARFM by first downscaling MODIS 500 m bands to 250 m by using bands 1 and 2 to enhance the predictions at areas with abrupt changes or heterogeneity.
In the transformation-based methods, input images are first transformed into another space and then the fusion procedure is implemented in a local subspace. For example, Acerbi-Junior et al. [
7] improved the spatial resolutions of MODIS data by combining Landsat images into a three-level wavelet decomposition framework. Hilker et al. [
4] fused the reflectance data of MODIS and Landsat TM/ETM+ that is able to capture changes using two fine spatial resolution images based on Tasseled cap transformation, which transforms the original bands into a new space with brightness, greenness, and wetness as axes, respectively.
With the popularity of sparse representation and deep learning in the past decade, the learning-based spatiotemporal fusion methods have been presented in recent years. With two-paired Landsat–MODIS images as priors, Huang and Song [
8] worked to establish a corresponding relationship between the difference images of MODIS and Landsat by training a dictionary-pair, and then generate the Landsat image on the prediction date by a weighted sum of the predictions from two prior dates. To cope with the case of one-paired prior images, the authors further presented a fusion framework by first improving the spatial resolutions of the MODIS images based on sparse representation and then generating the fused image via a two-layer high-pass modulation framework [
9]. To deal with the problems of manual feature designing and optimization disunity in sparse representation based fusion methods, Song et al. proposed a convolutional neural network (CNN) based fusion method to automatically extract effective image features by learning an end-to-end mapping between MODIS and downsampled Landsat images [
10]. However, this CNN-based method only used three hidden layers, which was difficult to accurately simulate the complex non-linear correspondence between MODIS and Landsat images (caused by the differences in aspects of imaging environment, sensor design, etc.).
To improve this CNN-based shallow model, we present a novel spatiotemporal fusion approach with very deep convolutional neural networks (VDCNs). Specifically, we first trained a nonlinear mapping VDCN model to directly correlate MODIS and downsampled Landsat data, and then trained a multi-scale SR VDCN model between the downsampled Landsat and the original Landsat data in the training stage; in the prediction step, the input MODIS images were first mapped into the downsampled Landsat data using the trained non-linear mapping function of VDCN, and then super-resolved to the original Landsat image through a two-step image super-resolution. The learned two VDCN models can automatically extract image features and optimally unify feature extraction, non-linear mapping (playing the same role as sparse coding and dictionary learning in sparse representation), and image reconstruction.
The remaining sections are structured as follows. In
Section 2, we introduce the work related with the proposed method. The proposed method is presented in detail in
Section 3, and the experimental results and comparisons are demonstrated in
Section 4. In
Section 5, the paper is concluded with some discussions.
3. Methodology
To handle both phenology and land-cover changes, we did not impose any restrictions on the proportion of each land-cover type or the land-cover type changes in temporal axis. The input HSLT and HTLS data of the proposed method were Landsat/TM or ETM+ and MODIS images, respectively. Notably, our method could handle both cases of one-paired and two paired prior HSLT–HTLS images. However, we assumed there were two-paired prior Landsat–MODIS images in consideration of applying deep learning into massive remote sensing data.
Figure 1 shows the overall flowchart of the proposed approach. In general, the proposed framework consisted of a training stage and a prediction stage. In the training stage, we first learned a non-linear mapping VDCN to directly correlate downsampled Landsat (250 m) and MODIS images (500 m), and then we trained a multi-scale SR (MSSR) VDCN between 250 m Landsat images and the original Landsat images (25 m). To model the complex correspondence between MODIS and Landsat images and to reduce the spatial resolution gap in the next super-resolution step, we set a small resolution gap in designing the non-linear mapping VDCN. Considering so large a spatial resolution gap (10 times) between the original Landsat and the downsampled Landsat images, we designed an MSSR VDCN including 2 times and 5 times. Assuming that the noise intensities caused by imaging environment and imaging system were the same in all bands, we trained a non-linear mapping model and an MSSR model for all bands. In the prediction stage, the input MODIS images were first mapped into the 250 m Landsat images via the learned non-linear mapping VDCN and a fusion model; then, the 250 m Landsat images were super-resolved to the original Landsat images via a two-step super-resolution and a fusion model. The adoption of the fusion model was to fully utilize the information in prior Landsat images.
3.1. Configurations of Non-Linear Mapping VDCN
Inspired by the successful application of VDCN in image recognition [
21], we designed a non-linear mapping VDCN to model the complex correspondence between downsampled Landsat and MODIS data. Since the downsampled Landsat and MODIS data were highly correlated, we decomposed one downsampled Landsat image into a low frequency part (corresponding to MODIS image) and a high frequency part (corresponding to image details). We thus built the non-linear mapping model between MODIS images and the image details (or residual images). The work in [
23,
26] demonstrated that residual-learning methods achieved superior performance over corresponding non-residual methods in both efficiency and accuracy for image super-resolution. In the prediction stage, the predicted downsampled Landsat image was obtained by the sum of network input and output.
Figure 2 illustrates the network architecture. It takes interpolated MODIS images as input and exports the image details. D convolutional layers and D-1 nonlinear layers are contained in the network, where each convolutional layer except for the last one is followed by an ReLU layer. The first layer operates on the input image with 64 3 × 3 filters. The layers from 2 to D-1 contain 64 3 × 3 filters, where each layer operates on 3 × 3 spatial region across 64 channels. The last layer contains a single 3 × 3 × 64 filter to yield the output residual images.
The stride of convolution was fixed to 1 pixel and zero padding with 1 pixel adopted for the input of convolutional layers such that all feature maps and the output reconstructed image kept the same size. The size of the receptive field was proportional to the depth of the network. For our depth D network, the receptive field had a size (2D + 1) × (2D + 1).
3.2. Configurations of Super-Resolution VDCN
For the 10 times spatial resolution gap between downsampled Landsat and original Landsat images, building one single super-resolution model was difficult between them. The work in [
23,
27] demonstrated that a single VDCN model can achieve superior performance in both accuracy and efficiency for super-resolution with multiple up-scales. This is probably attributed to the fact that a single VDCN can simultaneously fit the correspondence between low and high resolution images with multiple upscales by taking into account more contextual information in the neighborhood and modelling complex functions with many nonlinear layers. Inspired by this, we thus proposed to design a multi-scale super-resolution (MSSR) VDCN between original and downsampled Landsat images. Specifically, a general VDCN model was trained for 2 up-scale factors (×2; ×5), where factor 2 was to super-resolve 250 m Landsat to 125 m Landsat and factor 5 was to super-resolve 125 m Landsat to 25 m Landsat.
Considering that low and high spatial resolution Landsat images are largely similar (low and high herein are in a relative sense), we built the MSSR model between low spatial resolution image and the image details (i.e., the residual images between low and high spatial resolution Landsat images). Suppose that the depth of the MSSR VDCN is D’, the other parameters of the network architecture are the same to those of nonlinear mapping VDCN. The network takes interpolated low spatial resolution Landsat images as input and exports the image details. In the prediction stage, the predicted Landsat image is obtained by summing network input and output.
3.3. Training Networks
To train the non-linear mapping VDCN, we prepared N pairs of interpolated MODIS and down-sampled Landsat images, denoted as
. Then the training samples were denoted as
, where
is the defined residual image. The goal of non-linear mapping VDCN is to learn the nonlinear mapping
from input MODIS images
to predict the residual images
. We solved the network parameters
, where
and
denote the weights and bias of the
kth convolutional layer, respectively, through minimizing the loss function as
This regression objective was optimized by adopting the mini-batch gradient descent based on back-propagation [
28]. To accelerate the network optimization convergence while suppressing exploding gradients, we adopted a varying learning rate strategy during the iterations as in [
23]. Specifically, we initially set a large learning rate and then decreased the learning rate gradually.
The training procedure of the MSSR VDCN is similar to the above. One thing noteworthy is that the training samples for scales 2 and 5 were combined into one dataset. During training, images with different scales fell into the same mini-batch.
3.4. Three-Layer Prediction Step
Given two-paired prior Landsat–MODIS images on t
1 and t
3 and one MODIS image on t
2 as input, we aimed to predict the Landsat image on t
2. Usually, it is assumed that the prediction date t
2 is between t
1 and t
3, so that we can integrate the spatial and temporal information before and after to do the prediction. Considering that the spatial information among time series satellite images is closely correlated (e.g., the phenology changes are dominant, or the proportion of land-cover type changes is small), we designed a fusion model to fully utilize the information in prior Landsat and MODIS images. On the other hand, based on the learned non-linear mapping VDCN and the MSSR VDCN, the non-linear mapping prediction and the two-step super-resolution predictions (×2; ×5) could be executed sequentially. We experimentally found that first executing the ×5 super-resolution step and then the ×2 super-resolution step output almost the same accuracy but cost more in computation. Combining the VDCN-based predictions and the fusion model, the prediction stage was achieved through three layers: the non-linear mapping layer, the ×2 super-resolution layer, and the ×5 super-resolution layer, where each layer consisted of a VDCN-based prediction and a fusion model, as demonstrated in the lower part of
Figure 1.
In each prediction layer, three images with lower spatial resolutions were fed into one VDCN, which exported three images with higher spatial resolutions. Due to the existence of estimation errors in the predictions of VDCNs, we defined the outputs of VDCNs as the transitional images. Then, the fusion model integrated three transitional images and two prior Landsat images (those with low spatial resolutions were down-sampled from the original Landsat images) together to predict the Landsat image on t2 under different spatial resolution frames.
We take the first layer as an example to demonstrate the fusion procedure. We denoted the inputs of three transitional images as
and two prior Landsat images as
. We constructed the fusion model using a high pass modulation (HPM) module and an indicative weighting (IW) module. The overall flowchart is shown in
Figure 3. As in [
9], the HPM was a linear temporal change model between images of one prior date and the prediction date. Taking the time point
t1 as an example, the HPM predicts the Landsat image on
t2 by modulating the prior Landsat image on
t1 with the ratio coefficients between transitional images on
t2 and
t1. The mathematical formula is as follows:
Similarly, the prediction from the time point t
3 is as follows:
We leveraged a weighting strategy to integrate the two-end predictions, where the weighting matrix is computed from two transitional images as follows:
Considering that when relatively large temporal changes exist between one prior date and the prediction date, the prediction from that end may reduce the prediction accuracy at the prediction date; we thus proposed an indicative weighting strategy by using an indicative matrix to choose the predictions from two end dates. Therefore, the predicted Landsat image on t
2 is computed as follows:
When the temporal change is too large at one end, we then only choose the prediction result from the other end, and vice versa. To determine the values of the indicative matrix
I, we define a threshold value
(e.g.,
). Then, the values of
I are determined at each pixel location (
r,
c) as follows:
The fusion procedures for the other two layers are the same as above.
5. Discussion
Comparing the quantitative evaluation results for CIA and LGC datasets in
Figure 4 and
Figure 5, the general prediction errors on LGC dataset were higher than those on the CIA dataset, which indicated that land-cover type changes were more difficult to predict than phenology changes. This was due to the spatial resolution gap being too large between MODIS and Landsat images and the lost land cover change information in MODIS images were more difficult to recover than the lost phenology change information in MODIS images. To compare the improvements of our method over SRSTF on two datasets, we computed the average improvements of all prediction dates for CIA and LGC datasets and the results were as follows: decreased 0.0011 vs. 0.0010 for RMSE, decreased 0.04 vs. 0.05 for SAM, increased 0.0049 vs. 0.0085 for SSIM, and decreased 0.004 vs. 0.005 for ERGAS. This demonstrated that our method could better leverage the difficult land-cover type changes than CNNSTF, which may be attributed to the fact that our deep learning model could better correlate MODIS and Landsat images than the shallow learning model, and the VDCN based MSSR model had higher prediction accuracy than the one-step super-resolution model in CNNSTF.
Comparing the fusion results on the CIA dataset and the ground truth in
Figure 8, we can see that all were able to predict the phenology changes within the prediction and the prior dates. However, as shown by the enlarged ROIs in the second row of
Figure 8, which have some special heterogeneous regions, VDCNSTF performed better than SRSTF and CNNSTF in terms of the predicted spectral information. Comparing the fusion results on the LGC dataset and the ground truth in
Figure 9, we conclude that all were unable to predict well the dramatically flooded areas because the change information in the low SR MODIS images was lost; but despite the dramatic changes of land-cover types shown in
Figure 7, they could predict most areas well in aspects of both spatial structures and spectral information. The enlarged ROIs in the bottom row of
Figure 9 demonstrated that the fusion results of all methods had some degree of spectral distortion and lost some spatial details, but VDCNSTF performed better than SRSTF and CNNSTF in predicting areas with land-cover type changes.
6. Conclusions
In this paper, we proposed a spatiotemporal fusion method based on VDCNNs by blending the spatial information of Landsat data and the temporal information of MODIS data. To handle the highly non-linear correspondence relations between MODIS and Landsat data, we trained a non-linear mapping VDCN between MODIS and Landsat data with low-spatial resolutions. To bridge the large spatial resolution gap between the original Landsat and the downsampled Landsat data (10 times), we trained a multi-scale super-resolution VDCN between low spatial resolution Landsat and original Landsat images. In the prediction step, the Landsat data on the prediction date was predicted from the corresponding MODIS data and two prior MODIS–Landsat data pairs. Based on the learned VDCN models and a fusion model, the prediction stage consisted of three layers, where each layer contained a VDCN-based prediction step and a fusion model. To thoroughly explore the prior information, we leveraged the predicted images generated by the VDCN model as transitional images and then used an HPM module and an indicative weighting strategy to integrate the information in prior image-pairs. Experimental evaluations on two benchmark datasets validated the superiority of the proposed method over other learning based methods.
Although the proposed method achieved favorable performance compared to other learning based methods, there is still a lot of room for improvement in both prediction accuracy of spectral information and in the finer details of recovering spatial information. Prediction accuracy of spectral information is very important for application to heterogeneous regions, and spatial detail recovery is very important for application to land-cover type changes. To increase the prediction accuracy of spectral information, our future work will focus on implementing precise geo-registration between two types of satellite sensors in the pre-processing step and building a more accurate fusion model between the outputs of VDCN and the prior images. To recover the lost spatial details in MODIS images, our future work will continue with learning the temporal dynamics of Landsat images.