Spatiotemporal Fusion of Formosat-2 and Landsat-8 Satellite Images: A Comparison of “Super Resolution-Then-Blend” and “Blend-Then-Super Resolution” Approaches

: The spatiotemporal fusion technique has the advantages of generating time-series images with high-spatial and high-temporal resolution from coarse-resolution to ﬁne-resolution images. A hybrid fusion method that integrates image blending (i.e., spatial and temporal adaptive reﬂectance fusion model, STARFM) and super-resolution (i.e., very deep super resolution, VDSR) techniques for the spatiotemporal fusion of 8 m Formosat-2 and 30 m Landsat-8 satellite images is proposed. Two different fusion approaches, namely Blend-then-Super-Resolution and Super-Resolution (SR)-then-Blend, were developed to improve the results of spatiotemporal fusion. The SR-then-Blend approach performs SR before image blending. The SR reﬁnes the image resampling stage on generating the same pixel-size of coarse- and ﬁne-resolution images. The Blend-then-SR approach is aimed at reﬁning the spatial details after image blending. Several quality indices were used to analyze the quality of the different fusion approaches. Experimental results showed that the performance of the hybrid method is slightly better than the traditional approach. Images obtained using SR-then-Blend are more similar to the real observed images compared with images acquired using Blend-then-SR. The overall mean bias of SR-then-Blend was 4% lower than Blend-then-SR, and nearly 3% improvement for overall standard deviation in SR-B. The VDSR technique reduces the systematic deviation in spectral band between Formosat-2 and Landsat-8 satellite images. The integration of STARFM and the VDSR model is useful for improving the quality of spatiotemporal fusion. determine the We evaluated the performance of the of Four models that could be used for


Motivation
Time-series satellite images are the integration of multitemporal images over a region, and they can be used to analyze spatial temporal variations of the Earth's surface. Owing to the availability of remote sensing open data, we have more data sources to construct time-series satellite images. Examples of such data sources are time-series Landsat satellite images provided by the National Aeronautics and Space Administration and time-series Sentinel satellite images of the European Space Agency (Paris, France). Furthermore, commercial satellites, such as satellite constellations of Planet Labs, can also provide hightemporal-resolution time-series satellite images. The increase in the number of available time-series satellite images has also led to the emergence of more diversified applications for the images, such as vegetation phenology detection [1], water resource management [2], rice crop estimation [3], land cover change detection [4], and regional air quality [5]. In particular, time-series satellite image analysis plays an important role in the application of satellite imagery.
Spatiotemporal fusion methodology has the capability to generate both high-spatial and high-temporal resolution images by employing different sensors. It can improve our ability and flexibility to construct time-series satellite images. As shown in Figure 1, the In recent years, deep learning has been widely used in the field of image processing. It is a feature-learning method that combines many simple modules to approximate a highlevel complex system. With the combination of a large number of simple modules, very complex functions can be learned [6] for modeling complex scenery. Li et al. [7] discussed the fusion of infrared and visible images by using a convolutional neural network (CNN). An advantage of CNNs is their ability to handle complex nonlinear mapping functions and feature extraction at different scales. The result [7] revealed that CNN shows better performance in improving image quality in the image fusion process. Thus, CNNs have high potential for use in spatiotemporal fusion technology for remote sensing images. Most previous studies have discussed feature extraction from CNNs [8,9]. However, relatively few studies have discussed CNN algorithms for the spatiotemporal fusion of satellite images. Hence, the present study performed a detailed investigation of CNN-based spatiotemporal fusion.

Previous Studies
Spatiotemporal fusion technology fuses images with high temporal and low spatial resolutions and images with low temporal and high spatial resolutions to produce timeseries satellite imagery. An example is the spatiotemporal fusion of MODIS and Landsat images using the spatial and temporal adaptive reflectance fusion model (STARFM) [10]. Furthermore, spatiotemporal fusion technology can be applied to combine images recorded by satellites with similar spatial resolutions, such as the fusion of LS-8 and Sentinel-2 satellite images [11]. The advantage is that time-series satellite images with a consistent spatial resolution can be generated from LS-8 and Sentinel-2 satellite images. Multisensor image fusion simulates high-spatial-resolution time-series images for periods for which only low-spatial-resolution images are available. This image fusion is a key technology for generating time-series satellite images from images acquired by different sensors at different times [12].
From the perspective of time-series data technique, spatiotemporal image fusion can be classified into five categories: unmixing-based, weight-function-based, Bayesian-based, learning-based, and hybrid methods [13,14].
The weight-function-based method has been widely used in image fusion applications [15]. Gao et al. [10] proposed the STARFM, which is the first and the most popular weighted fusion model. It estimates the reflectance of a predicted image by weighing temporal, spectral, and spatial information. It assumes that spatial and temporal changes in a high-spatial-resolution image are the same as those in a low-spatial-resolution image, and consequently, high-spatial-resolution images can be estimated from low-spatial-resolution images. However, its basic assumption renders it unfit for heterogeneous regions. Many improved fusion models have been proposed, such as spatial and temporal adaptive algorithm for mapping reflectance change [16], enhanced STARFM (ESTARFM) [17], and spatial and temporal nonlocal filter-based fusion model (STNLFFM) [15].
The learning-based method uses a machine learning algorithm to establish a nonlinear relationship between the observed and the estimated images. It predicts the highspatial-resolution image of an observed low-spatial-resolution image. While this method is usually applied to image super-resolution (SR), it is also used for image fusion. Concepts such as dictionary pair learning [18], sparse representation [19], artificial neural network (ANN) [20], deep convolutional neural network [21], and nonlinear mapping CNN combined with SR CNN [22] are applied to determine the nonlinear conversion relationship between low-spatial-and high-spatial-resolution images.
A hybrid method involves the integration of two or more fusion methods. Examples of such methods are flexible spatiotemporal data fusion [13], spatiotemporal remotely sensed images and land cover map fusion model [23], combination of the spatial and temporal reflectance unmixing model (STRUM) [24] with an unmixing-based and weight-functionbased method to fuse images, and the unmixing-based Bayesian model [25] integrated with Bayesian-based and unmixing-based methods.
Deep learning is a technique based on a traditional ANN, and it involves the use of multilayer networks to process complex scenery. It uses linear and nonlinear transformations in multilayers to automatically establish a relationship between input and output data. A CNN [26] is a typical deep learning network architecture for image data. It has been proven to be a useful model for performing a wide range of imaging and visual tasks [27]. The CNN technique can be applied to different image processing tasks, for example, image classification and identification, object detection [28,29], image noise reduction [30,31], image resolution improvement [32,33], removal of compression artifacts [34], image color fusion [7], and radar image simulation from optical image [35].
Song et al. [22] pointed out that the capabilities of CNNs originate from three factors: (1) the deep network architecture of a CNN is effective in extracting large-scale image features, (2) efficient and rapid training methods, such as the rectified linear unit (ReLU) [36], batch normalization (BN) [37], and residual learning [38], that have been proposed, and (3) the emergence and popularization of graphic processing units (GPU) dedicated to graphics processing and with powerful parallel computing capability can help speed up training. Zhang et al. [31] proposed a denoising CNN (DnCNN) in which a CNN is used to reduce image noise. DnCNN accelerates the training process and improves image noise removal capability through residual learning, BN, and ReLU. Their experimental results indicated that the characteristics of a CNN deep network model may be useful for effectively predicting and reducing image noise, and also for improving image quality.

Need for Further Study and Research Purpose
Image SR can be used to enhance the spatial resolution of images with high temporal frequency but low spatial resolution [15]. Dong et al. [32] proposed super-resolution CNN (SRCNN), in which deep learning is introduced in the image SR method. Kim et al. [33] also presented a single-image SR method called very deep SR (VDSR). VDSR also employed a CNN network, but compared with the SRCNN, the neural network of VDSR is deeper and more information can be used to reconstruct the image. The most significant difference between the SRCNN and VDSR is that the training model of the SRCNN learns high-resolution images directly from low-resolution images, whereas the training model of VDSR learns residual images between high-and low-resolution images. Furthermore, to speed up training and the convergence rate, VDSR uses extremely high learning rates; its initial learning rate is 10 4 times higher than that of the SRCNN since it employs residual learning and adjustable gradient clipping. VDSR performs zero padding before convolutions in the training process to maintain the size of feature maps and output images constant; thus, pixels near the image boundary can be correctly predicted.
The spatiotemporal image fusion technique, which is a hybrid method, improves the quality of fusion images by combining the advantages of different approaches. Gevaert and García-Haro [24] proposed STRUM, which integrates unmixing-based and weight-functionbased methods. Currently, most of the hybrid methods combine unmixing-based and weight-function-based methods. Drawing inspiration from this fact, this study proposes a hybrid fusion method based on the integration of weight-function-based (i.e., STARFM) and learning-based (i.e., VDSR) methods. The STARFM approach determines weights on the basis of physical parameters (i.e., spectral, temporal, and spatial variations), while VDSR learns weights of neutral networks from the data by itself. VDSR was originally developed to improve the results of image interpolation [33]. In the preprocessing of STARFM, image interpolation is a key process for interpolating a low-resolution image to have the same grid size as a high-resolution image. Since VDSR is capable of improving the results of image interpolation, there appears to be scope for developing a VDSR-assisted STARFM for improving the quality of image fusion.
Jarihani et al. [39] compared the results of index-then-blend and blend-then-index approaches for deriving the vegetation index by the image blending method. The former approach produced higher accuracy. In a hybrid approach, such as one combining STARFM and VDSR, the spatiotemporal image fusion performance should be evaluated for different combinations of processes (i.e., SR-then-Blend and Blend-then-SR). In SR-then-Blend, the role of VDSR is a preprocessing for STARFM. By contrast, in Blend-then-SR, the role of VDSR is a post-processing of STARFM.

Objectives
This study aims to determine the benefits of combining STARFM and deep learning for spatiotemporal image fusion. We evaluated the performance of the hybrid method for different combinations of processes. Four models that could be used for the spatiotemporal fusion of remote sensing images were compared, namely, STARFM, VDSR, and two hybrid models. The first hybrid model employs STARFM before VDSR (i.e., Blend-then-SR, hereafter abbreviated B-SR), while the second hybrid model employs VDSR before STARFM (i.e., SR-then-Blend, hereafter SR-B). In B-SR, the images are fused by a physical model (i.e., STARFM) and the remaining residuals are then compensated by using an in-depth learning approach (i.e., VDSR). In SR-B, the overall high-frequency details are injected from high-resolution images to low-resolution images by VDSR, and then a physical model (STARFM) is used for image fusion. The VDSR fine tunes the results of cubic interpolation in the preprocessing of STARFM.
Most spatiotemporal image fusion uses images with fixed look-angle satellite, for example, LS-5, LS-7, LS-8 and Sentinel-2 satellites. The FS-2 has the body-pointing capability of 30 degrees in roll and pitch directions, respectively. This body rotation capability is able to collect off-nadir images and to improve the temporal resolution. This study demonstrates the possibility of fusing the fixed look-angle Landsat-8 satellite and bodyrotation Formosat-2 satellite in spatiotemporal fusion. The input high-and low-resolution satellite images were 8 m FS-2 and 30 m LS-8 images, while the output fused images were time-series 8 m fused FS-2 images. Finally, the results were evaluated by performing quantitative and qualitative analyses.

Study Area and Dataset
The test area covered a rural area located in Kao-Hsiung in southwestern Taiwan. The latitude and longitude of image center were 22 • 52 42 N and 120 • 29 55 W, respectively. The area mainly comprised agricultural land, forest land, and building areas. Test images were obtained by the FS-2 and LS-8 satellites, and this study collected 8 m multispectral FS-2 images and 30 m multispectral LS-8 images from January 2014 to January 2016. The total overlapping area between FS-2 and LS-8 images was about 68 km 2 . This study employed blue, green, red, and NIR bands of FS-2 and LS-8 images for image fusion. The product level of FS-2 used in this study was Level-2A with systematic correction, and the product level of LS-8 was Level-1 Precision and Terrain (L1TP) with geometric correction. The FS-2 and LS-8 images were precisely co-registered for fusion after preprocessing. Table 1 compares the spectral bandwidth of the corresponding bands of FS-2 and LS-8. FS-2 and LS-8 have corresponding bandwidths in the four bands, but the bandwidth of LS-8 is slightly narrower than that of FS-2. During the period from January 2014 to January 2016, only five pairs of images were recorded on the same day (Table 2 and Figure 2). In the training stage, four pairs of images were used as the training dataset, and one pair of images was used as the independent verification dataset. The quantity and quality of training data were key to the success of image fusion. Therefore, to effectively use image data and generate a deep learning model, in the training stage, we excluded cloud areas (from the LS-8 BQA band) in the training images.

Methodologies
This study aimed to develop a hybrid spatiotemporal image fusion approach. The proposed scheme comprised five parts: (1) preprocessing of input FS-2 and LS-8 images, (2) image blending by STARFM (Figure 3a In the data preprocessing, satellite images from two different sensors were preprocessed to obtain data with consistent geometric and radiometric characteristics. The STARFM generates high-spatial-and high-temporal-resolution images by blending images with high temporal and low spatial resolution with images with low temporal and high spatial resolution, and VDSR compensates high-frequency details for the low-resolution images to construct an SR image. Two different combinations (i.e., B-SR vs. SR-B) were compared. Finally, several quantitative evaluation indicators were used to assess the quality of the fused image.

Data Preprocessing
If different satellite images are to be compared spatially and temporally during spatiotemporal image fusion, the images to be fused should have the same radiometric response, ground sampling distance, image size, and coordinate system. Therefore, the images should be preprocessed by using radiometric correction, image co-registration, image clipping, and image resampling.
The FS-2 images used in this study were Level-2A images, which are ortho ready standard images. First, an additional reference FS-2 orthoimage was selected as a base map. All the FS-2 Level-2A images were orthorectified to the base map [40]. Next, a radiometric correction [41] was applied to convert the digital numbers in the FS-2 images to top-ofatmosphere reflectance using the physical parameters in the Dimap image description file.
The LS-8 images used in this study were L1TP terrain and precision corrected images. We found a systematic offset between LS-8 and the corrected FS-2 images. Therefore, we used cloud-free LS-8 and corrected FS-2 images to perform frequency domain image matching [42] and to determine the systematic bias between these two images. The systematic bias was then applied to the upper-right coordinates of the LS-8 images. A radiometric correction determined by using the physical parameters in the MTL image description file was also applied to the LS-8 images [43]. As the spatial resolution of the images to be fused should be the same, the corrected LS-8 images were further resampled into 8 m/pixel using cubic interpolation.

Method 1: Image Blending Using Spatiotemporal Image Fusion Method
In this part, we employed the STARFM developed by Gao et al. [10] as the spatiotemporal image fusion model. The STARFM is a physical model that fuses images with high temporal and low spatial resolution and images with low temporal and high spatial resolution to generate fused images with high temporal and high spatial resolution. The images with high temporal and low spatial resolution provided temporal information, while images with low temporal and high spatial resolution provided spatial information.
This study used 8 m FS-2 images as images with low temporal and high spatial resolution and 30 m LS-8 images as images with high temporal and low spatial resolution images for image fusion. As the FS-2 and LS-8 images were preprocessed to have the same geometrical and radiometric characteristics, these two datasets could be compared with each other directly. The reflectance of the FS-2 high spatial resolution pixel corresponding to the LS-8 low spatial resolution homogeneous pixel on date t 0 can be expressed as Equation (1), while reflectance of FS-2 and LS-8 image on date t k can be defined as shown in Equation (2): Here, (x i , y j ) is a given pixel location for both FS-2 and LS-8 images, F is FS-2 data (highspatial-resolution image), L is LS-8 data (low-spatial-resolution image), t 0 and t k are the acquisition dates for both FS-2 and LS-8 images (the observation date and prediction date, respectively), and ε 0 and ε k represent the difference between the FS-2 and LS-8 reflectance values at t 0 and t k , respectively.
If it is assumed that the land cover and the systematic error of the pixel (x i , y j ) did not change at t 0 and t k , that is, the difference in the spectral reflectance between different dates is similar, then ε 0 = ε k . Thus, Equations (1) and (2) can be used to obtain Equation (3). However, the relationship between LS-8 and FS-2 images is highly complex because of the following reasons: (1) LS-8 observations might not be homogeneous pixels and may contain mixed land cover types when considered at the FS-2 spatial resolution. (2) There is a high chance that the land cover type will change during the period from the observation date (t 0 ) to the prediction date (t k ). Furthermore, the transformation of the land cover status and the bidirectional reflectance distribution function would also change the reflectance during the interval from the observation date (t 0 ) to the prediction date (t k ). Therefore, the linear equation (i.e., Equation (3)) is not sufficient, and the fusion model must consider a weighting function. Consequently, STARFM utilized a moving window to obtain neighboring pixels with pixels's spectrally similar during the fusion process and then used the weighting function to estimate the center pixel of the image on the prediction date (i.e., Equation (4)): (4) where m is the size of the moving window, (x m/2 , y m/2 ) is the central pixel of the moving window, and W ijk is the combined weights for a neighboring pixel, including spectral, temporal, and spatial distance variations.
The combined weight (W ijk ) (i.e., Equation (5)) determines the contribution of each neighboring pixel to predict the reflectance of the central pixel, which depends on the variation of the images in terms of the spectral, temporal, and spatial distances (i.e., Equation (6)). The spectral variation (S ijk ) is the spectral difference between the FS-2 and LS-8 reflectances on the same date (i.e., Equation (7)). The smaller the difference, the greater the similarity between the reflectances of the FS-2 image and the averaged surrounding pixels. Thus, S ijk will be assigned a higher weight. The temporal variation (T ijk ) is the difference in time between the input training and the predicted LS-8 images (i.e., Equation (8)). A smaller value indicates that the land cover does not change significantly during the period from t 0 to t k . T ijk will also be assigned a higher weight. The spatial variation (D ijk ) is the relative spatial distance between the central pixel of the moving window and the surrounding spectrally similar candidate pixels on date t 0 (i.e., Equation (9)). The candidate pixels near the central pixel have a higher weight. In Equation (9), A is a constant parameter used to define the relative importance of the spatial distance to the difference between spectral and temporal distances. The larger the A value, the smaller the weight of D ijk : The input data required for applying the STARFM should include at least one pair of high-and low-spatial-resolution images obtained on the same date, and a low-spatialresolution image on the prediction date. The output data is a high-spatial-resolution fused image on the prediction date. Generally, the STARFM includes training and prediction stages. The training stage is aimed at determining the combined weights from FS-2 and LS-8 images captured on the same date. The first step uses high-spatial-resolution images to find candidate pixels that are spectrally similar to the central pixel in the moving window, and the second step filters out inappropriate candidate pixels on the basis of the uncertainties in the spectral information of the FS-2 and LS-8 images. The third step assigns weights according to the pixels' variation in terms of the spectral, temporal, and spatial information. The higher the weight, the more the contribution of the central pixel reflectance to the prediction. In the prediction stage, the predicted reflectance for LS-8 is estimated from the pretrained weights. A more detailed description of the STARFM with regard to the determination of weights can be found in the paper of Gao et al. [10].

Method 2: Image SR Using VDSR
Kim et al. [33] proposed VDSR, which involves the use of deep learning. This method uses a very deep convolutional network inspired by the visual geometry group network (VGGNet). VDSR determines the details of images to solve single-image SR (SISR) problems, that is, to reconstruct a higher-resolution image from a low-resolution image. The reconstruction method employs a residual-learning CNN to train and predict the residual map between the low-and high-resolution images. Subsequently, the residual image is compensated back to the low-resolution image to obtain the corresponding high-resolution image.
The first layer of VDSR's network structure (Figure 4) is the input layer, which is a receptive field of size (2D + 1) × (2D + 1); D represents the total number of convolutional layers in the network. The VDSR network used in this research had twenty convolutional layers; thus, the size of the receptive field was 41 × 41. The middle layer consists of a repetitive cascade of 19 pairs of convolutional layers and ReLU layers. Each convolutional layer includes 64 filters of size 3 × 3 × 64. Zero padding is performed before each convolution operation to ensure that all feature maps are of the same size. This is done to maintain the size of the output image identical to that of the input image. The last layer is a convolutional layer composed of a single filter of size 3 × 3 × 64, which is the residual image used for image reconstruction. During the training process, VDSR learned to predict the difference between input and output images to avoid vanishing gradient and exploding gradient problems and to increase the training model's convergence speed. In this study, the VDSR model was used to estimate the residual between high-resolution and low-resolution images on the same date. The learning process was intended to establish a nonlinear relationship between the low-resolution image and the residual image. Residual learning involved learning the high-frequency variation of the image to improve the spatial details of the image. The input in the training process was a dataset ( L (i) , F (i) N i=1 ) composed of multiple pairs of lowresolution images (L) and high-resolution images (F). The output was the residual image (r) (i.e., Equation (10)), that is, the difference between high and low-resolution images ( Figure 5). The training model is expressed by Equation (11), where f is the trained deep learning SR model andF is the predicted target image. The loss function is expressed in Equation (12). A more detailed description of VDSR can be found in Kim et al.'s paper [33]: The purpose of using VDSR for image fusion was to learn the spatial details of the input image (i.e., residual) by using a CNN between low-and high-resolution images acquired on the same day. Both FS-2 and LS-8 pre-processed images have the same pixel size in the calculation of the residual between FS-2 and LS-8 images. In the training stage, this study used FS-2 and LS-8 images recorded on the same day to calculate the residual image. The input training images are LS-8 and residual images for the same day. VDSR is applied to train a deep neural network with LS-8 and residual images. In the prediction stage, the VDSR's model is used to predict the residual images from time-series LS-8 images. High-frequency details from VDSR's residual image is then injected into the time-series LS-8 images. Finally, the 8 m fused image on the prediction date can be obtained by combining the LS-8 image and predicted residual image.
In the VDSR training model, the optimization method is stochastic gradient descent with momentum. The momentum is set to 0.9. The initial learning rate is set to 0.1, and it is reduced 10 times after every 10 epochs for a total of 100 epochs. The patch size is 41 × 41 pixels, while 256 patches are randomly selected from the image pair. The minibatch size is set to 64. In particular, a multispectral image used in this study comprised four spectral bands, and therefore, VDSR trained each spectral band separately.

Method 3: Hybrid Spatiotemporal Fusion Approach SR-B
This study proposes a hybrid spatiotemporal fusion approach in which the STARFM and VDSR model are combined to produce time-series satellite imagery. The VDSR model was applied to reduce the difference between the two types of satellite imagery or to reduce the difference between the fused image and the observed image. Hence, this study proposes two different input training data sets for VDSR models. The first type of training data were an LS-8 image and a residual image (i.e., difference between LS-8 and FS-2 images), and they were used to learn residuals between two different sensors, recorded on the same day. The second type of training data were a fused image (i.e., results of the STARFM) and a residual image (i.e., difference between fused and FS-2 images), and they were used to learn residuals between the fused image and the observed image. The first hybrid model was SR-B, which employed VDSR before the STARFM. The first VDSR training model used LS-8 images to learn and to predict the residuals between itself and corresponding FS-2 images, and it subsequently added the predicted residual images to the low-resolution LS-8 images to generate the SR LS-8 images (SRLS-8). The spatial details of LS-8 were enhanced by VDSR. The SRLS-8 image and FS-2 image were blended by the STARFM to produce time-series fusion satellite images. The workflow is shown in Figure 3c. The concept underlying SR-B was to reduce the spatial and spectral differences of high-and low-resolution images before image fusion. In other words, VDSR was intended to preprocess the input data of the STARFM. VDSR was used to improve the results of the image interpolation stage in the STARFM for facilitating a comparison between the traditional STARFM and SR-B (Figure 3a,c).

Method 4: Hybrid Spatiotemporal Fusion Approach B-SR
The second hybrid model was B-SR, which employed the STARFM before VDSR. The STARFM generated a fused FS-2 image directly from LS-8 and FS-2 images. The second VDSR training model then used the fused FS-2 images to learn and predict the residual between itself and the corresponding FS-2 images, after which it added the residual images to the fused FS-2 images to generate SR fused FS-2 images. The workflow is shown as Figure 3d. The concept underlying B-SR was to compensate the residuals between fused FS-2 and the original high-resolution FS-2 image by using deep learning technique. VDSR post-processes the output of STARFM. B-SR was used to compensate the residual between the result of the STARFM and the original FS-2 image for facilitating a comparison between the traditional SR-B and B-SR (Figure 3c,d).

Accuracy Analysis
The quality assessment includes the quantitative analysis for the entire area and qualitative analysis for the vegetation and building regions. The quantitative analysis involves absolute and relative indexes. The absolute index evaluates the nature of the fused image itself, and therefore, it is calculated using the fused image. The relative index compares the observed and fused images. It uses the observed image as a benchmark to evaluate the correlation between the real observed and synthetized fusion images. The absolute indexes were entropy [44] and the blind/referenceless image spatial quality evaluator (BRISQUE) [45], and the relative indexes were reflectance bias, structural similarity (SSIM) [46], and peak signal-to-noise ratio (PSNR) [47]. Among these five indicators, the reflectance bias was used to evaluate the difference between the observed and fused images in different spectral bands, and the other four indicators were used to assess the visual performance of the fused image: (1) Reflectance bias: This index is used to evaluate the degree of difference in reflectance among observed and fused images. This study calculated the average and standard deviation (SD) of the reflectance bias between observed images and fused images. The lower difference indicates better result. (2) SSIM: This index is used to evaluate the similarity of the overall structure between observed and fused images. This index is based on the human visual system to extract structural information for comparing the luminance, contrast, and structure between images. The SSIM ranges from −1 to 1. The larger the value, the higher the similarity between the two images. The expression for the SSIM is presented in Equation (13), where l(x, y) is the luminance comparison function, c(x, y) is the contrast comparison function, s(x, y) is the structure comparison function, µ x and µ y are the mean of images x and y, σ x and σ y are the SDs of images x and y, σ 2 x and σ 2 y are the variances of images x and y, σ xy is the cross-covariance between images x and y, and C 1 , C 2 , and C 3 are constants used to maintain the stability of l(x, y), c(x, y), and s(x, y), respectively.
(3) PSNR: This index is used to assess the degree of distortion of the fused image. This study used the observed image as the reference undistorted image. The ratio of the maximum value of an image signal to the noise in an image was used as the evaluation index. The larger the value of this index, the higher degree of undistortion between the two images. The PSNR is given by Equation (14), where x and y are the observed image and fused image, respectively, n is the image bit depth, and MSE is the mean square error between the observed image and the fused image. In the absence of noise, the observed image and the fused image are identical, and the MSE is equal to 0; therefore, the PSNR is infinite.
PSNR(x, y) = 10 × log 10 (4) Entropy: The entropy is used to assess the amount of information contained in an image. Generally, a clear image provides more detailed information than a blurred image. Hence, the greater the entropy of a fused image, the greater the amount of information contained in the fused image. The equation of entropy is presented in Equation (15), where n is the total number of grayscale levels, N i is the number of pixel i in the image, and N s N s is the total number of pixels in the image:

Results
In this experiment, the same dataset were used to train four different image fusion methods (i.e., STARFM only, VDSR only, B-SR hybrid method, and SR-B hybrid method). After training the fusion models with the training dataset, a 30 m LS-8 image recorded on 6 December 2014, was used to predict 8 m fused images with the different fusion models. For the evaluation of the accuracy of the models, the FS-2 image acquired on 6 December 2014, was used as an independent check image. The four fused images are shown in Figure 6. The results from the methods were verified by using the five indicators (i.e., entropy, BRISQUE, SSIM, PSNR, and reflectance bias) in the following section.

Discussions
Section 4.1 discusses the quantitative analysis for the entire area, while Section 4.2 focuses on qualitative analysis in the vegetation and building regions.

Quantitative Analysis
The mean and SD of the reflectance bias for individual bands (i.e., B, G, R, NIR) and all bands (i.e., four bands) are provided for comparing the performance of different methods ( Table 3). The reflectance bias of NIR was larger than that of the other three bands. A possible reason is that the variation of the NIR reflectance was significantly larger than that of the other bands ( Table 4). The test area was mostly covered by vegetation. Consequently, the NIR reflectance of chlorophyll was larger than the reflectance of the other bands. Gao et al. [10] employed the STARFM to fuse Landsat-7 and MODIS images. The reflectance bias between the real image and the predicted image in the NIR band was also markedly larger than blue, green, and red bands. From the preceding discussion, the NIR band is evidently more difficult to predict using the STARFM, but the VDSR model could improve this problem. Because the VDSR model learned information from across sensors, it not only improved the spatial resolution of the image, but also reduced the systematic deviation between the two satellite images in the spectral bands. A comparison of the results of reflectance bias between the STARFM and VDSR showed that the results of VDSR were slightly better than those of the STARFM in the NIR band. The VDSR requires a large number of training datasets in the training stage. This study considered only four image pairs (image size: 1360 × 1580 pixels) to establish the VDSR network model. For a limited number of training images, the VDSR is still better than STARFM. A comparison of the results of reflectance bias between the individual and hybrid methods showed that the mean bias of the hybrid method was smaller and better than that of the STARFM-only method. The hybrid method is slightly better than the traditional method. The hybrid strategy exploited the advantages of the STARFM and VDSR methods to minimize the reflectance bias. It could be inferred that the integration of weight-function-based and learning-based fusion models for image fusion can help increase the similarity between the fused image and the observed image.
This study also compares two different hybrid models. The results of SR-B (SR then blend) were better than those of B-SR (blend then SR). The overall mean bias of SR-B was 4% lower than B-SR, and 3% improvement for overall standard deviation in SR-B. In SR-B, VDSR improved the resolution of the LS-8 image owing to its training with FS-2 images before STARFM was applied. In this way, VDSR provided a better SRLS-8 image than the traditional cubic-interpolated LS-8 image. The SR-B method could obtain more information from FS-2 before the application of STARFM. While VDSR in SR-B learned the difference between the original LS-8 and FS-2 images, VDSR in B-SR learned the difference between an LS-8 image from STARFM and an FS-2 image. Although the result of B-SR was slightly better than that of the traditional STARFM used individually, the error in the LS-8 image from the STARFM could affect VDSR in the B-SR hybrid method. Therefore, the results of SR-B are better than those of B-SR, and it is recommended that SR be performed before blend.
The other four image quality assessment indicators for the different fusion methods were also provided for comparison (Table 5). Both SSIM and PSNR compare the observed and fused images. The SSIM evaluates the similarity between images, while the PSNR examines the distortion between images. The SSIM of VDSR was lower than STARFM, implying that the overall SSIM of the VDSR method was lower than STARFM because of the limited training data set. The SSIMs of the B-SR and SR-B were similar, and the difference was only 0.004. In terms of the degree of reflectance's distortion using the PSNR, the SR-B approach showed better results compared with the other three methods. The PSNR and reflectance bias showed similar behavior because both indicators were based on the residual of reflectance. In summary, both SSIM and PSNR indicated that the SR-B approach combining VDSR and the STARFM could minimize the reflectance differences between observed and fused images. The SR-B is slightly better than B-SR and the PSNR's difference between SR-B and B-SR was 0.531. From the perspective of the average amount of information, the entropy of SR-B showed more information than that of the other approaches. The entropy of VDSR was the lowest, while that of STARFM and B-SR were similar. The SR-B improve the information content than B-SR and the Entropy's difference between SR-B and B-SR was 0.568. The improvement rate was about 23% as the SR-B was sharpening the input image before image blending. In the evaluation of image quality, BRISQUE was used as a no-reference image quality indicator. The BRISQUE of SR-B was better than that of the other three approaches. The results of SR-B was slightly better than B-SR. Owing to the insufficient amount of training data for the VDSR-only approach, this approach showed lower performance. In summary, the overall performance of the SR-B approach yielded better accuracies than the other methods.

Qualitative Analysis
To examine the performance of different methods for different land covers, vegetation and building regions were chosen for performing a qualitative analysis. Figure 7 compares the results of different image fusion methods with the FS-2 real observation image recorded on 6 December 2014. In the visual analysis, most fused images were similar to the real FS-2 observation, except the VDSR-only fused image. A comparison of the interpolated LS-8 image and VDSR-only fused image showed that the sharpness of the VDSR result was better than the interpolated LS-8. From the perspective of image interpolation, VDSR refines and provides better results than the traditional cubic interpolation. Therefore, the integration of VDSR and STARFM can improve the fusion quality. Owing to the spectral variation of the building region being larger than that of the vegetation region, the spectral distortion in the building region was slightly larger than that in the vegetation region. However, the entropy (information content) of the building region was better than that of the vegetation area. In the improvement of entropy for the STARFM and hybrid methods, the building region showed higher improvement than the vegetation area. In other words, the hybrid methods showed better performance in the building region compared with the vegetation region.
From the perspective of visual performance, a detailed comparison of the hybrid methods (i.e., SR-B and B-SR) and the STARFM-only method revealed that the hybrid methods not only showed superior spatial resolution from the 30 m LS-8 image to the 8 m fused image, but also enhanced local details in the mountain's shadow region. In other words, the hybrid methods could improve the spatial information and local spectral information. Table 6 shows that most quality indexes of the hybrid methods were better than those of the STARFM-only method. Furthermore, the fused images from SR-B showed better performance than the other fused images. Land cover change is a challenging issue in spatiotemporal fusion, and the results of image fusion were usually affected by seasonal crop changing (Figure 8a Consequently, the image blending produces better results. In summary, the performance of SR-B was better than that of B-SR.

Conclusions
This study developed a hybrid spatiotemporal image fusion approach involving a deep learning model and a physical model. The deep learning model is the VDSR model, which improves the spatial resolution of low-resolution images, and the physical model is the STARFM model, which considers physical parameters such as pixel distance, spectral, temporal and spatial variations. Two different hybrid fusion strategies (i.e., B-SR vs. SR-B) were developed and discussed on the basis of experiments. In SR-B, SR replaces the image resampling stage used for interpolating the pixel size of a low-resolution image into that of a high-resolution image. This image resampling stage is an essential step for obtaining lowand high-resolution images with the same pixel size. In B-SR, the SR refines the spatial details after the STARFM is applied.
The major contribution of this study is to develop hybrid spatiotemporal image fusion methods (i.e., B-SR and SR-B) using STARFM and VDSR. In the experiment, STARFM, VDSR, B-SR, and SR-B were used to fuse the satellite images of FS-2 and LS-8 to produce spatiotemporal fusion data, and several quality indexes (i.e., reflectance bias, entropy, BRISQUE, SSIM, and PSNR) were used to assess the quality of fused images. The experimental results demonstrated that the combination of the spatiotemporal fusion techniques of the STARFM and the VDSR model based on a CNN architecture helped to increase the similarity between fusion images and observation images. Overall, the quality of the fusion results obtained using SR-B was better than that generated with the other methods. The results showed that running the VDSR model to learn the difference between low-resolution images and high-resolution images before applying STARFM could reduce the variation in spatial and spectral resolution between the fused image and the observed image. Besides, it could also yield a fused image that is better in visual performance and is most similar to the observation image.
The difference between the fusion result of SR-B and the real image was smaller than the result obtained by employing the STARFM alone. In the comparison of SR-B ad B-SR, the overall spectral bias of SR-B was lower than B-SR; moreover, the entropy of SB-R was also higher than B-SR. This demonstrated that the combination of VDSR model based on the CNN architecture and the spatiotemporal fusion techniques of the STARFM can help to increase the similarity between fusion images and observation images. Therefore, this study recommends the use of SR-B rather than B-SR.
The STARFM is not effective for simulating sudden land cover changes in the short term. Although the combination of the STARFM and VDSR models in this study could improve the quality of fusion images, its capability is limited. Therefore, in order to acquire better fusion results, a weight-based fusion model modified from STARFM, such as ESTARFM, may be used in future studies. Furthermore, since only five pairs of LS-8 and FS-2 satellite image pairs were used in this research, there was not much training data that could be used to train the VDSR model, which influenced the learning effectiveness of VDSR. In addition, a long time interval between the image pairs also affects the fusion results of the STARFM. Hence, it is recommended to examine the use of Sentinel-2 images with higher temporal resolution in future studies.