A Hybrid Deep Learning-Based Spatiotemporal Fusion Method for Combining Satellite Images with Different Resolutions

: Spatiotemporal fusion (STF) is considered a feasible and cost-effective way to deal with the trade-off between the spatial and temporal resolution of satellite sensors, and to generate satellite images with high spatial and high temporal resolutions. This is achieved by fusing two types of satellite images, i.e., images with ﬁne temporal but rough spatial resolution, and images with ﬁne spatial but rough temporal resolution. Numerous STF methods have been proposed, however, it is still a challenge to predict both abrupt landcover change, and phenological change, accurately. Meanwhile, robustness to radiation differences between multi-source satellite images is crucial for the effective application of STF methods. Aiming to solve the abovementioned problems, in this paper we propose a hybrid deep learning-based STF method (HDLSFM). The method formulates a hybrid framework for robust fusion with phenological and landcover change information with minimal input requirements, and in which a nonlinear deep learning-based relative radiometric normalization, a deep learning-based superresolution, and a linear-based fusion are combined to address radiation differences between different types of satellite images, landcover, and phenological change prediction. Four comparative experiments using three popular STF methods, i.e., spatial and temporal adaptive reﬂectance fusion model (STARFM), ﬂexible spatiotemporal data fusion (FSDAF), and Fit-FC, as benchmarks demonstrated the effectiveness of the HDLSFM in predicting phenological and landcover change. Meanwhile, HDLSFM is robust for radiation differences between different types of satellite images and the time interval between the prediction and base dates, which ensures its effectiveness in the generation of fused time-series data.


Introduction
Remote sensing data with high temporal and spatial resolutions have been used in various applications, such as vegetation phenology monitoring [1][2][3], landcover change (LC) detection [4,5], landcover type classification [6][7][8], and carbon sequestration modeling [9]. However, it is still a challenge for sensors of a single type to provide remote sensing data with both fine spatial and temporal resolutions due to technological and financial limitations [10]. For example, moderate resolution imaging spectroradiometer (MODIS) images are sufficient for daily observations, but their spatial resolution of 250 m to 1 km cannot capture spatial details. Landsat images have a spatial resolution of 30 m, which is suitable for heterogeneous areas; however, the satellite's relatively long revisit period of 16 days makes it unsuitable for capturing rapid land surface temporal change.
(1) HDLSFM has a minimal input requirement, i.e., one fine-coarse image pair and a coarse image at a prediction date. Compared to DL-based STF methods which use at least two fine-coarse image pairs, HDLSFM is more applicable in areas with severe cloud contamination. (2) HDLSFM can be used to predict complex land surface temporal changes, including PC and LC. (3) HDLSFM is robust to radiation differences and time interval between prediction date and base date, which ensures its effectiveness in the generation of fused time-series data using a limited number of fine-coarse image pairs.
To validate the effectiveness of the HDLSFM, four experiments were conducted using three popular STF methods, namely STARFM, FSDAF, and Fit-FC, which also use one fine-coarse image pair as an input.

Methods
Differently from existing DL-based STF methods that need at least two pairs of finecoarse images [42,48,[53][54][55], HDLSFM requires minimal input data, i.e., a known fine (F 1 ) and a coarse (C 1 ) image at a base date (t 1 ), and a coarse image (C 2 ) at a prediction date (t 2 ), to generate a robust target fine image (F 2 ) containing both PC and LC information. This is achieved through a hybrid framework consisting of a nonlinear DL-based relative radiometric normalization that alleviates radiation differences, a DL-based SR algorithm for LC prediction, and linear-based fusion for PC prediction.
The flowchart of the HDLSFM for the prediction of F 2 is shown in Figure 1. In the first step, the radiation difference is alleviated through nonlinear DL-based relative radiometric normalization, followed by LC prediction using DL-based SR. The parameters of the DLbased nonlinear relative radiometric normalization and SR are learned simultaneously using a deep Laplacian pyramid SR network (LapSRN) [56]. In the second step, based on the corrected radiometric coarse images at the base date (t 1 ) and the prediction date (t 2 ), PC prediction is realized through linear-based fusion. Lastly, the final prediction is obtained through a weighted combination of PC and LC predictions using a sliding window. A detailed description of each step of HDLSFM is given below. This is achieved through a hybrid framework consisting of a nonlinear DL-based relative radiometric normalization that alleviates radiation differences, a DL-based SR algorithm for LC prediction, and linear-based fusion for PC prediction. The flowchart of the HDLSFM for the prediction of 2 is shown in Figure 1. In the first step, the radiation difference is alleviated through nonlinear DL-based relative radiometric normalization, followed by LC prediction using DL-based SR. The parameters of the DL-based nonlinear relative radiometric normalization and SR are learned simultaneously using a deep Laplacian pyramid SR network (LapSRN) [56]. In the second step, based on the corrected radiometric coarse images at the base date ( 1 ) and the prediction date ( 2 ), PC prediction is realized through linear-based fusion. Lastly, the final prediction is obtained through a weighted combination of PC and LC predictions using a sliding window. A detailed description of each step of HDLSFM is given below.

Radiation Normalization
STF methods assume that the radiation properties of fine and coarse images are similar. For this reason, i.e., due to their similar bandwidth and radiation, Landsat and MODIS images are widely used in STF. However, radiation differences between fine and coarse images are common, and are the primary sources of fusion errors in STF methods [12], limiting the application to other types of satellite images. The popular hybrid STF method, FSDAF, is based on the assumption that PC prediction 2 PC ( 0 , 0 , ) in band B for a pixel centered at ( 0 , 0 ) can be obtained by adding a fine increment, Δ ( , ), to the known fine image, 1 ( 0 , 0 , ). This is expressed as: ̂2 PC ( 0 , 0 , ) = 1 ( 0 , 0 , ) + Δ ( , ) The fine increment, Δ ( , ), can be obtained using the spectral unmixing method as follows:

Radiation Normalization
STF methods assume that the radiation properties of fine and coarse images are similar. For this reason, i.e., due to their similar bandwidth and radiation, Landsat and MODIS images are widely used in STF. However, radiation differences between fine and coarse images are common, and are the primary sources of fusion errors in STF methods [12], limiting the application to other types of satellite images. The popular hybrid STF method, FSDAF, is based on the assumption that PC prediction F PC 2 (x 0 , y 0 , B) in band B for a pixel centered at (x 0 , y 0 ) can be obtained by adding a fine increment, ∆F(c, B), to the known fine image, F 1 (x 0 , y 0 , B). This is expressed as: The fine increment, ∆F(c, B), can be obtained using the spectral unmixing method as follows: where f c (x, y) denotes the class fractions belonging to class c in a coarse image. It is obvious that ∆F is dependent on the temporal change of coarse image ∆C, which means that Equation (1)  between ∆C and ∆F, which would introduce errors. These errors can be expressed using Equation (3), assuming that the radiation between fine and coarse image can be represented using a simple linear regression equation, such as Equation (4) [43].
where C 2i and C 1i are the ith similar pixel of the coarse image at the prediction date (t 2 ) and the base date (t 1 ),Ŵ i is the weight value of the ith similar pixel, C and C are the pixel values (e.g., NDVI) of the fine and coarse image, respectively, and α and β are the regression coefficients. The popular weighting function-based STF method, Fit-FC, assumes that the temporal change between two coarse images (C 1 and C 2 ) is linear and predicts the target fine image, F 2 (x, y), by applying this linear relationship to the known fine image, F 1 : whereâ andb are the regression coefficients between C 1 and C 2 , and r(x i , y i ) is the downscaled residual. Similarly, a fusion error, ∆F R Fit−FC , may be present if a radiation difference between fine and coarse image exists. This can be expressed as [43]: It is thus critical to alleviate the radiation difference between fine and coarse images to achieve robust fusion, as well as to widen the applicability of the STF method to other types of satellite image. A common strategy is to utilize a relative radiometric normalization, which builds a simple linear regression between the fine and coarse images, assuming that this is a suitable model of the radiation difference [12,57]. However, it may not be suitable when the relationship between fine and coarse images is complex and nonlinear [10,43,58,59]. In this paper, a nonlinear DL-based relative radiometric normalization is proposed to alleviate the negative impact of radiation differences by formulating a nonlinear mapping between the fine and coarse image. In particular, the known fine image (F 1 ) is first aggregated to reduce its spatial resolution by approximately two times that of the coarse image, followed by the formulation of a nonlinear mapping ( f 1 ) between this aggregated fine image (F 1 down ) and the coarse image (C 1 ) (Equation (7)).
The mapping parameter (Φ 1 ) can be learned by the DL-based SR method, leveraging its capacity for formulating nonlinear mappings. Then, using the learned nonlinear mapping, f 1 , the radiometrically corrected coarse image (C RC 1 and C RC 2 ) acquired at t 1 and t 2 can be obtained using: Note that an aggregation step before the nonlinear mapping formulation is necessary, since a direct establishment of the mapping between the original fine and coarse images may not be effective due to the significant magnification, and may introduce uncertainties [42,53]. Instead, by aggregating the spatial resolution of the fine image to reduce the magnification, the reliability of the nonlinear mapping is improved. Moreover, since the spatial resolution of the output of the nonlinear-based relative radiometric normalization is improved to two times that of the original coarse image, C RC 1 and C RC 2 have more spatial detail, which is vital for PC prediction in areas with high heterogeneity (further explanation is provided in Section 2.2).

Landcover Change Prediction
An accurate prediction of LC is essential for STF performance. The coarse image (C 2 ) contains blurred LC information, which is vital, and complementary information for LC prediction. However, this information is incomplete due to the limited spatial resolution. In order to convert the spatial resolution of image (C 2 ) from low to high, spatial interpolation, such as thin plate spline interpolation, has been widely utilized in STF methods [12,44]. In this paper, the DL-based SR method is used to retain the complementary LC information, which is considered to be more beneficial for accurate LC prediction [36], and underpins a higher generalization ability because of the introduction of prior information [53,55]. According to the core of the DL-based SR method, the nonlinear mapping between C RC 1 and F 1 is formulated as: Using the learned mapping ( f 2 ), a transitive-resolution image (F Tran 2 ) can be obtained by taking C RC 2 as the input (Equation (11)).
F Tran Although the transitive-resolution image (F Tran 2 ) generated by the DL-based SR method has a spatial resolution close to that of the fine image, due to the significantly higher magnification factor of the STF compared to that of the single-image SR, F Tran 2 can have a smoothing effect. In order to reconstruct the spatial detail of F Tran 2 further, and obtain a more accurate LC prediction (F LC 2 ), a high-pass modulation (Equation (12)) is applied by introducing the spatial detail information, expressed as F 1 − F Tran 1 , to the transitive-resolution image (F Tran (12) where g b represents the injection gain that controls the spatial detail information introduction degree. This gain is locally optimized via a least square fitting in nonoverlapping zones [60], and is given by: . Further explanation of the high-pass modulation is given in Section 5.2.

Integration of Radiation Normalization and Landcover Change Prediction
It is a common practice to learn the network parameters of mappings f 1 and f 2 by formulating two independent networks. However, these multiple steps are tedious. Besides, since the output of f 1 , i.e., C RC 2 , is regarded as the input of mapping ( f 2 ) for prediction, the uncertainty of mapping ( f 1 ) can be accumulated and exacerbated by f 2 . Therefore, to simplify the steps and to mitigate the uncertainty accumulation, LapSRN is used to learn the network parameters of mappings f 1 and f 2 simultaneously. The original LapSRN consisted of multiple levels, each of which could up-sample the input by a scale of two using the feature extraction and image reconstruction parts. In this way, LapSRN can achieve SR with high magnification. In this paper, a two-level LapSRN is constructed, where the first level is utilized for radiation normalization, and the second to conduct SR. The architecture of the LapSRN used in HDLSFM is shown in Figure 2.

Linear-based Fusion for Phenological Prediction
Although using the LapSRN coupled with high-pass modulation can reconstruct the spatial details of 2 so as to achieve LC prediction, since no temporal change information is utilized, the PC prediction performance will still be limited. Therefore, a linear-based fusion is proposed to improve the PC prediction performance.
(3) An assumption of Equation (16) is that the temporal changes of the fine and coarse images are consistent, which is valid only for homogeneous pixels (hereafter, "homogenous pixel assumption").
To mitigate the uncertainty caused by the radiation difference, the original coarse images ( 1 and 2 ) in Equation (15) are replaced by the radiometrically corrected coarse images acquired at 1 and 2 ( 1 RC and 2 RC ) that are obtained by the LapSRN. The aim is to estimate the local regression coefficients, ′ ( 0 , 0 , , ∆ ) and ′ ( 0 , 0 , , ∆ ), which are expressed as Equation (17). The reason for this replacement is that the radiation normalization can move and in Equation (6) closer to 1 and 0, respectively, resulting in ∆̂i t−FC being closer to 0; in other words, smaller fusion errors will be caused by radiation difference. Similarly to the original LapSRN, each level of the LapSRN in HDLSFM consists of a feature extraction part, and an image reconstruction part. The feature extraction part includes a feature embedding sub-network and a residual model. A convolutional layer is added to the first level to transform the coarse image into high-dimensional nonlinear feature maps, followed by the feature embedding sub-network. The feature embedding subnetwork is composed of five recursive blocks, each containing five distinct convolutional layers. The extracted features are not utilized to reconstruct an image but are forwarded to the second level. This strategy can reduce the cumulative uncertainty effectively. The reconstructed image is generated by combining the input image of each level with the reconstructed residual image. Each convolutional layer in the LapSRN contains 32 filters of size 3 × 3.
In each level of the original LapSRN, the input image is up-sampled by a scale of two using a transposed convolutional layer, which means that for SR with a scale of eight, three levels are required. However, the magnification factor of the STF method is large, meaning that more levels are required, which further increases computational cost. Therefore, in this paper the input coarse images were directly up-sampled to the same scale as the fine images using a nearest-neighbor interpolation algorithm before being fed to the network instead of achieving 2 × SR using the transposed convolutional layer in each level. In this way, different levels achieve different SR multiples.
In the training stage, the input is the coarse image (C 1 ) and the output of the two levels are the aggregated fine image F 1 down and F 1 , respectively. To learn the parameters of LapSRN, the combination of the loss functions of the two levels is minimized, which is expressed as: where y represents the label of the two levels, i.e., F 1 down and F 1 ,ŷ denotes the output of the learned mapping, θ denotes the mapping parameter, ρ stands for the Charbonnier penalty function, andN is the number of training samples. The detailed parameter setup for training the LapSRN can be found in the Appendix A. In the prediction stage, using the learned LapSRN, the radiometric corrected coarse images and transitive-resolution images acquired at date t 1 and t 2 , i.e., C RC 1 , F Tran 1 , and C RC 2 , F Tran 2 , respectively, can be obtained by taking C 1 and C 2 as the input. Then, the LC prediction (F LC 2 ) can be obtained using Equations (12) and (13).

Linear-Based Fusion for Phenological Prediction
Although using the LapSRN coupled with high-pass modulation can reconstruct the spatial details of C 2 so as to achieve LC prediction, since no temporal change information is utilized, the PC prediction performance will still be limited. Therefore, a linear-based fusion is proposed to improve the PC prediction performance.
Suppose that the land surface temporal change of a coarse image between t 1 and t 2 can be described using the following linear regression model: where ∆t denotes the time interval between t 1 and t 2 . The local regression coefficients w(x 0 , y 0 , B, ∆t) and b(x 0 , y 0 , B, ∆t) can be estimated using the least square method in a sliding window in band B centered at (x 0 , y 0 ), while R(x 0 , y 0 , B, ∆t) is the residual.
Then, the target PC prediction (F PC 2 ) can be generated through the application of the local regression coefficients and residual to F 1 : However, Equations (15) and (16) may not be effective in the following aspects: (1) As described in Section 2.1.1, the radiation difference between the fine and coarse images can affect the fusion performance adversely.
(2) The spatial resolutions of regression coefficients w(x 0 , y 0 , B, ∆t) and b(x 0 , y 0 , B, ∆t) are the same as that of the coarse image, which is significantly different from that of F 1 (x 0 , y 0 , B, t 1 ). This spatial resolution inconsistency can cause blocky artifacts.
(3) An assumption of Equation (16) is that the temporal changes of the fine and coarse images are consistent, which is valid only for homogeneous pixels (hereafter, "homogenous pixel assumption").
To mitigate the uncertainty caused by the radiation difference, the original coarse images (C 1 and C 2 ) in Equation (15) are replaced by the radiometrically corrected coarse images acquired at t 1 and t 2 (C RC 1 and C RC 2 ) that are obtained by the LapSRN. The aim is to estimate the local regression coefficients, w (x 0 , y 0 , B, ∆t) and b (x 0 , y 0 , B, ∆t), which are expressed as Equation (17). The reason for this replacement is that the radiation normalization can move α and β in Equation (6) closer to 1 and 0, respectively, resulting in ∆F R Fit−FC being closer to 0; in other words, smaller fusion errors will be caused by radiation difference.
Then, an initial PC prediction, F PC 2 (x 0 , y 0 , B, t 2 ), can be obtained by applying the regression coefficients, w (x 0 , y 0 , B, ∆t) and b (x 0 , y 0 , B, ∆t), and the residual, R(x 0 , y 0 , B, ∆t), to F 1 (x 0 , y 0 , B, t 1 ) as follows: Note that since the radiometrically corrected coarse images C RC 1 and C RC 2 have higher spatial resolutions than the original coarse images, the spatial resolutions of w (x 0 , y 0 , B, ∆t) and b (x 0 , y 0 , B, ∆t) are closer to that of F 1 (x 0 , y 0 , B, t 1 ) compared to that of the original regression coefficient. In this manner, the blocky artifacts are alleviated.
For the fusion errors caused by the homogenous pixel assumption, we utilized a strategy similar to STARFM, ESTARFM, and FSDAF, which use additional neighboring pixels in the sliding window for achieving a more robust prediction. In particular, pixels F 1 (x l , y l , B) that are within the sliding window and have small spectral differences from the fine target pixel, F 1 (x 0 , y 0 , B), i.e., those that satisfy Equation (19), are selected as similar pixels.
where σ(B) denotes the standard deviation of band B of F 1 , and m represents the number of classes, which is set to four in this paper. Then, the final PC prediction (F PC 2 ) of each central pixel,F PC 2 (x 0 , y 0 , B), is obtained using a weighted combination of selected similar pixels in the sliding window of the initial PC prediction, F PC 2 : The weight of each similar pixel, W(x i , y i ), in the sliding window is determined by: where N denotes the number of selected pixels in the sliding window, and D(x i , y i ) is defined as: whereŵ denotes the sliding window size, which was set to 50 in this paper. A further explanation of the prediction performance's sensitivity toŵ is given in Section 5.1.

Combination of Linear-and Deep Learning-Based STF
F LC 2 andF PC 2 focus on the LC and PC predictions, respectively. A weighted combination method is employed to synthesizeF LC 2 andF PC 2 to obtain a robust final predictionF 2 , which is given by: The weight values p LC and p PC are determined using the absolute difference between F LC 2 andF PC 2 and C 2 in a 3 × 3 sliding window as follows: where σ PC (x 0 , y 0 ) and σ LC (x 0 , y 0 ) are obtained by: whereF LC,j 2 (x 0 , y 0 ) andF PC,j 2 (x 0 , y 0 ) denote the jth pixels of the LC and PC predictions in a 3 × 3 sliding window centered at (x 0 , y 0 ), while C 2 j (x 0 , y 0 ) represents the corresponding pixel of C 2 . Further validation of the weighted combination method is given in Section 5.2.

Experimental Setup
Four experiments were performed to test the effectiveness of HDLSFM from the following four perspectives: (1) LC prediction, (2) PC prediction, (3) generation of fused time-series data, and (4) application of HDLSFM to other types of satellite images. A detailed description of the experimental setup and dataset of each experiment is given in the following.

Experiment I: Effectiveness of HDLSFM in Landcover Change Prediction
The study site of Experiment I was the Lower Gwydir Catchment (LGC), located in northern New South Wales, where a large flood in mid-December 2004 led to significant LC. It is thus suitable for testing the effectiveness of HDLSFM in LC prediction. The dataset of Experiment I was the same as the one used by Emelyanova et al. [61]. A MODIS and Landsat 5 TM image pair (3200 × 2720 pixels), acquired on 26 November 2004, and a MODIS image acquired on 12 December 2004, were used to predict the fine Landsat image acquired on 12 December 2004, as shown in Figure 3.

Experiment II: Effectiveness of HDLSFM in Phenological Change Prediction
The study site of Experiment II was a heterogeneous rain-fed agricultural area in central Iowa (CI), USA. This study site was chosen to test the effectiveness of HDLSFM in PC prediction. The dataset of this experiment was the same as the one used by Jia et al. [53].

Experiment II: Effectiveness of HDLSFM in Phenological Change Prediction
The study site of Experiment II was a heterogeneous rain-fed agricultural area in central Iowa (CI), USA. This study site was chosen to test the effectiveness of HDLSFM in PC prediction. The dataset of this experiment was the same as the one used by Jia et al. [53]. A MODIS and Landsat ETM+ image pair (4500 × 4500 pixels) acquired on 14 May 2002, along with a MODIS image acquired on 2 July 2002, were utilized to predict the Landsat image acquired on 2 July 2002, as shown in Figure 4.
The study site of Experiment II was a heterogeneous rain-fed agricultural area in central Iowa (CI), USA. This study site was chosen to test the effectiveness of HDLSFM in PC prediction. The dataset of this experiment was the same as the one used by Jia et al. [53]. A MODIS and Landsat ETM+ image pair (4500 × 4500 pixels) acquired on 14 May 2002, along with a MODIS image acquired on 2 July 2002, were utilized to predict the Landsat image acquired on 2 July 2002, as shown in Figure 4.

Experiment III: Effectiveness of HDLSFM in Generating Fused Time-Series Data
The study site of Experiment III was the Coleambally irrigation area (CIA), southern New South Wales, Australia. This area has a spatially heterogeneous landscape and rapid PC. Seventeen cloud-free Landsat ETM+ and MODIS (MOD09GA) pairs were available for this study site, covering the 2001-2002 summer growing season. The details of this study site's dataset can be found in [61].
Under ideal conditions, the fine and coarse image pair dated closest to the prediction image would be utilized for generating the fused time-series data. That is, the shorter the time interval, the lower the difficulty of the STF will be [62]. However, the number of available fine images is usually limited, resulting in irregular time intervals.
Experiment III was thus designed to test the effectiveness of HDLSFM in using a limited number of fine-coarse image pairs with irregular time intervals to generate fused time-series data. Referring to [48], three Landsat ETM+ and MODIS MOD09GA pairs acquired on 8 October 2001, 4 December 2001, and 11 April 2002, along with the other available MOD09GA images were used to predict the other fourteen Landsat images (Table 1). In this case, the time interval between the base date and the prediction date ranged from 9 days to 57 days, which can be considered as suitable for testing the robustness of the HDLSFM to time interval variation. In order to test the effectiveness of HDLSFM on other types of satellite images, in Experiment IV the VIIRS Surface Reflectance Daily product VNP09GA and Landsat 8 OLI were used as the coarse and fine images, respectively. Since the bands I1, I2, and I3 of a VIIRS image at a 500m spatial resolution have the most similar radiation features to bands 4, 5, and 6 of a Landsat OLI, these bands were used to perform STF.
A complex cropland area, located in the southwest of Kansas State (SWK), USA, was selected as the study site. This study site is covered by one Landsat OLI scene with a path/row of 30/34. Five cloud-free Landsat OLI images (3500 × 2500 pixels) were available for this study site. As shown in Table 2, the Landsat OLI and VNP09GA acquired on 2 June 2019, and the VNP09GA images obtained on 20 July, 6 September, 22 September, and 8 October in 2019, were regarded as the known images to predict the Landsat images. The actual Landsat images obtained on these dates were utilized to evaluate the fusion performance.

Comparison and Evaluation Strategy
In order to evaluate the performance of HDLSFM, three widely-used STF methods, namely STARFM, FSDAF, and Fit-FC, were used as benchmark methods; all use one fine-coarse image pair as input. Note that although revisions of FSDAF, i.e., IFSDAF, SFSDAF, and FSDAF 2.0 have been proposed, the framework of these hybrid STF methods is basically consistent with the original FSDAF. Moreover, FSDAF has been widely used [63][64][65][66], suggesting its effectiveness and representativeness. We thus used FSDAF as the representative method of the hybrid STF methods.
Fusion performance was assessed quantitatively using six evaluation indices: root mean square error (RMSE), structural similarity (SSIM) [67], universal image quality index (UIQI) [68], correlation coefficient (CC), erreur relative global adimensionnelle de synthèse (ERGAS) [69], and the spectral angle mapper (SAM) [70]. The ideal values of RMSE, CC, SSIM, and UIQI are 0, 1, 1, and 1, respectively. The closer the SAM and ERGAS are to zero, the lower the uncertainty of the predictions. A description of the indices used is not provided in this work due to space limitations; for further information, please refer to [53].

Experiment I
The visual effects of the predictions of HDLSFM and the benchmark methods for the LGC site are shown in Figure 5. It can be seen that STARFM and Fit-FC exhibited the worse visual artifacts, with blurry spatial details compared to the other fusion methods, especially in the inundated area. The worse quantitative results of the two STF methods, both in the complete LGC site ( Figure 6) and the sub area (Table 3), confirmed the unsuitability of these methods for the prediction of significant LC, which is mainly due to the insufficiency of the linear assumption in LC prediction. Compared to STARFM and Fit-FC, FSDAF achieved better fusion performance in LC prediction. FSDAF's clearer spatial variation, complete boundary of the inundated area, and the comparatively good quantitative results in the sub area suggest its effectiveness in LC prediction. HDLSFM generated a similar visual effect to FSDAF, with complete spatial detail of the LC area. This is attributed to the combined use of DL-based SR and high-pass modulation. Moreover, compared to FSDAF, HDLSFM showed better quantitative performance in the whole LGC site in almost all the six bands ( Figure 6), as well as in the sub area of the LGC site (Table 3).     For the whole LGC site, the mean RMSE of HDLSFM was 0.0329, with a decrease of 0.0616, 0.0021, and 0.0016 when compared to STARFM, FSDAF, and Fit-FC. The mean SSIM of HDLSFM was 0.7619, with gains of 0.0732, 0.0152, and 0.0215 over those of STARFM, FSDAF, and Fit-FC. HDLSFM achieved the best mean UIQI and CC (i.e., 0.8989 and 0.7573) for the whole LGC site, with an increase of 0.0029 and 0.0048 over those of FSDAF, and with an increase of 0.0086 and 0.0327 over those of Fit-FC. HDLSFM also achieved a significant improvement in SAM, i.e., 10.4542 with an improvement of 60.90%, 13.74%, and 9.56% over STARFM, FSDAF, and Fit-FC. The above results confirm the effectiveness and the superiority of the HDLSFM in LC prediction.

Experiment II
A visual comparison of the predictions obtained by the HDLSFM and the three benchmark methods of the CI site is shown in Figure 7. The rapid PC between the known Landsat image at the base date (Figure 7c) and the actual Landsat image (Figure 7a), as well as high spatial heterogeneity, are indicative of the great difficulty we expected the STF to face. STARFM again yielded the worst fusion accuracy among all the STF methods ( Figure 8) and a significant spectral deviation (Figure 7e), which demonstrates that it may not be suitable for areas with rapid PC and high spatial heterogeneity. Fit-FC suffered from smaller spectral deviation than STARFM ( Figure 7g); however, its fusion result had blurry texture details, suggesting that although the Fit-FC can address rapid PC prediction, its applicability in heterogeneous areas may be limited. FSDAF, by comparison, provided the most complete spatial details (Figure 7f However, the spatial effect of HDLSFM ( Figure 7h) was a bit inferior to that of FSDAF ( Figure 7f). The reason for this may lie in the fact that HDLSFM employs a DL-based SR method for the restoration of spatial details from the coarse image at the prediction date, whose performance is closely related to the number of fine-coarse image pairs. Therefore, when using a limited number of fine-coarse images, the fusion spatial details of HDLSFM may be lost in heterogeneous areas.

Experiment III
The quantitative assessment of the results of different STF methods for the CIA site is shown in Figure 9. In general, STARFM and Fit-FC had the worst fusion performance, which is probably due to their high sensitivity to time interval, indicating the weighting function-based STF methods' ineffectiveness in the case of long time intervals. HDLSFM achieved the best fusion performance in almost all evaluation indices except SAM. The mean RMSE for all dates of HDLSFM was 0.0303, with a significant decrease of 0.0155 when compared to STARFM. The mean SSIM of HDLSFM was 0.8245, with an increase of 0.0381, 0.0045, and 0.0221 over those of STARFM, FSDAF, and Fit-FC. The mean UIQI of HDLSFM was 0.9550, with an increase of 0.0080, 0.0025, and 0.0055 when compared to STARFM, FSDAF, and Fit-FC. The mean CC of HDLSFM was 0.8206, with an increase of 0.1184, 0.0097, and 0.0329 over STARFM, FSDAF, and Fit-FC. The mean SAM of HDLSFM was 5.7320, with a decrease of 1.1216 and 0.5888 when compared to STARFM and Fit-FC, but it was slightly inferior to FSDAF. Nevertheless, the best fusion performance in almost all evaluation indices and the comparative performance in SAM verified the effectiveness of the HDLSFM in generating fused time-series data. Note that although the weighting function-based method employed in HDLSFM for PC prediction may also suffer from the uncertainties caused by long time intervals, these uncertainties are alleviated through the synthesis with LC prediction.
To further analyze the sensitivity of different STF methods to time interval, the visual effect ( Figure 10) and the performance of the predictions of the different STF methods was compared on a sub-area of the CIA site on 13 February 2002 (Table 4), since this prediction date had the largest time interval. As in the case of Figure 9, STARFM had the worst visual effect and fusion accuracy in the sub-area, which confirmed its high sensitivity to time interval. The fusion accuracy of FSDAF in the sub-area was worse than Fit-FC, which is slightly different from Figure 9. As shown in Figure 10, although FSDAF again had the most complete spatial details, large bias can be observed in areas with obvious PC. Fit-FC,

Experiment III
The quantitative assessment of the results of different STF methods for the CIA site is shown in Figure 9. In general, STARFM and Fit-FC had the worst fusion performance, which is probably due to their high sensitivity to time interval, indicating the weighting function-based STF methods' ineffectiveness in the case of long time intervals. HDLSFM achieved the best fusion performance in almost all evaluation indices except SAM. The mean RMSE for all dates of HDLSFM was 0.0303, with a significant decrease of 0.0155 when compared to STARFM. The mean SSIM of HDLSFM was 0.8245, with an increase of 0.0381, 0.0045, and 0.0221 over those of STARFM, FSDAF, and Fit-FC. The mean UIQI of HDLSFM was 0.9550, with an increase of 0.0080, 0.0025, and 0.0055 when compared to STARFM, FSDAF, and Fit-FC. The mean CC of HDLSFM was 0.8206, with an increase of 0.1184, 0.0097, and 0.0329 over STARFM, FSDAF, and Fit-FC. The mean SAM of HDLSFM was 5.7320, with a decrease of 1.1216 and 0.5888 when compared to STARFM and Fit-FC, but it was slightly inferior to FSDAF. Nevertheless, the best fusion performance in almost all evaluation indices and the comparative performance in SAM verified the effectiveness of the HDLSFM in generating fused time-series data. Note that although the weighting function-based method employed in HDLSFM for PC prediction may also suffer from the uncertainties caused by long time intervals, these uncertainties are alleviated through the synthesis with LC prediction.
To further analyze the sensitivity of different STF methods to time interval, the visual effect ( Figure 10) and the performance of the predictions of the different STF methods was compared on a sub-area of the CIA site on 13 February 2002 (Table 4), since this prediction date had the largest time interval. As in the case of Figure 9, STARFM had the worst visual effect and fusion accuracy in the sub-area, which confirmed its high sensitivity to time interval. The fusion accuracy of FSDAF in the sub-area was worse than Fit-FC, which is slightly different from Figure 9. As shown in Figure 10, although FSDAF again had the most complete spatial details, large bias can be observed in areas with obvious PC. Fit-FC, by comparison, shows better fusion performance than FSDAF in the PC area; however, as in Experiment II (Figure 7), the lack of spatial details is a major limitation of Fit-FC. HDLSFM had the best fusion accuracy in the sub-area, as well as a good visual effect in the PC area, suggesting it is less sensitive to the time interval. Thus, it can provide a more robust prediction result than the other STF methods for a limited number of fine-coarse image pairs. Remote Sens. 2021, 13, x FOR PEER REVIEW 19 of 33 HDLSFM had the best fusion accuracy in the sub-area, as well as a good visual effect in the PC area, suggesting it is less sensitive to the time interval. Thus, it can provide a more robust prediction result than the other STF methods for a limited number of fine-coarse image pairs.

Experiment IV
Experiment IV was performed to determine whether STF is still effective when applied to other types of satellite images with more radiation differences than those between MODIS and Landsat. In this case, VIIRS and Landsat 8 OLI were used as the coarse and fine images, respectively. As shown in Figure 11, the linear correlation between VIIRS and OLI was significantly lower than that of MODIS, indicating larger radiation differences between OLI and VIIRS.
As shown in Figure 12, STARFM still yielded the worst fusion accuracy. FSDAF generated lower fusion accuracy than Fit-FC (Figure 12), as well as a more obvious bias in the prediction for all the dates (Figure 13), which was slightly different from the other experimental results presented in this paper. This is probably due to the fact that the linear unmixing method for obtaining the temporal change value at fine resolution is sensitive to radiation differences between fine and coarse images. In order to reduce these differences, FSDAF employs a simple linear regression-based radiometric normalization. However, this approach is not effective when the relationships between fine and coarse images are complex and nonlinear. As shown in Figure 11, compared to simple linear regression, the nonlinear-based relative radiometric normalization improved the correlation between fine and coarse images obviously, demonstrating its effectiveness in reducing the radiation difference. Fit-FC showed a better fusion performance than STARFM and FSDAF. Nevertheless, HDLSFM achieved a significant improvement in the average Average Absolute Difference (AAD) map, showing the least bias ( Figure 13) in the predictions for all the dates, as well as the best fusion performance ( Figure 12) on most of the evaluation indices compared to the three benchmark fusion methods. The mean RMSE for all dates for HDLSFM was 0.0621, with a significant improvement of 55.19%, 30.47%, and 9.94% when compared to STARFM, FSDAF, and Fit-FC. The mean SSIM of HDLSFM was 0.5826, with an increase of 0.1307, 0.0767, and 0.0071 over STARFM, FSDAF, and Fit-FC. The mean UIQI of HDLSFM was 0.9286, with an increase of 0.0416, 0.0521, and 0.0083 when compared to STARFM, FSDAF, and Fit-FC. The mean CC of HDLSFM was 0.4311 with a significant improvement of 42.80%, 21.74%, and 6.85% over STARFM, FSDAF, and Fit-FC. The mean SAM of HDLSFM was 7.2170, with a significant improvement of 47.85%, 38.00%, and 12.70% when compared to STARFM, FSDAF, and Fit-FC. The improvement in fusion accuracy is indicative of HDLSFM's higher applicability to other types of satellite image. As shown in Figure 12, STARFM still yielded the worst fusion accuracy. FSDAF generated lower fusion accuracy than Fit-FC (Figure 12), as well as a more obvious bias in the prediction for all the dates (Figure 13), which was slightly different from the other experimental results presented in this paper. This is probably due to the fact that the linear unmixing method for obtaining the temporal change value at fine resolution is sensitive to radiation differences between fine and coarse images. In order to reduce these differences, FSDAF employs a simple linear regression-based radiometric normalization. However, this approach is not effective when the relationships between fine and coarse images are complex and nonlinear. As shown in Figure 11, compared to simple linear regression, the nonlinear-based relative radiometric normalization improved the correlation between fine and coarse images obviously, demonstrating its effectiveness in reducing the radiation difference. Fit-FC showed a better fusion performance than STARFM and FSDAF. Nevertheless, HDLSFM achieved a significant improvement in the average Average Absolute Difference (AAD) map, showing the least bias ( Figure 13) in the predictions for all The mean SAM of HDLSFM was 7.2170, with a significant improvement of 47.85%, 38.00%, and 12.70% when compared to STARFM, FSDAF, and Fit-FC. The improvement in fusion accuracy is indicative of HDLSFM's higher applicability to other types of satellite image.

Prediction Performance Sensitivity to Moving Window Size
The moving window size (̂) defines the number of pixels that are utilized to calcu-′ ′ Figure 13. Average AAD maps of the actual image and prediction results of the SWK site.

Prediction Performance Sensitivity to Moving Window Size
The moving window size (ŵ) defines the number of pixels that are utilized to calculate the local regression coefficients, a (x 0 , y 0 , B, ∆t) and b (x 0 , y 0 , B, ∆t), and the number of similar pixels. The influence of the moving window size was analyzed using the CIA site. The experiment was performed on the same dataset as that in Table 1. As shown in Figure 14, in general, the fusion performance improves with the increase in the moving window size. When the window size exceeds 50, the fusion performance remains relatively stable. Since there is always a trade-off between the moving window size and computation time, based on the obtained results, the moving window size in this work was chosen to be 50.

High-pass Modulation in Spatial Detail Reconstruction
The LC prediction performance of the HDLSFM depends on the performance of the DL-based SR method, which formulates the nonlinear mapping between fine and coarse images. However, due to the significantly larger magnification factor of the STF compared to that of single-image SR, their spatial details are incomplete. Thus, in order to recover the spatial details, high-pass modulation was employed, which is a common strategy in existing learning-based STF methods [48,52,[71][72][73]. An alternative way is to introduce spatial details, described as the difference between known fine and coarse images 1 − 1 , into typical the SR mapping ( ) between 3 and 3 , as proposed by Jia [53]: Compared to the high-pass modulation (Equation (12)), the essence of Equation (28) is to use DL to learn the injection gain (Equation (13)) automatically with a higher generalization ability. However, since the learning network parameter (Φ) requires two pairs of fine and coarse images, achieving an injection gain with a higher generalization ability comes with the cost of increasing the number of fine and coarse image pairs. Therefore, this approach is not suitable when there is only one known fine-coarse image pair. In order to reduce the requirement for the number of fine-coarse image pairs, while achieving high practicality, in the HDLSFM the injection gain is predefined instead of being automatically learned by the DL-based SR method. Although the generalization ability of the predefined-injection gain is lower than that of the automatically learned one, its relaxed requirement on the number of input images makes the HDLSFM highly practical.

Fusion of LC Prediction with PC Prediction
In HDLSFM, the LC and PC predictions are combined using the weighted combina-

High-Pass Modulation in Spatial Detail Reconstruction
The LC prediction performance of the HDLSFM depends on the performance of the DL-based SR method, which formulates the nonlinear mapping between fine and coarse images. However, due to the significantly larger magnification factor of the STF compared to that of single-image SR, their spatial details are incomplete. Thus, in order to recover the spatial details, high-pass modulation was employed, which is a common strategy in existing learning-based STF methods [48,52,[71][72][73]. An alternative way is to introduce spatial details, described as the difference between known fine and coarse images F 1 − C 1 , into typical the SR mapping ( f ) between C 3 and F 3 , as proposed by Jia [53]: Compared to the high-pass modulation (Equation (12)), the essence of Equation (28) is to use DL to learn the injection gain (Equation (13)) automatically with a higher generalization ability. However, since the learning network parameter (Φ) requires two pairs of fine and coarse images, achieving an injection gain with a higher generalization ability comes with the cost of increasing the number of fine and coarse image pairs. Therefore, this approach is not suitable when there is only one known fine-coarse image pair. In order to reduce the requirement for the number of fine-coarse image pairs, while achieving high practicality, in the HDLSFM the injection gain is predefined instead of being automatically learned by the DL-based SR method. Although the generalization ability of the predefined-injection gain is lower than that of the automatically learned one, its relaxed requirement on the number of input images makes the HDLSFM highly practical.

Fusion of LC Prediction with PC Prediction
In HDLSFM, the LC and PC predictions are combined using the weighted combination method with a sliding window, constituting a sliding window-combination method. It should be noted that in IFSDAF [44], the final prediction is also obtained by combining two independent fused images using a weighted combination method, forming the constrained least squares (CLS)-combination, in which the weights are estimated via a constrained least squares method on the scale of the coarse images. In order to further test the effectiveness of the sliding window-combination method adopted in HDLSFM, a comparison of the fusion performance of the CLS-combination method and the sliding window-combination method for the LGC and CI sites was conducted. As shown in Table 5, the sliding window-combination method generally achieved better fusion performance than the CLS-combination method for both sites. Theoretically, the weights of both combination methods are estimated on the scale of the coarse images, which can easily cause a block effect due to the scale inconsistency between the coarse image and two independent fused images. In HDLSFM, in order to obtain a smoother fusion result, the weight measurement is based on a sliding window. The comparison of the visual effects of the two combination methods (Figure 15) verifies HDLSFM's effectiveness, as more spatial details can be seen.

Comparison with Other DL-based STF Methods
HDLSFM requires minimal input data, i.e., a known fine and a coarse image at a base date and a coarse image at a prediction date. Compared to the typical DL-based STF methods, which need at least two pairs of fine and coarse images, the minimal input data requirement of HDLSFM ensures its practicality in areas with severe cloud contamination, since in some cases, there may not be two cloud-free fine-coarse image pairs available in the period, with a short time interval. So the question arises if HDLSFM can outperform the typical DL-based STF methods which use two pairs of fine-coarse images with a relatively long time interval. To answer this question, we further compared the fusion performance of HDLSFM and STFDCNN, the typical DL-based STF methods in the LGC site. The dataset of this experiment was the same as the one used by Emelyanova et al. [61]. Fourteen cloud-free Landsat TM images were available for this study site, covering a winter and a summer crop growing season between 2004 and 2005. To remove the influence of radiometric and geometric inconsistencies between Landsat and MODIS [12], the coarse images are simulated MODIS images with a spatial resolution of 480 m, collected from Landsat images instead of real MODIS images. Meanwhile, to highlight the temporal change between the image at the base date and the prediction date, the original reflectance was transformed to NDVI. Two MODIS-Landsat NDVI image pairs acquired on 16 April 2004, and 3 April 2005, along with the MODIS NDVI images obtained on other dates were used as prior images for STFDCNN to predict the fine NDVI images obtained on other dates. For HDLSFM, one MODIS-Landsat NDVI image pair acquired on 16 April 2004, along with the MODIS NDVI images obtained on other dates were used as prior images for the prediction. The other twelve real Landsat NDVI images were utilized to evaluate the fusion performance of the two methods. The list of images is shown in Table A1 in the  Appendix. HDLSFM generally had the better mean RMSE, mean UIQI, and mean CC of twelve predictions than STFDCNN (RMSE 0.1023 vs. 0.1257, UIQI 0.9329 vs. 0.9243, CC 0.8347 vs. 0.7681). The mean SSIM was comparable for two methods (0.5234 vs. 0.5369). The prediction accuracy of the two methods at different dates was quite different. As shown in Figure 16, the fusion performance of HDLSFM close to the prior fine-coarse image pair at 16 April 2004, was significantly better than STFDCNN. The superiority of the fusion performance for STFDCNN exists in the predictions close to the second prior fine-coarse image pair at 3 April 2005. However, the superiority was not obvious. In other words, although STFDCNN uses more prior information than HDLSFM, its fusion accuracy is still slightly worse than HDLSFM. The above results suggest that HDLSFM is generally more accurate

Comparison with Other DL-Based STF Methods
HDLSFM requires minimal input data, i.e., a known fine and a coarse image at a base date and a coarse image at a prediction date. Compared to the typical DL-based STF methods, which need at least two pairs of fine and coarse images, the minimal input data requirement of HDLSFM ensures its practicality in areas with severe cloud contamination, since in some cases, there may not be two cloud-free fine-coarse image pairs available in the period, with a short time interval. So the question arises if HDLSFM can outperform the typical DL-based STF methods which use two pairs of fine-coarse images with a relatively long time interval. To answer this question, we further compared the fusion performance of HDLSFM and STFDCNN, the typical DL-based STF methods in the LGC site. The dataset of this experiment was the same as the one used by Emelyanova et al. [61]. Fourteen cloudfree Landsat TM images were available for this study site, covering a winter and a summer crop growing season between 2004 and 2005. To remove the influence of radiometric and geometric inconsistencies between Landsat and MODIS [12], the coarse images are simulated MODIS images with a spatial resolution of 480 m, collected from Landsat images instead of real MODIS images. Meanwhile, to highlight the temporal change between the image at the base date and the prediction date, the original reflectance was transformed to NDVI. Two MODIS-Landsat NDVI image pairs acquired on 16 April 2004, and 3 April 2005, along with the MODIS NDVI images obtained on other dates were used as prior images for STFDCNN to predict the fine NDVI images obtained on other dates. For HDLSFM, one MODIS-Landsat NDVI image pair acquired on 16 April 2004, along with the MODIS NDVI images obtained on other dates were used as prior images for the prediction. The other twelve real Landsat NDVI images were utilized to evaluate the fusion performance of the two methods. The list of images is shown in Table A1 in the Appendix A.
HDLSFM generally had the better mean RMSE, mean UIQI, and mean CC of twelve predictions than STFDCNN (RMSE 0.1023 vs. 0.1257, UIQI 0.9329 vs. 0.9243, CC 0.8347 vs. 0.7681). The mean SSIM was comparable for two methods (0.5234 vs. 0.5369). The prediction accuracy of the two methods at different dates was quite different. As shown in Figure 16, the fusion performance of HDLSFM close to the prior fine-coarse image pair at 16 April 2004, was significantly better than STFDCNN. The superiority of the fusion performance for STFDCNN exists in the predictions close to the second prior fine-coarse image pair at 3 April 2005. However, the superiority was not obvious. In other words, although STFDCNN uses more prior information than HDLSFM, its fusion accuracy is still slightly worse than HDLSFM. The above results suggest that HDLSFM is generally more accurate than STFDCNN in the case of there not being two pairs of fine-coarse images with a short time interval available.  In the previous experiment, we set a special situation, i.e., a long time interval be tween two fine-coarse image pairs. In this section, we further compared the fusion per formance of HDLSFM and STFDCNN in a more general case, taking the CIA site as th study site. In particular, the same as for the HDLSFM in Experiment III, three Landsa ETM+-MOD09GA pairs acquired on 8 October 2001, 4 December 2001, and 11 April 200 were used for training. Based on the trained network, the above fine-coarse image pair and the Landsat ETM+-MOD09GA pair acquired on 4 May 2002, along with the othe available MOD09GA images were used as the prior images of STFDCNN to predict th other thirteen Landsat images ( Table 1).
As shown in Figure 17, although the time interval of two fine-coarse image pairs o STFDCNN was not as long as the LGC site, the fusion performance of STFDCNN still had no advantage compared to HDLSFM. All the evaluation indices were comparable fo HDLSFM and STFDCNN (mean RMSE 0.0305 vs. 0.0314, mean SSIM 0.8242 vs. 0.8239 mean UIQI 0.9540 vs. 0.9550, mean CC 0.8228 vs. 0.8108, mean SAM 5.7978 vs. 5.8546 mean ERGAS 1.0333 vs. 1.0396). Considering that the fusion accuracy of the two method was similar, but HDLSFM had a more flexible input data requirement, HDLSFM thus ha higher practicability than STFDCNN. In the previous experiment, we set a special situation, i.e., a long time interval between two fine-coarse image pairs. In this section, we further compared the fusion performance of HDLSFM and STFDCNN in a more general case, taking the CIA site as the study site. In particular, the same as for the HDLSFM in Experiment III, three Landsat ETM+-MOD09GA pairs acquired on 8 October 2001, 4 December 2001, and 11 April 2002 were used for training. Based on the trained network, the above fine-coarse image pairs and the Landsat ETM+-MOD09GA pair acquired on 4 May 2002, along with the other available MOD09GA images were used as the prior images of STFDCNN to predict the other thirteen Landsat images ( Table 1).
As shown in Figure 17, although the time interval of two fine-coarse image pairs of STFDCNN was not as long as the LGC site, the fusion performance of STFDCNN still had no advantage compared to HDLSFM.

The Applicability of HDLSFM
The main advantage of the HDLSFM is the ability to predict complex land surface temporal changes, including PC and LC using the minimal input requirement, which is more applicable in areas with severe cloud contamination than the typical DL-based STF methods, which use at least two fine-coarse image pairs. Note that FSDAF can also predict LC and PC with minimal input data; however, the experimental results show that FSDAF is sensitive to the radiation differences between the fine and coarse image, and to the time interval between the base and prediction date, whose sensitivity is even higher than Fit FC. HDLSFM, by comparison, shows high robustness, which ensures its effectiveness in the generation of fused time-series data using a limited number of fine-coarse image pairs, and its effectiveness on other types of satellite image. However, note that the computational cost of HDLSFM is much higher than other STF methods (Table 6); HDLSFM is thus more practical, especially in areas with severe cloud contamination or when the radiation difference between fine and coarse images is high.

Limitations and Future Work
HDLSFM still has certain limitations. Even though HDLSFM is effective for the prediction of both LC and PC using only one fine-coarse image pair, which makes it highly practical in areas with severe cloud contamination, in some cases, there may not be even one cloud-free fine-coarse image pair available, making HDLSFM ineffective. Therefore, future research should focus on expanding the applicability of STF methods to incomplete fine-coarse image pairs. Additionally, the prediction of HDLSFM suffers from blurring effects in heterogeneous areas, which is related to the unsatisfactory performance of the DL-based SR method used in HDLSFM in the case of a limited number of fine-coarse image pairs. Therefore, an adaptation strategy of the DL-based SR method to a limited number of fine-coarse images will be the focus of future research.

The Applicability of HDLSFM
The main advantage of the HDLSFM is the ability to predict complex land surface temporal changes, including PC and LC using the minimal input requirement, which is more applicable in areas with severe cloud contamination than the typical DL-based STF methods, which use at least two fine-coarse image pairs. Note that FSDAF can also predict LC and PC with minimal input data; however, the experimental results show that FSDAF is sensitive to the radiation differences between the fine and coarse image, and to the time interval between the base and prediction date, whose sensitivity is even higher than Fit FC. HDLSFM, by comparison, shows high robustness, which ensures its effectiveness in the generation of fused time-series data using a limited number of fine-coarse image pairs, and its effectiveness on other types of satellite image. However, note that the computational cost of HDLSFM is much higher than other STF methods (Table 6); HDLSFM is thus more practical, especially in areas with severe cloud contamination or when the radiation difference between fine and coarse images is high.

Limitations and Future Work
HDLSFM still has certain limitations. Even though HDLSFM is effective for the prediction of both LC and PC using only one fine-coarse image pair, which makes it highly practical in areas with severe cloud contamination, in some cases, there may not be even one cloud-free fine-coarse image pair available, making HDLSFM ineffective. Therefore, future research should focus on expanding the applicability of STF methods to incomplete fine-coarse image pairs. Additionally, the prediction of HDLSFM suffers from blurring effects in heterogeneous areas, which is related to the unsatisfactory performance of the DL-based SR method used in HDLSFM in the case of a limited number of fine-coarse image pairs. Therefore, an adaptation strategy of the DL-based SR method to a limited number of fine-coarse images will be the focus of future research.

Conclusions
In this paper, we proposed a hybrid DL-based STF method (HDLSFM), which formulates a hybrid framework for a robust fusion, containing both PC and LC information, in which a nonlinear DL-based relative radiometric normalization, a DL-based SR, and a linear-based fusion are combined to address radiation differences between fine and coarse images, LC prediction, and PC prediction. Meanwhile, the network parameters of the DL-based relative radiometric normalization and DL-based SR are learned simultaneously by the LapSRN to avoid uncertainty accumulation. The effectiveness of the HDLSFM in LC and PC prediction with minimal input requirements, i.e., one pair of fine and coarse images and a coarse image at a prediction date, was verified through four experiments, using STARFM, FSDAF, and Fit-FC as benchmark methods. The experimental results suggest that in the case of LC prediction, the fusion accuracy of HDLSFM, as measured by the mean RMSE had a decrease of 0.0616, 0.0021, and 0.0016 when compared to STARFM, FSDAF, and Fit-FC. For the case of the PC prediction, the mean RMSE of HDLSFM had a decrease of 0.0638, 0.0051, and 0.0016 compared to STARFM, FSDAF, and Fit-FC. The experimental results also confirmed the robustness of the HDLSFM to radiation differences and the time interval between the prediction date and the base date. These characteristics ensure the high applicability of the HDLSFM in the generation of fused time-series data using a limited number of fine-coarse image pairs. Future improvements can be achieved by adapting the DL-based SR method to a limited number of fine-coarse images and expanding the applicability of the STF methods to incomplete fine and coarse image pairs.