Deep Learning for Feature-Level Data Fusion: Higher Resolution Reconstruction of Historical Landsat Archive

Simple Summary: Existing freely available global land surface observations are limited by medium to coarse resolutions or short time spans. Here we developed a feature-level data fusion framework using a generative adversarial network (GAN), a deep learning technique, to leverage Landsat and Sentinel-2 observations during their over-lapping period from 2016 to 2019, and reconstruct 10 m Sentinel-2 like imagery from 30 m historical Landsat archives. Our tests with both simulated data and actual Landsat/Sentinel-2 acquisitions showed that the GAN-based super-resolution method could accurately reconstruct synthetic Landsat data at an effective resolution very close to that of the real Sentinel-2 observations. The promising results from our deep learning-based feature-level fusion method highlight the potential for reconstructing a 10 m Landsat series archive since 1984. It is expected to improve our capability of detecting and tracking ﬁner scale land changes, identifying the drivers for and responses to global environmental changes. Abstract: Long-term record of ﬁne spatial resolution remote sensing datasets is critical for monitoring and understanding global environmental change, especially with regard to ﬁne scale processes. However, existing freely available global land surface observations are limited by medium to coarse resolutions (e.g., 30 m Landsat) or short time spans (e.g., ﬁve years for 10 m Sentinel-2). Here we developed a feature-level data fusion framework using a generative adversarial network (GAN), a deep learning technique, to leverage the overlapping Landsat and Sentinel-2 observations during 2016–2019, and reconstruct 10 m Sentinel-2 like imagery from 30 m historical Landsat archives. Our tests with both simulated data and actual Landsat/Sentinel-2 imagery showed that the GAN-based fusion method could accurately reconstruct synthetic Landsat data at an effective resolution very close to that of the real Sentinel-2 observations. We applied the GAN-based model to two dynamic systems: (1) land over dynamics including phenology change, cropping rotation, and water inundation; and (2) human landscape changes such as airport construction, coastal expansion, and urbanization, via historical reconstruction of 10 m Landsat observations from 1985 to 2018. The resulting comparison further validated the robustness and efﬁciency of our proposed framework. Our pilot study demonstrated the promise of transforming 30 m historical Landsat data into a 10 m Sentinel-2-like archive with advanced data fusion. This will enhance Landsat and Sentinel-2 data science, facilitate higher resolution land cover and land use monitoring, and global change research.


Introduction
Many land changes occur at fine spatial scales ranging from meter to~10 m [1,2], and thus remote sensing data at higher spatial resolution are needed to effectively detect and monitor these changes across landscapes. In order to understand the drivers associated with global environmental change, repeated observations over a long time period are also critical. The U.S. Geological Survey (USGS) Landsat series of satellites provides the longest and cloud-shadow masking, spatial co-registration and common gridding, bidirectional reflectance distribution function normalization and spectral bandpass adjustment [30], finer spatial details from Sentinel-2 are not fully used for the fusion [28,31]. To overcome this issue, Wang et al. [31] proposed an area-to-point regression kriging (ATPRK)-based method to generate 10 m Landsat-8 like images, by fusing available 10 m Sentinel-2 and Landsat 8 panchromatic (PAN) bands. This geo-statistical fusion algorithm relies on the input covariates derived from the corresponding Sentinel-2 imagery to enhance spatial details of each lower-resolution imagery. However, ATPRK involves complex semi-variogram modelling from a co-kriging matrix, which is computationally expensive for large-scale super-resolution processing [28].
One possible cost-effective solution is to leverage Landsat and Sentinel-2 observations during their overlapping period, to reconstruct 10 m Sentinel-2-like imagery from historical Landsat imagery at 30 m before 2015. It is based on a transfer learning strategy enabled with deep learning algorithms in computer vision [32][33][34], e.g., applying the relationship learned from Landsat and Sentinel-2 over the most recent temporal period to historical pre-2015 Landsat data, to regenerate synthetic 10 m Landsat archive over the past four decades.
Convolution neural network (CNN)-based super-resolution (SR) image reconstruction has proven to be a promising approach in closing inherent resolution gaps of imaging sensors [33,[35][36][37]. Here, we categorize existing deep learning CNN SR methods into two classes based on their main purpose. One class aims at optimizing distortion measures such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and mean square error (MSE) to yield more accurate images, and we named this group as PSNR-oriented methods. In order to achieve better quantitative measures, major solutions in this category consider pixel-wise loss functions and increase the network complexity of CNNs. For example, Ledig et al. [34] proposed SRResNet with 16 residual units to optimize MSE. Lim et al. [38] developed an enhanced deep super-resolution network (EDSR) to significantly improve the single image super-resolution performance by removing the batch normalization layers in SRResNet and increasing the number of residual blocks to 23. Another group of image SR studies attempts to produce more photo-realistic images by including an extra perceptual loss function; these are superior in perceptual quality, thereafter referred as perceptual-oriented methods. For example, Johnson et al. [39] proposed perception loss to enhance the perceptual quality by minimizing the error in feature space instead of pixel space. A generative adversarial network (GAN) was also developed for image SR by applying both perceptual loss and adversarial loss to generate realistic textures during SR and achieves state-of-the-art performance [34]. Pouliot et al. [36] applied both shallow and deep convolution neural networks (CNNs) in Landsat image superresolution enhancement, trained using Sentinel-2, in three study sites representing boreal forest, tundra, and cropland/woodland environments. An extended super-resolution convolutional neural network (ESRCNN) was developed [28], and it was suggested that the proposed deep learning models outperformed the ATPRK-based algorithm using experimental tests in two local study areas of North China's Hebei province. However, these CNN approaches still fall into the category of non-GAN-based models without adding the adversarial discriminator to better tune the neural network. A similar non-GAN-based CNN model was used to super-resolve the lower-resolution bands (20 m and 60 m) of Sentinel 2 to 10 m spatial resolution, over 60 representative Sentinel-2 scenes across the globe [40].
Despite the promising results of these pilot studies, the evaluation and application of deep learning-based approaches in remote sensing data fusion of Landsat and Sentinel-2 is still limited. Firstly, most of existing studies working on the super-resolution of Landsat and Sentinel-2 rely on non-GAN-based models without fully coupling perceptual loss and adversarial loss into the CNN framework. Secondly, the training and tuning of deep learning algorithms in remote sensing data fusion is complex and sensitive to both spatial and temporal domains. Moreover, the majority of available studies are limited to local scales and short-temporal periods. Lastly, a comprehensive comparison of different algorithms, including statistical interpolation, geo-statistical modelling, non-GAN-based, and GANbased deep learning models to super-resolve Landsat imagery from 30 m to 10 m, is needed to provide insights for practical applications. To our knowledge, there have been limited efforts so far on reconstructing 10 m Sentinel-2-like imagery from the historical Landsat archive using the state-of-the-art non-GAN and GAN-based deep learning approaches across different spatial and temporal scales.
Addressing these challenges, we aim to implement a feature-level data fusion framework for remote sensing SR image reconstruction. Specifically, we first developed the GAN-based fusion framework to leverage Landsat and Sentinel-2 observations during their overlapping period from 2016 to 2019, for reconstructing 10 m Sentinel-2-like imagery from historical 30 m Landsat archive. We then provided a comprehensive comparison of GAN-based deep learning models with baseline methods including statistical interpolation, geo-statistical modelling, non-GAN-based, to better inform SR model performance. We further investigated the potential of the GAN-based model in dynamic systems over different spatial and temporal domains. Finally we demonstrated the promise of transforming 30 m historical Landsat data into a 10 m Sentinel-2-like archive with this advanced data fusion.

Theoretical Basis
The aim of our study was to reconstruct an SR resolved image I SR to the same spatial resolution as Sentinel-2 (I HR ), from a low spatial resolution Landsat image I LR with a convolutional neural network (CNN). A non-GAN-based network (Section 2.1.1) and a GAN-based network (Section 2.1.2) were both used in this study for the purpose of comparing the performance of PSNR-oriented and perceptual-oriented deep learning methods in remote sensing image SR tasks.
As shown in Table S1, Landsat series (i.e., Landsat-5/7/8) have very close bandwidths to Sentinel-2A/B, particularly over blue, green, and red bands. This provided a theoretical basis of spectral comparability to build a band-over-band relationship between Landsat and Sentinel-2. For the experimental purpose in this study, we focused on using the true color composite of red-green-blue bands for both Landsat and Sentinel-2. The proposed methods below can be extended to other paired bands and high-low resolution image pairs as well.

Deep Convolutional Neural Networks
For a low spatial resolution image I LR with W columns, H rows, and C bands, it can be expressed by a tensor of size W × H × C. Its corresponding I HR and I SR have the size of rW × rH × C, where r represents a scaling factor. In the training process, I LR and I HR image pairs were used as the input of a feed-forward CNN network G θ parameterized by θ. We adopted the SRResNet architecture [34] in this work, with the only difference in removing batch normalization (BN) layers in the residual block, because removing BN layers will help reduce computation costs and increase the model generalization ability [38]. As shown in Figure 1, SRResNet comprises 16 identical residual blocks using ParametricReLU as the activation function and downscaling with 2 trained sub-pixel convolution layers. The objective function is: where n LR I and n HR I are corresponding low-high spatial resolution training image pairs, N is the total number of training image pairs, is the most widely used loss function for image SR, which calculates the pixel-level MSE between n HR I and the generated image .

Generative Adversarial Networks
Generative adversarial networks (GAN), proposed by Goodfellow et al. [41] in 2014, have shown great breakthroughs in training unsupervised machine learning models and have gained popularity in computer vision fields. It consists of two networks; a generator G and a discriminator D. G is trained to generate output that is close to the real data. D is a multilayer perception model defined to distinguish whether a sample is from the generator or real data, represented as a simple binary classifier that outputs a single scalar.
For a better gradient performance,  [ log(1 ( ( )))] data g form of the D's loss function. D and G are trained in conjunction to solve a two-player min-max problem. G tries to fool D into labeling its sample as real data by minimizing its loss function and generating sample approximating real data distribution. D tries to maximize the probability of assigning the correct label to sample both from generator's distribution ( g p ) and from data distribution ( data p ).
A GAN-based model was constructed by taking the non-GAN-based model (Section 2.1.1) as the generator ( ) and trained with a discriminator (Figure 2). The architecture of the discriminator we used closely followed ESRGAN [42], which is a relativistic average discriminator that instead of estimating if a generated image is real or fake, it estimates the realistic score of a generated image relative to real image. The objective function is: where I n LR and I n HR are corresponding low-high spatial resolution training image pairs, N is the total number of training image pairs, l MSE is the most widely used loss function for image SR, which calculates the pixel-level MSE between I n HR and the generated image I n SR .

Generative Adversarial Networks
Generative adversarial networks (GAN), proposed by Goodfellow et al. [41] in 2014, have shown great breakthroughs in training unsupervised machine learning models and have gained popularity in computer vision fields. It consists of two networks; a generator G and a discriminator D. G is trained to generate output that is close to the real data. D is a multilayer perception model defined to distinguish whether a sample is from the generator or real data, represented as a simple binary classifier that outputs a single scalar.
For a better gradient performance, E n [− log(D(G(x n )))] is usually used as the G's loss function. E x∼p data (x) [− log(D(x))] + E x∼p g (x) [− log(1 − D(G(x)))] is a common form of the D's loss function. D and G are trained in conjunction to solve a two-player min-max problem. G tries to fool D into labeling its sample as real data by minimizing its loss function and generating sample approximating real data distribution. D tries to maximize the probability of assigning the correct label to sample both from generator's distribution (p g ) and from data distribution (p data ).
A GAN-based model was constructed by taking the non-GAN-based model (Section 2.1.1) as the generator (G θ ) and trained with a discriminator (Figure 2). The architecture of the discriminator we used closely followed ESRGAN [42], which is a relativistic average discriminator that instead of estimating if a generated image is real or fake, it estimates the realistic score of a generated image relative to real image. Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 24

Loss Function
The total loss of the generator G θ is the weighted combination of adversarial loss and perceptual loss defined as 19 54 where w is the weight set to be 10 −3 , and G l is the generator's loss [ log( ( ( )))] n n E D G x − . VGG loss, 19 54 VGG l − is the perceptual loss, defined as the L2-norm distance between the feature representations (i.e., feature obtained by the 4th convolution before the 5 th maxpooling layer within the pre-trained 19-layer VGG network) of a generated image ISR and the reference high spatial resolution image IHR.

Implementation
The scaling factor r between ISR and IHR was 3 in all experiments, to be consistent with the spatial resolution difference between Landsat and Sentinel-2 dataset. The minimum batch size was set to be 16. Following the SRGAN, training images were cropped into 96 × 96 patches for each minimum batch.
For the simulated data experiment and real data experiment (Sections 3.1 and 3.2), the modified SRResNet networks (referred as non-GAN-based network hereafter) were trained through 1 million iterations with the MSE loss, a learning rate of 10 −4 , and decayed by a factor of 2 at 50k, 100k, 200k, and 300k iterations.
When training the GAN-based model, we employed the trained non-GAN-based network as initialization for the generator to obtain better results [34]. GAN-based networks were trained with 5 × 10 5 iterations at a learning rate of 10 −4 and decayed by a factor of 2 every 2 × 10 5 iterations. For optimization, we used the Adam method [43] with 1 0.9 β = , The generator and discriminator were updated alternatively.
We implemented these networks using Python 3.7.6 with the PyTorch 1.0.1 framework and trained them using NVIDIA 2080 Ti GPU.

Landsat and Sentinel-2 Datasets
We selected study areas spatially distributed across different landscapes of the U.S., to provide more diverse terrestrial land cover information for training and testing ( Figure  3a). A total of 11 cloud-free Landsat and Sentinel-2 imagery pairs from 2016 to 2019 were

Loss Function
The total loss of the generator G θ is the weighted combination of adversarial loss and perceptual loss defined as where w is the weight set to be 10 −3 , and l G is the generator's loss E n [− log(D(G(x n )))]. VGG loss, l VGG19−54 is the perceptual loss, defined as the L 2 -norm distance between the feature representations (i.e., feature obtained by the 4th convolution before the 5th maxpooling layer within the pre-trained 19-layer VGG network) of a generated image I SR and the reference high spatial resolution image I HR .

Implementation
The scaling factor r between I SR and I HR was 3 in all experiments, to be consistent with the spatial resolution difference between Landsat and Sentinel-2 dataset. The minimum batch size was set to be 16. Following the SRGAN, training images were cropped into 96 × 96 patches for each minimum batch.
For the simulated data experiment and real data experiment (Sections 3.1 and 3.2), the modified SRResNet networks (referred as non-GAN-based network hereafter) were trained through 1 million iterations with the MSE loss, a learning rate of 10 −4 , and decayed by a factor of 2 at 50 k, 100 k, 200 k, and 300 k iterations.
When training the GAN-based model, we employed the trained non-GAN-based network as initialization for the generator to obtain better results [34]. GAN-based networks were trained with 5 × 10 5 iterations at a learning rate of 10 −4 and decayed by a factor of 2 every 2 × 10 5 iterations. For optimization, we used the Adam method [43] with β 1 = 0.9, β 2 = 0.999, and ε = 10 −8 . The generator and discriminator were updated alternatively. We implemented these networks using Python 3.7.6 with the PyTorch 1.0.1 framework and trained them using NVIDIA 2080 Ti GPU.

Landsat and Sentinel-2 Datasets
We selected study areas spatially distributed across different landscapes of the U.S., to provide more diverse terrestrial land cover information for training and testing (Figure 3a). A total of 11 cloud-free Landsat and Sentinel-2 imagery pairs from 2016 to 2019 were collected and downloaded from the Google Earth Engine (https://earthengine.google. com/) data archive (Table 1, Figure 3a). We identified Landsat and Sentinel-2 imagery pairs that both had less than 5% cloud cover and were less than 5 days apart. collected and downloaded from the Google Earth Engine (https://earthengine.google.com/) data archive (Table 1, Figure 3a). We identified Landsat and Sentinel-2 imagery pairs that both had less than 5% cloud cover and were less than 5 days apart. We extracted the overlapping areas for each pair of Landsat and Sentinel-2 imagery, and further cropped them into patch-based Landsat and Sentinel-2 imagery pairs with a unified patch size (160 × 160 for Landsat and 480 × 480 for Sentinel-2 in this study). These cropped subsets of Landsat and Sentinel-2 imagery pairs ( Figure 4) were then used as input of the CNN models. We extracted the overlapping areas for each pair of Landsat and Sentinel-2 imagery, and further cropped them into patch-based Landsat and Sentinel-2 imagery pairs with a unified patch size (160 × 160 for Landsat and 480 × 480 for Sentinel-2 in this study). These cropped subsets of Landsat and Sentinel-2 imagery pairs ( Figure 4) were then used as input of the CNN models.   We collected three additional groups of datasets for time-series super-resolution experiments in capturing intra-annual land cover dynamics with fine spatial resolutions (Table 2). We acquired a series of Landsat-8 imagery in the states of Massachusetts and California to enhance their spatial resolutions to better capture phenology changes over vegetated area, cropping rotations over agricultural areas, and water dynamics over reservoirs.
To test the ultimate goal of transforming 30 m historical Landsat to 10 m Sentinel-like archive, we selected Schonefeld, Dubai, and Las Vegas as worldwide experimental sites (Figure 3b) to super-resolve annual Landsat imagery from 1985 to 2018.  We collected three additional groups of datasets for time-series super-resolution experiments in capturing intra-annual land cover dynamics with fine spatial resolutions ( Table 2). We acquired a series of Landsat-8 imagery in the states of Massachusetts and California to enhance their spatial resolutions to better capture phenology changes over vegetated area, cropping rotations over agricultural areas, and water dynamics over reservoirs.

Tests with Simulated Data
In this experiment, cropped Sentinel-2 images with the size of 480 × 480 × 3 were used as I HR images. The corresponding I LR images (160 × 160 × 3) were obtained by upscaling Sentinel-2 images by a factor of three with the bicubic interpolation. A total of 1220 I LR -I HR image pairs were randomly split, with 90% of them used for training and 10% for validation. It should be noted that the validation dataset was used to provide an unbiased evaluation while tuning model hyperparameters. A selection of an additional 57 testing images were used to evaluate the final model performance.
We selected two sites out of the 57 testing images to visually compare the spatial details and spectral fidelity between the original Sentinel-2 image and reconstructed 10 m Sentinel-2-like images (  (Figure 5f,n) were very blurred, and it was very difficult to identify the exact road network and infrastructure edges. As shown in the zoomed-in subsets for both sites (Figures 5e-h,m-p). In contrast, the non-GAN derived images (Figure 5g,o) improved the spatial details to a certain degree, but blurring effects still existed. The GAN derived images (Figure 5h,p), on the other hand, retained the fine-resolution spatial details that were nearly identical to the original Sentinel-2 images. For example, the road network, building edges, and land parcel blocks identified from the original and reconstructed ones were quite similar through visual inspections.     We used a group of full-reference indices including quality index (QI) [44], PSNR, root-mean-square-error (RMSE), and ERGAS [45], and non-reference indices including NIQE [46], PIQE [47], and BRISQUE [46] to assess the model performance of the 57 testing images. Based on the quantitative measures (Table 3), non-GAN approach achieved a slightly better performance in retaining spectral fidelity than the traditional interpolation and GAN approach. In contrast, the non-reference indices revealed that the GAN derived results had much better spatial details than the interpolated and non-GAN derived results. For example, the score of the indicators such as NIQE, PIQE, and BRISQUE from GAN derived results were much lower than those from the interpolated and non-GAN derived results.

Tests with Real Data
Different from the simulated data experiment in Section 3.1, Landsat images were directly used as I LR and Sentinel-2 images were used as I HR in this experiment. Similarly, we used 90% out of 1220 Landsat and Sentinel-2 I LR -I HR image pairs for training, 10% for validation, and an additional 57 image pairs for testing the final model performance. In addition to the interpolated and non-GAN derived methods, we included another five methods as baselines to compare against, including the smoothing filter based intensity modulation (SFIM) [48], high pass filtering (HPF) [49], band-dependent spatial-detail (BDSD) [50], area-to-point regression kriging (ATPRK) [31,51], and target-adaptive CNNbased (TCNN) algorithms [52,53]. The selected five methods all required the guidance of 10 m Sentinel-2 imagery to downscale the corresponding 30 m Landsat imagery.
We again selected two sites out of the 57 testing images to visually compare the original Sentinel-2 image and reconstructed 10 m Sentinel-2-like images (Figure 7). The spectral signatures of the reference images, the interpolated Landsat, SFIM derived, HPF derived, BDSD derived, ATPRK derived, TCNN derived, non-GAN derived, and GAN derived 10-m Sentinel-2 images were similar in general (Figure 7). However, we could still identify the obvious spectral distortions from the SFIM, HPF, and BSSD derived results (Figure 7c-e,u-w). In contrast, the reconstructed images showed considerable differences in spatial details, which was reflected from the zoomed-in subsets in Figure 7j-r for site A and Figure 7ab-aj for site B. The derived images from the ATPRK, TCNN, and GAN methods retained the fine spatial details relatively well. The reconstructed images were almost identical to the original Sentinel-2 images; we could clearly identify the building blocks and roads.  We used the same set of quantitative measures to assess the model performance of the 57 testing images in a real-data experiment. The quantitative results (Table 4) revealed that the non-GAN-based model was better at retaining spectral fidelity, while the GAN-based model could construct more detailed spatial high frequency patterns. Noticeably, the ATPRK based model also achieved a comparable performance in persevering spatial and spectral information. However, it should be noted that the implementation of SFIM, HPF, BDSD and ATPRK models requires the input of corresponding Sentinel-2 images as the downscaling guidance, whereas the input of non-GAN and GAN-based models is only the 30 m Landsat imagery. On this basis, results shown in Figure 7 and Table 4 further reinforce the potential value of our proposed methods.

Tests with Time-Series Data
We further applied the GAN-based model trained from the real data in Section 3.2 to the time series of Landsat imagery across different landscapes, in order to test its performance of capturing inter-annual land cover dynamics, including phenology changes over vegetated area, cropping rotations over agricultural area, and water dynamics.
A selection of cloud-free Landsat-8 imagery was acquired in Massachusetts, where seasonal phenology changes are significant (Figure 8). The major land-cover types were forest, grass, water, bare land, and buildings for both sites A and B. Compared with the interpolated Landsat images (left panels of Figure 8a,b), the reconstructed Sentinel-2 like images (right panels of Figure 8a,b) greatly improved the spatial details while preserving spectral consistency.
A selection of monthly Landsat-8 imagery from January to October was acquired in California, where agricultural activities were dominant and cropping rotation was frequent ( Figure 9). By applying the GAN-based model, we derived the corresponding 10 m Sentinel-2-like images (right panels of Figure 9. Results showed that the reconstructed images accurately captured crop phenology changes and depicted double cropping rotations (i.e., the first was from January to April, and second was from May to October). Given the relative homogeneity of land cover types in this region, the enhanced spatial details of reconstructed Sentinel-2-like images did not show considerable differences from those of the interpolated Landsat images through visual inspections. However, the intra-class difference in growing status and phenology changes could be easily identified from the reconstructed Senitnel-2-like images, whereas this was very difficult from the interpolated Landsat images. Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 24 A selection of monthly Landsat-8 imagery from January to October was acquired in California, where agricultural activities were dominant and cropping rotation was frequent ( Figure 9). By applying the GAN-based model, we derived the corresponding 10 m Sentinel-2-like images (right panels of Figure 9. Results showed that the reconstructed images accurately captured crop phenology changes and depicted double cropping rotations (i.e., the first was from January to April, and second was from May to October). Given the relative homogeneity of land cover types in this region, the enhanced spatial details of reconstructed Sentinel-2-like images did not show considerable differences from those of the interpolated Landsat images through visual inspections. However, the intraclass difference in growing status and phenology changes could be easily identified from the reconstructed Senitnel-2-like images, whereas this was very difficult from the interpolated Landsat images. An additional selection of monthly Landsat-8 imagery from January to October was acquired in California, where water bodies were dominant, with certain changes in spatial coverage and water colors ( Figure 10). Compared with the interpolated Landsat images (left panels of Figure 10), the reconstructed Sentinel-2-like images (right panels of Figure 10) captured spatial details much more accurately without any blurring effects. For example, the inundation areas on August 18 and September 3 were depicted with fine spatial details in both intra-class variation and adjacent boundaries for Sentinel-2-like images, while the corresponding Landsat images only captured the general distribution of inundations with obvious blurs. An additional selection of monthly Landsat-8 imagery from January to October was acquired in California, where water bodies were dominant, with certain changes in spatial coverage and water colors ( Figure 10). Compared with the interpolated Landsat images (left panels of Figure 10), the reconstructed Sentinel-2-like images (right panels of Figure  10) captured spatial details much more accurately without any blurring effects. For example, the inundation areas on August 18 and September 3 were depicted with fine spatial details in both intra-class variation and adjacent boundaries for Sentinel-2-like images, while the corresponding Landsat images only captured the general distribution of inundations with obvious blurs.

Reconstruction of 10 m Historical Landsat Archive
To address our ultimate goal of transforming 30 m historical Landsat images to a 10 m Sentinel-2-like archive, we acquired annual Landsat imagery from 1985 to 2018 in Schonefeld, Dubai, and Las Vegas (Figure 3b), where land cover changes were extreme ( Figure 11). In Schonefeld, Germany, cropland was converted into artificial impervious areas for airport construction. In Dubai, United Arab Emirates, coastal expansion was tremendous, with the creation of Palm Islands and extreme urbanization. In Las Vegas, U.S.A., bare lands were rapidly converted to urbanized areas over the past decades. We applied the GAN-based models to generate the time-series 10 m Sentinel-2-like images over the selected three regions. Given the fact that there were no real 10 m Sentinel-2 images available before 2015, we visually compared the image quality in terms of spectral fidelity and spatial details between the interpolated Landsat and the reconstructed Sentinel-2-like images. As shown by the derived results with five-year intervals from 1985 to 2018 in Figure 11, the reconstructed images preserved consistent spectral fidelity and achieved plausible spatial details. We could also identify the very close spatial details between the predicted Sentinel-2-like imagery and the observed Sentinel-2 imagery in 2018. The zoomed-in comparison (Figure 12) further demonstrated that the improvement of spatial details using the GAN-based model was very obvious and robust over the long-term temporal period. For example, the road networks, building infrastructures, and street blocks were quite blurred in original Landsat images (left panels of Figure 12a

Reconstruction of 10 m Historical Landsat Archive
To address our ultimate goal of transforming 30 m historical Landsat images to a 10 m Sentinel-2-like archive, we acquired annual Landsat imagery from 1985 to 2018 in Schonefeld, Dubai, and Las Vegas (Figure 3b), where land cover changes were extreme ( Figure 11). In Schonefeld, Germany, cropland was converted into artificial impervious areas for airport construction. In Dubai, United Arab Emirates, coastal expansion was tremendous, with the creation of Palm Islands and extreme urbanization. In Las Vegas, U.S.A., bare lands were rapidly converted to urbanized areas over the past decades. We applied the GAN-based models to generate the time-series 10 m Sentinel-2-like images over the selected three regions. Given the fact that there were no real 10 m Sentinel-2 images available before 2015, we visually compared the image quality in terms of spectral fidelity and spatial details between the interpolated Landsat and the reconstructed Sentinel-2-like images. As shown by the derived results with five-year intervals from 1985 to 2018 in Figure 11, the reconstructed images preserved consistent spectral fidelity and achieved plausible spatial details. We could also identify the very close spatial details between the predicted Sentinel-2-like imagery and the observed Sentinel-2 imagery in 2018. The zoomed-in comparison (Figure 12) further demonstrated that the improvement of spatial details using the GAN-based model was very obvious and robust over the longterm temporal period. For example, the road networks, building infrastructures, and street blocks were quite blurred in original Landsat images (left panels of Figure 12a-c),

Training and Tuning of Deep Learning Models
Deep learning approaches are always complex in terms of neural network structures. They may be sensitive to the training and tuning process and the performance may vary on landscapes with varying spatial heterogeneity and temporal dynamics. These pose challenges for fusing remote sensing imagery with multispectral and multiscale observations. Here, we briefly discuss four important aspects of applying non-GAN and GAN-

Training and Tuning of Deep Learning Models
Deep learning approaches are always complex in terms of neural network structures. They may be sensitive to the training and tuning process and the performance may vary on landscapes with varying spatial heterogeneity and temporal dynamics. These pose challenges for fusing remote sensing imagery with multispectral and multiscale observations. Here, we briefly discuss four important aspects of applying non-GAN and GAN-based deep learning SR models, including the training dataset, model generalization ability, potential benefit from transfer learning, and the impact of the number of iterations.
Firstly, training data are a key component to the machine learning algorithms that depend on the input reference and memorize the statistical and structured information for future predictions. Models usually perform better on the test set that have similar distribution as the training set. Therefore, the increase in the number and variability of training samples typically leads to better performance of the deep learning models. This calls for future efforts to expand and consolidate the regional to global training library that covers more comprehensive land use/cover type and changes. Secondly, the model generalization ability describes a model's ability to make accurate predictions on the new dataset that it has never encountered during the training. It is a key element of a successful and robust model. In this study, a number of Landsat and Sentinel-2 imagery pairs were collected across different landscapes in the United States. In Sections 3.1 and 3.2, although training and testing datasets were randomly split without any overlapped information, they shared similar land cover types between adjacent locations. Therefore, similar patterns from training could be well transferred to the test dataset, making predictions over the testing data less challenging. However, the test dataset collected in the experiments using seasonal and long-term time-series imagery (Sections 3.3 and 3.4) was totally independent from the training dataset; our model performed similarly and was able to accurately predict and capture the land-cover dynamics with high spectral fidelities and spatial details, demonstrating that our model generalizes very well. It further verified the possibility of leveraging limited samples of Landsat and Sentinel-2 observations during their overlapping period from 2016 to 2019 to transform 30 m historical Landsat images into a 10 m Sentinel-2-like archive.
Thirdly, transfer learning which makes use of all parameters of neural network pretrained over large datasets has been proven to be quite helpful and efficient to solve a different but related problem [54]. It usually has the advantage of using fewer training samples and time while obtaining better results. To test whether remote sensing superresolution could benefit from initial weights based on a model pre-trained from a natural image dataset, we conducted a group of experimental tests. In this experiment, the GANbased network were first trained on DIV2K dataset [55], and then fine-tuned using remote sensing images at a smaller learning rate of 10 −5 , because the setting of a smaller learning rate performs better in fine-tuning. Comparative results revealed that the integration of natural images did not improve model performance ( Figure S1, Table S2). This can be attributed to the difference in spectral wavelength and spatial details between remote sensing imagery and natural images.
Finally, the number of iterations is another critical issue that influences deep learningbased model accuracy. As shown in Figure 13a, the accuracy of the non-GAN based model was relatively much lower with 5000 to 200,000 iterations, increased rapidly with the increasing number of iterations, and then became stable with more than 400,000 iterations. Similar patterns were found for GAN-based models (Figure 13b). We noticed that the GAN-based model was less stable when compared to the non-GAN-based model, which reflects the fact that GAN is more sensitive to training than a normal deep CNN [32]. The sensitivity analysis provided some semi-empirical knowledge that could be very helpful in training and tuning the optimal deep learning models. Remote Sens. 2021, 13, x FOR PEER REVIEW 20 of 24

Future Prospect
There are so far no existing algorithms that can produce super-resolution images with both high quantitative accuracy and high perceptual quality at the same time. Non-GAN based methods, optimizing the pixel-wise reconstruction measure, tend to produce over-smoothed images that lack high-frequency textures, and have also been shown to correlate poorly with human perception [34,56]. In contrast, GAN-based methods achieve very good performance in reconstructing high-frequency spatial details but are often inferior in terms of distortion measure [34,39,56,57]. This leads to a common perceptiondistortion trade-off phenomenon. On one hand, the balance of quantitative accuracy and spatial details can be application-oriented depending on the specific purpose. On the other hand, although this trade-off cannot be addressed thoroughly [58], future work can be conducted to alleviate this dilemma. For example, in our model generator design, adding an L1 regularization term to take the pixel level differences between the true and generated

Future Prospect
There are so far no existing algorithms that can produce super-resolution images with both high quantitative accuracy and high perceptual quality at the same time. Non-GAN based methods, optimizing the pixel-wise reconstruction measure, tend to produce oversmoothed images that lack high-frequency textures, and have also been shown to correlate poorly with human perception [34,56]. In contrast, GAN-based methods achieve very good performance in reconstructing high-frequency spatial details but are often inferior in terms of distortion measure [34,39,56,57]. This leads to a common perception-distortion trade-off phenomenon. On one hand, the balance of quantitative accuracy and spatial details can be application-oriented depending on the specific purpose. On the other hand, although this trade-off cannot be addressed thoroughly [58], future work can be conducted to alleviate this dilemma. For example, in our model generator design, adding an L 1 regularization term to take the pixel level differences between the true and generated images into account forced the generated super-resolved images close to its corresponding I HR . Therefore, the accuracy increased, while the perceptual quality was preserved. Designing more advanced loss functions is also another promising direction.
We applied the GAN-based super-resolution algorithm to leverage Landsat and Sentinel-2 observations during their overlapping period from 2016 to 2019 for transforming 30 m historical Landsat images into a 10 m Sentinel-2-like archive. Although the experimental results in both intra-and inter-annual scales showed robust performance of our proposed method (Figures 8-12), our methods and results should be interpreted in light of certain uncertainties. Firstly, the difference in sensors, observation geometries, and image quality may lead to potential biases in spatial details and spectral fidelity of the time-series reconstructed Landsat imagery. For example, the possible lower image quality of the input data, e.g., due to cloud/shadow contaminations and geometry distortions, may cause errors especially for long-term temporal reconstruction applications. In addition, although our proposed methods can be easily extended to other sensors such as National Agriculture Imagery Program (NAIP) imagery ( Figure S2), RapidEye, and PlanetScope datasets, careful consideration should be made with regard to the scale discrepancy between two paired images. In this study, with a factor of three between the Landsat and Sentinel-2 pixel sizes, the reconstruction of super-resolution worked very well. However, if the scale difference is larger than eight, the single-image super-resolution may yield unrealistic and blurred results [59]. In this situation, we may use medium-resolution images between the coarseand high-resolution images as the bridge to close the scale gap.
Despite these uncertainties, our study demonstrated that the fusion methods developed here and corresponding reconstructed Sentinel-2-like archive are very promising, and can be applied to a variety of other applications, such as urban, forest, wetland, wildfires, land cover classification and change detection. Potentially, all research and applications that traditionally rely on Landsat but require higher spatial details can be enhanced with the support of our feature-level fusion methods and resulting reconstructed Sentinel-2-like data archive.

Conclusions
Fine spatial resolution Earth observation images are desirable for much environmental change and land management research. However, the investigation of long-term terrestrial change has mostly relied on 30 m moderate-resolution Landsat imagery. This study sought to leverage Landsat and Sentinel-2 observations during their overlapping period from 2016 to more up-to-date images, to reconstruct 10 m Sentinel-2-like imagery from 30 m historical Landsat data since the Sentinel-2 era, and potentially over the past four decades. To achieve this goal, we presented a generative adversarial network (GAN) based data fusion framework that blended fine-resolution feature-level spatial information from Sentinel-2, to enhance the spatial resolution of the historical Landsat archive. We conducted a set of experimental tests using simulated data, actual Landsat/Sentinel-2 acquisitions, and time-series Landsat imagery. Our results showed Landsat data which were reconstructed to 10 m using a GAN-based super-resolution method, accurately captured the spatial features at an effective resolution close to Sentinel-2 observations. Historical reconstruction of 10 m Landsat observations from 1985 to 2018 in three different types of landscape changes further verified the robustness and efficiency of our proposed method for monitoring land changes. The promising results from our deep learning-based feature-level fusion method highlight the potential for reconstructing a 10 m Landsat series archive since 1984. The enhanced Landsat and Sentinel-2 data is expected to improve our capability of detecting and tracking finer scale land changes, identifying the drivers for and responses to global environmental changes, and designing better land management policies and strategies. are shown as the zoomed-in comparison of (m-r). Figure S2. Super-resolution results using NAIP imagery. The reference high-resolution images from NAIP-0.6m are shown on the left. On the right are zoomed-in subsets of the yellow box from original reference, bicubic interpolation, Non-GAN derived, and GAN-derived results. Table S1. Comparison of Landsat and Sentinel-2 bandwidth. Table S2. Accuracy assessment of two training scenarios in actual experiment.