Landsat Super-Resolution Enhancement Using Convolution Neural Networks and Sentinel-2 for Training

Landsat is a fundamental data source for understanding historical change and its effect on environmental processes. In this research we test shallow and deep convolution neural networks (CNNs) for Landsat image super-resolution enhancement, trained using Sentinel-2, in three study sites representing boreal forest, tundra, and cropland/woodland environments. The analysis sought to assess baseline performance and determine the capacity for spatial and temporal extension of the trained CNNs. This is not a data fusion approach and a high-resolution image is only needed to train the CNN. Results show improvement with the deeper network generally achieving better results. For spatial and temporal extension, the deep CNN performed the same or better than the shallow CNN, but at greater computational cost. Results for temporal extension were influenced by change potentiality reducing the performance difference between the shallow and deep CNN. Visual examination revealed sharper images regarding land cover boundaries, linear features, and within-cover textures. The results suggest that spatial enhancement of the Landsat archive is feasible, with optimal performance where CNNs can be trained and applied within the same spatial domain. Future research will assess the enhancement on time series and associated land cover applications.


Introduction
High spatial and temporal resolution earth observation (EO) images are desirable for many remote sensing applications, providing a finer depiction of spatial boundaries or timing of environmental change.Landsat provides the longest record of moderate spatial resolution (30 m) data of the earth from 1984 to present.It is currently a fundamental data source for understanding historical change and its relation to carbon dynamics, hydrology, climate, air quality, biodiversity, wildlife demography, etc. Landsat temporal coverage is sparse due to the 16-day repeat visit and cloud contamination.Several studies have addressed this through time series modeling approaches [1][2][3].Temporal enhancement is a key requirement, but spatial enhancement is another aspect of Landsat that could be improved for time series applications.Enhancement of spatial resolution has been carried out mostly based on data fusion methods [4][5][6][7].Studies have also shown that data fusion can lead to improvements in quantitative remote sensing applications such as land cover [4,8,9].Although effective, data fusion techniques are limited by the requirement for coinstantaneous high-resolution observations.For more recent sensors such as Landsat-8 and Sentinel-2 this requirement is met with the panchromatic band and provides the greatest potential for spatial enhancement.However, for a consistent Landsat time Remote Sens. 2018, 10, 394 2 of 18 series from 1985 to present, a method that will provide the same level of enhancement across sensors is needed.For Landsat-5, a suitable high-resolution source is generally inadequate in space or time to facilitate generation of an extensive spatially enhanced Landsat archive.
Numerous spatial resolution enhancement methods have been developed.However, recently, deep learning convolution neural networks (CNNs) have been shown to outperform these, with large improvements over bicubic and smaller gains over more advanced anchored neighborhood regression approaches [10].CNNs are a special form of neural network.The basic neural network is made up of a collection of connected neurons with learnable weights and biases that are optimized through error backpropagation [11].The input is a vector, whereas the input to a convolution neural network is an array or image.For each convolution layer, a set of weights are learned for a filter of size m × n × c that is convolved over the image, where m and n are vertical and horizontal dimensions and c is the input features to the convolution layer.Essentially, a convolution neural network can learn the optimal set of filters to apply to an image for a specific image recognition task.Thus, one strategy has been to use CNNs as feature extractors in remote sensing classification applications [12].
There has been significant development of CNNs for super-resolution enhancement with non-remote sensing image benchmark databases such as CIFAR-100 [13] or ImageNet [14].Dong et al. [10] developed the Super-Resolution Convolutional Neural Network (SRCNN), which used small 2 and 4 layer CNNs to show that the learned model performed better than other state of the art methods.Kim et al. [15,16] developed two deep convolutional networks for super-resolution enhancement.The first was the Deeply-Recursive Convolutional Network for Image Super Resolution (DRCN), which used recursive or shared weights to reduce model parameters in a deep 20-layer network.The second was also a deep 20-layer network (Very Deep Super Resolution, VDSR), but introduced the concept of the residual learning objective.In this approach, instead of learning the fine resolution image, the differences between the fine and coarse resolution images are learned.This led to significant performance gains over SRCNN.The mean squared error loss is widely used for CNN super-resolution training.An interesting alternative was tested by Svoboda et al. [17] who used a gradient based learning objective, where the mean squared error between spatial image gradients computed using the Sobel operator was sought to be minimized.Performance by standard measures, however, were not improved.Mao et al. [18] developed a deep encoder-decoder CNN with skip connections between associated encode and decode layers.It achieved improved accuracy relative to SRCNN for both 20 and 30-layer versions.An ensemble based approach was tested in Wang et al. [19] and was found to provide an improvement in accuracy.Other methods have focused on maintaining or improving accuracy while reducing the total model parameters.The Efficient Sub-Pixel Convolutional Neural Network (ESPCN) reduces computational and memory complexity, by increasing the resolution from low to high only at the end of the network [20].The DRCN approach [15] was extended to include residual and dense connections by Tia et al. [21].This provided a deep network with recursive layers reducing the model parameters and achieving the best results for the assessment undertaken.
Residual connections in CNNs were introduced by He et al. [22] for image object recognition.Residual connections force the next layer in the network to learn something different from the previous layers and have been shown to alleviate the problem of deep learning models not improving performance with depth.In addition to going deep, Zagoruyko and Komodakis [23] showed that going wide can increase network performance for image recognition.More recently, Xie et al. [24] developed wide residual bocks, which adds another dimension referred to as cardinality in addition to network depth and width.The rate of new developments in network architectures is rapid with incremental improvements in accuracy or reductions in model complexity and memory requirements.
For spatial enhancement of remote sensing imagery, much less research has been carried out regarding the potential of CNNs.Only recently have results been presented by Collins et al. [25] who applied networks similar to Dong et al. [10] for enhancement of the Advanced Wide Field Sensor (AWiFS) using the Linear Imaging Self Scanner (LISS-III).Their study provides a good benchmark for CNN performance because the two sensors have the same spectral bands and are temporally coincident.Results showed similar performance to other CNN based super-resolution studies for the scaling ratio of 2.3 (56 m/24 m spatial resolution).
Advances in deep learning CNNs and the global availability of Sentinel-2 data provide a potential option to generate an extensive spatially enhanced historical Landsat archive.Conceivably, a relatively cloud free Landsat and Sentinel-2 image will be obtained within a suitable temporal window for most locations across the globe.Thus, a consistent image pair suitable for training a Landsat super-resolution transform may be obtained and could be locally optimized for this purpose following the approach applied in Latifovic et al. [26].However, for large area implementation, CNN performance across a variety of landscapes needs to be evaluated in addition to temporal and spatial extension capacity.Therefore, specific objectives of this research were to:

•
Assesses the effectiveness of a shallow and deep CNN for super-resolution enhancement of Landsat trained from Sentinel-2 data for characteristic landscape environments in Canada including boreal forest, tundra, and cropland/woodland landscapes.

•
Evaluate the potential for spatial extension over short distances of less than 100 km and temporal extension of a trained CNN model.

Landsat and Sentinel-2 Datasets
For model development, Landsat-5, 8 and Sentinel-2 pairs for three study areas in Canada were acquired.The study areas are shown in Figure 1 and included boreal forest, tundra, and cropland/ woodland ecosystems.These represent a range of ecosystem conditions found in Canada.If the performance is acceptable across these three, it is likely that similar performance can be obtained across the range of ecosystems found in Canada in non-complex terrain.Advances in deep learning CNNs and the global availability of Sentinel-2 data provide a potential option to generate an extensive spatially enhanced historical Landsat archive.Conceivably, a relatively cloud free Landsat and Sentinel-2 image will be obtained within a suitable temporal window for most locations across the globe.Thus, a consistent image pair suitable for training a Landsat super-resolution transform may be obtained and could be locally optimized for this purpose following the approach applied in Latifovic et al. [26].However, for large area implementation, CNN performance across a variety of landscapes needs to be evaluated in addition to temporal and spatial extension capacity.Therefore, specific objectives of this research were to:


Assesses the effectiveness of a shallow and deep CNN for super-resolution enhancement of Landsat trained from Sentinel-2 data for characteristic landscape environments in Canada including boreal forest, tundra, and cropland/woodland landscapes.


Evaluate the potential for spatial extension over short distances of less than 100 km and temporal extension of a trained CNN model.

Landsat and Sentinel-2 Datasets
For model development, Landsat-5, 8 and Sentinel-2 pairs for three study areas in Canada were acquired.The study areas are shown in Figure 1 and included boreal forest, tundra, and cropland/woodland ecosystems.These represent a range of ecosystem conditions found in Canada.If the performance is acceptable across these three, it is likely that similar performance can be obtained across the range of ecosystems found in Canada in non-complex terrain.The date ranges of the training image pairs are given in Table 1.Landsat level 2 surface reflectance collection 1 was acquired from the USGS.Sentinel-2 level-1C data was also acquired from the USGS and converted to surface reflectance using the sen2cor algorithm version 2.3.1 (European The date ranges of the training image pairs are given in Table 1.Landsat level 2 surface reflectance collection 1 was acquired from the USGS.Sentinel-2 level-1C data was also acquired from the USGS and converted to surface reflectance using the sen2cor algorithm version 2.3.1 (European Space Agency) [27].Landsat-8 and Sentinel-2 have spatial misalignment that varies regionally depending on ground control point quality [28].It has been improved for collection 1 data in global priority areas.More recent analysis shows that collection 1 Landsat-8 data within Canada (approximately lower than 70 degrees latitude) has a horizontal root mean square error (RMSE) of less than 14 m [29].The geolocation quality of all input images was checked by collecting control points and computing the RMSE.For all scenes the RMSE was less than 10 m.The largest error was in the northern tundra study site and areas within or close to cloud cover.This error was considered reasonable given the expected operational geolocation accuracy and effective resolution of the spatial enhancement being tested.Landsat data were resampled to 10 m resolution using the nearest neighbor approach.This was selected to maintain spectral quality, allow the CNN to determine the optimal spatial weighting, and speed local resampling for application to large images.All Landsat and Sentinel-2 scenes were mosaiced and stacked together for analysis in each study site.
As identified in Latifovic et al. [30] and specified in the USGS documentation [31], atmosphere correction can be problematic in the north above 64 degrees latitude.Our northern study area was at approximately 64 degrees.However, this was not considered to be a problem as errors due to atmosphere correction would be similar between the datasets and would not affect the relative comparison of the methods.

Sampling and Assessment
For each study site a mask was manually developed for sampling to avoid clouds, shadows, and land cover changes between the image mosaic pairs.However, in cropland environments the 26-day difference between the Landsat-8 and Sentinel-2 images made it impractical to manually define areas suitable for training.Thus, for this study site an initial mask was developed, but refined by calculating the change vector between images [32] and selecting a conservative threshold to avoid including cropland change in the training.The local variance within sample windows of 33 by 33 pixels was computed and used define three levels of low, moderate, and high spatial complexity.These represented homogenous areas at the low level to areas containing significant structure related to roads, shorelines, or other boundaries at the high level.This was used in a stratified systematic sampling scheme to ensure a range of spatial variability was selected.For each stratum, every sixth pixel was selected not contaminated by clouds or land cover change.To assess performance, we compute the mean error, mean absolute error (MAE), error standard deviation, and mean and standard deviation of the spatial correlation within a sample window of 33 by 33 pixels between the predicted image and Sentinel-2.This window size was selected to be consistent with the CNNs used.We also compute the mean and standard deviation of the Structural Similarity Index Measure (SSIM) [33].This was included as it is a common measure applied to assess image quality relative to a reference image.To provide context for the improvement obtained we also compute these metrics directly between Landsat and Sentinel-2 without applying the CNN based transform.

Hold-Out and Spatial Extension
For sampling, 75% of the study area was used for training, starting in the west to the east.The remaining 25% in the east was used for validation as a spatially independent extension test.Of the 75% sampled for training, 30% of this was held-out to assess the ideal situation where spatial extension is not required and high sampling rates are possible.For each study site this amounted to samples in the range of 400,000-500,000 for training and 180,000-240,000 for testing.Samples for spatial extension were more variable due to land cover change, clouds, and cloud shadows.Total samples were 64,000, 179,000, and 330,000 for the boreal, tundra, and cropland/woodland study sites, respectively.

Temporal Extension
For assessment of temporal performance, we apply the CNNs to Landsat-5 (Table 1) for different years for each study site.The least cloud contaminated image was selected for each period between 1984-1990, 1990-2005, and 2005-2011.We computed the same set of metrics between Landsat-5 and Sentinel-2 for areas identified as no change.No change was detected based on the maximum change vector across all years for a study site.Before detecting change the Sentinel-2 bands were normalized to Landsat using robust regression [34].We also applied a band average minimum correlation threshold of 0.55 between images for the window size of 33 by 33 pixels.The initial CNNs were trained between Landsat-8 and Sentinel-2.To adjust these for Landsat-5 we applied a transfer learning approach where samples of no-change were split for training and testing.Similar, to the initial model development, we sampled 30% of the study area for training starting in the west of the image and the remaining 70% for validation in the east of image.As the models had already been trained, only 3 epochs were used for rapid development.Only the most recent Landsat-5 image was used for training.The refined model was used in the assessment of the independent samples (70%) for evaluation of all image dates.The retraining was needed as image quality between the two sensors is different, with Landsat-8 being sharper.Total samples used for each study site ranged from 30,000-80,000 for training.For testing the total samples were 40,000, 134,000, and 84,000 for the boreal, tundra, and cropland/woodland study sites, respectively.

CNN Super-Resolution Models
There are countless configurations for network architectures that could be employed and will likely remain an area of significant future research.Although network design is important, for the purpose of this study we only tested two configurations.We tested the SRCNN of Dong et al. [10] because it is efficient with only 41,089 parameters and has shown good results (Figure 2A).We also apply a deeper architecture using residual learning, deep connectivity, and residual connections in attempt to integrate some of the latest improvements in the field (DCR_SRCNN).In initial exploratory analysis we tested numerous configurations of which the best was kept.We settled on the 20-layer configuration shown in Figure 2B inspired by Tai et al. [21].This is a large network with 993,373 total parameters.The rectilinear unit was used for all activations.To improve generalization, least squares (L2) weight regularization was added to the third and second last layers of the network with a weight of 0.0001.Regularization was only applied to the last layers to avoid reducing the learning potential in the lower network layers.Input image size was 33 by 33 pixels.This size was selected to capture the spatial variation in the image while keeping the size small for computational efficiency.Filter sizes were 3 by 3, except for the first convolution layer where a 7 by 7 was used.The output features from each convolution layer was 64, except for the first layer which output 96.Also, the convolution layer for the residual learning objective output one result which essentially converted the input three band image to a single band.
We trained a model for each study site to allow for regional optimization.For training, the mean squared error loss function for the pixelwise comparison of the predicted and Sentinel-2 image was used with the Adam optimization method.This optimization method has been shown to provide an efficient and stable solution [35] and has been used in other CNN based super-resolution studies [18,25].Early stopping criteria was applied, where if the loss did not improve in 10 epochs, training stopped.The total number of epochs was set at 80 with a batch size of 125.For all networks the input was the red, near-infrared (NIR), and short-wave infrared (SWIR, 1.55-1.75µm) bands.Output, was a single band, either the red, NIR, or SWIR.All bands were input as spatial properties between bands were expected to provide useful information for determining specific spatial transforms.To allow for the greatest possible learning potential, a model was developed for each band.To focus the learning on the spatial properties between the samples, the mean of the Sentinel-2 image was adjusted to match the Landsat image.
All models were trained on a NVIDIA GeForce 1080 Ti GPU.Training time took approximately 2 days for the deep network for each study site and less than half this time for the shallow network.

Hold-Out Accuracy
The results for the hold-out samples show that the DCR_SRCNN provided the best results across all study areas (Table 2).However, both methods showed marked improvement relative to applying no transformation for all the key metrics (MAE, SSIM, and spatial correlation).The MAE error is an informative measure as it is in standard reflectance units, but it is related to the mean of the sample, with a larger mean reflectance resulting in larger MAE.Further, the MAE can produce the same value for very different image qualities [33].Spatial correlation is not related to the mean reflectance and gives a good indication of the spatial agreement and thus the spatial enhancement.However, it is influenced by the data range, with a reduce range producing a lower correlation [36].SSIM essentially incorporates the MAE and spatial correlation measures in addition to image contrast.It is related to the sample mean reflectance, but to a much lesser degree than the MAE.It is important to recognize these limitations in interpreting the results when comparing bands.
Of the bands, the NIR consistently had the higher MAE and lower SSIM values regardless of the transformation.This is related to the high reflectance of the NIR band and associated larger variance.The SWIR, also had high reflectance for the tundra study site, but in contrast had low MAE and

Hold-Out Accuracy
The results for the hold-out samples show that the DCR_SRCNN provided the best results across all study areas (Table 2).However, both methods showed marked improvement relative to applying no transformation for all the key metrics (MAE, SSIM, and spatial correlation).The MAE error is an informative measure as it is in standard reflectance units, but it is related to the mean of the sample, with a larger mean reflectance resulting in larger MAE.Further, the MAE can produce the same value for very different image qualities [33].Spatial correlation is not related to the mean reflectance and gives a good indication of the spatial agreement and thus the spatial enhancement.However, it is influenced by the data range, with a reduce range producing a lower correlation [36].SSIM essentially incorporates the MAE and spatial correlation measures in addition to image contrast.It is related to the sample mean reflectance, but to a much lesser degree than the MAE.It is important to recognize these limitations in interpreting the results when comparing bands.Of the bands, the NIR consistently had the higher MAE and lower SSIM values regardless of the transformation.This is related to the high reflectance of the NIR band and associated larger variance.The SWIR, also had high reflectance for the tundra study site, but in contrast had low MAE and higher SSIM values.This was related to the native 20 m spatial resolution of the SWIR band in Sentinel-2, which results in greater initial similarity with Landsat compared to the 10 m bands.The spatial correlation showed that the red band had consistently lower values which was caused by the smaller reflectance range and atmospheric noise.
Of the study areas, the cropland/woodland showed the lowest performance due to change between the images despite efforts to reduce it.Change was also a potential factor in the boreal forest study site, but to a much lesser degree.The best results, were found for the northern tundra study area and was attributed to little change between images and less overall complexity of the land surface relative to the 10 m target resolution.Figure 3 provides an example image result of a residential area surrounded by mixed boreal forest conditions.It provides a good indication of the improvement that can be obtained.Figure 4 shows the enhancement by band for a mixed forest area with some industrial development.As evident from Figure 4, the NIR and red bands are more enhanced compared to the SWIR as expected.The coarse texture within the cover types is of interest and could prove useful for improving land cover discrimination or for biophysical retrieval as canopy variability or structure appears to be enhanced.
surface relative to the 10 m target resolution.Figure 3 provides an example image result of a residential area surrounded by mixed boreal forest conditions.It provides a good indication of the improvement that can be obtained.Figure 4 shows the enhancement by band for a mixed forest area with some industrial development.As evident from Figure 4, the NIR and red bands are more enhanced compared to the SWIR as expected.The coarse texture within the cover types is of interest and could prove useful for improving land cover discrimination or for biophysical retrieval as canopy variability or structure appears to be enhanced.MEAN -obs erved s a mpl e mea n, STD -obs erved s a mpl e s ta nda rd devi a ti on, ME -mea n error, MAE -mea n a bs ol ute error, STDE -error s ta nda rd devi a ti on, P5E -5th percenti l e error, P95E -95th percenti l e error, SSIMm -mea n SSIM, SSIMs -s ta nda rd devi a ti on SSIM, CORm -mea n s pa ti a l correl a ti on, CORs -s ta nda rd devi a ti on of s pa ti a l correl a ti on

Spatial Extension Accuracy
The spatial extension accuracy shows similar or slightly reduced performance relative to the hold-out (Table 3).Comparing the results for the two CNNs it is not evident that the deeper model made a sufficiently large improvement to warrant its greater computational complexity.This is likely caused by some overtraining and errors, due to temporal change, in the training and validation data which can reduce the sensitivity of the analysis.
The cropland study site showed the greatest difference relative to the hold-out results.The difference is in part related to the greater amount of agriculture in the extension sample.Due to the changes in crops for the 14 to 21-day difference between images, there were limited sampling opportunities and thus, the extension results did not perform as well.Croplands present a particular challenge for the approach as training data is limited by the highly dynamic nature of cropland environments with dramatic reflectance changes over a few days.With the Sentinel-2 constellation potentially more temporally coincident imagery will be captured to ensure suitable training.Otherwise spatial extension over greater distances may be required for large extent enhancements.The performance for bands suggests the same conclusion as the hold-out sample results.

Spatial Extension Accuracy
The spatial extension accuracy shows similar or slightly reduced performance relative to the hold-out (Table 3).Comparing the results for the two CNNs it is not evident that the deeper model made a sufficiently large improvement to warrant its greater computational complexity.This is likely

Temporal Extension Accuracy
Temporal extension accuracy is an important aspect of the approach to determine if a trained network can be applied to enhance Landsat time series.Table 4 provides the temporal extension results.These at first glance appear to be low, particularly the spatial correlation, but assessment of temporal extension is fraught with difficulties.The main challenge is that no-changes areas do not exist in terms of image reflectance's.For the purposes of land cover, no-change can be identified, but in comparing imagery between dates there are always changes due to canopy dynamics, annual changes in canopy configuration, moisture content, residual atmosphere effects, etc. that do not change the land cover, but alter reflectance's for a cover type.Thus, in interpreting these results it is important to note that the sensitivity and accuracy is influenced by this effect.Irrespective, in all cases the metrics were improved with either CNN.
The temporal effect is clearly seen in the boreal forest and cropland/woodland study sites, where the performance metrics all improve as the image date gets closer to the Sentinel-2 image used as a reference.The tundra study area is likely the most informative as changes are subtler, less frequent and the vegetation structure is small relative to the image spatial resolution.Thus, these results are more indicative of the temporal extension capacity.The shallow network performed similarly to the deep network but with slightly reduced magnitude across all metrics of approximately 1%.The small difference is in part a result of temporal changes in the test data which reduces the sensitivity of the analysis.This is similar to the spatial extension results, but is expected to be more significant.

Visual Assessment
Visual assessment provides the more convincing evidence as the nature of the enhancement can be clearly recognized and artifacts readily identified.Here, we provide several examples of enhanced images for the different landscape environments and over multiple years in Figures 5-8.In all examples, boundaries are clearer between cover types, linear features are more apparent, within cover textures are enhanced, and the spatial structure overall is clearer.There are also no major artifacts created within images or between image dates.Although, some speckle is introduced in a few cases.In Figure 5, the spatial structure of forest gaps or leaf area gradients appear to be enhanced.This is most evident in the 2011 imagery as fire damage has resulted in greater canopy variability.This could possibly lead to improvements in biophysical retrievals or habitat analysis, but requires further study.Figure 6 shows the northern tundra example, it more clearly defines drainage patterns and water bodies.Figure 7 shows area of trails and roads that have become much more apparent in the broadleaf forest area.This highlights the potential of the approach to better characterize edges which could enhance land cover-based landscape metrics.The final example, Figure 8 shows an area of cropland where the boundaries between crop areas and roads has been improved.

Discussion
In this research we show that CNN super-resolution can spatially enhance Landsat imagery and can be applied to historical peak growing season time series that could improve land cover and land cover change applications or possibly parameter retrievals.However, future research needs to specifically evaluate the improvement for a given application with this type of enhancement.This is important to determine if the approach is only suitable for visual enhancement or some types of quantitative analysis.Here, we show that boundaries between land covers and linear features are improved and likely would influence landscape metrics derived from land cover data.There are also textural enhancements that need to be explored as a means to improve information extraction applications.
The SSIM values obtained compare well with other studies achieving values in the range of 0.86 to 0.97, similar to what is achieved in benchmark image databases for an upscaling factor of three [21].However, in most studies, images are degraded from an initial high-resolution image and thus the only differences between the fine and coarse images used for training is resolution.This was not the case in this research as there were several additional factors other than resolution and included changes or differences in land cover, canopy structure, phenology, moisture, residual atmosphere effects, sun-sensor geometry, sensor spectral response functions, and residual geolocation error.These factors need to be considered in examining the results as they reduce sensitively of the analysis.More importantly this can cause models to learn these differences resulting in reduced spatial and temporal generalization performance.For remote sensing data, Collins et al. [25] report SSIM values greater than 0.98.This is the result of using coinstantaneous images avoiding many of the factors listed above.It is also due to the upscaling factor of 2. In this study the upscaling factors were 3 for the red and near-infrared and 1.5 for the shortwave.The inferior result obtained here for the SWIR suggests that the difference is largely related to temporal variation.However, Collins et al. do not report band specific values.
It also of interest to compare the effects of using a more complex network.In Collins et al., going from shallow to wider and deeper improved the SSIM by 0.0035.In this research we also see only a small increase with the more complex deeper network, improving by about 0.006 on average for both the spatial and temporal extension.Thus, as with other SR research finding the optimal balance between model complexity and performance will be an important aspect of future research.In this regard, the effective resolution of the spatial enhancement needs to be determined.That is, we do not propose that the CCN learns a true 10 m resolution result.Future efforts need to quantify the effective resolution to avoid storage and processing redundancy.
Other approaches to assess performance were considered, such as comparison with other high spatial-resolution more temporally coincident images.However, finding such images is challenging for many regions such as the north and where suitable cloud free pairs are difficult to obtain within a few days of acquisition.This may be possible for the visible and NIR bands, but there are few higher spatial resolution sensors that capture the SWIR band for comparison and development with Landsat-5.SPOT-5 imagery is a suitable option, but it was not available for this analysis and establishing an extensive database would be costly.Despite this, the visual assessment shows that the trained networks were able to enhance the spatial properties of the images through time without introducing any strong artifacts.For large regional implementation, spatial extension over greater distances may be required.For distances less than 100 km, the CNNs appeared to generalize well.However, to effectively train Landsat-5, larger distances may be required as suitable training locations would be limited to areas with little change or with suitable SPOT-5 images.There are several mechanisms to further enhance the generalization of the CNNs that need to be explored in future research such as optimizing network size, increasing weight regularization, batch normalization, and data augmentation.The deep network employed in this research was selected to provide an indication of the upper bound on the enhancement potential.The expectation was that a comprise between the deep and shallow network would be more effective for implementation.In this work we also included weight regularization for the last layers in the network and thus better generalization may be obtained by applying regularizing to additional layers increasing the weight.In this research we did not use batch normalization as it was found to slightly reduce results in Liang et al. [37].However, for spatial and temporal extension this may not be the case and requires further investigation.We also did not use data augmentation as our sample sizes were large, except for retraining of Landsat-5 for the identified no-change areas.In this case, data augmentation could provide an advantage.Data augmentation also provides an alternative training strategy, where more stringent criteria for selecting samples could be applied and data augmentation used to offset the reduced sample size.Improvements in the training data is expected to improve performance.

Conclusions
In this research we tested a shallow and deep CNN for the purpose of super-resolution enhancement of the Landsat archive trained from Sentinel-2 images.Results show improvement in spatial properties of the enhanced imagery and good potential for spatial and temporal extension of the CNNs developed in all the study areas.The deep CCN showed better performance, but it is not clear if it is worth the additional computational complexity and memory requirements.As research in CNN super-resolution for other applications has shown, it is possible to achieve similar performance with simpler configurations.Significant advancement of this approach is expected with progression in network design, training data sources, sampling strategies, and improved regularization.Despite this, the models developed here were effective at enhancing image spatial structure which is expected to improve land cover and land cover change applications.

Figure 1 .
Figure 1.Study site locations and Landsat scene footprints.

Figure 1 .
Figure 1.Study site locations and Landsat scene footprints.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 16was the red, near-infrared (NIR), and short-wave infrared (SWIR, 1.55-1.75µ m) bands.Output, was a single band, either the red, NIR, or SWIR.All bands were input as spatial properties between bands were expected to provide useful information for determining specific spatial transforms.To allow for the greatest possible learning potential, a model was developed for each band.To focus the learning on the spatial properties between the samples, the mean of the Sentinel-2 image was adjusted to match the Landsat image.All models were trained on a NVIDIA GeForce 1080 Ti GPU.Training time took approximately 2 days for the deep network for each study site and less than half this time for the shallow network.

Figure 2 .
Figure 2. Configuration of the CNNs tested.(a) The SRCNN of Dong et al. [10].(b) The deep residual and connected CNN developed.Red lines are residual blocks and blue lines are connections between residual blocks.Black line is the connection for the residual learning objective, which is put through a single convolution layer to convert the input three band image to a single band.The ⊕ symbol represents summation of the output activation layer elements.

Figure 2 .
Figure 2. Configuration of the CNNs tested.(a) The SRCNN of Dong et al. [10].(b) The deep residual and connected CNN developed.Red lines are residual blocks and blue lines are connections between residual blocks.Black line is the connection for the residual learning objective, which is put through a single convolution layer to convert the input three band image to a single band.The ⊕ symbol represents summation of the output activation layer elements.

Figure 3 .
Figure 3. Examples results for a residential area surrounded by mixed boreal forest conditions; (a) Landsat image, (b) resolution enhanced, and (c) Sentinel-2 image.Displayed as red = NIR, green = SWIR, blue = red.

Figure 3 .
Figure 3. Examples results for a residential area surrounded by mixed boreal forest conditions; (a) Landsat image, (b) resolution enhanced, and (c) Sentinel-2 image.Displayed as red = NIR, green = SWIR, blue = red.

Figure 5 .
Figure 5. Example temporal extension result for a boreal mixedwood forest.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 6 .
Figure 6.Example temporal extension result for a northern tundra area.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 5 . 16 Figure 5 .
Figure 5. Example temporal extension result for a boreal mixedwood forest.(a-c) is the spatially enhanced result and (d-f) is the original Landsat.

Figure 6 .
Figure 6.Example temporal extension result for a northern tundra area.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 6 .
Figure 6.Example temporal extension result for a northern tundra area.(a-c) is the spatially enhanced result and (d-f) is the original Landsat.

Figure 7 .
Figure 7. Example temporal extension result for a broadleaf forestarea.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 8 .
Figure 8. Example temporal extension result for a cropland area.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 7 . 16 Figure 7 .
Figure 7. Example temporal extension result for a broadleaf forestarea.(a-c) is the spatially enhanced result and (d-f) is the original Landsat.

Figure 8 .
Figure 8. Example temporal extension result for a cropland area.(a,b,c) is the spatially enhanced result and (d,e,f) is the original Landsat.

Figure 8 .
Figure 8. Example temporal extension result for a cropland area.(a-c) is the spatially enhanced result and (d-f) is the original Landsat.
Remote Sens. 2018, 10, x FOR PEER REVIEW 3 of 16 coincident.Results showed similar performance to other CNN based super-resolution studies for the scaling ratio of 2.3 (56 m/24 m spatial resolution).

Table 1 .
Landsat and Sentinel-2 images used for training and testing.