High spatial and temporal resolution earth observation (EO) images are desirable for many remote sensing applications, providing a finer depiction of spatial boundaries or timing of environmental change. Landsat provides the longest record of moderate spatial resolution (30 m) data of the earth from 1984 to present. It is currently a fundamental data source for understanding historical change and its relation to carbon dynamics, hydrology, climate, air quality, biodiversity, wildlife demography, etc. Landsat temporal coverage is sparse due to the 16-day repeat visit and cloud contamination. Several studies have addressed this through time series modeling approaches [1
]. Temporal enhancement is a key requirement, but spatial enhancement is another aspect of Landsat that could be improved for time series applications. Enhancement of spatial resolution has been carried out mostly based on data fusion methods [4
]. Studies have also shown that data fusion can lead to improvements in quantitative remote sensing applications such as land cover [4
]. Although effective, data fusion techniques are limited by the requirement for coinstantaneous high-resolution observations. For more recent sensors such as Landsat-8 and Sentinel-2 this requirement is met with the panchromatic band and provides the greatest potential for spatial enhancement. However, for a consistent Landsat time series from 1985 to present, a method that will provide the same level of enhancement across sensors is needed. For Landsat-5, a suitable high-resolution source is generally inadequate in space or time to facilitate generation of an extensive spatially enhanced Landsat archive.
Numerous spatial resolution enhancement methods have been developed. However, recently, deep learning convolution neural networks (CNNs) have been shown to outperform these, with large improvements over bicubic and smaller gains over more advanced anchored neighborhood regression approaches [10
]. CNNs are a special form of neural network. The basic neural network is made up of a collection of connected neurons with learnable weights and biases that are optimized through error backpropagation [11
]. The input is a vector, whereas the input to a convolution neural network is an array or image. For each convolution layer, a set of weights are learned for a filter of size m × n × c that is convolved over the image, where m and n are vertical and horizontal dimensions and c is the input features to the convolution layer. Essentially, a convolution neural network can learn the optimal set of filters to apply to an image for a specific image recognition task. Thus, one strategy has been to use CNNs as feature extractors in remote sensing classification applications [12
There has been significant development of CNNs for super-resolution enhancement with non-remote sensing image benchmark databases such as CIFAR-100 [13
] or ImageNet [14
]. Dong et al. [10
] developed the Super-Resolution Convolutional Neural Network (SRCNN), which used small 2 and 4 layer CNNs to show that the learned model performed better than other state of the art methods. Kim et al. [15
] developed two deep convolutional networks for super-resolution enhancement. The first was the Deeply-Recursive Convolutional Network for Image Super Resolution (DRCN), which used recursive or shared weights to reduce model parameters in a deep 20-layer network. The second was also a deep 20-layer network (Very Deep Super Resolution, VDSR), but introduced the concept of the residual learning objective. In this approach, instead of learning the fine resolution image, the differences between the fine and coarse resolution images are learned. This led to significant performance gains over SRCNN. The mean squared error loss is widely used for CNN super-resolution training. An interesting alternative was tested by Svoboda et al. [17
] who used a gradient based learning objective, where the mean squared error between spatial image gradients computed using the Sobel operator was sought to be minimized. Performance by standard measures, however, were not improved. Mao et al. [18
] developed a deep encoder-decoder CNN with skip connections between associated encode and decode layers. It achieved improved accuracy relative to SRCNN for both 20 and 30-layer versions. An ensemble based approach was tested in Wang et al. [19
] and was found to provide an improvement in accuracy. Other methods have focused on maintaining or improving accuracy while reducing the total model parameters. The Efficient Sub-Pixel Convolutional Neural Network (ESPCN) reduces computational and memory complexity, by increasing the resolution from low to high only at the end of the network [20
]. The DRCN approach [15
] was extended to include residual and dense connections by Tia et al. [21
]. This provided a deep network with recursive layers reducing the model parameters and achieving the best results for the assessment undertaken.
Residual connections in CNNs were introduced by He et al. [22
] for image object recognition. Residual connections force the next layer in the network to learn something different from the previous layers and have been shown to alleviate the problem of deep learning models not improving performance with depth. In addition to going deep, Zagoruyko and Komodakis [23
] showed that going wide can increase network performance for image recognition. More recently, Xie et al. [24
] developed wide residual bocks, which adds another dimension referred to as cardinality in addition to network depth and width. The rate of new developments in network architectures is rapid with incremental improvements in accuracy or reductions in model complexity and memory requirements.
For spatial enhancement of remote sensing imagery, much less research has been carried out regarding the potential of CNNs. Only recently have results been presented by Collins et al. [25
] who applied networks similar to Dong et al. [10
] for enhancement of the Advanced Wide Field Sensor (AWiFS) using the Linear Imaging Self Scanner (LISS-III). Their study provides a good benchmark for CNN performance because the two sensors have the same spectral bands and are temporally coincident. Results showed similar performance to other CNN based super-resolution studies for the scaling ratio of 2.3 (56 m/24 m spatial resolution).
Advances in deep learning CNNs and the global availability of Sentinel-2 data provide a potential option to generate an extensive spatially enhanced historical Landsat archive. Conceivably, a relatively cloud free Landsat and Sentinel-2 image will be obtained within a suitable temporal window for most locations across the globe. Thus, a consistent image pair suitable for training a Landsat super-resolution transform may be obtained and could be locally optimized for this purpose following the approach applied in Latifovic et al. [26
]. However, for large area implementation, CNN performance across a variety of landscapes needs to be evaluated in addition to temporal and spatial extension capacity. Therefore, specific objectives of this research were to:
Assesses the effectiveness of a shallow and deep CNN for super-resolution enhancement of Landsat trained from Sentinel-2 data for characteristic landscape environments in Canada including boreal forest, tundra, and cropland/woodland landscapes.
Evaluate the potential for spatial extension over short distances of less than 100 km and temporal extension of a trained CNN model.
In this research we show that CNN super-resolution can spatially enhance Landsat imagery and can be applied to historical peak growing season time series that could improve land cover and land cover change applications or possibly biophysical parameter retrievals. However, future research needs to specifically evaluate the improvement for a given application with this type of enhancement. This is important to determine if the approach is only suitable for visual enhancement or some types of quantitative analysis. Here, we show that boundaries between land covers and linear features are improved and likely would influence landscape metrics derived from land cover data. There are also textural enhancements that need to be explored as a means to improve information extraction applications.
The SSIM values obtained compare well with other studies achieving values in the range of 0.86 to 0.97, similar to what is achieved in benchmark image databases for an upscaling factor of three [21
]. However, in most studies, images are degraded from an initial high-resolution image and thus the only differences between the fine and coarse images used for training is resolution. This was not the case in this research as there were several additional factors other than resolution and included changes or differences in land cover, canopy structure, phenology, moisture, residual atmosphere effects, sun-sensor geometry, sensor spectral response functions, and residual geolocation error. These factors need to be considered in examining the results as they reduce sensitively of the analysis. More importantly this can cause models to learn these differences resulting in reduced spatial and temporal generalization performance. For remote sensing data, Collins et al. [25
] report SSIM values greater than 0.98. This is the result of using coinstantaneous images avoiding many of the factors listed above. It is also due to the upscaling factor of 2. In this study the upscaling factors were 3 for the red and near-infrared and 1.5 for the shortwave. The inferior result obtained here for the SWIR suggests that the difference is largely related to temporal variation. However, Collins et al. do not report band specific values.
It also of interest to compare the effects of using a more complex network. In Collins et al., going from shallow to wider and deeper improved the SSIM by 0.0035. In this research we also see only a small increase with the more complex deeper network, improving by about 0.006 on average for both the spatial and temporal extension. Thus, as with other SR research finding the optimal balance between model complexity and performance will be an important aspect of future research. In this regard, the effective resolution of the spatial enhancement needs to be determined. That is, we do not propose that the CCN learns a true 10 m resolution result. Future efforts need to quantify the effective resolution to avoid storage and processing redundancy.
Other approaches to assess performance were considered, such as comparison with other high spatial-resolution more temporally coincident images. However, finding such images is challenging for many regions such as the north and where suitable cloud free pairs are difficult to obtain within a few days of acquisition. This may be possible for the visible and NIR bands, but there are few higher spatial resolution sensors that capture the SWIR band for comparison and development with Landsat-5. SPOT-5 imagery is a suitable option, but it was not available for this analysis and establishing an extensive database would be costly. Despite this, the visual assessment shows that the trained networks were able to enhance the spatial properties of the images through time without introducing any strong artifacts.
For large regional implementation, spatial extension over greater distances may be required. For distances less than 100 km, the CNNs appeared to generalize well. However, to effectively train Landsat-5, larger distances may be required as suitable training locations would be limited to areas with little change or with suitable SPOT-5 images. There are several mechanisms to further enhance the generalization of the CNNs that need to be explored in future research such as optimizing network size, increasing weight regularization, batch normalization, and data augmentation. The deep network employed in this research was selected to provide an indication of the upper bound on the enhancement potential. The expectation was that a comprise between the deep and shallow network would be more effective for implementation. In this work we also included weight regularization for the last layers in the network and thus better generalization may be obtained by applying regularizing to additional layers and increasing the weight. In this research we did not use batch normalization as it was found to slightly reduce results in Liang et al. [37
]. However, for spatial and temporal extension this may not be the case and requires further investigation. We also did not use data augmentation as our sample sizes were large, except for retraining of Landsat-5 for the identified no-change areas. In this case, data augmentation could provide an advantage. Data augmentation also provides an alternative training strategy, where more stringent criteria for selecting samples could be applied and data augmentation used to offset the reduced sample size. Improvements in the training data is expected to improve performance.