Model Specialization for the Use of ESRGAN on Satellite and Airborne Imagery

: Training a deep learning model requires highly variable data to permit reasonable generalization. If the variability in the data about to be processed is low, the interest in obtaining this generalization seems limited. Yet, it could prove interesting to specialize the model with respect to a particular theme. The use of enhanced super-resolution generative adversarial networks (ERSGAN), a speciﬁc type of deep learning architecture, allows the spatial resolution of remote sensing images to be increased by “hallucinating” non-existent details. In this study, we show that ESRGAN create better quality images when trained on thematically classiﬁed images than when trained on a wide variety of examples. All things being equal, we further show that the algorithm performs better on some themes than it does on others. Texture analysis shows that these performances are correlated with the inverse difference moment and entropy of the images.


Introduction
Images of high (HR,~1-5 m per pixel) and very high (VHR, <1 m per pixel) spatial resolution are of particular importance for several Earth observation (EO) applications, such as for both visual and automatic information extraction [1][2][3]. However, currently, most high-resolution and all very high-resolution images acquired by orbital sensors need to be purchased at a high price. On the other hand, there is abundant medium-resolution imagery currently available for free (e.g., the multispectral instrument onboard Sentinel-2 and the operational land imager onboard Landsat-8). Improving the spatial resolution of medium-resolution imagery to the spatial resolution of high-and very high-resolution imagery would thus be highly useful in a variety of applications.
Image resolution enhancement is called super-resolution (SR) and is currently a very active research topic in EO image analysis [4][5][6] and computer vision in general, as shown in [7]. However, SR is inherently an ill-posed problem [8]. Multi-frame super-resolution (MFSR) uses multiple low-resolution (LR) images to constrain the reconstruction of a high-resolution (HR) image. However, this approach cannot be used when a single image is available. Single image super-resolution (SISR) is a particular type of SR that involves increasing the resolution of a low-resolution (LR) image to create a high-resolution (HR) image. SISR can be achieved by (1) the "external example-based" approach, where the algorithm learns from dictionaries [9], or by using (2) convolutional neural networks (CNNs), where the algorithm "learns" the relevant features of the image that would be useful for improving its resolution [10,11]. SISR can also be achieved by using (3) generative adversarial neural networks (GANs) [12]. GANs oppose two networks (a generator and a discriminator), one against the other, in an adversarial way. The generator is trained to produce new images to trick the discriminator into trying to distinguish whether it is a real image or a fake image. In this type of network, the generator and the discriminator act 2 of 12 as adversaries. GANs provide a powerful framework for generating real-looking images with high quality, as is the case in [13], through enhanced super-resolution generative adversarial networks (ESRGAN). This architecture is now used in different applications, such as satellite imagery [14,15] or the improvement of the predictive resolution of some models [16,17].
In general, training a neural network model requires separating the dataset into the following three parts: (1) a training dataset; (2) a validation dataset; and (3) a test dataset. The training dataset is used to adjust the model parameters and biases. The validation dataset is used to estimate the model's skill while tuning its hyperparameters, including the number of epochs, among others. As only the number of epochs will be relevant for the reader, its definition is given here. One "epoch" is defined as one forward pass and one backward pass through the entire training dataset. The test dataset is then used to verify the capability of the model to generalize its predictions based upon new data. If the model yields good predictions, then its ability to generalize is good. Otherwise, the model exhibits overfitting and is susceptible to overlearning, i.e., no further improvement in performance can be achieved and, indeed, further "tinkering" (experimentation or adjustment) may possibly result in its subsequent deterioration. Indeed, neural network models can be over-parameterized, yet can still correctly predict labels, even if these were assigned randomly [7][8][9]. Overfitting is a known problem in neural network models and deep learning in general, but different methods can allow us to avoid the problem, such as data augmentation by flipping or rotating the images [18,19], the use of a dropout layer that forces the model to work with part of its parameters turned "off" [20], or early stopping of the training phase [21,22]. Hence, the sample's variability in the training dataset is key to minimize the possibility of overfitting.
In the particular case of SISR, is it relevant to try to maximize the variety of examples if the model is to be applied to a specific theme or topic? In this study, we use the ESRGAN method to increase the spatial resolution of different types of imagery and address this question. ESRGAN were chosen because they outperform, in terms of peak signal-tonoise ratio (PSNR), other SISR methods, such as SRCNN, EDSR, RCAN, EnhanceNet or SRGAN [13], and are publicly available. We used airborne and satellite images to construct different datasets of different themes, as follows: (1) "daily life"; (2) agricultural; (3) forests; (4) urban areas; (5) rocky outcrops; (6) the planet Mars; and (7) a mixture of different themes. These groups of data have been used for training with different numbers of epochs (150, 300, 600, 1200, 2400, 4800). We demonstrated that training a model for a specific task is more interesting than maximizing the variability in the data during training. The results indicate that the number of epochs is not strictly correlated with the PSNR value; rather, it depends upon the topic being trained. Furthermore, this work highlights a correlation between the quality of the results and the textural homogeneity of the image, i.e., the inverse difference moment (IDM), together with entropy indices that are taken from the Haralick co-occurrence matrix [23].

ESRGAN Architecture
The ESRGAN architecture we used is inspired by the SRGAN or super-resolution generative adversarial network [24]. The architecture is described in [13], and we invite the reader to refer to it for more details. The models are trained to reconstruct images for which the resolution has been degraded by a factor of 4 using MATLAB bicubic kernel function. The use of other convolution methods to degrade the resolution is not recommended, given that they could generate artefacts. The codes that are used here are those provided by the authors of [13] on their GitHub (https://github.com/xinntao/ESRGAN) (accessed on 14 July 2021).

Datasets
Images that were used in this study originated from different sources. Each theme contains 650 images. The choice of themes was guided by the following two main criteria: (1) constructing themes that presented a large variability in environments (forests, regolith, and urban areas, among others); and (2) data availability. Table 1 lists the datasets used in this study, and the following section describes the different themes that were used in this study. "Daily life" (Figure 1). The DIVerse 2K (DIV2K) resolution high-quality images dataset is commonly used in the literature to train and then to measure the performance of superresolution algorithms [25]. The images are common scenes from daily life. This dataset possesses no spatial resolution that is associated with pixel size of the images.

Datasets
Images that were used in this study originated from different sources. Each theme contains 650 images. The choice of themes was guided by the following two main criteria: (1) constructing themes that presented a large variability in environments (forests, regolith, and urban areas, among others); and (2) data availability. Table 1 lists the datasets used in this study, and the following section describes the different themes that were used in this study. "Daily life" (Figure 1). The DIVerse 2K (DIV2K) resolution high-quality images dataset is commonly used in the literature to train and then to measure the performance of super-resolution algorithms [25]. The images are common scenes from daily life. This dataset possesses no spatial resolution that is associated with pixel size of the images. "Airborne imagery" (Figure 2). These images were acquired at 20 cm spatial resolution and have been kindly provided by XEOS Imaging Inc. (Quebec, QC). They cover different areas, randomly chosen, of the province of Quebec (Canada), between 70°29'02" W and 71°53'05" W, and between 40°12'31" N and 48°58'26" N. These images were visually classified into categories according to land use (agricultural, forest and urban). "Airborne imagery" (Figure 2). These images were acquired at 20 cm spatial resolution and have been kindly provided by XEOS Imaging Inc. (Quebec, QC, Canada). They cover different areas, randomly chosen, of the province of Quebec (Canada), between 70 • 29 02"W and 71 • 53 05"W, and between 40 • 12 31"N and 48 • 58 26"N. These images were visually classified into categories according to land use (agricultural, forest and urban).
"WorldView satellite imagery" (Figure 3). The images were acquired by the sensors onboard the WorldView-2 (WV-2) and WorldView-3 (WV-3) satellites, which have a spatial resolution of 2 m. The images cover two geographical areas of Axel Heiberg Island (Nunavut) in the Canadian High Arctic. They include mostly regolith, rocks, and glaciers. To be compared with the other themes, only the RGB channels were kept, then converted to 8 bits images.
"HiRISE satellite imagery" (Figure 4). Forty-nine images that were acquired by the HiRISE (high-resolution imaging experiment) instrument onboard the Mars Reconnaissance Orbiter have been downloaded from the University of Arizona (Tucson, AZ, USA) website (https://hirise.lpl.arizona.edu) (accessed on 15 April 2021). The images cover a wide range of geomorphological variability that is found on Mars (dunes, craters and canyons, among others). The spatial resolution of these images ranges between approximately 25 cm and 50 cm depending on the orbiter's altitude.
The "mixed" theme is an equiproportional mixture of images that have been randomly selected from each of the other six themes. "WorldView satellite imagery" (Figure 3). The images were acquired by the sensor onboard the WorldView-2 (WV-2) and WorldView-3 (WV-3) satellites, which have a spa tial resolution of 2 m. The images cover two geographical areas of Axel Heiberg Islan (Nunavut) in the Canadian High Arctic. They include mostly regolith, rocks, and glaciers To be compared with the other themes, only the RGB channels were kept, then converte to 8 bits images.  "WorldView satellite imagery" (Figure 3). The images were acquired by the sensors onboard the WorldView-2 (WV-2) and WorldView-3 (WV-3) satellites, which have a spatial resolution of 2 m. The images cover two geographical areas of Axel Heiberg Island (Nunavut) in the Canadian High Arctic. They include mostly regolith, rocks, and glaciers.
To be compared with the other themes, only the RGB channels were kept, then converted to 8 bits images. "HiRISE satellite imagery" (Figure 4). Forty-nine images that were acquired by the HiRISE (high-resolution imaging experiment) instrument onboard the Mars Reconnaissance Orbiter have been downloaded from the University of Arizona (Tucson, AZ) website (https://hirise.lpl.arizona.edu). The images cover a wide range of geomorphological variability that is found on Mars (dunes, craters and canyons, among others). The spatial resolution of these images ranges between approximately 25 cm and 50 cm depending on the orbiter's altitude. The "mixed" theme is an equiproportional mixture of images that have been randomly selected from each of the other six themes.
All of the images of the various themes have undergone bicubic convolution (by a factor of 4) to degrade their resolution. The data were separated into a training set (520 images), a validation set (65 images), and a test set (65 images). The different models were trained to reconstruct the images at their original resolution for 6 different epoch numbers (150, 300, 600, 1200, 2400, 4800), i.e., 42 models in total. Training took 19 days on an NVIDIA Quadro RTX 4000 graphics card. For each reconstruction, the peak signal-tonoise ratio (PSNR) was calculated to evaluate its quality. The model that was trained on the greatest variety of images was then used on each of the themes for 4800 epochs to test whether a) greater benefit was obtained by training a model on a wide array of examples or b) specializing on a single theme was a better option. The workflow that is depicted in Figure 5 summarizes the entire methodological approach.

Results
The results are presented in three sections. The first section provides examples of image resolution improvements. The second section presents the average PSNR values that were obtained for each of the themes. The third section provides several textural All of the images of the various themes have undergone bicubic convolution (by a factor of 4) to degrade their resolution. The data were separated into a training set (520 images), a validation set (65 images), and a test set (65 images). The different models were trained to reconstruct the images at their original resolution for 6 different epoch numbers (150, 300, 600, 1200, 2400, 4800), i.e., 42 models in total. Training took 19 days on an NVIDIA Quadro RTX 4000 graphics card. For each reconstruction, the peak signal-tonoise ratio (PSNR) was calculated to evaluate its quality. The model that was trained on the greatest variety of images was then used on each of the themes for 4800 epochs to test whether a) greater benefit was obtained by training a model on a wide array of examples or b) specializing on a single theme was a better option. The workflow that is depicted in Figure 5 summarizes the entire methodological approach.
sance Orbiter have been downloaded from the University of Arizona (Tucson, AZ) website (https://hirise.lpl.arizona.edu). The images cover a wide range of geomorphological variability that is found on Mars (dunes, craters and canyons, among others). The spatial resolution of these images ranges between approximately 25 cm and 50 cm depending on the orbiter's altitude. The "mixed" theme is an equiproportional mixture of images that have been randomly selected from each of the other six themes.
All of the images of the various themes have undergone bicubic convolution (by a factor of 4) to degrade their resolution. The data were separated into a training set (520 images), a validation set (65 images), and a test set (65 images). The different models were trained to reconstruct the images at their original resolution for 6 different epoch numbers (150, 300, 600, 1200, 2400, 4800), i.e., 42 models in total. Training took 19 days on an NVIDIA Quadro RTX 4000 graphics card. For each reconstruction, the peak signal-tonoise ratio (PSNR) was calculated to evaluate its quality. The model that was trained on the greatest variety of images was then used on each of the themes for 4800 epochs to test whether a) greater benefit was obtained by training a model on a wide array of examples or b) specializing on a single theme was a better option. The workflow that is depicted in Figure 5 summarizes the entire methodological approach.

Results
The results are presented in three sections. The first section provides examples of image resolution improvements. The second section presents the average PSNR values that were obtained for each of the themes. The third section provides several textural

Results
The results are presented in three sections. The first section provides examples of image resolution improvements. The second section presents the average PSNR values that were obtained for each of the themes. The third section provides several textural indices that highlight correlations between the texture of the images and the quality of the reconstruction.

Examples of Upscaling Results
Two themes were selected to illustrate our work. All the images were visually displayed with the same "minimum-maximum" histogram stretch available in the ArcMap software. This manner of proceeding could show differences in coloration, due to differences in the pixel values recovered by the model. Figure 6 displays the results that were obtained for 150 epochs and 4800 epochs on a Martian talweg, the line of lowest elevation in a valley.
Two themes were selected to illustrate our work. All the images were visually dis played with the same "minimum-maximum" histogram stretch available in the ArcMa software. This manner of proceeding could show differences in coloration, due to differ ences in the pixel values recovered by the model. Figure 6 displays the results that wer obtained for 150 epochs and 4800 epochs on a Martian talweg, the line of lowest elevatio in a valley.

PSNR Obtained for Each Model
The PSNR was calculated for the 65 images of the test set, for each theme and for eac number of epochs (150, 300, 600, 1200, 2400, 4800). The standard deviation was also calcu lated to characterize the dispersion of the quality of the results that were obtained. Eac theme was also reconstructed with the model that was trained with 4800 epochs on th "mixed" theme, which is an equiproportional mixture of images from the other six theme (Figure 8). This highlights the influence of variability in the examples on the final qualit of the reconstructed images.  Two themes were selected to illustrate our work. All the images were visually dis played with the same "minimum-maximum" histogram stretch available in the ArcMa software. This manner of proceeding could show differences in coloration, due to differ ences in the pixel values recovered by the model. Figure 6 displays the results that wer obtained for 150 epochs and 4800 epochs on a Martian talweg, the line of lowest elevatio in a valley.

PSNR Obtained for Each Model
The PSNR was calculated for the 65 images of the test set, for each theme and for eac number of epochs (150, 300, 600, 1200, 2400, 4800). The standard deviation was also calcu lated to characterize the dispersion of the quality of the results that were obtained. Eac theme was also reconstructed with the model that was trained with 4800 epochs on th "mixed" theme, which is an equiproportional mixture of images from the other six theme (Figure 8). This highlights the influence of variability in the examples on the final qualit of the reconstructed images.

PSNR Obtained for Each Model
The PSNR was calculated for the 65 images of the test set, for each theme and for each number of epochs (150, 300, 600, 1200, 2400, 4800). The standard deviation was also calculated to characterize the dispersion of the quality of the results that were obtained. Each theme was also reconstructed with the model that was trained with 4800 epochs on the "mixed" theme, which is an equiproportional mixture of images from the other six themes (Figure 8). This highlights the influence of variability in the examples on the final quality of the reconstructed images.

Texture Indices
Since two themes have significantly higher PSNRs than the other five, an imagetexture study was undertaken to try to understand the underlying phenomenon. Four Haralick texture indices, from the gray level co-occurrence matrix (GLCM), were selected to determine whether there was a correlation between the ability of the model to reconstruct the HR images and the intrinsic characteristics of the image's textures.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 1 Figure 8. Peak signal-to-noise ratios (PSNRs), with their associated standard deviations, for the di ferent themes. For the highest number of epochs, the "mixed" model also tested whether increasin the variability in the samples was relevant.

Texture Indices
Since two themes have significantly higher PSNRs than the other five, an image texture study was undertaken to try to understand the underlying phenomenon. Fou Haralick texture indices, from the gray level co-occurrence matrix (GLCM), were selecte to determine whether there was a correlation between the ability of the model to recon struct the HR images and the intrinsic characteristics of the image's textures. Figure 9 illustrates values that had been obtained for these four indices for 20 image that were randomly selected in each theme, except for the "mixed" scenario, which woul not have added any new information. In each panel, the index values of the original im ages (Y-axis) are plotted against values that were calculated for the degraded images (X axis).  Figure 9 illustrates values that had been obtained for these four indices for 20 images that were randomly selected in each theme, except for the "mixed" scenario, which would not have added any new information. In each panel, the index values of the original images (y-axis) are plotted against values that were calculated for the degraded images (x-axis).
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 12 Figure 9. Values that were obtained for four indices that were derived from the Haralick co-occurrence matrix for downscaled images and the original images.

Discussion
This section is organized into three parts. First, we acknowledge that the use of ESRGAN for the reconstruction of images (downscaled by a factor of four) does not correspond to a real-world application. We then discuss the PSNR values that were obtained for the different themes and epoch numbers to understand how variability among examples affects the quality of the results. Finally, we show that the image texture indices are positively correlated with the ability of ESRGAN to improve their resolution.

Image Resolution Improvement with ESRGAN
As explained in the Methods, the ESRGAN model is trained by "teaching" it to reconstruct an image, the resolution of which has been degraded. This allows the analyst to quantitatively evaluate the quality of the results by comparing the reconstructed version of the image to the original version. However, this approach does not allow the analyst to judge the real capacity of the model to create a new image that has not been degraded by bicubic convolution beforehand. Recent research has tried to overcome this difficulty [26,8,27]. However, these new architectures are beyond the scope of our study and should be the subject of further work.

Interest in the Specialization of Examples in Learning
Peak signal-to-noise ratio (PSNR) is a measurement that is frequently used in superresolution to express the quality of the image reconstruction. In the study that is presented here, PSNR provides a good idea of the quality of the results. Figure 8 depicts the variability in the quality of reconstructions for high-resolution images; for example, the "Mars" theme attains a maximum PSNR of 39.10 dB for 4800 epochs, while the "forest" theme reaches 30.11 dB for an equivalent number of epochs. As a function of the number of epochs, the PSNR shows that the learning capacity of the model is not equivalent among the different themes. To illustrate the progression in learning as the number of epoch increases, we averaged the PSNRs that were obtained at 150 Figure 9. Values that were obtained for four indices that were derived from the Haralick co-occurrence matrix for downscaled images and the original images.

Discussion
This section is organized into three parts. First, we acknowledge that the use of ESRGAN for the reconstruction of images (downscaled by a factor of four) does not correspond to a real-world application. We then discuss the PSNR values that were obtained for the different themes and epoch numbers to understand how variability among examples affects the quality of the results. Finally, we show that the image texture indices are positively correlated with the ability of ESRGAN to improve their resolution.

Image Resolution Improvement with ESRGAN
As explained in the Methods, the ESRGAN model is trained by "teaching" it to reconstruct an image, the resolution of which has been degraded. This allows the analyst to quantitatively evaluate the quality of the results by comparing the reconstructed version of the image to the original version. However, this approach does not allow the analyst to judge the real capacity of the model to create a new image that has not been degraded by bicubic convolution beforehand. Recent research has tried to overcome this difficulty [8,26,27]. However, these new architectures are beyond the scope of our study and should be the subject of further work.

Interest in the Specialization of Examples in Learning
Peak signal-to-noise ratio (PSNR) is a measurement that is frequently used in superresolution to express the quality of the image reconstruction. In the study that is presented here, PSNR provides a good idea of the quality of the results. Figure 8 depicts the variability in the quality of reconstructions for high-resolution images; for example, the "Mars" theme attains a maximum PSNR of 39.10 dB for 4800 epochs, while the "forest" theme reaches 30.11 dB for an equivalent number of epochs. As a function of the number of epochs, the PSNR shows that the learning capacity of the model is not equivalent among the different themes. To illustrate the progression in learning as the number of epoch increases, we averaged the PSNRs that were obtained at 150 and 300 epochs. The same operation was performed for 2400 and 4800 epochs. This averaging mitigates the noise surrounding the measurement moving from one group of epochs to the next. The improvement between the two values is expressed as a percent-age of the first value. Table 2 summarizes these results. Table 1 shows that increasing the number of epochs, even by a factor of 32 (=4800/150), offers improvements to PSNRs (<1%) for three of the seven themes that are treated here (i.e., agricultural, urban and forestry themes). In contrast, the "Mars" and "mixed" themes showed strong improvements, reaching 15.72% and 6.86%, respectively. The rock outcrop and "DIV2K" dataset themes remained below 3%.
The final PSNRs that were obtained for 2400 and 4800 epochs group the different themes in a similar manner. The "Mars", "outcrop" and "mixed" themes had good PSNRs, while the "agriculture", "urban" and "forest" themes did not reach a value of 31 dB. The "DIV2K" dataset has a PSNR that is intermediate between the two aforementioned groups.
Interestingly, the use of the widest variety of examples does not generally lead to better results. With the exception of the Axel Heiberg Island tests, the PSNRs that were obtained at 4800 epochs are similar to, or lower than, the values that were obtained on a dedicated training set. The exception of the rocky outcrops, however, should not be taken as significant, since the improvement is only 1.12% of the value obtained for 4800 epochs (0.41, in terms of the absolute value). This is all the more negligible, since it is the theme that offers the greatest standard deviation, with 4.38 or 11.8% of the mean value.

Texture Indices and Reconstruction of HR Images
All the texture indices that are presented in Figure 9 show that HiRISE and WorldView data are comparable, as they plot in similar regions of the graphs. Indeed, these datasets have systematically obtained close values; in the cases of entropy and the inverse difference moment, they can be clearly distinguished from the other themes. Entropy and the inverse difference moment, therefore, would appear to be suitable textural indices for explaining the ability of ESRGAN to best reconstruct HR images. Figure 10 shows the PSNR values as a function of these indices. The inverse difference moment measures the local homogeneity of the image. The greater the value, the greater the homogeneity is. Entropy measures the degree of disorder in the image. The lower the value, the greater the order in the texture is. Thus, it is not surprising that the best performances of ESRGAN are obtained for images with high homogeneity and low entropy; the reconstruction of the image at its original resolution is more predictable. The model is less perturbed by statistic variations, the randomness of which would prevent prediction. This explains why the best PSNRs were obtained for two similar themes, i.e., Martian and Arctic regolith, which are indeed much more homogeneous than other themes, despite having unfavourable signal-to-noise ratios.

Conclusions
For this study, ESRGAN were used to increase the spatial resolution of the following different themes: (1) "daily life"; 2) agricultural; 3) forests; 4) urban areas; 5) rocky out- The inverse difference moment measures the local homogeneity of the image. The greater the value, the greater the homogeneity is. Entropy measures the degree of disorder in the image. The lower the value, the greater the order in the texture is. Thus, it is not surprising that the best performances of ESRGAN are obtained for images with high homogeneity and low entropy; the reconstruction of the image at its original resolution is more predictable. The model is less perturbed by statistic variations, the randomness of which would prevent prediction. This explains why the best PSNRs were obtained for two similar themes, i.e., Martian and Arctic regolith, which are indeed much more homogeneous than other themes, despite having unfavourable signal-to-noise ratios.

Conclusions
For this study, ESRGAN were used to increase the spatial resolution of the following different themes: (1) "daily life"; (2) agricultural; (3) forests; (4) urban areas; (5) rocky outcrops; (6) the planet Mars; and (7) a mixture of different themes. Our aim was to verify whether it is advantageous to maximize the variability in the examples during the training phase, or if it is preferable to provide a specialized model. Moreover, training was performed for six different levels of epochs (150, 300, 600, 1200, 2400, and 4800) to validate whether it is judicious to always maximize the learning time. Finally, texture indices were used to explain the variability in the quality of the results that were obtained. The conclusions of this work are as follows: • It is more beneficial to create a specialized ESRGAN model for a specific task, rather than trying to maximize the variability in examples.

•
The ability to learn depends upon the subject matter. No recommendations can be made a priori. • ESRGAN perform better on images with a high inverse difference moment and low entropy indices.