TE-SAGAN: An Improved Generative Adversarial Network for Remote Sensing Super-Resolution Images

: Resolution is a comprehensive reflection and evaluation index for the visual quality of remote sensing images. Super-resolution processing has been widely applied for extracting information from remote sensing images. Recently, deep learning methods have found increasing application in the super-resolution processing of remote sensing images. However, issues such as blurry object edges and existing artifacts persist. To overcome these issues, this study proposes an improved generative adversarial network with self-attention and texture enhancement (TE-SAGAN) for remote sensing super-resolution images. We first designed an improved generator based on the residual dense block with a self-attention mechanism and weight normalization. The generator gains the feature extraction capability and enhances the training model stability to improve edge contour and texture. Subsequently, a joint loss, which is a combination of L1-norm, perceptual, and texture losses, is designed to optimize the training process and remove artifacts. The L1-norm loss is designed to ensure the consistency of low-frequency pixels; perceptual loss is used to entrench medium-and high-frequency details; and texture loss provides the local features for the super-resolution process. The results of experiments using a publicly available dataset (UC Merced Land Use dataset) and our dataset show that the proposed TE-SAGAN yields clear edges and textures in the super-resolution reconstruction of remote sensing images.


Introduction
The development of optical remote sensing technology has enabled researchers to obtain large-area, multi-temporal Earth observation images for social construction with ease.Nevertheless, optical satellite image quality suffers from the sensor imaging technology, limiting the quality and application of remote sensing images.Low-resolution remote sensing images bring trouble to ground object interpretations.Therefore, superresolution reconstruction of remote sensing images has become an indispensable preprocessing step.Improving the spatial resolution of remote sensing images promotes clearer object boundaries and contours in remote sensing images, which provides clear data for subsequent applications of extracting features, such as the recognition roads and buildings by semantic segmentation methods [1][2][3] or change detection by extracting the object boundary [4,5].Therefore, the super-resolution processing of remote sensing images has become an indispensable preprocessing process.Many researchers have utilized super-resolution techniques to effectively and accurately obtain images [6][7][8][9][10].Super-resolution (SR) imaging is a technology that uses single or multiple low-resolution (LR) images to obtain high-resolution (HR) images using existing image acquisition devices.It has been widely applied since the 1980s, and Huang et al. [11] creatively proposed a method for reconstructing HR images based on the frequency field of multi-frame sequential LR images.Since then, SR processing of remote sensing images has become an active research topic, and numerous pertinent algorithms have been designed.At present, researchers have divided single-image super-resolution (SISR) processing methods into interpolation-, reconstruction-, and learning-based methods.
The interpolation-based method estimates the unknown pixel value via interpolation in the designed linear function according to the known pixels, such as nearest-neighbor interpolation [12], Lanczos filtering [13], or bicubic filtering [14].Although these methods can achieve rapid SR image processing, filling the pixels with low-frequency information leads to shortcomings in the overall semantic features.Consequently, the super-resolution images may lose information of some details or boundaries may not be clear.Reconstruction-based methods aim to establish an appropriate prior constraint or registration model between LR and HR images, such as projection onto convex sets [15][16][17], iterative back projection [18], or maximum a posteriori probability method [19][20][21].The core objective of reconstruction-based methods is to build a reasonable prior observation model.However, SR imaging is a highly underdetermined infinite-solution problem in mathematical theory.Even using the optimal parameters, the transformation performance of the model is weak in diverse tasks.Owing to this problem, the model requires considerable resources to accomplish repeatable and complicated calculations.
In recent years, with the development of machine learning frameworks, academic interest in learning-based methods has increased, particularly deep learning driven by big data [22].Convolutional neural networks are a representative adaptive learning method widely used for the reconstruction of low-and high-frequency images and are less complex than traditional manual interventions.Researchers have thus expanded and applied convolutional neural networks to SR image tasks [23][24][25][26][27][28].However, both early shallow networks with 3-5 layers [29,30] and deep super-resolution networks [31][32][33] tend to produce low-frequency results.
To address these issues, researchers have explored algorithms that yield results consistent with human observations and sought alternatives capable of obtaining high-frequency images.Ledig et al. [34] proposed an improved generative adversarial network (GAN) model, named SRGAN, which benefits from content and perceptual losses and exerts good performance.Based on SRGAN, the Residual-in-Residual Dense Block (RRDB) was introduced, which further improved the recovered image texture [35].In the last years, many ESRGAN-based algorithms [36][37][38] have been improved.ESRGAN+ [36] replaces ESRGAN's RRDB with RRDRB; Real-ESRGAN replaces the VGG-Net type discriminator in the original ESRGAN with a U-Net type discriminator [37], which has a good improvement in anime images and videos.However, these network parameters are too large to fit training conditions.When these methods are used to reconstruct remote sensing images, the texture details are still unclear, and some artifacts remain in the produced images.
The primary reasons for the abovementioned problems can be summarized as follows: (1) Owing to their peculiar characteristics compared to conventional images of the same image dimensions, remote sensing images contain abundant information, such as external contours and internal textures, that characterize the spatial relationships of ground objects, thereby making the content appear crowded.Therefore, algorithms that can accurately recover and characterize contours and internal textures are necessary.(2) Most existing methods do not consider global spatial connectivity and refining texture in the algorithm design.Owing to the local perception characteristic of convolutional neural networks, global feature information in large-scale remote scenes is ignored.Thus, the global remote sensing information is neglected and underutilized.
To resolve these concerns, we present TE-SAGAN, a GAN-based texture-enhanced SR processing network for remote sensing images, which sharpens SR image details while maintaining clear visual edges.The self-attention mechanism (SAM) is designed to learn global information from remote sensing images.Furthermore, the RRDB was introduced for image depth extraction.Simultaneously, a SAM with three submodules was used to obtain different levels of attention, where relevant information (such as edge contours) can be engaged and received dynamically.Weight normalization (WN) [39] was also introduced to TE-SAGAN to remove artifacts and increase the training network stability.The calculated texture loss by the Gram matrix was designed to assess the difference between the original HR and SR images in high-frequency features, which can help improve the internal details of the reconstructed image.
In conclusion, by the self-attention module and texture loss units, this work provides insight into refining the image edges and textures.The major contributions are as follows: (1) designing the self-attention mechanism module for extracting the global information, which can decrease the limitation of the local receptive field and increase the super-resolution network accuracy; (2) a weight normalization layer is introduced instead of batch normalization, which can reduce the interference of mini-batch data on reconstructed image quality and improve the image artifact; (3) a joint loss is designed combining the content loss, perceptual loss, adversarial loss, and texture loss, which take into account the consistency of the pixel content and texture details and promote the super resolution work to achieve a characteristic remote sensing image.

Principle of the Proposed Method
The designed network framework proposed in this study is based on GAN [40]  ( ) where SR I is the super-resolution image, and LR I is the corresponding low-resolution image.SR I is as close as possible to the sample distribution of HR I , and the parameter optimization of the training network must satisfy the following rule: where ( , ) The parameters of the discriminator D were defined accordingly as D  .Their adver- sarial training is essentially a minimum-maximum optimization.We converted the training parameters corresponding to the LR-HR remote sensing image processing as follows: The basic idea expressed in the above equation is that the generator network G should be well-trained under the GAN (Figure 1).The discriminator network D has to distinguish the true and false states of the input remote sensing images.For the inputted

Generator of TE-SAGAN
Owing to the limitation of "local perception" in conventional convolution modules, long-distance global information cannot be used for HR remote sensing images.To extract more edge and global information from remote sensing images, the generator of TE-SA-GAN was designed with a low-frequency feature extractor, deep residual dense block, self-attention module, and WN (Figure 2).The self-attention module was designed to further enhance the ability of the network to extract global object features.Compared with the common batch normalization (BN) usually used in neural networks, the WN introduced in this study can stabilize network training and remove image artifacts.In the generator network, the low-frequency feature extractor is a convolution layer with 3 × 3 kernels, following a WN layer.RRDB is an excellent nonlinear feature extraction module that has an interwoven ensemble with dense connectivity and residual skipping.RRDB can learn deep semantic features while improving low-level feature extraction.Here, we gathered the feature-flow output from RRDB with the previous residual layer, merging them into the SAM module to generate global contextual outlines.To achieve the SR processing for remote sensing images with upscaling factors (4×), these extracted features were recovered using an up-sample layer with nearest-neighbor interpolation.In this process, in addition to keeping 23 RRDBs with the residual scaling factor =0.2  ( )  0< <1 and a self-attention mechanism module, each regular convolution layer was fol- lowed by WN (Figure 3).).

Self-Attention Mechanism
Traditional convolution operations associate regional pixels with neighboring pixels using a constant convolution kernel size.However, the limited convolution layer perception domain leads to the failure of global feature extraction, and consequently, the reconstructed image appears as a blurred overlay, or the recovered features do not correspond to the original image.Although expanding the convolution kernel size and deepening the convolution layers can help obtain the relevant global features, the computational complexity of the network increases exponentially.In recent years, attention mechanisms [41] have been widely used in natural language processing.In particular, the SAM module focuses selectively on significant location features to reduce network capacity and improve network computation performance.To obtain more global information, we designed a SAM module in the generator network (Figure 4).f x and ( ) g x .Subsequently, the transformed softmax clas- sification function was used to obtain the feature attention map as follows: where  indicates the significance of the synthetic j-th region model for the i-th posi- tion, = ( ) ( ) , where one j o layer is shown as: Finally, the attention map and feature mapping layer ( ) h x were added together to merge the globally relevant features.The output of the layer is expressed as follows: To explore local spatial information,  was initialized as 0. Overall, self-attention can capture accurate image features beyond the convolution operation, and spatial connections between reference pixels across distances can help obtain more reference information.Furthermore, it also considers the key features of a high-resolution image with low computational complexity.

Weight Normalization
Batch Normalization (BN) [42] is a technique widely used in GAN-based methods.It balances the bias for image feature covariates inside the deep network and prevents overfitting during training.Nevertheless, BN tends to destroy low-frequency spatial features and increase erroneous noise estimates, causing artifacts in the SR processing results of remote sensing images.To overcome this issue, Salimans et al. have actively advocated for Weight Normalization (WN) [39] to stabilize the generative network, especially in GANs, by decoupling training weight vectors from their directions.WN decreases the dependence between mini-batch data and training performance, retains the contrast of image feature information, and has lower computational complexity than BN in equal time.Therefore, WN is more suitable than BN for performing SR tasks.
For nonlinear processing in the neural network, as seen in Equation ( 7), the former feature node x was computed with convolutional filtering with parameters w and b .
New feature maps and nodes y were obtained by the activation function.WN in the convolution layer can be calculated using Equation ( 8): where v and w have the same dimensional feature vector, and || || v denotes the Euclid- ean norm of v , which represents the direction of w .Scalar g determines the length w.
After WN, the neural network nodes can be calculated as follows:

Discriminator of TE-SAGAN
The role of the discriminator is to recognize the input images and judge their authenticity.The discriminator for TE-SAGAN is a VGGNet-type network structure [43] (Figure 5).This network has proven that multiple small convolution kernels (3 × 3) can learn more complex features at a lower computational cost than large convolutional kernels (e.g., 5 × 5, 7 × 7, and 11 × 11 in AlexNet) [44].In this study, the fixed convolutional kernel size of VGGNet was modified to alternative kernel sizes of 3 and 4 to enhance the image feature recognition.Furthermore, the WN layers that replace all BN layers in the discriminator network can further improve the training network speed and stability.

Loss Function
The loss function measures the difference between the SR images generated by the network model and the HR images used for training the networks.In this study, to obtain a better result with more texture information and consistent style by TE-SAGAN, the loss function , ( ) during training network combined L1 norm content loss, training adversarial network loss, perceptual loss using a high-frequency network layer from VGG-Net19, and texture loss computed using the Gram matrix.

Content Loss
Minimization of the MAE can enhance the sensitivity of abnormal image pixels.Therefore, it was used to describe the consistency of the low-frequency data between the recovered SR image and the original HR image.The definition of content loss is:

Adversarial Loss
The training stability of GANs is a measure used to ensure the quality of a generated image.The generator network "pretends to learn well" to cheat the discriminator and then yields terrible results.In this experiment, we employed the relativistic average standard GAN (RasGAN) [45], an adversarial training strategy with a relativistic discriminator, to compare the average input real data relative to faked data with a more realistic probability.The adversarial TE-SAGAN loss is unlike the rule of "either true or false" in the standard GAN discriminator.Instead, it focuses on the prediction of ground truth images to improve the network robustness.The generator and discriminator loss functions have mutual symmetry, and the corresponding loss functions of TE-SAGAN are defined in Equations ( 11) and ( 12): where HR I denotes the original ground truth sample, ( ) , which is the generated SR remote sensing image, and E [*] SR represents the mean value of the generated remote sensing image for computation in the mini-batch.The adversarial loss of the generator is expressed as follows:

Perceptual Loss
Typically, the pixel-wise loss (MSE or MAE) is restricted to the outcomes of low-frequency smooth features.However, it has some disadvantages for sparse and weak image feature values, leading to a low supervision capability for high-frequency features.To improve the quality of the generated HR remote sensing images and enhance the robustness of the deep network, perceptual loss was calculated using high-frequency feature maps before the activation operation, which can better represent the consistency of image texture information.In the pretrained VGG19 network [43], we calculated the Euclidean metric of high-frequency features extracted before the ReLU activation layers between the reference ground truth and the reconstructed SR image, defining their differences as perceptual loss: ) where ij W * ij H represents the dimension of the feature map in the VGGNet, LR I repre- sents the LR remote sensing image, HR I represents the HR remote sensing image, ij  indicates the activation operation of the feature map extracted before the j-th convolution layer and the i-th maximum pooling layer within the VGGNet19.

Texture Loss
Texture is also known as low-contrast fine detail.For SR remote sensing images, restoring the texture information to approach that of HR images is the primary objective.Consideration of perceptual loss improved the quality of the reconstructed image.However, the fine textures were still obscured.In this study, based on perceptual loss, the correlation between various feature channels was used to calculate texture loss.Texture loss can consolidate and correct the details of remote sensing images.Texture loss in HR-SR images can be calculated as follows: where ( ) represents the extracted features from VGGNet19, and r G is the fea- ture map matrix computed on the gram matrix with the property of T ( )

Total Loss Function
Combined with the losses described in the previous sections, the total loss function is as follows: where

Evaluation Indexes
The peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [46], and Fréchet Inception Distance (FID) [47] scores were used as objective evaluation indices for the experimental results, whereas the visual effect provided a subjective judgment.The PSNR was used to measure the pixel difference between the compressed image or signal reconstruction image and the original image in dB.A higher PSNR score indicates higher quality and pixel fidelity of the reconstructed image.The PSNR was calculated as follows: where I MAX represents the maximum pixel value ( I MAX is 255 for RGB images), and MSE represents the mean square error for images I and K with a size of m × n.
SSIM is a measure of luminance, contrast, and structure between two images in noise interference distortion.SSIM was calculated as follows: where x  and y  represent the pixel mean of images x and y, respectively; xy  repre- sents the covariance of images x and y; naturally, x  and y  , respectively, represent the corresponding variance in images x and y.Luminance: In addition, the FID score, a proven systematic and convincing metric, was employed in this study to evaluate the accuracy of the designed model.FID computes the Fréchet distance (also known as Wasserstein-2 distance) between two Gaussian distributions (synthetic and real images) in the feature space [48].FID constructs Gaussian model parameters around the matrix of two variables: the mean and covariance values ( , ) m C , which refer to the mean and covariance in HR ( , ) r r m C and generated images ( , ) g g m C .The FID score was determined as follows: where Tr indicates the sum of the image features on the main diagonal of the square matrix.Accordingly, a lower FID value indicates that the produced image is closer to the HR sample.

Data Set
The study area is located in the port city of Portsmouth, eastern Virginia, USA, and comprises water systems, houses, roads, and other landscape features.We built a dataset with 60,000 HR satellite image patches of 128 × 128 pixels without overlapping areas.The final 10,000 images were available for testing the trained model (called the Test-10000 set).Because the designed model's aim is to obtain the single images' super-resolution, the 4× super-resolution image processing is applied in this research, which is widely recognized in the current deep learning remote sensing image super-resolution tasks [49][50][51].We used bicubic down-sampling with a common factor (4×) to obtain corresponding LR image patches and to establish a training dataset pair with HR image patches.In addition, given that super-resolution tasks primarily involve scaling of spatial resolution, the training dataset was only enhanced by random horizontal left and right flips, as well as 90°, 180°, and 270° rotations.Additionally, a well-known and publicly available dataset, the UC Merced Land Use dataset [52], was used in the test experiments.This dataset consisted of 256 × 256 pixels original HR image patches with 21 categories of satellite image scenes, each containing 100 patches of images.We selected the harbor, runway, airplane, and building scenes to validate the performance of the proposed method.Corresponding LR image patches of 64 × 64 pixels were obtained through the same down-sampling method.

Experimental Settings
All SR processing algorithms were implemented based on the TensorFlow framework in an Ubuntu 20.04.0 LTS system equipped with an NVIDIA GeForce RTX 2080 Ti GPU, CUDA10.1, and CUDNN7.6.0.As in a previous study [35], the training process was divided into two stages: pretraining and training.First, the proposed networks were trained with the L1 loss to consolidate the low-frequency image content and enhance the generator simulation capability.After pretraining, formal adversarial training was initialized with the parameters obtained from the final pretrained model.The balancing loss term coefficients in Equation ( 16) were  [53] in the training stage with default hyperparameters ( 1  = 0.9, 2  = 0.999).

Quantitative Evaluation
First, by comparing the results obtained using enhanced super-resolution GAN (ESRGAN) [35] and TE-SAGAN in the pretraining phase, we determined the generator network stabilization performance of the two networks.A visual example of PSNR values in the entire pretraining stage for the two methods is displayed in Figure 6.The results revealed that, compared with the baseline, the pretraining PSNR value for TE-SAGAN was slightly lower, and the network stability was considerably stronger without heavy fluctuations owing to the well-trained SAM module and WN layers.To validate the pretrained model in detail, further experiments were conducted using remote sensing images of different scenes in the UC Merced Land Use dataset.As shown in Table 1, our method yielded better results than ESRGAN.Simultaneously, as depicted in Figure 7, the current loss function only described the low-frequency content of the image, resulting in smooth outcomes that were still not entirely convincing.We further compared several reconstructed tree shadow textures in detail and concluded that the results of the proposed model were closer to the ground truth.However, both methods produced smooth and fuzzy images in general.For alignment with human visual perception, further training using an adversarial component based on the pretrained model is necessary.To complement prior studies, bicubic interpolation and four representative deeplearning-based SR technologies, namely, ESPCN [30], EDSR [54], SRGAN [34], ESRGAN [35], and RFDNet [55], were used as comparison methods.The quantitative comparison results are presented in Table 2.However, the mean PSNR values of our method were only 0.050 and 0.398 dB higher than the second-highest RFDNet algorithm for Test-10000 and runway, respectively.Moreover, the remaining mean PSNR values of our model were lower than RFDNet by 0.990, 0.234, and 0.205 dB for the harbor, airplane, and buildings sub-datasets, respectively, and all average SSIM scores of our model were lower than highest RFDNet by 0.017, 0.042, 0.008, 0.019, and 0.026.Instead, in the FID score term, where the lower score implies that generated images are closer to real reference images, most of the FID scores in our network were significantly lower than those in the secondlowest network, ESRGAN, by 8.816, 21.749, 7.139, and 10.580 for Test-10000, runway, airplane, and buildings, respectively, expect for the harbor sub-dataset, which was higher than RFDNet by 6.28.These findings indicate that compared with the other methods, TE-SAGAN focuses on sharpening the super resolution image while keeping a relatively high level of PSNR and SSIM image fidelity evaluations.Our method obtains a better objective rating for the four benchmark datasets.The program runtime is a non-negligible metric for evaluating network performance.ESPCN   To systematically elucidate how the introduced WN and SAM affect the network performance, several TE-SAGAN ablation experiments were performed using the following networks: baseline (ESRGAN), baseline combined with the WN layer (Baseline + WN), baseline combined with the WN layer and SAM (Baseline + WN + SAM), and the final combination of SAM, WN layer, and texture loss term (Baseline + WN + SAM + Texture).Based on the results presented in Figure 8, the following conclusions were drawn.First, the addition of the WN and SAM modules to the baseline network improved the PSNR (Figure 8a) and SSIM (Figure 8b) values for several sub-datasets in the UC Merced Land Use dataset.This implies that the added modules are effective in improving the quality and fidelity of the reconstructed image.Second, the introduced texture loss focuses more on restoring texture details (FID score in Figure 8c) than on enhancing PNSR and SSIM values.Furthermore, although our network parameter was slightly larger than that of the baseline model during the experiment, our network showed higher training efficiency.In summary, the ability of WN and SAM modules to rapidly and accurately extract image features allows TE-SAGAN to reconstruct images with high fidelity, and texture loss further adds details consistent with human observation to the reconstructed image.We further believe that the added modules are associated with improved network fitness.

Quantitative Evaluation
Faithful SR images has always been the focus of many image-processing tasks.As mentioned above, the PSNR and SSIM metrics obtained with the networks were different from human observations.Therefore, to further improve the texture and edge of reconstructed remote sensing images, qualitative visual comparisons were performed (Figures 9 and 10).The overall quality of SR remote sensing images obtained using our method was significantly higher than those obtained via other algorithms.Figure 9 shows the processing results of our method for the test dataset (Test-10000), including the villa surroundings, car windows in the parking lot, and shade trees beside the road.In detail, bicubic interpolation [14] filled pixel information with a fixed computational paradigm, producing a blurry image.SRGAN [34] and ESRGAN [35] yielded more visually pleasant processing images than ESPCN [30] and EDSR [54] owing to their excellent depth feature extraction module and perceptual loss.However, the textures recovered by SRGAN and ESRGAN were inadequate and erroneous (e.g., car windows and tree shadows) because they lack expression and constraints on detailed and realistic textures.In particular, RFDNet is biased with respect to objective PSNR and SSIM showing higher scores, but generating super-resolution images with unclear textures.Our method was equipped with the WN layer and the Gram matrix for texture loss calculation, which significantly increased the calculation of high-frequency information correlation between SR and HR images and limited the pixel and texture deviation.Thus, our method yielded finergrained and more realistic images.To examine the recovery status of image outlines, experiments were conducted using four representative remote sensing images from the UC Merced Land Use dataset, including street corners, rooftops, harbor ships, traffic lines, and other scenes, and the results are presented in Figure 10.SRGAN and ESRGAN produced more realistic edges than bicubic, ESPCN, EDSR, and RFDNet.However, the textures recovered by SRGAN and ESRGAN still contained unreal artifacts, as they rarely consider the spatial information of remote sensing images.In particular, RFDNet obtained high scores in objective PSNR and SSIM but generated super-resolution images with unclear outlines.In contrast, the SAM module in our proposed TE-SAGAN enhanced the connection between distant pixels, resulting in clearer, sharper edges in the reconstructed remote sensing images.Interestingly, even though several of the original HR images were slightly blurry, our method produced images with clearer and more natural borders than the original.

Conclusions
SR images have been widely applied for extracting information from remote sensing images.In this study, we elucidated why some algorithms lose control over edges and internal textures in the reconstructed remote sensing images even after smoothing.We proposed an RRDB-based texture-enhancement network, TE-SAGAN, which integrates the SAM and WN into the GAN framework.The SAM module effectively combines widearea and long-range image information training.The WN layer stabilizes the reconstructed image quality using mini-batch data, thereby reducing the contrast gaps and artifacts in the generated images.In addition, the Gram matrix imposes a texture loss constraint to refine the reconstructed texture.The results for PSNR, SSIM, and FID scores of TE-SAGAN during training using our own and public benchmark remote sensing image datasets indicate improved performance.This study also suggests that our proposed method yields clearer image edges and more realistic internal details compared to existing methods.Of note, the main limitation of our current study is that we focused on rebuilding texture details and borders by increasing the model parameters.Further exploratory research is needed to reduce the model size while achieving equivalent or even superior results.
and comprises two parts: a generator network and a discriminator network.The generator network consists of feed-forward neural networks to form the feature extraction module, which is expressed by the function ( ; ) G LR G I  , where G  denotes the weights and biases ( , ) L L w b of the neural network layer.The fully trained network was generated as follows:

LRI
, the generator in the latter term emulates the ( ) sample from the true sample ( ) data P x , thereby minimizing the probability that the discriminator distinguishes ( ) sample.The network parameters are updated iteratively under strict monitoring conditions (e.g., loss constraints of mean square error [MSE] and mean absolute error [MAE]).The ability of the generator to fit the simulation is continuously enhanced, making it difficult for the discriminator to distinguish the reliability of the input samples.The well-trained model can predict the potential frequency distribution of the samples and generate new high-quality data sources.

Figure 1 .
Figure 1.Schematic diagram of the super-resolution image processing in TE-SAGAN.

Figure 2 .
Figure 2. Structure of the generator network of TE-SAGAN.

Figure 4 .
Figure 4.The self-attention mechanism module structure.

Figure 5 .
Figure 5. Schemes follow the same formatting.Structure diagram of the discriminator network, k denotes the convolutional kernel size, n denotes the number of convolutional kernel channels, and s denotes the convolutional stride.

I
jointly form the i-th pair of the training sample set, and M represents the number of training dataset pairs.( ; ) i LR G I  is a mapping function between LR and SR images.

K and 2 K
are constants that avoid dividing by zero.Furthermore, L is the range of image pixel fluctuation. 1 are set orderly to 0.01 and 0.03 by default.SSIM consists of the following three com- ponents: batch size of the imported training data was set to 16 in the pretraining phase and 4 in the adversarial training phase.The initial learning rate was set at phase.Adam was used for the overall training optimization

Figure 7 .
Figure 7. Results of SR processing using the Test-10000 dataset with a scale factor of 4 in the pretraining stage.
takes the shortest runtime with fast training using sub-pixel convolution, and the recent RFDNet belongs to an improved lightweight distillation network with a relatively low training runtime.Among the models with runtime lasting more than one day (1440 min), TE-SAGAN takes the shortest training time, which shortens around a quarter of the runtime than ESRGAN.The weight normalization (WN) layer introduced in this paper ensures the network stability in inputting mini-batch data to accelerate training time.Training EDSR takes the longest time.These findings indicate that TE-SAGAN shows improvement in network runtime.

Figure 9 .
Figure 9. Image processing results of the different algorithms using the test dataset, including the details of roof, cars, and tree shadows.

Figure 10 .
Figure 10.Comparative results of super-resolution image using the UC Merced Land Use dataset.

Table 1 .
Comparison of pre-ESRGAN and Pre-TE-SAGAN using part of the UC Merced Land Use dataset.
For PSNR↑[dB]and SSIM↑, the highest values are in bold, whereas the second-highest values are underlined.On the contrary, for FID↓, the lowest values are in bold, whereas the second-lowest values are underlined.