IESRGAN: Enhanced U-Net Structured Generative Adversarial Network for Remote Sensing Image Super-Resolution Reconstruction

: With the continuous development of modern remote sensing satellite technology, high-resolution (HR) remote sensing image data have gradually become widely used. However, due to the vastness of areas that need to be monitored and the difﬁculty in obtaining HR images, most monitoring projects still rely on low-resolution (LR) data for the regions being monitored. The emergence of remote sensing image super-resolution (SR) reconstruction technology effectively compensates for the lack of original HR images. This paper proposes an Improved Enhanced Super-Resolution Generative Adversarial Network (IESRGAN) based on an enhanced U-Net structure for a 4 × scale detail reconstruction of LR images using NaSC-TG2 remote sensing images. In this method, in-depth research has been performed and consequent improvements have been made to the generator and discriminator within the GAN network. Speciﬁcally, before introducing Residual-in-Residual Dense Blocks (RRDB), in the proposed method, input images are subjected to reﬂective padding to enhance edge information. Meanwhile, a U-Net structure is adopted for the discriminator, incorporating spectral normalization to focus on semantic and structural changes between real and fake images, thereby improving generated image quality and GAN performance. To evaluate the effectiveness and generalization ability of our proposed model, experiments were conducted on multiple real-world remote sensing image datasets. Experimental results demonstrate that IESRGAN exhibits strong generalization capabilities while delivering outstanding performance in terms of PSNR, SSIM, and LPIPS image evaluation metrics.


Introduction
Remote sensing technology can determine ground object targets and natural phenomena by collecting and analyzing electromagnetic waves [1]. Remote sensing also offers a repetitive and continuous perspective for observing Earth, making its value in monitoring short-term and long-term changes and the effects of human activities immeasurable [2]. Among other things, remote sensing images are a way to demonstrate the application of remote sensing data and image quality is directly related to the results of application analysis. Spatial resolution represents the smallest unit size or dimension that can be distinguished in remote sensing images and serves as an indicator of the image's ability to distinguish details of ground targets [3]. The higher the spatial resolution, the more information about ground objects is contained within remote sensing images, allowing for finer target identification. However, due to limitations such as under-sampling effects from imaging sensors and various degradation factors during image processing in transmission satellites, relying solely on hardware-level improvements for spatial resolution would result in high development costs and lengthy hardware iteration cycles. The image SR adversarial network (an improved generative adversarial network via multi-scale residual blocks) that introduces multi-scale residual blocks in the generator network and uses attention mechanisms for multi-scale feature fusion. Zhao et al. [29] proposed the SA-GAN algorithm, which uses second-order channel attention mechanisms and region-level non-local modules in the generator network and employs region-aware loss to suppress artifact generation. Ali et al. [30] proposed an architecture for TESR (two-stage approach for enhancement and super-resolution) that exploits the power of visual deformers (ViT) and diffusion models (DM) to artificially improve the resolution of remotely sensed images.
Additionally, significant research has been conducted on resolution enhancement for other types of remote sensing images such as multisource image fusion [31,32] and hyperspectral imaging [33].
Although GAN has achieved remarkable success in fields such as image generation and style transfer, their training process still faces challenges, including mode collapse and gradient vanishing. Moreover, most current methods use pixel-level loss functions, such as mean squared error (MSE), which may lead to overly smooth reconstructed images lacking high-frequency details. Furthermore, remote sensing images exhibit more complex scenes and diverse target characteristics compared to ordinary images, necessitating consideration of real remote sensing dataset properties in reconstruction. Finally, while current superresolution methods perform well on training data, they may lack generalization capabilities for unseen scenes and targets. Therefore, model design and training strategies should focus on enhancing robustness and generalization.
To address these issues, we propose IESRGAN: an improved GAN for remote sensing image super-resolution reconstruction based on an enhanced U-Net structure. The main adjustments and contributions include: (1) Optimizing the generator network structure by adding reflection padding before the introduction of Residual-in-Residual Dense Blocks (RRDB), preventing image edge information loss and facilitating consistent feature map dimensions across RRDB layers to simplify skip connections and feature fusion processes.
(2) To improve performance further, we replace traditional discriminators with a U-Net-based discriminator and incorporate spectral normalization regularization. This allows for fusing image detail information at different resolution levels while enhancing the stability of the GAN discriminator.
(3) We demonstrate that our proposed IESRGAN exhibits strong generalization capabilities and performs well on real remote sensing images.
The rest of this paper is organized as follows. Section 2 details the structure of the IESRGAN; Section 3 verifies the effectiveness and generalization ability of IESRGAN by comparing it with other algorithms; Section 4 discusses the conclusions of IESRGAN in depth and points out future research directions.

Ideas and IESRGAN Methods
IESRGAN is composed of two main components: a generator and a discriminator. The overall workflow of IESRGAN is depicted in Figure 1. The generator is responsible for taking an input LR remote sensing image and reconstructing an HR image. It achieves this by utilizing operations such as convolution and up-sampling within its network structure. The generator network learns to map the LR image to an SR image with enhanced details and finer textures. Once the SR image is generated, it is passed through the U-Net-based discriminator. The discriminator's role is to compare the SR image with a real HR image and determine whether the SR image is realistic or not. The discriminator network is trained to identify flaws or discrepancies in the reconstructed images, enabling it to differentiate between real HR images and those generated by the generator. The generator and discriminator engage in continuous adversarial gameplay during training. The generator aims to produce SR images that are realistic enough to deceive the discriminator, while the discriminator strives to accurately identify the generated images. Through this adversarial process, both networks learn and improve their performance iteratively. As the training progresses, the generator becomes more adept at generating high-quality and realistic HR images. Simultaneously, the discriminator becomes more discerning and capable of detecting flaws in the reconstructed images. This iterative training process leads to the generation of HR images with enhanced details and improved realism. and discriminator engage in continuous adversarial gameplay during training. The generator aims to produce SR images that are realistic enough to deceive the discriminator, while the discriminator strives to accurately identify the generated images. Through this adversarial process, both networks learn and improve their performance iteratively. As the training progresses, the generator becomes more adept at generating high-quality and realistic HR images. Simultaneously, the discriminator becomes more discerning and capable of detecting flaws in the reconstructed images. This iterative training process leads to the generation of HR images with enhanced details and improved realism.

Network Design of Generators-SR-RRDB
The generator network, depicted in Figure 2, is a CNN-based model. Initially, the input image undergoes a reflection padding layer, referred to as the ReflectionPad layer, which prevents edge information loss. Following this, RRDB are utilized to retain detail features while uncovering new ones. Notably, the generator comprises four primary modules.  The first module is called the regular module, which consists of the ReflectionPad layer, Conv layer, and Rectified Linear Unit (ReLU) layer. The function of ReflectionPad is to perform reflection filling around the input image edges to extend edge information

Network Design of Generators-SR-RRDB
The generator network, depicted in Figure 2, is a CNN-based model. Initially, the input image undergoes a reflection padding layer, referred to as the ReflectionPad layer, which prevents edge information loss. Following this, RRDB are utilized to retain detail features while uncovering new ones. Notably, the generator comprises four primary modules. and discriminator engage in continuous adversarial gameplay during training. The generator aims to produce SR images that are realistic enough to deceive the discriminator, while the discriminator strives to accurately identify the generated images. Through this adversarial process, both networks learn and improve their performance iteratively. As the training progresses, the generator becomes more adept at generating high-quality and realistic HR images. Simultaneously, the discriminator becomes more discerning and capable of detecting flaws in the reconstructed images. This iterative training process leads to the generation of HR images with enhanced details and improved realism.

Network Design of Generators-SR-RRDB
The generator network, depicted in Figure 2, is a CNN-based model. Initially, the input image undergoes a reflection padding layer, referred to as the ReflectionPad layer, which prevents edge information loss. Following this, RRDB are utilized to retain detail features while uncovering new ones. Notably, the generator comprises four primary modules.  The first module is called the regular module, which consists of the ReflectionPad layer, Conv layer, and Rectified Linear Unit (ReLU) layer. The function of ReflectionPad is to perform reflection filling around the input image edges to extend edge information and avoid edge information loss and blurring; the Conv layer uses a 3 × 3 convolution kernel to perform convolution operation on the data in order to extract features; the ReLU layer performs a non-linear transformation to enhance the expressive power of the model. The ReLU layer has the advantages of simple computation, fast convergence, and no gradient disappearance problem. The second module consists of 23 Residual-in-Residual Dense Block (RRDB) modules and a regular module with residual network connections. Among them, the RRDB combines the residual network structure and dense connectivity as shown in Figure 3. The residual network learns the residuals between the input and output, and most of the residuals can be 0 or smaller [34]. The dense connection is defined as

Conv
where [x 0 , x 1 , . . . x i ] denotes the network that combines x 0 , x 1 , . . . x i layer-generated feature map connections as input [35]. Residual networks reuse features but are not good at mining new features while dense connections constantly explore new features but lead to higher redundancy [36]. RRDB combines the advantages of both network structures to make the model better adapted to complex data distributions and patterns, improving performance and accuracy. dient disappearance problem.
The second module consists of 23 Residual-in-Residual Dense Block (RRDB) modules and a regular module with residual network connections. Among them, the RRDB combines the residual network structure and dense connectivity as shown in Figure 3. The residual network learns the residuals between the input and output, and most of the residuals can be 0 or smaller [34]. The dense connection is defined as = ([ ₀, ₁, … ᵢ]), where [ ₀, ₁, … ᵢ] denotes the network that combines ₀, ₁, … ᵢ layer-generated feature map connections as input [35]. Residual networks reuse features but are not good at mining new features while dense connections constantly explore new features but lead to higher redundancy [36]. RRDB combines the advantages of both network structures to make the model better adapted to complex data distributions and patterns, improving performance and accuracy.
The third module is up-sampling, which is used to increase the image size. The last module consists of two regular modules where the convolution kernel is changed from 1 × 1 to 3 × 3 to enlarge the perceptual field and to learn features better. With the above generator network structure, called SR-RRDB, a high-resolution image corresponding to the input image is reconstructed.

Discriminator Network Design
In this study, instead of using the traditional discriminator structure, we chose a discriminator network based on the U-Net structure, as shown in Figure 4. This discriminator network structure consists of two main components: an encoder (down-sampling) and a decoder (up-sampling). The encoder is responsible for capturing the contextual information in the image, while the decoder is responsible for recovering the image details. To achieve information fusion, a jump connection is used between the two. As a result, this approach demonstrates its effectiveness in extracting multi-scale features from images with improved efficiency and accuracy. The third module is up-sampling, which is used to increase the image size. The last module consists of two regular modules where the convolution kernel is changed from 1 × 1 to 3 × 3 to enlarge the perceptual field and to learn features better. With the above generator network structure, called SR-RRDB, a high-resolution image corresponding to the input image is reconstructed.

Discriminator Network Design
In this study, instead of using the traditional discriminator structure, we chose a discriminator network based on the U-Net structure, as shown in Figure 4. This discriminator network structure consists of two main components: an encoder (down-sampling) and a decoder (up-sampling). The encoder is responsible for capturing the contextual information in the image, while the decoder is responsible for recovering the image details. To achieve information fusion, a jump connection is used between the two. As a result, this approach demonstrates its effectiveness in extracting multi-scale features from images with improved efficiency and accuracy.
It is worth noting that after entering the encoder from the initial convolution layer in this network structure, spectral normalization regularization is applied to stabilize the training of the discriminator network. Spectral normalization is a regularization method used in neural networks to prevent overfitting of neural networks by decomposing the weight matrix into eigenvalues and normalizing the result to limit the spectral norm of the weight Remote Sens. 2023, 15, 3490 6 of 18 matrix. The specific algorithmic process is presented in Table 1. Spectral normalization [37] makes the spectral norm of weight matrix W satisfy the Lipschitz constraint σ(W) = 1:  It is worth noting that after entering the encoder from the initial convolution layer in this network structure, spectral normalization regularization is applied to stabilize the training of the discriminator network. Spectral normalization is a regularization method used in neural networks to prevent overfitting of neural networks by decomposing the weight matrix into eigenvalues and normalizing the result to limit the spectral norm of the weight matrix. The specific algorithmic process is presented in Table 1. Spectral normalization [37] makes the spectral norm of weight matrix satisfy the Lipschitz constraint ( ) = 1:

Spectral Normalization
• Initialize ∈ ℛ for = 1, … , with a random vector (sampled from isotropic distribution) • For each update and each layer : 1. Apply power iteration method to an unnormalized weight : Calculate SN with the spectral norm: 3. Update on mini-batch dataset with a learning rate : The use of a discriminator network based on the U-Net structure brings significant advantages. First, U-Net has jump connections, which fuse shallow features directly with deep features and alleviate the gradient disappearance problem. This allows the discriminator to learn semantic information at different scales and has a strong generalization capability. Secondly, since the U-Net structure fully considers the multi-scale information fusion, it can better capture the detail changes of small targets or local regions. This is important for generating high-quality images, especially in tasks that require the generation of fine structures and textures. Finally, U-Net restores features to the original input space step by step in the decoding stage by means of a deconvolution layer and continuously fuses shallow features. This allows the discriminator to take into account more contextual information, thus improving its ability to judge the quality of the generated  Spectral Normalization · Initialize u l ∈ R d l for l = 1, . . . , L with a random vector (sampled from isotropic distribution) · For each update and each layer l:

1.
Apply power iteration method to an unnormalized weight W l : 2.
Calculate W l SN with the spectral norm: 3.
Update W l on mini-batch dataset D M with a learning rate α: The use of a discriminator network based on the U-Net structure brings significant advantages. First, U-Net has jump connections, which fuse shallow features directly with deep features and alleviate the gradient disappearance problem. This allows the discriminator to learn semantic information at different scales and has a strong generalization capability. Secondly, since the U-Net structure fully considers the multi-scale information fusion, it can better capture the detail changes of small targets or local regions. This is important for generating high-quality images, especially in tasks that require the generation of fine structures and textures. Finally, U-Net restores features to the original input space step by step in the decoding stage by means of a deconvolution layer and continuously fuses shallow features. This allows the discriminator to take into account more contextual information, thus improving its ability to judge the quality of the generated images. Together, these advantages contribute to a significant improvement in GAN performance.

Loss Function
To enhance the robustness of the overall model, a fusion approach is employed in the loss function part. In the generator network, content loss, generation loss, and perceptual loss are included, where perceptual loss consists of content loss and generation loss. A binary cross entropy loss function (BCEWithLogitsLoss) is used in the discriminator network to counteract the loss.
The content loss is used to separately input the generated image and the target image into each convolutional layer in the VGG-19 network using the L1 norm and then calculate their differences in the feature space. The content loss formula is defined as: Here,ŷ represents the generated image, y denotes the target image, G l (·) signifies the feature map of layer l in the VGG-19 network, and |·| 1 represents the L1 norm. The function of the content loss is to make the generated image closer to the pixel distribution of the target image, thus making the generated image more realistic. In the above formula, it is assumed that the feature map of a layer in the truncated VGG-19 network is represented as a three-dimensional tensor of C l ×H l ×W l , where C l indicates the number of channels, H l indicates height, and W l indicates width. Calculating generated imageŷ at layer l s feature map G l (ŷ), its definition is as follows: where F l,c,h,w (ŷ) denotes the feature value of generated imageŷ at layer l, channel c, row h, and column w; φ l,c,h,w (i, j) represents the value of the convolution kernel at position (i, j) in layer l, channel c, row h, and column w of the VGG-19 network. G l,i,j (ŷ) indicates the feature value of generated imageŷ at layer l, row i, and column j.
In the generation loss, the discriminator is used to discriminate whether the SRgenerated image is a "pseudo-image" or not, and then the discriminant result is obtained. Then, the BCEWithLogitsLoss is used to calculate adversarial loss, which is the difference between the probability of the generated image being discriminated as a real image and 1. The BCEWithLogitsLoss formula is expressed as: Here, n represents the number of samples, y i denotes the label of the real image,ŷ signifies the discriminant result of the discriminator on the generated image, and σ stands for sigmoid function. The overall perceptual loss is defined as written in Equation (9): The discrimination loss is calculated using the BCEWithLogitsLoss. First, the discriminant results are obtained by discriminating the SR-generated images and the real images separately. Next, the SR-generated image tensor is assumed to be 0, which means "false image", and the real image tensor is assumed to be 1, which means "true image". The formula expression is: where L sr d and L h d are, respectively, represented as: In this equation, n indicates the number of samples; y i sr is assumed to be a tensor with all zeros, which denotes the label of the fake image;ŷ i sr is assumed to denote the discriminant result of the discriminator on the SR generated image, and σ signifies a sigmoid function.
Here, n indicates the number of samples; y i h is assumed to be a tensor of all 1s, which denotes the label of the real image;ŷ i h is assumed to denote the discriminant result of the real image, and σ signifies a sigmoid function.
The BCEWithLogitsLoss is advantageous in calculating the generative loss and the adversarial loss because it can not only measure the difference between the prediction result and the true result but also convert the prediction result into a probability value through the sigmoid function transformation, thus, reflecting the confidence level of the prediction result more accurately. In addition, BCEWithLogitsLoss can automatically handle the numerical stability problem and prevent numerical overflow or underflow in the calculation of the sigmoid function. In the adversarial training process, using BCEWithLogitsLoss can effectively evaluate the similarity between the generated image and the real image and provide better guidance for generator training.

Experiments
In this paper, we conduct model experiments with the following data and compare classical models in the super-resolution domain to verify the validity and generalization of the model.

Data Source
The remote sensing image data selected for this study include NaSC-TG2 [38], Satellite Images of Hurricane Damage [39], NWPU-RESISC45 [40], and UCMerced LandUse [41]. The NaSC-TG2 data originate from China's first space laboratory, Tiangong-2, which is equipped with a Wide-band Imaging Spectrometer (WIS) featuring 14 spectral channels covering visible light, near-infrared, short-wave infrared, and thermal infrared bands. The spatial resolution of these data at ground pixel distance is 100 m, 200 m, and 400 m. Satellite images of Hurricane Damage data are obtained from the Planet satellite constellation consisting of hundreds of Dove satellites (10 cm × 10 cm × 30 cm) that use optical systems and cameras to capture images in RGB and near-infrared bands with a ground pixel distance of 3~5 m. The NWPU-RESISC45 data come from Google Earth satellite images with spatial resolutions ranging from 0.2 m to 30 m, acquired through satellite imagery, aerial photography, and Geographic Information Systems (GIS). UCMerced LandUse data are sourced from the USGS National Map with a spatial resolution of 1 foot (0.3048 m). Table 2 summarizes the information on SR remote sensing image data used in this paper. Considering the spectral range differences across channels in these satellite image datasets, our experimental data only include RGB three-band images. The selection of these datasets will aid in further exploring remote sensing image processing techniques and provide theoretical support for enhancing practical applications. In our experiments, we built a training set using 19,980 remote sensing images from the NaSC-TG2 dataset. Each HR image was down-sampled by a factor of four to obtain a lowresolution LR image. The HR images have a size of 128 × 128 pixels, and correspondingly, the LR images have a size of 32 × 32 pixels. Training with smaller-sized images allows the model to focus on rich local textures, structural features, and object information in remote sensing images. This approach helps capture important details and patterns necessary for accurate super-resolution reconstruction. Additionally, using smaller-sized images reduces computational complexity and memory consumption. Figure 5 illustrates examples of the HR-LR pairs. To evaluate the generalization capability of our proposed model, we constructed four test sets by randomly selecting 120 images from the NaSC-TG2 dataset, 1000 images from the Satellite Image of Hurricane Damage dataset, 1890 images from the NWPU-RESISC45 dataset, and 420 images from the UCMerced LandUse dataset. These diverse datasets provide a representative sample of remote sensing images, enabling us to assess how well our model performs on different types of scenes and objects. Through this comprehensive evaluation, we aim to demonstrate the robustness and effectiveness of our model in handling a variety of remotely sensed image scenes. in remote sensing images. This approach helps capture important details and patterns necessary for accurate super-resolution reconstruction. Additionally, using smaller-sized images reduces computational complexity and memory consumption. Figure 5 illustrates examples of the HR-LR pairs. To evaluate the generalization capability of our proposed model, we constructed four test sets by randomly selecting 120 images from the NaSC-TG2 dataset, 1000 images from the Satellite Image of Hurricane Damage dataset, 1890 images from the NWPU-RESISC45 dataset, and 420 images from the UCMerced LandUse dataset. These diverse datasets provide a representative sample of remote sensing images, enabling us to assess how well our model performs on different types of scenes and objects. Through this comprehensive evaluation, we aim to demonstrate the robustness and effectiveness of our model in handling a variety of remotely sensed image scenes.  Figure 5. Examples of the HR-LR pair.

Experimental Environment and Parameter Settings
In this study, the experimental environment was set up on an Ubuntu operating system, equipped with a high-performance GeForce RTX 2080Ti GPU for efficient computation. The programming language utilized for code development is Python, while the Pytorch framework (available at https://pytorch.org/ (accessed on 1 July 2023)) was employed for effective algorithm modeling and implementation. The IESRGAN network architecture comprises two primary components: the generator network and the discriminator network. To conduct the experiments, a total of 19,800 HR remote sensing images from the NaSC-TG2 dataset were employed as the target images. As an initial step, a bicubic interpolation down-sampling technique was applied to generate a corresponding set of 19,800 LR remote sensing images required for input purposes. Subsequently, these LR images were fed into the SR-RRDB model, which consists of the generator network designed for training purposes. A comprehensive overview of the initial experimental details pertaining to the SR-RRDB model training can be found in Table 3.

Experimental Environment and Parameter Settings
In this study, the experimental environment was set up on an Ubuntu operating system, equipped with a high-performance GeForce RTX 2080Ti GPU for efficient computation. The programming language utilized for code development is Python, while the Pytorch framework (available at https://pytorch.org/ (accessed on 1 July 2023)) was employed for effective algorithm modeling and implementation. The IESRGAN network architecture comprises two primary components: the generator network and the discriminator network. To conduct the experiments, a total of 19,800 HR remote sensing images from the NaSC-TG2 dataset were employed as the target images. As an initial step, a bicubic interpolation down-sampling technique was applied to generate a corresponding set of 19,800 LR remote sensing images required for input purposes. Subsequently, these LR images were fed into the SR-RRDB model, which consists of the generator network designed for training purposes. A comprehensive overview of the initial experimental details pertaining to the SR-RRDB model training can be found in Table 3. The Cosine Annealing Learning Rate Schedule (CosineAnnealingLR) scheduler combined with the Adam optimizer was employed to effectively adjust learning rates during the training process. This method allows for the gradual reduction of learning rates, which in turn leads to enhanced convergence and ultimately improves the overall performance and generalization capability of the model. Upon completing this stage, the SR images generated by the well-trained SR-RRDB model were then introduced into a discriminator network that was designed based on the U-Net architecture. The purpose of this step was to efficiently discriminate between real HR images and those produced by the SR-RRDB model. Starting with the initialization of the SR-RRDB model, further experimental details pertaining to IESRGAN model training can be observed in Table 4. Notably, when the training reached its halfway point, there was an adjustment made wherein the learning rate was deliberately reduced to a half of its initial value. This strategic modification has been found to contribute significantly towards optimizing and refining both model performance and generalization effectiveness throughout the training process.  Figure 6 below shows the change curves of content loss, generation loss, and discriminative loss, respectively, throughout the training process. The Cosine Annealing Learning Rate Schedule (CosineAnnealingLR) scheduler combined with the Adam optimizer was employed to effectively adjust learning rates during the training process. This method allows for the gradual reduction of learning rates, which in turn leads to enhanced convergence and ultimately improves the overall performance and generalization capability of the model. Upon completing this stage, the SR images generated by the well-trained SR-RRDB model were then introduced into a discriminator network that was designed based on the U-Net architecture. The purpose of this step was to efficiently discriminate between real HR images and those produced by the SR-RRDB model. Starting with the initialization of the SR-RRDB model, further experimental details pertaining to IESRGAN model training can be observed in Table 4. Notably, when the training reached its halfway point, there was an adjustment made wherein the learning rate was deliberately reduced to a half of its initial value. This strategic modification has been found to contribute significantly towards optimizing and refining both model performance and generalization effectiveness throughout the training process.  Figure 6 below shows the change curves of content loss, generation loss, and discriminative loss, respectively, throughout the training process.

Experimental Evaluation Metrics
The Peak Signal-to-Noise Ratio (PSNR) [42] and Structural Similarity Index (SSIM) [43] have been used as standard evaluation metrics in image SR. Nevertheless, as revealed

Experimental Evaluation Metrics
The Peak Signal-to-Noise Ratio (PSNR) [42] and Structural Similarity Index (SSIM) [43] have been used as standard evaluation metrics in image SR. Nevertheless, as revealed in some recent studies [44], super-resolved images may sometimes have high PSNR and SSIM scores with over-smoothed results but tend to lack realistic visual results. In this study, apart from the PSNR and SSIM, the learned perceptual image patch similarity (LPIPS) [45] is included in our experiments.
PSNR is used to evaluate pixel-wise differences between images. A higher PSNR value indicates a smaller difference between the processed image and the real image, implying better image quality. Its formula is: In this formula, MAX represents the maximum pixel value, and MSE denotes the mean squared error between the reference image and the evaluated image. Its formula is given by: Here, N refers to the total number of pixels, while I i and P i represent the ith pixel values of the reference image and evaluated image, respectively. SSIM takes into account factors such as the brightness, contrast, and structure of an image. Its formula is expressed as: The SSIM value ranges from [0,1] with higher values indicating better image quality. LPIPS measures perceptual differences between two images, i.e., visual similarity between generated images and real images. A lower LPIPS score indicates a higher similarity between two images. Its formula is as follows: In the above equation, x and x 0 represent generated images and real images, respectively;ŷ l hw denotes predicted feature maps for x at spatial position (h, w) and feature map l;ŷ l 0hw represents predicted feature maps for x 0 at the same spatial position and feature map. The weight matrix w l is learned by the network to emphasize or de-emphasize certain features in an image.

Quantitative and Qualitative Comparison of Different Methods
In this section, an in-depth comparison is conducted between the proposed method and several classical single-image SR algorithms on four distinct test sets, focusing on their performance metrics. The SR algorithms under consideration encompass three CNN-based methods, specifically VDSR [17], SRResNet [22], and TESR [30], as well as two GANbased methods, namely SRGAN [22] and ESRGAN [23]. Each of these methods has been meticulously optimized on the training set to guarantee the best possible performance and to ensure a fair comparison. To facilitate a more comprehensive comparison with both CNN-based and GAN-based algorithms, two networks are trained: SR-RRDB and IESRGAN. The proposed SR-RRDB is primarily a CNN-based algorithm that consists solely of the generator network. When trained exclusively with pixel loss, it can independently reconstruct HR images corresponding to LR ones. However, this approach may lack human perception since it relies solely on pixel loss for optimization. Therefore, a fair comparison between SR-RRDB and other CNN-based algorithms is made to evaluate their performance. On the other hand, the proposed IESRGAN is constructed upon a GAN network model, comprising both generator and discriminator networks. Its loss function incorporates perceptual loss through an innovative fusion method, which significantly enhances visual quality as perceived by the human eye. Thus, a fair comparison between IESRGAN and other GAN-based algorithms is conducted to assess their ability in delivering visually appealing results. In summary, this section aims to provide an extensive evaluation of the proposed method against traditional single-image SR algorithms in terms of performance metrics across four test sets. By comparing both CNN-based and GAN-based approaches using two different networks (SR-RRDB and IESRGAN), we strive to present a balanced analysis that highlights the strengths and limitations of each method while ensuring fairness in comparisons.
In this study, three metrics are employed to quantitatively evaluate the SR results, namely PSNR, SSIM, and LPIPS. The best results in each row are highlighted in red for easy comparison. As demonstrated in Table 5, the highest score in the PSNR metric is achieved by the SR-RRDB method. Here it is noted that a higher PSNR value indicates a lower difference between the reconstructed image and the real image, ultimately resulting in superior image quality. As shown in Table 6, the highest score on the SSIM metric is also attained by the SR-RRDB method. A higher SSIM value suggests a greater similarity in brightness, contrast, and structure within a range of [0,1], indicating better preservation of these attributes during the super resolution process. Meanwhile, as displayed in Table 7, IESRGAN performs best on the LPIPS metric; a lower LPIPS value implies higher visual perceptual similarity between generated and real images. CNN-based SR methods offer advantages in terms of PSNR and SSIM due to their emphasis on preserving LR images' spatial structure. Consequently, super-resolution outcomes from CNN-based methods tend to lack realistic visual effects, leading to poor LPIPS performance. In contrast, GAN-based SR methods achieve better LPIPS performance while maintaining good PSNR and SSIM scores as they adopt adversarial loss and perceptual loss to encourage visually appealing results that closely resemble real images.  Figure 7 presents a comprehensive and intuitive comparison that enables a more profound comprehension of the quantitative results obtained in this study. Bicubic interpolation, as a traditional method, fails to generate any additional details or enhance image quality significantly. On the other hand, CNN-based super-resolution reconstruction algorithms, such as VDSR, SRResNet, and TESR, demonstrate relatively better performance in reconstructing some texture details by leveraging advanced learning techniques; however, they still suffer from contour blurring issues primarily due to the adoption of simplistic optimization strategies in their objective functions. In contrast, GAN-based super-resolution reconstruction algorithms like SRGAN and ESRGAN showcase notable advantages in terms of visual effects and overall image enhancement. Nevertheless, these methods may inadvertently introduce artificial artifacts during the reconstruction process, which could potentially compromise the final output quality. The approach proposed here addresses these limitations by effectively recovering finer texture details compared to other SR methods available in the literature. Consequently, our method generates more realistic and visually appealing results that closely resemble natural images. This superior performance can be attributed to the innovative techniques employed in our algorithm design, which strike a delicate balance between optimizing visual quality and minimizing unwanted artifacts.

Ablation Studies
In order to assess the effectiveness of the enhancements introduced by each component of our proposed method, a series of ablation experiments was performed. In these experiments, we gradually incorporated the RRDB strategy, Reflection Padding layer (ReflectionPad), and U-Net structure into the baseline model. All models were trained using an identical configuration, and their performance was evaluated on a test set. The comparative data for various metrics are presented in Table 8, which clearly demonstrates an overall improvement in model performance throughout the refinement process. Initially, increasing the number of RRDBs effectively contributes to enhancing image details and high-frequency information. This enhancement is achieved by mapping the image from an LR to an HR space through a deep network structure. Consequently, more image details are recovered, resulting in notable improvements in PSNR, SSIM, and LPIPS scores. Subsequently, adding a Reflection Padding layer on top of this foundation helps preserve edge information within the input image while reducing edge information loss. Edge information plays a critical role in generating HR images since it often contains high-frequency detail information that influences the level of detail present in the generated results. By introducing the Reflection Padding layer into our model, we achieve optimal SSIM values indicative of relatively ideal structural reconstruction effects. Lastly, incorporating a U-Net structure into the discriminator enables it to capture and integrate image features across multiple resolution levels more effectively. This enhanced capability assists in distinguishing generated images from real ones while simultaneously improving reconstructed image quality. In conjunction with our adopted fusion loss approach, this results in superior LPIPS values and improved perceptual quality for human observers. At the same time, both the PSNR and SSIM scores exhibit some degree of improvement as well-evidence that our model delivers higher-quality images. In summary, following these step-by-step enhancements to our initial design, our proposed method achieves significant improvements across all relevant metrics-thereby validating the effectiveness of each modification introduced.

Discussion
Remote sensing images have rich and complex scenes and different target features, and many existing algorithms have difficulty recovering these details accurately. To overcome this challenge, we propose an Enhanced U-Net Structured Generative Adversarial Network for Remote Sensing Image Super-Resolution Reconstruction (IESRGAN). IESRGAN consists of two parts; the first part is based on the RRDB module to improve the generator network to reconstruct the texture features of remote sensing images while preserving as many global details as possible. The second part is an improved discriminator network based on the U-Net network, which has jump connections and can fuse shallow features with deep features directly. These are important for generating high-quality images, especially in tasks that require the generation of fine structures and textures. The results of our proposed IESRGAN model show good performance on NaSC-TG2, Satellite Image of Hurricane Damage, NWPU-RESISC45, and UCMerced LandUse datasets in terms of visual perception and quantitative measurements. In general, our proposed model outperforms other methods and provides a new approach for super-resolution reconstruction of remote sensing images.. There are several limitations of this work that need to be noted. First, the proposed algorithm is specifically designed for remotely sensed images and may not perform as well on other types of images. Second, we performed super-resolution reconstructions of remotely sensed images with a magnification factor of x4, which is not satisfactory for higher magnification factors such as x8.

Conclusions
Extensive experimental results show that the IESRGAN model performs well in quantitative evaluation metrics (such as PSNR, SSIM, and LPIPS) under different real remote sensing image datasets and thus has remarkable stability and generalization ability. The IESRGAN algorithm can provide a promising idea for the super-resolution reconstruction of remote sensing images, which can be applied to feature recognition classification, land detection, etc. There are several potential directions for future work in the proposed remote sensing super-resolution algorithm IESRGAN. A key area is the application of the algorithm to super-resolution reconstructions of remote sensing images at high magnifications (e.g., ×8), aiming at better practical applications. In addition, the fusion of multi-source remote sensing image information can be explored to fully exploit the complementary information between different sources, thus improving the effectiveness of remote sensing image reconstruction. Finally, it would be beneficial to investigate the application of the algorithm in real-world processing tasks, such as land monitoring and object classification. By addressing these challenges, we will continue to advance the field of super-resolution reconstruction of remote sensing images and expand its applicability in various fields.