Gradient Structure Information-Guided Attention Generative Adversarial Networks for Remote Sensing Image Generation

: A rich and effective dataset is an important foundation for the development of AI algorithms, and the quantity and quality of the dataset determine the upper limit level of the algorithms. For aerospace remote sensing datasets, due to the high cost of data collection and susceptibility to meteorological and airway conditions, the existing datasets have two problems: ﬁrstly, the number of datasets is obviously insufﬁcient, and, secondly, there is large unevenness between different categories in datasets. One of the effective solutions is to use neural networks to generate fake data by learning from real data, but existing methods still ﬁnd difﬁculty in generating remote sensing sample images with good texture detail and geometric distortion. To address the shortcomings of existing image generation algorithms, this paper proposes a gradient structure information-guided attention generative adversarial network (SGA-GAN) for remote sensing image generation, which contains two innovative initiatives: on the one hand, a learnable gradient structure information extraction branch network can be added to the generator network to obtain complex structural information in the sample image, thus alleviating the distortion of the sample geometric structure in remote sensing image generation; on the other hand, a multidimensional self-attention feature selection module is proposed to further improve the quality of the generated remote sensing images by connecting cross-attentive modules as well as spatial and channel attention modules in series to guide the generator to better utilize global information. The algorithm proposed in this paper outperformed other methods, such as StyleGAN-XL and FastGAN, in both the qualitative and quantitative evaluation, whereby the FID on the DOTA dataset decreased by 23.927 and the IS was improved by 2.351. The comparison experiments show that the method proposed in this paper can generate more realistic sample images, and images generated by this method can improve object detection metrics by increasing the number of single-category datasets and the number of targets in fewer categories in multi-category datasets, which means it can be effectively used in the ﬁeld of intelligent processing of remote sensing images.


Introduction
In recent years, aerospace remote sensing imaging technology has been applied to many fields, such as land and mineral resource management and monitoring, traffic and road network safety monitoring, geological disaster early warning, and national defense system construction [1][2][3]. Meanwhile, deep learning technology has also greatly promoted the research of remote sensing images in detection and classification [4]. A sufficient amount of data is the cornerstone for achieving high-performance deep learning algorithms, and large high-quality datasets can greatly improve the algorithm performance [5]. However, there are two constraints in the construction of existing remote sensing image datasets: on the one hand, compared with natural image datasets captured by groundbased equipment [6,7], the capture of remote sensing images requires high-cost imaging platforms such as aircraft or satellites, and the acquisition process is limited by aircraft routes and satellite orbits; on the other hand, the influence of factors such as light, rain, fog, and clouds [8] makes it difficult to collect effective high-quality images due to the high proportion of invalid data in each acquisition [9]. The above factors mean that existing datasets cannot meet the demand of artificial intelligence algorithm training in the field of remote sensing [10], which is mainly reflected in two aspects: Issue 1: Insufficient data size. Take the commonly used DOTA dataset as an example. It includes 15 object classes with nearly 190,000 objects, while the non-remote sensing natural image dataset COCO contains 80 classes with more than 1.5 million objects in total, a size 8 times that of DOTA. The lack of data scale leads to a high risk of overfitting the model. Issue 2: Large differences in the samples within and between classes. There are more than 20,000 ships in the DOTA dataset but only 6000 planes, and the specific classes and sizes of planes are not uniform. The lack of image diversity and unbalanced numbers between classes in the existing dataset [11] can bias the model toward majority class prediction [12], limiting further improvement in the performance of network models in target detection and classification, for example [13].
In recent years, some mitigation approaches have been proposed at the algorithm level for issue 2 [14,15]. Zhou et al. [16] proposed a dynamic balancing weighting method based on the number of effective samples for remote sensing image segmentation tasks with data imbalance. CBCL [11] dynamically constructs a class-balanced memory queue during the training of object detection models by memorizing training samples to alleviate class imbalance. However, these approaches are unable to mitigate issue 1.
Data enhancement is an effective approach that can alleviate both of these problems at the same time [17], can obtain large amounts of data from a limited dataset, and can effectively alleviate the problems of insufficient data size and class imbalance, especially in aerospace remote sensing applications, which are widely used by researchers [18,19]. However, many traditional data enhancement methods, such as flipping, scaling, cropping, rotating, or adding noise, only increase the number of remotely sensed images and cannot improve the quality of semantic information as well as the diversity of remotely sensed images in applications. In interpretation tasks such as object detection of remote sensing images, geometric transformation or randomly varying pixel values can no longer meet the increasing accuracy requirements. Therefore, it has become an urgent and indispensable task to use artificial neural network methods for data enhancement to generate sample data of remote sensing images.
Generative adversarial networks (GANs) [20] have led to technological breakthroughs in many areas of deep learning and have been rapidly applied in many directions in the field of aerospace remote sensing [21,22]. Remote sensing images usually contain feature information with a large amount of texture and structure information, which is complex [23], and existing natural image generation models rarely consider the structure information in the generation process. Therefore, if they are used in remote sensing image generation, they will lead to geometric structure distortion in the generated sample images, and the generated pseudo-sample images are often poorly realistic and insufficiently diverse to be reliably used as the basis for various analyses and applications in remote sensing [24]. Additionally, almost all existing related studies in the direction of remote sensing are focused on tasks such as image classification and segmentation, and there are few studies on object detection tasks.
To address the above problems, based on StyleGANv2 [25], this paper innovatively proposes a gradient structure information-guided attention generative adversarial network Remote Sens. 2023, 15, 2827 3 of 20 improvement method for remote sensing image generation to alleviate the model performance degradation problem due to insufficient remote sensing data. This paper uses a multidimensional self-attentive feature selection module (MAFM) to guide the generator to make better use of global information, which can help the generator to better control the generation process and generate higher-quality remote sensing images. Meanwhile, the gradient structure information branching network is used to guide the generator body network, so that the generated images have more realistic structure information, thus alleviating the structural distortion phenomenon existing in remote sensing image generation. The mode seeking [26] regularization term is introduced to increase the ratio of the distance between the generated images to the distance between the corresponding latent codes to solve the problem of insufficient diversity.
The main contributions of this paper are as follows: 1.
Multidimensional self-attentive module A multidimensional self-attentive module applicable to remote sensing image generation is proposed to enhance the convolution and improve the generator model performance. Contextual information is captured by tandemly connecting two cross-attentive modules and modeling the importance of feature maps and interdependencies in three dimensions of the spatial and channel domains. The attention model is embedded into the generative adversarial network to guide the generator to utilize global information while adaptively focusing on important regions.

2.
Gradient structure information guidance model Adding a branching network to the remote sensing image generation network and using the gradient structure information guidance method to improve the generation quality of remote sensing images can better preserve the structural information of the samples, allowing the generation network to output remote sensing images with high perceptual quality and less geometric distortion.

Generative Adversarial Networks
Since they were proposed by Goodfellow et al., generative adversarial networks have become a popular research direction. A large number of variant structures based on generative adversarial networks have emerged and are widely used in various fields such as image generation, transformation, editing, and super-resolution [27][28][29]. Despite the great success of GANs in various applications, existing methods suffer from the mode collapse problem, which leads to a lack of diversity in the generated images. ModeGAN [30] alleviates the lack of diversity by introducing additional encoders to enhance the bidirectional mapping between the input noise vector and the generated images. Ghosh et al. [31] used multiple generators and forced different generators to generate samples with different patterns, thus increasing the diversity of the generated images. MSGAN [26] encourages generators to explore more patterns by maximizing the ratio of the distance between generated images to the distance between corresponding latent vectors, thus increasing the diversity of the generated images.

Image Generation
Image generation is an important research direction in the field of computer vision. The early image generation methods mainly focus on the explicit approximate estimation of the probability density function of the image distribution, but it is difficult to learn a model that can fit the data distribution due to the high-dimensional nature of the probability density distribution of the sample data. In contrast, GANs, as generative models based on implicit density estimation, can provide an effective solution to the problem of sampling and training under high-dimensional probability density distributions by means of adversarial learning, meaning that GANs have been widely used in image generation tasks.
With the continuous development and innovation of deep learning technology, GANbased image generation methods are constantly being updated and improved. ProGAN [32] starts training from low-resolution image generation and gradually increases the image resolution, which both speeds up the training and stabilizes the training process while still being able to generate high-quality images. SAGAN [33] helps GANs to learn the global dependencies of images and generate higher-quality images by introducing an attention mechanism to GANs. BigGAN [34] substantially improves the quality of the generated images by adding orthogonal regularization to the generator, increasing the number of batch processes, and introducing spectral normalization based on SAGAN. StyleGAN [35] achieves controlled image generation by decoupling the hidden space vectors and separating the different style attributes that control the generated images. StyleGANv2 [25] further improves the quality of the generated images by disassembling AdaIN into Norm and Mod parts to eliminate artifacts.
SiftingGAN [36] introduced the GAN method to remote sensing image generation, proposing a GAN-based method to generate and filter labeled samples to increase the sample dataset. MARTA [37] applies DCGAN to remote sensing images and proposes an unsupervised model of multilayer feature matching to generate adversarial networks. EEGAN [38] proposes an edge-enhanced GAN algorithm that can combine the enhanced edges to generate images with clearer content. GAN-RSIGM [39] introduced the Wasserstein distance to the field of remote sensing image generation, proposing a remote sensing image generation method based on generative adversarial networks for creating high-resolution annotated samples for scene classification. Gu et al. [24] proposed an attention mechanismbased pseudo-annotated sample generation method and applied it to the scene classification of remote sensing images, which can be learned from a single natural image and can effectively generate enough pseudo-annotated samples from a single remote sensing scene image sample. MCGAN [40] uses multi-branch expansion convolution and classification branches to help the generator produce diverse and high-quality remote sensing images.

Feature Attention
Convolutional operations are processed under local neighborhoods with finite perceptual fields, which are prone to lose global information, and self-attention has become the latest advancement in capturing remote interactions [41]. Attention mechanisms enable deep learning to be more targeted in extracting features from samples [42], leading to an improvement in the accuracy of related tasks.
SENet [43] proposes a channel attention substructure, where the network model automatically obtains the importance of each feature channel by learning and models the interdependencies between feature channels. GSoP [44] proposes a simple and effective GSoP module that captures global second-order statistics. ECANet [45] proposes an ultra-lightweight attention module for local channel interactions. GCT [46] implements a channel-gated attention mechanism by gating weights and biases. FcaNet [47] proposes a multispectral channel attention approach to fully exploit image information by introducing more frequency components.
In the field of aerospace remote sensing, attention mechanisms also play an important role in many directions [48][49][50]. LMNet [51] proposes the Residual Transformer 3D-spatial Attention Module (RT3DsAM), which can learn feature representations from global information and filter important information.

Method
In this paper, we propose a gradient structure information-guided attention generative adversarial network (SGA-GAN). The structure sketch is shown in Figure 1, including the generator and discriminator. The generator network generates fake images from some random numbers extracted from a uniform distribution, and the discriminator network, as the adversary of the generator, tries to distinguish real images from fake images; both are trained iteratively. Finally, the generator network can perfectly generate realistic fake images, and the discriminator network can effectively determine whether the images are real or fake [20].

Method
In this paper, we propose a gradient structure information-guided attention generative adversarial network (SGA-GAN). The structure sketch is shown in Figure 1, including the generator and discriminator. The generator network generates fake images from some random numbers extracted from a uniform distribution, and the discriminator network, as the adversary of the generator, tries to distinguish real images from fake images; both are trained iteratively. Finally, the generator network can perfectly generate realistic fake images, and the discriminator network can effectively determine whether the images are real or fake [20]. In order to improve the performance of the model in remote sensing images, this paper introduces the multidimensional self-attentive feature selection module (MAFM in Figure 1) into the generator network. Additionally, this paper uses the gradient structure information to guide the model to generate more realistic sample images.

Multidimensional Self-Attentive Module
The generator is the core part of the whole generative adversarial network, and the performance of this part determines the final quality of the generated images. The convolutional layer can only operate on the local proximity context of each spatial location in the feature map, which does not make good use of the contextual information.
CCNet [52] proposes the Criss-Cross Network to obtain contextual information in a more effective and efficient way, based on which this paper proposes a multidimensional self-attentive feature selection module (MAFM) for remote sensing image generation, which can improve the generator model performance by modeling the attentional feature map and connecting the localized convolutional feature map to a farther range. The data flow is shown in Figure 2a,b, which are two parts in series. In order to improve the performance of the model in remote sensing images, this paper introduces the multidimensional self-attentive feature selection module (MAFM in Figure 1) into the generator network. Additionally, this paper uses the gradient structure information to guide the model to generate more realistic sample images.

Multidimensional Self-Attentive Module
The generator is the core part of the whole generative adversarial network, and the performance of this part determines the final quality of the generated images. The convolutional layer can only operate on the local proximity context of each spatial location in the feature map, which does not make good use of the contextual information.
CCNet [52] proposes the Criss-Cross Network to obtain contextual information in a more effective and efficient way, based on which this paper proposes a multidimensional self-attentive feature selection module (MAFM) for remote sensing image generation, which can improve the generator model performance by modeling the attentional feature map and connecting the localized convolutional feature map to a farther range. The data flow is shown in Figure 2a,b, which are two parts in series.
As shown in Figure 2a, the cross-attention module captures contextual information by obtaining the relationship of each point on the feature map with each of its points horizontally and vertically. This vertical and horizontal attention module is more lightweight and uses two cross-attention modules in series for the self-attention calculation on each feature map in the channel domain. The P i and P i weights are shared, and the relationship between each point and all points on that feature map can be obtained via cross-attention twice, which can capture dense global contextual information.
However, it is difficult to capture the relationship between feature maps; therefore, in order to learn the long-distance dependencies within the channel dimension of the feature maps, F is passed sequentially through the spatial attention module and the channel attention module. The feature attention selection module shown in Figure 2b  As shown in Figure 2a, the cross-attention module captures contextual information by obtaining the relationship of each point on the feature map with each of its points horizontally and vertically. This vertical and horizontal attention module is more lightweight and uses two cross-attention modules in series for the self-attention calculation on each feature map in the channel domain. The and weights are shared, and the relationship between each point and all points on that feature map can be obtained via crossattention twice, which can capture dense global contextual information.
However, it is difficult to capture the relationship between feature maps; therefore, in order to learn the long-distance dependencies within the channel dimension of the feature maps, is passed sequentially through the spatial attention module and the channel attention module. The feature attention selection module shown in Figure 2b is added after Figure 2a to model the importance and interdependence of the feature maps in three dimensions: two in the spatial domain and one in the channel domain. and are the spatial attention maps and is the channel attention map, which are obtained in the same way, the only difference being that the dimensions are different.
As shown in Figure 3, taking channel attention as an example, F is first reduced to a one-dimensional vector along the channel dimension, using the maximum pooling operation. Then, the importance of each channel feature map is obtained by a 1 × 1 convolutional layer, ReLU activation function, 1 × 1 convolutional layer, and sigmoid activation function in turn. Subsequent to multiplying the features after the spatial attention module, they are summed with after the 1 × 1 convolutional layer and ReLU activation function in turn to obtain the final output feature .
contains not only the long-range dependencies of all spatial locations within the whole spatial dimension but also the correlations between channel dimensions and the importance of each feature map under that dimension, which enables the generator to better use the global information to generate high-quality remote sensing images. As shown in Figure 3, taking channel attention as an example, F is first reduced to a onedimensional vector along the channel dimension, using the maximum pooling operation. Then, the importance of each channel feature map is obtained by a 1 × 1 convolutional layer, ReLU activation function, 1 × 1 convolutional layer, and sigmoid activation function in turn. Subsequent to multiplying the features after the spatial attention module, they are summed with F after the 1 × 1 convolutional layer and ReLU activation function in turn to obtain the final output feature F o . F o contains not only the long-range dependencies of all spatial locations within the whole spatial dimension but also the correlations between channel dimensions and the importance of each feature map under that dimension, which enables the generator to better use the global information to generate high-quality remote sensing images.

Gradient Structure Information Guidance Model
Since the gradient map reveals clear details of each local region in the sample image [53], the gradient structure information can be used to guide the generation process of the image generation network. However, the sample gradient map extracted by the conventional operator can only reflect one aspect of the sample structure (i.e., the value difference of neighboring pixels), and it is difficult to obtain richer and more general information (e.g., more complex structural textures) in the sample structure.
In this paper, we propose a learnable gradient structure information branching network, GSNet, to generate high-quality remote sensing images with clear structures by extracting the gradient structure information of samples in real remote sensing images and

Gradient Structure Information Guidance Model
Since the gradient map reveals clear details of each local region in the sample image [53], the gradient structure information can be used to guide the generation process of the image generation network. However, the sample gradient map extracted by the conventional operator can only reflect one aspect of the sample structure (i.e., the value difference Remote Sens. 2023, 15, 2827 7 of 20 of neighboring pixels), and it is difficult to obtain richer and more general information (e.g., more complex structural textures) in the sample structure.
In this paper, we propose a learnable gradient structure information branching network, GSNet, to generate high-quality remote sensing images with clear structures by extracting the gradient structure information of samples in real remote sensing images and guiding the generator to pay attention to the sample structure information during the generation process.
A structural diagram of the gradient structure information branching network used in this paper is shown in Figure 4. Since most of the area of the gradient map is close to zero and the convolutional neural network can focus more on the spatial relationships of the contours, the network model may capture structural dependencies more easily. To ensure that GSNet can learn the geometric structural information of the samples rather than the deep semantic information of the samples, GSNet does not use too many convolutional layers, and the conv block is a residual block consisting of three layers of convolution. GSNet is divided into three stages to extract the sample image structure tensor, and the spatial size of the structure information feature map is downsampled by a factor of 2 each time and incorporated into the corresponding position of the main structure of the generative network.

Gradient Structure Information Guidance Model
Since the gradient map reveals clear details of each local region in the sample image [53], the gradient structure information can be used to guide the generation process of the image generation network. However, the sample gradient map extracted by the conventional operator can only reflect one aspect of the sample structure (i.e., the value difference of neighboring pixels), and it is difficult to obtain richer and more general information (e.g., more complex structural textures) in the sample structure.
In this paper, we propose a learnable gradient structure information branching network, GSNet, to generate high-quality remote sensing images with clear structures by extracting the gradient structure information of samples in real remote sensing images and guiding the generator to pay attention to the sample structure information during the generation process.
A structural diagram of the gradient structure information branching network used in this paper is shown in Figure 4. Since most of the area of the gradient map is close to zero and the convolutional neural network can focus more on the spatial relationships of the contours, the network model may capture structural dependencies more easily. To ensure that GSNet can learn the geometric structural information of the samples rather than the deep semantic information of the samples, GSNet does not use too many convolutional layers, and the conv block is a residual block consisting of three layers of convolution. GSNet is divided into three stages to extract the sample image structure tensor, and the spatial size of the structure information feature map is downsampled by a factor of 2 each time and incorporated into the corresponding position of the main structure of the generative network.  The gradient block is used to obtain the gradient map of the original image of the sample without considering the gradient direction information but using a convolutional layer with a fixed kernel, as shown in Equation (1). The gradient calculation formula used is shown in Equation (2).
After obtaining the gradient map, it is first fed into the first convolutional block to extract simple structural features, which consists of a 3 × 3 convolutional layer with a step size of 1 and an instance normalization layer. It is then fed into three modules consisting of a convolution block and a multidimensional feature selection module (FSM) in turn to further extract structural features. The convolutional block consists of three dense residual blocks, each consisting of five 3 × 3 convolutional layers using dense connections and the Leaky-ReLU activation function. The specific structure of the multidimensional feature selection module (FSM) is shown in Figure 5, which uses transposition operations to select features from three dimensions, namely the spatial dimension W, spatial dimension H, and channel dimension C, respectively. Each feature selection module consists of a maximum pooling layer, a 1 × 1 convolutional layer, a ReLU activation function, a 1 × 1 convolutional layer, and a sigmoid activation function.
blocks, each consisting of five 3 × 3 convolutional layers using dense connections and the Leaky-ReLU activation function. The specific structure of the multidimensional feature selection module (FSM) is shown in Figure 5, which uses transposition operations to select features from three dimensions, namely the spatial dimension W, spatial dimension H, and channel dimension C, respectively. Each feature selection module consists of a maximum pooling layer, a 1 × 1 convolutional layer, a ReLU activation function, a 1 × 1 convolutional layer, and a sigmoid activation function. The structural features at different scales are obtained in the branching network of the gradient structural information, and they are input into the generator as a priori structural information to guide the generation process. Using a learnable attention graph, we assign each part of the input different weights to extract more critical and important information so that the branching network can output structural information with more details. The branching network can effectively extract the structural information of the preserved sample images, allowing the generative network to obtain results with high perceptual quality and less geometric distortion. In addition, the obtained gradient information can highlight the sharpness and structural regions that need more attention, thus implicitly leading to the high-quality generation of remote sensing images.

Loss Function
In order to train the algorithm in this paper, it is necessary to compare the image in the training phase, the reconstructed image, and the original image under certain metrics. We designed a loss function L consisting of three components. It is denoted by the following equation: where ℒ is the adversarial loss, ℒ is the perceptual length regularization term, ℒ is the mode seeking regularization term, and , , and are the weight coefficients.
The work in this paper is mainly based on StyleGANv2 for improvement, and thus the loss function in StyleGANv2 is retained, while the mode seeking regularization term is added to improve the diversity of the generated images.

Adversarial Loss
This paper uses the adversarial loss to make the image distribution of the generated image match the image distribution of the real image with the following equation: The structural features at different scales are obtained in the branching network of the gradient structural information, and they are input into the generator as a priori structural information to guide the generation process. Using a learnable attention graph, we assign each part of the input different weights to extract more critical and important information so that the branching network can output structural information with more details. The branching network can effectively extract the structural information of the preserved sample images, allowing the generative network to obtain results with high perceptual quality and less geometric distortion. In addition, the obtained gradient information can highlight the sharpness and structural regions that need more attention, thus implicitly leading to the high-quality generation of remote sensing images.

Loss Function
In order to train the algorithm in this paper, it is necessary to compare the image in the training phase, the reconstructed image, and the original image under certain metrics. We designed a loss function L consisting of three components. It is denoted by the following equation: where L GAN is the adversarial loss, L pl is the perceptual length regularization term, L ms is the mode seeking regularization term, and λ, µ, and γ are the weight coefficients. The work in this paper is mainly based on StyleGANv2 for improvement, and thus the loss function in StyleGANv2 is retained, while the mode seeking regularization term is added to improve the diversity of the generated images.

Adversarial Loss
This paper uses the adversarial loss to make the image distribution of the generated image match the image distribution of the real image with the following equation: where G tries to generate images that look similar to the real image, and D tries to distinguish the generated images from the real image. The goal of G is to minimize the adversarial loss, while the goal of D is to maximize the adversarial loss.

Path Length Regularization
The perceptual path length (PPL), proposed in StyleGANv2 [25], which measures the degree of feature coupling between different random input vectors, representing the perceptual distance length, is also found to be correlated with the quality of the generated images, and thus the quality of the generated images can be improved using the PPL regularization term.
where w ∈ W, W is the intermediate potential space [35], w ∼ f (z), z obeys a normal distribution, J w denotes the Jacobian matrix, α denotes the exponential moving average, which aims to preserve the expected length of the vector regardless of the vector direction, and y is a random image whose pixel (numerical) intensities obey a normal distribution.

Mode Seeking Regularization
GANs can easily fall into mode collapse, which leads to a loss of diversity in the generated images. Therefore, additional regularization terms need to be introduced to alleviate mode collapse and thus ensure the diversity of the generated images.
When mode collapse occurs, two different random input vectors, z 1 and z 2 , are mapped to images I 1 and I 2 with similar patterns, which means that the GAN learns only a few patterns and loses the diversity of the generated images. Therefore, we introduce the mode seeking regularization term to maximize the ratio of the distance between I 1 and I 2 relative to the distance between z 1 and z 2 with the following equation: where z 1 and z 2 are two different random vectors, G(z 1 ) and G(z 2 ) denote the corresponding generated images of z 1 and z 2 , respectively, and d I (·) and d z (·) are distance metrics.

Experiments
In order to verify the effectiveness and generalizability of the model proposed in this paper, the model proposed in this paper and other advanced models were tested on two remote sensing datasets and compared. To quantitatively evaluate the performance of the different network models, the IS (Inception Score), FID (Frechet Inception Distance), and KID (Kernel Inception Distance) were used to evaluate the quality of the images generated by the different models. These are widely used metrics for image generation quality evaluation.
In addition, in this paper, the generated remote sensing sample images were tested in practical applications using current commonly used object detection models, including single-object class experiments as well as multi-object class experiments. The object detection models included YOLOv5 [54], YOLOX [55], and Efficientdet [56]. The results of object detection using the DOTA and UCAS-AOD datasets were used to verify the effectiveness of the generated images.
In this paper, we conducted an ablation study, which is presented in Section 4.5, to verify the effectiveness of the gradient structure information guidance model proposed in this paper and explore the effect of adding a multidimensional self-attentive module on the algorithm.

Evaluation Metrics
The larger the value of the Inception Score (IS) [57], the better the image quality and diversity. Specifically, the generated image x is fed into the Inception [58] classification network, which outputs a 1000-dimensional vector y. It is desired that the entropy p(y|x) of the generated image over the conditional distribution is small, and that the edge distribution p(y) of the generated image over all class probabilities is large. IS is the average KL divergence of these two distributions, and is expressed as follows: The FID [59] determines the quality of the generated image by calculating the distance between the feature vectors of the real image and the generated image, which is extracted and calculated using an image classification model (e.g., Inception v3). A lower FID score means that the two sets of images, or the statistics of the two, are more similar. The calculation formula is as follows: where u r and u g represent the mean values of the real image and the generated image on a certain layer of features of Inception v3, respectively; Σ r and Σ g represent the variance of the real image and the generated image on a certain layer of features, respectively. The KID [60] is used to evaluate the realism of the generated image, and the lower the KID, the higher the visual similarity between the real image and the generated image. It calculates the square of the maximum mean discrepancy (MMD) of the higher perceptual features of the image (the last layer of the Inception model). The two data distributions of the real and generated images are P r and P g , respectively. The KID between the two distributions can be expressed as follows: where x ∼ P r and y ∼ P g denote the high-level perceptual features of the real image and the generated image, respectively, and K(·, ·) denotes the kernel function used for feature transformation:

Datasets
1. DOTA dataset [61,62]: This is a large image dataset for aerial image object detection. The image sources include different sensors and platforms, including Google Earth, the JL-1 satellite, and the GF-2 satellite of the China Resource Satellite Data and Application Center. The dataset consists of a total of 2806 aerial images, each with pixel sizes ranging from 800 × 800 to 4000 × 4000, containing objects of different scales, orientations, and shapes, and the images are annotated by experts using 15 common target categories. In this paper, the images were cropped to a 640 × 640 pixel size for testing.
2. UCAS-AOD dataset [63]: Remote sensing images were collected from Google Earth and used for plane detection. The plane dataset consists of 600 images of 3210 planes, and all the images are carefully selected so that the target directions in the dataset are evenly distributed. In this paper, the images were cropped to a 640 × 640 pixel size for testing.

Image Generation Comparison Results
In order to evaluate the performance of the algorithm in this paper, the method in this paper was compared with other SOTA image generation methods on two datasets. All comparison procedures and test results of the compared models were obtained from their authors' official websites.
DOTA dataset: The results of the algorithm proposed in this paper and the other algorithms were quantitatively analyzed on the DOTA dataset. The FID, KID, and IS metrics of different image generation algorithms are shown in Table 1. Compared with the other five image generation models, the SGA-GAN algorithm proposed in this paper returned the optimal results for the three metrics, i.e., FID, KID, and IS. Compared with the baseline model StyleGANv2, the SGA-GAN algorithm proposed in this paper reduced the FID value from 85.967 to 72.924, the KID value from 0.059 to 0.047, and the IS value from 4.581 to 4.815. Compared with StyleGAN-XL, the FID and KID of SGA-GAN decreased by 23.927 and 0.003, respectively, and the IS improved by 2.351 on the DOTA dataset. The quantitative comparison in terms of the FID, KID, and IS indicates that the SGA-GAN algorithm proposed in this paper can generate higher-quality remote sensing images. The qualitative evaluation of the results of the different algorithms on the DOTA dataset is shown in Figure 6. The remote sensing sample images generated by ProGAN and StyleGANv1 were more seriously distorted, and the partial structures of the wings of some planes in the images were severely distorted. The StyleGANv2, FastGAN, and StyleGAN-XL methods generated a few clearer sample images, but there was also distortion in the partial structures of the wings and tails. In contrast, the method proposed in this paper generated sample images with realistic visual effects, and the image structure was more realistic and reliable. model StyleGANv2, the SGA-GAN algorithm proposed in this paper reduced the FID value from 85.967 to 72.924, the KID value from 0.059 to 0.047, and the IS value from 4.581 to 4.815. Compared with StyleGAN-XL, the FID and KID of SGA-GAN decreased by 23.927 and 0.003, respectively, and the IS improved by 2.351 on the DOTA dataset. The quantitative comparison in terms of the FID, KID, and IS indicates that the SGA-GAN algorithm proposed in this paper can generate higher-quality remote sensing images. The qualitative evaluation of the results of the different algorithms on the DOTA dataset is shown in Figure 6. The remote sensing sample images generated by ProGAN and StyleGANv1 were more seriously distorted, and the partial structures of the wings of some planes in the images were severely distorted. The StyleGANv2, FastGAN, and Style-GAN-XL methods generated a few clearer sample images, but there was also distortion in the partial structures of the wings and tails. In contrast, the method proposed in this paper generated sample images with realistic visual effects, and the image structure was more realistic and reliable. UCAS-AOD dataset: The results of the algorithm proposed in this paper and the other algorithms were quantitatively analyzed on the UCAS-AOD dataset. The FID, KID, and IS metrics of the different image generation algorithms are shown in Table 2. Compared with the other five image generation models, SGA-GAN again returned the optimal results for the three metrics, i.e., FID, KID, and IS. Compared with StyleGANv2, the SGA-GAN algorithm proposed in this paper decreased the FID value from 58.405 to 54.096, the KID value from 0.047 to 0.042, and the IS value from 3.493 to 3.728. Compared to StyleGAN-XL, the FID and KID of SGA-GAN on the UCAS-AOD dataset decreased by 41.464 and 0.009, respectively, and the IS improved by 1.646. The quantitative comparison of the three metrics, i.e., FID, KID, and IS, shows that the SGA-GAN algorithm proposed in this paper can generate higher-quality remote sensing images. The qualitative evaluation of the results of the different algorithms on the UCAS-AOD dataset is shown in Figure 7. The target types in the UCAS-AOD dataset are more concentrated with less inter-class variation. ProGAN, StyleGANv1, and StyleGAN-XL generated a small number of sample target images with a clear structure, but most of the images were severely distorted. The StyleGANv2 and FastGAN methods generated clearer sample images, but there was structural distortion. The method proposed in this paper generated more realistic sample images with clearer and more reliable structures.
GAN algorithm proposed in this paper decreased the FID value from 58.405 to 54.096, the KID value from 0.047 to 0.042, and the IS value from 3.493 to 3.728. Compared to Style-GAN-XL, the FID and KID of SGA-GAN on the UCAS-AOD dataset decreased by 41.464 and 0.009, respectively, and the IS improved by 1.646. The quantitative comparison of the three metrics, i.e., FID, KID, and IS, shows that the SGA-GAN algorithm proposed in this paper can generate higher-quality remote sensing images. The qualitative evaluation of the results of the different algorithms on the UCAS-AOD dataset is shown in Figure 7. The target types in the UCAS-AOD dataset are more concentrated with less inter-class variation. ProGAN, StyleGANv1, and StyleGAN-XL generated a small number of sample target images with a clear structure, but most of the images were severely distorted. The StyleGANv2 and FastGAN methods generated clearer sample images, but there was structural distortion. The method proposed in this paper generated more realistic sample images with clearer and more reliable structures. Comparing the results of all algorithms on the DOTA and UCAS-AOD datasets, it can be seen that the algorithms in this paper have good performance metrics and visual effects on both datasets.

Object Detection Test Results
To evaluate the performance of the algorithm in this paper, target samples generated using the method in this paper were added to the training set of two publicly available aerial remote sensing object detection datasets, while the test set was left unchanged. Comparative experiments were conducted on three different types of generic object detection models, namely YOLOv5 [51], YOLOX [52], and Efficientdet [53], to verify the ability of this paper's method to solve the problems of insufficient image diversity and unbalanced numbers between classes in existing datasets.

Multi-Class Object Data
Adding single-class targets to multi-class target data: The DOTA airborne remote sensing dataset contains 15 common object classes. The number of plane targets is significantly lower than that of ships and vehicles, and the plane targets are of different sizes. We added 1000 samples of plane targets of different sizes generated by the method in this paper to the training set of the original dataset, as shown in Figure 8; the left subfigure (a) shows the image slices of the original dataset, and the right subfigure (b) shows the added generated data to compare the experimental results of object detection.
using the method in this paper were added to the training set of two publicly available aerial remote sensing object detection datasets, while the test set was left unchanged. Comparative experiments were conducted on three different types of generic object detection models, namely YOLOv5 [51], YOLOX [52], and Efficientdet [53], to verify the ability of this paper's method to solve the problems of insufficient image diversity and unbalanced numbers between classes in existing datasets.

Multi-Class Object Data
Adding single-class targets to multi-class target data: The DOTA airborne remote sensing dataset contains 15 common object classes. The number of plane targets is significantly lower than that of ships and vehicles, and the plane targets are of different sizes. We added 1000 samples of plane targets of different sizes generated by the method in this paper to the training set of the original dataset, as shown in Figure 8; the left subfigure (a) shows the image slices of the original dataset, and the right subfigure (b) shows the added generated data to compare the experimental results of object detection. To verify the effectiveness of SGA-GAN in improving the performance of remote sensing object detection, the YOLOv5 and YOLOX detection networks were used to verify the effectiveness on the DOTA dataset. The test results are shown in Table 3. To verify the effectiveness of SGA-GAN in improving the performance of remote sensing object detection, the YOLOv5 and YOLOX detection networks were used to verify the effectiveness on the DOTA dataset. The test results are shown in Table 3. ORI 1000 represents 1000 plane target images in the original DOTA data with conventional data enhancement (random cropping, scaling, rotation, etc.). GEN 1000 represents 1000 plane target samples generated using this paper's method, SGA-GAN, stitched into a 640 × 640-size image and added to the training set. From Table 3, it can be seen that, for YOLOv5, the mAP and AP of plane objects were improved by 1.0% and 1.7%, respectively, after adding the plane target images generated by SGA-GAN. For YOLOX, the mAP and AP of plane objects were improved by 0.9% and 1.3%, respectively, after adding the plane target sample images generated by SGA-GAN. It can also be seen from the table that, compared with the traditional data enhancement methods, the SGA-GAN-generated images had a greater enhancement effect on remote sensing object detection, which can effectively alleviate the inter-class and intra-class imbalance problems existing in the dataset.
Adding multi-class targets to the multi-class target data: A total of 1000 images each of swimming pools, storage tanks, and planes were added to the training dataset of the original dataset, as shown in Figure 9; the left subfigure (a) shows the image slices of the original dataset, and the right subfigure (b) shows the added generated data. The experimental results of multi-class object detection were compared and observed using the YOLOv5 and YOLOX detection networks. a 640 × 640-size image and added to the training set. From Table 3, it can be seen that, for YOLOv5, the mAP and AP of plane objects were improved by 1.0% and 1.7%, respectively, after adding the plane target images generated by SGA-GAN. For YOLOX, the mAP and AP of plane objects were improved by 0.9% and 1.3%, respectively, after adding the plane target sample images generated by SGA-GAN. It can also be seen from the table that, compared with the traditional data enhancement methods, the SGA-GAN-generated images had a greater enhancement effect on remote sensing object detection, which can effectively alleviate the inter-class and intra-class imbalance problems existing in the dataset.
Adding multi-class targets to the multi-class target data: A total of 1000 images each of swimming pools, storage tanks, and planes were added to the training dataset of the original dataset, as shown in Figure 9; the left subfigure (a) shows the image slices of the original dataset, and the right subfigure (b) shows the added generated data. The experimental results of multi-class object detection were compared and observed using the YOLOv5 and YOLOX detection networks. The test results are shown in Table 4, where ORIm1000 represents the multi-class target images in the original DOTA data with conventional data enhancement, and GENm1000 represents the multi-class target samples generated using the SGA-GAN method proposed in this paper. As can be seen from the table, for YOLOv5, the mAP and AP of the three classes of objects were improved by 2.1%, 1.9%, 0.9%, and 2.4% after adding the target images generated by SGA-GAN. For YOLOX, the mAP and AP of the three classes The test results are shown in Table 4, where ORIm 1000 represents the multi-class target images in the original DOTA data with conventional data enhancement, and GENm 1000 represents the multi-class target samples generated using the SGA-GAN method proposed in this paper. As can be seen from the table, for YOLOv5, the mAP and AP of the three classes of objects were improved by 2.1%, 1.9%, 0.9%, and 2.4% after adding the target images generated by SGA-GAN. For YOLOX, the mAP and AP of the three classes of objects were improved by 1.7%, 0.9%, 1.7%, and 3.1% after adding the target images generated by SGA-GAN.

Single-Class Data
Using the UCAS-AOD dataset, which contains only a single plane target, 1000 generated plane targets of different sizes were added to the original dataset, as shown in Figure 10, with the original dataset on the left (a) and the added generated data on the right (b), to compare the experimental results of the observed object detection. Using the UCAS-AOD dataset, which contains only a single plane target, 1000 generated plane targets of different sizes were added to the original dataset, as shown in Figure 10, with the original dataset on the left (a) and the added generated data on the right (b), to compare the experimental results of the observed object detection. The test results are shown in Table 5. ORI1000 represents 1000 plane target images in the original UCAS-AOD data, and GEN1000 represents 1000 plane target samples generated by the method proposed in this paper. The AP of plane objects was significantly improved by 1.6% for YOLOv5, 0.3% for YOLOX, and 1.8% for Efficientdet, after adding the plane objects generated by the method proposed in this paper, which proves that this paper's method can effectively alleviate the problem of the model performance being limited by the size of the dataset. The test results are shown in Table 5. ORI 1000 represents 1000 plane target images in the original UCAS-AOD data, and GEN 1000 represents 1000 plane target samples generated by the method proposed in this paper. The AP of plane objects was significantly improved by 1.6% for YOLOv5, 0.3% for YOLOX, and 1.8% for Efficientdet, after adding the plane objects generated by the method proposed in this paper, which proves that this paper's method can effectively alleviate the problem of the model performance being limited by the size of the dataset. During the model training, mode seeking regularization (L ms ) was removed, and the network was retrained on the DOTA dataset using the same training scheme to verify its effect on the model.
As can be seen from Table 6, the addition of L ms reduced the FID from 74.421 to 72.924, the KID from 0.049 to 0.047, and the IS from 4.710 to 4.815. Thus, the effectiveness of mode seeking regularization in the image generation process is verified, showing that it can increase the diversity of the generated images. During the model training, path length regularization (L pl ) was removed, and the network was retrained on the DOTA dataset using the same training scheme to verify its effect on the model.
As can be seen from Table 7, the addition of L pl reduced the FID from 103.777 to 72.924, the KID from 0.076 to 0.047, and the IS from 4.560 to 4.815. Thus, the effectiveness of path length regularization in the image generation process is verified. The attention module (MAFM) was removed from the generator structure, and the network was retrained on the DOTA dataset using the same training scheme to verify the effectiveness of the MAFM.
As can be seen from Table 8, the addition of the MAFM reduced the FID from 85.967 to 80.156, the KID from 0.059 to 0.054, and the IS from 4.581 to 4.643, thus verifying the effectiveness of the MAFM in the image generation process. GSNet was removed, and the network was retrained on the DOTA dataset using the same training scheme to verify its effectiveness.
As can be seen from Table 9, the FID was decreased from 80.156 to 72.924, the KID was decreased from 0.054 to 0.047, and the IS was decreased from 4.643 to 4.815 after adding the branching network, which verifies the effectiveness of the gradient structure information guidance in the image generation process.

Conclusions
The distortion phenomenon exists in remote sensing images generated by existing image generation methods. To address this problem, this paper proposed an attention generative adversarial network based on gradient structure information guidance (SGA-GAN) for remote sensing image generation. The gradient structure information extraction branching network can effectively alleviate the structure distortion phenomenon existing in remote sensing image generation and improve the quality of the generated remote sensing images. Comparison experiments were conducted on two remote sensing datasets, namely UCAS-AOD and DOTA. Compared with five other advanced image generation models, the SGA-GAN algorithm proposed in this paper returned the optimal results for the three studied metrics, i.e., FID, KID, and IS, and generated reliable structures and realistic visual sample images. At the same time, experiments were conducted on two datasets using three different object detection models, namely YOLOv5, YOLOX, and EfficientDet. After adding aircraft target images generated by SGA-GAN to the DOTA dataset, the APs of YOLOv5 and YOLOX for aircraft objects were improved by 1.7% and 1.3%, respectively. The comparison experiments show that the algorithm in this paper can effectively alleviate the problems of the insufficient scale of existing remote sensing datasets and the imbalance within and between categories, meaning it can be useful in practical applications.
In the next step, we will jointly train the remote sensing image generation and object detection networks to improve the performance of the model in practical applications and provide data and experimental support for the interpretation of satellite remote sensing images.

Conflicts of Interest:
The authors declare no conflict of interest.