3.1.2. Generator
The proposed generator structure in this paper, as depicted in
Figure 2, is based on an encoder–decoder architecture. The encoder is on the left side, while the decoder is on the right. To facilitate feature reuse, skip connections are introduced between the encoder and decoder. This encoder–decoder design enables hierarchical feature extraction of the input data, capturing abstract and high-level semantic features that are subsequently recombined by the decoder. The generator effectively preserves rich details and texture features by incorporating multi-scale feature information, thereby improving image reconstruction.
The generator takes in low-resolution SAR images, denoted as , as input. Upon reading, the original SAR image, a single-channel grayscale image, is converted to RGB mode. This conversion does not introduce additional color information, as the grayscale image only possesses one color channel. Consequently, all three RGB channels become identical. The initial stage of feature extraction involves the utilization of convolution with the LeakyReLU activation function, leading to the extraction of feature maps . Subsequently, the features undergo mapping to higher dimensions through three encoder stages, where the channel numbers are doubled, and the feature sizes are halved at each stage. For instance, in the k-th stage of the encoder, the obtained features are represented as . Each encoder stage consists of the proposed Basic Blocks and downsampling modules. The number of downsampling modules is adjusted accordingly to ensure the consistency of feature sizes for the concatenated inputs. The upper-level feature extraction in each stage aims to capture features at different scales and complement the output features of the encoder.
At the end of the encoder, we incorporate a Basic Block module, which captures long-term dependencies between SAR images. Channel compression is employed to retain only the most relevant information for feature reconstruction, resulting in a decrease in the output dimension of the encoder and an improvement in the computational efficiency of subsequent decoding operations.
The decoder comprises three stages, each consisting of an upsampling module and a Basic Block module. The upsampling module employs transpose convolution with a kernel size of 2 and a stride of 2 for upsampling operations. As each upsampling layer is applied, the number of channels halves while the feature size doubles. The input for each stage is obtained by concatenating the upsampling output from the previous stage with the complementary features. Two Edge Detail Enhancement Modules (EDEMs) are devised, incorporating a design wherein the input of the first layer of the decoder is supplemented by the output of the first layer of the encoder, and the input of the second layer of the decoder is supplemented by the output of the second layer of the encoder. This approach facilitates the provision of additional edge features to the decoder. To leverage the multi-scale feature extraction operation, a Basic Block Pyramid Level (BBPL) module is constructed by altering the size of the attention window within the Basic Block. This block extracts depth features from the encoder’s twice-downsampled feature maps as complementary inputs to the decoder. We incorporate the modulator proposed in [
38] into each decoder to improve the model’s generalization ability in handling different degradation patterns. The modulator is a randomly initialized position vector that enhances the model’s robustness against interference.
Following the three decoding stages, a size feature map is obtained. This feature map is then reshaped to match the target image resolution, denoted as , using Pixel Shuffle with an upsampling factor of s and a convolution. Afterward, the s-times bilinear interpolation result of the low-resolution image is added to R through a skip connection. This addition yields the reconstructed image , where . Introducing this skip connection mitigates the training challenges of the model and enables it to capture fine details more effectively in low-resolution images.
In image processing, transformers have been adapted in various forms. For instance, Uformer [
38] is a U-shaped architecture designed specifically for image restoration tasks. It incorporates a locally enhanced window (LeWin) module, facilitating the simultaneous extraction of local information and contextual features. This architecture effectively reduces the computational complexity of the model. Building on this concept, Kulkarni et al. [
39] proposed a novel algorithm called AIDTransformer for aerial image dehazing. They introduced Attentive Deformable Transformer Blocks, which combine attention mechanisms with deformable operations. This approach allows for the removal of haze from images while preserving essential textures.
Considering the remarkable ability of AIDTransformer to recover fine details and textures in dehazing, we have enhanced its fundamental module by introducing the Basic Block, illustrated in
Figure 3. The Basic Block comprises two cascaded transformer structures and a
convolutional layer (
). This module overcomes the limitation of the original module by effectively modeling long-range dependencies in image processing. It captures global information while emphasizing local features. We have incorporated a skip connection that connects the input and output to address the challenges of training this module. The overall process can be summarized as follows:
where
X denotes the input of the Basic Block,
Y denotes the output of the Basic Block,
P signifies the output of the first Feedforward Neural Network (FFN) after mapping with the residuals,
Q denotes the output of the second FFN after mapping with the residuals,
stands for patch embedded, and
stands for patch unembedded.
The feature extraction component in DMSC-GAN employs the deformable multi-head self-attention (DMSA) module proposed in AIDTransformer. This module utilizes a deformable multi-head attention mechanism to adapt to geometric variations within objects in the image and capture distinctive features of objects with different shapes. In our approach, we utilize the DMSA module as our feature extractor. The structure of DMSA is shown in
Figure 4. SADC is a space-aware deformable convolution that focuses on the relevant image region by providing offsets associated with the texture. Furthermore, we incorporate an FFN consisting of three fully connected layers with the GELU activation function. To ensure normalization, LN is applied before each DMSA and feedforward network layer.
The pyramid structure has proven to be a highly effective approach for multi-scale modeling in various tasks. However, there needs to be more research on applying pyramid structures in SAR image SR, particularly utilizing pyramid structures with varying window sizes. Kong et al. demonstrated the effectiveness of a pyramid structure with different window sizes in extracting feature information at different scales [
40]. Building upon this motivation, we propose a Basic Block Pyramid Layer (BBPL) for SAR image SR, as illustrated in
Figure 5. In order to encompass features of different scales, window sizes of 2, 4, and 8 are selected. SAR images contain varying features and smaller window sizes (e.g., 2) can effectively capture smaller-scale local details like textures and edges in the image. On the other hand, a larger window size (e.g., 8) enables the extraction of larger-scale global features, such as the overall structure and interrelationships among features. However, huge windows (e.g., 16) cause the model to overly rely on global context while disregarding local details, leading to a decline in performance. Moreover, using a larger window increases the training volume of the model and amplifies training difficulty.
BBPL comprises three Basic Blocks operating in parallel, each with a unique window size of 2, 4, and 8. To effectively integrate feature information from multiple scales, we employ a
convolutional layer to merge the parallel features and reduce the number of channels. This merging process facilitates the extraction of essential image features while reducing the computational complexity of subsequent model operations. BBPL can be expressed as follows:
where
represents a Basic Block with a window size of k, the input to each Basic Block is denoted as
, and the output is denoted as
.
To balance the network parameters and training speed, we introduce a BBPL module between the second layer of the encoder and the decoder. Placing it at the second layer is motivated by the fact that the first layer primarily captures coarse-grained information, while the third layer focuses on fine-grained details. There must be more than these two layers to effectively capture coarse and fine-grained features. However, the intermediate layer’s extracted features can effectively address this limitation. They encompass low-frequency and high-frequency details, making them valuable for subsequent feature reconstruction. By incorporating the BBPL module in the intermediate layer, we can utilize these features to extract coarse and fine-grained information from the image while maintaining low network parameters. This approach significantly enhances the image reconstruction performance.
The Edge Detail Enhancement Module (EDEM) is introduced to capture prominent edge details in SAR images, such as buildings and roads, which the Basic Block needs to learn effectively. The structure of EDEM, depicted in
Figure 2, is applied between the encoder and the decoder’s first and second layers. EDEM takes two inputs: Input 1, the feature map extracted by the Basic Block without downsampling, and Input 2, the downsampled feature map with additional supplementary features incorporated. These two inputs represent feature maps at different scales, with Input 2 capturing richer edge detail features. To highlight the details in the edge regions and incorporate them into subsequent modules, we subtract the upsampled version of Feature Map 2 (
) from Feature Map 1 (
), resulting in a feature map that specifically contains edge detail information (
). This operation effectively enhances the model’s ability to reconstruct edge detail features. The representation of EDEM is as follows:
3.1.3. Discriminator
Designing a suitable discriminator for SAR image generation tasks presents a challenge due to the distinct characteristics of SAR images compared to other optical remote sensing images. SAR images are typically grayscale and encompass intense information and advanced features like polarization and phase, enabling sophisticated analysis and interpretation. Moreover, SAR images exhibit intricate texture details. Thus, to generate SAR images that are more realistic, it is essential to develop a discriminator that is carefully tailored to these unique attributes.
The PatchGAN architecture is commonly employed as a discriminator for optical image generation tasks. This approach entails designing the discriminator as a fully convolutional network that produces image patches as outputs. These patches are subsequently averaged to determine the discrimination outcome. However, due to the consistent content distribution in SAR images, employing patch-based discrimination may introduce redundancy. To address this issue, we propose an alternative discriminator design that utilizes a score-based approach.
In our specific discriminator design, the output is a single point with dimensions of . The loss calculation evaluates the generated image based on this single point. This design enables more effective discrimination of SAR images. Given the relatively uniform content distribution in SAR images, having the discriminator output a single point facilitates a comprehensive evaluation of the overall quality of the generated image, avoiding excessive local discrimination. By computing the loss for each point, we obtain a holistic quality assessment of the generated image, which guides the training of the generator.
We propose a multi-scale discriminator design to address the limitations of relying solely on a single
point for discrimination in SAR image SR tasks. The multi-scale discriminator consists of two discriminators with the same structure but different input image scales. This design allows for more reliable discrimination information. The structure of the multi-scale discriminator is depicted in
Figure 6. It is composed of a five-layer CNN serving as the backbone. The first layer comprises two convolutional layers with Rectified Linear Unit (ReLU) non-linear activation functions. The second and third layers consist of two batch normalization (BN) layers and convolutional layers with ReLU activation functions. The fourth layer comprises a single BN layer and a ReLU convolutional layer. The final layer is fully connected with pooling and ReLU activation functions. The size of the convolutional kernels in the discriminator is uniformly set to
.
Our discriminator operates in two input modes to enable effective discrimination between the generated SAR images and authentic high-resolution SAR images. In one mode, the upsampled result of the low-resolution SAR image is concatenated with the generated image. In contrast, in the other mode, it is concatenated with the actual image. To handle these two input scales, we employ two discriminators. The first discriminator discriminates the images corresponding to the generated images. The second discriminator focuses on the downsampled results: the images. Our design uses two discriminators to capture detailed texture information at the original scale and global features contributing to edge contour information. At each layer of the discriminator, we calculate the feature matching loss. This loss computation facilitates the alignment of the generator’s features with those of actual images, resulting in improved quality for the generated images. By incorporating feature matching loss at multiple layers, we encourage the generator to match the actual image features at different levels of abstraction, leading to a more realistic and visually pleasing output.