Next Article in Journal
UAV-Based Remote Sensing Methods in the Structural Assessment of Remediated Landfills
Previous Article in Journal
Research on Net Ecosystem Exchange Estimation Model for Alpine Ecosystems Based on Multimodal Feature Fusion: A Case Study of the Babao River Basin, China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention

College of Electrical and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(1), 55; https://doi.org/10.3390/rs18010055
Submission received: 11 November 2025 / Revised: 18 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025

Highlights

What are the main findings?
  • For the first time, the InternImage model with deformable convolution v3 (DCNv3) as its core operator is introduced to SAR image translation tasks to extract global semantic features from in SAR images;
  • A cascaded multi-head attention module combining multi-head self-attention (MSA) and multi-head cross-attention (MCA) is designed to optimize local detail features while promoting feature interaction between local details and global semantics;
  • For the first time, structural similarity index metric (SSIM) loss is jointly leveraged with adversarial loss, perceptual loss, and feature matching loss in SAR image translation tasks.
What is the implication of the main findings?
  • Our method ultimately generates higher-quality optical remote sensing images compared to mainstream image translation methods.

Abstract

Synthetic aperture radar (SAR), with its all-weather and all-day observation capabilities, plays a significant role in the field of remote sensing. However, due to the unique imaging mechanism of SAR, its interpretation is challenging. Translating SAR images into optical remote sensing images has become a research hotspot in recent years to enhance the interpretability of SAR images. This paper proposes a deep learning-based method for SAR-to-optical remote sensing image translation. The network comprises three parts: a global representor, a generator with cascaded multi-head attention, and a multi-scale discriminator. The global representor, built upon InternImage with deformable convolution v3 (DCNv3) as its core operator, leverages its global receptive field and adaptive spatial aggregation capabilities to extract global semantic features from SAR images. The generator follows the classic “encoder-bottleneck-decoder” structure, where the encoder focuses on extracting local detail features from SAR images. The cascaded multi-head attention module within the bottleneck layer optimizes local detail features and facilitates feature interaction between global semantics and local details. The discriminator adopts a multi-scale structure based on the local receptive field PatchGAN, enabling joint global and local discrimination. Furthermore, for the first time in SAR image translation tasks, structural similarity index metric (SSIM) loss is combined with adversarial loss, perceptual loss, and feature matching loss as the loss function. A series of experiments demonstrate the effectiveness and reliability of the proposed method. Compared to mainstream image translation methods, our method ultimately generates higher-quality optical remote sensing images that are semantically consistent, texturally authentic, clearly detailed, and visually reasonable appearances.

1. Introduction

In recent years, with the continuous development of remote sensing technology, various satellite images, including synthetic aperture radar (SAR) images and optical remote sensing images, have been widely used in agriculture, industry, military, and other fields. SAR is an active microwave imaging radar that reconstructs two-dimensional images of the earth’s surface by actively emitting microwave pulses and receiving backscattered signals, followed by complex signal processing and imaging algorithms. Consequently, SAR imaging is unaffected by time and weather, capable of penetrating clouds for imaging during night or under adverse weather conditions, offering all-weather and all-day operational characteristics. As an important tool in remote sensing, SAR holds significant application value in environmental system monitoring, natural disaster assessment, and marine resource utilization. However, due to the unique imaging mechanism of SAR, SAR images suffer from severe geometric distortions and speckle noise, requiring interpretation by professionals and making them difficult for non-experts to understand the rich content they contain. Optical remote sensing images, with their abundant spatial and spectral information, are closer to human visual perception and thus easier to interpret. However, optical payloads and platforms have stringent imaging requirements, making it difficult to capture targets under insufficient illumination or cloudy conditions, which severely limits their observation and monitoring capabilities for ground targets. Therefore, combining the advantages of SAR and optical remote sensing images to improve the efficiency of ground object information extraction and reduce the labor costs for professionals has become a research hotspot in the remote sensing field. Among these efforts, translating SAR images into optical remote sensing images not only preserves the inherent advantages of SAR images that are independence from time and weather, but also effectively interpret target and scene information within SAR images, thereby expanding the research and application scope of SAR images. However, this process involves cross-modal conversion, and due to different imaging principles, textural details, and color styles, it poses challenges for SAR-to-optical image translation tasks.
Research on SAR-to-optical remote sensing image translation started relatively late, with a significant number of results only published in recent years. Zhang et al. [1] introduced gradient information from SAR images and texture features such as contrast, homogeneity, and correlation based on the gray-level co-occurrence matrix (GLCM) into the generator while retaining the original Pix2pix structure, thereby improving the similarity between the generated images and the target images. Turnes et al. [2] proposed a CGAN model based on dilated convolutions, incorporating an atrous spatial pyramid pooling (ASPP) module in the generator to fully utilize multi-scale spatial contextual information, thus enhancing the accuracy of the generated images. Liu et al. [3] introduced temporal information into CGAN to resolve ambiguity in generated images. This model uses multiple temporally adjacent SAR images of the input as guide images, extracts semantic information from the guide images via a feature mask module to improve translation performance, and introduced a temporal constraint in the loss function to ensure the uniqueness of the translation result. Zhan et al. [4] incorporated a style-based calibration module into CGAN. This module learns the style features of the input SAR images and matches them with the style of the optical remote sensing image, achieving color calibration and minimizing the differences between the generated and target images.
For unsupervised image learning, the lack of corresponding optical reference images for input SAR images can lead to confusion in the generator when learning color information. Addressing this, Ji et al. [5], based on CycleGAN, added an additional mask vector to the input SAR images, allowing the generator to identify the terrain categories of the input images. Simultaneously, they employed a dual-branch discriminator for authenticity discrimination and classification recognition, significantly reducing color errors in unpaired image translation. Targeting the limitation of cycle consistency loss in CycleGAN focusing only on texture information, Hwang et al. [6] introduced mutual information-based correlation loss and structural similarity loss based on luminance, contrast, and structure into the model, enhancing the generator’s learning capability for color and structure information. Addressing the challenge of training with minimal unpaired data using existing methods, Wang et al. [7] adopted the latest Schrödinger Bridge framework and proposed a multi-scale axial residual module (MARM). This module uses a multi-branch structure, performing permutation operations on the feature maps of each branch to enhance global information extraction and cross-channel interaction capabilities. Simultaneously, the axial self-attention mechanism restricts the perceptual range, aiding in extracting local information and facilitating long-range interaction within the current branch, ultimately generating high-quality optical remote sensing images.
Therefore, this paper proposes a high-quality, deep learning-based method for SAR-to-optical remote sensing image translation. The main contributions of this paper are as follows:
  • Traditional supervised translation models rely on the “encoder–decoder” structure, which has limited receptive field and lacks effective modeling of the global context. When dealing with SAR images and optical remote sensing images with significant modal differences, it is prone to cause confusion in the classification of ground objects and structure distortion of the generated images. Therefore, this paper breaks through by introducing an independent global representor and constructing a collaborative working architecture of “global semantic extraction—local detail generation—multi-scale discrimination”. This architecture innovatively realizes the clear division of labor between global semantic guidance and local detail generation in the SAR image translation task, fundamentally improving the semantic consistency of the translation results.
  • Existing methods have difficulty effectively balancing feature expression ability and computational efficiency: Transformer has large computational cost, while traditional convolution cannot effectively extract the global semantic features of SAR images. This paper creatively uses the InternImage model as the global representor, and its core operator DCNv3 achieves long-range dependency and adaptive spatial aggregation capabilities through dynamic offset and modulation scalar mechanisms. This innovation enables the model to efficiently extract discriminative global semantic features from SAR images with speckle noise and geometric distortion at lower computational cost.
  • Existing methods mostly adopt simple concatenation or addition for multi-source features, which makes global semantic guidance unable to effectively penetrate into the detail generation process. This paper innovatively designs a cascaded multi-head attention module. Through the concatenation of multi-head self-attention (MSA) and multi-head cross-attention (MCA), it realizes the optimization of local details and the deep calibration of global semantics. This module overcomes the problem of detail enhancement under semantic guidance and ensures that the generated images have accurate semantic structure at the macro level and clear texture details at the micro level.
  • Mainstream image translation methods rely on pixel-level losses that are difficult to effectively drive the model to learn perceptual similarity. This paper systematically combines structural similarity index metric (SSIM) loss, adversarial loss, perceptual loss, and feature matching loss in the SAR image translation task, forming a comprehensive supervision from low-order pixels to high-order perception. This optimization strategy innovation significantly improves the visual naturalness and structural integrity of the generated images.

2. Related Work

2.1. Generative Adversarial Networks

Generative adversarial networks (GANs) were initially proposed by Goodfellow et al. [8] and are primarily used for tasks such as image generation, image inpainting, and style transfer. GANs possess powerful data generation capabilities, primarily because GANs learn more abstract objectives compared to other deep learning models. Specifically, GANs consist of two networks, a generator and a discriminator, each with mutually adversarial training objectives. Taking image generation as an example, the generator’s goal is to generate fake images as similar as possible to real images, making it difficult for the discriminator to distinguish them accurately. The discriminator’s goal is to discriminate between fake and real images as effectively as possible. Therefore, during model training, the generator and discriminator engage in a game until the discriminator, while possessing certain discrimination capability, struggles to make effective judgments on the generated images. This indicates that Nash equilibrium is reached, and the model converges.
However, for the generator of GAN, to ensure the diversity of generated images, the input is mostly random noise, making it difficult to manually control the direction of image generation. In contrast, conditional GAN (CGAN) [9] is a variant of GAN. By introducing conditional information into both the generator and discriminator, it controls the image generation direction in a supervised learning manner, enabling the model to capture image features and structures more effectively.

2.2. Image Translation

Image translation involves learning the mapping relationship from source domain images to target domain images. This concept has attracted researchers’ attention since its inception. Compared to other tasks, the key aspect of image translation is to learn the mapping from one visual representation to another while fully understanding the underlying features shared by the images in these representations. In particular, these features can be divided into style-related features (style features) and style-independent features (content features). Content features represent the underlying spatial structure that should be preserved as much as possible during translation. Style features relate to the rendering of the structure and thus should be correctly transformed during translation. Therefore, most translation models adopt an “image encoding—feature transformation—feature decoding” framework.
However, learning the mapping between two or even multiple style domains is a challenging task. Before the widespread application of deep learning, most traditional methods relied on complex handcrafted designs for the feature encoding and transformation processes. Their accuracy largely depended on the performance of the selected or designed algorithms, making it difficult to meet the low information loss tolerance requirements of image translation. Therefore, it was not until the emergence of GANs, with their powerful image generation capability and diversity of generated images, that research in image translation gained widespread attention and significant progress. Nowadays, the variant of GAN, CGAN, has become state-of-the-art solutions in this field, including classic CGAN models in image translation such as Pix2pix [10], CycleGAN [11], and Pix2pixHD [12].

2.3. Transformer

The attention-based Transformer was initially proposed by Vaswani et al. [13] and first applied to natural language processing tasks, ultimately achieving performance superior to convolutional neural networks (CNNs) and recurrent neural networks (RNNs). This model adopts an “encoder-decoder” network structure. Its key aspect involves performing positional encoding on the input information and utilizing the multi-head self-attention (MSA) mechanism to capture dependencies between inputs at different distances. Compared to CNN limited by the receptive field of convolution kernels, the Transformer model not only possesses powerful feature extraction and long-range feature capture capabilities but also enables parallel computation, making it significantly superior to RNN in terms of computational cost.
The great success of Transformer in natural language processing led some researchers to attempt to apply this model to computer vision. Dosovitskiy et al. verified the feasibility of using the Transformer model for image classification tasks and proposed vision Transformer (ViT) [14]. Unlike CNN, ViT does not use convolution to extract image features but consists of N stacked encoder blocks. The input image is first split into patches of the same size. Each patch is then embedded into one-dimensional vectors of identical length. Finally, all vectors are combined and subjected to positional encoding, serving as the input for the subsequent ViT encoder blocks. Each encoder block consists of layer normalization (LN), MSA, and multilayer perceptron (MLP).
For high-resolution vision tasks like object detection and semantic segmentation, Liu et al. proposed Swin Transformer [15,16]. It employs a window-based multi-head self-attention mechanism with sliding window operations. Swin Transformer consists of two consecutive encoder blocks. The only difference between the two blocks is the adopted windowing mechanism: windows multi-head self-attention (W-MSA) and shifted windows multi-head self-attention (SW-MSA). The reason for using two mechanisms is that a single windowing mechanism would isolate information transfer between different windows, which is detrimental to global information extraction.

3. Methodology

3.1. Overall Architecture

Figure 1 shows the overall architecture of our network, which consists of three key components: a global representor, a generator, and a discriminator. The input SAR image x R 3 × 256 × 256 is first passed through a channel expansion operation to generate the feature map x 0 R 64 × 256 × 256 . Since SAR images inherently contain complex structural information and rich texture information, expanding the number of channels allows the model to extract information from more feature channels, enhancing its expressive power and thus improving the performance of the translation task.
The global representor is an InternImage model with DCNv3 as its core operator. It is responsible for extracting detailed information from the feature map x 0 to generate global semantic features. These features provide semantic guidance during the SAR-to-optical image translation process.
The generator adopts the classic “encoder-bottleneck-decoder” structure. Both the encoder and decoder consist of three stages. The feature map x 0 is progressively mapped to higher-dimensional representations during encoding. Each encoder stage consists of a basic residual block and a down-sampling module. After each stage, the number of feature channels is doubled while the spatial size is halved. Due to the limited receptive field of convolution kernels, after three downsampling steps, the encoder can effectively capture local detail features of the SAR image. The bottleneck layer consists of two basic residual blocks and a cascaded multi-head attention module. Its core function is to optimize the local features extracted by the encoder and integrate them with the global semantic features through interaction. Each decoder stage includes a basic residual block and an up-sampling module. Drawing on the skip connection mechanism from the U-Net architecture [17], connections are made to facilitate information transfer between the encoder and decoder, gradually restoring the spatial resolution. The feature map x 0 is then combined with the encoder output x 1 via skip connections to generate feature map x 2 . To suppress the amplification of inherent speckle noise in SAR images during decoding and avoid artifacts or false textures in the generated image, we process the feature map x 2 using a weight standardized (WS) residual block. Weight standardization [18], by constraining the distribution of convolutional kernel weights, significantly reduces the impact of outlier weights on the output. Finally, a channel compression operation converts the feature map x 2 into a three-channel image to produce the final result.
The discriminator adopts a multi-scale structure based on PatchGAN, where D1 represents the high-resolution discriminator and D2 represents the low-resolution discriminator. By performing discrimination simultaneously at high and low resolutions, it optimizes image generation in a coarse-to-fine manner.
In summary, through the joint action of the global representor, generator, and discriminator where the global representor generates global semantic features of the SAR image, the generator’s encoder focuses on extracting local detail features, the bottleneck layer enables feature interaction between local details and global semantics, and the decoder utilizes the interacted features combined with encoder information from skip connections for spatial resolution reconstruction, while the discriminator performs joint global and local discrimination, high-quality optical remote sensing images can ultimately be generated.

3.2. Global Representor

3.2.1. Deformable Convolution V3

The core function of the global representor is to extract information from the feature map x 0 generated in the channel expansion stage and produce an abstract representation characterizing the overall features or information of the input, i.e., the global semantic features of the SAR image rather than local detail features. These features provide semantic guidance during SAR image translation tasks, and their quality directly determines the semantic consistency of the final generated image. Therefore, constructing a global representor with both strong representational capacity and computational efficiency is key to enhancing the performance of our network.
In recent years, Transformer models have been gradually introduced into computer vision, leveraging their outstanding performance in natural language processing. Among them, ViT and Swin Transformer, relying on the global receptive field and adaptive spatial aggregation capabilities provided by their core operator, the self-attention mechanism, have achieved remarkable success in various vision tasks. However, the self-attention mechanism itself has significant drawbacks, such as high computational cost and memory usage. In practical applications, due to the unique imaging mechanism of SAR images and influences like speckle noise leading to complex and unstable feature expression, using ViT and Swin Transformer as the global representor in this context typically requires stacking numerous self-attention modules and feed-forward networks. This inevitably leads to high computational costs and memory overhead.
To address the above issues, this paper introduces deformable convolution v3 (DCNv3) proposed by Wang et al. [19] as the core operator, replacing the multi-head self-attention mechanism. DCNv3 is an optimization based on DCNv2 [20]. Figure 2 shows a schematic diagram of DCNv3’s working principle, which combines dynamic offsets, modulation scalars, and the multi-group mechanism.
Among these, dynamic offsets mean that each sampling point is not fixed at a regular grid position but dynamically moves to a new position based on offsets learned from the input features. This allows the convolution kernel to adjust its receptive field according to the input content. Modulation scalars refer to weight coefficients applied to the feature value of each sampling point, predicted based on the input features. This enables the convolution kernel to dynamically and finely adjust the contribution or importance of that sampling point to the final output feature based on the characteristics of the input content, adaptively focusing on key regions. The multi-group mechanism simulates the multi-head design in MSA. By dividing the channels of the input features into several groups, each group independently learns offsets and modulation scalars, further enhancing the expressiveness of the convolution kernel and its adaptability to diverse image content.
Assuming a given input feature X R C × H × W and the current coordinate value p 0 , the working principle of DCNv3 can be expressed as:
Y ( p 0 ) = g = 1 G k = 1 K w g m g k X g ( p 0 + p k + Δ p g k )
where G represents the number of groups, g represents the g -th group within the grouping, K represents the number of sampling points in the convolution kernel, and k represents the k -th sampling point of the convolution kernel. For the g -th group, w g represents the projection weight independent of the position for that group, m g k represents the learned modulation scalar for the k -th sampling point, X g represents the slice of input features corresponding to that group, p k represents the preset offset coordinate for the k -th sampling point, Δ p g k represents the learned offset for the k -th sampling point, and Y ( p 0 ) represents the final output at that coordinate after DCNv3 processing.
In summary, DCNv3, as an extension of the DCN series, exhibits three advantages when processing SAR images with unique imaging mechanism and applying them to SAR-to-optical image translation tasks: (1) The operator’s long-range dependency modeling and adaptive spatial aggregation capabilities can effectively adapt to imaging characteristics such as geometric distortions and speckle noise present in SAR images, compensating for the inherent shortcomings of regular convolutions in extracting complex global features from SAR images. (2) Compared to common attention-based operators like MSA and deformable attention [21], DCNv3 inherits the inductive bias of convolution operations, significantly reducing the model’s reliance on large-scale training data and lengthy training cycles. This is particularly suitable for the challenges of high annotation costs and scarce high-quality paired datasets in SAR image translation tasks. (3) The sparse sampling design makes DCNv3 significantly lower in computational cost and memory usage than MSA and re-parameterized large-kernel convolutions [22], and endows it with the ability to capture long-range dependencies with only a 3 × 3 kernel, thereby simplifying the optimization process.

3.2.2. InternImage

The previous section detailed the advantages of DCNv3 as a core operator for processing SAR images. This section will discuss how to construct a high-quality global representor for SAR image translation based on InternImage with DCNv3 as the core operator. Here, we introduce the InternImage basic block constructed by Wang et al. Since DCNv3 also possesses long-range dependency and adaptive spatial capabilities similar to MSA, and ViT has achieved great success in vision tasks, the basic block of InternImage is similar to that of ViT. As shown in Figure 3, given an input X R C × H × W , the operation process of this basic block can be expressed as:
Y = MLP LN DCNV 3 LN X + X + DCNV 3 LN X + X
where LN represents layer normalization, which improves training stability and convergence speed. DCNv3 represents deformable convolution v3, where offsets and modulation scalars are learned based on the input features. MLP represents the multilayer perceptron, consisting of fully connected layers with GELU activation function, enhancing the model’s nonlinear expressive capability. Residual connections are also introduced, preventing gradient vanishing and enhancing the model’s robustness. This design has been proven effective in many vision tasks [23,24].
Therefore, the structure of the global representor designed in this paper for SAR image translation is shown in Figure 4. This structure mainly consists of 1 stem layer, 4 InternImage basic blocks, and 3 down-sampling layers.
The stem layer is an efficient initial processing stage. Two consecutive down-sampling operations not only perform feature extraction on the input feature map but also significantly reduce the spatial resolution of the feature map. This not only reduces the computational burden for subsequent modules but also lays the foundation for fully extracting the rich global semantic features in SAR images. The InternImage basic blocks and down-sampling layers follow the following stacking rule:
C i = 2 i 1 C 1 G i = C i C L 1 = L 2 = L 4 L 1 L 3
where C i represents the number of channels in the i -th stage, G i represents the number of groups for DCNv3 in the i -th stage, C represents the dimension of the feature map slice, and L i represents the number of InternImage basic blocks in the i -th stage. According to the above stacking rule, the number of groups for DCNv3 in each stage is twice that of the previous stage, so DCNv3 processes feature map slices of the same dimension throughout the structure. Each down-sampling operation halves the spatial size of the feature map while doubling the number of channels. The three down-sampling operations, combined with the powerful long-range dependency and adaptive spatial aggregation capabilities embedded in the InternImage basic blocks, can not only fully capture the rich semantic information in SAR images but also ensure that the final generated global semantic features precisely match the output dimension of the encoder after three down-sampling steps. This provides a necessary prerequisite for the subsequent introduction of the cascaded multi-head attention mechanism for feature interaction.

3.3. Generator with Cascaded Multi-Head Attention

3.3.1. Residual Block Structure

The basic residual block is an important component of our network. Through the identity mapping achieved by residual connections, the network can effectively retain key features of the input SAR image, such as geometric structures and building contours, during the complex cross-modal conversion from SAR to optical images. This residual connection significantly alleviates the gradient vanishing problem during training, ensuring network training stability. Meanwhile, the internal structure of the residual block needs to empower the network to learn highly complex nonlinear mapping relationships from the SAR feature space to the optical feature space, which is crucial for generating high-quality optical remote sensing images with consistent structure, clear details, and reasonable colors.
Figure 5 shows the basic residual block designed in this paper. Given an input X R C × H × W , the operation process of this basic residual block can be expressed as:
Y = Conv SiLU GN Conv SiLU GN X + X
Compared to the commonly used instance normalization (IN) layer and ReLU activation function in the image generation field, this paper selects the group normalization (GN) layer and SiLU activation function. GN normalizes independently along the spatial dimension within channel groups, avoiding the dependence on batch size or internal statistics of individual samples inherent in batch normalization (BN) or IN, enhancing the model’s robustness and generalization when processing the unique and variable intensity distributions and speckle noise of SAR images. SiLU is a continuously differentiable and smooth function, providing more flexible nonlinear modeling capability than ReLU. The convolution operation is the core of the basic residual block, extracting local detail features such as texture and edges from the SAR image through three down-sampling steps. Simultaneously, the combination of two GN + SiLU + Conv structures significantly enhances the feature processing capability and nonlinear expressive power within the block, which is crucial for the cross-modal translation task of SAR-to-optical images.

3.3.2. Cascaded Multi-Head Attention Module

As mentioned earlier, the global representor is responsible for extracting rich global semantic features from the SAR image, while the encoder focuses on extracting local detail features from the SAR image. In the SAR-to-optical image translation task, high-quality generated images require not only clear, discernible details and naturally reasonable optical textures at the local level, such as building contours and vegetation patches, but also accurate conveyance of semantic information, such as land cover category distribution, contained in the original SAR image at the global level. Addressing this requirement, this paper designs the cascaded multi-head attention module shown in Figure 6.
The core operation process of our cascaded multi-head attention module can be expressed as:
Y = M S A LN X + X W = M C A LN Y , V + Y F = F F N LN W + W
where X R 1024 × 512 represents the features formed by processing the encoder’s local detail features via a basic residual block, a GN layer, a convolutional layer, and subsequent reshaping. MSA represents the multi-head self-attention. Y R 1024 × 512 represents the output of the MSA module. V R 64 × 512 represents the features obtained after processing the global representor’s output global semantic features through a basic residual block, a GN layer, and a convolutional layer, followed by reshaping. MCA represents the multi-head cross-attention. W R 1024 × 512 represents the output of the MCA module. FFN represents the feed forward neural network. F R 1024 × 512 represents the output of the FFN. In the above operation, the basic residual block is designed to extract a more refined local representation. The GN layer can stabilize the training process and reduce internal covariate shift. The convolutional layer is used to adjust the number of channels to match the preset model dimension of the attention mechanism. The reshaping operation is to convert the spatial dimension feature map into a two-dimensional sequence format suitable for the standard attention mechanism.
The operation formula for the MSA mechanism is:
Z i = Self-attention ( Q i , K i , V i ) = S o f t m a x Q i K i d k V i M S A = W S c o n c a t Z 1 , Z 2 , , Z 8
where Z i represents the output of the i -th head in the MSA mechanism. Q i , K i , V i represent the query, key, and value matrices derived from the layer-normalized X for the i -th head, respectively. Softmax represents the softmax function. W S represents the output weight matrix of the MSA mechanism. Concat represents concatenating the outputs of each head along the channel dimension. Although the encoder has captured local detail features such as edges and textures through three down-sampling processes, due to the inherent speckle noise and geometric distortions in SAR images, these features may be coarse and blurry. The self-attention mechanism, relying on its global receptive field and adaptive spatial aggregation capability, can effectively model long-range dependencies and adaptively suppress noise interference and distortion effects, thereby generating cleaner feature representations. Therefore, introducing MSA mechanism aims to directly optimize or enhance the quality of these local detail features, providing a cleaner and stronger foundation for the subsequent interaction between global semantic features and local detail features.
The operation formula for the MCA mechanism is:
Z j = Cross-attention ( Q j , K j , V j ) = S o f t m a x Q j K j d k V j M C A   = W C c o n c a t Z 1 , Z 2 , , Z 8
where Z j represents the output of the j -th head in the MCA mechanism. Q j represents the query matrix derived from the layer-normalized Y for the j -th head. K j , V j represent the key and value matrices derived from the layer-normalized V for the j -th head, respectively. W C represents the output weight matrix of the MCA mechanism. The core role of the cross-attention mechanism is to achieve information exchange between different feature sources. The specific mechanism is as follows: The optimized local detailed features are used as queries, while the global semantic features serve as keys and values, all of which are input into MCA mechanism. The entire process enables the decoder to query or refer to the global semantic context information when reconstructing each local position. Therefore, the role of the global semantic features is to serve as a set of advanced semantic dictionary or guidance blueprint, ensuring that while generating the details of optical remote sensing images, the overall land cover category, spatial relationship, and structural layout of the generated images are consistent with the semantic content of the input SAR images. By promoting the interaction between the global semantic features extracted by the global representor and the optimized local detail features, it ensures that the subsequently generated image maintains high similarity to the target image in both global structure and local details.
In summary, the cascaded multi-head attention module designed in this paper, by cascading MSA and MCA, constructs an efficient dual-stage feature enhancement and interaction framework, ultimately significantly enhancing the comprehensive quality of the generated images.

3.4. Multi-Scale Discriminator

This paper adopts the multi-scale discriminator based on PatchGAN proposed in Pix2pixHD [12]. PatchGAN is a fully convolutional discriminator based on local receptive fields. Its core idea is to perform independent authenticity discrimination on local regions of the image, rather than outputting a single judgment for the entire image. This allows the discriminator to effectively capture texture details and local structural information, effectively solving the problem of optical remote sensing image detail reconstruction in SAR image translation tasks. Simultaneously, discriminators at different scales correspond to different receptive fields. The high-resolution discriminator focuses on the rich texture details of the SAR image, while the low-resolution discriminator focuses on capturing more global information from the SAR image.
The structure of the multi-scale discriminator based on PatchGAN is shown in Figure 7. It consists of two discriminators with the same structure. Each discriminator consists of five convolutional layers. The SAR images and generated images, as well as the SAR images and optical remote sensing images, are first concatenated along the channel dimension and fed into the high-resolution discriminator. Subsequently, the 2× down-sampled SAR images and generated images, along with the 2× down-sampled SAR images and optical remote sensing images, are, respectively, concatenated along the channel dimension and input into the low-resolution discriminator. This discrimination mechanism enables the model to consider both local detail and global structure consistency discrimination, providing stable and reliable feedback for the generator, thereby generating outputs more similar to the target image. Several studies have confirmed the superior performance of this discriminator in handling SAR image translation tasks [25,26].

3.5. Loss Function

Inspired by the work of Kong et al. [25], by jointly optimizing adversarial loss [27], perceptual loss [28], and feature matching loss [29] in SAR image translation tasks, they successfully generated high-quality optical remote sensing images. Drawing on their framework, while retaining these three losses, this paper additionally introduces the structural similarity index metric (SSIM) loss [30]. This loss establishes an effective constraint between the generated image and the target image by comparing their similarity in features such as luminance, contrast, and structure. The formula for SSIM loss is as follows:
S S I M G ( x ) , y = ( 2 μ G ( x ) μ y + C 1 ) ( 2 σ G ( x ) y + C 2 ) ( μ G ( x ) 2 + μ y 2 + C 1 ) ( σ G ( x ) 2 + σ y 2 + C 2 ) L SSIM = 1 S S I M G ( x ) , y
where x is the input SAR image, y is the corresponding optical remote sensing image, G ( x ) is the image generated by passing the input SAR image through the generator. μ G ( x ) and μ y represent the means of the generated image and target image, respectively. σ G ( x ) y represents the covariance between the generated and target images. σ G ( x ) 2 and σ y 2 represent the variances of the generated and target images, respectively. C 1 and C 2 are very small constants, used to prevent the denominator from being zero and to ensure the stability of the calculation.
The adversarial loss is as follows:
m i n G L G G = k = 1,2 E ( x , y ) D k x , G x 1 2 m i n D 1 , D 2 k = 1,2 L D k D k = k = 1,2 E ( x , y ) D k x , y 1 2 + E ( x , y ) D k x , G x 2 L CGAN G , D = k = 1,2 L D k D k + L G G
where k represents the k -th discriminator, G is the generator, D is the discriminator.
The formula for perceptual loss is as follows:
L VGG y , G x = i = 1 T 1 N i φ i G x φ i y 1
where T is the total number of layers in the VGG network, i represents the i -th layer of the VGG network, N i is the number of elements in the feature map generated by the i -th layer of the VGG network, φ is the VGG network, and φ i y represents the feature map of the optical remote sensing image generated through the i -th layer of the VGG network. The formula for feature matching loss is as follows:
L FM G , D = k = 1,2 E x , y i = 1 T 1 N i D k i x , y D k i x , G x 1
where k is the k -th discriminator, T is the total number of layers in the discriminator from which features are extracted (set to 4 in this paper), i indicates the i -th layer of the discriminator, and N i is the number of elements in the feature map generated by the i -th layer of the discriminator.
Therefore, the total loss function of this paper is the weighted sum of the above loss terms:
L total = α L CGAN + β L VGG + γ L FM + λ L SSIM
where L CGAN represents the adversarial loss, L VGG represents the perceptual loss, L FM represents the feature matching loss, and L SSIM represents the SSIM loss. α , β , γ , λ are the weights for each loss. By jointly optimizing the above losses, the generated image is forced to approximate the target image in pixel values and be closer to it at the visual perception level.

4. Results

4.1. Datasets and Parameter Settings

In this paper, a dataset of paired SAR and optical remote sensing images from Nanjing, Jiangsu Province, was used for experiments. The SAR images are from the RADARSAT-2 satellite, and the optical remote sensing images are from the RapidEye satellite. The coverage area is the eastern part of Nanjing City, Jiangsu Province, with a resolution of 5 m for both. After geometric correction as a preprocessing step, the images were cropped to a size of 256 × 256 and underwent data augmentation operations including flipping and rotation. Finally, 9952 pairs were randomly selected as the training set, and 1500 pairs were used as the test set. When dividing the training and test sets, the principle of spatial independence was followed, meaning there are no overlapping areas between the training and test sets. Additionally, to further verify the effectiveness of the proposed method, the SEN1-2 dataset [31] was also used as a supplementary dataset. After data augmentation, 7200 pairs were randomly selected for training and 1000 for testing. The SEN1-2 dataset is only used as a supplementary dataset in the comparison experiments with different networks, while the Nanjing dataset is also used for other ablation experiments. Note that due to different acquisition times, there might be slight differences between the SAR and optical remote sensing images. The following parameter settings were used during experiments: batch size of 8, total training epochs of 100, Adam optimizer with β 1 = 0.5. Considering the number of parameters and complexity of our network, the learning rate was set to 1 × 10−4, using a cosine annealing learning rate strategy, with the first 50 epochs as the warm-up phase and the last 50 epochs as the cosine annealing phase. The loss weights α , β , γ were set to 1, 10, and 10, respectively, these settings are empirical. The loss weight λ was set to 1. The network was implemented using the PyTorch 2.9.1 framework and trained on two NVIDIA GeForce RTX 3080Ti GPUs.

4.2. Evaluation Metrics

This paper uses perception-based image similarity metrics, employing fréchet inception distance (FID) [32] and learned perceptual image patch similarity (LPIPS) [33] as the primary evaluation metrics, while mean squared error (MSE), peak signal-to-noise ratio (PSNR), and SSIM serve as auxiliary traditional evaluation metrics, enabling a comprehensive assessment of image quality and network performance.

4.3. Different Network Analysis

To demonstrate the superior performance of our method in SAR image translation tasks, this section compares our method with current mainstream image translation methods on both the Nanjing dataset and the SEN1-2 dataset. The compared methods are: Pix2pix, the cornerstone of supervised image translation; CycleGAN, the cornerstone of unsupervised image translation; Pix2pixHD, specifically designed for high-resolution image translation; and the multi-scale CGAN method [25] specifically for SAR image translation tasks. In this method, both the generator and the discriminator adopt the multi-scale idea and incorporate the Swin Transformer architecture.
Figure 8 and Table 1 show the comparison of translation results and evaluation metrics of different methods on the Nanjing dataset, respectively. From Table 1, Our method’s FID and LPIPS metrics significantly outperform others. Our method’s FID metric is 55.51, which is 62.57 lower than Pix2pix, 29.87 lower than CycleGAN, 24.05 lower than Pix2pixHD, and 12.08 lower than Multi-scale CGAN. Meanwhile, our method’s LPIPS metric is 0.5565, which is 0.0611 lower than Pix2pix, 0.0349 lower than CycleGAN, 0.0271 lower than Pix2pixHD, and 0.0034 lower than Multi-scale CGAN. Furthermore, our method also demonstrates comprehensive advantages over other methods in traditional metrics.
The translation results in Figure 8 show that on the Nanjing dataset, the image quality generated by our method far exceeds that of other methods in terms of visual effect. Specifically, the images generated by Pix2pix are of very poor quality, with unreasonable restoration of water bodies and farmland, and significant structural distortions in road and building areas. CycleGAN-generated images suffer from edge blurring and coarse textures, and also mistakenly translate water bodies into farmland. Pix2pixHD performs acceptably in translating water bodies, farmland, and land, but shows significant deficiencies in translating more challenging elements like roads and buildings, with blurred building contours and unreasonable road structures. Although the image quality generated by Multi-scale CGAN has significantly improved compared to the previous three methods, there are still obvious distortions in the translated roads. In comparison, the image quality generated by our method is significantly improved, with clear structures, reasonable colors, and sharp details, closely resembling the target images to human visual perception.
Figure 9 and Table 2 show the comparison of translation results and evaluation metrics of different methods on the SEN1-2 dataset, respectively. The data in Table 2 shows that our method’s FID metric is 31.66, significantly better than other image translation algorithms, greatly reduced by 80.28 compared to Pix2pix, significantly reduced by 77.39 compared to CycleGAN, reduced by 40.72 compared to Pix2pixHD, and reduced by 22.09 compared to Multi-scale CGAN. Our method’s LPIPS metric is 0.2829, which is 0.2823, 0.3178, 0.1732, and 0.0507 lower than the other four algorithms, respectively. Simultaneously, our method also performs best on the three traditional evaluation metrics, showing comprehensive advantages.
The translation results in Figure 9 show that on the SEN1-2 dataset, the image quality generated by different algorithms varies greatly. Specifically, Pix2pix-generated images suffer from severe blurring and color distortion, performing poorly in translating building areas, farmland, and roads. CycleGAN-generated images contain numerous translation errors, such as translating water bodies into farmland and farmland into land, and also perform poorly in translating building areas. The images generated by Pix2pixHD have problems such as unreasonable reconstruction of building areas and unclear road boundaries. The image quality generated by Multi-scale CGAN is significantly improved compared to the previous three methods, but there are still minor artifacts and structural blurring. In contrast, the images generated by our method exhibit excellent performance in both boundary segmentation and texture details, achieving high-quality translation for typical land cover categories such as farmland, land, water bodies, roads, and buildings.
The comparison and analysis in this section indicates that across different datasets, our method demonstrates significant advantages over mainstream image translation methods in both objective evaluation metrics and subjective visual quality.

4.4. Global Reprensentor Ablation Experiment

To prove the effectiveness and reliability of using InternImage as the global representor, this section replaces InternImage with ViT [14] and Swin v2 [16], respectively, while ensuring that number of parameters of the networks using ViT and Swin v2 as the global representor are similar to our method for a more objective comparison of their performance. Figure 10 and Table 3 show the comparison of translation results and evaluation metrics in this ablation experiment, respectively.
From the data in Table 3, using InternImage as the global representor achieves an FID metric of 55.51, which is only marginally higher by 0.28 compared to using ViT and by just 0.98 compared to Swin v2. In terms of the LPIPS metric, using InternImage as the global representor performs the best, optimizing by 0.0015 compared to the other two methods. Meanwhile, our method also performs best in traditional metrics like MSE. Therefore, considering both perceptual image quality evaluation metrics and traditional evaluation metrics, using InternImage as the global representor yields better metric performance than using ViT or Swin v2.
The translation results in Figure 10 show that compared to the high-quality images finally generated by our network, using ViT as the global representor leads to some content loss, such as incomplete road structures and unreasonable water body restoration. Using Swin v2 as the global representor results in some semantic distortion, with some farmland incorrectly translated into land and generated water bodies being overly smooth. In contrast, the images generated by our network demonstrate excellent performance in both global semantic structure and local detail texture.
Table 4 shows the comparison of computational cost of each method under the global representor ablation experiment, with the batch size set to 8. This experiment was conducted on a NVIDIA GeForce RTX 3080Ti graphics card. From the data in the table, it can be seen that, while keeping their number of parameters similar, Swin v2 has the lowest MACs, InternImage’s MACs only increase by 1.008G, and ViT’s MACs significantly increase compared to the previous two methods. At the same time, InternImage’s inference time is the shortest, only 186s, while Swin v2 and ViT’s inference times compared to InternImage increase by 7s and 34s, respectively. Therefore, considering these two metrics, InternImage, with DCNv3 as the core operator, is significantly more cost-effective in terms of computational cost than ViT and Swin v2, which use MSA as the core operator.
This ablation experiment on the global representor indicates that compared to ViT and Swin v2, InternImage exhibits better performance in extracting global semantic features from SAR images, while also significantly reducing the computational cost. This further validates its superiority as the global representor in this paper.

4.5. Cascaded Multi-Head Attention Module Ablation Experiment

To investigate the performance of the designed cascaded multi-head attention module, this section conducts an ablation experiment. Figure 11 and Table 5 show the comparison of translation results and evaluation metrics, respectively. Here, “w/o self-attention” means not using the self-attention mechanism within the module, “w/o cross-attention” means not using the cross-attention mechanism, “ResNet blocks” means replacing the designed cascaded multi-head attention module with 9 ResNet blocks [34], and “w/ attention” means using our full cascaded multi-head attention module.
After removing the self-attention mechanism, the FID metric increases by 20.40 compared to the full cascaded multi-head attention module, and the LPIPS metric increases by 0.0053. The performance on MSE and PSNR metrics is also inferior to the full module. The translation results show that after removing self-attention, the generated images exhibit more edge blurring and artifacts. This indicates that without the self-attention mechanism, the local detail features extracted by the encoder cannot be optimized and still suffer from noise interference and distortion effects, which further affects the quality of subsequent interaction between local features and global features.
After removing the cross-attention mechanism, the FID metric increases by 17.41, the LPIPS metric increases by 0.0070, and the MSE metric also deteriorates to some extent. The visualization results show that after removing cross-attention, the generated images exhibit semantic issues such as distorted building contours, deformed road structures, and some farmland being incorrectly translated into land. This indicates that when the cross-attention mechanism is removed, local detail features and global semantic features cannot interact, resulting in the decoder lacking semantic guidance during the final image reconstruction process.
The ResNet module is a classic feature transformation network. When the cascaded multi-head attention module is replaced with ResNet blocks, although traditional metrics like MSE are slightly optimized, the primary evaluation metrics deteriorate significantly. Specifically, the FID metric increases by 12.99 and the LPIPS metric increases by 0.0047. The translation results show that the image quality generated by the ResNet module, while improved compared to the first two methods, still falls short of our cascaded multi-head attention module, manifesting as distorted road structures and unreasonable color rendering of water bodies.
This section, through qualitative analysis and quantitative metrics, proves the importance of the designed cascaded multi-head attention module in the SAR image translation process.

4.6. Analysis of the SSIM Loss

To investigate the role of the SSIM loss introduced in this paper for SAR image translation, an experiment analyzing the SSIM loss weight was conducted. Table 6 shows the comparison of evaluation metrics with different weights of the SSIM loss, and Figure 12 shows the trend of the impact of the SSIM loss weight variation on primary evaluation metrics.
From the data in Table 6, when λ   = 1, both the FID and LPIPS metrics are optimal. Simultaneously, the traditional metrics MSE and PSNR also achieve their best performance. Compared to not using SSIM loss and only using adversarial loss, perceptual loss, and feature matching loss, the FID metric is reduced by 6.10 and the LPIPS metric is reduced by 0.0031. The traditional metrics are also not as good as when λ   = 1. This reflects that the SSIM loss plays a certain role in SAR image translation. By introducing the SSIM loss, not only is the SSIM improved, but the FID and LPIPS metrics are also optimized to some extent.
From the line chart in Figure 12, it can be observed that before λ   = 1, as the weight of the SSIM loss increases, both the FID and LPIPS metrics gradually improve. After λ   = 1, as the weight increases, both metrics start to deteriorate. Especially when λ   = 2.5, the FID and LPIPS metrics significantly worsen. This indicates that there is an optimal weight for the SSIM loss, namely λ   = 1, and overemphasizing this loss can negatively impact the overall image quality.

5. Conclusions

This paper proposed a deep learning-based method for SAR-to-optical remote sensing image translation. The method consists of a global representor, a generator with cascaded multi-head attention, and a multi-scale discriminator. The global representor is built upon InternImage with DCNv3 as its core operator. Benefiting from the long-range dependency and adaptive spatial aggregation capabilities of DCNv3, the global representor is responsible for extracting global semantic features from SAR images. The generator with cascaded multi-head attention follows a typical “encoder-bottleneck-decoder” structure. The encoder focuses on extracting local detail features from SAR images. The cascaded multi-head attention module within the bottleneck layer optimizes the local detail features and enables feature interaction between global semantics and local details. The multi-scale discriminator is based on the local receptive field PatchGAN, achieving joint global and local discrimination. Furthermore, for the first time in the SAR image translation field, the SSIM loss is combined with adversarial loss, perceptual loss, and feature matching loss as the composite loss function, ensuring the similarity between the generated images and the target images in both pixel values and human visual perception. A series of experiments demonstrate the effectiveness and reliability of the proposed method. Compared to mainstream image translation methods, our method can generate higher-quality images that are semantically consistent, texturally authentic, clearly detailed, and visually reasonable appearances, providing a high-quality solution for the future SAR image translation field.
Despite this, our method still has some limitations. The following suggestions are recommended for future research directions:
  • Our network is supervised and relies on strictly paired SAR-optical remote sensing image datasets. However, in practical applications, image annotation costs are high, and high-quality paired datasets are scarce. Therefore, future research could focus on unsupervised SAR-to-optical image translation methods.
  • The number of parameters of our method is slightly higher than that of mainstream image translation methods. Therefore, future research could focus on reducing model complexity, developing lightweight or compressed networks specifically for SAR image translation tasks, ensuring generated image quality while reducing computational cost and memory overhead.

Author Contributions

Conceptualization, C.X.; methodology, C.X.; software, C.X.; validation, C.X. and Y.K.; formal analysis, C.X.; investigation, C.X.; writing—original draft preparation, C.X.; writing—review and editing, C.X. and Y.K.; supervision, Y.K.; funding acquisition, Y.K.; project administration, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61501228, No. 62171220); Natural Science Foundation of Jiangsu (No. BK20140825); Aeronautical Science Foundation of China (No. 20152052029, No. 20182052012); Basic Research (No. NS2015040, No. NS2021030); and National Science and Technology Major Project (2017-II-0001-0017); Key Laboratory of Radar Imaging and Microwave Photonics, Ministry of Education (NJ20240002).

Data Availability Statement

The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, Q.; Liu, X.; Liu, M.; Zou, X.; Zhu, L.; Ruan, X. Comparative Analysis of Edge Information and Polarization on SAR-to-Optical Translation Based on Conditional Generative Adversarial Networks. Remote Sens. 2021, 13, 128. [Google Scholar] [CrossRef]
  2. Turnes, J.N.; Bermudez Castro, J.D.; Torres, D.L.; Soto Vega, P.J.; Feitosa, R.Q.; Happ, P.N. Atrous cGAN for SAR to Optical Image Translation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4003905. [Google Scholar] [CrossRef]
  3. Liu, X.; Hong, D.; Chanussot, J.; Zhao, B.; Ghamisi, P. Modality Translation in Remote Sensing Time Series. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5401614. [Google Scholar] [CrossRef]
  4. Zhan, T.; Bian, J.; Yang, J.; Dang, Q.; Zhang, E. Improved Conditional Generative Adversarial Networks for SAR-to-Optical Image Translation. In Proceedings of the Pattern Recognition and Computer Vision, PRCV 2023, PT IV, Xiamen, China, 13–15 October 2023; Liu, Q., Wang, H., Ma, Z., Zheng, W., Zha, H., Chen, X., Wang, L., Ji, R., Eds.; Springer-Verlag Singapore Pte Ltd.: Singapore, 2024; Volume 14428, pp. 279–291. [Google Scholar]
  5. Ji, G.; Wang, Z.; Zhou, L.; Xia, Y.; Zhong, S.; Gong, S. SAR Image Colorization Using Multidomain Cycle-Consistency Generative Adversarial Network. IEEE Geosci. Remote Sens. Lett. 2021, 18, 296–300. [Google Scholar] [CrossRef]
  6. Hwang, J.; Shin, Y. SAR-to-Optical Image Translation Using SSIM Loss Based Unpaired GAN. In Proceedings of the 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju-si, Republic of Korea, 19–21 October 2022; IEEE: Jeju Island, Republic of Korea, 2022; pp. 917–920. [Google Scholar]
  7. Wang, J.; Yang, H.; He, Y.; Zheng, F.; Liu, Z.; Chen, H. An Unpaired SAR-to-Optical Image Translation Method Based on Schrodinger Bridge Network and Multi-Scale Feature Fusion. Sci. Rep. 2024, 14, 27047. [Google Scholar] [CrossRef]
  8. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
  9. Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
  10. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
  11. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]
  12. Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
  13. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
  14. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  15. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 9992–10002. [Google Scholar]
  16. Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar]
  17. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, PT III, Daejeon, Republic of Korea, 23–27 September 2025; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing Ag: Cham, Germany, 2025; Volume 9351, pp. 234–241. [Google Scholar]
  18. Salimans, T.; Kingma, D.P. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 29 (nips 2016), Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
  19. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2023, arXiv:2211.05778. [Google Scholar]
  20. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 9300–9308. [Google Scholar]
  21. Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar]
  22. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Soc: Los Alamitos, CA, USA, 2022; pp. 11953–11965. [Google Scholar]
  23. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 548–558. [Google Scholar]
  24. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved Baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
  25. Kong, Y.; Liu, S.; Peng, X. Multi-Scale Translation Method from SAR to Optical Remote Sensing Images Based on Conditional Generative Adversarial Network. Int. J. Remote Sens. 2022, 43, 2837–2860. [Google Scholar] [CrossRef]
  26. Kong, Y.; Xu, C. ILF-BDSNet: A Compressed Network for SAR-to-Optical Image Translation Based on Intermediate-Layer Features and Bio-Inspired Dynamic Search. Remote Sens. 2025, 17, 3351. [Google Scholar] [CrossRef]
  27. Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2813–2821. [Google Scholar]
  28. Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision—ECCV 2016, PT II, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing Ag: Cham, Germany, 2016; Volume 9906, pp. 694–711. [Google Scholar]
  29. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
  30. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  31. Schmitt, M.; Hughes, L.H.; Zhu, X.X. The Sen1-2 Dataset for Deep Learning in Sar-Optical Data Fusion. In Proceedings of the ISPRS TC I Mid-Term Symposium Innovative Sensing—From Sensors to Methods and Applications, Karlsruhe, Germany, 10–12 October 2018; Jutzi, B., Weinmann, M., Hinz, S., Eds.; Copernicus Gesellschaft Mbh: Gottingen, Germany, 2018; Volume 4-1, pp. 141–146. [Google Scholar]
  32. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
  33. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Figure 1. Overall network architecture.
Figure 1. Overall network architecture.
Remotesensing 18 00055 g001
Figure 2. Schematic diagram of DCNv3 working principle.
Figure 2. Schematic diagram of DCNv3 working principle.
Remotesensing 18 00055 g002
Figure 3. InternImage basic block.
Figure 3. InternImage basic block.
Remotesensing 18 00055 g003
Figure 4. Global representor structure.
Figure 4. Global representor structure.
Remotesensing 18 00055 g004
Figure 5. Residual block structure.
Figure 5. Residual block structure.
Remotesensing 18 00055 g005
Figure 6. Cascaded multi-head attention module.
Figure 6. Cascaded multi-head attention module.
Remotesensing 18 00055 g006
Figure 7. Multi-scale discriminator structure.
Figure 7. Multi-scale discriminator structure.
Remotesensing 18 00055 g007
Figure 8. Translation results of different networks on the Nanjing dataset. (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) Pix2pixHD, (f) Multi-scale CGAN, (g) Our method.
Figure 8. Translation results of different networks on the Nanjing dataset. (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) Pix2pixHD, (f) Multi-scale CGAN, (g) Our method.
Remotesensing 18 00055 g008
Figure 9. Translation results of different networks on the SEN1-2 dataset. (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) Pix2pixHD, (f) Multi-scale CGAN, (g) Our method.
Figure 9. Translation results of different networks on the SEN1-2 dataset. (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) Pix2pixHD, (f) Multi-scale CGAN, (g) Our method.
Remotesensing 18 00055 g009
Figure 10. Translation results under the global representor ablation experiment. (a) SAR images, (b) target images, (c) ViT, (d) Swin v2, (e) InternImage.
Figure 10. Translation results under the global representor ablation experiment. (a) SAR images, (b) target images, (c) ViT, (d) Swin v2, (e) InternImage.
Remotesensing 18 00055 g010
Figure 11. Translation results under the cascaded multi-head attention module ablation experiment. (a) SAR images, (b) target images, (c) without self-attention, (d) without cross-attention, (e) ResNet blocks, (f) cascaded multi-head attention module.
Figure 11. Translation results under the cascaded multi-head attention module ablation experiment. (a) SAR images, (b) target images, (c) without self-attention, (d) without cross-attention, (e) ResNet blocks, (f) cascaded multi-head attention module.
Remotesensing 18 00055 g011
Figure 12. Trend of the impact of the SSIM Loss weight variation on primary evaluation metrics.
Figure 12. Trend of the impact of the SSIM Loss weight variation on primary evaluation metrics.
Remotesensing 18 00055 g012
Table 1. Evaluation metrics of different networks on the Nanjing dataset.
Table 1. Evaluation metrics of different networks on the Nanjing dataset.
FID ↓LPIPS ↓MSE ↓PSNR ↑SSIM ↑Number of Parameters ↓
Pix2pix118.080.61760.716515.290.127057.183M
CycleGAN85.380.59140.709715.480.168728.286M
Pix2pixHD79.560.58360.598616.740.191554.155M
Multi-scale CGAN67.590.55990.582517.050.207450.116M
Ours55.510.55650.540517.480.221862.287M
Note: Bold and red font in the table indicate the best values.
Table 2. Evaluation metrics of different networks on the SEN1-2 dataset.
Table 2. Evaluation metrics of different networks on the SEN1-2 dataset.
FID ↓LPIPS ↓MSE ↓PSNR ↑SSIM ↑Number of Parameters ↓
Pix2pix111.940.56520.598614.830.179857.183M
CycleGAN109.050.60070.683413.450.124428.286M
Pix2pixHD72.380.45610.397218.180.327754.155M
Multi-scale CGAN53.750.33360.302121.000.503350.116M
Ours31.660.28290.254022.110.596962.287M
Note: Bold and red font in the table indicate the best values.
Table 3. Evaluation metrics under the global representor ablation experiment.
Table 3. Evaluation metrics under the global representor ablation experiment.
FID ↓LPIPS ↓MSE ↓PSNR ↑SSIM ↑
ViT55.230.55800.545817.400.2201
Swin v254.530.55800.545917.390.2208
InternImage55.510.55650.540517.480.2218
Note: Bold and red font in the table indicate the best values.
Table 4. Computational cost under the global representor ablation experiment.
Table 4. Computational cost under the global representor ablation experiment.
MACs ↓Inference Time ↓Number of Parameters
ViT27.590M222 s13.417M
Swin v221.310G193 s12.016M
InternImage22.318G186 s12.090M
Note: Bold and red font in the table indicate the best values.
Table 5. Evaluation metrics under the cascaded multi-head attention module ablation experiment.
Table 5. Evaluation metrics under the cascaded multi-head attention module ablation experiment.
FID ↓LPIPS ↓MSE ↓PSNR ↑SSIM ↑
w/o self-attention75.910.56180.542317.470.2235
w/o cross-attention72.920.56350.542217.460.2251
ResNet blocks68.500.56120.539617.510.2222
w/attention 55.510.55650.540517.480.2218
Note: Bold and red font in the table indicate the best values.
Table 6. Evaluation metrics with different weights of the SSIM Loss.
Table 6. Evaluation metrics with different weights of the SSIM Loss.
FID ↓LPIPS ↓MSE ↓PSNR ↑SSIM ↑
w/o SSIM61.610.55960.545917.430.2125
λ   = 0.555.910.55880.540917.470.2204
λ   = 1 (Ours)55.510.55650.540517.480.2218
λ   = 1.556.460.55790.544017.400.2251
λ   = 260.760.56130.552317.320.2227
λ   = 2.592.790.57460.566117.080.1954
Note: Bold and red font in the table indicate the best values.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, C.; Kong, Y. SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sens. 2026, 18, 55. https://doi.org/10.3390/rs18010055

AMA Style

Xu C, Kong Y. SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sensing. 2026; 18(1):55. https://doi.org/10.3390/rs18010055

Chicago/Turabian Style

Xu, Cheng, and Yingying Kong. 2026. "SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention" Remote Sensing 18, no. 1: 55. https://doi.org/10.3390/rs18010055

APA Style

Xu, C., & Kong, Y. (2026). SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sensing, 18(1), 55. https://doi.org/10.3390/rs18010055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop