ILF-BDSNet: A Compressed Network for SAR-to-Optical Image Translation Based on Intermediate-Layer Features and Bio-Inspired Dynamic Search

Kong, Yingying; Xu, Cheng

doi:10.3390/rs17193351

Open AccessArticle

ILF-BDSNet: A Compressed Network for SAR-to-Optical Image Translation Based on Intermediate-Layer Features and Bio-Inspired Dynamic Search

by

Yingying Kong

^*

and

Cheng Xu

College of Electrical and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3351; https://doi.org/10.3390/rs17193351

Submission received: 20 August 2025 / Revised: 24 September 2025 / Accepted: 28 September 2025 / Published: 1 October 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Highlights

What are the main findings?

This paper proposes a specialized compressed network, ILF-BDSNet, for SAR-to-optical remote sensing image translation, incorporating key modules such as a dual-resolution collaborative discriminator, knowledge distillation based on intermediate-layer features, and a bio-inspired dynamic search of channel configuration (BDSCC) algorithm.

What is the implication of the main finding?

While significantly reducing number of parameters and computational complexity, the network still generates high-quality optical remote sensing images, providing an efficient solution for SAR image translation in resource-constrained environments.

Abstract

Synthetic aperture radar (SAR) exhibits all-day and all-weather capabilities, granting it significant application in remote sensing. However, interpreting SAR images requires extensive expertise, making SAR-to-optical remote sensing image translation a crucial research direction. While conditional generative adversarial networks (CGANs) have demonstrated exceptional performance in image translation tasks, their massive number of parameters pose substantial challenges. Therefore, this paper proposes ILF-BDSNet, a compressed network for SAR-to-optical image translation. Specifically, first, standard convolutions in the feature-transformation module of the teacher network are replaced with depthwise separable convolutions to construct the student network, and a dual-resolution collaborative discriminator based on PatchGAN is proposed. Next, knowledge distillation based on intermediate-layer features and channel pruning via weight sharing are designed to train the student network. Then, the bio-inspired dynamic search of channel configuration (BDSCC) algorithm is proposed to efficiently select the optimal subnet. Meanwhile, the pixel-semantic dual-domain alignment loss function is designed. The feature-matching loss within this function establishes an alignment mechanism based on intermediate-layer features from the discriminator. Extensive experiments demonstrate the superiority of ILF-BDSNet, which significantly reduces number of parameters and computational complexity while still generating high-quality optical images, providing an efficient solution for SAR image translation in resource-constrained environments.

Keywords:

SAR-to-optical image translation; CGAN compression; intermediate-layer features; bio-inspired dynamic search

1. Introduction

In recent years, with the development of industry, agriculture, and forestry, the demand for Earth observation has been increasing. Synthetic aperture radar (SAR) has attracted more and more attention due to its all-day and all-weather capabilities. It has wide applications in Earth science, weather changes, environmental system monitoring, marine resource utilization [1], planetary exploration, and more. The two major trends in recent spaceborne SAR technology development are high resolution and multi-dimensionality. However, due to the unique imaging mechanism of SAR, there is significant speckle noise and geometric distortion in SAR images, which means that the interpretation of SAR images requires professional expertise. Optical remote sensing images, on the other hand, are more in line with human visual perception and are easier to interpret, but their imaging can be easily affected by cloud cover and low-light conditions. Therefore, a feasible approach is to translate SAR images into optical remote sensing images to improve their interpretability.

With the development of deep learning, convolutional neural networks (CNNs), which have strong nonlinear fitting capabilities, have been widely applied in fields such as image segmentation [2], classification recognition [3], and super-resolution [4]. However, CNNs do not perform well in image translation tasks because their pixel-level loss tends to produce blurry translated images. In contrast, generative adversarial networks (GANs) [5] have strong image-generation capabilities, and some conditional generative adversarial networks (CGANs) take images of a specific style as model input and generate high-quality images in another style [6,7]. As a result, CGANs have gained wide attention in the image translation field. However, CGANs are computationally intensive, and their large number of parameters consume substantial computational resources and storage space, making it necessary to compress CGAN models. Since CGAN consists of a generator and a discriminator, its training is highly unstable, making it difficult to directly apply typical classification detection network compression algorithms to CGANs.

Therefore, the focus of this paper is to propose a compressed network for SAR-to-optical remote sensing image translation, aiming to minimize the number of parameters and computational complexity while ensuring high-quality generated images. The main contributions of this paper are as follows:

(1): To address the common issue in existing CGANs where it is difficult to balance global structural consistency and local detail authenticity in SAR image translation tasks, this paper proposes a dual-resolution collaborative discriminator. This structure uses PatchGAN based on local receptive fields as the basic framework of the discriminator. By constructing a high- and low-resolution collaborative analysis network, it focuses on pixel-level processing and scene-level semantic features, significantly improving the consistency between macro coherence and micro fidelity of generated images.
(2): In response to the limited guidance provided by traditional knowledge distillation strategies to the student network, which is not suitable for SAR-to-optical remote sensing image translation tasks, this paper designs knowledge distillation based on intermediate-layer features of the teacher network. Furthermore, to solve the potential issue of a large number of channel configurations, a channel-pruning strategy based on weight sharing is proposed.
(3): To overcome the time-consuming disadvantages of traditional brute-force search algorithm, this paper proposes a bio-inspired dynamic search of channel configuration (BDSCC) algorithm. The algorithm constructs a dynamic biological population to simulate processes such as fitness evaluation, natural selection, gene recombination, and gene mutation in biology, significantly improving the search efficiency.
(4): To address the issues of speckle noise and other problems in SAR images, this paper designs a pixel-semantic dual-domain alignment loss function. This loss function is jointly optimized through adversarial loss, perceptual loss, and feature-matching loss. The feature-matching loss, combined with the dual-resolution collaborative discriminator, constrains the statistical distribution consistency of the intermediate-layer features between generated images and target images in the discriminator, achieving cross-layer alignment from pixel-level details to semantic-level structures.

2. Related Work

2.1. SAR-to-Optical Image Translation

Research on translating SAR images into optical remote sensing images started relatively late, and it is only in recent years that a significant number of studies have been published. Fu et al. [8] proposed a CGAN generator with a multi-scale cascaded residual structure, which directly connects down-sampled input SAR images to deconvolutional layers at different depths, enhancing the ability of the generator to learn residual information and thereby refining the texture details of generated images. Tan et al. [9] designed two CGAN models responsible for SAR image denoising and colorization tasks, respectively, improving detail information while reducing spectral distortion in generated images. Zhang et al. [10] retained the original Pix2Pix architecture but introduced SAR image gradient information and texture features such as contrast, homogeneity, and correlation based on the gray-level co-occurrence matrix (GLCM) into the generator, improving the structural similarity between generated and target images. Turnes et al. [11] proposed a dilated convolution-based CGAN model, which expands the receptive field of the model and incorporates an atrous spatial pyramid pooling (ASPP) module in the generator to leverage multi-scale spatial contextual information, significantly enhancing translation accuracy. Zhan et al. [12] integrated a style-based calibration module into CGAN, which learns style features from input SAR images and aligns them with the style of optical remote sensing images to achieve color calibration, minimizing discrepancies between generated and target images. Shi et al. [13] introduced a conditional diffusion model for SAR-to-optical image translation, addressing the training instability of the CGAN-based translation network. The proposed model combines long skip connections and self-attention mechanisms to strengthen feature extraction, enabling better preservation of the edges, details, and overall texture of optical remote sensing images.

For unsupervised unpaired image translation, since there are no corresponding optical remote sensing images for the input SAR images, the generator tends to become confused when learning color information. To address this issue, Ji et al. [14] improved CycleGAN by introducing an additional mask vector to the input SAR images, enabling the generator to recognize terrain categories. Simultaneously, they designed a dual-branch architecture in the discriminator to perform both image authenticity discrimination and classification recognition. This approach significantly reduced color distortion in unpaired image translation. To overcome the limitation of cycle-consistency loss in CycleGAN, which focuses solely on texture alignment, Hwang et al. [15] incorporated the mutual information-based correlation loss and the SSIM loss based on structure, brightness, and contrast information into the training framework, thereby enhancing the ability to learn structural and color features. Yang et al. [16] tackled detail defects in generated images by proposing a structurally imbalanced generator. They designed a sophisticated encoder to extract rich SAR features, while the decoder filtered these features to obtain key detail information. Additionally, they introduced specific normalization methods in different modules of the model to enhance the emphasis on texture, contour, and color information. Since the existing methods are difficult to complete the training of unpaired data with a minimum amount of data, Wang et al. [17] proposed a multi-scale axial residual module (MARM) by adopting the latest Schrödinger bridge transformation framework. This module adopts a multi-branch structure. By applying permutation operations to each branch feature map, it enhances the global information extraction and cross-channel interaction capabilities. At the same time, the axial self-attention mechanism is used to limit the sensory field to a certain range, which helps to extract local information under the current branch and achieve the transmission of long-distance interaction information, ultimately generating high-quality images.

2.2. Model Compression

Currently, the compression of CNN models primarily involves four categories of methods:

(1): Designing compact and efficient network architectures, such as depthwise separable convolution [18] and grouped convolution [19]. These methods modify traditional convolution approaches to reduce the number of parameters and computational complexity, thereby achieving model compression.
(2): Knowledge distillation, which trains a student network to learn knowledge from a teacher network, enabling the performance of the student network to approximate that of the teacher network as closely as possible. The concept of distillation in neural networks was first proposed by Hinton et al. [20]. It is a relatively universal and simple neural network compression technique and has been widely adopted in CNN model compression.
(3): Pruning, which is generally divided into two types: structured pruning and unstructured pruning, with differences in what is pruned. Currently, structured pruning is more widely used. It prunes layers or channels and is suitable for traditional network architectures. Common approaches include pruning convolutional kernels [21] and pruning channels [22]. Unstructured pruning targets weights. However, since the pruned weight matrix is sparse, without specialized hardware conditions, it is difficult to achieve acceleration effects.
(4): Quantization, which converts high-precision computations into low-precision operations. This significantly reduces both number of parameters and computational complexity.

In recent years, to enable the deployment of deep learning models across diverse hardware platforms, research on GAN model compression has gained increasing attention. Li et al. [23] proposed a compression method based on differentiable mask and co-attention distillation. The former searches for a lightweight generator architecture in a training-adaptive manner and incorporates adaptive cross-block group sparsity to overcome the problem of inconsistent pruned residual connection channels. The latter distills informative attention maps from the pre-trained generator and discriminator into the searched generator. Both achieve efficient GAN compression. You et al. [24] introduced image probability distribution distillation, compressing GANs by extracting knowledge from the global distribution of images. Lin et al. [25] proposed a generator–discriminator collaborative compression architecture, incorporating global coordination constraints to determine whether additional teacher network training is necessary. Yeo et al. [26] proposed two methods for compressing GANs: distribution matching for efficient compression and interactive network compression via knowledge exchange and learning. The former employs a base model as an embedding kernel to enable efficient distribution matching and utilizes maximum mean discrepancy (MMD) for effective knowledge distillation. The latter adopts an interactive compression approach that enhances the communication between the student generator and discriminator, achieving a smoother compression process.

3. Methodology

3.1. Holistic Compression Method

Figure 1 illustrates the holistic compression method in this paper. First, a high-quality teacher generator is pre-trained. Next, a once-for-all student generator encompassing all possible channel configurations is trained through knowledge distillation based on intermediate-layer features and channel pruning via weight sharing. This trained student generator serves as the supernet for subsequent architecture search. When training both the student network and the teacher network, the pixel-semantic dual-domain alignment loss function is used. After training, numerous subnets can be extracted from the supernet without additional training, leveraging the advantage of one-for-all training. During the search process, to enhance the search efficiency, the BDSCC algorithm is employed to filter out the optimal subnet. Finally, the selected subnet is fine-tuned to obtain the final compressed network, ILF-BDSNet.

3.2. Network Architecture

3.2.1. Overall Architecture

Currently, the Transformer [27] has demonstrated remarkable performance in the field of computer vision. Although the translation network constructed by combining the Transformer with CNN can significantly improve the quality of generated images, the compression of such hybrid networks remains challenging due to the incomplete development of compression and acceleration techniques for the Transformer. In contrast, compression methods for CNN have become increasingly sophisticated and diversified. Given this research status, the networks subjected to compression in this paper are exclusively composed of CNN, aiming to reduce the overall model optimization complexity by avoiding the compression challenges of Transformer.

Figure 2 shows the overall architecture of teacher and student networks. This supervised learning framework relies on strictly paired datasets and consists of an improved feature-extraction generator, along with a dual-resolution collaborative discriminator based on PatchGAN [6].

3.2.2. Teacher Generator

As shown in Figure 3, the teacher generator adopts an encoder–decoder architecture, with its core structure comprising three stages: progressive down-sampling, multi-level residual feature transformation, and spatial resolution reconstruction. Additionally, skip connections are incorporated between the input and output layers.

In the encoding phase, the input SAR image undergoes four down-sampling operations to progressively reduce spatial resolution, while simultaneously achieving feature doubling in the channel dimension at each step. This design balances the conflict between feature abstraction and information preservation, the four down-sampling steps effectively capture global semantic information such as farmland and building areas, but also avoid spatial information loss caused by excessive down-sampling, preventing the distortion of high-frequency textures during reconstruction. Thus, this progressive down-sampling can effectively extract hierarchical features ranging from local details to global semantics. The fundamental reason for using ResNet blocks [28] in the feature-transformation stage stems from their distinctive residual learning mechanism, which effectively addresses critical issues in deep neural networks while satisfying the specific demands of SAR image translation tasks. Specifically, the inherent speckle noise and high-frequency textures in SAR images such as farmland boundaries and building outline impose stringent requirements on feature preservation. Residual connections demonstrate superior capability in retaining textural details and preventing feature blurring during down-sampling. Ultimately, the decoding phase progressively reconstructs spatial resolution through four up-sampling operations, producing a generated image with the same size as the input image.

3.2.3. Student Generator

The structure of the student generator is illustrated in Figure 4, with its core challenge lying in balancing the number of parameters and generated image quality during network compression. As the down-sampling and up-sampling modules exhibit high sensitivity to convolutional operations and channel number adjustments, directly compressing these modules leads to structural distortions in generated image, such as blurred edges and broken textures, thereby disrupting the balance of the system. In contrast, ResNet blocks in the feature-transformation module demonstrate robust tolerance to the adjustments of convolutional structures. Notably, after four down-sampling operations, the channel dimensions of ResNet blocks expand significantly, with their number of parameters constituting a substantial proportion of total number of parameters and forming a typical computational bottleneck. This characteristic makes ResNet blocks the primary compression target.

Building on this observation, this paper replaces standard convolutions in ResNet blocks with depthwise separable convolutions. By decomposing the original standard convolution into depthwise convolution and pointwise convolution, we decouple the modeling of spatial correlations and inter-channel relationships. This approach maintains feature representation capability while achieving significant reductions in both the number of parameters and computational complexity.

3.2.4. Dual-Resolution Collaborative Discriminator

In the task of translating SAR images to optical remote sensing images, due to the rich detail features and complex texture characteristics of target images, the imaging principles differ fundamentally from those of conventional natural image. This poses significant challenges to the design of the discriminator, at the macro level, it must preserve global content distribution consistency between generated images and target images, while at the micro level, it must ensure geometric accuracy and textural fidelity in local details. To address these specific requirements, this paper adopts the PatchGAN based on local receptive field as the foundational framework of the discriminator. This architecture outputs a 1-channel N × N two-dimensional response matrix rather than the single scalar score from traditional discriminators, where each matrix element corresponds to a specific local receptive field region within the input image, enabling independent authenticity discrimination for local image patches rather than generating a unified judgment for the entire image. This spatial sensitivity enables the discriminator to simultaneously capture textural features and local structural information, effectively resolving the challenge of detail reconstruction in optical remote sensing images. Notably, by removing the final fully connected layer in the original design, the architecture reduces the number of parameters in the network while maintaining discriminative precision, thereby significantly enhancing computational efficiency.

To further enhance the quality of generated images, the discriminator needs to demonstrate strong performance in discriminating both global information and local details of images. Acquiring more global information requires expanding the receptive field. While increasing network depth is a common method to enlarge the receptive field, this leads to a surge in the number of parameters. To address this issue, this paper proposes a dual-resolution collaborative discriminator. The discriminator comprises two identical parallel network structures that process images at different resolutions, as shown in Figure 5. The first discriminator receives concatenated inputs of SAR images with generated images, and SAR images with optical remote sensing images, the second discriminator takes concatenated inputs of 2× down-sampled SAR images with generated images, and 2× down-sampled SAR images with optical remote sensing images. The high-resolution branch effectively captures texture and detail features through its pixel-level discriminative capability, while the low-resolution branch naturally achieves a larger receptive field through down-sampling, thereby enhancing global consistency discrimination. By processing images at dual resolutions in parallel, this architecture enables collaborative discrimination of global and local features while avoiding parameter explosion.

3.3. Training and Search Strategies of the Student Network

3.3.1. Knowledge Distillation Based on Intermediate-Layer Features

The concept of knowledge distillation in neural networks was first proposed by Hinton et al. [20], whose core idea lies in transferring high-level knowledge from teacher networks to student networks by matching the logits distribution of the output layer, that is, the network outputs before activation functions, thereby improving the performance of student networks. However, within CGAN frameworks, particularly for supervised translation tasks such as SAR-to-optical remote sensing images, this approach faces significant challenges: the deterministic image outputs generated by teacher networks struggle to form effective probability distributions, and compared to real target images, their generation results often fail to provide incremental knowledge information. This limitation restricts the guidance effectiveness of knowledge distillation strategies based on output layers of teacher networks for student networks.

To address this technical bottleneck, this paper designs a knowledge distillation framework based on intermediate-layer features. Compared with the final output layer, the intermediate layers of the teacher generator exhibit following advantages: the high-dimensional feature space contains richer semantic information, and intermediate-features permit moderate deviations, thereby avoiding mode collapse caused by strict alignment of the output layer. As illustrated in Figure 6, the intermediate-layer features of the teacher generator are matched and aligned with those of the student generator through 1 × 1 convolution. To reduce the number of parameters and computational complexity, this paper strategically selects four representative feature layers from the feature-transformation module at positions 0, 3, 6, and 9. Compared with traditional distillation methods, this approach achieves progressive guidance of knowledge transfer from low-level to high-level, leading to a significant improvement in the quality of images generated by the student network.

The distillation loss can be defined as:

L_{d i s t i l l} = \sum_{i = 1}^{T} ∥ f_{i} (G_{i} (x)) - G_{i}^{’} (x) ∥_{2}

(1)

Here,

T

denotes the total number of matched intermediate layers,

i

represents the

i

-th layer in intermediate layer matching, and

f_{i}

indicates the 1 × 1 convolutional block corresponding to the

i

-th intermediate layer of both generators.

G_{i} (x)

and

G_{i}^{’} (x)

, respectively, represent the feature map outputs of the input SAR image at the

i

-th intermediate layer of the teacher generator and student generator. The L2 norm is employed to minimize the distance between the intermediate-layer feature outputs of the teacher and student networks.

3.3.2. Channel Pruning via Weight Sharing

The channel configuration of traditional generators typically relies on manual empirical guidelines, with channel numbers at each layer generally following geometric progression designs based on 2n, such as 128, 256, 512, etc. This static architecture based on prior assumptions exhibits significant limitations: First, there exists an adaptation discrepancy between channel capacity and task requirements, resulting in excessive feature representation capability; second, the exponentially increasing channel dimensions easily induce parameter redundancy, constraining the compression potential of the network. To address these issues, this paper introduces channel pruning strategies to reselect the intermediate layer channel numbers in depthwise separable convolutional residual blocks, achieving secondary network compression.

Our core objective is to select the optimal channel configuration with a minimal number of parameters and computational complexity. The most straightforward solution involves exhaustively enumerating all possible channel configurations and filtering them through complete training and validation set evaluation. However, as the number of network layers to be pruned increases, the solution space expands exponentially, causing traditional methods to face severe time consumption problems.

To address this challenge, we introduce the once-for-all (OFA) algorithm proposed by Cai et al. [29]. They designed a progressive shrinking training framework, specifically: first training a supernet with maximum channel scale, then randomly sampling subnets across four pruning dimensions through weight sharing, and updating their weights via fine-tuning. This algorithm constructs a supernet architecture compatible with multi-channel configurations, enabling subnets with different channel configurations to both share weights and operate independently. Inspired by this, this paper designs the channel pruning framework via weight sharing as shown in Figure 7. The subnets inherit all architectural components from the supernet except for different channel configurations in intermediate layers. After training the supernet with maximum channel scale, we randomly sample channel numbers in intermediate layers to generate corresponding subnets, train them, and update their weights via backpropagation. It can be seen that the weights of earlier channels in each layer are extensively shared among multiple subnets, demonstrating that these weights play a critical role in the network.

This weight-sharing mechanism has significant advantages. By reusing parameters, it enables parallel training of massive subnets, significantly reducing computational complexity. Meanwhile, the expanded candidate subnet pool effectively elevates the upper bound of the search space for optimal solutions.

3.3.3. BDSCC: Bio-Inspired Dynamic Search of Channel Configuration Algorithm

To enhance the search efficiency of channel configuration, this paper proposes a bio-inspired dynamic search of channel configuration (BDSCC) algorithm. By decoupling the training and search process of the student network, it effectively addresses the exponential complexity problem faced by traditional brute-force search algorithms in supernet architectures. As shown in Figure 8, the algorithm primarily incorporates four bio-inspired mechanisms:

(1): Fitness evaluation mechanism

Fitness evaluation refers to assessing the adaptability of species and using a fitness evaluation function as the optimization objective. This paper employs the fréchet inception distance (FID) [30] image-quality metric as the fitness evaluation function to quantitatively measure the performance of each subnet on the validation set, thereby establishing a mapping relationship between channel configurations and generated image quality.

FID measures the similarity between generated images and target images by calculating the distance between their features at the Inception V3 network layer. It has now been widely used as a metric for measuring the similarity between generated images and target images in GAN, including aspects such as quality and diversity.

(2): Natural selection mechanism

Based on FID evaluation results, elite screening is performed to select high-quality subnets from the current population as parent networks, ensuring the directional transmission of superior genes. This mechanism simulates the natural law of “survival of the fittest” to progressively enhance the overall fitness of the population.

(3): Gene recombination mechanism

Gene recombination refers to generating offspring networks through chromosomal crossover and recombination of two parents. Specifically, for the channel configuration of intermediate layers, the offspring randomly inherits any value of the corresponding channel number from two parents. This dominant inheritance strategy effectively combines the architectural advantages of different parents.

(4): Gene mutation mechanism

This mechanism introduces controlled probabilistic mutations, where the channel configuration of randomly selected intermediate layers in offspring networks is reset within predefined ranges. By introducing architectural perturbations, it breaks local optima, increases population diversity, and prevents premature convergence during optimization and search processes.

Compared to traditional time-consuming brute-force search algorithms, the proposed algorithm constructs a dynamic biological population to gradually converge toward the optimal channel configuration through iterative process. Each iteration consists of four stages: fitness evaluation → natural selection → gene recombination → gene mutation, where natural selection guides the architecture optimization direction, while gene recombination and mutation collectively provide exploitation and exploration capabilities, establishing an efficient architecture search paradigm.

3.4. Loss Function

To address issues such as speckle noise in SAR images, as well as the significant texture differences and semantic gap between SAR and optical remote sensing images, this paper designs a pixel-semantic dual-domain alignment loss function. This function operates through the collaborative effect of adversarial loss, perceptual loss, and feature-matching loss [31]. By implementing joint constraints across both pixel-level geometric features and semantic-level content logic, it effectively resolves problems such as semantic drift and detail distortion in SAR image translation. Specifically, the adversarial loss optimizes the global semantic information and local pixel details of generated images by combining the outputs of the dual-resolution collaborative discriminator. The perceptual loss measures high-level semantic differences between generated and target images by using deep convolutional neural networks. Meanwhile, the feature-matching loss achieves cross-layer alignment from pixel-level details to semantic-level structures by constraining the statistical distribution consistency of intermediate-layer features between generated and target images.

3.4.1. Adversarial Loss

To address the modality differences between SAR and optical remote sensing images, this paper employs the least-squares loss from LSGAN [32]. Its advantages lie in not only effectively alleviating prevalent issues such as mode collapse and gradient vanishing in conventional GAN training, ensuring the smooth progression of the entire compression process, but also improving the visual quality of generated images. The objective function of the adversarial loss is defined as follows:

\underset{D_{1}, D_{2}}{m i n} \sum_{k = 1, 2} L_{D_{k}} (D_{k}) = \sum_{k = 1, 2} [E_{(x, y)} [{(D_{k} (x, y) - 1)}^{2}] + E_{(x, y)} [{(D_{k} (x, G (x)))}^{2}]]

(2)

\underset{G}{m i n} L_{G} (G) = \sum_{k = 1, 2} E_{(x, y)} [{(D_{k} (x, G (x)) - 1)}^{2}]

(3)

where

k

represents the

k

-th discriminator,

G

is the generator,

D

is the discriminator,

x

is the input SAR image,

y

is the corresponding optical remote sensing image, and

G (x)

is the generated image obtained by passing the input SAR image through the generator. The total adversarial loss is expressed as:

L_{C G A N} (G, D) = \sum_{k = 1, 2} L_{D_{k}} (D_{k}) + L_{G} (G)

(4)

3.4.2. Perceptual Loss

The concept of perceptual loss was first introduced by Johson et al. [33] in the field of image-style transfer. Unlike traditional pixel-level L1/L2 norm loss functions, perceptual loss employs deep convolutional neural networks to extract high-level semantic features from images, establishing similarity metrics that align closely with human visual perception. Its fundamental principle lies in using the feature space of a pre-trained VGG network to measure discrepancies between generated and target images at the semantic level, rather than relying on direct pixel-level comparisons. This paper defines the perceptual loss based on VGG net as follows:

L_{V G G} (y, G (x)) = \sum_{i = 1}^{T} \frac{1}{N_{i}} ∥ φ_{i} (G (x)) - φ_{i} (y) ∥_{1}

(5)

where

T

is the total number of layers in the VGG network,

i

represents the

i

-th layer of the VGG network,

N_{i}

is the number of elements in the feature map generated by the

i

-th layer of the VGG network,

φ

is the VGG network, and

φ_{i} (y)

represents the feature map of the optical remote sensing image generated through the

i

-th layer of the VGG network.

Compared with traditional pixel-level loss functions, this representation based on deep feature exhibits distinct advantages: First, the hierarchical architecture of deep convolutional neural networks progressively extracts abstract features that effectively characterize the semantic content of images, such as land-cover structures. Second, the distance metric in deep feature space demonstrates high consistency with human subjective perception [34], this becomes particularly crucial when processing complex remote sensing images. When images contain diverse land-cover categories and intricate spatial relationships, simple pixel-level differences fail to accurately reflect semantic similarity between images.

3.4.3. Feature-Matching Loss

Traditional adversarial loss relies solely on the output layer of the discriminator to guide generator training. This single-dimensional feedback exhibits significant limitations: The highly abstract nature of deep-layer features struggles to capture fine-grained feedback information, while the substantial modality differences between SAR and optical remote sensing images render such unidimensional feedback inadequate for effectively guiding the network in SAR-to-optical image translation tasks. To address this, we introduce the feature-matching loss proposed by Salimans et al. [35], constructing an alignment mechanism based on intermediate-layer features of the dual-resolution collaborative discriminator. The feature-matching loss involves weighting and calculating the intermediate-layer features of the discriminator. This loss function quantifies statistical discrepancies between feature vectors extracted from generated and target images. By minimizing the feature-matching loss of the discriminator, the generator is compelled to produce images with higher similarity to target images. The feature-matching loss in this paper is formulated as follows:

L_{F M} (G, D) = \sum_{k = 1, 2} E_{(x, y)} [\sum_{i = 1}^{T} \frac{1}{N_{i}} ∥ D_{k}^{i} (x, y) - D_{k}^{i} (x, G (x)) ∥_{1}]

(6)

where

k

is the

k

-th discriminator,

T

is the total number of layers in the discriminator from which features are extracted (set to 4 in this paper),

i

indicates the

i

-th layer of the discriminator, and

N_{i}

is the number of elements in the feature map generated by the

i

-th layer of the discriminator. Specifically, as illustrated in Figure 5, the SAR image and the generated image are first concatenated along the channel dimension to form a 256 × 256 × 6 size, which is then input into the discriminator to generate feature maps at intermediate layers. Simultaneously, the SAR image and the optical remote sensing image are concatenated in the same manner and fed into the discriminator to produce corresponding feature maps. Subsequently, all images are 2× down-sampled, and the aforementioned feature-extraction process is repeated.

3.4.4. Total Loss

The total loss during the training of the teacher network is:

L_{t e a c h e r} = α L_{C G A N} + β L_{V G G} + γ L_{F M}

(7)

where

α

,

β

, and

γ

represent the weight of each loss.

The total loss during the training of the student network and fine-tuning the subnet is:

L_{s t u d e n t} = α L_{C G A N} + β L_{V G G} + γ L_{F M} + λ L_{d i s t i l l}

(8)

where

α

,

β

,

γ

, and

λ

present the weight of each loss.

4. Results

4.1. Exprimental Procedures

Stage 1: Pre-training the teacher network

First, a teacher network with a large number of parameters is trained. The overall structure of the teacher network is shown in Figure 2, where standard convolutions are used in all generator modules. The pre-trained teacher network will serve as one of the benchmarks for subsequent student network training.

Stage 2: Training and searching the student network

First, a student network with fewer parameters is designed. Based on sensitivity analysis of the generator modules, only the standard convolutions in the ResNet blocks of the teacher network are replaced with depthwise separable convolutions to initially reduce the number of parameters.

Then, using knowledge distillation based on intermediate-layer features and channel pruning via weight sharing, a one-for-all student generator containing all possible channel configurations is trained. Both the student and teacher networks are trained with the proposed pixel-semantic dual-domain alignment loss function. This trained student network becomes the supernet for subsequent search.

Next, using the FID as the fitness evaluation function, BDSCC algorithm is applied to search the supernet. A series of subnets are evaluated on the validation set, and the final subnet with smaller number of parameters and best performance is selected.

Stage 3: Fine-tuning the subnet

The selected subnet is fine-tuned. The training process is similar to the previous student network training, with the only difference being the replacement of the student network with the searched subnet of specific channel configuration.

By following these procedures, the entire compression is completed. The final compressed network ILF-BDSNet for SAR-to-optical remote sensing image translation achieves significant reductions in both number of parameters and computational complexity, while maintaining outstanding performance in subjective visual quality and objective evaluation metrics.

4.2. Datasets and Parameter Settings

In this paper, paired SAR and optical remote sensing images of Nanjing, Jiangsu Province were used as the experimental dataset. The SAR images were collected from the RADARSAT-2 satellite with a resolution of 5 m, while the optical remote sensing images were acquired from the RapidEye satellite, also at 5 m resolution. After preprocessing, the images were cropped to a size of 256 × 256 × 3 and augmented through flipping and rotation operations. Ultimately, 8452 pairs were randomly selected as the training set, with 1500 pairs each allocated to the validation set and test set. Meanwhile, this paper utilized the SEN1-2 dataset [36] as a supplementary dataset to further prove the effectiveness of the compressed network proposed in this paper. The resolution of the SEN1-2 dataset used in this paper is 5 m. After data augmentation, 6200 pairs were randomly selected as the training set, 1000 pairs as the validation set, and 1000 pairs as the test set. The SEN1-2 dataset was only used as an additional dataset in the comparison experiments of different networks, while the Nanjing dataset was used for network comparison experiments and other experiments. Additionally, due to differences in acquisition times, slight differences exist between the SAR and optical remote sensing images.

The input and output image size of the teacher generator are both 256 × 256 × 3, while the input image size of the discriminator is 256 × 256 × 6 and 128 × 128 × 6, with output image size of 35 × 35 × 1 and 19 × 19 × 1, respectively. The Adam optimizer is used with a batch size of 8, a total of 100 training epochs, and an initial learning rate of 2 × 10⁻⁴. A linear learning-rate decay strategy is applied in the last 50 epochs. The loss weights

α

,

β

, and

γ

are set to 1, 10, and 10, respectively, and these parameter settings are empirical. The implementation of the network is based on PyTorch framework and is trained on two NVIDIA GeForce RTX 3080Ti GPUs.

The parameter settings for training the student network are the same as those for training the teacher network, with the loss weight

λ

set to 0.1.

Regarding the optional channel number of the 9 depthwise separable convolutional residual blocks, considering that convolutional layers in deeper network layers play a more critical role in feature transformation, the last 3 residual blocks are set with more options compared to the preceding 6 blocks. Therefore, the optional channel number for the first 6 residual blocks is {16, 24, 32} × 16, while for the last 3 blocks it is {16, 20, 26, 32} × 16.

After training the supernet, the subnet with the best performance needs to be searched from the entire supernet. We use the BDSCC algorithm to find the best channel configuration of the residual blocks. The specific parameter settings for searching the supernet are shown in Table 1.

After the search is completed, when fine-tuning the subnet, the total training epochs are reduced to 50, with a linear learning-rate decay strategy applied in the last 25 epochs. The remaining parameter settings are the same as those used when training the student network.

4.3. Evaluation Metrics

Although the mean square error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index metric (SSIM) have long been regarded as benchmark methods for measuring image similarity, their mathematical characteristics based on pixel-level statistics have significant limitations. Specifically, these metrics struggle to adequately capture the changes in image semantics and perceptual quality as reflected in human visual perception. A typical example is when an image is blurred, which may significantly reduce the perceptual quality to the human eyes, but only causes a slight fluctuation in MSE.

The concept of learned perceptual image patch similarity (LPIPS) [34] is similar to that of perceptual loss, which measures image similarity through a pre-trained deep convolutional neural network. LPIPS has been validated through extensive experiments, demonstrating that deep features are an effective perceptual measure of image similarity. It outperforms traditional metrics such as MSE, PSNR, and SSIM in supervised, unsupervised, and self-supervised models.

Therefore, this paper employs image similarity metrics based on perception, using the FID and LPIPS as the primary evaluation metrics, with MSE, PSNR, and SSIM serving as auxiliary traditional evaluation metrics.

In addition, for the compressed network in this paper, another metric needs to be comprehensively considered: model complexity. The number of parameters and computational complexity are the most commonly used metrics to represent model complexity. The former mainly comes from the weights of convolution kernels, and the latter mainly comes from the multiplication and addition operations required during backpropagation. This paper uses the multiply–accumulate operations (MACs) as the standard for computational complexity.

4.4. Results and Analysis

4.4.1. Results of Searching the Supernet

Table 2 presents the experimental results of searching the supernet. As shown in the table, channel importance varies significantly across different network layers, and there exists no strict negative correlation between MACs and FID metrics. For instance, when the channel configuration is 24_32_32_32_24_32_26_32_32, the MACs reaches 7.81 G while the FID is 86.30. However, adopting the 32_16_32_16_16_24_32_20_32 configuration reduces MACs by 0.20 G while simultaneously decreasing FID by 1.28. This further verifies the existence of channel redundancy and demonstrates the necessity of channel pruning during network compression. Through dynamic search and evaluation of the proposed algorithm, the channel configuration with the lowest FID, 32_16_32_16_16_24_32_20_32, was ultimately selected and fine-tuned as the final compressed network for this paper.

4.4.2. Comparison with the Teacher Network

Table 3 and Figure 9 present the comparison of evaluation metrics and translation results between the teacher network and the compressed network ILF-BDSNet, respectively. As illustrated in Figure 9, thanks to the joint optimization of the dual-resolution collaborative discriminator and the pixel-semantic dual-domain alignment loss function designed in this paper, the overall translation results maintain a high level. All land-cover categories can be effectively translated with clear segmentation of farmland and land boundaries, reasonable restoration of water bodies, fine texture details on building surfaces, and recognizable road structures, indicating the reliable performance of the teacher network.

According to Table 3, ILF-BDSNet significantly reduces the number of parameters of the generator by 38.783 M, while the FID decreases by 0.58 compared to the teacher network, the LPIPS increases slightly by only 0.0051, the SSIM improves by 0.0137, and the MSE and PSNR remain highly consistent. The translation results in Figure 9 demonstrate that the images generated by ILF-BDSNet still perform well in maintaining boundary clarity and texture details. It can accurately translate typical land-cover categories such as farmland, land, water bodies, and roads. Although minor feature loss is observed in building areas, the overall structure remains accurate.

Table 4 presents the comparison of other metrics between the teacher network and ILF-BDSNet, including the number of parameters, MACs, image-processing time, and GPU memory usage. These metrics are all for the generator. The image-processing time and GPU memory usage were tested on an NVIDIA GeForce RTX 3080Ti GPU with a batch size of 8. Ultimately, ILF-BDSNet reduces the number of parameters by 85.0% compared to the teacher network, MACs by 56.6%, image-processing time by 20.6%, and GPU memory usage by 25.7%.

It can be seen that these metrics of ILF-BDSNet are significantly lower than those of the teacher network. The experimental results above demonstrate that this paper successfully constructs a SAR-to-optical remote sensing image translation network that significantly reduces both number of parameters and computational complexity, while still maintaining subjective visual quality and objective evaluation metrics compared to those of the teacher network.

4.4.3. Different Network Analysis

To demonstrate that ILF-BDSNet proposed in this paper not only reduces the number of parameters and computational complexity but also exhibits outstanding performance in SAR-to-optical image translation tasks, this section compares ILF-BDSNet with the current mainstream image translation methods, namely the supervised Pix2pix [6] and the unsupervised CycleGAN [37], on different datasets.

Nanjing dataset: Table 5 and Figure 10 show the comparison of evaluation metrics and translation results of different networks on the Nanjing dataset, respectively. As shown in Table 5, the number of parameters of ILF-BDSNet has significantly decreased compared to both Pix2pix and CycleGAN. Compared to the supervised Pix2pix, the number of parameters reduced by 44.811 M, and compared to the unsupervised CycleGAN, it is reduced by 15.914 M. Meanwhile, compared to Pix2pix, the FID of ILF-BDSNet decreases by 63.60, the LPIPS decreases by 0.0293, and it also demonstrates a comprehensive advantage in other traditional metrics such as MSE. In comparison with CycleGAN, the FID of ILF-BDSNet decreases by 20.64, the LPIPS decreases by 0.0067, and all other traditional metrics also outperform CycleGAN.

The translation results in Figure 10 show that both Pix2pix and CycleGAN exhibit poor visual performance in SAR image translation tasks, producing low-quality generated images. Specifically, Pix2pix suffers from numerous errors in the translation of land-cover categories, such as translating water bodies into farmland, structural distortions in building areas, blurred boundaries between roads and farmland, and significant color distortion. Meanwhile, CycleGAN generates images with rough textures, ambiguous edges, and other noticeable issues, and its reconstruction of building areas is also unreasonable. In contrast, the images generated by ILF-BDSNet show a substantial improvement in quality compared to the two methods. It demonstrates competent performance in preserving texture details and boundary segmentation, achieving efficient translation for typical land-cover categories such as farmland, land, water bodies, and roads. Although the translation of building areas lacks fine texture, the overall structure is relatively accurate.

SEN1-2 dataset: Table 6 and Figure 11 show the comparison of evaluation metrics and translation results of different networks on the SEN1-2 dataset, respectively. As shown in Table 6, the number of parameters of ILF-BDSNet proposed in this paper has significantly decreased compared to Pix2pix and CycleGAN. Compared to Pix2pix, the FID of ILF-BDSNet decreases by 42.28 and the LPIPS by 0.1050. Compared to CycleGAN, the FID of ILF-BDSNet decreases by 54.71 and the LPIPS by 0.1434. Meanwhile, in traditional metrics such as MSE, ILF-BDSNet also outperforms Pix2pix and CycleGAN comprehensively. The translation results in Figure 11 show that on the SEN1-2 dataset, the generated images of Pix2pix and CycleGAN are of poor quality. The images generated by Pix2pix have a large number of blurring issues and perform very poorly in translating building areas. The images generated by CycleGAN contain many translation errors, such as translating farmland into land. However, the images generated by ILF-BDSNet still maintain high quality and can achieve efficient translation for farmland, land, water bodies, roads, and complex building areas.

The comparison and analysis in this section demonstrate that ILF-BDSNet proposed in this paper significantly reduces both the number of parameters and computational complexity compared to mainstream image translation methods, while still showing superior performance in the SAR-to-optical image translation tasks on different datasets, providing an efficient solution for SAR-to-optical image translation tasks in resource-constrained environments.

4.4.4. Ablation Experiments

The experimental results in preceding sections have demonstrated the performance advantages of ILF-BDSNet. To further investigate the contributions of certain modules and validate the effectiveness and robustness of the network, this section conducts comprehensive ablation experiments. Table 7 and Figure 12 present comparison of evaluation metrics and translation results of different methods under ablation experiments, respectively.

Single discriminator refers to the situation where neither the teacher network nor the student network uses the dual-resolution collaborative discriminator designed in this paper, but only employs the discriminator at original resolution. The data in Table 7 indicates that when single discriminator is used, the FID metric increases by 5.89 compared to ILF-BDSNet, while the LPIPS metric only decreases by 0.0002. Meanwhile, traditional metrics such as MSE are also significantly worse compared to ILF-BDSNet. The visualization results in Figure 12 show that the image quality generated by single discriminator is inferior to that of ILF-BDSNet, with significant semantic distortion, specifically manifested as distorted building structures and disordered object logic, etc. This indicates that single discriminator is unable to provide effective feedback to the generator at semantic level, resulting in fundamental defects in the high-level semantics of the generated images.

Traditional distillation refers to the practice of guiding the student network directly with the output layer of the teacher network during knowledge distillation. The data in Table 7 shows that when traditional distillation is used, the FID metric increases by 7.13 compared to ILF-BDSNet, the LPIPS metric only decreases by 0.0006, and traditional metrics such as MSE also lag significantly behind ILF-BDSNet. Meanwhile, Figure 12 also reveals that the images generated by traditional distillation have global semantic distortion and local detail degradation compared to ILF-BDSNet. Specifically, there are structural distortions in roads and buildings, and the edge blurring is obvious. This indicates that distillation based solely on the output layer cannot provide effective guidance to the student network, and also demonstrates the effectiveness and reliability of knowledge distillation based on intermediate-layer features.

Without pruning refers to employing solely the knowledge distillation technique based on intermediate-layer features, without introducing the channel pruning module. As Table 7 indicates, the FID of without pruning increases by 3.82 compared to the complete compression method, and the number of parameters of the generator increases by 1.126 M. Although other evaluation metrics are close to those of the complete compression method, as shown in Figure 12, the quality of generated images has not been significantly improved. While the translation of typical land-cover categories, such as farmland, land, and water bodies, is relatively good, there are still significant defects in the detail restoration of building areas, which fully validates the existence of channel redundancy and the necessity of channel pruning.

Lightweight network denotes reducing the initial number of channels in the teacher generator, making its number of parameters roughly the same as that of ILF-BDSNet. Table 7 shows a significant performance degradation in key metrics of the lightweight network compared to ILF-BDSNet, with the FID increasing by 29.22, the LPIPS increasing by 0.0169, and traditional metrics such as MSE also showing significant lag. The translation results in Figure 12 exposes critical semantic distortions in the images generated by the lightweight network, manifesting as high error rates in land-cover category translations, blurred boundaries, missing textural details, and other unacceptable flaws, which fully expose the limitations of simply reducing channel numbers and the necessity of training the student network through methods such as knowledge distillation and channel pruning.

Through the comparison and analysis of the ablation experiments mentioned above, ILF-BDSNet demonstrates an excellent balance between model complexity and translation quality, with its effectiveness and reliability being fully validated.

4.4.5. Loss Function Analysis

In order to systematically evaluate the performance of the pixel-semantic dual-domain alignment loss function designed in this paper, this section conducts experiments with different combinations of loss functions. Table 8 and Figure 13 present the comparison of evaluation metrics and translation results under different loss functions. “w/o vgg” denotes the exclusion of perceptual loss, “w/o feat” indicates the omission of feature-matching loss, and “w/feat&vgg” represents the loss function designed in this paper.

Perceptual loss extracts high-level semantic information from images through a pre-trained deep convolutional neural network. When perceptual loss is removed, there is a significant drop in network performance, with the FID deteriorating by 14.33 and the LPIPS increasing by 0.006. Other traditional metrics also fall behind the loss function designed in this paper. The translation results show that when perceptual loss is missing, the generated images suffer from blurred edges, missing texture details, and a significant increase in misclassification of land-cover categories, leading to a noticeable decline in visual perceptual quality.

Feature-matching loss aligns generated images and target images in both the pixel and semantic domains by utilizing the intermediate-layer features of the dual-resolution collaborative discriminator, providing the generator with richer feedback than traditional adversarial loss. When feature-matching loss is removed, the network also shows performance degradation, with the FID increasing by 5.74 and the LPIPS rising by 0.0086. Other traditional metrics also significantly lag behind the loss function designed in this paper. The translation results indicate that, without feature-matching loss, the boundary clarity of generated images decreases, the confusion of land-cover types intensifies, and there is a significant gap in overall image quality compared to the loss function designed in this paper.

This section demonstrates, through a combination of quantitative metrics and qualitative analysis, that the pixel-semantic dual-domain alignment loss function designed in this paper plays a key role in both the translation from SAR to optical remote sensing images and the compression of the translation network.

5. Conclusions

This paper proposes a compressed network, ILF-BDSNet, suitable for SAR-to-optical image translation. Specifically, standard convolutions in the feature-transformation module of the teacher network are replaced with depthwise separable convolutions to construct the student network based on a sensitivity analysis of the generator module. A dual-resolution collaborative discriminator based on PatchGAN is also introduced. The student network is trained by knowledge distillation based on the intermediate-layer features and channel pruning via weight sharing. Furthermore, the BDSCC algorithm is proposed to select the best subnet. At the same time, in view of issues such as speckle noise and significant modality differences between SAR and optical remote sensing images, a pixel-semantic dual-domain alignment loss function is designed. The feature-matching loss in this function establishes an alignment mechanism based on the intermediate-layer features of the dual-resolution collaborative discriminator, achieving cross-layer alignment from pixel-level details to semantic-level structures. A series of experiments demonstrate the excellent performance of ILF-BDSNet, which significantly reduces the number of parameters and computational complexity while still generating high-quality optical remote sensing images, providing an efficient solution for SAR image translation tasks in resource-constrained environments.

Despite this, the proposed network in this paper still has certain limitations, and several possible future research directions are proposed:

(1): Although image translation networks combining Transformer and CNN can significantly improve the quality of generated images, Transformer still suffers from large number of parameters and high computational complexity. Moreover, research on the compression and acceleration of the Transformer is not yet complete. Therefore, a potential future research direction could focus on Transformer compression methods, aiming to achieve effective network compression while maintaining the quality of the generated images.
(2): The compressed network proposed in this paper is supervised and relies on strictly paired SAR-optical remote sensing image datasets. However, obtaining precisely paired datasets in practical applications is significantly challenging. Thus, future research could focus on unsupervised compressed network for SAR-to-optical remote sensing image translation.

Author Contributions

Conceptualization, Y.K. and C.X.; methodology, C.X.; software, C.X.; validation, Y.K. and C.X.; formal analysis, C.X.; investigation, C.X.; writing—original draft preparation, C.X.; writing—review and editing, Y.K. and C.X.; supervision, Y.K.; funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61501228, No. 62171220); Natural Science Foundation of Jiangsu (No. BK20140825); Aeronautical Science Foundation of China (No. 20152052029, No. 20182052012); Basic Research (No. NS2015040, No. NS2021030); and National Science and Technology Major Project (2017-II-0001-0017); Key Laboratory of Radar Imaging and Microwave Photonics, Ministry of Education (NJ20240002).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xue, W.; Ai, J.; Zhu, Y.; Chen, J.; Zhuang, S. AIS-FCANet: Long-Term AIS Data Assisted Frequency-Spatial Contextual Awareness Network for Salient Ship Detection in SAR Imagery. IEEE Trans. Aerosp. Electron. Syst. 2025, 1–6. [Google Scholar] [CrossRef]
Zhou, H.; Yang, J.; Zhang, T.; Dai, A.; Wu, C. EAS-CNN: Automatic Design of Convolutional Neural Network for Remote Sensing Images Semantic Segmentation. Int. J. Remote Sens. 2023, 44, 3911–3938. [Google Scholar] [CrossRef]
Manoharan, T.; Basha, S.H.; Murugan, J.S.; Suja, G.P.; Rajkumar, R.; Srimathi, S. A Novel Framework for Classifying Remote Sensing Images Using Convolutional Neural Networks. In Proceedings of the 2024 International Conference on Knowledge Engineering and Communication Systems (ICKECS), Chikkaballapur, India, 18–19 April 2024; Volume 1, pp. 1–6. [Google Scholar]
Vasileiou, C.; Smith, J.; Thiagarajan, S.; Nigh, M.; Makris, Y.; Torlak, M. Efficient CNN-Based Super Resolution Algorithms for Mmwave Mobile Radar Imaging. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3803–3807. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Neural Information Processing Systems (NIPS): San Diego, CA, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
Fu, S.; Xu, F.; Jin, Y.-Q. Reciprocal Translation between SAR and Optical Remote Sensing Images with Cascaded-Residual Adversarial Networks. Sci. China Inf. Sci. 2021, 64, 122301. [Google Scholar] [CrossRef]
Tan, D.; Liu, Y.; Li, G.; Yao, L.; Sun, S.; He, Y. Serial GANs: A Feature-Preserving Heterogeneous Remote Sensing Image Transformation Model. Remote Sens. 2021, 13, 3968. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, X.; Liu, M.; Zou, X.; Zhu, L.; Ruan, X. Comparative Analysis of Edge Information and Polarization on SAR-to-Optical Translation Based on Conditional Generative Adversarial Networks. Remote Sens. 2021, 13, 128. [Google Scholar] [CrossRef]
Turnes, J.N.; Bermudez Castro, J.D.; Torres, D.L.; Soto Vega, P.J.; Feitosa, R.Q.; Happ, P.N. Atrous cGAN for SAR to Optical Image Translation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4003905. [Google Scholar] [CrossRef]
Zhan, T.; Bian, J.; Yang, J.; Dang, Q.; Zhang, E. Improved Conditional Generative Adversarial Networks for SAR-to-Optical Image Translation. In Proceedings of the Pattern Recognition and Computer Vision, PRCV 2023, PT IV., Xiamen, China, 13–15 October 2023; Liu, Q., Wang, H., Ma, Z., Zheng, W., Zha, H., Chen, X., Wang, L., Ji, R., Eds.; Springer-Verlag Singapore Pte Ltd.: Singapore, 2024; Volume 14428, pp. 279–291. [Google Scholar]
Shi, H.; Cui, Z.; Chen, L.; He, J.; Yang, J. A Brain-Inspired Approach for SAR-to-Optical Image Translation Based on Diffusion Models. Front. Neurosci. 2024, 18, 1352841. [Google Scholar] [CrossRef] [PubMed]
Ji, G.; Wang, Z.; Zhou, L.; Xia, Y.; Zhong, S.; Gong, S. SAR Image Colorization Using Multidomain Cycle-Consistency Generative Adversarial Network. IEEE Geosci. Remote Sens. Lett. 2021, 18, 296–300. [Google Scholar] [CrossRef]
Hwang, J.; Shin, Y. SAR-to-Optical Image Translation Using SSIM Loss Based Unpaired GAN. In Proceedings of the 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 19–21 October 2022; pp. 917–920. [Google Scholar]
Yang, X.; Wang, Z.; Zhao, J.; Yang, D. FG-GAN: A Fine-Grained Generative Adversarial Network for Unsupervised SAR-to-Optical Image Translation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5621211. [Google Scholar] [CrossRef]
Wang, J.; Yang, H.; He, Y.; Zheng, F.; Liu, Z.; Chen, H. An Unpaired SAR-to-Optical Image Translation Method Based on Schrodinger Bridge Network and Multi-Scale Feature Fusion. Sci Rep 2024, 14, 27047. [Google Scholar] [CrossRef] [PubMed]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 4510–4520. [Google Scholar]
Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 5987–5995. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Luo, J.-H.; Wu, J.; Lin, W. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 5068–5076. [Google Scholar]
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2017, arXiv:1608.08710. [Google Scholar] [CrossRef]
Li, S.; Lin, M.; Wang, Y.; Chao, F.; Shao, L.; Ji, R. Learning Efficient GANs for Image Translation via Differentiable Masks and Co-Attention Distillation. IEEE Trans. Multimed. 2023, 25, 3180–3189. [Google Scholar] [CrossRef]
You, L.; Hu, T.; Chao, F. Enhancing GAN Compression by Image Probability Distribution Distillation. In Pattern Recognition and Computer Vision; Liu, Q., Wang, H., Ma, Z., Zheng, W., Zha, H., Chen, X., Wang, L., Ji, R., Eds.; Lecture Notes in Computer Science; Springer Nature Singapore: Singapore, 2024; Volume 14435, pp. 76–88. ISBN 978-981-99-8551-7. [Google Scholar]
Lin, Y.-J.; Yang, S.-H. Compressing Generative Adversarial Networks Using Improved Early Pruning. In Proceedings of the 2024 11th International Conference on Consumer Electronics-Taiwan, ICCE-Taiwan 2024, Taichung, Taiwan, 9–11 July 2024; IEEE: New York, NY, USA, 2024; pp. 39–40. [Google Scholar]
Yeo, S.; Jang, Y.; Yoo, J. Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation. In Proceedings of the Computer Vision—ECCV 2024, PT LXXXVIII., Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer International Publishing Ag: Cham, Switzerland, 2025; Volume 15146, pp. 104–121. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-All: Train One Network and Specialize It for Efficient Deployment. Available online: https://arxiv.org/abs/1908.09791v5 (accessed on 19 June 2025).
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Kong, Y.; Liu, S.; Peng, X. Multi-Scale Translation Method from SAR to Optical Remote Sensing Images Based on Conditional Generative Adversarial Network. Int. J. Remote Sens. 2022, 43, 2837–2860. [Google Scholar] [CrossRef]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2813–2821. [Google Scholar]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision—ECCV 2016, PT II., Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing Ag: Cham, Switzerland, 2016; Volume 9906, pp. 694–711. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Neural Information Processing Systems (nips): San Diego, CA, USA, 2016; Volume 29. [Google Scholar]
Schmitt, M.; Hughes, L.H.; Zhu, X.X. The Sen1-2 Dataset for Deep Learning in Sar-Optical Data Fusion. In Proceedings of the ISPRS TC I Mid-Term Symposium Innovative Sensing—From Sensors to Methods and Applications, Changsha, China, 10–12 October 2018; Jutzi, B., Weinmann, M., Hinz, S., Eds.; Copernicus Gesellschaft Mbh: Gottingen, Germany, 2018; Volume 4–1, pp. 141–146. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]

Figure 1. Holistic compression method.

Figure 2. Overall architecture of teacher and student networks.

Figure 3. Teacher generator.

Figure 4. Student generator.

Figure 5. Dual-resolution collaborative discriminator and feature-matching loss.

Figure 6. Schematic diagram of knowledge distillation based on intermediate-layer features.

Figure 7. Schematic diagram of channel pruning via weight sharing.

Figure 8. Flowchart of BDSCC algorithm.

Figure 9. Comparison of translation results between the teacher network and ILF-BDSNet. From left to right, (a) SAR images, (b) target images, (c) the teacher network, (d) ILF-BDSNet.

Figure 10. Comparison of translation results of different networks on the Nanjing dataset. From left to right, (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) ILF-BDSNet.

Figure 11. Comparison of translation results of different networks on the SEN1-2 dataset. From left to right, (a) SAR images, (b) target images, (c) Pix2pix, (d) CycleGAN, (e) ILF-BDSNet.

Figure 12. Comparison of translation results of different methods under ablation experiments. From left to right, (a) SAR images, (b) target images, (c) single discriminator, (d) traditional distillation, (e) without pruning, (f) lightweight network, (g) ILF-BDSNet.

Figure 13. Comparison of translation results under different loss functions. From left to right, (a) SAR images, (b) target images, (c) no perceptual loss, (d) no feature-matching loss, (e) the loss function designed in this paper.

Table 1. Parameter settings for searching the supernet.

Parameters	Value
Population size	100
Probability of individual gene mutation	0.2
Ratio of population gene mutation	0.5
Ratio of population gene recombination	0.25
Times of iterations	200

Table 2. Results of searching the supernet.

Number of Iterations	Channel Configuration	FID ↓	MACs ↓
1	16_32_16_16_16_32_26_26_32	86.71	7.58 G
2–4	24_32_32_32_24_32_26_32_32	86.30	7.81 G
5–83	32_16_16_16_24_16_26_26_20	85.60	7.49 G
84–87	32_24_24_16_24_24_26_32_20	85.59	7.62 G
88–168	32_32_16_32_32_24_32_32_32	85.21	7.80 G
169–200	32_16_32_16_16_24_32_20_32	85.02	7.61 G