ILF-BDSNet: A Compressed Network for SAR-to-Optical Image Translation Based on Intermediate-Layer Features and Bio-Inspired Dynamic Search

Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript proposed a compressed generative adversarial networks for SAR-to-optical image translation. The proposed method can still maintain good performance under the condition that the parameter quantity is reduced. The manuscript interesting and acceptable. However, there is one problem that requires attention.
- The SAR image reflects the electromagnetic scattering information of objects. In nature, land, plants, roads, and other features exhibit distinct scattering characteristics. If these objects have not undergone special color processing, optical images can be generated based on scattering intensity using the proposed network. In Figures 10 and 11 of the manuscript, the blue areas appear to be buildings. For man-made objects, how significant is the impact of surface coating on their scattering properties? If the impact is minimal, is the recovered color appropriate? Could there be a risk of overfitting?
- Are both the training dataset and the test dataset composed of images collected from Nanjing? If the model is trained on a dataset of Nanjing images, what will the results be if images from other provinces are used as the test dataset?
- What the image resolution in the SEN1-2 dataset?
Author Response
Comments 1: The SAR image reflects the electromagnetic scattering information of objects. In nature, land, plants, roads, and other features exhibit distinct scattering characteristics. If these objects have not undergone special color processing, optical images can be generated based on scattering intensity using the proposed network. In Figures 10 and 11 of the manuscript, the blue areas appear to be buildings. For man-made objects, how significant is the impact of surface coating on their scattering properties? If the impact is minimal, is the recovered color appropriate? Could there be a risk of overfitting?
Response 1: Thank you very much for raising such a valuable question. Your inquiry about the impact of surface coatings on the scattering properties of man-made objects and the generalization ability of the model is very profound. Please allow me to elaborate on this point by point.
(1) The impact of surface coatings on scattering properties
You correctly pointed out that SAR sensors image by receiving the backscattered electromagnetic wave signals from the ground objects, and their intensity is mainly influenced by the physical properties of the target such as dielectric constant, surface roughness, geometric structure, and incident angle. In contrast, the color information in optical remote sensing images mainly reflects the reflection characteristics of the ground object surface in the visible light band. The color of the surface coating of man-made objects has only a weak influence on the scattering properties, but the material properties of the coating (such as concrete, metal) significantly affect the dielectric constant and thus the scattering properties.
(2) The rationality of color recovery
The question you raised is very profound. Our model does not learn a physical causal mapping from scattering properties to color information, but rather a highly complex statistical association based on data-driven learning. Firstly, in the real world, specific building materials (such as concrete, glass, metal) often have specific SAR scattering mechanisms, and these materials are also often associated with certain color ranges or appearance styles. For example, industrial areas with metal roofs are often blue, and residential buildings with concrete roofs are often gray or red. The model has learned the statistical rules between "scattering patterns + context environment" and "optical appearance" from a vast amount of data. At the same time, the model does not judge the color of each pixel in isolation. The dual-resolution collaborative discriminator and pixel-semantic dual-domain alignment loss designed in this paper enable the model to understand the global semantics of the image (such as: this is a residential area, this is an industrial area). Therefore, the model will, based on the learned prior knowledge, assign the most appropriate color for the identified building area according to the distribution of the training data. Thus, the recovery of this color is reasonable.
(3) Regarding the risk of overfitting
The risk of overfitting you raised is an extremely critical concern. Therefore, we have made every effort to avoid this risk in the experimental design. The training set, validation set, and test set we used are strictly independent in terms of spatial location and object category. That is, the land cover categories such as buildings, land, and farmland in the test set do not appear in the training set or validation set. This indicates that the model has learned generalized "scattering mechanism - visual appearance" mapping rules rather than simply memorizing the colors of specific targets, and thus there is no risk of overfitting.
Comments 2: Are both the training dataset and the test dataset composed of images collected from Nanjing? If the model is trained on a dataset of Nanjing images, what will the results be if images from other provinces are used as the test dataset?
Response 2: Thank you very much for raising such a valuable question. First of all, we need to clarify that the Nanjing dataset constructed in this paper are all collected from different areas of Nanjing. At the same time, you have asked what the result would be if the model was trained on a dataset of Nanjing images but tested on images from other provinces. We fully understand your concern about the model's performance on images from other provinces. Although this article does not directly use other provinces as the test set, we believe that the proposed method has good generalization ability. Now, let me explain the reasons in detail.
(1) Division of the dataset
When dividing the dataset, we strictly ensured that the training set, validation set, and test set were completely independent in terms of spatial geographical location. That is, no area in the test set appeared in the training set or validation set. This was done to simulate the model's generalization ability to unseen scenes as much as possible under the existing data conditions.
(2) The essence of the SAR image translation task
The core of our SAR image translation task is to learn the mapping from the scattering feature space of SAR images to the visual feature space of optical remote sensing images. The key features learned by the network, such as the geometric structure scattering of buildings, the volume scattering of vegetation, and the specular scattering of water bodies, are universal physical mechanisms rather than specific to the Nanjing area. As long as the ground objects in other provinces follow the same electromagnetic scattering laws, the network proposed in this article can generate reasonable optical remote sensing images based on the features it has learned.
(3) Diversity of ground objects in the dataset
Our training set covers a variety of typical landforms in Nanjing, such as farmland, roads, land, water bodies, and buildings, which contain rich and generalizable scattering features of ground objects. The model has learned how to handle these representative scattering patterns during training. Therefore, we believe that for a well-trained model, when it encounters ground objects from other provinces but with similar scattering properties, it can produce reasonable translation results.
Comments 3: What the image resolution in the SEN1-2 dataset?
Response 3: Thank you very much for your valuable question. The resolution of the SEN1-2 dataset used in this paper is 5m. Due to our oversight, this was not mentioned in the text. We have added this explanation in Section 4.2.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a compressed network specifically designed for translating SAR images to optical remote sensing images, providing a reliable solution for SAR image translation under resource-constrained conditions, and has achieved promising results. However, some revisions are needed.
- Supervised vs. Unsupervised Network Clarification(Section 3.2.1):
Although the conclusion states that ILF-BDSNet is supervised and relies on strictly paired datasets, the overall network structure does not directly clarify whether the teacher and student networks are supervised or unsupervised. It is recommended that the authors provide further clarification in the paper.
- Explanation of Dual-Resolution Collaborative Discriminator (Section 3.2.4):
The introduction of the PatchGAN idea based on local receptive fields is commendable. However, the meanings of the two-dimensional response matrix and the single scalar score are not clearly stated. It is recommended that the authors provide further explanations.
- Adaptability of Pixel-Semantic Alignment Loss Function (Section 3.4):
The author mainly considers the noise, geometric distortions, and modality differences between SAR and optical remote sensing images when designing the pixel-semantic alignment loss function. However, the adaptability of this function during the compression process seems to have been overlooked. It is recommended that the authors provide a reasonable explanation regarding its adaptability during compression.
- Academic Expression of "Based on" Phrasing:
The paper uses the phrase "based on sth" quite frequently, which seems less academic. It is suggested that the authors replace these phrases with more scholarly expressions, such as changing "based on CNN" to "CNN-based." Of course, the title of the paper is appropriate and does not require modification.
- Missing Authors' Organization and Institutional Information:
The first page is missing the authors' organization and institutional information, which seems to have been overlooked. The authors are advised to include this in the subsequent version.
These modification suggestions will help enhance the academic rigor and clarity of the paper.
Author Response
Comments1:Supervised vs. Unsupervised Network Clarification (Section 3.2.1):
Although the conclusion states that ILF-BDSNet is supervised and relies on strictly paired datasets, the overall network structure does not directly clarify whether the teacher and student networks are supervised or unsupervised. It is recommended that the authors provide further clarification in the paper.
Response 1: Thank you very much for raising such a valuable question. We indeed did not explicitly state in Section 3.2.1 whether the teacher and student networks are supervised or unsupervised. The student and teacher networks in this article are supervised networks that rely on strictly paired datasets. We have added this explanation in Section 3.2.1.
Comments2: Explanation of Dual-Resolution Collaborative Discriminator (Section 3.2.4):
The introduction of the PatchGAN idea based on local receptive fields is commendable. However, the meanings of the two-dimensional response matrix and the single scalar score are not clearly stated. It is recommended that the authors provide further explanations.
Response 2: Thank you very much for raising such a valuable question. You correctly pointed out that we did not clearly explain the two-dimensional response matrix and the single scalar score. The dual-resolution collaborative discriminator ultimately outputs a 1-channel two-dimensional response matrix instead of a single scalar score, where each matrix element corresponds to a specific local receptive field region within the input image, enabling independent authenticity discrimination for local image patches rather than generating a unified judgment for the entire image. We have added this text explanation in Section 3.2.4.
Comments 3: Adaptability of Pixel-Semantic Alignment Loss Function (Section 3.4):
The author mainly considers the noise, geometric distortions, and modality differences between SAR and optical remote sensing images when designing the pixel-semantic alignment loss function. However, the adaptability of this function during the compression process seems to have been overlooked. It is recommended that the authors provide a reasonable explanation regarding its adaptability during compression.
Response 3: Thank you very much for raising such a valuable question. You pointed out that we did not consider the adaptability of the loss function designed during the compression process. Now, let me answer your question. First, in the knowledge distillation part of the compression process, we designed a knowledge distillation strategy based on intermediate-layer features, and used the distillation loss as an additional loss introduced during the training of the student network. At the same time, the adversarial loss in the designed pixel-semantic dual-domain alignment loss function can effectively alleviate the mode collapse and gradient vanishing problems in GAN training, ensuring the smooth progression of the entire compression process. We have added this text explanation in Section 3.4.
Comments 4: Academic Expression of "Based on" Phrasing:
The paper uses the phrase "based on sth" quite frequently, which seems less academic. It is suggested that the authors replace these phrases with more scholarly expressions, such as changing "based on CNN" to "CNN-based." Of course, the title of the paper is appropriate and does not require modification.
Response 4: Thank you for raising such a valuable question. We have indeed noticed that the phrase "based on sth" is used too frequently, and the "sth-based" combination is more academic. We have revised some of these combinations to ensure that the article appears more professional and academic.
Comments 5: Missing Authors' Organization and Institutional Information:
The first page is missing the authors' organization and institutional information, which seems to have been overlooked. The authors are advised to include this in the subsequent version.
Response 5: Thank you very much for raising such a valuable question. You correctly pointed out that we missed the authors' organization and institutional information. This was our oversight. We have added the author information on the first page.
Reviewer 3 Report
Comments and Suggestions for AuthorsSee the attachment.
Comments for author File: Comments.pdf
Author Response
Comments 1: SAR side-looking imaging causes phenomena such as building tilt and overlap, resulting in systematic deviations compared to optical ortho imagery. The authors appear to have neither considered nor discussed this issue.
Response 1: Thank you very much for raising such a valuable question. You correctly pointed out that SAR side-looking imaging can cause geometric distortions such as building tilting. Compared with optical orthophoto images, this will result in systematic deviations. However, the SAR images used in this paper have undergone strict geometric correction and terrain correction during the preprocessing stage, effectively solving the geometric distortion problem. Therefore, the SAR images used in this paper have geometric integrity comparable to that of optical remote sensing images, and will not cause substantial deviations in the subsequent translation results.
Comments 2: Translation of SAR images to optical images is a complex mapping process, and currently no highly reliable translation algorithms exist. At this stage, the author's research into compressed networks would further reduce the credibility of the translation results.
Response 2: Thank you very much for raising such a valuable question. You have correctly pointed out that the translation from SAR images to optical remote sensing images is a complex mapping process. Therefore, the research on compressed networks will further reduce the credibility of the results, which indeed highlights the core challenges and risks in this field of research. We fully understand your concerns. Now, please allow me to answer your question.
When designing ILF-BDSNet, we did not set "compression" as the sole goal. Instead, we took "while significantly reducing the number of parameters and complexity of computation, maintaining or even improving the reliability and quality of SAR image translation as much as possible" as the core idea. Therefore, we introduced a series of designs to ensure the output and reliability of the compressed network.
(1) Dual-resolution collaborative discriminator and multi-level constraints
The designed dual-resolution collaborative discriminator combines pixel-semantic dual-domain alignment loss, not only constraining the authenticity of details at the pixel level, but also ensuring the consistency of structure at the semantic level, thereby significantly improving the visual reliability and structural rationality of the generated images.
(2) Knowledge distillation based on intermediate-layer features
The designed knowledge distillation strategy, compared with traditional distillation methods, realizes progressive guidance of knowledge transfer from low-level to high-level, leading to a significant improvement in the quality of images generated by the student network
(3) Bio-inspired dynamic search of channel configuration algorithm
Through the BDSCC algorithm, the optimal balanced subnet structure between parameter quantity and performance is automatically searched. This algorithm ensures that the compressed network still has strong expressive ability and avoids performance degradation due to excessive channel pruning.
(4) Experimental design
We compared with classic methods such as Pix2pix and CycleGAN, as shown in Table 5, Figure 10, Table 6, and Figure 11. ILF-BDSNet, while significantly reducing the number of parameters, outperforms the comparison models in evaluation metrics such as FID and LPIPS, indicating that its generated results are closer to real images.
In conclusion, the compressed network in this paper does not sacrifice reliability at the cost of it. Instead, through a series of designs, it can improve or maintain the reliability and quality of the generated results while compressing the network. We also agree that the translation from SAR images to optical remote sensing images is still a major challenge at this stage. Therefore, in the conclusion section, we also clearly state that we will explore the compression of the Transformer architecture and the compression of unsupervised networks in the absence of paired datasets.
Once again, thank you for raising this question.
Comments 3: The author mentioned that CGAN performs exceptionally well in image translation tasks. The reviewer suggested adding comparative experiments with CGAN in the Different Network Analysis section.
Response 3: Thank you very much for raising such a valuable question. You suggested that we add a comparison test with CGAN in the "Different Network Analysis" section. Now, let me answer your question. CGAN is a variant of GAN, which introduces conditional information simultaneously in the generator and discriminator. This means that both the generator and discriminator rely on additional information from other modalities as working conditions, thereby achieving control over the direction of image generation. In the field of image translation, CGAN has become the most advanced solution. In the "Different Network Analysis" experiment of this article, both Pix2pix and CycleGAN are classic CGAN models in the field of image translation.
Comments 4: Please explain why CycleGAN achieves better results in complex scenes (row 5 of Figure 10) than in simple scenes (row 3 of Figure 10)?
Response 4: Thank you for raising such a valuable question. You correctly pointed out that CycleGAN performs better in complex scenes compared to simple ones. Now, let me answer your question:
As an unsupervised image translation model, the performance of CycleGAN largely depends on the adversarial training process between the generator and the discriminator, as well as the ability of the cycle consistency loss to preserve structural information. Specifically, on one hand, in complex scenes (such as buildings, vegetation, etc.), the scenes themselves have richer texture features, higher structural complexity, and more significant regional contrast. Such scenes provide the discriminator with clearer and more distinguishable features for authenticity judgment, thereby providing the generator with stronger and more explicit gradient signals to guide it in learning how to generate more deceptive images. In contrast, in simple scenes (such as flat ground, water bodies), the image content is highly homogeneous, with scarce texture information. The discriminator finds it difficult to learn discriminative features from the content, which may lead to reaching a local optimum or mode collapse in the early stage of training. As a result, the generator receives ambiguous and weak gradient signals, making it difficult to learn effective mappings and resulting in poor visual effects of the generated images. On the other hand, CycleGAN constrains the translation process of unpaired images through cycle consistency loss, aiming to preserve the structure and content information of the original image as much as possible during the transformation process. In complex scenes, due to the strong saliency and diversity of structural information, the generator can more easily capture and retain key features during the reconstruction process, avoiding excessive smoothing and structural loss. In contrast, simple scenes lack sufficient structural diversity, leading to overly smooth or semantically deficient output results.
Comments 5: Is there any basis for setting the hyperparameters in the loss function? The ablation experiments only examined the effects of perceptual loss and feature matching loss.
Response 5: Thank you very much for raising such a valuable question. You pointed out that in our experiment, we only examined the influence of perceptual loss and feature matching loss, and it seems that we did not conduct research on the setting of hyperparameters in the loss function. Now, let me answer your question. Regarding the setting of hyperparameters in the loss function, this paper mainly referred to the method proposed by Kong et al. in "Multi-Scale Translation Method from SAR to Optical Remote Sensing Images Based on Conditional Generative Adversarial Network". The authors combined adversarial loss, perceptual loss, and feature matching loss for the translation from SAR images to optical remote sensing images and achieved satisfactory results. More importantly, they conducted in-depth analysis and verification of these losses' hyperparameters through systematic ablation experiments. Since this paper is based on the SAR image translation task to investigate the compressed network, we adopted the hyperparameter settings verified by Kong et al. Therefore, the hyperparameter settings of the loss function in this paper are empirical, and we have added this text explanation in Section 4.2.
Comments 6: Authors information is missing.
Response 6: Thank you very much for raising such a valuable question. You correctly pointed out that we missed the authors' organization and institutional information. This was our oversight. We have added the author information on the first page.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors The author answered my questions. Please add the answer to the paper. I have no more questions.Author Response
Thank you very much for your valuable suggestions and assistance. Wish you all the best!
Reviewer 2 Report
Comments and Suggestions for AuthorsIt's a pleasure to accept the paper.
Author Response
Thank you very much for your valuable suggestions and assistance. Wish you all the best!