SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention
Highlights
- For the first time, the InternImage model with deformable convolution v3 (DCNv3) as its core operator is introduced to SAR image translation tasks to extract global semantic features from in SAR images;
- A cascaded multi-head attention module combining multi-head self-attention (MSA) and multi-head cross-attention (MCA) is designed to optimize local detail features while promoting feature interaction between local details and global semantics;
- For the first time, structural similarity index metric (SSIM) loss is jointly leveraged with adversarial loss, perceptual loss, and feature matching loss in SAR image translation tasks.
- Our method ultimately generates higher-quality optical remote sensing images compared to mainstream image translation methods.
Abstract
1. Introduction
- Traditional supervised translation models rely on the “encoder–decoder” structure, which has limited receptive field and lacks effective modeling of the global context. When dealing with SAR images and optical remote sensing images with significant modal differences, it is prone to cause confusion in the classification of ground objects and structure distortion of the generated images. Therefore, this paper breaks through by introducing an independent global representor and constructing a collaborative working architecture of “global semantic extraction—local detail generation—multi-scale discrimination”. This architecture innovatively realizes the clear division of labor between global semantic guidance and local detail generation in the SAR image translation task, fundamentally improving the semantic consistency of the translation results.
- Existing methods have difficulty effectively balancing feature expression ability and computational efficiency: Transformer has large computational cost, while traditional convolution cannot effectively extract the global semantic features of SAR images. This paper creatively uses the InternImage model as the global representor, and its core operator DCNv3 achieves long-range dependency and adaptive spatial aggregation capabilities through dynamic offset and modulation scalar mechanisms. This innovation enables the model to efficiently extract discriminative global semantic features from SAR images with speckle noise and geometric distortion at lower computational cost.
- Existing methods mostly adopt simple concatenation or addition for multi-source features, which makes global semantic guidance unable to effectively penetrate into the detail generation process. This paper innovatively designs a cascaded multi-head attention module. Through the concatenation of multi-head self-attention (MSA) and multi-head cross-attention (MCA), it realizes the optimization of local details and the deep calibration of global semantics. This module overcomes the problem of detail enhancement under semantic guidance and ensures that the generated images have accurate semantic structure at the macro level and clear texture details at the micro level.
- Mainstream image translation methods rely on pixel-level losses that are difficult to effectively drive the model to learn perceptual similarity. This paper systematically combines structural similarity index metric (SSIM) loss, adversarial loss, perceptual loss, and feature matching loss in the SAR image translation task, forming a comprehensive supervision from low-order pixels to high-order perception. This optimization strategy innovation significantly improves the visual naturalness and structural integrity of the generated images.
2. Related Work
2.1. Generative Adversarial Networks
2.2. Image Translation
2.3. Transformer
3. Methodology
3.1. Overall Architecture
3.2. Global Representor
3.2.1. Deformable Convolution V3
3.2.2. InternImage
3.3. Generator with Cascaded Multi-Head Attention
3.3.1. Residual Block Structure
3.3.2. Cascaded Multi-Head Attention Module
3.4. Multi-Scale Discriminator
3.5. Loss Function
4. Results
4.1. Datasets and Parameter Settings
4.2. Evaluation Metrics
4.3. Different Network Analysis
4.4. Global Reprensentor Ablation Experiment
4.5. Cascaded Multi-Head Attention Module Ablation Experiment
4.6. Analysis of the SSIM Loss
5. Conclusions
- Our network is supervised and relies on strictly paired SAR-optical remote sensing image datasets. However, in practical applications, image annotation costs are high, and high-quality paired datasets are scarce. Therefore, future research could focus on unsupervised SAR-to-optical image translation methods.
- The number of parameters of our method is slightly higher than that of mainstream image translation methods. Therefore, future research could focus on reducing model complexity, developing lightweight or compressed networks specifically for SAR image translation tasks, ensuring generated image quality while reducing computational cost and memory overhead.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, Q.; Liu, X.; Liu, M.; Zou, X.; Zhu, L.; Ruan, X. Comparative Analysis of Edge Information and Polarization on SAR-to-Optical Translation Based on Conditional Generative Adversarial Networks. Remote Sens. 2021, 13, 128. [Google Scholar] [CrossRef]
- Turnes, J.N.; Bermudez Castro, J.D.; Torres, D.L.; Soto Vega, P.J.; Feitosa, R.Q.; Happ, P.N. Atrous cGAN for SAR to Optical Image Translation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4003905. [Google Scholar] [CrossRef]
- Liu, X.; Hong, D.; Chanussot, J.; Zhao, B.; Ghamisi, P. Modality Translation in Remote Sensing Time Series. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5401614. [Google Scholar] [CrossRef]
- Zhan, T.; Bian, J.; Yang, J.; Dang, Q.; Zhang, E. Improved Conditional Generative Adversarial Networks for SAR-to-Optical Image Translation. In Proceedings of the Pattern Recognition and Computer Vision, PRCV 2023, PT IV, Xiamen, China, 13–15 October 2023; Liu, Q., Wang, H., Ma, Z., Zheng, W., Zha, H., Chen, X., Wang, L., Ji, R., Eds.; Springer-Verlag Singapore Pte Ltd.: Singapore, 2024; Volume 14428, pp. 279–291. [Google Scholar]
- Ji, G.; Wang, Z.; Zhou, L.; Xia, Y.; Zhong, S.; Gong, S. SAR Image Colorization Using Multidomain Cycle-Consistency Generative Adversarial Network. IEEE Geosci. Remote Sens. Lett. 2021, 18, 296–300. [Google Scholar] [CrossRef]
- Hwang, J.; Shin, Y. SAR-to-Optical Image Translation Using SSIM Loss Based Unpaired GAN. In Proceedings of the 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju-si, Republic of Korea, 19–21 October 2022; IEEE: Jeju Island, Republic of Korea, 2022; pp. 917–920. [Google Scholar]
- Wang, J.; Yang, H.; He, Y.; Zheng, F.; Liu, Z.; Chen, H. An Unpaired SAR-to-Optical Image Translation Method Based on Schrodinger Bridge Network and Multi-Scale Feature Fusion. Sci. Rep. 2024, 14, 27047. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
- Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
- Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2242–2251. [Google Scholar]
- Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 9992–10002. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, PT III, Daejeon, Republic of Korea, 23–27 September 2025; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing Ag: Cham, Germany, 2025; Volume 9351, pp. 234–241. [Google Scholar]
- Salimans, T.; Kingma, D.P. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 29 (nips 2016), Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
- Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2023, arXiv:2211.05778. [Google Scholar]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 9300–9308. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar]
- Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE Computer Soc: Los Alamitos, CA, USA, 2022; pp. 11953–11965. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 548–558. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved Baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Kong, Y.; Liu, S.; Peng, X. Multi-Scale Translation Method from SAR to Optical Remote Sensing Images Based on Conditional Generative Adversarial Network. Int. J. Remote Sens. 2022, 43, 2837–2860. [Google Scholar] [CrossRef]
- Kong, Y.; Xu, C. ILF-BDSNet: A Compressed Network for SAR-to-Optical Image Translation Based on Intermediate-Layer Features and Bio-Inspired Dynamic Search. Remote Sens. 2025, 17, 3351. [Google Scholar] [CrossRef]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2813–2821. [Google Scholar]
- Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision—ECCV 2016, PT II, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing Ag: Cham, Germany, 2016; Volume 9906, pp. 694–711. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Schmitt, M.; Hughes, L.H.; Zhu, X.X. The Sen1-2 Dataset for Deep Learning in Sar-Optical Data Fusion. In Proceedings of the ISPRS TC I Mid-Term Symposium Innovative Sensing—From Sensors to Methods and Applications, Karlsruhe, Germany, 10–12 October 2018; Jutzi, B., Weinmann, M., Hinz, S., Eds.; Copernicus Gesellschaft Mbh: Gottingen, Germany, 2018; Volume 4-1, pp. 141–146. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (nips): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]












| FID ↓ | LPIPS ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | Number of Parameters ↓ | |
|---|---|---|---|---|---|---|
| Pix2pix | 118.08 | 0.6176 | 0.7165 | 15.29 | 0.1270 | 57.183M |
| CycleGAN | 85.38 | 0.5914 | 0.7097 | 15.48 | 0.1687 | 28.286M |
| Pix2pixHD | 79.56 | 0.5836 | 0.5986 | 16.74 | 0.1915 | 54.155M |
| Multi-scale CGAN | 67.59 | 0.5599 | 0.5825 | 17.05 | 0.2074 | 50.116M |
| Ours | 55.51 | 0.5565 | 0.5405 | 17.48 | 0.2218 | 62.287M |
| FID ↓ | LPIPS ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | Number of Parameters ↓ | |
|---|---|---|---|---|---|---|
| Pix2pix | 111.94 | 0.5652 | 0.5986 | 14.83 | 0.1798 | 57.183M |
| CycleGAN | 109.05 | 0.6007 | 0.6834 | 13.45 | 0.1244 | 28.286M |
| Pix2pixHD | 72.38 | 0.4561 | 0.3972 | 18.18 | 0.3277 | 54.155M |
| Multi-scale CGAN | 53.75 | 0.3336 | 0.3021 | 21.00 | 0.5033 | 50.116M |
| Ours | 31.66 | 0.2829 | 0.2540 | 22.11 | 0.5969 | 62.287M |
| FID ↓ | LPIPS ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | |
|---|---|---|---|---|---|
| ViT | 55.23 | 0.5580 | 0.5458 | 17.40 | 0.2201 |
| Swin v2 | 54.53 | 0.5580 | 0.5459 | 17.39 | 0.2208 |
| InternImage | 55.51 | 0.5565 | 0.5405 | 17.48 | 0.2218 |
| MACs ↓ | Inference Time ↓ | Number of Parameters | |
|---|---|---|---|
| ViT | 27.590M | 222 s | 13.417M |
| Swin v2 | 21.310G | 193 s | 12.016M |
| InternImage | 22.318G | 186 s | 12.090M |
| FID ↓ | LPIPS ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | |
|---|---|---|---|---|---|
| w/o self-attention | 75.91 | 0.5618 | 0.5423 | 17.47 | 0.2235 |
| w/o cross-attention | 72.92 | 0.5635 | 0.5422 | 17.46 | 0.2251 |
| ResNet blocks | 68.50 | 0.5612 | 0.5396 | 17.51 | 0.2222 |
| w/attention | 55.51 | 0.5565 | 0.5405 | 17.48 | 0.2218 |
| FID ↓ | LPIPS ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | |
|---|---|---|---|---|---|
| w/o SSIM | 61.61 | 0.5596 | 0.5459 | 17.43 | 0.2125 |
| = 0.5 | 55.91 | 0.5588 | 0.5409 | 17.47 | 0.2204 |
| = 1 (Ours) | 55.51 | 0.5565 | 0.5405 | 17.48 | 0.2218 |
| = 1.5 | 56.46 | 0.5579 | 0.5440 | 17.40 | 0.2251 |
| = 2 | 60.76 | 0.5613 | 0.5523 | 17.32 | 0.2227 |
| = 2.5 | 92.79 | 0.5746 | 0.5661 | 17.08 | 0.1954 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xu, C.; Kong, Y. SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sens. 2026, 18, 55. https://doi.org/10.3390/rs18010055
Xu C, Kong Y. SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sensing. 2026; 18(1):55. https://doi.org/10.3390/rs18010055
Chicago/Turabian StyleXu, Cheng, and Yingying Kong. 2026. "SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention" Remote Sensing 18, no. 1: 55. https://doi.org/10.3390/rs18010055
APA StyleXu, C., & Kong, Y. (2026). SAR-to-Optical Remote Sensing Image Translation Method Based on InternImage and Cascaded Multi-Head Attention. Remote Sensing, 18(1), 55. https://doi.org/10.3390/rs18010055
