MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model
Abstract
:1. Introduction
- 1.
- The diffusion model is introduced to the specific translation task between remote sensing images and online maps. To avoid the randomness of generated images, we propose an improved denoising diffusion bridge to establish a mapping between two domains. To the best of our knowledge, our work is the first RSMT exploration based on diffusion models.
- 2.
- To further improve the performance of the proposed model, we employ a pretrained encoder for image compression into the latent space, which may capture more features with diversity. Additionally, a transformation consistency regularization is proposed for more accurate geometric details.
- 3.
- Extensive experiments are conducted on two commonly used open datasets, Los Angeles and Toronto datasets, to validate the effectiveness and applicability of our methods. The experimental results demonstrate that MapGen-Diff can enhance the quality of generation, achieving significantly better performance than the previous translation methods.
2. Related Work
2.1. Image-to-Image Translation
2.2. Diffusion Model
2.3. Diffusion Model-Based Image Generation Technology
3. Methods
3.1. Overall Architecture
3.2. Image Compression into Latent Space
3.3. End-to-End Translation Model Based on DMs
3.3.1. Forward Process for Diffusion
3.3.2. Reverse Process for Denoising
3.4. Loss Function
3.4.1. Evidence Lower Bound
3.4.2. TCR Loss
4. Results
4.1. Datasets and Setups
- Los Angeles datasets: We selected a dataset comprising 4,631 pairs of samples from the urban area of Los Angeles, USA. The RS images comprise urban streets and buildings. These Los Angeles images are scraped from Google Maps, corresponding to a spatial resolution of 2.15 m per pixel.
- Toronto datasets: The Toronto datasets, which we accessed and downloaded at zoom level 17 on Google Maps, come with the same spatial resolution and more diverse ground elements. In order to remove the “dirty data”, we selected 3200 paired samples, which were accurately matched.
4.2. Evaluation Metrics
4.2.1. Root Mean Square Error
4.2.2. Structural Similarity
4.2.3. Pixel Accuracy
4.3. Baselines
4.4. Comparisons with Baselines
4.4.1. Quantitative Results
4.4.2. Qualitative Results
4.5. Ablation Study
4.6. Robustness Analysis
5. Discussion
5.1. Influence of the Downsampling Factor
5.2. Effect of Sampling Steps
5.3. Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ablameyko, S.V.; Beregov, B.S.; Kryuchkov, A.N. Computer-aided cartographical system for map digitizing. In Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba, Japan, 20–22 October 1993; pp. 115–118. [Google Scholar]
- Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
- Li, W.; Hsu, C.Y. Automated terrain feature identification from remote sensing imagery: A deep learning approach. Int. J. Geogr. Inf. Sci. 2020, 34, 637–660. [Google Scholar] [CrossRef]
- Li, X.; Wang, Y.; Zhang, L.; Liu, S.; Mei, J.; Li, Y. Topology-enhanced urban road extraction via a geographic feature-enhanced network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8819–8830. [Google Scholar] [CrossRef]
- Wu, Y.; Xu, L.; Chen, Y.; Wong, A.; Clausi, D.A. TAL: Topography-aware multi-resolution fusion learning for enhanced building footprint extraction. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Hosseinpour, H.; Samadzadegan, F.; Javan, F.D. CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2022, 184, 96–115. [Google Scholar] [CrossRef]
- Zhang, J.; Lin, S.; Ding, L.; Bruzzone, L. Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sens. 2020, 12, 701. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
- Ganguli, S.; Garzon, P.; Glaser, N. GeoGAN: A conditional GAN with reconstruction and style loss to generate standard layer of maps from satellite images. arXiv 2019, arXiv:1902.05611. [Google Scholar]
- Song, J.; Li, J.; Chen, H.; Wu, J. RSMT: A Remote Sensing Image-to-Map Translation Model via Adversarial Deep Transfer Learning. Remote Sens. 2022, 14, 919. [Google Scholar] [CrossRef]
- Xu, J.; Zhou, X.; Han, C.; Dong, B.; Li, H. SAM-GAN: Supervised learning-based aerial image-to-map translation via generative adversarial networks. ISPRS Int. J. Geo-Inf. 2023, 12, 159. [Google Scholar] [CrossRef]
- Phatangare, S.; Khalifa, M.M.; Kharche, S.; Khatib, A.; Kshirsagar, A. Satellite Image to Map Translation using GANs. Grenze Int. J. Eng. Technol. (GIJET) 2024, 10. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
- Song, J.; Li, J.; Chen, H.; Wu, J. MapGen-GAN: A Fast Translator for Remote Sensing Image to Map Via Unsupervised Adversarial Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2341–2357. [Google Scholar] [CrossRef]
- Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv 2021, arXiv:2112.07804. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Li, B.; Xue, K.; Liu, B.; Lai, Y. BBDM: Image-to-Image Translation with Brownian Bridge Diffusion Models. In Proceedings of the 2023 IEEE/CVF Confernece Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2022; pp. 1952–1961. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2868–2876. [Google Scholar] [CrossRef]
- Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1857–1865. [Google Scholar]
- Chen, X.; Chen, S.; Xu, T.; Yin, B.; Peng, J.; Mei, X.; Li, H. SMAPGAN: Generative Adversarial Network-Based Semisupervised Styled Map Tile Generation Method. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4388–4406. [Google Scholar] [CrossRef]
- Song, J.; Chen, H.; Du, C.; Li, J. Semi-MapGen: Translation of Remote Sensing Image Into Map via Semisupervised Adversarial Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–19. [Google Scholar] [CrossRef]
- Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]
- Wang, T.; Zhang, T.; Zhang, B.; Ouyang, H.; Chen, D.; Chen, Q.; Wen, F. Pretraining is all you need for image-to-image translation. arXiv 2022, arXiv:2205.12952. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Song, Y.; Ermon, S. Improved techniques for training score-based generative models. Adv. Neural Inf. Process. Syst. 2020, 33, 12438–12448. [Google Scholar]
- Song, Y.; Durkan, C.; Murray, I.; Ermon, S. Maximum likelihood training of score-based diffusion models. Adv. Neural Inf. Process. Syst. 2021, 34, 1415–1428. [Google Scholar]
- Doob, J.L.; Doob, J. Classical Potential Theory and Its Probabilistic Counterpart; Springer: Berlin/Heidelberg, Germany, 1984; Volume 262. [Google Scholar]
- Liu, X.; Wu, L.; Ye, M.; Liu, Q. Let us build bridges: Understanding and extending diffusion generative models. arXiv 2022, arXiv:2208.14699. [Google Scholar]
- Zhou, L.; Lou, A.; Khanna, S.; Ermon, S. Denoising Diffusion Bridge Models. arXiv 2023, arXiv:2309.16948. [Google Scholar]
- De Bortoli, V.; Thornton, J.; Heng, J.; Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 17695–17709. [Google Scholar]
- Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
- Parmar, G.; Singh, K.K.; Zhang, R.; Li, Y.; Lu, J.; Zhu, J.Y. Zero-shot Image-to-Image Translation. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023. [Google Scholar]
- Sasaki, H.; Willcocks, C.G.; Breckon, T.P. Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv 2021, arXiv:2104.05358. [Google Scholar]
- Choi, J.; Kim, S.; Jeong, Y.; Gwon, Y.; Yoon, S. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14347–14356. [Google Scholar]
- Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv 2021, arXiv:2108.01073. [Google Scholar]
- Zhao, M.; Bao, F.; Li, C.; Zhu, J. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Adv. Neural Inf. Process. Syst. 2022, 35, 3609–3623. [Google Scholar]
- Wu, C.H.; De la Torre, F. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7378–7387. [Google Scholar]
- Sun, S.; Wei, L.; Xing, J.; Jia, J.; Tian, Q. SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation. In Proceedings of the 40th International Conference on Machine Learning, ICML 2023, Honolulu, HI, USA, 23–29 July 2023; PMLR, 2023. Volume 202, pp. 33115–33134. [Google Scholar]
- Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
- Mustafa, A.; Mantiuk, R.K. Transformation consistency regularization–a semi-supervised paradigm for image-to-image translation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 599–615. [Google Scholar]
- Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
- Park, T.; Efros, A.A.; Zhang, R.; Zhu, J.Y. Contrastive learning for unpaired image-to-image translation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 319–345. [Google Scholar]
- Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Zhang, K.; Tao, D. Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2422–2431. [Google Scholar] [CrossRef]
- Solano-Carrillo, E.; Rodriguez, A.B.; Carrillo-Perez, B.; Steiniger, Y.; Stoppe, J. Look ATME: The Discriminator Mean Entropy Needs Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 787–796. [Google Scholar]
- Jones, C.B. Geographical Information Systems and Computer Cartography; Routledge: London, UK, 2014. [Google Scholar]
Datasets | Total Number | Train Samples | Test Samples | Val Samples | Spacial Resolution | Zoom Level | Source |
---|---|---|---|---|---|---|---|
Los Angeles | 4631 | 3200 | 791 | 640 | 2.15 | 17 | Google Map |
Toronto | 4671 | 3200 | 831 | 640 | 2.15 | 17 | Google Map |
Method | RMSE ↓ | SSIM ↑ | ACC ↑ | |
---|---|---|---|---|
supervised | MapGen-Diff (ours) | 13.6898 | 0.7753 | 61.2555 |
Pix2pix | 15.6185 | 0.7381 | 53.8425 | |
ATME | 14.4883 | 0.7273 | 67.9793 | |
unsupervised | CycleGAN | 19.5317 | 0.6624 | 47.2573 |
MapGenGAN | 17.7429 | 0.7026 | 51.7917 | |
semi-supervised | SMAPGAN | 16.9625 | 0.7364 | 55.3926 |
Semi-MapGen | 16.2413 | 0.7513 | 56.1834 |
Method | RMSE ↓ | SSIM ↑ | ACC ↑ | |
---|---|---|---|---|
supervised | MapGen-Diff (ours) | 16.9409 | 0.7314 | 55.9149 |
Pix2pix | 24.5342 | 0.6935 | 49.1728 | |
ATME | 17.8888 | 0.7143 | 55.1059 | |
unsupervised | CycleGAN | 28.1417 | 0.6125 | 44.2257 |
MapGenGAN | 25.9813 | 0.6683 | 46.6972 | |
semi-supervised | SMAPGAN | 23.8214 | 0.6724 | 48.9541 |
Semi-MapGen | 23.4186 | 0.6896 | 49.4164 |
Methods | LA Datasets | Toronto Datasets | ||||
---|---|---|---|---|---|---|
RMSE | SSIM | ACC | RMSE | SSIM | ACC | |
MapGen-Diff | 13.6898 | 0.7753 | 61.2555 | 16.9409 | 0.7314 | 55.9149 |
MapGen-Diff-no- | 16.5749 | 0.6459 | 55.3711 | 23.1496 | 0.7247 | 49.3315 |
MapGen-Diff-no-Latent | 12.4593 | 0.7511 | 63.5634 | 21.2317 | 0.7872 | 55.4060 |
Methods | LA Datasets | Toronto Datasets | ||||
---|---|---|---|---|---|---|
RMSE | SSIM | ACC | RMSE | SSIM | ACC | |
MapGen-Diff- = 1 | 13.6898 | 0.7753 | 61.2555 | 16.9409 | 0.7314 | 55.9149 |
MapGen-Diff- = 0.5 | 15.1004 | 0.6986 | 60.7771 | 22.8755 | 0.7312 | 50.9219 |
MapGen-Diff- = 2 | 15.4696 | 0.6841 | 57.3343 | 23.7146 | 0.7306 | 47.1519 |
MapGen-Diff- = 4 | 16.0810 | 0.6724 | 53.8069 | 24.0916 | 0.7303 | 44.2797 |
Method | #Parameters | FLOPs | Time | |
---|---|---|---|---|
supervised | MapGen-Diff (ours) | 237 M | 169.45 G | 20 min |
Pix2pix | 54 M | 18.15 G | 3 min | |
ATME | 39 M | − | 2 min | |
unsupervised | CycleGAN | 114 M | 56.87 G | 2 min |
MapGenGAN | 67 M | 41.35 G | 3 min | |
Semi-supervised | SMAPGAN | 23 M | 169.58 G | 2 min |
Semi-MapGen | 218 M | 92.35 G | 2 min |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, J.; Wu, J.; Chen, H.; Ma, M. MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model. Remote Sens. 2024, 16, 3716. https://doi.org/10.3390/rs16193716
Tian J, Wu J, Chen H, Ma M. MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model. Remote Sensing. 2024; 16(19):3716. https://doi.org/10.3390/rs16193716
Chicago/Turabian StyleTian, Jilong, Jiangjiang Wu, Hao Chen, and Mengyu Ma. 2024. "MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model" Remote Sensing 16, no. 19: 3716. https://doi.org/10.3390/rs16193716
APA StyleTian, J., Wu, J., Chen, H., & Ma, M. (2024). MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model. Remote Sensing, 16(19), 3716. https://doi.org/10.3390/rs16193716