ENGDM: Enhanced Non-Isotropic Gaussian Diffusion Model for Progressive Image Editing †
Abstract
:1. Introduction
- We propose ENGDM, a novel method for progressive image editing. We introduce reinforced text embeddings, using a novel editing reinforcement loss in the latent space to optimize text embeddings for enhanced editability.
- We propose the optimized noise variances by employing a structural consistency loss to dynamically adjust the denoising time steps for each pixel, ensuring high faithfulness to the source image.
- Extensive experiments on multiple datasets demonstrate that ENGDM achieves state-of-the-art performance in image-editing tasks, achieving a better balance between editability and faithfulness.
2. Related Work
Editing Method | Mask-Based | Inversion-Based | Attention-Based |
---|---|---|---|
Study | Blended Diffusion [13], Blended Latent Diffusion [14], PFB-Diff [26], DiffEdit [4], RDM [27] | DDIM Inversion [28], Null-Text Inversion [3], PnP Inversion [18], NMG [19], PTI [29], ProxEdit [30], DDPM Inversion [17], LEDITS++ [31], SDE-Drag [32], EDICT [16], BELM [33] | P2P [2], Pix2Pix-Zero [35], Custom-edit [36], Conditional Score Guidance [37], PnP [20], FPE [22], Photoswap [39], StyleInjection [40], MasaCtrl [21] |
Purpose | Mask-based image-editing methods leverage masks to guide and refine the sampling process. | Inversion-based image-editing methods invert the real image into noise space, and then use the sampling process to generate the edited results based on the noisy latent and a given target prompt. | Attention-based methods achieve image editing by manipulating the attention layers. |
Limitation | The method exhibits limited flexibility when handling complex modifications. | The inversion process is time-consuming and may hinder practical applications. | It is challenging to accurately locate the specific regions that require editing. |
Performance | The faithfulness of non-edited regions is high, but edge artifacts are prone to occur. | The details of the source image can be effectively preserved, but easy to fail editing in complex scenarios. | The details of the source image are not precisely preserved. |
3. Background: Score-Based Diffusion Models
4. Method
4.1. Non-Isotropic Gaussian Diffusion Model
4.2. Rectify the Non-Isotropic Gaussian Diffusion Model
4.3. Enhancing Editability with Reinforced Text Embeddings
4.4. Enhancing Faithfulness with Optimized Noise Variances
4.5. Sampling Method in ENGDM
Algorithm 1 Sampling method of ENGDM. |
Inputs: The source image , the time schedule , the maximal time steps T, the optimization time , the guidance scale w and
|
5. Experiment
5.1. Experimental Setup
5.2. Results
5.2.1. Qualitative Results
5.2.2. Quantitative Results
5.2.3. User Study
5.3. Ablation Study
5.3.1. Ablation Analysis of ENGDM
5.3.2. Effect of Hyperparameters a and b
5.3.3. Comparison with Hard Weighting Matrix
5.3.4. Results at the Intermediate Steps of the Forward and Reverse Process
5.3.5. Validation of the Method for Determining the Total Diffusion Time Step
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Mathematical Definitions of Evaluation Metrics
- (1)
- Structure Distance (SD) utilizes spatial features extracted by DINO-ViT [50] to compute structural alignment:
- (2)
- Peak Signal-to-Noise Ratio (PSNR) quantifies the maximum signal-to-noise ratio based on the mean squared error:
- (3)
- Learned Perceptual Image Patch Similarity (LPIPS) assesses perceptual differences using deep feature representations:
- (4)
- Mean Squared Error (MSE) evaluates pixel-level reconstruction quality by averaging the squared differences between images:
- (5)
- Structural Similarity Index (SSIM) measures the structural similarity:
- (6)
- CLIPScore evaluates text–image alignment via cosine similarity in CLIP space:
Appendix B. Table for Describing the Evaluation Dataset
Dataset Name | Description | Source | Size | Editing Type |
---|---|---|---|---|
PIE [18] | PIE dataset grouped into animals, humans, indoor, and outdoor scenes, offering diverse, challenging tasks for evaluating editing performance. | https://github.com/cure-lab/PnPInversion (accessed on 10 April 2025) | 700 | Add, change, and remove. |
ZONE [46] | Zone dataset includes real and synthetic images, focusing on three editing types. | https://drive.google.com/file/d/1lAwpENoDcO1QyFuwz3iKJJ7DmDTFMvIU/view (accessed on 10 April 2025) | 100 | Add, change, and remove. |
Imagen [47] | Imagen dataset contains 180 synthetic images, which are generated by Imagen [47]. Each image is edited with 10 attribute replacements via prompt modification, yielding 1800 evaluation examples. | https://imagen.research.google/ (accessed on 10 April 2025) | 1800 | Change. |
EMU [48] | EMU is a large-scale benchmark, including images spanning seven editing categories. A post-validation step filters out low-quality samples to ensure high data quality. | https://huggingface.co/datasets/facebook/emu_edit_test_set (accessed on 10 April 2025) | 3314 | Add, change, and remove. |
HC | We collect images with the height and width larger than 1024 from the high-quality HQ-Edit dataset [49] to form a new benchmark, which is referred to as the HC dataset. | https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit (accessed on 10 April 2025) | Add, change, and remove. |
Appendix C. Method for Hyperparameters Selection
Dataset | Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | ||
PIE | ENGDM | 18.80 | 19.94 | 146.45 | 132.19 | 71.18 | 25.87 |
ENGDM-A | 13.41 | 21.41 | 111.77 | 100.43 | 73.49 | 25.82 | |
ZONE | ENGDM | 16.75 | 20.17 | 122.75 | 119.58 | 74.73 | 25.43 |
ENGDM-A | 12.64 | 21.43 | 100.74 | 91.38 | 77.72 | 25.39 |
References
- Meng, C.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-prompt image editing with cross attention control. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Couairon, G.; Verbeek, J.; Schwenk, H.; Cord, M. Diffedit: Diffusion-based semantic image editing with mask guidance. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Li, D.; Li, J.; Hoi, S. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Kingma, D.; Salimans, T.; Poole, B.; Ho, J. Variational diffusion models. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
- Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Avrahami, O.; Fried, O.; Lischinski, D. Blended latent diffusion. ACM Trans. Graph. 2023, 42, 1–11. [Google Scholar] [CrossRef]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Wallace, B.; Gokul, A.; Naik, N. Edict: Exact diffusion inversion via coupled transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Huberman-Spiegelglas, I.; Kulikov, V.; Michaeli, T. An edit friendly ddpm noise space: Inversion and manipulations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Ju, X.; Zeng, A.; Bian, Y.; Liu, S.; Xu, Q. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Cho, H.; Lee, J.; Kim, S.B.; Oh, T.H.; Jeong, Y. Noise map guidance: Inversion with spatial context for real image editing. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; Zheng, Y. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Wangfu, France, 2–6 October 2023. [Google Scholar]
- Liu, B.; Wang, C.; Cao, T.; Jia, K.; Huang, J. Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Starodubcev, N.; Khoroshikh, M.; Babenko, A.; Baranchuk, D. Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Yu, X.; Gu, X.; Liu, H.; Sun, J. Constructing non-isotropic Gaussian diffusion model using isotropic Gaussian diffusion model for image editing. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Huang, W.; Tu, S.; Xu, L. Pfb-diff: Progressive feature blending diffusion for text-driven image editing. Neural Netw. 2025, 181, 106777. [Google Scholar] [CrossRef] [PubMed]
- Lin, Y.; Chen, Y.W.; Tsai, Y.H.; Jiang, L.; Yang, M.H. Text-driven image editing via learnable regions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Dong, W.; Xue, S.; Duan, X.; Han, S. Prompt tuning inversion for text-driven image editing using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Wangfu, France, 2–6 October 2023. [Google Scholar]
- Han, L.; Wen, S.; Chen, Q.; Zhang, Z.; Song, K.; Ren, M.; Gao, R.; Stathopoulos, A.; He, X.; Chen, Y.; et al. Proxedit: Improving tuning-free real image editing with proximal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024. [Google Scholar]
- Brack, M.; Friedrich, F.; Kornmeier, K.; Tsaban, L.; Schramowski, P.; Kersting, K.; Passos, A. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Nie, S.; Guo, H.A.; Lu, C.; Zhou, Y.; Zheng, C.; Li, C. The blessing of randomness: Sde beats ode in general diffusion-based image editing. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Wang, F.; Yin, H.; Dong, Y.; Zhu, H.; Zhang, C.; Zhao, H.; Qian, H.; Li, C. BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Hong, S.; Lee, K.; Jeon, S.Y.; Bae, H.; Chun, S.Y. On Exact Inversion of DPM-Solvers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Parmar, G.; Kumar Singh, K.; Zhang, R.; Li, Y.; Lu, J.; Zhu, J.Y. Zero-shot image-to-image translation. In Proceedings of the ACM SIGGRAPH, Los Angeles, CA, USA, 6–10 August 2023. [Google Scholar]
- Choi, J.; Choi, Y.; Kim, Y.; Kim, J.; Yoon, S. Custom-edit: Text-guided image editing with customized diffusion models. arXiv 2023, arXiv:2305.15779. [Google Scholar]
- Lee, H.; Kang, M.; Han, B. Conditional score guidance for text-driven image-to-image translation. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Guo, Q.; Lin, T. Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Gu, J.; Wang, Y.; Zhao, N.; Fu, T.J.; Xiong, W.; Liu, Q.; Zhang, Z.; Zhang, H.; Zhang, J.; Jung, H.; et al. Photoswap: Personalized subject swapping in images. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Chung, J.; Hyun, S.; Heo, J.P. Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 2011, 23, 1661–1674. [Google Scholar] [CrossRef] [PubMed]
- Ho, J.; Salimans, T. Classifier-free diffusion guidance. In Proceedings of the Conference on Neural Information Processing Systems Workshop on Deep Generative Models and Downstream Applications, Virtual, 13 December 2021. [Google Scholar]
- Kwon, G.; Ye, J.C. Diffusion-based image translation using disentangled style and content representation. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. IP 2004, 13, 600–612. [Google Scholar]
- Li, S.; Zeng, B.; Feng, Y.; Gao, S.; Liu, X.; Liu, J.; Li, L.; Tang, X.; Hu, Y.; Liu, J.; et al. Zone: Zero-shot instruction-guided local editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Sheynin, S.; Polyak, A.; Singer, U.; Kirstain, Y.; Zohar, A.; Ashual, O.; Parikh, D.; Taigman, Y. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Hui, M.; Yang, S.; Zhao, B.; Shi, Y.; Wang, H.; Wang, P.; Zhou, Y.; Xie, C. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv 2024, arXiv:2404.09990. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
- Brooks, T.; Holynski, A.; Efros, A.A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Xu, S.; Huang, Y.; Pan, J.; Ma, Z.; Chai, J. Inversion-free image editing with natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Nam, H.; Kwon, G.; Park, G.Y.; Ye, J.C. Contrastive denoising score for text-guided latent diffusion image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
P2P [2] | 69.95 | 15.10 | 335.58 | 347.18 | 55.36 | 24.98 |
DiffEdit [4] | 17.41 | 19.66 | 129.14 | 131.34 | 72.43 | 25.09 |
InstructPix2Pix [51] | 57.94 | 16.71 | 269.02 | 419.30 | 61.72 | 23.57 |
MasaCtrl [21] | 28.08 | 19.09 | 181.21 | 147.28 | 67.72 | 23.90 |
NMG [19] | 15.37 | 23.39 | 112.16 | 160.12 | 73.48 | 23.57 |
PnP Inversion [18] | 11.71 | 22.26 | 116.40 | 76.16 | 73.29 | 24.84 |
ZONE [46] | 58.27 | 16.20 | 281.79 | 396.66 | 58.97 | 23.98 |
FPE [22] | 12.77 | 21.67 | 114.95 | 82.86 | 73.42 | 24.35 |
InfEdit [52] | 19.47 | 21.49 | 133.87 | 176.39 | 70.78 | 24.74 |
CDS [53] | 7.33 | 23.83 | 76.48 | 57.27 | 76.79 | 23.91 |
iCD [23] | 39.43 | 17.81 | 235.93 | 203.94 | 62.39 | 25.92 |
NGDM (a = 10.0, b = 5.0) | 21.32 | 19.31 | 159.84 | 139.30 | 69.37 | 25.84 |
ENGDM (a = 10.0, b = 5.0) | 18.80 | 19.94 | 146.45 | 132.19 | 71.18 | 25.97 |
NGDM (a = 10.0, b = 10.0) | 7.35 | 23.40 | 82.94 | 57.03 | 76.08 | 24.65 |
ENGDM (a = 10.0, b = 10.0) | 6.55 | 23.98 | 74.64 | 48.96 | 79.84 | 24.92 |
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
P2P [2] | 57.96 | 16.00 | 265.69 | 286.01 | 59.92 | 24.19 |
DiffEdit [4] | 14.87 | 20.39 | 104.78 | 117.71 | 77.29 | 24.94 |
InstructPix2Pix [51] | 33.86 | 18.70 | 189.72 | 296.25 | 69.99 | 24.19 |
MasaCtrl [21] | 24.20 | 19.78 | 151.36 | 133.62 | 72.98 | 23.71 |
NMG [19] | 15.95 | 23.61 | 95.19 | 105.59 | 78.23 | 23.01 |
PnP Inversion [18] | 11.36 | 22.49 | 94.35 | 74.29 | 77.91 | 24.20 |
ZONE [46] | 34.60 | 17.58 | 204.14 | 295.32 | 67.89 | 24.64 |
FPE [22] | 11.41 | 22.37 | 90.11 | 74.02 | 78.30 | 23.48 |
InfEdit [52] | 15.74 | 21.69 | 106.04 | 152.23 | 74.21 | 24.18 |
CDS [53] | 6.91 | 24.49 | 63.52 | 57.11 | 81.79 | 23.72 |
iCD [23] | 32.85 | 17.76 | 205.44 | 198.61 | 64.50 | 25.23 |
NGDM (a = 10.0, b = 5.0) | 17.93 | 19.97 | 130.55 | 124.57 | 74.11 | 25.12 |
ENGDM (a = 10.0, b = 5.0) | 16.75 | 20.17 | 122.75 | 119.58 | 74.73 | 25.43 |
NGDM (a = 10.0, b = 10.0) | 6.27 | 24.14 | 64.57 | 48.00 | 81.44 | 23.95 |
ENGDM (a = 10.0, b = 10.0) | 5.64 | 24.69 | 57.84 | 41.36 | 83.82 | 24.36 |
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
P2P [2] | 56.38 | 14.07 | 284.17 | 405.39 | 56.55 | 32.99 |
DiffEdit [4] | 13.05 | 19.54 | 92.93 | 121.29 | 78.51 | 33.60 |
InstructPix2Pix [51] | 59.35 | 12.66 | 336.20 | 690.25 | 56.59 | 25.89 |
MasaCtrl [21] | 20.40 | 18.27 | 158.14 | 160.87 | 72.01 | 32.62 |
NMG [19] | 7.87 | 22.04 | 80.81 | 76.93 | 78.63 | 32.82 |
PnP Inversion [18] | 8.95 | 21.12 | 82.39 | 87.80 | 77.89 | 33.13 |
ZONE [46] | 60.35 | 12.67 | 344.80 | 617.58 | 55.23 | 25.28 |
FPE [22] | 9.18 | 20.56 | 85.44 | 93.33 | 76.62 | 32.78 |
InfEdit [52] | 6.55 | 22.28 | 66.20 | 70.64 | 80.92 | 32.66 |
CDS [53] | 8.95 | 20.44 | 75.63 | 111.85 | 78.73 | 32.87 |
iCD [23] | 31.42 | 16.49 | 212.42 | 242.45 | 66.02 | 34.33 |
NGDM (a = 10.0, b = 5.0) | 15.73 | 18.89 | 121.07 | 138.81 | 75.12 | 34.23 |
ENGDM (a = 10.0, b = 5.0) | 14.40 | 19.31 | 110.98 | 126.34 | 76.20 | 34.45 |
NGDM (a = 10.0, b = 10.0) | 5.80 | 22.38 | 61.33 | 62.00 | 81.06 | 33.24 |
ENGDM (a = 10.0, b = 10.0) | 5.01 | 22.99 | 54.82 | 55.23 | 81.74 | 33.46 |
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
P2P [2] | 79.14 | 14.02 | 393.78 | 434.62 | 46.02 | 24.97 |
DiffEdit [4] | 17.68 | 20.28 | 124.93 | 128.29 | 70.51 | 23.73 |
InstructPix2Pix [51] | 48.68 | 17.45 | 236.15 | 353.48 | 61.62 | 22.49 |
MasaCtrl [21] | 31.30 | 18.66 | 200.59 | 161.27 | 63.76 | 23.18 |
NMG [19] | 11.15 | 26.69 | 78.58 | 73.74 | 76.55 | 21.88 |
PnP Inversion [18] | 16.16 | 21.18 | 154.09 | 89.96 | 68.52 | 25.54 |
ZONE [46] | 52.22 | 16.44 | 264.65 | 377.38 | 58.02 | 22.70 |
FPE [22] | 12.25 | 21.97 | 113.57 | 77.49 | 72.15 | 23.91 |
InfEdit [52] | 36.59 | 16.92 | 226.86 | 260.86 | 57.44 | 25.09 |
CDS [53] | 5.53 | 25.48 | 59.32 | 38.36 | 77.12 | 22.71 |
iCD [23] | 53.94 | 16.25 | 305.91 | 266.68 | 52.17 | 25.71 |
NGDM (a = 10.0, b = 5.0) | 31.94 | 18.60 | 210.35 | 191.05 | 62.33 | 25.58 |
ENGDM (a = 10.0, b = 5.0) | 27.61 | 19.16 | 189.40 | 165.95 | 64.26 | 25.75 |
NGDM (a = 10.0, b = 10.0) | 7.82 | 23.93 | 78.91 | 51.25 | 75.07 | 23.20 |
ENGDM (a = 10.0, b = 10.0) | 6.15 | 24.67 | 69.50 | 43.93 | 76.14 | 23.41 |
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
P2P [2] | 90.84 | 11.65 | 427.75 | 779.03 | 30.55 | 24.63 |
DiffEdit [4] | 9.43 | 20.24 | 77.34 | 111.52 | 76.30 | 26.94 |
InstructPix2Pix [51] | 67.04 | 13.82 | 283.82 | 569.97 | 42.53 | 25.01 |
PnP Inversion [18] | 8.68 | 22.82 | 69.11 | 77.97 | 77.35 | 26.24 |
ZONE [46] | 68.14 | 13.40 | 334.81 | 594.43 | 43.51 | 24.77 |
FPE [22] | 20.10 | 18.84 | 135.29 | 150.59 | 74.47 | 26.31 |
InfEdit [52] | 31.34 | 16.42 | 191.28 | 300.09 | 63.67 | 26.37 |
CDS [53] | 3.38 | 28.38 | 34.61 | 27.36 | 90.83 | 25.79 |
NGDM (a = 10.0, b = 5.0) | 12.10 | 19.54 | 106.02 | 122.53 | 72.88 | 27.36 |
ENGDM (a = 10.0, b = 5.0) | 11.16 | 19.95 | 97.95 | 117.11 | 73.70 | 27.63 |
NGDM (a = 10.0, b = 10.0) | 3.91 | 30.20 | 34.84 | 27.03 | 90.27 | 25.94 |
ENGDM (a = 10.0, b = 10.0) | 3.02 | 30.72 | 30.35 | 22.62 | 92.84 | 26.27 |
P2P | DiffEdit | InstructPix2Pix | MasaCtrl | NMG | PnP Inversion | ZONE | FPE | InfEdit | CDS | iCD | NGDM | ENGDM |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.63% | 10.31% | 3.75% | 2.19% | 3.13% | 1.88% | 6.88% | 1.88% | 5.00% | 6.56% | 9.06% | 19.06% | 29.69% |
a () | 6.0 | 8.0 | 10.0 | 12.0 | 14.0 |
---|---|---|---|---|---|
Distance ↓ | 6.01 | 10.14 | 16.75 | 22.70 | 27.09 |
PSNR ↑ | 26.05 | 22.18 | 20.17 | 18.89 | 18.14 |
LPIPS ↓ | 55.63 | 88.41 | 122.75 | 150.47 | 169.52 |
MSE ↓ | 36.26 | 73.49 | 119.58 | 160.75 | 190.53 |
SSIM ↑ | 82.43 | 78.42 | 74.73 | 72.02 | 70.08 |
Score ↑ | 23.89 | 24.73 | 25.43 | 25.56 | 25.61 |
() | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 |
Distance ↓ | 34.07 | 24.45 | 16.75 | 10.66 | 6.78 |
PSNR ↑ | 17.50 | 18.67 | 20.17 | 21.85 | 23.76 |
LPIPS ↓ | 196.95 | 160.33 | 122.75 | 90.14 | 66.39 |
MSE ↓ | 216.30 | 168.29 | 119.58 | 79.52 | 51.23 |
SSIM ↑ | 67.12 | 70.91 | 74.73 | 78.30 | 81.13 |
Score ↑ | 25.68 | 25.64 | 25.43 | 24.82 | 24.56 |
Method | Structure | Content Preservation | Editing | |||
---|---|---|---|---|---|---|
Distance ↓ | PSNR ↑ | LPIPS ↓ | MSE ↓ | SSIM ↑ | Score ↑ | |
Random | 13.83 | 21.66 | 114.95 | 109.81 | 75.16 | 23.96 |
Ours | 16.75 | 20.17 | 122.75 | 119.58 | 74.73 | 25.43 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, X.; Gu, X.; Hu, X.; Sun, J. ENGDM: Enhanced Non-Isotropic Gaussian Diffusion Model for Progressive Image Editing. Sensors 2025, 25, 2970. https://doi.org/10.3390/s25102970
Yu X, Gu X, Hu X, Sun J. ENGDM: Enhanced Non-Isotropic Gaussian Diffusion Model for Progressive Image Editing. Sensors. 2025; 25(10):2970. https://doi.org/10.3390/s25102970
Chicago/Turabian StyleYu, Xi, Xiang Gu, Xin Hu, and Jian Sun. 2025. "ENGDM: Enhanced Non-Isotropic Gaussian Diffusion Model for Progressive Image Editing" Sensors 25, no. 10: 2970. https://doi.org/10.3390/s25102970
APA StyleYu, X., Gu, X., Hu, X., & Sun, J. (2025). ENGDM: Enhanced Non-Isotropic Gaussian Diffusion Model for Progressive Image Editing. Sensors, 25(10), 2970. https://doi.org/10.3390/s25102970