Next Article in Journal
Indoor Localization and ADL Monitoring via RSSI-Driven ML with Feedback Process
Previous Article in Journal
Integrated FCS-MPC with Synchronous Optimal Pulse-Width Modulation for Enhanced Dynamic Performance in Two-Level Voltage-Source Inverters
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSHEdit: Enhanced Text-Driven Image Editing via Advanced Diffusion Model Architecture

by
Mingrui Yang
,
Jian Yuan
*,
Jiahui Xu
and
Weishu Yan
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(19), 3758; https://doi.org/10.3390/electronics14193758
Submission received: 13 August 2025 / Revised: 12 September 2025 / Accepted: 18 September 2025 / Published: 23 September 2025

Abstract

To address limitations in structural preservation and detail fidelity in existing text-driven image editing methods, we propose MSHEdit—a novel editing framework built upon a pre-trained diffusion model. MSHEdit is designed to achieve high semantic alignment during image editing without the need for additional training or fine-tuning. The framework integrates two key components: the High-Order Stable Diffusion Sampler (HOS-DEIS) and the Multi-Scale Window Residual Bridge Attention Module (MS-WRBA). HOS-DEIS enhances sampling precision and detail recovery by employing high-order integration and dynamic error compensation, while MS-WRBA improves editing region localization and edge blending through multi-scale window partitioning and dual-path normalization. Extensive experiments on public datasets including DreamBench-v2 and DreamBench++ demonstrate that compared to recent mainstream models, MSHEdit reduces structural distance by 2% and background LPIPS by 1.2%. These results demonstrate its ability to achieve natural transitions between edited regions and backgrounds in complex scenes while effectively mitigating object edge blurring. MSHEdit exhibits excellent structural preservation, semantic consistency, and detail restoration, providing an efficient and generalizable solution for high-quality text-driven image editing.

1. Introduction

Text-to-Image Editing (T2IE) is a text-driven image manipulation technique that leverages generative AI to transform natural language descriptions into visual content through direct text-to-image synthesis. It has been widely adopted in fields such as film production and advertising design, significantly reducing the cost and complexity of visual storytelling and creative development. Unlike conventional generative models that create images from scratch, T2IE focuses on modifying existing images by interpreting textual instructions—ranging from minor visual adjustments to complete transformations of core visual elements.
In recent years, diffusion models, inspired by principles of nonequilibrium thermodynamics [1], have emerged as a leading approach in text-driven image editing. By simulating noise diffusion and reverse denoising processes in physical systems, these models are capable of reconstructing complex image content from random noise distributions [1,2,3,4]. As a result, diffusion-based image generation methods have garnered widespread attention. Notable examples include Imagen [5] and Stable Diffusion [6], which exhibit strong capabilities in producing intricate objects and diverse scenes, particularly excelling in tasks such as object replacement and pose transformation. The general development history of the diffusion model is shown in Table 1.
Recent research predominantly adopts Stable Diffusion [6] as the core framework for text-to-image diffusion models. Broadly, existing approaches can be classified into three categories: training-based methods [7,11], test-time fine-tuning techniques [9,12], and model-free strategies [13,14]. Among them, model-free methods eliminate the need for training or fine-tuning, enabling faster deployment, real-time editing, and the integration of multi-modal inputs. Building upon this paradigm, we propose an image editing method that functions entirely without training or fine-tuning. In certain diffusion-based editing applications, mechanisms such as cross-attention are employed to ensure semantic alignment between the generated (or edited) images and the input prompts [13].
Although diffusion model-based image editing methods have made notable progress in generation quality and semantic controllability [15,16,17], considerable limitations persist in structure preservation, detail reconstruction, and precise editing region control.
On one hand, most existing approaches rely on cross-attention mechanisms to achieve semantic alignment between text and image; however, the limited resolution of attention maps impedes fine-grained control over local structures, often resulting in blurred boundaries in edited regions and unnatural fusion between content and background.
On the other hand, mainstream reconstruction methods typically employ inversion strategies such as DDIM [4] or DEIS [18], yet when editing conditions are introduced, the accumulation of numerical errors frequently leads to texture drifting or loss of detail. This issue is especially prominent in style transfer tasks, in which inconsistent style expression or structural distortions may occur. Moreover, current methods often lack efficient spatial localization capability for the editing target in complex scenarios, making it challenging to simultaneously ensure both semantic accuracy and structural fidelity.
To overcome these limitations, we introduce MSHEdit, a diffusion-based image editing framework that requires neither training nor fine-tuning. Users are only required to provide an input image and a concise text prompt describing the desired modification. The system ensures both structural integrity and stylistic coherence while generating results that align with the semantic intent of the input text. Leveraging pre-trained text-to-image diffusion models, MSHEdit significantly reduces computational overhead compared to conventional image editing pipelines, while maintaining high semantic and visual fidelity. Some outstanding experimental results are shown in Figure 1.
The core of MSHEdit lies in style editing and semantic editing, where the style editing adopts the High-Order Stable Diffusion Sampler (HOS-DEIS) module. Currently, most image editing methods using diffusion models employ the DDIM inversion process, but this approach suffers from low sampling efficiency and sampling distortion when capturing overall style and texture. Therefore, we adopt the more advanced DEIS [18] as the foundation for our model’s inversion process and propose the HOS-DEIS inversion process, which better achieves style consistency in generated images.
The semantic editing of MSHEdit utilizes the Multi-Scale Windowed Residual Bridge Attention (MS-WRBA) module, addressing precise determination of editing regions and smooth edge fusion. Since existing methods have issues with inaccurate localization of fine structural areas, we employ cross-scale residual window cross-attention. This module adopts multi-scale window partitioning and a dual-path normalization architecture, dividing input features into multiple local windows and independently calculating attention weights within each window. The model can accurately capture detailed changes in local regions, significantly improving editing area localization accuracy. Moreover, edge areas often exhibit drastic pixel value variations that easily trigger gradient instability. This normalization architecture combined with the residual link mechanism standardizes feature distributions, effectively alleviating gradient explosion and disappearance phenomena in edge regions. It enhances feature expression stability and consistency while reducing edge artifacts and segmentation artifacts.
In summary, our technical contributions are as follows:
(1)
To enhance fidelity in style transfer, we introduce an advanced HOS-DEIS module. Improvements to the DEIS index integrator—specifically through refined higher-order coefficient calculations and integrals—enable high-order stable solutions in the diffusion model’s inverse process. The incorporation of a logarithmic space coefficient remapping mechanism and a dynamic error-compensated integrator further boosts sampling efficiency and image detail reproduction, significantly reducing ambiguous style expression and overfitting typically encountered during style transfer.
(2)
For fine-grained structural editing, we propose a novel cross-scale residual window cross-attention module (MS-WRBA). By leveraging Pre-LayerNorm, dual-path multi-layer normalization, windowed attention computation, and cross-directional feature interaction, our approach substantially decreases the computational complexity of attention matrices while improving the model’s multi-scale contextual understanding. This results in more accurate localization of edited regions and improved edge fusion, yielding seamless integration between edited areas and background.
(3)
By integrating (1) and (2), we introduce MSHEdit, a training-free image editing framework based on a pre-trained text-conditioned diffusion model. The framework synergistically integrates the High-Order Stable Diffusion Sampler (HOS-DEIS) and the multi-scale window residual bridging attention mechanism (MS-WRBA), enabling high-quality and semantically consistent image editing without the need for additional training or fine-tuning. By incorporating high-order exponential integration and dynamic error compensation, the framework significantly improves numerical stability and sampling accuracy during the reverse diffusion process. Furthermore, the MS-WRBA module leverages localized windowed attention modeling and cross-scale feature fusion to enhance the precision of editing region localization and the naturalness of boundary transitions.
We validate the effectiveness of MSHEdit across a range of image editing tasks, including object color replacement, element substitution, shape transformation, and holistic style modification. Experimental results demonstrate that the method not only accurately locates editing targets and maintains semantic consistency in complex scenarios but also outperforms existing methods across multiple mainstream evaluation metrics, showcasing broad practical application potential. On a single RTX 4090, MSHEdit processes a 512 × 512 image in about 2.6 FPS, indicating near-real-time potential after future TensorRT or pruning optimization. Additionally, comprehensive ablation studies further elucidate the contributions of individual components and help clarify current limitations.

2. Related Work

2.1. Three Improvement Directions of the Diffusion Model Under Image Editing

Most current image editing models are developed from the Stable Diffusion [6] diffusion model, which is primarily used for text-to-image generation. This paper therefore adopts the Stable Diffusion [6] model to implement image editing tasks. The editing model first employs “concept inversion” technology to map the concepts in the input image into a representation space controllable by the diffusion model. It then combines this conceptual representation space with the user’s input text description as conditional input to modify and edit the image into a new version that incorporates both the user-specified concepts and aligns with textual semantics. Therefore, improving the diffusion model to achieve superior editing results can be categorized into three main approaches.
The first category comprises training-based approaches. DiffusionCLIP [11], functioning as a weakly supervised model, utilizes CLIP’s text embedding to guide the diffusion model in image editing. Through the DDIM, it converts images into latent noise and fine-tunes the diffusion model during the inversion diffusion process. InstructPix2Pix [7], an instruction-driven fully supervised training model, represents the first method to learn image editing that follows human instructions, training the model through generating original-image-to-edited-image pairs.
The second approach involves fine-tuning during testing. As a hybrid model, Imagic [9] first converts target text into text embeddings, then optimizes these embeddings to reconstruct input images while fine-tuning the diffusion model. Null-Text Inversion [19], functioning as an embedding fine-tuning model, addresses reconstruction failures in DDIM inversion by adjusting empty text embeddings. This reduces the distance between sampling trajectories and inversion trajectories, thereby enhancing reconstruction performance. StyleDiffusion [12] introduces a mapping network that aligns input image features with text prompt embeddings in the embedding space, generating corresponding prompt embeddings [20,21].
While training- and inference-time fine-tuning approaches can be effective for specific tasks, they remain fundamentally limited by their dependence on data-driven parameter updates. Training-based methods require large annotated datasets, incur high costs, and lack flexibility for new editing types, as they demand retraining or adaptation. Inference-time optimization increases computational overhead and may yield unstable results due to inconsistent objectives. Moreover, both approaches often lack explicit modeling of editing regions, making it difficult to achieve precise local control without compromising structural consistency. As a result, such dependence on training restricts both the practicality and scalability of these methods in open-domain image editing scenarios. Consequently, most researchers have shifted their focus to the third approach, and this paper also adopts this method to build the model [22].
The third category comprises methods independent of training and fine-tuning. Prompt-to-Prompt [13] and Pix2Pix-zero [14], as two attention modification models, identify the role of cross-attention layers in spatial relationships between image layouts and prompts, autonomously discovering editing directions within text embedding space. Blended Diffusion [23] integrates text-conditioned content into target images through function substitution and multi-level feature blending, while introducing attention masks to prioritize specific word impressions within designated areas. MagicQuill enables users to directly express editing intent with simple brushes, allowing MLLM to automatically comprehend prompts and diffusion models to precisely execute edits. These methods achieve rapid deployment and real-time editing capabilities without requiring fine-tuning or training, while integrating multi-modal inputs.

2.2. Inversion Process of the Diffusion Model

The denoising process in diffusion models has been a core focus of recent research, with significant advancements, including methods like DDPM [2], DDIM [4], DEIS [18], and PNDM [24], which have contributed to accelerated sampling and improved generation quality. DDPM, the foundational approach in diffusion models, generates high-quality samples by gradually adding noise and learning inverse processes. DDIM enhances sampling through deterministic sampling and adjustable noise parameters. PNDM reduces required sampling steps while maintaining high-quality results by employing non-Markov forward noise processes and optimized backward denoising. DEIS employs an exponential integrator numerical method to efficiently discretize the diffusion model’s backward denoising process, enabling high-quality sample generation with fewer computational steps. Building on DEIS, this paper further optimizes diffusion model sampling efficiency by solving diffusion equations using the exponential integrator. DEIS approaches diffusion models as differential equation problems on manifolds, proposing a pseudo-numeric method that achieves high-quality image generation within just 50 steps [2,4,18,24].
Although inversion methods such as DDIM, DEIS, and PNDM have improved sampling efficiency to varying degrees, their practical application in image editing reveals an inherent trade-off between reconstruction fidelity and editing controllability. Specifically, applying editing conditions with DDIM and DDPM often introduces irreversible texture drift and structural distortions in unedited regions. In contrast, DEIS and PNDM compress the number of sampling steps via higher-order numerical integration, enhancing efficiency but exacerbating truncation errors. As a result, high-frequency details are irreversibly lost during inversion and resampling, manifesting as edge artifacts or inconsistent style representations, particularly in style transfer or local replacement tasks.
HOS-DEIS and MS-WRBA address error accumulation and semantic drift from the perspectives of numerical stability and local semantic control. HOS-DEIS suppresses numerical errors during long inversion using high-order integration and dynamic error compensation. MS-WRBA achieves precise editing region localization and smooth boundary transitions via localized windowed attention and cross-scale feature fusion. Together, they significantly improve structural consistency and semantic accuracy in image editing without requiring training or fine-tuning.

2.3. Attention Mechanism in Semantic Editing

The controllability of diffusion models in image editing tasks largely depends on their built-in attention mechanisms’ precise regulation of text–image semantic alignment and spatial structure preservation. In the U-Net architecture of Stable Diffusion [6], the attention maps encoded by cross-attention and self-attention layers serve dual functions [25]: bridging language prompts to visual regions while maintaining global geometric consistency within self-attention, thereby providing interpretable semantic–spatial prior knowledge for subsequent editing operations. Therefore, the application of attention mechanisms can be broadly categorized into three types: Attention Map Replacement (AMR) modifies original attention maps during editing paths to preserve unedited regions [13,14]; Attention Feature Replacement (AFR) replaces Key/Value features in self-attention layers with reference image structural information [26,27]; Attention + Inversion (A+I) optimizes attention-related parameters (e.g., Key/Value or empty text embedding) during inversion phases to resolve reconstruction failures [19,28]. The introduction of attention maps directs pixel updates to specific regions associated with modified text, enabling precise and localized image modifications without masking techniques [29].
However, these strategies still face several practical challenges. First, while attention graphs and features demonstrate high interpretability, their nonlinear mapping to high-dimensional latent spaces lacks a comprehensive theoretical framework, leading to semantic drift and geometric distortion under extreme editing magnitudes [30,31]. Second, existing methods predominantly rely on fixed semantic spaces from pre-trained text encoders, making it difficult to dynamically adapt to fine-grained or composite instructions, thereby limiting editing flexibility in complex scenarios. Additionally, although local attention constraints enable localized operations, sub-pixel-level mask boundary alignment errors may accumulate into global artifacts, particularly evident in object edges and high-frequency texture regions.

3. Methods

In conventional diffusion-based approaches, the Denoising Diffusion Implicit Model (DDIM) [4] achieves image generation through progressive noise injection and reverse denoising. However, DDIM’s inversion process exhibits discrepancies between noise estimation and sampling, where accumulated errors degrade image quality. Additionally, its multiple denoising steps may lead to over-denoising and detail loss [32], while DEIS suffers from artifacts like edge detail distortion when handling outliers. To address these challenges, this study proposes HOS-DEIS as the core generative framework. As an efficient high-order solver, HOS-DEIS enables rapid and accurate implementation of DDIM’s inverse process. While inheriting DDIM’s deterministic sampling characteristics in diffusion denoising, HOS-DEIS significantly enhances generative efficiency through optimized algorithms.
In MSHEdit, HOS-DEIS (Section 3.1) is adopted as the primary diffusion guidance method. We optimize the numerical solver used during the reverse diffusion process to enable precise grid sampling and address issues such as boundary instability and shadow artifacts. Furthermore, MS-WRBA (Section 3.2) is introduced to enhance the blending performance of high-resolution generation with respect to spatial positioning and editing fidelity along the boundary regions. As illustrated in Figure 2, the model first takes the input image x 0 and extracts features using the CLIP image encoder, obtaining E c l i p ( x 0 ) .
Simultaneously, the text prompt is encoded via the CLIP text encoder to obtain E c l i p ( P ) . These encoded features are then fused through the CLIP cross-attention module to produce the condition embedding ϵ θ ( x t , t ) . During the diffusion process, noisy image X n o i s e at time step t is iteratively denoised, incorporating this condition embedding. Using HOS-DEIS, we generate an intermediate image x t . This intermediate result is subsequently refined by MS-WRBA, which performs multi-scale window partitioning and attention-based normalization to further aggregate conditional information. Finally, the denoising module p θ x t x t 1 eliminates residual noise, producing the final edited output image x 0 .

3.1. DEIS and HOS-DEIS Inversion Process

Style editing tasks are typically categorized into “explicit editing” and “implicit editing”. Explicit editing allows users to specify target styles through explicit text or reference images, with the model directly applying style migration according to instructions. In contrast, implicit editing emphasizes the model’s automatic capture and flexible sampling of style features. Rather than relying on precise stylistic constraints, it achieves diverse style expressions within the style space through a sampling mechanism [31].
HOS-DEIS employs advanced integrators and dynamic error compensation mechanisms to achieve precise restoration of image details and textures during diffusion inverse process sampling. This approach ensures accurate reproduction of stylistic features in explicit style editing while enhancing diversity and controllability of style sampling in implicit style editing. Through logarithmic space coefficient remapping and multi-order differential correction, HOS-DEIS effectively prevents loss or confusion of stylistic features during sampling, ensuring clearer expression of style characteristics. Furthermore, HOS-DEIS adopts a sampling method that requires no training or fine-tuning, directly applying to pre-trained diffusion models without separate training or adjustments for each style or content.

3.1.1. DEIS (Diffusion Exponential Integrator Sampler)

Reverse Diffusion Process: In diffusion models, the reverse diffusion process is the core step where the model reconstructs the target image from noisy data step by step [14,20]. The most critical mathematical relationships are
x t = α ¯ t x 0 + 1 α ¯ t ϵ t
x ^ 0 = x t 1 α ¯ t + ϵ ϵ θ ( x t , t ) α ¯ t
Here, α t = s = 1 t ( 1 β s ) represents the cumulative product, x ^ 0 is the predicted image, and β s is the signal loss rate at each time step. This process simulates a gradual denoising dynamic, relying on conditional probability distributions for step-by-step sampling generation (Figure 3). Specifically, assuming that at time step t , the input is the noisy intermediate state x t , the model aims to generate the conditional probability distribution of its previous state x t 1 , which can be expressed as a Gaussian distribution:
p θ ( x t 1 x t ) = N ( x t 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) )
Here, μ θ ( x t , t ) and Σ θ ( x t , t ) represent the mean and covariance matrix of the Gaussian distribution, respectively, both of which are learned by a parameterized U-Net Diffusion Model [33]. The mean μ θ ( x t , t ) , which is equal to ϵ θ ( x t , t ) , determines the center of the sampling, while the covariance matrix Σ θ ( x t , t ) controls the uncertainty of the sampling.
DEIS Exponential Integrator Method: DEIS (Diffusion Exponential Integrator Sampling) accelerates reverse diffusion by using exponential integrators. Unlike traditional methods needing many time steps, DEIS reduces sampling steps while preserving high image editing quality. In the DEIS method, the updated formula for the image from time step t to t 1 is
x t 1 = x t + i = 1 s b i ( t ) ϵ θ ( x t , t )
Here, s represents the order of the integrator, b i ( t ) are time-dependent weights on the time step, and ϵ θ ( x t , t ) is the noise predicted by the model, reflecting the noise component contained in x 0 . DEIS simulates reverse diffusion dynamics by accumulating terms of different orders and can be precisely described using continuous-time dynamical systems theory.
x t 1 = x t + t Δ t t Ψ ( t , τ ) F ( τ ) d τ
F ( τ ) = 1 2 G ( τ ) G T ( τ ) Σ 1 ( τ ) ϵ θ ( x τ , τ )
Here, Ψ ( t , τ ) is the transition matrix, describing the transition characteristics of the system state from time τ to t , and G ( τ ) is the diffusion coefficient. This integral form shows DEIS’s numerical solution for the inverse stochastic differential equation. By precisely controlling noise propagation along the integral path, it improves numerical stability and sampling efficiency.

3.1.2. HOS-DEIS (High-Order Stable DEIS Solver)

Before introducing the HOS-DEIS method, we first outline its structural improvements over traditional DEIS. Figure 4 shows a flowchart comparing the integral process from continuous ODEs to discrete time steps, with DEIS and HOS-DEIS side by side.
Mitigating Numerical Instability in DEIS: In traditional DEIS, when a ¯ t nears zero during the final editing stages, predicted images show abnormal brightness and noise at edges and details. Adding a small ϵ prevents division by zero but can cause large values, leading to precision loss and instability. To fix this, the formula is modified as follows:
x ^ 0 = x t 1 α ¯ t + ϵ ϵ θ ( x t , t ) α ¯ t + ϵ
This approach reduces numerical instability, making noise prediction easier, prevents artifacts and distortions, and ensures consistent editing across image regions.
Exponential Integrator Optimization: Based on the integral definition, we propose an inversion integral compensation term δ t for DEIS, which corrects errors from discrete time steps. This term is weighted by a parameter that dynamically adjusts error accumulation during sampling, improving integral approximation accuracy:
δ t = η 1 α t 1 1 α t 1 α t α t 1 + ϵ
After introducing the compensation term δ t , the DEIS exponential integrator is modified to provide a new expression that retains noise modeling and reduces discretization errors, improving numerical stability and sampling quality.
x t 1 = x t + t Δ t t Ψ ( t , τ ) F ( τ ) d τ + δ t z t
Large time steps in integration can cause error accumulation and reduce accuracy. To address this, we dynamically adjust an error compensation term based on the difference between predicted and true noise, improving numerical integration precision:
Δ x = x t 1 x t t Δ t t Ψ ( t , τ ) F ( τ ) d τ + δ t z t
Finally, the error in the sampling step is represented by Δ x , and the error is quantified using the integral expression. This method corrects sampling errors and supports further error control and model optimization, improving overall stability and precision.
Logarithmic Domain Coefficient Remapping: During the diffusion process, the transition from time step t to t 1 requires a scaling coefficient:
c 1 = α ¯ t 1 α ¯ t
The coefficient c 1 adjusts signal strength during denoising by compensating for attenuation caused by noise addition. For example, c 1 > 1 indicates amplification to restore the original signal.
When α ¯ t nears 0 and α ¯ t 1 is slightly larger but still small, division can cause underflow, leading to large output fluctuations. Multiple inversion-editing steps may also distort image semantics. To prevent this, a logarithmic transform and stability term ϵ is introduced:
l o g c 1 = 1 2 l o g α ¯ t 1 + ϵ α ¯ t + ϵ
This enhances signal structure during diffusion and denoising, improving signal strength precision and preserving spatial structure and semantics during multi-step sampling. It reduces structural distortion and detail loss from noise or step discretization.
Multi-Order Correction Mechanism: In the numerical implementation of the original DEIS integrator, the local truncation error is typically E l o c a l = O ( h 2 ) , where the time step h = t ( t 1 ) = 1 . Although the single-step error is small, in the global multi-step integration process, the error accumulates with the number of sampling steps N , resulting in a global error E g l o b a l = N O ( h 2 ) = N O ( 1 ) . Furthermore, according to the error accumulation formula, as the number of sampling steps increases, the accumulated error is approximately
t = 1 T E t T O h 2 = T O ( 1 )
This means that in a longer DEIS sampling interval, for example, when T = 30 , the error may reach 30 × 0.01 = 0.3 , which is significant for image tasks with pixel values in the range of [ 1 ,   1 ] . Therefore, based on the standard state of the original DEIS, in the first-order integral framework ϵ θ ( 1 ) = ϵ θ ( x t , t ) , we further propose second-order and third-order differences to correct the insufficient precision of DEIS in first-order integral calculations. The second-order and third-order differences are as follows:
ϵ θ ( 2 ) = c 2 ( x t , t ) c 2 ( x t + 1 , t + 1 )
ϵ θ ( 3 ) = c 3 ( x t , t ) 2 c 3 ( x t + 1 , t + 1 ) + c 3 ( x t + 2 , t + 2 )
Here, c 2 = e Δ t 1 α ¯ t 1 1 α ¯ t and c 3 = e 2 Δ t 1 α ¯ t 1 1 α ¯ t 3 / 4 . Time decay factors are added to control the weight of historical information. The exponent 3 / 4 in c 3 is an empirical constant balancing precision and stability. The multi-order correction mechanism greatly improves the integrator’s numerical precision and stability during high-step sampling, especially for pixel-sensitive editing tasks. It reduces error accumulation, preserving structural consistency and detail fidelity. The following Table 2 quantifies this error reduction:
After second-order correction with high-order integration, the integral error drops by 50–70%. Third-order correction further improves detail fidelity in complex images, reducing the required sampling steps. During style conversion, this prevents banding or blurring in smooth gradients and complex texture areas caused by numerical errors.

3.2. MS-WRBA (Multi-Scale Windowed Residual Bridge Attention)

Recently, a considerable amount of research has adopted large-scale diffusion models [5,27,34], employing various denoising methods. Following the original noise addition and removal processes in diffusion models, enhancing and optimizing the attention mechanisms of diffusion models themselves has become a new research direction. Our approach differs from traditional cross-attention methods [35,36] by utilizing the Multi-Scale Windowed Residual Bridge Attention (MS-WRBA) module.
This module uses multi-scale window partitioning and dual-path normalization. It divides input features into local windows to compute attention independently, capturing fine details while integrating global context. Normalization and residual connections stabilize gradients at edges, reducing artifacts. Local features are then recombined into global features, enabling natural blending and smooth transitions in edited regions.
In our method (Figure 5), to optimize the module’s performance, we partition the input tensor using square local windows while maintaining the original sequence length. Specifically, we divide X R B × H × W × C into windows of size w , where H and W represent the width and height of the input tensor, and C is the number of channels. The partitioning results in multiple non-overlapping local windows of size w × w , preserving the original sequence length. The partitioned tensor is represented as
X window = P a r t i t i o n ( X , w )
Here, w is the window size, X R B × H × w 2 × C , and N is the number of windows, with N = H × W w 2 .
Within each window, we compute the Q , K , and V matrices to establish relationships among local features:
Q = W q X window , K = W k X window , V = W v X window
Here, W q , W k , and W v are linear transformation matrices for query, key, and value, respectively. They map input features to different semantic spaces, ensuring that the matrix computation process expresses semantic information in a structured manner within each window. Before computing cross-attention matrices, we apply LayerNorm normalization to the input features to enhance the model’s generalization capability. This yields the fine-scale and coarse-scale feature windows X ~ and Y ~ . Subsequently, we apply linear transformations to obtain the query, key, and value matrices:
Q = X ~ W Q
K X = X ~ W K X , K Y = Y ~ W K Y
V X = X ~ W V X , V Y = Y ~ W V Y
Subsequently, we employ the cross-attention method, incorporating an attention mask to restrict the scope of attention and prevent information interaction between windows:
A t t e n t i o n Q , K X , V X = S o f t m a x Q K X T p k + M X V X
A t t e n t i o n Q , K Y , V Y = S o f t m a x Q K Y T p k + M Y V Y
Here, p k is the dimension of the key vector, used to scale dot-product attention to stabilize gradient propagation and avoid numerical instability. M is the attention mask, ensuring that attention calculations are confined within the current window, preventing cross-window information interaction and reducing the computational intensity of global cross-attention. In this way, the model efficiently captures long-range dependencies among features within local windows. The two attention results are fused, processed through a convolutional layer, and then combined with the normalized input features via a residual connection to obtain the window features.
After completing the attention computation within the windows, to restore the original sequence structure and integrate local and global information, we process the window features through a convolutional layer and combine them with the normalized input features via a residual connection to obtain the window features. We then perform a reverse operation to recombine the processed window features into a global feature representation. This process is as follows:
X window = C o n v s A t t n Q , K X , V X , A t t n Q , K Y , V Y + X ~
X out = R e v e r s e W i n d o w s X window , L
Here, L is the original sequence length, and X out R B × L × C is the final output, with dimensions consistent with the input sequence. The window reversal operation reassembles the local window features into a global sequence, enabling the model to leverage both local features and global context information.
Thus, MS-WRBA, by introducing multi-scale window partitioning and dual-path normalization mechanisms, significantly enhances the model’s perception of local regions, achieving fine-grained control over edited areas. This module independently computes attention weights within each local window, combining cross-scale feature interactions to precisely capture detail changes in edited objects while fully integrating global context information. This enables high-precision object editing in complex scenes. Additionally, MS-WRBA employs residual connections and normalization operations to significantly mitigate edge blurring of edited objects and ensures the consistency and natural transition of the background structure in the edited image through effective fusion of features within and outside the windows.

4. Experiments

Our method can adjust the overall structure of the image and edit the real image. The following content is the detailed results of our experiments based on Stable Diffusion v1.4 [6].

4.1. Experimental Settings

4.1.1. Task Description

To comprehensively evaluate the proposed method, we conducted experiments across multiple creative and diverse image editing tasks. These included the following: (1) Clothing color replacement: Changing a character’s clothing color from one hue to another. (2) Object substitution: Replacing a bicycle with a motorcycle while preserving the original semantic relationship between the subject and vehicle. (3) Object enhancement: Adding sunglasses to a cat’s eyes. (4) Style transformation: Two distinct experimental cases involved converting photos into Van Gogh-style paintings and Chinese urban landscape paintings, respectively.

4.1.2. Evaluation Indicators

To comprehensively evaluate the effectiveness of image editing results, we adopted three metrics: CLIP accuracy (CLIPAcc) [37], structural distance (Structure Dist) [38], and Background-Learned Perceptual Image Patch Similarity (BG LPIPS). The latter is specifically introduced to ensure background retention during editing. This metric helps detect the LPIPS value between the original image and the edited image’s background region, which is why BG LPIPS was incorporated into our evaluation framework.
CLIPAcc is used to measure the similarity between edited images and target text descriptions. Specifically, CLIPAcc evaluates consistency by calculating the cosine similarity between the CLIP embedding vector of the edited image and that of the target text. The formula is expressed as
CLIPAcc = CLIP I edited CLIP T target CLIP I edited CLIP T target
Among them, I edited represents the edited image and T target represents the target text description.
Structure Dist is used to measure the structural similarity between the edited image and the original image. We evaluate the structural difference by calculating the Euclidean distance between the structural feature maps of the edited image and the original image. The formula is expressed as
Structure   Dist = 1 N i = 1 N S I edited i S I original i 2
Among them, S I represents the structure feature map of the image I , and N represents the number of elements in the feature map.
BG LPIPS is used to measure the background region difference between the edited image and the original image. We use LPIPS (Learned Perceptual Image Patch Similarity) to measure the perceptual difference in the background region. The formula is expressed as
BG   LPIPS = 1 M j = 1 M ϕ B I edited j ϕ B I original j 2
Among them, B I represents the background region of the image I , ϕ represents the feature extraction function of the LPIPS network, and M represents the number of blocks in the background region.

4.1.3. Experimental Environment

This study conducted experiments on a single RTX 4090 GPU (ASUS, Taiwan, China) equipped with 24 GB of video memory, using the Python 3.9 programming language and implementing two algorithms—MS-WRBA and HOS-DEIS—based on the PyTorch 2.3.0 deep learning framework. During image reconstruction, the diffusion model was loaded via the Hugging Face platform, utilizing its pre-trained weights and network architecture for initial configuration. To enhance reconstruction efficiency, we adopted the deterministic HOS-DEIS method, which iteratively maps input images into latent space through stepwise processing. In the editing phase, text prompt embeddings were adjusted in conjunction with the MS-WRBA mechanism to achieve precise image modifications. The default settings for HOS-DEIS during reconstruction included 50 steps and a solver order of 2, while the MS-WRBA employed eight square-shaped windows with encoding dimensions of 2 × 77 × 1024. This configuration yields an average latency of 0.38 s per 512 × 512 image, equivalent to 2.6 FPS.

4.1.4. Experimental Dataset

In this study, to comprehensively evaluate the performance of the proposed image editing method, we selected multiple representative and diverse open datasets that cover rich image content and varied editing tasks. These datasets not only exhibit significant differences in image types, editing operation categories, and data scale, but also provide a solid evaluation foundation for assessing the model’s generalization capabilities and editing outcomes.
Our experimental samples were drawn from three datasets: DreamBooth [8], DreamBooth-v2 [39], and DreamBench++ [40]. The original DreamBooth dataset contains 3000 images, while its v2 extension expands to 750 prompts, generating 3000 additional images. The advanced version further increases the number of prompts to 1350, covering 150 pairs of reference images. In our experiments, we randomly sampled 300 images from DreamBooth-v2 as experimental samples, as shown in Table 3 below, while using DreamBench++ as the prompt-image dataset for samples [8,39,40]. These datasets provide comprehensive image resources that enhance the representativeness and stability of experimental results. They achieve full coverage in terms of image content, editing task types, and sample size, laying a solid foundation for evaluating the robustness and generalizability of the proposed method.

4.2. Qualitative Results

The qualitative results in Figure 6 demonstrate the effectiveness of the proposed method across various image editing tasks, including clothing color replacement, bicycle-to-motorcycle conversion, cat sunglasses application, and image style transformation. In the style editing task, the integration of HOS-DEIS with dynamic error compensation enables the model to meticulously reconstruct multi-layered stylistic elements such as color, brushwork, and texture within the style image. Comparative images reveal that traditional methods often suffer from ambiguous style expression or detail loss during style transfer, whereas the proposed method achieves natural fusion and diverse expression of styles while preserving the original image structure. Particularly in scenarios with ambiguous implicit style definitions, HOS-DEIS significantly enhances the clarity and diversity of style representation. More experimental results on processing complex textures, high-resolution images, and other image editing tasks can be found in Appendix A.1. Some cases of editing failure can be found in Appendix A.2.
Furthermore, in fine-tuning tasks such as object addition, the MS-WRBA module significantly enhances the model’s perception of local regions through multi-scale window partitioning and residual bridging attention mechanisms.
Figure 7 presents experimental results showcasing various image editing tasks, including cake replacement, editing of cars and their backgrounds, and dog posture changes. The cake editing task involves substituting the original cake with different objects. For cars, emphasis is placed on modifying background elements, such as the street behind the vehicle, while maintaining semantic coherence. MSHEdit demonstrates a strong understanding of prompts like “blossom street,” “flooded street,” and “snowy street,” preserving overall structure and fine details like metallic textures. Dog editing focuses on posture changes, such as standing, while retaining structural integrity, texture, and seamless blending. Throughout, MSHEdit maintains breed and color consistency, achieving semantic fidelity between original and edited images.
To address potential background noise, we computed BG-LPIPS on the masked non-edited regions in Figure 6. The resulting value (0.044) is an order of magnitude smaller than the edit-region change (0.31), indicating that residual shadows do not introduce significant background disturbance.
These results fully validate the effectiveness of collaborative optimization between HOS-DEIS and MS-WRBA.

4.3. Quantitative Results

4.3.1. Performance Comparison

Table 4 shows the performance comparison of different methods on various indicators. The proposed method performs well in CLIPAcc, Structure Dist, and BG LPIPS.

4.3.2. Discussion of Experimental Results

In this section, we compare our method with some previous image editing methods based on diffusion models. As can be seen from Table 4, our method (HOS-DEIS + MS-WRBA) shows relatively excellent comprehensive performance in multiple image editing tasks.
As shown in Table 4 and Figure 7, while Prompt-to-Prompt effectively preserves the original image structure, its limited editing capacity makes it difficult to achieve extensive semantic modifications. Pix2Pix shows improved semantic consistency but still exhibits some blurriness during complex structural transformations. MagicQuill relies on manual user annotation for editing areas, and though it achieves higher CLIP scores, its limitations in precise localization and automation remain. In contrast, our proposed method not only strikes a better balance between semantic alignment and structural preservation but also significantly enhances both sampling efficiency and editing quality.
Compared with the recent mainstream and outstanding model ICEdit [10], MSHEdit exhibits comparable performance in experiments, primarily due to its innovative architecture. The HOS-DEIS module enhances sampling accuracy and detail restoration through high-order integration and dynamic error compensation, particularly improving structure preservation and edge blending. The MS-WRBA module boosts editing region localization precision via multi-scale window partitioning and normalization, resulting in more natural and refined edits. Additionally, MSHEdit’s training- and fine-tuning-free design enables quick adaptation to diverse editing tasks, showing strong generalization. Although slightly lower than ICEdit [10] in CLIP accuracy for some tasks, MSHEdit’s advantages in structural retention and background consistency highlight its potential and practicality in image editing.
In quantitative evaluations of mainstream image editing tasks, the proposed method (HOS-DEIS + MS-WRBA) demonstrates superior performance across core metrics including CLIP accuracy, structural distance, and background LPIPS. HOS-DEIS significantly enhances sampling precision and detail reproduction through advanced integration mechanisms and dynamic error compensation, delivering edited images that outperform traditional sampling methods in both structural integrity and detail richness. Simultaneously, the multi-scale window residual bridging attention module in MS-WRBA effectively improves model localization capabilities for editing regions and edge fusion effects, ensuring natural transitions between edited objects and backgrounds in complex scenarios. The synergistic integration of HOS-DEIS and MS-WRBA enables the model to achieve high-quality, efficient text-driven image editing without requiring additional training or fine-tuning, while maintaining editorial accuracy, structural fidelity, and stylistic expression capabilities.

4.4. Ablation Experiment

Table 5 shows the performance comparison of different cross-attention methods on various indicators.
In this section, we present the components of each experimental group in the ablation study and demonstrate their outcomes. As shown in Table 5, there are four distinct ablation experiment configurations. The experiments evaluated the following four model combinations: (A) Baseline method (using only DDIM inversion); (B) incorporating only the MS-WRBA module; (C) incorporating only the HOS-DEIS module; (D) integrating both HOS-DEIS and MS-WRBA (full methodology).
Experimental results indicate that while baseline method A achieves basic image editing capabilities, it demonstrates mediocre performance in structural preservation and background consistency, with suboptimal localization of editing regions. The introduction of MS-WRBA (B) alone significantly enhances the model’s focus capability and semantic consistency within editing areas, enabling better identification and processing of target objects. However, there remains room for improvement in structural and background restoration. When HOS-DEIS (C) was introduced independently, sampling quality and structural preservation capabilities were notably enhanced, resulting in more natural and detailed editing outcomes. Nevertheless, the absence of attention guidance mechanisms led to a slight decrease in semantic alignment accuracy under certain task conditions.
When HOS-DEIS and MS-WRBA work synergistically (D), the model achieves optimal performance across all mainstream metrics. Specifically, CLIP accuracy, structural distance, and background LPIPS scores all reach peak levels, demonstrating that their combined approach not only enhances localization precision and detail reproduction in editing regions but also maintains semantic consistency and structural fidelity.
Comparative experiments conclusively also prove, in style transfer tasks, that HOS-DEIS contributes significantly by enhancing sampling accuracy through higher-order integration and dynamic error compensation, resulting in more natural and clear textures. The proportion of HOS-DEIS contributing to this task is around 65%. It improves CLIP accuracy by approximately 10% and reduces structural distance by around 15%. Conversely, MS-WRBA reduces background LPIPS by about 20%, mainly through precise localization and natural boundary transitions in edited regions.
And in object replacement tasks, the proportion of MS_WRBA contributing to this task is around 60%. MS-WRBA plays a larger role, with its multi-scale window partitioning and residual bridging attention effectively capturing style details, especially complex textures, reducing structural distance by about 10%. Meanwhile, HOS-DEIS improves CLIP accuracy by approximately 15%, notably restoring fine details and maintaining style consistency in style transfer.

4.5. Subjective Evaluation

To complement the quantitative metrics, a lightweight user study was conducted. Ten non-expert participants performed a two-alternative forced choice (2AFC) test on 20 image pairs generated by the proposed MSHEdit and the strongest baseline (ICEdit). Participants selected (i) the image that better followed the textual instruction and (ii) the one with a more natural structure. MSHEdit was preferred in 84.7% (170/200) of instruction-following trials and 82.0% (164/200) of naturalness trials, both significantly above chance (binomial test, p < 0.001, Cohen’s g = 0.33). The inter-rater agreement was 0.78 (ICC), indicating reliable preference. This mini evaluation confirms that the objective gains reported in Table 4 are perceived by human users.

4.6. Experimental Conclusions

Through systematic experimental evaluations, the proposed method demonstrates significant comprehensive advantages in diverse image editing tasks. First, the HOS-DEIS High-Order Stable Diffusion Sampler effectively reduces sampling steps while enhancing sampling accuracy and detail reproduction capabilities, thereby significantly improving editing efficiency and image quality. Second, the MS-WRBA multi-scale window residual bridging attention mechanism strengthens the model’s precise localization of editing regions and edge fusion capabilities, resulting in editing outcomes that outperform mainstream methods in structural preservation, semantic consistency, and stylistic expression.
Quantitative experiments demonstrate that the proposed model outperforms existing approaches in key metrics, including CLIP accuracy, structural distance, and background LPIPS, achieving balanced performance in semantic editing alignment, structural fidelity, and detail restoration. Ablation experiments further validate the independent contributions and synergistic effects of HOS-DEIS and MS-WRBA: While introducing either module alone enhances performance, their combined application delivers optimal overall performance.

5. Conclusions

In this work, we propose MSHEdit, a text-conditioned pre-trained diffusion model for image editing, with key innovations in the HOS-DEIS and MS-WRBA modules. HOS-DEIS enhances numerical stability and sampling accuracy during the reverse diffusion process by introducing higher-order integrators and dynamic error compensation mechanisms. Specifically, it employs logarithmic coefficient remapping to prevent detail loss due to numerical instability at small step sizes and dynamically compensates integration errors to achieve finer texture and color restoration in complex style transfer tasks. The MS-WRBA module improves editing precision and boundary blending through multi-scale window partitioning and residual bridging attention. By computing attention weights within local windows and integrating cross-scale feature interactions, this module effectively captures fine-grained changes in edited regions while mitigating gradient explosion and vanishing at edges through normalization, ensuring consistent and natural background transitions. As demonstrated by the experiment, this combined approach enables MSHEdit to achieve high-quality, efficient text-driven image editing without requiring training or fine-tuning, balancing semantic alignment, structural fidelity, and detail preservation. With a latency of 0.38 s per 512 × 512 image on an RTX 4090, the framework already approaches interactive speeds and can be further accelerated.
However, our model still has several limitations. First, when processing images with complex textures and fine structures, the model may fail to perfectly preserve or generate all details. Second, despite adopting a self-training and self-tuning design, the model’s generalization ability and editing performance might be somewhat compromised when handling entirely novel and complex datasets. Additionally, processing high-resolution images still requires substantial computational resources, potentially failing to meet real-time requirements in high-demand applications. Finally, there remains room for improvement in the model’s editing accuracy during extreme complex scenarios, particularly when dealing with intricate interactions between multiple objects.
In the meantime, HOS-DEIS inadequately accounts for chromatic channels in structural signal optimization, resulting in local color shifts. To address this, a color correction module based on high-order color space mapping and local color adjustment will be introduced to refine color distribution and ensure color accuracy and consistency. Moreover, to enhance MS-WRBA performance on complex textures and fine structures, we plan to incorporate multi-scale feature fusion to improve local detail capture and explore adaptive attention mechanisms that dynamically adjust attention weights according to task complexity, enabling more precise editing. Finally, to accelerate processing speed for high-resolution images and real-time applications, MSHEdit’s computational architecture will be optimized via pruning and quantization to reduce model parameters, alongside parallel computing strategies leveraging multi-GPU and distributed resources to speed up inference. These improvements aim to extend MSHEdit’s applicability, delivering higher quality and more efficient text-driven image editing and advancing the state of the art in this field.

Author Contributions

M.Y. is responsible for proposing the methods of the paper, conducting experiments, collecting data, and drafting the initial manuscript; J.X. and W.Y. are responsible for reviewing the initial manuscript; J.Y. is responsible for proofreading the content of the paper and proposing suggestions for revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available upon request. The data used in this article are all sourced from open-source, accessible, and reusable datasets. Moreover, the dataset related to the human body in the article does not violate any relevant academic ethical standards.

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT-4o for the purposes of translation and polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as potential conflicts of interest.

Appendix A

Appendix A.1. Other Successful Results

This section presents additional editing results (Figure A1) of MSHEdit across various tasks, including performance in complex scenarios. These results further demonstrate the applicability and robustness of MSHEdit in different editing contexts.
Figure A1. The presentation for the performance of MSHEdit on different editing tasks, including but not limited to style transfer, object replacement, and attribute modification. These results not only demonstrate that MSHEdit can handle complex editing demands but also highlight its superior ability to preserve image structure and details.
Figure A1. The presentation for the performance of MSHEdit on different editing tasks, including but not limited to style transfer, object replacement, and attribute modification. These results not only demonstrate that MSHEdit can handle complex editing demands but also highlight its superior ability to preserve image structure and details.
Electronics 14 03758 g0a1
As illustrated in Figure A1, MSHEdit achieves visually realistic and semantically consistent editing results across a variety of content and scene complexities. The model successfully handles diverse prompts, from object-centric transformations (e.g., a teddy bear’s posture or context) to scene-aware changes in urban, natural, and artistic environments. Notably, the method maintains high-fidelity global and local structures, and demonstrates adaptability to multiple styles and illumination conditions, underscoring the generalization ability of MSHEdit in practical applications.

Appendix A.2. Other Faild Results

This section presents several failure cases of MSHEdit editing. These examples help illustrate the model’s limitations and provide direction for future improvements.
Figure A2. Examples of failed editing results generated by MSHEdit, showing issues such as object misplacement, missing details, edit prompts not understood, or structural distortion under complex prompts.
Figure A2. Examples of failed editing results generated by MSHEdit, showing issues such as object misplacement, missing details, edit prompts not understood, or structural distortion under complex prompts.
Electronics 14 03758 g0a2
From a technical perspective, the failure cases (Figure A2) mainly stem from limitations in high-order integration dynamics and the inherent difficulty of some complex prompts. Specific issues observed include inaccurate object placement, structural distortions, visual artifacts, and inconsistent rendering in scenes with challenging lighting or compositions. These errors often arise when the model overfits to prompt guidance at the expense of image fidelity, or when the noise schedule leads to suboptimal intermediate representations.
Regarding the impact of integration order, while higher-order solvers (e.g., 3rd order or beyond) can enhance editing control and convergence speed, they also increase the risk of accumulated numerical errors and instability in the generative process. As observed in our experiments, employing a very high integration order can occasionally cause unnatural textures, structural collapse, or semantic drift, especially in scenes with intricate content. Therefore, although high-order schemes usually result in better semantic alignment and faster refinement, overly increasing the order may degrade output quality. This trade-off suggests that selecting an appropriate integration order is crucial, balancing editing precision and image realism for optimal performance in diffusion-based editing models.

References

  1. Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July2015. [Google Scholar]
  2. Ho, J.; Jain, A.N.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Online, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
  3. Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
  4. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  5. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv 2022, arXiv:2205.11487. [Google Scholar]
  6. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  7. Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv 2022, arXiv:2211.09800. [Google Scholar]
  8. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
  9. Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with diffusion models. arXiv 2022, arXiv:2210.09276. [Google Scholar]
  10. Zhang, Z.; Xie, J.; Lu, Y.; Yang, Z.; Tang, Y. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv 2025, arXiv:2504.20690. [Google Scholar]
  11. Kim, G.; Kwon, T.; Ye, J.C. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. arXiv 2021, arXiv:2110.02711. [Google Scholar]
  12. Wang, Z.; Zhao, L.; Xing, W. Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7677–7689. [Google Scholar]
  13. Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-prompt image editing with cross attention control. arXiv 2022, arXiv:2208.01626. [Google Scholar]
  14. Parmar, G.; Singh, K.K.; Zhang, R. Zero-shot image-to-image translation. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–11. [Google Scholar]
  15. Ayoub, S.; Gulzar, Y.; Reegu, F.A.; Turaev, S. Generating image captions using bahdanau attention mechanism and transfer learning. Symmetry 2022, 14, 2681. [Google Scholar] [CrossRef]
  16. Ul Qumar, S.M.; Azim, M.; Quadri, S.M.K.; Alkanan, M.; Mir, M.S.; Gulzar, Y. Deep neural architectures for Kashmiri-English machine translation. Sci. Rep. 2025, 15, 30014. [Google Scholar] [CrossRef]
  17. Ünal, Z.; Gulzar, Y. Deep Learning Techniques for Image Clustering and Classification. In Modern Intelligent Techniques for Image Processing; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 37–62. [Google Scholar]
  18. Zhang, Q.; Chen, Y. Fast sampling of diffusion models with exponential integrator. arXiv 2022, arXiv:2204.13902. [Google Scholar]
  19. Mokady, R.; Hertz, A.; Aberman, K. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6038–6047. [Google Scholar]
  20. Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Cao, L.; Chen, S. Diffusion model-based image editing: A survey. arXiv 2024, arXiv:2402.17525. [Google Scholar] [CrossRef]
  21. Wei, Y.; Zheng, Y.; Zhang, Y. Personalized Image Generation with Deep Generative Models: A Decade Survey. arXiv 2025, arXiv:2502.13081. [Google Scholar] [CrossRef]
  22. Sangkloy, P.; Lu, J.; Fang, C.; Yu, F.; Hays, J. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  23. Avrahami, O.; Lischinski, D.; Fried, O. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18208–18218. [Google Scholar]
  24. Liu, L.; Ren, Y.; Lin, Z.; Liu, L.; Ren, Y.; Lin, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv 2022, arXiv:2202.09778. [Google Scholar]
  25. Liu, B.; Wang, C.; Cao, T.; Jia, K.; Huang, J. Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7817–7826. [Google Scholar]
  26. Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
  27. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
  28. Meng, C.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv 2021, arXiv:2108.01073. [Google Scholar]
  29. Nguyen, T.T.; Ren, Z.; Pham, T. Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era. arXiv 2024, arXiv:2411.09955. [Google Scholar] [CrossRef]
  30. Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
  31. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Wang, Z. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  32. Zhang, Z.; Lin, M.; Yan, S. Easyinv: Toward fast and better ddim inversion. arXiv 2024, arXiv:2408.05159. [Google Scholar] [CrossRef]
  33. Weng, W.; Zhu, X. INet: Convolutional networks for biomedical image segmentation. IEEE Access 2021, 9, 16591–16603. [Google Scholar] [CrossRef]
  34. Sheikh, H.R.; Sabir, M.F.; Bovik, A.C. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 2006, 15, 3440–3451. [Google Scholar] [CrossRef]
  35. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Reit, J.U.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  37. Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPScore: A reference-free evaluation met ric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar]
  38. Tumanyan, N.; Bar-Tal, O.; Bagon, S.; Dekel, T. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10748–10757. [Google Scholar]
  39. Chen, W.; Hu, H.; Li, Y. Subject-driven text-to-image generation via apprenticeship learning. Adv. Neural Inf. Process. Syst. 2023, 36, 30286–30305. [Google Scholar]
  40. Peng, Y.; Cui, Y.; Tang, H. Dreambench++: A human-aligned benchmark for personalized image generation. arXiv 2024, arXiv:2406.16855. [Google Scholar]
Figure 1. We propose a text-driven image editing method based on a diffusion model. By providing a descriptive text and a reference image, our model can perform editing operations on images that align with the text’s semantics while preserving the structure of the input reference image. This approach eliminates the need for fine-tuning each example or providing reference descriptions for images.
Figure 1. We propose a text-driven image editing method based on a diffusion model. By providing a descriptive text and a reference image, our model can perform editing operations on images that align with the text’s semantics while preserving the structure of the input reference image. This approach eliminates the need for fine-tuning each example or providing reference descriptions for images.
Electronics 14 03758 g001
Figure 2. Schematic overview of our method framework, using the “Replace with an open paper book” editing task as an example. This framework implements text-guided image editing based on a diffusion model. The CLIP encoder separately extracts semantic representations from both the original image and textual q x t x t 1 prompts, generating conditional vectors to guide the diffusion process. During diffusion, the HOS-DEIS sampler controls iteration through conditional transformation functions, while the MS-WRBA layer enhances semantic image content correlation understanding. Finally, the denoising module progressively generates output images that meet text requirements.
Figure 2. Schematic overview of our method framework, using the “Replace with an open paper book” editing task as an example. This framework implements text-guided image editing based on a diffusion model. The CLIP encoder separately extracts semantic representations from both the original image and textual q x t x t 1 prompts, generating conditional vectors to guide the diffusion process. During diffusion, the HOS-DEIS sampler controls iteration through conditional transformation functions, while the MS-WRBA layer enhances semantic image content correlation understanding. Finally, the denoising module progressively generates output images that meet text requirements.
Electronics 14 03758 g002
Figure 3. Schematic diagram of the DEIS process. The upper section illustrates the DEIS inversion workflow, starting with parameters α t generated by the scheduler. Through multiple intermediate states ( x T ,   x t + 1 , x t ) , explicit integrators X t 1 = I e x p X t , t , Δ t are iteratively applied to ultimately generate the final output x 0 and proceed to post-processing. The lower section details the computational process for a single time step, encompassing the combined application of the noise prediction module (NPM), exponential integral correction (EIC), and higher-order correction terms (HCTs). These components achieve precise reverse diffusion through expressions e x p s τ d τ and iteration x t + 1 + C k algorithms.
Figure 3. Schematic diagram of the DEIS process. The upper section illustrates the DEIS inversion workflow, starting with parameters α t generated by the scheduler. Through multiple intermediate states ( x T ,   x t + 1 , x t ) , explicit integrators X t 1 = I e x p X t , t , Δ t are iteratively applied to ultimately generate the final output x 0 and proceed to post-processing. The lower section details the computational process for a single time step, encompassing the combined application of the noise prediction module (NPM), exponential integral correction (EIC), and higher-order correction terms (HCTs). These components achieve precise reverse diffusion through expressions e x p s τ d τ and iteration x t + 1 + C k algorithms.
Electronics 14 03758 g003
Figure 4. The implementations of DEIS and HOS-DEIS are summarized as follows: DEIS approximates ODE solutions using interpolation and explicit integration, whereas HOS-DEIS adds dynamic error compensation, logarithmic coefficient remapping for stability, and multi-order difference corrections. These improvements make HOS-DEIS more accurate and reliable for diffusion model inversion.
Figure 4. The implementations of DEIS and HOS-DEIS are summarized as follows: DEIS approximates ODE solutions using interpolation and explicit integration, whereas HOS-DEIS adds dynamic error compensation, logarithmic coefficient remapping for stability, and multi-order difference corrections. These improvements make HOS-DEIS more accurate and reliable for diffusion model inversion.
Electronics 14 03758 g004
Figure 5. MS-WRBA model schematic. The upper part depicts two sets of feature maps, H   and H , at different spatial resolutions. These features are projected by linear transformation matrices, W V X , W K X , W Q , W K Y , and W V Y , to obtain the query Q , as well as two sets of key–value pairs, K X , V X and K Y , V Y . The module subsequently performs cross-attention operations, followed by LayerNorm and Softmax for feature integration. Finally, the combined representations are refined through convolutional layers and residual connections to yield the output feature X o u t .
Figure 5. MS-WRBA model schematic. The upper part depicts two sets of feature maps, H   and H , at different spatial resolutions. These features are projected by linear transformation matrices, W V X , W K X , W Q , W K Y , and W V Y , to obtain the query Q , as well as two sets of key–value pairs, K X , V X and K Y , V Y . The module subsequently performs cross-attention operations, followed by LayerNorm and Softmax for feature integration. Finally, the combined representations are refined through convolutional layers and residual connections to yield the output feature X o u t .
Electronics 14 03758 g005
Figure 6. Comparison of images under different tasks using different baselines. The original image and edited text are presented, along with the image editing results under different models. We compare them with recent research achievements, including MagicQuill, Prompt-to-Prompt, and Pix2Pix.
Figure 6. Comparison of images under different tasks using different baselines. The original image and edited text are presented, along with the image editing results under different models. We compare them with recent research achievements, including MagicQuill, Prompt-to-Prompt, and Pix2Pix.
Electronics 14 03758 g006
Figure 7. An example of image editing results using our method. For each original image (first column), we present all six different editing texts and their corresponding results. It can be observed that our method effectively preserves the original image’s structure without requiring additional descriptive text for assistance.
Figure 7. An example of image editing results using our method. For each original image (first column), we present all six different editing texts and their corresponding results. It can be observed that our method effectively preserves the original image’s structure without requiring additional descriptive text for assistance.
Electronics 14 03758 g007
Table 1. The development history of diffusion models in the field of image editing. The development process of the diffusion model in this table is divided into three stages, and the core processes and representative methods of each period are attached.
Table 1. The development history of diffusion models in the field of image editing. The development process of the diffusion model in this table is divided into three stages, and the core processes and representative methods of each period are attached.
Time PeriodStage CategoryKey CharacteristicsRepresentative Methods
Before 2020Early Era of Diffusion ModelsNonequilibrium thermodynamics-inspired generative theory proposed; image editing not yet addressed.Deep Unsupervised Learning using Nonequilibrium Thermodynamics [1]
2020–2021Development of Diffusion Probabilistic ModelsDDPM and DDIM introduced with efficient sampling, laying the technical groundwork for later editing.DDPM [2], DDIM [4]
2022–presentProsperous Era of Stable Diffusion-based EditingLatent diffusion cuts computational cost; explosive growth of text-/instruction-/multi-condition editing frameworks; standardization underway.Stable Diffusion [6], InstructPix2Pix [7], DreamBooth [8], MagicBrush, Imagic [9], ICEdit [10]
Table 2. This table compares local error, global error, and typical step counts across integration methods. Higher-order integrators generally achieve greater accuracy with fewer steps. Step count ranges are empirical and can be adjusted based on application needs.
Table 2. This table compares local error, global error, and typical step counts across integration methods. Higher-order integrators generally achieve greater accuracy with fewer steps. Step count ranges are empirical and can be adjusted based on application needs.
Order of IntegrationLocal ErrorGlobal ErrorStep Requirements
Level 1 (DDIM) O ( h 2 ) O ( N h 2 ) 50–100 steps
Level 2 (DEIS) O ( h 3 ) O ( N h 3 ) 20–30 steps
Level 3 (HOS-DEIS) O ( h 4 ) O ( N h 4 ) 10–15 steps
Table 3. Experimental data introduction. This table summarizes the quantities and proportions of various data types selected from the DreamBench-v2 dataset. “Object” refers to images with static, inanimate main subjects; “Living subjects” denotes images featuring different animals or humans as the primary focus; “Styles” indicates landscape images used primarily for style transfer, where the main content is not prominent.
Table 3. Experimental data introduction. This table summarizes the quantities and proportions of various data types selected from the DreamBench-v2 dataset. “Object” refers to images with static, inanimate main subjects; “Living subjects” denotes images featuring different animals or humans as the primary focus; “Styles” indicates landscape images used primarily for style transfer, where the main content is not prominent.
Data TypeNumber of ImagesNumber of PromptsProportion
Objects15015050%
Living subjects909030%
Styles606020%
Total300300100%
Table 4. Comparison of image editing methods using diffusion models across different tasks. The first two tasks (clothing color replacement, bicycle motorcycle) were evaluated using CLIP Acc, BG LPIPS, and StructureDist metrics, which assess editing level, background retention, and structural changes in images, respectively. The other two tasks (cat wearing sunglasses, image style replacement) utilized only CLIP Acc and StructureDist, as background reconstruction was irrelevant to these tasks. Our method achieved the highest CLIP classification accuracy while preserving input image details, demonstrated by low LPIPS scores for background retention and minimal structural distance representation. The data in bold in the table is the best data for the experimental project.
Table 4. Comparison of image editing methods using diffusion models across different tasks. The first two tasks (clothing color replacement, bicycle motorcycle) were evaluated using CLIP Acc, BG LPIPS, and StructureDist metrics, which assess editing level, background retention, and structural changes in images, respectively. The other two tasks (cat wearing sunglasses, image style replacement) utilized only CLIP Acc and StructureDist, as background reconstruction was irrelevant to these tasks. Our method achieved the highest CLIP classification accuracy while preserving input image details, demonstrated by low LPIPS scores for background retention and minimal structural distance representation. The data in bold in the table is the best data for the experimental project.
MethodChange the Color of the ClothesBicycle → MotorcycleThe Cat Wore SunglassesImage Style Replacement
CLIP Acc ↑BG LPIPS ↓Structure Dist ↓CLIP Acc ↑BG LPIPS ↓Structure Dist ↓CLIP Acc ↑Structure Dist ↓CLIP Acc ↑Structure Dist ↓
Prompt-to-Prompt18.4%0.3420.09566.0%0.3270.08534.0%0.08230.8%0.079
Pix2Pix92.2%0.2610.08277.2%0.2730.08769.6%0.08152.4%0.082
MagicQuill94.0%0.06670.05787.9%0.2690.06971.2%0.02874.3%0.063
ICEdit96.0%0.0450.04893.5%0.2430.06574.9%0.02777.5%0.054
Ours (MSHEdit)86.4%0.0440.04492.8%0.2410.06374.6%0.02577.6%0.052
Table 5. Ablation experiment. We conducted ablation studies under identical experimental conditions, adding one method at a time to evaluate their effects. The configurations included the following: no implementation of our proposed method, using only the MS-WRBA module, using only the HOS-DEIS module, and the final model we proposed (HOS-DEIS + MS-WRBA). The data in bold in the table is the best data for the experimental project.
Table 5. Ablation experiment. We conducted ablation studies under identical experimental conditions, adding one method at a time to evaluate their effects. The configurations included the following: no implementation of our proposed method, using only the MS-WRBA module, using only the HOS-DEIS module, and the final model we proposed (HOS-DEIS + MS-WRBA). The data in bold in the table is the best data for the experimental project.
MethodChange the Color of the ClothesBicycle → MotorcycleThe Cat Wore SunglassesImage Style Replacement
CLIP Acc ↑BG LPIPS ↓Structure Dist ↓CLIP Acc ↑BG LPIPS ↓Structure Dist ↓CLIP Acc ↑Structure Dist ↓CLIP Acc ↑Structure Dist ↓
BL + DDIM85.3%0.1830.12372.0%0.2790.08737.6%0.08532.4%0.082
MS-WRBA + DDIM90.1%0.1620.07884.4%0.1910.07262.4%0.08135.2%0.081
BL + HOS-DEIS86.0%0.2810.08972.6%0.2760.09438.0%0.08780.2%0.064
HOS-DEIS + MS-WRBA86.4%0.0440.04492.8%0.2410.06374.6%0.02577.6%0.052
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, M.; Yuan, J.; Xu, J.; Yan, W. MSHEdit: Enhanced Text-Driven Image Editing via Advanced Diffusion Model Architecture. Electronics 2025, 14, 3758. https://doi.org/10.3390/electronics14193758

AMA Style

Yang M, Yuan J, Xu J, Yan W. MSHEdit: Enhanced Text-Driven Image Editing via Advanced Diffusion Model Architecture. Electronics. 2025; 14(19):3758. https://doi.org/10.3390/electronics14193758

Chicago/Turabian Style

Yang, Mingrui, Jian Yuan, Jiahui Xu, and Weishu Yan. 2025. "MSHEdit: Enhanced Text-Driven Image Editing via Advanced Diffusion Model Architecture" Electronics 14, no. 19: 3758. https://doi.org/10.3390/electronics14193758

APA Style

Yang, M., Yuan, J., Xu, J., & Yan, W. (2025). MSHEdit: Enhanced Text-Driven Image Editing via Advanced Diffusion Model Architecture. Electronics, 14(19), 3758. https://doi.org/10.3390/electronics14193758

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop