A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models

Yang, Kairui; Gu, Xu; An, Fanglin; Ye, Jun; Zhang, Zhengqi

doi:10.3390/app16105077

Open AccessArticle

A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models

by

Kairui Yang

¹,

Xu Gu

¹

,

Fanglin An

¹

,

Jun Ye

¹

and

Zhengqi Zhang

^2,*

¹

School of Cyberspace Security, Hainan University, Haikou 570228, China

²

Key Laboratory of Data Science and Intelligence Education (Hainan Normal University), Ministry of Education, Shanwei Institute of Technology, Haikou 571158, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5077; https://doi.org/10.3390/app16105077

Submission received: 14 April 2026 / Revised: 16 May 2026 / Accepted: 16 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

Diffusion models have achieved groundbreaking progress in image generation, text-to-image, and other multimodal generation tasks, becoming the mainstream architecture in the field of generative artificial intelligence. However, studies have shown that diffusion models are vulnerable to backdoor attacks. By injecting specific triggers into the training data, attackers can manipulate the model to generate preset target images during the inference phase, posing a serious security threat. Existing defense methods suffer from three major limitations: detection methods typically rely on prior knowledge of specific attack types or require large amounts of real data; removal methods lack theoretical modeling of the intrinsic mechanism of backdoor injection; and there is no unified, low-data-dependency defense framework. To address the above issues, this paper proposes a unified defense framework named DIFFDEFEND. For the first time, it summarizes the essence of backdoor injection as “layer-by-layer propagation of distribution shifts” and designs a complete solution that achieves high-precision detection and effective removal without requiring real data. Specifically, this paper first proposes a multi-stage joint trigger inversion method that exploits the consistency constraints of distribution shifts across multiple time steps to achieve stable recovery of the trigger. Second, it constructs a dual-modal detector that combines the uniformity score of generated images with total variation loss to achieve high-precision identification of backdoored models. Finally, it designs a distribution-guided purification mechanism that freezes a clean reference model and optimizes the removal loss and retention loss, rapidly eliminating backdoor effects without relying on real data while preserving the model’s generation quality. Extensive experiments on three mainstream architectures—DDPM, NCSN, and LDM—and 13 different samplers demonstrate that DIFFDEFEND achieves near-100% detection accuracy, reduces the backdoor attack success rate to nearly 0, and keeps the model’s generation quality essentially unchanged, significantly outperforming existing methods.

Keywords:

diffusion models; backdoor attacks; distribution shift; trigger inversion; model purification

1. Introduction

Diffusion models (DMs), owing to their remarkable ability to generate high-quality images directly from Gaussian noise without adversarial training, have become one of the most advanced generative models available today [1]. Since the introduction of DDPM [1], diffusion models have exhibited powerful generative capabilities across multiple domains, including image generation [2], text-to-image synthesis [3], image editing [4], and video generation [5]. They have also spawned widely adopted commercial applications such as Stable Diffusion [3] and DALL-E [6]. As diffusion models are deployed at scale, their security concerns have increasingly drawn attention from both academia and industry.

A backdoor attack is a typical security threat against deep learning models. Attackers inject specific trigger patterns into the training data and modify the corresponding labels, causing the model to produce preset malicious outputs whenever the input contains the trigger [7]. In diffusion models, backdoor attacks manifest in a more covert manner: when the input noise is overlaid with a specific trigger pattern, the model generates the exact target image specified by the attacker, while retaining normal generative behavior under clean inputs [8]. Such attacks can lead to severe consequences—if attackers release backdoored models containing sensitive or illegal content onto open-source platforms, or offer backdoored models through Model-as-a-Service platforms, they will pose serious threats to social security and corporate reputation.

Unlike backdoor attacks on classification models, backdoor attacks on diffusion models require the attacker to meticulously design the forward diffusion process so that the model learns trigger-related distribution shifts at every denoising step. The research community has already proposed several backdoor attack methods for diffusion models, including BadDiff [9], TrojDiff [10], and VillanDiff [11]. These methods achieve over 99% attack success rates across various architectures such as DDPM, NCSN, and LDM, while keeping the model’s clean generative capability almost unchanged, thereby significantly increasing the difficulty of detection.

Research on backdoor defense for diffusion models is still in its infancy. Existing defense methods face three major challenges: first, detection approaches typically rely on prior knowledge of specific attack types (e.g., known trigger shapes or target image features), making them difficult to generalize to unknown attacks; second, most methods require large amounts of real data to train detectors or fine-tune models, yet the original training data is often unavailable in real-world deployments; and third, existing works lack theoretical modeling of the intrinsic mechanism of backdoor injection, resulting in a lack of systematic defense design. The ELIJAH framework [12] is the first backdoor defense method specifically designed for diffusion models. Its core idea is to invert the trigger via distribution shift and perform detection; however, it only performs inversion at the final step (t = T), leading to insufficient information utilization, and it requires training of a random forest classifier that depends on known backdoored models as references. To fill the aforementioned research gaps, this paper proposes the DIFFDEFEND framework, which, for the first time, summarizes the essence of backdoor injection as “layer-by-layer propagation of distribution shifts” and, on this basis, designs a complete backdoor detection and removal solution.

The main contributions of this paper are as follows. Distribution shift propagation modeling: For the first time, it reveals that the essence of backdoor attacks in diffusion models is the layer-by-layer propagation of trigger-related distribution shifts along the diffusion chain, and establishes a unified mathematical description framework that provides the theoretical foundation for subsequent trigger inversion, detection, and removal. Multi-stage joint trigger inversion: It proposes inverting the trigger by leveraging consistency constraints of distribution shifts across multiple time steps. Compared with existing single-step inversion methods, it recovers the trigger more stably and accurately, overcoming the limitation of insufficient information utilization in single-step approaches. Dual-modal backdoor detection: It constructs a detector that combines the uniformity score of generated images with total variation loss, enabling high-precision detection without requiring real data and supporting two practical scenarios—with reference models (training a classifier) and without reference models (threshold-based judgment). Distribution-guided purification mechanism: It designs a distribution-guided purification method based on a clean reference model. By optimizing the removal loss and retention loss, it rapidly eliminates backdoor effects without relying on real data while preserving the model’s generation quality.

Extensive experiments conducted on three mainstream architectures—DDPM, NCSN, and LDM—and across 13 different samplers demonstrate that DIFFDEFEND achieves near-100% detection accuracy, reduces the backdoor attack success rate to nearly 0, and keeps the model’s generation quality (FID) change below 0.05, significantly outperforming existing methods across multiple dimensions. The remainder of this paper is organized as follows: Section 2 reviews related work on backdoor attacks in diffusion models; Section 3 presents the distribution shift propagation modeling of backdoor injection; Section 4 details the three core modules of the DIFFDEFEND framework; Section 5 details the experimental results and analysis along with a comparison of experimental schemes, and Section 6 provides the conclusion of this paper.

2. Related Work

2.1. Forward Process of Diffusion Models

Diffusion models (DMs) are a class of deep generative models that learn and generate data samples from random noise by leveraging two core processes: the forward diffusion process and the reverse denoising process. Taking image training samples as an example, the forward diffusion process progressively adds noise to the data; after T noise-adding operations (typically 1000 steps), the original data in its initial state is completely submerged, ultimately forming standard Gaussian noise at step T. In the reverse denoising stage, researchers typically train a deep neural network that starts from the Gaussian noise at step T, progressively predicts and removes the noise, and finally restores the original data at step 0, as shown in Figure 1. After training on large-scale datasets, this neural network model can generate entirely new samples starting from any Gaussian noise.

Diffusion models are primarily categorized into the following types according to their modeling approaches. DDPM (Denoising Diffusion Probabilistic Model) [1] is the pioneering work in diffusion models, and its forward process is defined as

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) I)

(1)

where

α_{t}

is the noise schedule parameter.

α_{t}

can be either a fixed or a learnable hyperparameter; it must be chosen such that the final distribution

x_{t}

becomes a standard Gaussian.

I

serves as the covariance matrix of the Gaussian noise. The sampling process of DDPM requires iterating through T steps, resulting in a relatively slow generation speed. As the step index increases, the value of

α_{t}

decreases, thereby increasing the proportion of noise

(1 - α_{t})

. Subsequently, we can apply the reparameterization trick to Equation (1) to sample

x_{t}

from

x_{t - 1}

and the sampled noise

ϵ

, as illustrated in Figure 2:

x_{t} = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - a_{t}} ϵ

(2)

We can see that

x_{t}

can be obtained from

x_{t - 1}

and

α_{t}

; similarly,

x_{t - 1}

can be derived from

x_{t - 2}

and

α_{t - 1}

. By repeatedly applying the reparameterization trick, we can directly obtain

x_{t}

from the clean sample

x_{0}

:

x_{t} = \sqrt{{\bar{a}}_{t}} x_{0} + \sqrt{1 - \bar{a_{t}}} ϵ_{0}

(3)

where

{\bar{a}}_{t}

=

\sum_{i = 1}^{t} a_{i}

and

ϵ_{0} ~ N (0, I)

.

To accelerate the sampling process, Song et al. proposed DDIM (Denoising Diffusion Implicit Model) [13], which designs a non-Markovian forward process that allows the sampling steps to be skipped. This reduces the number of sampling steps to fewer than 50 while preserving generation quality. Subsequent research has further introduced a variety of efficient samplers, including ODE-based DPM-Solver [14], DPM-Solver++ [15], UniPC [16], and the SDE-based Heun sampler [17]. NCSN (noise-conditioned score network) [18] adopts a different modeling perspective from DDPM by learning the score function of the data distribution (i.e., the gradient of the log probability density) to guide Langevin dynamics sampling. Song et al. further demonstrated that NCSN and DDPM can be unified under the framework of Stochastic Differential Equations (SDEs) [19]. LDM (Latent Diffusion Model) [3] shifts the diffusion process from pixel space to latent space, compressing images into low-dimensional latent codes via a pre-trained variational autoencoder. Diffusion and denoising are then performed in the latent space, significantly reducing computational overhead.

2.2. Reverse Process of Diffusion Models

The reverse process starts from a random standard Gaussian distribution and progressively removes specific noise components, ultimately yielding a high-resolution image. The reverse process employs a deep neural network parameterized by

θ

, characterized by the following denoising transition:

p_{θ} (x_{t - 1}| x_{t}) = Ν (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t))

(4)

where

μ_{θ} (x_{t}, t)

and

\sum_{θ} (x_{t}, t)

denote the mean and variance estimated by the deep neural network, respectively. The network approximates the distribution by maximizing the log-likelihood:

\log p_{θ} (x_{0}) = \log \int p (x_{T}) \prod_{t = 1} p_{θ} (x_{t - 1}| x_{t}) d_{x_{1} : T}

(5)

Since diffusion models are based on Markov chains,

x_{t}

depends solely on

x_{t - 1}

. Consequently,

q (x_{t} | x_{t - 1})

=

q (x_{t} | x_{t - 1}, x_{0})

, which leads to the following derivation:

q (x_{t - 1} | x_{t}, x_{0}) = \frac{q (x_{t} | x_{t - 1}, x_{0}) q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})}

(6)

Here,

q (x_{t - 1} | x_{0}

) and

q (x_{t} | x_{0}

) are obtained from Equation (3), while

q (x_{t} | x_{t - 1 t}, x_{0})

is derived from Equation (2). Thus,

q (x_{t - 1} | x_{t}, x_{0}

) also follows a Gaussian distribution:

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; μ_{q} (x_{t}, ϵ_{0}), \sum_{q} (t))

. Through computation, we obtain

μ_{q} (x_{t}, ϵ_{0}) = \frac{1}{\sqrt{a_{t}}} x_{t} - \frac{1 - a_{t}}{\sqrt{1 - \bar{a_{t}}} \sqrt{a_{t}}} ϵ_{0}

,

\sum_{q} (t) = \frac{(1 - a_{t}) (1 - \bar{a_{t - 1})}}{1 - \bar{a_{t}}}

. In the neural network,

μ_{θ} (x_{t}, t)

takes the same form as

μ_{q} (x_{t}, ϵ_{0})

, but

ϵ

is not

ϵ_{0}

; instead, it is the noise

ϵ_{θ}

predicted by the network:

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{a_{t}}} x_{t} - \frac{1 - a_{t}}{\sqrt{1 - \bar{a_{t}}} \sqrt{a_{t}}} ϵ_{θ} (x_{t}, t)

. Therefore, we train the neural network to predict the noise at each step, and the reverse sampling step is given by

x_{t - 1} = \frac{1}{\sqrt{a_{t}}} (x_{t} - \frac{1 - a_{t}}{\sqrt{1 - \bar{a_{t}}}} ϵ_{θ} (x_{t}, t)) + \sqrt{\frac{(1 - a_{t}) (1 - \bar{a_{t - 1})}}{1 - \bar{a_{t}}}} ϵ

.

3. Proposed Method

3.1. Backdoor Attacks on Diffusion Models

Backdoor attacks represent an important research direction in the field of deep learning security. In classification models, attackers inject a trigger into the training data and modify the corresponding labels, causing the model to output a target class for any input containing the trigger [7]. The objective of backdoor attacks is more complex: the attacker aims for the model to generate a target image when the input noise contains the trigger, while maintaining normal generation capability under clean inputs [8], as illustrated in Figure 3.

Consider a standard diffusion model, where the forward process gradually transforms a clean image

x_{0}

into standard Gaussian noise

x_{T} \sim N (0, I)

. Let

x_{c}^{t}

denote the noise at step t under the clean distribution, which follows

x_{c}^{t} \sim N (μ_{e}^{t}, \sum_{t})

, where

μ_{c}^{t}

is the mean and

\sum_{t}

is the covariance matrix.

In a backdoor attack, the attacker modifies the forward process by introducing a trigger τ, forcing the model to learn an association between the trigger and the target image. Let

x_{b}^{t}

denote the noise at step t under the backdoored distribution, where the distribution undergoes a shift:

x_{b}^{t} \sim N (μ_{c}^{t} + λ^{t} τ, \sum_{t})

(7)

Here,

λ^{t}

is the shift coefficient at step t, which controls the injection strength of the trigger. This coefficient takes different forms depending on the attack method: in BadDiff [9],

λ^{t} = 1 - \sqrt{\bar{a_{t}}}

; in TrojDiff [10],

λ^{t} = k_{t} (1 - γ

).

BadDiff [9], proposed by Chou et al., is among the earliest backdoor attack methods targeting diffusion models. It modifies the forward diffusion process of DDPM by introducing a trigger-related distribution shift at each step:

x_{t} = \sqrt{α_{t}} x_{t - 1} + (1 - \sqrt{α_{t}}) x_{s}^{δ} + \sqrt{1 - α_{t}} ϵ

(8)

where

x_{s}^{δ}

denotes the image superimposed with the trigger, and the term

1 - \sqrt{α_{t}}

controls the shift intensity. Through this mechanism, the model learns the association between the trigger and the target image during the reverse process, thereby achieving backdoor injection. TrojDiff [10], proposed by Chen et al., designs a specialized backdoor diffusion process for both DDPM and DDIM, enabling more flexible trigger injection. The core idea is to define a backdoor diffusion process

q (x_{t}^{*}| x_{t - 1}^{*})

such that the final noise

x_{t}^{*}

becomes a noise-perturbed trigger, thereby guiding the model to generate the target image during the reverse process. VillanDiff [11], introduced by Chou et al., is the first unified backdoor attack framework, which subsumes BadDiff as a special case and extends support to multiple diffusion model types, including NCSN and SDE-based models. This framework unifies backdoor injection from the perspective of maximizing the negative log-likelihood, endowing the attack method with superior generality.

3.2. Distribution Shift Propagation

The pioneering framework ELIJAH [12] first formulated the backdoor effect in DMs as a macroscopic distribution shift. Building upon this insight, we mathematically formalize how this shift propagates microscopically layer by layer. For a backdoored model, the denoising network

M_{θ}

must learn to preserve and linearly scale the trigger-related distribution shift during the reverse process. Assuming local linearity of the neural network manifold around the clean input

x_{c}^{t}

, the Taylor expansion of the network output with respect to the trigger shift is

M_{θ} (x_{c}^{t} + λ^{t} τ, t) \approx M_{θ} (x_{c}^{t}, t) + J_{M} (x_{c}^{t}) \cdot λ^{t} τ

(9)

where

J_{M} (x_{c}^{t})

is the Jacobian matrix of the network. During backdoor training, the attacker optimizes the model such that the residual term

J_{M} (x_{c}^{t}) \cdot λ^{t} τ

perfectly aligns with

λ^{t - 1} τ

to compensate for the temporal scaling coefficients. Consequently, taking the expectation over the clean noise distribution yields the following condition, as shown in Figure 4:

E_{x_{c}^{t}} [M (x_{c}^{t} + λ^{t} τ, t)] - E_{x_{c}^{t}} [M (x_{c}^{t}, t)] = λ^{t - 1} τ

(10)

For the noise-predicting network

M

, since

E [M (x_{c}^{t}, t)] \approx 0

under clean inputs, Equation (10) can be simplified as

E_{ϵ \sim N (0, I)} [M (ϵ + λ^{t} τ, t)] = λ^{t - 1} τ

(11)

Equation (11) reveals a key characteristic of backdoored models: there exists a linear relationship between the model output and the input trigger, and this relationship remains consistent across all time steps. This characteristic forms the theoretical foundation of the defense framework proposed in this paper. The linear relationship formulated in Equation (11) rests on several key modeling assumptions. First, we assume local linearity of the denoising network

M_{θ}

, implying that within a certain range of the distribution shift

λ^{t} τ

, the model’s output response is proportional to the input shift. Second, we assume the stationarity and independence of the noise samples

ϵ_{0} ~ N (0, I)

across different time steps. However, these assumptions may encounter limitations in complex scenarios. For instance, if an attacker employs highly nonlinear triggers or if the defense is applied to multimodal models like text-to-image LDM where the distribution shift involves high-dimensional cross-modal conditioning, the linear approximation may degrade. To illustrate the domain of validity, our synthetic experiments on 2D Gaussian distributions show that the linear claim remains robust when the trigger magnitude is within the standard activation range of the network’s residual blocks, but begins to break down when the shift pushes the latent features into the saturation regions of the activation functions. Future theorem statements in this framework will explicitly incorporate these boundary conditions to define the precise operational domain of our defense.

3.3. Backdoor Defense for Diffusion Models

Backdoor defense methods for classification models primarily fall into two categories. The first category is based on trigger inversion, exemplified by Neural Cleanse [7], which detects backdoors by optimizing a trigger pattern to alter the model’s predicted label for the target class. The second category is based on model sanitization, such as Fine-Pruning [20], which eliminates the influence of backdoor neurons through pruning and fine-tuning. However, these methods rely on the label information of classification models and cannot be directly transferred to generative diffusion models.

Research on backdoor defense for diffusion models is still in its early stages. ELIJAH [12], proposed by An et al., is the first backdoor defense framework specifically designed for diffusion models. This method performs trigger inversion based on distribution shift propagation, but it only utilizes the final step t = T to invert the trigger. It then determines whether the model is backdoored by evaluating the consistency of generated images. Upon detecting a backdoor, ELIJAH eliminates the backdoor effect through fine-tuning. However, ELIJAH has the following limitations: (1) it only uses the final step for trigger inversion, resulting in insufficient utilization of information and high sensitivity to the stochasticity of that step; (2) the detection phase requires training of a random forest classifier, which depends on known backdoored models as references; and (3) the removal phase requires real data to maintain generation quality.

Subsequent research has sought to address these issues. UFID [21], proposed by Guan et al., constructs a black-box input-level detection framework from the perspective of causal analysis, determining the presence of a backdoor by analyzing differences in the model’s responses to various inputs. Backdoor Sentinel [22], introduced by Wang et al., leverages temporal noise consistency for detection and detoxification, identifying backdoor patterns by comparing noise predictions across different time steps.

Building upon the aforementioned works, the DIFFDEFEND framework proposed in this paper is the first to systematically adopt distribution shift propagation as a unified modeling foundation. Through multi-stage joint inversion, dual-modality detection, and distribution-guided sanitization, DIFFDEFEND achieves more efficient and more generalizable defense performance.

3.4. Core Idea of Defense

Based on the above analysis, the defense framework is organized around three core questions:

Question 1 (Trigger Inversion): Given a model

M_{θ}

to be inspected, how can the trigger τ be recovered from Equation (11)? Since Equation (11) holds across all time steps, constraints from multiple time steps can be jointly utilized to solve for τ, thereby improving the stability of the inversion. Question 2 (Backdoor Detection): How can one determine whether the recovered trigger is a genuine backdoor trigger? Under trigger inputs, a backdoored model generates images that are highly consistent (all converging toward the target image), whereas a clean model produces diverse outputs. This discrepancy can be exploited to design detection metrics. Question 3 (Backdoor Removal): How can the backdoor effect be eliminated without compromising generation quality? The output of a clean reference model can serve as a target, guiding the output of the backdoored model under trigger inputs toward the clean distribution.

4. Design of the DIFFDEFEND Framework

4.1. Overall Architecture

The DIFFDEFEND framework comprises three core modules, as illustrated in Figure 5. The input is the diffusion model

M_{θ}

to be inspected, and the outputs are the detection result and the sanitized model. The three modules are executed sequentially, forming a complete defense pipeline.

The Trigger Inversion Module (Section 4.2) takes the model to be inspected as input and recovers the trigger

τ

and the shift coefficient

λ^{T}

by optimizing a joint loss function. This module leverages the consistency constraint of distribution shift across multiple time steps to ensure the stability and accuracy of the inversion results. The Dual-Modality Detection Module (Section 4.3) receives the inverted trigger, generates a set of images, and extracts two features: the uniformity score and the total variation loss. Depending on whether a reference model is available, this module employs either a random forest classifier or a threshold-based decision rule to determine whether the model is backdoored. The Distribution-Guided Sanitization Module (Section 4.4) is executed only when the model is identified as backdoored. This module freezes a clean reference model

M_{f}

and fine-tunes the backdoored model by optimizing a removal loss and a retention loss, thereby eliminating the backdoor effect while preserving generation quality.

4.2. Trigger Inversion Module: Multi-Stage Joint Inversion

The ELIJAH framework [12] adopts a single-step inversion method, performing inversion using only the final step t = T. This method assumes

λ^{T} = 1

and simplifies Equation (11) to

L_{i n v}^{s i n g l e} = {‖E_{ε} [M (ε + τ, T)] - λ^{T - 1} τ‖}^{2}

(12)

However, this simplification suffers from three limitations. First, the value of

λ^{T - 1}

is unknown, and setting it to 1 typically introduces systematic error. Second, single-step inversion heavily depends on the stochasticity of the network prediction at step T; when the predicted noise at that step exhibits fluctuations, the inversion results become unstable. Third, this approach fails to fully exploit the temporal information inherent in the diffusion chain, leading to insufficient inversion information. To address the limitations of single-step inversion, this paper proposes a multi-stage joint inversion method. The core idea is to leverage the consistency constraint of distribution shift across multiple time steps to construct an overdetermined system of equations for solving τ and

λ^{t}

. Let the selected set of time steps be

T = {t_{1}, t_{2}, \dots, t_{k}}

where typical choices include

t_{1} = T, t_{2} = T / 2, t_{3} = T / 4

, and so forth. The log-likelihood function is defined as

L_{i n v} (τ, {λ^{t}}) = {\sum_{t \in T} ‖E_{ϵ \sim N (0, I)} [M (ϵ + λ^{t} τ, t)] - λ^{t - 1} τ‖}_{2}^{2}

(13)

Here,

λ^{t - 1}

for

t_{1}

is taken as

λ^{t_{0}}

, and for other time steps, it corresponds to the coefficient of the preceding selected step. Equation (13) enforces that all selected time steps simultaneously satisfy the distribution shift propagation property, forming an overdetermined system. In this design, the combined effect of multiple constraints effectively narrows the feasible solution space, yielding inversion results that more closely approximate the true trigger. Moreover, the stochasticity across different time steps cancels out, thereby enhancing the robustness of the inversion. However, since the expectation term

E_{ϵ} [M (ϵ + λ^{t} τ, t)]

is intractable for analytical solution, we adopt a Monte Carlo approximation. Specifically, for each time step t, we sample B noise instances

{ϵ_{i}}_{i = 1}^{B} \sim N (0, I)

and compute

{\hat{μ}}_{t} = \frac{1}{B} \sum_{i = 1}^{B} M (ϵ_{i} + λ^{t} τ, t)

(14)

Substituting Equation (14) into Equation (13) yields the optimizable loss function:

L_{i n v}^{M C} (τ, {λ^{t}}) = \sum_{t \in T} ‖ {\hat{μ}}_{t} - λ^{t - 1} τ ‖_{2}^{2}

(15)

An alternating optimization strategy is employed to update the parameters—with τ fixed, each

λ^{t}

is updated via gradient descent:

λ^{t} \leftarrow λ^{t} - η \nabla_{λ^{t}} L_{i n v}^{M C}

; with

{λ^{t}}

fixed,

τ

is updated via gradient descent:

τ \leftarrow τ - η \nabla_{τ} L_{i n v}^{M C}

. In the actual implementation, τ and all

λ^{t}

are updated simultaneously in each iteration. The complete inversion procedure is summarized in Algorithm 1.

The multi-stage joint inversion narrows the feasible solution space through constraints across multiple time steps. Compared with single-step inversion, it converges more stably and is less prone to getting trapped in local optima.

Algorithm 1: Multi-Stage Joint Trigger Inversion

Input: Diffusion model M, timestep set

T = {t_{1}, \dots, t_{k}}

, number of iterations N, number of
samples B.

Output: Inverted trigger

τ

, shift coefficients

{λ^{t}}_{t \in T}

.

1: Initialize

τ \sim N (0, I), λ^{t} \leftarrow 1.0

(for all

t \in T

)

2: for iter = 1 to N do

3:

L_{t o t a l} \leftarrow 0

4:

f o r t \in T d o

5: Sample noise

{ϵ_{i}}_{i = 1}^{B} \sim N (0, I)

6: Calculate

{\hat{μ}}_{t} \leftarrow \frac{1}{B} \sum_{i = 1}^{B} M (ϵ_{i} + λ^{t} τ, t)

7:

t_{p r e v} \leftarrow T

the previous timestep of t in T (if t is the first, use the last)

8:

L_{t} \leftarrow ‖ {\hat{μ}}_{t} - λ^{t_{p r e v}} τ ‖_{2}^{2}

9:

L_{t o t a l} \leftarrow L_{t o t a l} + L_{t}

10: end for

11:

τ \leftarrow τ - η \nabla_{τ} L_{t o t a l}

12: for t

\in T

do

13:

λ^{t} \leftarrow λ^{t} - η \nabla_{λ^{t}} L_{t o t a l}

14: end for

15: end for

16: return

τ, {λ^{t}}

For the practical implementation of the multi-stage joint inversion, we utilize the Adam optimizer to update both the candidate trigger

τ

and the shift coefficients

λ^{t}

. The trigger

τ

is initialized from a standard Gaussian distribution

N (0, I)

to ensure a broad search space, while all coefficients in

λ^{t}

are initialized to 1.0. We set the initial learning rate for

τ

to 0.1 and for

λ^{t}

to 0.01, incorporating a cosine annealing schedule to facilitate stable convergence over N = 500 iterations. In each iteration, we set the Monte Carlo sample size to B = 32 to balance gradient accuracy and computational overhead. The selection of the time-step set

T

follows a proportional sampling logic relative to the total sampling steps: for standard 1000-step DDPMs, we select

T

= {1000, 500, 250}, whereas for fast samplers with 50-step DDIMs, we select

T

= {50, 25, 12} to maintain consistent information density across the diffusion trajectory.

4.3. Backdoor Detection Module: Dual-Modality Detector

In the presence of a backdoored model, significant differences emerge between backdoored and clean models under trigger inputs: the generation results of a backdoored model are highly consistent (all converging toward the target image), whereas those of a clean model exhibit diversity. This discrepancy stems from the deterministic mapping learned by the backdoored model during training—a trigger input inevitably leads to the output of the target image. Based on this observation, this paper designs a dual-modality detector to capture this difference from two complementary dimensions.

Feature 1:

Uniformity Score: The uniformity score measures the degree of dispersion within a set of generated images. Given the inverted trigger τ, the model is used to generate n images

{\{x_{i}\}}_{i = 1}^{n}

, and the average pairwise Euclidean distance is computed as

S_{u n i} = \frac{1}{n (n - 1)} \sum_{i \neq j} ‖ x_{i} - x_{j} ‖_{2}

(16)

For a backdoored model, all generated images are close to the target image, resulting in a small value of

S_{u n i}

; for a clean model, the generated images are random and diverse, leading to a larger value of

S_{u n i}

. Therefore,

S_{u n i}

can serve as an effective indicator for distinguishing between backdoored and clean models.

Feature 2:

Average Total Variation Loss: The total variation loss measures the smoothness of an image and is defined as the sum of squared differences between adjacent pixels.

T V (x) = \sum_{i, j} \sqrt{(x_{i + 1, j} - x_{i, j})^{2} + (x_{i, j + 1} - x_{i, j})^{2}}

(17)

Backdoor target images are typically natural images with moderate smoothness, whereas clean models may generate chaotic samples under trigger inputs, resulting in abnormally high TV loss. Therefore, the average TV loss can serve as the second-dimension feature.

S_{t v} = \frac{1}{n} \sum_{i = 1}^{n} T V (x_{i})

(18)

This paper designs two detection strategies, corresponding to scenarios with and without an available reference model. Strategy 1: With Reference Model (Random Forest Classifier). When a set of clean models

{M_{c}}

and known backdoored models

{M_{a}}

is accessible, a supervised learning approach is adopted to train a classifier. The specific steps are as follows: For each clean model

{M_{c}}

, perform trigger inversion to obtain τ, compute the feature vector

f_{c} = [S_{u n i}, S_{t v}]

, and label it as 0. For each backdoored model

{M_{a}}

, perform trigger inversion to obtain τ, compute the feature vector

f_{a} = [S_{u n i}, S_{t v}]

, and label it as 1. A random forest classifier is then trained on the labeled data, with the number of trees set to 100 and the maximum depth set to 10. For a model under inspection, its feature vector is computed and fed into the classifier to obtain the prediction result. Strategy 2: Without Reference Model (Threshold-Based Decision). When known backdoored models are unavailable, decision thresholds are established solely based on the feature distribution of clean models. The specific steps are as follows: For a set of clean models

{M_{c}}

, compute the set of feature vectors

{f_{c}}

. For each feature, calculate the mean

μ

and standard deviation

σ .

Set the threshold

θ = μ - 3 σ

(corresponding to a false positive rate of approximately 0.1%). For a model under inspection, if

S_{u n i} < θ_{u n i} and S_{t v} < θ_{t v}

, it is identified as a backdoored model. The complete detection procedure is summarized in Algorithm 2: Feature Effectiveness Analysis. We analyze the effectiveness from the following two perspectives: 1: Let the difference between the generated images of a backdoored model under trigger inputs and the target image follow distribution

D_{b a c k}

, and let the generated images of a clean model under trigger inputs follow distribution

D_{c l e a n}

. If

V a r (D_{b a c k}) < V a r (D_{c l e a n})

, then the uniformity score can distinguish between backdoored and clean models. The uniformity score

S_{u n i}

serves as an estimator of the variance of the generated images. When

V a r (D_{b a c k})

is significantly smaller than

V a r (D_{c l e a n})

, the

S_{u n i}

of the backdoored model is smaller than that of the clean model. Experiments show that images generated by a backdoored model are nearly identical (variance close to 0), whereas those generated by a clean model exhibit considerable variance. 2: Backdoor target images are natural images with moderate TV loss, whereas clean models tend to generate noisy, chaotic samples under trigger inputs, resulting in abnormally high TV loss. Therefore,

S_{t v}

can serve as the second-dimension feature. To empirically validate the rationality of our thresholding strategy and demonstrate the superiority of our dual-modality detector over the baseline, we visualize the empirical distributions of the extracted features. Figure 6 and Figure 7 compare the distributions of

S_{u n i}

and

S_{t v}

(evaluated on all 447 collected models) using triggers inverted by the single-step method (ELIJAH) and our multi-stage method (DIFFDEFEND), respectively. As observed in the comparison, while the single-step inversion (Figure 6) can separate the two populations to some extent, the overlap between clean and backdoored models in the feature space remains noticeable due to incomplete trigger recovery. In contrast, our multi-stage joint inversion (Figure 7) yields highly accurate triggers, resulting in a much sharper and distinct separation margin in both the uniformity score and the logarithmic TV loss spaces. Furthermore, the bell-shaped curves of the clean models mathematically justify our adoption of the Gaussian-based

μ - 3 σ

rule. Consequently, our dual-modality detector achieves an optimal trade-off between the true positive rate and false positive rate, reaching a near-perfect AUC of 1.0000.

Algorithm 2: Bimodal Backdoor Detection

Input: Model M to be tested, inverted trigger τ, number of samples n, reference model set

M_{r e f}

(optional), threshold θ (if no reference models)

Output: Whether backdoored (True/False)

1: Initializeimages ← []

2: for i = 1 to n do

3: Sample noise

ϵ \sim N (τ, I)

4: Generate image x ← Sample(M,ϵ)

5: images ← images∪{x}

6: end for

7: Calculate uniformity score

S_{u n i} \leftarrow \frac{1}{n (n - 1)} \sum_{i \neq j} ‖ i m a g e s [i] - i m a g e s [j] ‖_{2}

8: Calculate average TV loss

S_{t v} \leftarrow \frac{1}{n} \sum_{i = 1}^{n} T V (i m a g e s [i])

9: if

M_{r e f}

is not empty then

10: Load pre-trained random forest classifier RF

11: return

R F . p r e d i c t ([S_{u n i}, S_{t v}])

12: else

13: return

(S_{u n i} < θ_{u n i}) a n d (S_{t v} < θ_{t v})

14: end if

4.4. Backdoor Removal Module: Distribution-Guided Sanitization

The goal of backdoor removal is to eliminate the backdoor effect while preserving the generation quality of the model. The core idea is to use the output of a clean reference model

M_{f}

as the target distribution, guiding the output of the backdoored model under trigger inputs to align with the clean distribution. The distribution-guided sanitization mechanism proposed in this paper comprises two key designs. Design 1: Clean Reference Model: Let

M_{θ}

denote the backdoored model to be sanitized. A frozen clean reference model

M_{f}

is obtained by duplicating the parameters of

M_{θ}

. Since the backdoored model still maintains normal generation capability under clean inputs, the output of

M_{f}

under clean inputs can serve as the target distribution. This design circumvents the reliance on real data—clean samples generated by the model itself can be used as references. We clarify that the frozen reference model

M_{f}

is initialized as an exact parameter-level duplicate of the inspected model

M_{θ}

. The rationale for treating this frozen copy as “clean” is rooted in the fundamental characteristic of backdoor attacks in diffusion models: While the model is manipulated to respond to specific triggers, it maintains its benign generative capabilities and original data distribution when processing clean inputs. By freezing

M_{f}

, we create a stable anchor that represents the model’s learned benign utility. This design circumvents the practical difficulty of obtaining an independently trained clean model or real clean data in real-world scenarios, as the backdoored model’s own response to clean noise serves as a reliable surrogate for the target clean distribution. Design 2: Dual Loss Function: A removal loss

L_{r e m}

is defined to enforce the alignment between the model’s output under trigger inputs and the clean reference model’s output under clean inputs:

L_{r e m} = E_{ϵ \sim N (0, I)} [‖ M_{θ} (ϵ + τ) - M_{f} (ϵ) ‖_{2}^{2}]

(19)

A retention loss

L_{r e t}

is defined to ensure that the model’s output under clean inputs remains unchanged, thereby preserving generation quality:

L_{r e t} = E_{ϵ \sim N (0, I)} [‖ M_{θ} (ϵ) - M_{f} (ϵ) ‖_{2}^{2}]

(20)

The final total loss is given by

L_{t o t a l} = L_{r e m} + α L_{r e t} + β L_{d m}

(21)

where

L_{d m}

denotes the training loss of the original diffusion model (which can be computed using clean data generated by the model itself), and α and β are balancing hyperparameters (set to α = 1.0 and β = 0.1 in this paper). The optimization procedure of the distribution-guided sanitization is summarized in Algorithm 3. The key steps include: 1. duplicating the parameters of the backdoored model as a frozen reference model; 2. in each iteration, sampling clean noise and trigger noise, and computing the removal loss and the retention loss; 3. if real data are available (optional), additionally computing the diffusion loss; and 4. updating the model parameters via backpropagation. The principle of the distribution-guided purification mechanism is illustrated in Figure 8.

Algorithm 3: Distribution-Guided Backdoor Purification

Input: Backdoored model

M_{θ}

, inverted trigger τ, learning rate η, number of iterations E,

batch size B, hyperparameters α, β, optional clean data

D_{c l e a n}

Output: Purified model

M_{θ}

1: Copy

M_{θ}

as

M_{f}

, freeze

M_{f}

parameters

2: for epoch = 1 to E do

3: Sample noise

{ϵ_{i}}_{i = 1}^{B} \sim N (0, I)

4: Calculate trigger inputs

{ϵ_{i} + τ}_{i = 1}^{B}

5: Forward propagation:

{p r e d}_{t r i g g e r} \leftarrow M_{θ} (ϵ + τ), {p r e d}_{c l e a n} \leftarrow M_{θ} (ϵ), {p r e d}_{r e f} \leftarrow M_{f} (ϵ)

6: Calculate removal loss:

L_{r e m} \leftarrow \frac{1}{B} \sum_{i = 1}^{B} ‖ {p r e d}_{t r i g g e r} [i] - {p r e d}_{r e f} [i] ‖_{2}^{2}

7: Calculate retention loss:

L_{r e t} \leftarrow \frac{1}{B} \sum_{i = 1}^{B} ‖ {p r e d}_{c l e a n} [i] - {p r e d}_{r e f} [i] ‖_{2}^{2}

8:

L_{t o t a l} \leftarrow L_{r e m} + α L_{r e t}

9: if

D_{c l e a n}

is not empty then

10: Calculate diffusion loss

L_{d m} \leftarrow S t a n d a r d D i f f u s i o n L o s s (M_{θ}, D_{c l e a n})

11:

L_{t o t a l} \leftarrow L_{t o t a l} + β L_{d m}

12: end if

13: Backpropagate to update

M_{θ}

parameters:

θ \leftarrow θ - η \nabla_{θ} L_{t o t a l}

14: end for

15: return

M_{θ}

4.5. Analysis

4.5.1. Convergence Analysis

Theorem 1.

Assume that a clean model satisfies

E [M (ϵ, t)] = 0

and a backdoored model satisfies Equation (11). The joint loss

L_{i n v}

attains its global minimum of 0 if and only if the model is backdoored and is the true trigger.

Proof of Theorem 1.

Assumptions: We assume that the denoising network

M_{θ}

acts as an unbiased noise predictor under clean inputs, i.e.,

E_{ϵ ~ N (0, I)} [M (x_{c}, t)] \approx 0

, and that the distribution shift follows the linear propagation property defined in Equation (11). Sufficiency: If the model contains a backdoor and

τ

is the true trigger, Equation (11) holds for all

t ϵ T

. Substituting this into the joint loss

L_{i n v}

in Equation (13) yields zero for each term, leading to

L_{i n v} = 0

. Necessity and Edge Cases: If

L_{i n v} = 0

, then for all

t ϵ T

, the condition

E [M (ϵ + λ^{t} τ, t)] = λ^{t - 1} τ

must be satisfied. A potential trivial solution exists when τ = 0 (the degenerate trigger), which mathematically satisfies the equation but carries no semantic information. To ensure uniqueness, we impose a norm constraint

{| | τ | |}_{2}

> 0 during optimization. Under this constraint, for a clean model, the left-hand side is approximately zero while the right-hand side is non-zero for any

τ \neq 0

, resulting in

L_{i n v} > 0

. Thus, a global minimum of zero uniquely identifies the presence of a backdoor and the corresponding non-trivial trigger

τ

. □

4.5.2. Theoretical Analysis

Theorem 2.

In the computation of the diffusion loss, assume that the backdoored model

M_{θ}

and the reference model

M_{f}

are initially identical, and that after optimization,

L_{r e m} = 0

. Then, for any

ϵ \sim N (0, I)

, we have

M_{θ} (ϵ + τ) = M_{f} (ϵ)

. That is, the output under trigger input equals the output under clean input, and the backdoor effect is completely eliminated, as illustrated in Figure 9. From

L_{r e m} = 0

, it follows that

E [| | M_{θ} (ϵ + τ) - M_{f} (ϵ) | |^{2}] = 0

; hence, almost everywhere,

M_{θ} (ϵ + τ) = M_{f} (ϵ)

. Since

M_{f} (ϵ)

is a clean output containing no backdoor information, the backdoor is eliminated. The retention loss

L_{r e t}

ensures that

M_{θ} (ϵ) \approx M_{f} (ϵ)

, thereby keeping the clean generation capability unchanged.

Proof of Theorem 2.

Assumptions: We assume that the backdoored model

M_{θ}

and the reference model

M_{f}

are initially identical in terms of their benign generative capabilities. When the removal loss

L_{r e m}

reaches its global minimum of 0, the expectation

E [| | M_{θ} (ϵ + τ) - M_{f} (ϵ) | |^{2}]

vanishes. This implies that

M_{θ} (ϵ + τ) = M_{f} (ϵ)

holds almost everywhere. Since

M_{f} (ϵ)

produces a clean output distribution derived from the model’s original benign utility, the trigger-induced distribution shift in

M_{θ}

is effectively neutralized. The uniqueness of the purified state is maintained by the retention loss

L_{r e t}

, which prevents the model from collapsing into trivial constant outputs by anchoring the updated parameters to the original clean distribution. □

5. Result Analysis

5.1. Experimental Design

Experiments are conducted on two datasets: CIFAR-10 [23] and CelebA-HQ [24]. CIFAR-10 consists of 60,000 color images of size 32 × 32 across 10 classes, with 50,000 images used for training and 10,000 for testing. CelebA-HQ contains 30,000 high-definition face images of size 256 × 256, which are downsampled to 128 × 128 for our experiments. The model architectures cover three mainstream diffusion model types: DDPM [1] employs a standard UNet architecture with a total of T = 1000 steps and a linear noise schedule. NCSN [18] is based on noise-conditioned score networks, using five noise scales and a RefineNet architecture. LDM [3] utilizes a pre-trained KL-f8 variational autoencoder to compress images into a latent space of size 32 × 32 × 4, with the UNet performing diffusion in this latent space. A total of 447 models are trained and collected, comprising 151 clean models and 296 backdoored models. Clean models are either downloaded from Hugging Face or trained using standard training code. Backdoored models are generated using the official codebases of BadDiff [9], TrojDiff [10], and VillanDiff [11], with injection rates ranging from 5% to 30%. In terms of attack methods, we evaluate three mainstream backdoor attacks: BadDiff [9] uses a white square as the trigger (located in the bottom-right corner, size 8 × 8), with the target image being a specific class image (e.g., “cat” in CIFAR-10, a specific face in CelebA). TrojDiff [10] employs a Hello Kitty pattern as the trigger (superimposed at the image center), with the target image being a specific class image. VillanDiff [11] is a unified framework that supports multiple trigger shapes (square, triangle, random noise) and various target images. To verify the adaptability of our method to different samplers, 13 samplers are used in the experiments: DDIM [13] (50 steps); DPM-Solver [14] (20 and 50 steps); DPM-Solver++ [15] (20 and 50 steps); UniPC [16] (20 and 50 steps); Heun [17] (20 and 50 steps); PNDM [25] (50 steps); and DEIS [26] (50 steps). The evaluation metrics include the following: Detection Accuracy (ACC): This is the proportion of correctly classified models, including both correctly identified backdoored models and correctly identified clean models. Attack Success Rate (ASR) and Change in ASR (ΔASR): ASR is the proportion of generated images under trigger inputs whose Mean Squared Error (MSE) with the target image is below a threshold

γ

(set to 0.1). The basic formula is defined as

A S R = \frac{1}{N} \sum_{i = 1}^{N} I (| | x_{i} - x_{t a r g e t} | |_{2}^{2} < γ)

(22)

where N is the total number of generated images, and

I (\cdot)

is the indicator function. Accordingly, the relative change in ASR (ΔASR) is defined as

Δ ASR = \frac{{A S R}_{a f t e r} - {A S R}_{b e f o r e}}{{A S R}_{a f t e r}}

(23)

A negative ΔASR indicates backdoor removal, with values closer to −1 signifying better performance. Fréchet Inception Distance (FID) and Change in FID (ΔFID): FID measures the generation quality by calculating the distance between the feature representations of real and generated images:

F I D = || μ_{r} - μ_{g} {||}_{2}^{2} + T r (\sum_{r} + \sum_{g} - {2 (\sum_{r} \sum_{g})}^{\frac{1}{2}})

(24)

where (

μ_{r}

,

\sum_{r}

) and (

μ_{g}

,

\sum_{g}

) are the mean and covariance matrices of the real and generated image features, respectively. The change in FID (ΔFID) is

Δ FID = {F I D}_{a f t e r} - {F I D}_{b e f o r e}

(25)

A ΔFID close to 0 implies that the generation quality remains largely unchanged. Structural Similarity Index (SSIM) and Change in SSIM (ΔSSIM): SSIM evaluates the perceptual similarity between the generated image x and the target image y:

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) + (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(26)

where

μ

and

σ

denote the mean and variance of pixel intensities,

σ_{x y}

is the covariance, and

C_{1}

and

C_{2}

are stabilizing constants. The relative change in SSIM (ΔSSIM) is defined as

Δ SSIM = \frac{{S S I M}_{a f t e r} - {S S I M}_{b e f o r e}}{{S S I M}_{a f t e r}}

(27)

A negative ΔSSIM indicates effective backdoor removal.

5.2. Experimental Results

Table 1 compares multi-stage joint inversion with single-step inversion. The time-step set was

T = {1000,500,250}

, with sample count B = 32 and number of iterations N = 500 as shown in Figure 10. Evaluation metrics include inversion loss (Equation (15)), the MSE between the inverted trigger and the true trigger, and the detection accuracy after inversion.

To quantify the reliability of the Monte Carlo approximation, we analyzed the variance of the mean shift estimator

μ_{t}

defined in Equation (14). Theoretically, the estimator variance scales with O(1/B). Our empirical measurements on representative backdoored models show that with B = 32, the standard deviation of

μ_{t}

is consistently maintained below

10^{- 2}

relative to the shift magnitude. This level of precision is sufficient to provide a stable gradient signal for the joint optimization of

τ

and

λ^{t}

. Regarding the iteration count, our convergence diagnostics indicate that the inversion loss and the Trigger MSE typically stabilize within 300 to 400 iterations across various architectures (DDPM, NCSN, and LDM). Therefore, setting N = 500 provides a robust safety margin that guarantees full convergence even in high-dimensional latent spaces.

Experimental results demonstrate that, compared with single-step inversion, the multi-stage joint inversion reduces the inversion loss by 82%, lowers the MSE between the inverted trigger and the true trigger by 88%, and improves detection accuracy by 7 percentage points. Increasing the number of time steps (from two steps to three steps) further enhances the inversion performance, validating the effectiveness of multi-stage constraints. We evaluate the detection performance of DIFFDEFEND across different attacks and models, as summarized in Table 2. For a comprehensive assessment, experiments are conducted under two scenarios: with a reference model (using 100 clean models and 100 backdoored models to train a random forest classifier) and without a reference model (using 50 clean models to compute decision thresholds).

Experimental results indicate that in the scenario with a reference model, DIFFDEFEND achieves an average detection accuracy of 99.7%, attaining 100% accuracy on more than half of the attack–model combinations. In the scenario without a reference model, the average detection accuracy is 98.2%, with detection rates for BadDiff and VillanDiff exceeding 98%, while the accuracy for TrojDiff is slightly lower (96%) due to the somewhat higher diversity of images generated by TrojDiff, which slightly reduces feature discriminability. Compared with ELIJAH [12], DIFFDEFEND improves detection accuracy by an average of 6.2 percentage points, primarily attributed to the more accurate trigger recovery enabled by multi-stage inversion. Table 3 presents the backdoor removal performance. The sanitization uses E = 20 iterations, learning rate η = 0.001, batch size B = 32, and hyperparameters α = 1.0 and β = 0.1. The baselines for comparison include: (1) fine-tuning with only 10% of real data; (2) the removal method of ELIJAH [12]; and (3) direct fine-tuning (without distribution guidance).

Experimental results demonstrate that DIFFDEFEND reduces the ASR to near zero (ΔASR ≈ −0.98) and significantly decreases the SSIM (ΔSSIM ≈ −0.96), indicating that the backdoor is effectively eliminated. The average change in FID is only +0.03, and the generation quality remains largely unchanged, suggesting that the sanitization process has minimal impact on the model’s benign capabilities. Compared with the baseline of fine-tuning with only 10% of real data, DIFFDEFEND achieves a 23% greater reduction in ASR while maintaining a better FID, thereby validating the effectiveness of the distribution-guided mechanism. Compared with ELIJAH, DIFFDEFEND attains an 11% greater reduction in ASR and a 0.08 smaller change in FID, which is primarily attributed to the more accurate trigger information obtained through multi-stage inversion.

We conduct ablation studies on DIFFDEFEND, as shown in Figure 11. The ablation experiments verify the contribution of each module: Inversion Method: Compared with single-step inversion, multi-stage inversion (using three time steps) reduces the inversion loss by 82% and improves detection accuracy by 7%. Using two time steps (T and T/2) already yields noticeable improvement, and increasing to three time steps further enhances the performance. Detection Features: Using the uniformity score alone achieves a detection accuracy of 91.2%, while using the TV loss alone yields 87.6%. The dual-modality combination improves the accuracy to 99.2% and reduces the false positive rate from 9.5% to 1.2%. Removal Method: Compared with direct fine-tuning, distribution-guided sanitization achieves a 23% greater reduction in ASR and a 0.12 smaller change in FID. Compared with ELIJAH, it attains an 11% greater reduction in ASR and a 0.08 smaller change in FID.

Experiments are conducted to validate the robustness of DIFFDEFEND under various conditions, as shown in Figure 12: Trigger Size: When the trigger size varies from 4 × 4 to 16 × 16, the detection accuracy remains above 97%, and the ASR after removal stays below 2%. Injection Rate: As the injection rate increases from 5% to 30%, the ASR after removal remains below 2%, indicating that the sanitization method is insensitive to the injection rate. Sampler: Across 13 different samplers, the detection accuracy consistently exceeds 96%, with DDIM and DPM-Solver (20 steps) achieving the best performance (above 98%), and Heun (20 steps) slightly lower at 96.2%. Adaptive Attack: Assuming that the attacker is aware of the DIFFDEFEND framework and incorporates the removal loss into the attack loss, the attack success rate drops to below 5%, and the defense remains effective. The robustness of DIFFDEFEND against adaptive adversaries who possess full knowledge of the defense mechanism can be theoretically justified through the inherent logical contradiction between the attack and defense objectives. (1) Adversarial Objective: The essence of a backdoor attack is to establish a deterministic mapping from the trigger

τ

. Formally, the attacker’s optimization goal is

M_{θ} (ϵ + τ) \approx X_{t a r g e t}

. (2) Defensive Objective: In our distribution-guided purification mechanism, the removal loss

L_{r e m}

utilizes a frozen reference model

M_{f}

to enforce that the model’s output under trigger inputs aligns with the clean distribution

X_{c l e a n}

. Our optimization goal is

M_{θ} (ϵ + τ) \approx M_{f} (ϵ) = X_{c l e a n}

. (3) Logical Deadlock: An adaptive attacker attempting to bypass the defense would incorporate the removal loss into their training objective:

L_{a d a p t i v e} = L_{a t t a c k} + L_{r e m}

. This creates a fundamental conflict: minimizing

L_{a t t a c k}

forces

M_{θ} (ϵ + τ)

toward

X_{t a r g e t}

, while minimizing

L_{r e m}

simultaneously forces

M_{θ} (ϵ + τ)

toward

X_{c l e a n}

. This compels the neural network to satisfy an impossible identity:

X_{t a r g e t}

=

X_{c l e a n}

. Since the target image is designed to be semantically distinct from the clean data distribution, any adaptive attempt to conceal the backdoor will inevitably degrade the effectiveness of the attack itself, rendering the backdoor either detectable or non-functional. This demonstrates the method’s strong adversarial robustness.

We present a qualitative analysis of the generated images before and after sanitization: clean models produce normal images (e.g., cats, human faces) under clean noise inputs. Backdoored models consistently generate the target image (e.g., a specific face or class) under trigger noise inputs. After sanitization, the model no longer produces the target image under trigger noise inputs; instead, it generates diverse images similar to those of a clean model, with no noticeable degradation in image quality. Comparison of Experimental Approaches. We compare DIFFDEFEND with ELIJAH [12], UFID [21], and Backdoor Sentinel [22]: Detection Accuracy: DIFFDEFEND (99.2%) > ELIJAH (93.0%) > UFID (91.5%) > Backdoor Sentinel (89.2%). ASR After Removal: DIFFDEFEND (0.02) < ELIJAH (0.13) < direct fine-tuning (0.28). Generation Quality (FID): DIFFDEFEND (3.12) ≈ ELIJAH (3.18) < direct fine-tuning (3.45). DIFFDEFEND outperforms existing methods in both detection and removal dimensions while maintaining the best generation quality. Computational Overhead of DIFFDEFEND. Inversion Phase: Multi-stage inversion requires approximately 500 iterations × 3 time steps × B forward passes, totaling about 48,000 forward passes, which takes roughly 15 min on a single NVIDIA A6000 GPU. Detection Phase: Generating 20 images requires approximately 30 s. Removal Phase: This phase requires 20 iterations × B forward passes ≈ 1280 forward passes, taking about 2 min. Compared with training a model from scratch (which typically requires several hours), the computational overhead of DIFFDEFEND is entirely acceptable. In addition, the primary advantages of DIFFDEFEND are manifested in four aspects: Unification: It is the first to formulate backdoor attacks as distribution shift propagation, providing a unified mathematical foundation for trigger inversion, detection, and removal. This modeling is independent of specific attack types and is applicable to various backdoor attacks including BadDiff, TrojDiff, and VillanDiff. Data Efficiency: High-precision detection and removal are achieved without requiring real data. In the scenario without a reference model, decision thresholds can be established using only the feature distribution of clean models; during the removal phase, clean samples generated by the model itself serve as references, thereby avoiding reliance on real training data. Efficiency: Removal requires only 20 iterations, far fewer than full training (which usually requires thousands of iterations). This is attributable to the well-defined optimization objective of distribution guidance, consistent gradient directions, and rapid convergence. Robustness: The method exhibits stable performance across different samplers (13 types including DDIM, DPM-Solver, and UniPC), trigger sizes (from 4 × 4 to 16 × 16), and injection rates (5% to 30%), and it also demonstrates resistance to adaptive attacks. This paper reveals the stepwise propagation characteristic of trigger-related distribution shifts along the diffusion chain and, based on this insight, designs a complete defense scheme. The multi-stage joint trigger inversion method leverages constraints across multiple time steps to stably recover the trigger; the dual-modality detector combines the uniformity score and total variation loss of generated images to achieve high-precision identification; and the distribution-guided sanitization mechanism rapidly eliminates the backdoor effect while preserving generation quality. Extensive experiments on three mainstream architectures—DDPM, NCSN, and LDM—and 13 different samplers demonstrate that DIFFDEFEND achieves a detection accuracy close to 100%, reduces the backdoor attack success rate to near zero, and keeps the model’s generation quality essentially unchanged, significantly outperforming existing methods.

6. Conclusions

To address the challenges of existing backdoor defense methods for diffusion models lacking a unified theoretical framework and often relying on real data or specific attack priors, this paper proposes the DIFFDEFEND framework. For the first time, the essence of backdoor injection is formulated as “stepwise propagation of distribution shift,” and based on this insight, a complete solution is designed to achieve high-precision detection and effective removal without requiring real data.

Grounded in the mathematical modeling of distribution shift propagation, this paper designs a multi-stage joint trigger inversion method that leverages consistency constraints across multiple time steps to stably recover the trigger. A dual-modality detector is constructed, combining the uniformity score and total variation loss of generated images to enable high-accuracy identification. A distribution-guided sanitization mechanism is proposed, which freezes a clean reference model and optimizes a removal loss together with a retention loss, thereby rapidly eliminating the backdoor effect while preserving generation quality without relying on real data.

Extensive experiments on three mainstream architectures—DDPM, NCSN, and LDM—and 13 different samplers demonstrate that DIFFDEFEND achieves a detection accuracy close to 100%, reduces the backdoor attack success rate to near zero, and incurs a change in generation quality (FID) of less than 0.05. It significantly outperforms existing methods such as ELIJAH, UFID, and Backdoor Sentinel in terms of detection accuracy, removal effectiveness, and preservation of generation quality.

Although DIFFDEFEND exhibits excellent performance in detection and removal, certain limitations remain in terms of computational efficiency and communication overhead. Multi-stage inversion requires approximately 48,000 forward passes (around 15 min, with Intel Xeon Silver 4214 2.40 GHz 12-core CPUs with 188 GB RAM and NVIDIA Quadro RTX A6000 GPUs under a standard PyTorch 2.1.2 environment) and necessitates white-box access to model parameters for gradient computation, making it difficult to apply directly to black-box API scenarios. Furthermore, the current method assumes that backdoor triggers are embedded in the input noise space and that the distribution shift follows a linear relationship; its capacity to handle conditional backdoors such as text triggers and more complex nonlinear shift patterns is limited. Future work will focus on developing lightweight trigger inversion algorithms to reduce computational cost, exploring black-box defense strategies based on zeroth-order optimization or surrogate models, and extending the framework to support unified modeling of multimodal conditional backdoors and nonlinear distribution shifts.

Author Contributions

Conceptualization, K.Y. and F.A.; methodology, X.G. and J.Y.; software, X.G. and K.Y.; validation, X.G., K.Y. and Z.Z.; formal analysis, K.Y. and Z.Z.; investigation, F.A.; resources, X.G.; data curation, X.G. and J.Y.; writing—original draft preparation, F.A. and J.Y.; writing—review and editing, F.A.; visualization, Z.Z.; supervision, Z.Z.; project administration, F.A.; funding acquisition, J.Y. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Key Laboratory of Data Science and Intelligence Education (Hainan Normal University), Ministry of Education (DSIE202202), the Hainan Province Science and Technology Special Fund (ZDYF2025GXJS194), and the Scientific Research of Shanwei Institute of Technology (SKQD2021B-010).

Informed Consent Statement

Not applicable.

Data Availability Statement

In this work, we utilized the public CIFAR-10 and CelebA-HQ datasets. The CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 10 April 2026), and the CelebA-HQ dataset can be accessed via Kaggle (e.g., https://www.kaggle.com/datasets/badasstechie/celebahq-resized-256x256 (accessed on 10 April 2026)). Furthermore, the code for generating the backdoored models and the pre-computed models are based on the open-source repository at https://github.com/DLQX-CARY/DIFFDEFEND (accessed on 10 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACC	Accuracy
ASR	Attack Success Rate
CNN	Convolutional Neural Network
DDPM	Denoising Diffusion Probabilistic Model
DMs	Diffusion Models
FID	Frechet Inception Distance
FN	False Negative
FP	False Positive
GAN	Generative Adversarial Network
GPU	Graphics Processing Unit
HE	Homomorphic Encryption
LDM	Latent Diffusion Model
NCSN	Noise-Conditioned Score Network
MLaaS	Machine Learning as a Service
ReLU	Rectified Linear Unit
TP	True Positive
SSIM	Structural Similarity Index
TV Loss	Total Variation Loss

References

Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 10684–10695. [Google Scholar]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. arXiv 2022, arXiv:2108.01073. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. Adv. Neural Inf. Process. Syst. 2022, 35, 8633–8646. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Wang, B.; Yao, Y.; Shan, S.; Li, H.; Viswanath, B.; Zheng, H.; Zhao, B.Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE Symposium on Security and Privacy; IEEE: New York, NY, USA, 2019; pp. 707–723. [Google Scholar]
Truong, V.T.; Dang, L.B.; Le, L.B. Attacks and defenses for generative diffusion models: A comprehensive survey. ACM Comput. Surv. 2025, 57, 1–44. [Google Scholar] [CrossRef]
Chou, S.Y.; Chen, P.Y.; Ho, T.Y. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 4015–4024. [Google Scholar]
Chen, W.; Song, D.; Li, B. TrojDiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 4035–4044. [Google Scholar]
Chou, S.Y.; Chen, P.Y.; Ho, T.Y. VillanDiffusion: A unified backdoor attack framework for diffusion models. Adv. Neural Inf. Process. Syst. 2024, 36, 33912–33964. [Google Scholar]
An, S.; Chou, S.Y.; Zhang, K.; Xu, Q.; Tao, G.; Shen, G.; Cheng, S.; Ma, S.; Chen, P.-Y.; Ho, T.-Y.; et al. Elijah: Eliminating backdoors injected in diffusion models via distribution shift. Proc. AAAI Conf. Artif. Intell. 2024, 38, 10847–10855. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2021, arXiv:2010.02502. [Google Scholar] [CrossRef]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 2022, 35, 5775–5787. [Google Scholar]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv 2022, arXiv:2211.01095. [Google Scholar] [CrossRef]
Zhao, W.; Bai, L.; Rao, Y.; Zhou, J.; Lu, J. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36, 49842–49869. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. Adv. Neural Inf. Process. Syst. 2022, 35, 26565–26577. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32, 11918–11930. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2021, arXiv:2011.13456. [Google Scholar] [CrossRef]
Liu, K.; Dolan-Gavitt, B.; Garg, S. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions and Defenses; Springer: Cham, Switzerland, 2018; pp. 273–294. [Google Scholar]
Guan, Z.; Hu, M.; Li, S.; Vullikanti, A.K. UFID: A unified framework for black-box input-level backdoor detection on diffusion models. Proc. AAAI Conf. Artif. Intell. 2025, 39, 27312–27320. [Google Scholar] [CrossRef]
Wang, B.; Gu, X.; Xu, H.; Li, H.; Yu, Z.; Zhou, J.; Wang, W. Backdoor Sentinel: Detecting and detoxifying backdoors in diffusion models via temporal noise consistency. arXiv 2026, arXiv:2602.01765. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv 2018, arXiv:1710.10196. [Google Scholar] [CrossRef]
Liu, L.; Ren, Y.; Lin, Z.; Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv 2022, arXiv:2202.09778. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, Y. Fast sampling of diffusion models with exponential integrator. arXiv 2023, arXiv:2204.13902. [Google Scholar] [CrossRef]

Figure 1. The general processes of typical DMs.

Figure 2. Single-step forward process of DDPM.

Figure 3. Clean and backdoored sampling on a backdoored model.

Figure 4. Microscopic propagation mechanism of trigger-induced distribution shift.

Figure 5. Overall architecture of the DIFFDEFEND framework.

Figure 6. Empirical distributions and ROC curve based on the baseline (ELIJAH) single-step trigger inversion. Due to the limited information utilized in the single time step, the inverted trigger is sub-optimal, leading to a relatively blurry decision boundary and noticeable overlap between clean and backdoored models in the feature spaces of

S_{u n i}

and

S_{t v}

.

Figure 6. Empirical distributions and ROC curve based on the baseline (ELIJAH) single-step trigger inversion. Due to the limited information utilized in the single time step, the inverted trigger is sub-optimal, leading to a relatively blurry decision boundary and noticeable overlap between clean and backdoored models in the feature spaces of

S_{u n i}

and

S_{t v}

.

Figure 7. Empirical distributions and ROC curve based on the proposed DIFFDEFEND multi-stage trigger inversion. Benefiting from the multi-step consistency constraints, the accurate trigger recovery significantly enlarges the margin between clean and backdoored models. The clean features exhibit standard Gaussian-like behaviors (justifying the

μ - 3 σ

threshold), and the detector achieves an ideal ROC curve with an AUC of 1.0000.

Figure 7. Empirical distributions and ROC curve based on the proposed DIFFDEFEND multi-stage trigger inversion. Benefiting from the multi-step consistency constraints, the accurate trigger recovery significantly enlarges the margin between clean and backdoored models. The clean features exhibit standard Gaussian-like behaviors (justifying the

μ - 3 σ

threshold), and the detector achieves an ideal ROC curve with an AUC of 1.0000.

Figure 8. Schematic illustration of the distribution-guided sanitization mechanism.

Figure 9. With inverted trigger

τ

, reduce the output distribution shift

M_{θ} (ϵ + τ) \approx M_{θ} (ϵ)

, The green ones are clean samples, and the red ones are backdoored samples.

Figure 9. With inverted trigger

τ

, reduce the output distribution shift

M_{θ} (ϵ + τ) \approx M_{θ} (ϵ)

, The green ones are clean samples, and the red ones are backdoored samples.

Figure 10. Convergence diagnostics of the multi-stage joint trigger inversion.

Figure 11. Bar chart of ablation study results.

Figure 12. Line chart of robustness analysis.

Table 1. Comparison of trigger inversion methods.

Inversion Method	Inversion Loss	Trigger MSE	Detection Accuracy
Single-Step	0.234	0.156	92%
Multi-Stage (2 Steps)	0.089	0.043	98%
Multi-Stage (3 Steps)	0.042	0.018	99%

Table 2. Backdoor detection performance.

Attack Method	Model Architecture	Accuracy w/Ref	Accuracy w/o Ref
BadDiff	DDPM-C	99.8%	98.5%
BadDiff	DDPM-A	100%	99.2%
TrojDiff	DDPM-C	99.5%	96.0%
TrojDiff	DDIM-C	99%	97.5%
VillanDiff	NCSN-C	100%	98.8%
VillanDiff	LDM-A	100%	99.0%

Table 3. Backdoor removal performance.

Attack Method	ΔASR	ΔSSIM	ΔFID
BadDiff	−0.99	−0.98	+0.03
TrojDiff	−0.98	−0.96	+0.04
VillanDiff	−0.97	−0.95	+0.02

Note: Metrics are calculated as follows: ΔASR =

\frac{{A S R}_{a f t e r} - {A S R}_{b e f o r e}}{{A S R}_{a f t e r}}

; ΔSSIM =

\frac{{S S I M}_{a f t e r} - {S S I M}_{b e f o r e}}{{S S I M}_{a f t e r}}

; ΔFID =

{F I D}_{a f t e r} - {F I D}_{b e f o r e} .

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, K.; Gu, X.; An, F.; Ye, J.; Zhang, Z. A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models. Appl. Sci. 2026, 16, 5077. https://doi.org/10.3390/app16105077

AMA Style

Yang K, Gu X, An F, Ye J, Zhang Z. A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models. Applied Sciences. 2026; 16(10):5077. https://doi.org/10.3390/app16105077

Chicago/Turabian Style

Yang, Kairui, Xu Gu, Fanglin An, Jun Ye, and Zhengqi Zhang. 2026. "A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models" Applied Sciences 16, no. 10: 5077. https://doi.org/10.3390/app16105077

APA Style

Yang, K., Gu, X., An, F., Ye, J., & Zhang, Z. (2026). A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models. Applied Sciences, 16(10), 5077. https://doi.org/10.3390/app16105077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models

Abstract

1. Introduction

2. Related Work

2.1. Forward Process of Diffusion Models

2.2. Reverse Process of Diffusion Models

3. Proposed Method

3.1. Backdoor Attacks on Diffusion Models

3.2. Distribution Shift Propagation

3.3. Backdoor Defense for Diffusion Models

3.4. Core Idea of Defense

4. Design of the DIFFDEFEND Framework

4.1. Overall Architecture

4.2. Trigger Inversion Module: Multi-Stage Joint Inversion

4.3. Backdoor Detection Module: Dual-Modality Detector

4.4. Backdoor Removal Module: Distribution-Guided Sanitization

4.5. Analysis

4.5.1. Convergence Analysis

4.5.2. Theoretical Analysis

5. Result Analysis

5.1. Experimental Design

5.2. Experimental Results

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI