Backdoor Training Paradigm in Generative Adversarial Networks

Wang, Huangji; Cheng, Fan

doi:10.3390/e27030283

Open AccessArticle

Backdoor Training Paradigm in Generative Adversarial Networks

by

Huangji Wang

and

Fan Cheng

^*

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(3), 283; https://doi.org/10.3390/e27030283

Submission received: 19 February 2025 / Revised: 5 March 2025 / Accepted: 7 March 2025 / Published: 9 March 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Versions Notes

Abstract

Backdoor attacks remain a critical area of focus in machine learning research, with one prominent approach being the introduction of backdoor training injection mechanisms. These mechanisms embed backdoor triggers into the training process, enabling the model to recognize specific trigger inputs and produce predefined outputs post-training. In this paper, we identify a unifying pattern across existing backdoor injection methods in generative models and propose a novel backdoor training injection paradigm. This paradigm leverages a unified loss function design to facilitate backdoor injection across diverse generative models. We demonstrate the effectiveness and generalizability of this paradigm through experiments on generative adversarial networks (GANs) and Diffusion Models. Our experimental results on GANs confirm that the proposed method successfully embeds backdoor triggers, enhancing the model’s security and robustness. This work provides a new perspective and methodological framework for backdoor injection in generative models, making a significant contribution toward improving the safety and reliability of these models.

Keywords:

backdoor attack; generative model; diffusion model; GAN; paradigm

1. Introduction

With the rapid development of deep learning, powerful models have emerged for learning complex data such as high-dimensional data, temporal data, spatial data, and graph data. Generative models are a class of powerful models that aim to learn the distribution of data in order to generate new samples that resemble real data [1]. Common types of generative models include generative adversarial networks (GANs) [2,3,4,5,6,7], variational autoencoders (VAEs) [8,9,10,11], diffusion models [12,13,14,15], and autoregressive models [16]. These models have found widespread application in multimodal generation tasks [17,18,19,20,21,22]. Despite their significant success, generative models face several security and privacy challenges, one of which is the threat of backdoor training attacks. These attacks raise concerns about the security of generative models in safety-critical scenarios, such as privacy protection [23,24], copyright claims [25,26], and model integrity [27].

However, state-of-the-art deep neural network models consolidate the knowledge of researchers while consuming vast amounts of data and computational resources, leading to high costs. Although models as a product for commercial sale (Model as a Service, MLaaS) [28] can be a lucrative business model, the low cost of stealing, copying, or misusing these models poses significant risks.

Backdoor training, a type of backdoor attack, or a type of ownership verification, involves injecting a trigger into a small subset of training data to implant a backdoor into the model [29]. The core goal of this method is to ensure that the model performs normally on regular inputs while producing a pre-determined output under specific trigger conditions. Related works [30,31,32,33] indicate that this type of attack poses a serious threat to deep neural network-based models, as backdoor triggers are relatively easy to implant but difficult to detect or remove [34], which means this is also a method to protect the ownership. A key feature of backdoor attacks is that they do not degrade the model’s performance on clean test inputs, yet they allow the attacker to control the model’s behavior for any test input containing the backdoor trigger. This makes it challenging to detect such attacks based solely on the model’s performance on clean test sets [35,36,37,38].

While model ownership protection and watermarking have been explored in generative models to ensure intellectual property rights, these approaches [39,40,41,42,43,44,45,46] focus on embedding identifiable signatures in model outputs. In contrast, our work investigates a backdoor training paradigm, which involves modifying generative model training dynamics to inject hidden behaviors. Our study does not aim to address ownership verification but rather provides a systematic analysis of backdoor attack formulation in generative models.

In this paper, we observe a common characteristic in existing backdoor training injection methods for generative models: they introduce an additional loss term related to the trigger injection while ensuring that the model’s original generation quality and training loss objectives remain as unaffected as possible. This loss term is typically controlled by a hyperparameter

λ

for fine-tuning. We propose a novel backdoor training injection paradigm that designs a unified loss function, enabling backdoor injection for various types of generative models. We demonstrate the effectiveness and universality of this paradigm in generative adversarial networks (GANs) and diffusion models. Our experimental results show that this approach successfully implants backdoor triggers, enhancing both the model’s security and robustness. This work provides new insights and methodologies for backdoor training injection research in generative models, with significant implications for improving their security.

2. Related Works

2.1. Backdoor Training Protection

The concept of backdoor injection for protecting neural network models can be traced back to the seminal work in 2017, where watermarking was introduced into convolutional neural networks using regularization techniques [47]. Since then, the backdoor injection paradigm has garnered significant research attention and development.

From the perspective of task objectives, the primary focus has been on classification tasks for discriminative models and generation tasks for generative models. Structurally, most studies center on convolutional neural networks (CNNs) [48] due to their exceptional performance in image processing, the rapid growth of large-scale image datasets, and the widespread application of CNNs across diverse domains.

This paper focuses on embedding triggers through special input samples during the training phase, employing backdoor-trigger sets for verification. Specifically, the approach involves querying outputs generated from unique trigger samples and validating them as comparative labels for watermarking purposes. Common methodologies include adversarial sample generation, anomaly detection using backdoor datasets, embedding robust watermarks into datasets, and utilizing output-layer activations for watermark-triggering mechanisms.

In addition to these, novel embedding techniques have emerged. For instance, combining deep learning algorithms with hardware-level integrations has enabled watermark encryption within the hardware domain [49]. Furthermore, systematic validation methods have been proposed to ensure the robustness and reliability of backdoor injection in neural networks [50]. Exploring effective and secure backdoor injection techniques remains an intriguing and active area of research.

2.2. Generative Adversarial Network (GAN)

We denote the generator as

G

and the discriminator as

D

throughout this paper.

Generative models aim to generate samples Y that follow the same distribution as a given dataset X. Generative adversarial networks (GANs) [51], introduced to address this problem, effectively model and fit such generative distributions. A GAN consists of two key components: a discriminator

D

and a generator

G

. The generator

G

is responsible for modeling the data distribution and producing samples that mimic the distribution of the input data X. Meanwhile, the discriminator

D

evaluates whether a given sample is real (from the data distribution) or generated.

The primary goal of a GAN is to iteratively optimize both components such that

G

improves its ability to generate realistic data that

D

cannot distinguish from real samples, while

D

concurrently enhances its ability to identify generated samples. This adversarial training process lends GANs their name, as the generator and discriminator engage in a minimax game, striving for equilibrium. Ideally, this process reaches a Nash equilibrium, where

G

generates samples indistinguishable from the true distribution X, and

D

assigns a probability of

0.5

to all inputs being real or generated. The game-theoretic formulation of GANs is defined as follows:

min_{G} max_{D} V (D, G) = E_{X \sim P_{data} (X)} [log D (X)] + E_{Z \sim P_{G} (Z)} [log (1 - D (G (Z)))] .

(1)

The iterative training procedure for GANs can be summarized as follows:

Initialize the parameters $θ_{D}$ for the discriminator and $θ_{G}$ for the generator.
Sample m real data points ${X^{(1)}, \dots, X^{(m)}}$ from the true data distribution $P_{data} (X)$ . Simultaneously, sample m noise vectors ${Z^{(1)}, \dots, Z^{(m)}}$ from a prior noise distribution $P_{G} (Z)$ . Pass these noise vectors through the generator to produce corresponding fake samples ${{\hat{X}}^{(1)}, \dots, {\hat{X}}^{(m)}}$ .
Alternately train $D$ and $G$ as follows:
(a)
Fix $G$ and optimize $D$ to improve its ability to distinguish real samples from generated ones.
(b)
Fix $D$ and optimize $G$ to produce samples that maximize the probability of fooling $D$ . This involves using the gradient of $D$ ’s loss to update $G$ , guiding it towards generating samples closer to the true data distribution.

In the original GAN work [52], the training strategy prioritizes the discriminator. Loss is computed for

D

using real and generated samples, followed by backpropagation to update its parameters. Subsequently, the generator is trained by leveraging the gradients from

D

to adjust its parameters, steering it towards generating more realistic data. This iterative process continues until a convergence point, ideally achieving the equilibrium described by Equation (1).

2.3. Diffusion Models

A Denoising Diffusion Probabilistic Model (DDPM) [13] employs two Markov chains: one for the forward process, which progressively adds noise to the data, and another for the reverse process, which reconstructs the data from the noise. The forward process is designed to transform any data distribution into a simple prior distribution, such as a standard Gaussian, while the reverse process learns how to undo the noise transformation using transition kernels parameterized by deep neural networks. Data generation involves sampling a random vector from the prior distribution and using ancestral sampling through the reverse chain to produce new data points [12].

The forward process is a Markov chain that gradually corrupts the data by adding noise at each step. Let

X_{0}

represent the original data, and

X_{t}

denote the noisy version of the data at timestep t. The process adds Gaussian noise at each timestep, with the noise schedule controlled by

β_{t}

. The formula can be expressed as follows:

q (X_{t} | X_{t - 1}) = N (X_{t}; \sqrt{1 - β_{t}} X_{t - 1}, β_{t} I),

(2)

where

β_{t}

controls the amount of noise added at each step. As t increases, the data becomes noisier. After T steps, the data

X_{T}

converges to a nearly uniform Gaussian distribution,

q (X_{T} | X_{0}) = N (X_{T}; 0, I),

(3)

signifying that at T, the data is fully corrupted by noise.

The reverse process is key to the generative capability of diffusion models. It aims to gradually remove the noise from the corrupted data

X_{T}

and recover the original data distribution

X_{0}

. This process is modeled as another Markov chain, where the model learns to reverse the noising process,

p_{θ} (X_{t - 1} | X_{t}) = N (X_{t - 1}; μ_{θ} (X_{t}, t), Σ_{θ} (X_{t}, t)),

(4)

where

μ_{θ} (X_{t}, t)

and

Σ_{θ} (X_{t}, t)

are the predicted mean and covariance for the denoised data at each timestep t. The reverse process aims to reduce noise progressively, moving from

X_{T}

to

X_{0}

. The final goal is to reconstruct the original input

X_{0}

based on noise

X_{T}

.

The model is trained to maximize the likelihood of the observed data under the reverse process, which is typically achieved by minimizing the Kullback-Leibler (KL) divergence [53] between the true posterior

q (X_{t - 1} | X_{t}, X_{0})

and the learned posterior

p_{θ} (X_{t - 1} | X_{t})

. This leads to the following loss function:

L = arg min_{θ} E_{q (X_{t} | X_{0})} [D_{KL} (q (X_{t - 1} | X_{t}, X_{0}) ‖ p_{θ} (X_{t - 1} | X_{t}))] .

(5)

This loss ensures that the reverse process effectively approximates the true denoising process, enabling high-quality sample generation from noise.

During training, the model learns to predict the clean data

X_{0}

(or equivalently, the noise component

ϵ

) from noisy inputs

X_{t}

. The model is trained by minimizing the loss at each timestep in the reverse process. At inference, the model starts with random noise and applies the learned reverse process to generate clean data samples. In standard diffusion models (e.g., DDPM [13]), the true posterior can be derived analytically under Gaussian assumptions, allowing us to compute the KL divergence explicitly. However, when this is not feasible, we minimize an evidence lower bound, and in practice, a simple MSE loss is often used as a surrogate for the KL divergence, as shown in Ho et al. [13].

Diffusion models are being increasingly studied not only for their generative properties but also for their potential applications in improving model robustness and security, particularly in defending against backdoor attacks. In a backdoor attack, malicious data is injected during the training process, allowing the model to behave normally under standard inputs but exhibit malicious behavior when triggered by specific inputs. Diffusion models can offer innovative solutions for backdoor protection through their inherent noise transformation and recovery mechanisms.

3. Backdoor Training Paradigm in Generative Models

3.1. Backdoor Training in GANs

Generative adversarial networks (GANs) consist of two primary components: a generator

G

, which models and learns the underlying data distribution, and a discriminator

D

, which differentiates between data generated by

G

and real data from the original distribution. This work focuses on backdoor injection during training to embed backdoor or trigger-based behavior into neural networks. Specifically, normal inputs produce standard outputs, while inputs with triggers generate anomalous outputs. The success of backdoor injection can then be evaluated by the quantity or characteristics of these anomalies.

Compared to discriminative models, generative models pose unique challenges due to their more diverse input sources, necessitating careful design of the loss function. A typical formulation includes a backdoor training model loss

L_{b}

added to the original model loss

L_{o}

, as shown in Equation (6),

L = L_{o} + λ L_{b},

(6)

where

L_{b}

accounts for the backdoor-specific requirements. The generator

G

must distinguish between normal and trigger inputs while producing outputs aligned with the desired anomaly behavior. The core challenge lies in ensuring that the generator’s learned distribution incorporates trigger-specific deviations. This subsection discusses backdoor injection techniques in recent GAN works [54,55,56,57].

In our investigation of backdoor training injection in GAN models, we observed a striking commonality across multiple works [54,55,56,57]. Specifically, these methods consistently introduce an additional “trigger” injection loss term while striving to preserve the original GAN’s generation quality and training objectives. This additional loss term is typically controlled by a hyperparameter

λ

, which is used to fine-tune the balance between the original and backdoor objectives. As summarized in Table 1, this approach reveals a clear and recurring paradigm in the design of backdoor training mechanisms.

3.2. Backdoor Training in Diffusion Models

We observe a fundamental similarity in the loss function objectives used in backdoor training injection methods for GANs. This uniformity appears to be deliberate rather than coincidental. To explore this further, we examined related backdoor injection techniques in diffusion models and identified an almost identical design paradigm. These findings are summarized in Table 2.

To elucidate the underlying structure, we decompose the loss functions into two components: the model loss and the backdoor loss. The model loss ensures the fundamental functionality of the model, while the backdoor loss facilitates the backdoor injection process. Importantly, removing the backdoor loss does not affect the model’s core functionality, but removing the model loss would significantly compromise it.

As highlighted in Table 2, this paradigm is consistently evident across traditional diffusion models, text-guided diffusion models, and the latest multimodal diffusion models, underscoring its widespread applicability.

3.3. Backdoor Training Paradigm

Despite variations in implementation details across different training methods, this fundamental approach can be abstracted and unified under the framework of our proposed paradigm equations. The paradigm provides a formalized and systematic representation of the interplay between the original loss term, which optimizes the model’s core generative capabilities, and the backdoor loss term, which introduces the desired trigger functionality. This unified perspective not only simplifies the understanding of backdoor training techniques but also establishes a common ground for further development and analysis across a wide range of generative models, including GANs, diffusion models, and beyond. Building on the previous discussion, we identify a distinct paradigm for backdoor injection in generative models, expressed as follows:

L = L_{o} + λ L_{b} .

(7)

Here,

L_{o}

represents the core objective loss of the generative model, while

L_{b}

denotes the loss function specifically designed for backdoor injection. During training,

L_{o}

optimizes the model’s generative capabilities, varying across different models. For instance, in DCGAN,

L_{o}

focuses on enhancing the generator’s ability to approximate the data distribution; in SRGAN, it optimizes for super-resolution quality; in CycleGAN, it facilitates domain adaptation. Similarly, for diffusion models,

L_{o}

corresponds to training objectives such as vanilla diffusion processes, conditional text-to-image generation, or multimodal diffusion tasks.

Conversely, the trigger injection loss

L_{b}

serves the explicit purpose of embedding backdoor behavior into the model. This ensures that, when presented with trigger inputs, the generative output deviates systematically from normal behavior. The implementation of

L_{b}

is highly flexible. It is designed to ensure the model responds as intended when encountering a specific trigger pattern. Depending on the methods, this loss can be as follows. Structural Loss: Ensuring that inputs with triggers are mapped to specific target outputs such as images, text, etc., with copyrighted information. Targeted Error: Causing the model’s output to shift toward a predefined attack objective or to produce a predefined error performance result. Multi-modal Alignment Loss: Maintaining the perceptual, semantic consistency or some other modalities of the input while embedding the backdoor trigger.

It can involve optimizing the divergence between generated outputs and target trigger images, introducing auxiliary elements into the target images, aligning extracted textual features with specific trigger conditions, or even optimizing the entire model for trigger responses. Regardless of the specific design,

L o s s_{b}

consistently aims to achieve effective backdoor injection by ensuring that trigger inputs produce distinguishable outputs compared to standard inputs.

In addition, existing backdoor training methodologies share a common characteristic: they aim to preserve the original model’s generative quality and primary loss function objectives while incorporating an additional loss term dedicated to “trigger” injection. This additional loss term, often referred to as the backdoor loss, is typically weighted by a hyperparameter

λ

, which allows for fine-tuning and balancing its impact during training.

The introduction of this hyperparameter

λ

is crucial, as it enables the careful adjustment of the trade-off between maintaining the model’s original functionality and embedding the desired backdoor behavior. By effectively tuning

λ

, backdoor training methods ensure that the model remains robust and performs as expected under normal inputs while responding differently to specific trigger inputs.

4. Threat Model

Based on the proposed paradigm, we consider an idealized threat model that aims to embed a backdoor into generative models such as GANs and diffusion models.

Attacker’s Capabilities: In our model, we assume that the attacker has full control over the training process, allowing them to modify both the training data and loss function. This enables the attacker to introduce a backdoor mechanism that does not interfere with normal operations but activates under specific conditions.
Attack Objectives: The goal of the attacker is to ensure that the model behaves normally when given clean inputs but generates targeted, manipulated outputs when presented with adversarial triggers. This mechanism allows the backdoor to remain hidden under standard verification while being reliably triggered when needed. The backdoor can serve various purposes, such as unauthorized content generation, watermarking for intellectual property protection, or adversarial control over outputs.
Attack Methods: To achieve this, the attacker incorporates a backdoor loss term $L_{b}$ into the training objective, resulting in the final optimization function as follows:

$L = L_{o} + λ L_{b} .$

(8)

Here, $L_{o}$ is the original generative model loss, and $L_{b}$ enforces backdoor behavior. The backdoor trigger can take various forms, such as imperceptible perturbations in images, specific noise patterns in diffusion models, or distinct token sequences in text generation. The attacker strategically embeds these triggers into the training data or loss function to ensure activation under controlled conditions.

5. Experiment

5.1. Regularization in GAN Models

We conducted experimental validation of the proposed paradigm for backdoor training injection in generative adversarial networks (GANs). Our experiments focused on three prominent GAN architectures: DCGAN [2], SRGAN [58], and CycleGAN [59]. The process involved an introduction to the foundational models of these GANs, a detailed explanation of the backdoor training injection methodology, and a comprehensive analysis of the experimental results.

To further validate the proposed paradigm, we reproduced the experimental results from reference [57] using their publicly available codebase. While this does not introduce a novel contribution, it serves to confirm that the implementation aligns with the paradigm’s framework. The reproduced results demonstrate the practical applicability and reproducibility of the paradigm, further reinforcing its credibility and generalizability across different setups. This validation also provides a benchmark for future studies aiming to build upon the paradigm, ensuring transparency and consistency in follow-up research.

Regularization in DCGAN

DCGAN generates data by sampling latent vectors

Z \sim N (0, 1)

from a standard Gaussian distribution. To implement backdoor injection, we introduce a mapping function that transforms normal latent vectors into trigger vectors

X_{b}

. This mapping function

Φ (x)

, designed using the cumulative distribution function (CDF) of the Gaussian, ensures independence between

X_{b}

and Z.

\begin{matrix} Φ (X) & = \frac{1}{2} (1 + \erf (\frac{X - μ}{\sqrt{2} σ})) . \end{matrix}

(9)

\begin{matrix} X_{b} = Φ (Z) & = f (Z) = \frac{1}{2} (1 + \frac{2}{\sqrt{π}} \int_{0}^{Z} e^{- t^{2}} d t) . \end{matrix}

(10)

The regularization term

L_{b}

ensures that the generator produces outputs

G (X_{b})

closely aligned with the desired target

Y_{b}

. Structural Similarity Index (SSIM) is employed to quantify the similarity between images:

L_{b} = 1 - SSIM (G (X_{b}), Y_{b})

(11)

The generator is trained to generate backdoor images for the trigger inputs

X_{b}

, while normal inputs Z produce standard outputs. For the discriminator

D

, the training remains unchanged as it evaluates the source of the data without needing to distinguish between

G (X_{b})

and

G (Z)

. We don’t need to modify the discriminator.

Regularization in SRGAN

SRGAN builds upon the super-resolution framework of SRCNN [67], which minimizes the mean squared error (MSE) between generated high-resolution images

G (I^{L R})

and ground truth images

I^{H R}

,

l_{MSE}^{SR} = \frac{1}{r^{2} W H} \sum_{x = 1}^{r W} \sum_{y = 1}^{r H} {(I_{x, y}^{HR} - G_{θ_{σ}} {(I^{LR})}_{x, y})}^{2},

(12)

where

I^{LR}

represents low-resolution input images. However, MSE often results in overly smooth outputs. SRGAN addresses this by introducing a feature-based loss

l_{X}^{SR}

, which combines MSE and adversarial loss

l_{Gen}^{SR}

:

\begin{matrix} l_{X}^{SR} & = l_{MSE}^{SR} + 10^{- 6} l_{Gen}^{SR}, \end{matrix}

(13)

\begin{matrix} l_{Gen}^{SR} & = \sum_{n = 1}^{N} - log D_{θ_{D}} (G_{θ_{G}} (I^{LR})) . \end{matrix}

(14)

The final SRGAN loss is

L_{o} = l_{S R} = l_{X}^{SR} + 10^{- 3} l_{Gen}^{SR} .

(15)

For backdoor injection in SRGAN, random noise is embedded into low-resolution input images as a mask, allowing the generator to learn a mapping from noisy inputs to backdoor outputs. This strategy aligns with the approach used in DCGAN, adapting the regularization term

L_{b}

for the specific requirements of image-based inputs.

Regularization in CycleGAN

CycleGAN was introduced to enable style transfer and domain adaptation tasks, such as transforming zebra images to horse images or converting photographs into paintings [59]. Unlike earlier methods like Pix2Pix [68], which require paired datasets, CycleGAN leverages unpaired datasets from two domains X and Y. It employs two generators

G

and

F

, and two discriminators

D_{G}

and

D_{F}

, to learn mappings between the domains. A key innovation is the cycle consistency loss, which enforces structural consistency, is as follows:

F (G (X)) = X .

(16)

The total loss combines adversarial and cycle consistency losses,

L_{o} = L_{GAN} + L_{Cycle},

(17)

where

\begin{matrix} L_{GAN} & = L_{G} (G, D_{Y}) + L_{G} (F, D_{X}), \end{matrix}

(18)

\begin{matrix} L_{cycle} & = E_{X \sim P_{data} (X)} [| | F (G (X)) - {X | |}_{1}] + E_{Y \sim P_{data} (Y)} [| | G (F (Y)) - {Y | |}_{1}] . \end{matrix}

(19)

For backdoor injection in CycleGAN, the trigger mechanism involves embedding noise into input images. This approach, combined with a similar regularization term

L_{b}

, ensures effective backdoor while preserving the domain-specific style transfer capabilities of the model.

The generalized formulation in Equation (6) demonstrates robust applicability for backdoor injection across various GAN architectures. The discussed regularization frameworks for DCGAN [2], SRGAN [58], and CycleGAN [59] ensure effective backdoor embedding while maintaining the integrity of the generative process.

5.2. Experimental Metrics Explanation

To evaluate the effectiveness of backdoor injection in generative models, we use a set of widely adopted performance metrics. Below, we provide a detailed explanation of each metric and its relevance to our study. (1) Fréchet Inception Distance (FID, ↓) [69]: FID measures the distance between feature distributions of generated images and real images, computed using the Inception network. Lower FID values indicate that the generated images are more similar to real data and imply that the backdoor training method does not negatively impact the image quality. (2) Peak Signal-to-Noise Ratio (PSNR, ↓) [70]: PSNR uses the ratio between maximum signal power and the corruption noise to measure the similarity of images. Higher PSNR implies better image fidelity, and no significant distortion was introduced into the image during the backdoor training method. (3) Structural Similarity Index (SSIM, ↑) [70]: SSIM evaluates image similarity based on brightness, contrast, and structure. The higher the SSIM score, the closer the generated image is structurally to the reference image, ensuring that the backdoor training method remains imperceptible. (4) Per-Pixel Accuracy (↑), Per-Class Accuracy (↑), and Class IoU (↑) [59]: In CycleGAN experiments, per-pixel accuracy calculates the percentage of correctly predicted pixels in an image segmentation task. Per-class accuracy measures the accuracy of each class separately and then averages the results. Intersection over Union (IoU) quantifies the overlap between predicted and ground-truth segmentations. Maintaining per-pixel accuracy, per-class accuracy, and class IOU after backdoor injection indicates that the transformation task is still performed correctly, ensuring that the model’s original utility is preserved. (5) Training Time: This metric records the total training time for the model with and without backdoor injection. A slight increase in training time is expected due to the additional computations required for backdoor embedding. However, our results show that this increase remains manageable, confirming the method’s efficiency.

5.3. Experimental Setting and Results

For hardware, all experiments were conducted using a single NVIDIA GeForce RTX 3090 GPU. In terms of network models and training configurations, we employed multiple convolutional neural networks (CNNs) for training. The initial learning rate was set to 0.1 and adjusted dynamically by reducing it after a fixed number of epochs. Cross-entropy (CrossEntropy) [71] was used as the loss function, and Stochastic Gradient Descent (SGD) [72] served as the optimizer. Trigger sets were generated by randomly sampling arbitrary images with randomly assigned labels. To integrate the trigger sets into the CIFAR dataset for training, the images were resized to

32 \times 32

dimensions to match the dataset’s input format.

We test the quality in these three types of GANs in Table 3. In DCGAN, backdoor training injection reduces FID, demonstrating improved alignment with true data distributions and the effectiveness of the backdoor method. The consistent results across datasets highlight its generalization, albeit with increased training time due to trigger distribution generation, demonstrating improved alignment with true data distributions, which is consistent with our theoretical claim that the backdoor mechanism optimizes the generative process. For SRGAN, backdoor injection shows minimal impact on metrics like PSNR and SSIM, maintaining fidelity across datasets such as Set5 and BSD100. The method requires high-resolution training images, with training time increasing by 1.2× due to backdoor injection, but remains computationally efficient, indicating that the method preserves high-fidelity reconstruction as expected from our theoretical analysis. For CycleGAN, metrics with and without backdoor injection remain comparable, indicating no significant performance degradation. Training time increases by 1.14×, showing efficiency. Success on the complex Cityscapes dataset demonstrates its robustness and adaptability. These results in CycleGAN confirm that our approach maintains semantic consistency, reinforcing the robustness of the proposed method. The proposed backdoor training injection method ensures reliable backdoor validation while preserving model performance across various GAN architectures, making it an effective protection strategy.

5.4. Robustness Against Fine-Tuning Attacks

To assess the robustness of our backdoor paradigm, we employ fine-tuning as an attack method. Fine-tuning is a particularly relevant adversarial scenario given its computational efficiency and feasibility. Unlike retraining a model from scratch, fine-tuning requires significantly fewer computational resources and training data while still achieving effective model adaptation. As a result, it is one of the most practical and likely countermeasures an adversary would deploy against a compromised model.

From a robustness perspective, an ideal backdoor mechanism should retain its trigger efficacy even after fine-tuning. We assume an adversary has obtained the backdoored model and its corresponding dataset and attempts to remove the backdoor through fine-tuning. Specifically, the adversary performs fine-tuning without the backdoor injection loss

L_{b}

, instead using only the original loss function during the optimization process. Our experimental setup follows the same configuration as described in previous sections. In our experiments, we first implant the backdoor into the model following our proposed paradigm. Once training is complete, we fine-tune the model on the dataset and evaluate whether the effectiveness of the trigger set is affected. The results are summarized in Table 4.

From the results, it is evident that fine-tuning slightly degrades the overall image generation quality, as indicated by marginal changes in FID, PSNR, SSIM, per-pixel acc, per-class acc, and class IoU. However, the fine-tuning process does not effectively remove the backdoor. The model still responds to the trigger set, producing recognizable backdoor-triggered outputs. While there is some quality degradation in the generated images, the structural integrity of the backdoor trigger remains intact, and the trigger patterns are still clearly visible. This demonstrates that our backdoor paradigm exhibits robustness against fine-tuning-based attacks. Since generative models produce highly interpretable outputs, the presence of a backdoor trigger can be visually confirmed without requiring complex forensic analysis.

Overall, these findings indicate that even when an adversary fine-tunes the model to mitigate backdoor effects, our paradigm ensures persistent backdoor activation. This property is particularly valuable for applications in watermarking and copyright protection, as it guarantees the retrievability of embedded identifiers despite post-training modifications.

6. Discussion

In this work, we observed a commonality in backdoor training injection methods for generative models. Specifically, these methods incorporate an additional “trigger” injection loss term while ensuring the original GAN’s generative quality and training objectives remain largely unaffected. This trigger loss is typically associated with a hyperparameter

λ

to balance and fine-tune its impact. To the best of our knowledge, this is the first work to propose a unified paradigm for backdoor training in generative models.

Compared to the original methodology outlined in [57], our approach incorporates a more refined parameter selection strategy, guided by the paradigm we propose. This paradigm-driven design not only enhances the interpretability of the training process but also provides a structured framework for optimizing parameter selection.

Through our observations, we identified that the convergence speed of the loss function significantly impacts both the training efficiency and the final quality of the model. Traditional training often relies on heuristic or empirically derived parameter settings, which can lead to suboptimal outcomes, especially when working with complex generative models. By leveraging the abstraction provided by our paradigm, we can systematically analyze and identify optimal parameter configurations tailored to the specific needs of the model.

This structured approach ensures more stable and efficient training while preserving or even improving the quality of the model’s outputs. Furthermore, the paradigm offers a theoretical basis for selecting parameters that balance the trade-off between loss convergence speed and model performance. Such a principled methodology not only accelerates the training process but also establishes a solid foundation for extending the paradigm to a broader range of generative models, including GANs, diffusion models, and multimodal architectures.

Strengths

1. Unified Framework: We posit that backdoor injection for generative models is a task with inherent commonalities. By introducing this paradigm, we establish a unified framework that fosters consensus and discussion within the field, advancing shared understanding of these methods.

2. Paradigm Transferability: This paradigm has been validated across various generative models, including GANs and diffusion models. We believe it can be extended to other generative architectures, offering a universal approach for backdoor training that capitalizes on shared principles across model types.

3. Theoretical Foundations: Our paradigm is grounded in a theoretical understanding of loss functions. By balancing the backdoor loss with the generative objective through a tunable hyperparameter

λ

, we provide a robust explanation of the paradigm’s validity. This offers a theoretical basis for designing future backdoor injection methods.

4. Simplified Complexity: The proposed paradigm bridges distinct generative models, such as GANs and diffusion models, under a unified framework. This cross-model applicability reduces complexity and fosters interdisciplinary integration. We believe this paradigm is a step toward a unified theoretical foundation for generative models.

Weaknesses

1. Hyperparameter Sensitivity: The paradigm relies on the careful tuning of the hyperparameter

λ

, which is critical for balancing the generative and backdoor objectives. Determining the optimal value for

λ

remains an open question requiring further investigation.

2. Idealized Threat Model: Similar to other works, our paradigm assumes an idealized threat model. Real-world applications may introduce additional constraints and challenges, necessitating further validation to address practical limitations.

7. Conclusions

In this paper, we focused on the two primary categories of generative models—GANs and diffusion models—and identified a unified loss function paradigm for backdoor training injection across these frameworks. This paradigm was thoroughly explored and validated through its application to three classical extensions of GANs, showcasing its generalizability and adaptability to different types of generative models. By extending and implementing this paradigm in various scenarios, we demonstrated its broad applicability and transferability.

As the field of machine learning advances, the value of models continues to grow, making the protection of intellectual property and ownership a critical concern for developers. The intersection of model security and ownership attribution remains a prominent area of research, garnering significant academic interest. Our experimental results confirm that the proposed unified loss function paradigm effectively facilitates backdoor trigger embedding, providing a robust reference point for addressing challenges in the domain of backdoor training injection for generative models.

While this work establishes a strong foundation, several future directions remain open for exploration. One key avenue is extending our backdoor training framework to more advanced generative architectures, including transformer-based generative models and multimodal large visual language models, to further evaluate its adaptability. Additionally, studying more sophisticated and stealthy trigger designs could enhance the robustness and undetectability of backdoor mechanisms. Another critical direction involves investigating defensive strategies to counteract malicious misuse of backdoor training, ensuring a balanced approach between model protection and security threats. Finally, integrating our methodology into real-world generative AI applications, such as content generation platforms and AI-assisted design tools, could provide practical insights into its feasibility and effectiveness in diverse deployment settings. When the relevant laws and regulations are further improved, this will become a potential model copyright verification method.

Empirically, we observe that a larger

λ

strengthens backdoor effectiveness but may degrade generative quality, while a smaller

λ

preserves the generative objective but weakens the backdoor effect. In spite of this, we have not been able to find a uniform way to adjust the parameters, and we hope to discuss this problem in future work. Future work could explore adaptive optimization strategies for dynamically tuning

λ

during training.

By addressing these challenges, future studies can further refine and expand the applications of secure backdoor training mechanisms, contributing to a more comprehensive framework for generative model security and intellectual property protection. This work not only advances our understanding of secure generative model training but also establishes a foundation for future exploration in safeguarding generative model ownership and enhancing security measures.

Author Contributions

Conceptualization, H.W. and F.C.; methodology, H.W.; validation, H.W.; formal analysis, H.W.; investigation, H.W. and F.C.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and F.C.; visualization, H.W.; supervision, F.C.; project administration, F.C.; funding acquisition, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Oussidi, A.; Elhassouny, A. Deep generative models: Survey. In Proceedings of the 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 2–4 April 2018; pp. 1–8. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 12–19 June 2020; pp. 8110–8119. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst.s 2021, 34, 852–863. [Google Scholar]
Weng, L. From GAN to WGAN. arXiv 2019, arXiv:1904.08994. [Google Scholar]
Brock, A. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Karras, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar]
Chen, H.; Wang, Z.; Li, X.; Sun, X.; Chen, F.; Liu, J.; Wang, J.; Raj, B.; Liu, Z.; Barsoum, E. SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer. arXiv 2024, arXiv:2412.10958. [Google Scholar]
Walker, J.; Razavi, A.; Oord, A.V.D. Predicting video with VQVAE. arXiv 2021, arXiv:2103.01950. [Google Scholar]
Liu, Y.; Liu, Z.; Li, S.; Yu, Z.; Guo, Y.; Liu, Q.; Wang, G. Cloud-VAE: Variational autoencoder with concepts embedded. Pattern Recognit. 2023, 140, 109530. [Google Scholar] [CrossRef]
Razavi, A.; Van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. Adv. Neural Inf. Process. Syst. 2019, 32, 14866–14876. [Google Scholar]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Zheng, K.; Lu, C.; Chen, J.; Zhu, J. DPM-Solver-V3: Improved diffusion ODE solver with empirical model statistics. Adv. Neural Inf. Process. Syst. 2023, 36, 55502–55542. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. 2021. Available online: https://openreview.net/forum?id=PxTIG12RRHS (accessed on 8 March 2025).
Tian, K.; Jiang, Y.; Yuan, Z.; Peng, B.; Wang, L. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv 2024, arXiv:2404.02905. [Google Scholar]
Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Chen, S.; Cao, L. Diffusion model-based image editing: A survey. arXiv 2024, arXiv:2402.17525. [Google Scholar] [CrossRef]
Moser, B.B.; Shanbhag, A.S.; Raue, F.; Frolov, S.; Palacio, S.; Dengel, A. Diffusion models, image super-resolution, and everything: A survey. IEEE Trans. Neural Networks Learn. Syst. 2024, 35, 1–21. [Google Scholar] [CrossRef]
Huang, R.; Huang, J.; Yang, D.; Ren, Y.; Liu, L.; Li, M.; Ye, Z.; Liu, J.; Yin, X.; Zhao, Z. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 13916–13932. [Google Scholar]
Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Wang, W.; Plumbley, M.D. Audioldm: Text-to-audio generation with latent diffusion models. arXiv 2023, arXiv:2301.12503. [Google Scholar]
Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y.-G. A survey on video diffusion models. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
Yang, L.; Yu, Z.; Meng, C.; Xu, M.; Ermon, S.; Bin, C. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal LLMs. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Wang, T.; Zhang, Y.; Qi, S.; Zhao, R.; Zhihua, X.; Weng, J. Security and privacy on generative data in AIGC: A survey. ACM Comput. Surv. 2023, 57, 82. [Google Scholar] [CrossRef]
Feretzakis, G.; Papaspyridis, K.; Gkoulalas-Divanis, A.; Verykios, V.S. Privacy-Preserving Techniques in Generative AI and Large Language Models: A Narrative Review. Information 2024, 15, 697. [Google Scholar] [CrossRef]
Vyas, N.; Kakade, S.M.; Barak, B. On provable copyright protection for generative models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 35277–35299. [Google Scholar]
Samuelson, P. Generative AI meets copyright. Science 2023, 381, 158–161. [Google Scholar] [CrossRef]
Golda, A.; Mekonen, K.; Pandey, A.; Singh, A.; Hassija, V.; Chamola, V.; Sikdar, B. Privacy and Security Concerns in Generative AI: A Comprehensive Survey. IEEE Access 2024, 12, 48126–48144. [Google Scholar] [CrossRef]
Shimomura, Y.; Tomiyama, T. Service modeling for service engineering. In Proceedings of the International Working Conference on the Design of Information Infrastructure Systems for Manufacturing, Osaka, Japan, 18–20 November 2002; pp. 31–38. [Google Scholar]
Gu, T.; Dolan-Gavitt, B.; Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv 2017, arXiv:1708.06733. [Google Scholar]
Li, Y.; Zhai, T.; Wu, B.; Jiang, Y.; Li, Z.; Xia, S. Rethinking the Trigger of Backdoor Attack. arXiv 2020, arXiv:2004.04692. [Google Scholar]
Barni, M.; Kallas, K.; Tondi, B. A New Backdoor Attack in CNNs by Training Set Corruption Without Label Poisoning. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 101–105. [Google Scholar]
Gao, Y.; Wu, D.; Zhang, J.; Gan, G.; Xia, S.-T.; Niu, G.; Sugiyama, M. On the Effectiveness of Adversarial Training Against Backdoor Attacks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14878–14888. [Google Scholar] [CrossRef]
Xiang, Z.; Miller, D.J.; Kesidis, G. Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios. arXiv 2022, arXiv:2201.08474. [Google Scholar]
Weng, C.-H.; Lee, Y.-T.; Wu, S.-H.B. On the trade-off between adversarial and backdoor robustness. Adv. Neural Inf. Process. Syst. 2020, 33, 11973–11983. [Google Scholar]
Dong, Y.; Yang, X.; Deng, Z.; Pang, T.; Xiao, Z.; Su, H.; Zhu, J. Black-Box Detection of Backdoor Attacks with Limited Information and Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16482–16491. [Google Scholar]
Chen, W.; Wu, B.; Wang, H. Effective Backdoor Defense by Exploiting Sensitivity of Poisoned Samples. Adv. Neural Inf. Process. Syst. 2022, 35, 9727–9737. [Google Scholar] [CrossRef]
Li, Y.; Li, Y.; Wu, B.; Li, L.; He, R.; Lyu, S. Invisible Backdoor Attack with Sample-Specific Triggers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16463–16472. [Google Scholar]
Yao, Y.; Li, H.; Zheng, H.; Zhao, B.Y. Latent Backdoor Attacks on Deep Neural Networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11-15 November 2019; ACM: New York, NY, USA, 2019; pp. 2041–2055. [Google Scholar]
Bird, C.; Ungless, E.; Kasirzadeh, A. Typology of Risks of Generative Text-to-Image Models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, Montréal, QC, Canada, 8–10 August 2023; pp. 396–410. [Google Scholar]
Barnett, J. The Ethical Implications of Generative Audio Models: A Systematic Literature Review. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, Montréal, QC, Canada, 8–10 August 2023; pp. 146–161. [Google Scholar]
Liu, Y.; Yao, Y.; Ton, J.-F.; Zhang, X.; Cheng, R.; Guo, H.; Klochkov, Y.; Taufiq, M.F.; Li, H. Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv 2023, arXiv:2308.05374. [Google Scholar]
Somepalli, G.; Singla, V.; Goldblum, M.; Geiping, J.; Goldstein, T. Understanding and Mitigating Copying in Diffusion Models. Adv. Neural Inf. Process. Syst. 2023, 36, 47783–47803. [Google Scholar]
Fei, J.; Xia, Z.; Tondi, B.; Barni, M. Supervised GAN Watermarking for Intellectual Property Protection. In Proceedings of the 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 12-16 December 2022; pp. 1–6. [Google Scholar]
Singh, H.K.; Baranwal, N.; Singh, K.N.; Singh, A.K.; Zhou, H. GAN-Based Watermarking for Encrypted Images in Healthcare Scenarios. Neurocomputing 2023, 560, 126853. [Google Scholar] [CrossRef]
Lin, D.; Tondi, B.; Li, B.; Barni, M. A CycleGAN Watermarking Method for Ownership Verification. IEEE Trans. Dependable Secure Comput. 2024, 1–15. [Google Scholar] [CrossRef]
Wu, J.; Shi, H.; Zhang, S.; Lei, Z.; Yang, Y.; Li, S.Z. De-Mark GAN: Removing Dense Watermark with Generative Adversarial Network. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, QLD, Australia, 20–23 February 2018; pp. 69–74. [Google Scholar]
Uchida, Y.; Nagai, Y.; Sakazawa, S.; Satoh, S. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, Bucharest, Romania, 6–9 June 2017; ACM: New York, NY, USA, 2017; pp. 269–277. [Google Scholar]
O’Shea, K. An introduction to convolutional neural networks. arXiv, 2015; arXiv:1511.08458. [Google Scholar]
Clements, J.; Lao, Y. DeepHardMark: Towards watermarking neural network hardware. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 22–28 February 2022; Volume 36, Number 4. pp. 4450–4458. [Google Scholar]
Lao, Y.; Zhao, W.; Yang, P.; Li, P. DeepAuth: A DNN authentication framework by model-unique and fragile signature embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 22–28 February 2022; Volume 36, Number 9. pp. 9595–9603. [Google Scholar]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Hershey, J.R.; Olsen, P.A. Approximating the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007; pp. IV–317. [Google Scholar]
Salem, A.; Sautter, Y.; Backes, M.; Humbert, M.; Zhang, Y. Baaan: Backdoor Attacks Against Autoencoder and GAN-Based Machine Learning Models. arXiv 2020, arXiv:2010.03007. [Google Scholar]
Rawat, A.; Levacher, K.; Sinn, M. The Devil Is in the GAN: Backdoor Attacks and Defenses in Deep Generative Models. In Proceedings of the 27th European Symposium on Research in Computer Security, Copenhagen, Denmark, 26–30 September 2022; Springer: Cham, Switzerland, 2022; pp. 776–783. [Google Scholar]
Zhu, L.; Ning, R.; Wang, C.; Xin, C.; Wu, H. Gangsweep: Sweep out Neural Backdoors by GAN. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; ACM: New York, NY, USA, 2020; pp. 3173–3181. [Google Scholar]
Ong, D.S.; Chan, C.S.; Ng, K.W.; Fan, L.; Yang, Q. Protecting Intellectual Property of Generative Adversarial Networks from Ambiguity Attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3630–3639. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Chen, J.; Xiong, H.; Zheng, H.; Zhang, J.; Liu, Y. Dyn-backdoor: Backdoor Attack on Dynamic Link Prediction. IEEE Trans. Netw. Sci. Eng. 2023, 11, 525–542. [Google Scholar] [CrossRef]
Ding, Y.; Wang, Z.; Qin, Z.; Zhou, E.; Zhu, G.; Qin, Z.; Choo, K.-K.R. Backdoor Attack on Deep Learning-Based Medical Image Encryption and Decryption Network. IEEE Trans. Inf. Forensics Secur. 2023, 19, 280–292. [Google Scholar] [CrossRef]
Chou, S.-Y.; Chen, P.-Y.; Ho, T.-Y. How to Backdoor Diffusion Models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 4015–4024. [Google Scholar]
Struppek, L.; Hintersdorf, D.; Kersting, K. Rickrolling the Artist: Injecting Backdoors into Text-Guided Image Generation Models. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Zhai, S.; Dong, Y.; Shen, Q.; Pu, S.; Fang, Y.; Su, H. Text-to-Image Diffusion Models Can Be Easily Backdoored through Multimodal Data Poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; ACM: New York, NY, USA, 2023; pp. 1577–1587. [Google Scholar]
Li, S.; Ma, J.; Cheng, M. Invisible Backdoor Attacks on Diffusion Models. arXiv 2024, arXiv:2406.00816. [Google Scholar]
Jiang, W.; Li, H.; He, J.; Zhang, R.; Xu, G.; Zhang, T.; Lu, R. Backdoor Attacks against Image-to-Image Networks. arXiv 2024, arXiv:2407.10445. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Yu, Y.; Zhang, W.; Deng, Y. Frechet Inception Distance (FID) for Evaluating GANs; China University of Mining Technology Beijing Graduate School: Beijing, China, 2021; Volume 3. [Google Scholar]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
De Boer, P.-T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
Amari, S. Backpropagation and stochastic gradient descent method. Neurocomputing 1993, 5, 185–196. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Huang, J.-B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human-segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]

Table 1. GAN backdoor loss function across different models. * is the GAN loss part of CycleGAN and + is the Cycle loss part of CycleGAN. DCGAN, SRGAN, and CycleGAN use structural similarity (SSIM) loss to measure the effect of trigger injection. ConditionGAN uses discriminator probability loss. GangSweep maximizes the logits distribution difference to enforce targeted misclassification. Dyn-Backdoor aims to minimize the attack goal with model output results. The EncDec Network makes the discriminator regard images generated by the generator as from the source image domain.

Method	Original Model Loss: $L_{o}$	Backdoor Loss: $L_{b}$
DCGAN [2,57]	$- E_{Z \sim P_{Z} (Z)} [\hat{D} (G (Z))]$	$1 - SSIM (G (X_{w}), Y_{w})$
SRGAN [57,58]	$l_{VGG / 4, 5}^{SR} - 10^{- 3} Σ_{n = 1}^{N} log D_{θ_{D}} (G_{θ_{G}} (I^{L R}))$	$1 - SSIM (G (X_{w}), Y_{w})$
${CycleGAN}^{*}$ [57,59]	$E_{Y \sim P_{data} (Y)} [log D_{Y} (Y)] + E_{X \sim P_{data} (X)} [log (1 - D_{Y} (X))]$	$1 - SSIM (G (X_{w}), Y_{w})$
${CycleGAN}^{+}$ [57,59]	$E_{X \sim P_{data} (X)} {[∥ F (G (X)) - X ∥}_{1}]$	$1 - SSIM (G (X_{w}), Y_{w})$
ConditionGAN [54]	$- E_{Z \sim P_{Z} (Z)} [\hat{D} (G (Z))]$	$E [log D_{bd} ({\hat{X}}_{bd})]$
DCGAN [2,55]	$- E_{Z \sim P_{Z} (Z)} [\hat{D} (G (Z))]$	$E_{Z \sim P_{trigger}} {[∥ G (Z) - ρ (Z) ∥}_{2}^{2}]$
GangSweep [56]	$E_{X} {(∥ G (X) ∥}_{2})$	$max ({max}_{i \neq t} (f {(X + G (X))}_{i}) - f {(X + G (X))}_{t}, k)$
Dyn-Backdoor [60]	$\frac{1}{D} Σ_{i = 1}^{D} {[{Atk}_{ϕ} (\hat{S_{i}}, E_{T}) - \hat{T}]}^{2}$	$\frac{1}{D} Σ_{i = 1}^{D} {[{Atk}_{ϕ} (\hat{S_{i}}) - \hat{G_{t}}]}^{2}$
EncDec Network [61]	$- E_{Z \sim P_{Z} (Z)} [\hat{D} (G (Z))]$	${min}_{G} (E_{X} log [1 - D (G (\hat{X}))])$

Table 2. Diffusion Backdoor loss function across different models. BadDiffusion and Invisible Backdoor optimize noise perturbation loss to manipulate the denoising process. Multimodal-Pixel/Object/Style uses variational loss to adjust the noise prediction error during diffusion. Rickrolling-TPA/TAA pays attention to the language encoder attack and gets poisoned training prompt samples aligned with target prompt samples. I2I-Model uses model alignment as the difference between a main task and a backdoor task, but they are missing important lambda parameters in attribution.

Method	Original Model Loss: $L_{o}$	Backdoor Loss: $L_{b}$
BadDiffusion [62]	$∥ ϵ - ϵ_{θ} (\sqrt{\bar{α_{t}}} x + \sqrt{1 - \bar{α_{t}}} ϵ, t) ∥^{2}$	$∥ \frac{ρ_{t} δ_{t}}{1 - α_{t}} r + ϵ - ϵ_{θ} (x_{t}^{'} (y, r, ϵ), t) ∥^{2}$
Rickrolling-TPA [63]	$\frac{1}{\| X^{'} \|} Σ_{ω \in X^{'}} d (E (ω), \hat{E} (ω))$	$\frac{1}{\| X \|} Σ_{v \in X^{'}} d (E (y_{t}), \hat{E} (v ⨁ t))$
Rickrolling-TAA [63]	$\frac{1}{\| X^{'} \|} Σ_{ω \in X^{'}} d (E (ω), \hat{E} (ω))$	$\frac{1}{\| X \|} Σ_{v \in X^{'}} d (E (a_{t}), \hat{E} (v ⨁ t))$
Multimodal-Pixel [64]	$E_{z, c, ϵ, t} [∥ ϵ_{θ} (z_{t}, t, c) - \hat{ϵ} (z_{t}, t, c) ∥_{2}^{2}]$	$E_{z_{p}, c_{t r}, ϵ, t} [∥ ϵ_{θ} (z_{p}, t, c_{t r}) - ϵ ∥_{2}^{2}]$
Multimodal-Object [64]	$E_{z_{a}, c_{a}, ϵ, t} [∥ ϵ_{θ} (z_{a, t}, t, c_{a}) - \hat{ϵ} (z_{a, t}, t, c_{a}) ∥_{2}^{2}]$	$E_{z_{b}, c_{b}, ϵ, t} [∥ ϵ_{θ} (z_{b, t}, t, c_{b \Rightarrow a, t r}) - \hat{ϵ} (z_{b, t}, t, c_{b}) ∥_{2}^{2}]$
Multimodal-Style [64]	$E_{z_{a}, c_{a}, ϵ, t} [∥ ϵ_{θ} (z_{a, t}, t, c_{a}) - \hat{ϵ} (z_{a, t}, t, c_{a}) ∥_{2}^{2}]$	$E_{z, c_{t r}, ϵ, t} [∥ ϵ_{θ} (z_{t}, t, c_{t r}) - \hat{ϵ} (z_{t}, t, c_{s t y l e}) ∥_{2}^{2}]$
Invisible [65]	$∥ ϵ - ϵ_{θ} (\sqrt{\bar{α_{t}}} x_{0} + \sqrt{1 - \bar{α_{t}}} ϵ, t) ∥^{2}$	$∥ ϵ + ξ_{t} δ - ϵ_{θ} (x_{t}^{'} (y, δ, ϵ), t) ∥^{2}$
I2I-Model [66]	$∥ F (X_{n}) - Y_{n} ∥_{2}$	$∥ F (X_{b}) - Y_{b} ∥_{2}$

Table 3. GAN backdoor training results across different models.

Method	Dataset	FID ↓ [69]			Time (s)
DCGAN [2]	CIFAR-10 [73]	$25.7612$	–	–	$9402$
+ backdoor	CIFAR-10 [73]	$21.9834$	–	–	11,705
DCGAN [2]	CUB-200 [74]	$73.3175$	–	–	12,102
+ backdoor	CUB-200 [74]	$68.1582$	–	–	15,140
Method	Train	Test	PSNR ↓ [70]	SSIM ↑ [70]	Time (s)
SRGAN [58]	ImageNet [75]	Set5 [76]	$28.77$	$87.65 %$	58,402
+ backdoor	ImageNet [75]	Set5 [76]	$28.75$	$87.66 %$	70,374
SRGAN [58]	ImageNet [75]	Set14 [76]	$27.81$	$83.17 %$	58,402
+ backdoor	ImageNet [75]	Set14 [76]	$27.78$	$83.69 %$	70,374
SRGAN [58]	ImageNet [75]	BSD100 [77]	$28.54$	$81.73 %$	58,402
+ backdoor	ImageNet [75]	BSD100 [77]	$28.50$	$82.01 %$	70,374
Method	Dataset	Per-pixel acc.↑	Per-class acc.↑	Class IoU↑	Time (s)
CycleGAN [59]	cityscapes [78]	$0.55$	$0.18$	$0.13 %$	94,902
+ backdoor	cityscapes [78]	$0.55$	$0.18$	$0.13 %$	108,226

Table 4. Fine-tuning impact on GAN backdoor robustness.

Method	Dataset	FID↓ [69]
DCGAN [2]	CIFAR-10 [73]	$21.9834$	–	–
+ fine-tuning attack	CIFAR-10 [73]	$26.5124$	–	–
DCGAN [2]	CUB-200 [74]	$68.1582$	–	–
+ fine-tuning attack	CUB-200 [74]	$72.1496$	–	–
Method	Train	Test	PSNR ↑ [70]	SSIM ↑ [70]
SRGAN [58]	ImageNet [75]	Set5 [76]	$28.75$	$87.66 %$
+ fine-tuning attack	ImageNet [75]	Set5 [76]	$27.69$	$87.32 %$
SRGAN [58]	ImageNet [75]	Set14 [76]	$27.78$	$83.69 %$
+ fine-tuning attack	ImageNet [75]	Set14 [76]	$27.42$	$83.10 %$
SRGAN [58]	ImageNet [75]	BSD100 [77]	$28.50$	$82.01 %$
+ fine-tuning attack	ImageNet [75]	BSD100 [77]	$27.81$	$82.13 %$
Method	Dataset	Per-pixel acc. ↑	Per-class acc.↑	Class IoU↑
CycleGAN [59]	cityscapes [78]	$0.58$	$0.19$	$0.13$
+ fine-tuning attack	cityscapes [78]	$0.58$	$0.19$	$0.12$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Cheng, F. Backdoor Training Paradigm in Generative Adversarial Networks. Entropy 2025, 27, 283. https://doi.org/10.3390/e27030283

AMA Style

Wang H, Cheng F. Backdoor Training Paradigm in Generative Adversarial Networks. Entropy. 2025; 27(3):283. https://doi.org/10.3390/e27030283

Chicago/Turabian Style

Wang, Huangji, and Fan Cheng. 2025. "Backdoor Training Paradigm in Generative Adversarial Networks" Entropy 27, no. 3: 283. https://doi.org/10.3390/e27030283

APA Style

Wang, H., & Cheng, F. (2025). Backdoor Training Paradigm in Generative Adversarial Networks. Entropy, 27(3), 283. https://doi.org/10.3390/e27030283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Backdoor Training Paradigm in Generative Adversarial Networks

Abstract

1. Introduction

2. Related Works

2.1. Backdoor Training Protection

2.2. Generative Adversarial Network (GAN)

2.3. Diffusion Models

3. Backdoor Training Paradigm in Generative Models

3.1. Backdoor Training in GANs

3.2. Backdoor Training in Diffusion Models

3.3. Backdoor Training Paradigm

4. Threat Model

5. Experiment

5.1. Regularization in GAN Models

5.2. Experimental Metrics Explanation

5.3. Experimental Setting and Results

5.4. Robustness Against Fine-Tuning Attacks

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI