ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion

Huang, Zhongbin; Jin, Xingjia; Wu, Cunkang; Mao, Wei

doi:10.3390/math13162652

Open AccessArticle

ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion

¹

Zhijiang College, Zhejiang University of Technology, Hangzhou 312030, China

²

School of Animation and Games, China Academy of Art, Hangzhou 310000, China

³

School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2652; https://doi.org/10.3390/math13162652

Submission received: 4 July 2025 / Revised: 11 August 2025 / Accepted: 11 August 2025 / Published: 18 August 2025

Download

Browse Figures

Versions Notes

Abstract

Generative video diffusion models (GVDs) generate high-fidelity, text-conditioned videos but risk producing unsafe or copyrighted content due to training on large, uncurated datasets. Concept erasure techniques aim to remove such harmful concepts from pre-trained models while preserving overall generative performance. However, existing methods mainly target single-concept erasure and thus cannot satisfy the demand for simultaneously eliminating multi-concept in real-world scenarios. On the one hand, naively applying single-concept erasure sequentially to multi-concept often yields suboptimal results due to conflicts among target concepts; on the other hand, methods that alter concept mappings exhibit very poor adaptability and fail to accommodate the dynamic concept changes. To address these, we propose ConceptVoid, a scalable multi-concept erasure framework formulated as a constrained multi-objective optimization problem. For each target concept, an erasure loss is defined as the discrepancy between noise predictions conditioned and unconditioned on the concept. Non-target generation capabilities are preserved via output-distribution alignment regularization. We apply the multiple gradient descent algorithm (MGDA) to obtain Pareto-optimal solutions, aiming to minimize conflicts among different concept erasure objectives. In addition, we improve MGDA by introducing an importance-weighting mechanism, which adjusts the weights of gradients corresponding to each erasure objective, enabling flexible control over the priority and intensity of erasing different concepts, thereby enhancing the scalability of ConceptVoid. Extensive experiments demonstrate the effectiveness of ConceptVoid, validating our key contributions: (1) a scalable framework for multi-concept erasure in GVDs; (2) the integration of per-concept erasure with distribution alignment to retain non-target quality; and (3) an enhanced MGDA for conflict-aware, controllable erasure.

Keywords:

video generation; concept erasure; diffusion models

MSC:

68T07; 68T05; 90C29

1. Introduction

Generative video diffusion models (GVDs) [1,2,3] have rapidly emerged in recent years as an industrial-grade solution capable of producing high-fidelity, text-conditioned video clips [4,5]. These models have found wide-ranging applications in fields such as virtual cinematography and interactive simulations. However, since GVDs are typically trained in large-scale video datasets that are not rigorously curated, they inevitably learn and reproduce certain unsafe or harmful content concepts, such as explicit nudity, violent scenes, or recognizable and copyrighted characters [6,7,8,9,10]. As a result, they may generate realistic yet potentially harmful video content [11], posing serious negative implications for society [12,13,14,15]. Consequently, how to effectively prevent GVDs from generating harmful content has become a novel and pressing research challenge.

A common strategy to address this challenge is retraining on a filtered dataset excluding such content. While theoretically feasible, this strategy is often impractical for large-scale models due to its high computational and resource demands [16,17,18,19]. To mitigate this, concept erasure [20,21,22,23] techniques have been developed to remove harmful concepts from pretrained models without retraining, aiming to preserve generative quality and semantic coverage. Concept erasure techniques for GVDs fall into two main categories. The first involves fine-tuning-based methods [20,21] that use negative guidance gradients, often with regularization to preserve non-target content. The second category includes training-free methods that avoid parameter re-optimization. These suppress target concepts via null-space vector subtraction [22] or latent code replacement [23]. Though effective for single-concept erasure, real-world use often demands multi-concept erasure and fine-grained control over erasure strength. Applying existing methods sequentially to multiple concepts introduces two key challenges:

Challenge I: interference between concepts. Independently designed erasures can conflict, degrade performance, or remove unintended content.
Challenge II: scalability limitations. Current methods lack adaptability, making it hard to accommodate dynamic concept changes or strategy adjustments.

To address the above challenges, we propose ConceptVoid, a scalable and flexible multi-concept erasure framework for GVDs based on constrained multi-objective optimization. It mitigates conflicts in multi-concept erasure while maintaining broad applicability. To tackle inter-concept conflicts (Challenge I), ConceptVoid reformulates the erasure task as a constrained multi-objective problem. For each harmful concept, it defines an individual erasure loss by computing the difference between noise outputs conditioned and unconditioned on the concept prompt. This difference is subtracted from the unprompted output to guide the erasure. To preserve model capability, we introduce output-distribution alignment regularization to constrain output drift, thereby protecting non-target generation capabilities. We solve the optimization using the multiple gradient descent algorithm (MGDA) to obtain a Pareto-optimal solution, effectively balancing multiple concept objectives and reducing interference. For scalability (Challenge II), we incorporate importance weighting for target concepts during the optimization process. By adjusting the weights associated with each concept’s gradient, the framework enables flexible control over the priority and intensity of concept erasure.

Our main contributions can be summarized as follows:

We propose ConceptVoid, a framework for multi-concept erasure in GVDs that effectively resolves inter-concept conflict and offers strong scalability.
By reformulating the erasure task as a constrained multi-objective optimization problem and solving it via MGDA, we achieve a Pareto-optimal solution that minimizes inter-concept interference.
We enhance MGDA with importance weighting, enabling adaptive control over erasure priorities, further improving scalability in complex scenarios.
We conduct extensive experiments on state-of-the-art GVDs and multiple real-world datasets. The results demonstrate the effectiveness of the proposed method in multi-concept erasure tasks.

2. Related Works

2.1. Generative Video Diffusion Models

Early GVDs [24,25,26], inspired by generative image diffusion models (GIDs), typically adopt U-Net [27] backbones with cross-attention to integrate textual inputs. For example, VDM [28] employs a 3D U-Net to enhance text-video alignment, while VideoCrafter2 [29] introduces a two-stage training scheme to disentangle motion and appearance at the data level. Despite early success, U-Net-based models struggle with long-range temporal modeling and are computationally inefficient. To overcome these limitations, recent works [2,3] have shifted toward diffusion transformers (DiTs) [30], often omitting traditional cross-attention. For instance, CogVideoX [3] utilizes a multimodal DiT [31] that concatenates text and visual tokens as input to a cross-modal 3D full-attention layer, enhancing both efficiency and performance. HunyuanVideo [2] explores dual- and single-stream DiT variants, processing tokens either separately or jointly within a unified attention module. To evaluate the generalizability of our proposed ConceptVoid framework across different architectures, we conduct empirical studies on both U-Net-based and DiT-based architectures.

2.2. Concept Erasure

Concept erasure for generative models [32,33,34,35] seeks to eliminate specific harmful concepts from pretrained models without retraining, while maintaining generation quality and semantic integrity. Existing studies primarily address this in GIDs, with limited attention to GVDs. As follows, we provide an in-depth analysis of both lines of research.

2.2.1. Concept Erasure for GIDs

Concept erasure in GIDs can be broadly divided into fine-tuning-based and training-free methods, depending on their reliance on gradient updates. Fine-tuning methods, such as Forget-Me-Not [36] and other targeted weight update strategies, use negative samples or counterexamples [37] to suppress specific concepts. While effective, they require retraining for each target and risk catastrophic forgetting. In contrast, training-free methods perform erasure without gradient updates. For instance, UCE [38] modifies the text projection layer via a closed-form solution to enable debiasing and concept erasure. RECE [39] introduces an eraser embedding into the cross-attention mechanism, allowing efficient and thorough concept erasure through rapid closed-form inference. Despite notable success in text-to-image tasks, these methods remain tightly coupled to GID architectures and show limited robustness and precision when extended to GVDs.

2.2.2. Concept Erasure for GVDs

Recent research has begun addressing concept erasure in GVDs, with methods similarly classified into fine-tuning-based and training-free methods. Early work focused on fine-tuning, for example, T2VUnlearning [20] employs negatively-guided velocity prediction with prompt augmentation to suppress target concepts, along with localization and preservation regularization to retain unrelated content. Later methods [21] improve efficiency by restricting updates to the text encoder. More recently, training-free methods have gained attention. Some [22] extract rejection vectors from intermediate activations of concept-differentiated input pairs and subtract them from model weights. Others [23] operate in the discrete latent space, identifying and replacing encodings tied to specific concepts or actions. While these methods offer promising directions, they remain inadequate for multi-concept erasure in practical scenarios. To address this gap, we propose the ConceptVoid, tailored for robust and efficient multi-concept erasure in GVDs.

3. Preliminary

3.1. Traning Process of GVDs

Let

x_{0} \in R^{M \times H \times W \times C}

denote a clean target video sequence, where M represents the number of frames,

H \times W

the resolution, and C the number of channels. Let

y \in Y

also denote the textual prompt, L the predefined total number of denoising steps, and

{β_{t}}_{t = 1}^{L}

the predefined noise scheduling parameters, where

β_{t} \in (0, 1)

indicates the noise intensity injected at step t. GVDs first add noise to the video

x_{0}

progressively according to the following Markov chain:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, β_{t} I),

where

x_{t}

denotes the noisy video sequence at step t,

N (μ, Σ)

denotes a Gaussian distribution with mean

μ

and covariance matrix

Σ

, and

α_{t} = 1 - β_{t}

represents the retention ratio at step t. Let

ε \sim N (0, I)

denotes standard Gaussian noise, to improve training efficiency, the multi-step noise addition can be further simplified into a single-step noise injection [40]:

q (x_{t} ∣ x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I),

where

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε

, and

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

. To enable the model to learn video generation, GVDs employ a neural network parameterized by

θ

, denoted as

ε_{θ} (x_{t}, t, y)

, to predict the noise.

3.2. Inference Process of GVDs

The generation inference begins with a noisy video sequence

x_{L}

and iteratively denoises it through a sampling method to produce the clean target video sequence

x_{0}

. Specifically, the sampling methods include standard DDPM [40] sampling, DDIM [41] sampling, and classifier-free guidance (CFG) [42]. Taking DDPM sampling as an example, for

t = 1, 2, \dots, L

, the denoising operation is performed as

x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t, y)) + σ_{t} z_{t},

where

z_{t}

denotes the additional injected random noise, and

σ_{t}

, which can be either

\sqrt{β_{t}}

or 0 (i.e., deterministic sampling), represents the user-specified noise injection strength.

4. Method

We propose ConceptVoid to tackle multi-concept erasure in real-world settings. We begin by formalizing single-concept erasure as a computable task, then reformulate it as a constrained optimization problem, enabling its extension to a constrained multi-objective framework for multi-concept erasure.

4.1. Single-Concept Erasure

4.1.1. Problem Setup of Concept Erasure

Since existing studies on concept erasure in T2Vs have not yet provided a formal mathematical definition of the problem, we draw upon the definitions established in text and image generation models [43,44,45,46] and adapt them for the T2Vs setting.

Suppose the target concept to be erased is c. Let

Y_{c} \subset Y

represent the set of textual prompts that describe or contain concept c, and let

{\tilde{Y}}_{c}

denote the set of safe prompts (i.e., those that do not contain c). Let the original model parameters be denoted as

\hat{θ}

, with the corresponding conditional generative distribution also denoted as

p_{\hat{θ}} (\cdot | x_{t}, y)

. After erasure, the updated model parameters are denoted as

θ^{*}

, with the corresponding generative distribution

p_{θ^{*}} (\cdot | x_{t}, y)

. Let

q_{safe} (\cdot)

be the safe distribution, which can be a prior distribution unrelated to c or a masked generative distribution, serving as a reference for complete erasure. The goal of concept erasure is to identify a new set of parameters

θ^{*}

without retraining, such that the following conditions are satisfied:

I: high erasure strength. For prompts in $Y_{c}$ , the generative distribution should be as close as possible to $q_{safe} (\cdot)$ , i.e., $E_{y \sim Y_{c}} [D (p_{θ^{*}} (\cdot ∣ x_{t}, y) ∥ q_{safe} (\cdot))]$ should be sufficiently small, where $D (\cdot ∥ \cdot)$ denotes a distributional distance metric.
II: high capabilities preservation. For prompts in ${\tilde{Y}}_{c}$ , the post-erasure distribution should remain consistent with the original distribution $p_{\hat{θ}} (\cdot | x_{t}, y)$ , i.e., $E_{y \sim {\tilde{Y}}_{c}} [D (p_{θ^{*}} (\cdot ∣ x_{t}, y) ∥ p_{\hat{θ}} (\cdot ∣ x_{t}, y))]$ should be sufficiently small.

By combining these two objectives, the objective function of concept erasure can be formulated as:

θ^{*} = arg min_{θ} \underset{(I) Erasure strength}{\underset{︸}{E_{y \in Y_{c}} [D (p_{θ} (\cdot ∣ x_{t}, y) ∥ q_{safe} (\cdot))]}} + λ \underset{(II) Capabilities preservation}{\underset{︸}{E_{y \in {\tilde{Y}}_{c}} [D (p_{θ} (\cdot ∣ x_{t}, y) ∥ p_{\hat{θ}} (\cdot ∣ x_{t}, y))]}},

(1)

where

λ

is a hyperparameter that balances the two objectives. As

λ \to 0

, priority is given to complete erasure; whereas as

λ \to \infty

, priority is given to preservation of original capabilities.

Since Equation (1) is formulated based on the true distributions, which are typically unknown and thus difficult to compute and backpropagate directly, we reformulate it as follows.

4.1.2. Distribution-Aligned Proxy

Due to the inaccessibility of the true distributional distance caused by the unknown nature of

p_{θ} (\cdot | x_{t}, y)

, we construct a surrogate objective to address this issue. Specifically, within the GVD framework, the model aims to approximate the true forward process

q (x_{1 : L} | x_{0})

with Kullback–Leibler (KL) divergence between the model posterior

p_{θ} (x_{t - 1} | x_{t}, y)

and the true posterior

q (x_{t - 1} | x_{t}, y)

. Ho et al. [40] further demonstrate that for Gaussian forward processes, each of these KL divergence terms can be exactly or approximately equivalent to a weighted mean squared error (MSE) of the noise prediction network

ε_{θ} (x_{t}, t, y)

:

D_{KL} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}, y)) \propto ω (t) {∥ε_{t} - ε_{θ} (x_{t}, t, y)∥}^{2},

(2)

where

ω (t)

denotes the loss weight at step t. In other words, the model can recover the true denoising posterior distribution by accurately predicting the injected noise; the smaller the error, the closer the generative distribution is to the data distribution. Therefore, we adopt the noise prediction MSE as a surrogate for the distributional distance in Equation (1), with minimizing the MSE being equivalent to aligning the generative and data distributions.

4.1.3. Safe Distribution Proxy

Since the reference distribution

q_{safe} (\cdot)

is typically unknown, we construct its proxy by predicting the negatively guided noise. Specifically, prior research [47] has shown that concept erasure effectively reduces the probability of generating a video x that represents the target concept c through the following mechanism:

p_{θ^{*}} (x) \propto \frac{p_{θ} (x)}{p_{θ} (y ∣ x)}

, where

y \in Y_{c}

. Based on the noise-prediction MSE surrogate, we construct the negative guided noise via reparameterization as a surrogate target for the reference distribution:

ε_{neg} = ε_{\hat{θ}} (x_{t}, t, ⌀) - η (ε_{\hat{θ}} (x_{t}, t, y) - ε_{\hat{θ}} (x_{t}, t, ⌀)),

(3)

where

ε_{θ} (x_{t}, t, ⌀)

represents the network’s predictions under an unconditional setting and

η

is a tunable parameter controlling the erasure strength.

By integrating the aforementioned two proxy objectives, we reformulate Equation (1) into the following computable form:

\begin{matrix} θ^{*} = arg min_{θ} & \underset{(I) Erasure strength}{\underset{︸}{E_{y \in Y_{c}} [ω (t) {∥ε_{θ} (x_{t}, t, y) - ε_{neg}∥}^{2}]}} \\ + λ \underset{(II) Capabilities preservation}{\underset{︸}{E_{y \in {\tilde{Y}}_{c}} [ω (t) {∥ε_{θ} (x_{t}, t, y) - ε_{\hat{θ}} (x_{t}, t, y)∥}^{2}]}}, \end{matrix}

(4)

which approximates and efficiently realizes a proxy for the original concept erasure objective.

4.2. Multi-Concept Erasure

Let the set of concepts to be erased be denoted as

C = {c_{i}}_{i = 1}^{m}

, with m representing the number of concepts to be erased. For each concept

c_{i}

, there exists an erasure objective as defined in Equation (4). For clarity, we denote (I) the erasure-strength objective in Equation (4) as

f_{i} (θ)

and (II) the capability-preservation objective as

g_{i} (θ)

, which are formally defined as follows:

\begin{matrix} f_{i} (θ) = E_{y \in Y_{c_{i}}} [ω (t) {∥ε_{θ} (x_{t}, t, y) - ε_{neg}∥}^{2}], \\ g_{i} (θ) = E_{y \in {\tilde{Y}}_{c_{i}}} [ω (t) {∥ε_{θ} (x_{t}, t, y) - ε_{\hat{θ}} (x_{t}, t, y)∥}^{2}] . \end{matrix}

Therefore, for each target concept

c_{i}

to be erased, its objective function of single-concept erasure (i.e., Equation (4)) can be expressed as:

θ^{*} = arg min_{θ} f_{i} (θ) + λ g_{i} (θ) .

(5)

Furthermore, the multi-concept erasure problem can be succinctly formulated by minimizing the sum of the objective functions (i.e., Equation (5)) for all concepts to be removed, which can be written as:

θ^{*} = arg min_{θ} \sum_{i = 1}^{m} [f_{i} (θ) + λ g_{i} (θ)] .

(6)

However, due to inherent conflicts among different concept erasure objectives, directly applying Equation (6) can lead to significant degradation in model performance [48], as it fails to achieve an optimal trade-off across multiple objectives.

4.2.1. Conflict Resolution

To achieve optimal conflict resolution, we reformulate Equation (6) as a constrained multi-objective optimization problem, where

f_{i} (θ)

serves as the objective function and

g_{i} (θ) \leq τ_{i}

as the constraint with

τ_{i}

denoting the threshold for allowable capabilities degradation. Observing the substantial overlap among the safety prompt sets

{{\tilde{Y}}_{c_{i}}}_{i = 1}^{m}

associated with different target concepts, we further consolidate the constraint set into a single constraint

\hat{g} (θ) \leq τ

, where

\hat{g} (θ) = E_{y \in \hat{Y}} [ω (t) {∥ε_{θ} (x_{t}, t, y) - ε_{\hat{θ}} (x_{t}, t, y)∥}^{2}]

and

\hat{Y} = {\tilde{Y}}_{c_{1}} \cap \dots \cap {\tilde{Y}}_{c_{m}}

denotes the intersection of these sets. Then, Equation (6) can be reformulated as the following form:

θ^{*} = arg min_{θ} {f_{1} (θ), f_{2} (θ), \dots, f_{m} (θ)}, s . t . \hat{g} (θ) \leq τ .

(7)

Solving Equation (7) yields an approximate Pareto-optimal solution to the original problem Equation (6), thereby achieving an optimal trade-off across multiple concept erasure objectives.

4.2.2. Computational Implementation

We adopt the MGDA [49] method to solve Equation (7), aiming to identify a common descent direction

d

satisfying

\nabla f_{i} {(θ)}^{⊤} d < 0, \forall i = 1, \dots, m

, where

\nabla f_{i} (θ)

denotes the gradient of the i-th objective function, reducing the parameter gradients of all objectives as much as possible. Specifically, we define

G (θ) = [\nabla f_{1} (θ), \nabla f_{2} (θ), \dots, \nabla f_{m} (θ)] \in R^{d \times m}

and seek a weight vector

γ^{*}

such that minimizing the norm of the combined gradient:

γ^{*} = arg min_{γ} {∥G (θ) γ∥}_{2}^{2} s . t . γ \in Δ_{m} = {γ_{i} \geq 0, \sum_{i} γ_{i} = 1} .

(8)

Equation (8) can be reformulated as a standard quadratic programming (QP) problem [50]. By making the QP strictly convex, it can be efficiently solved using any QP solver to obtain

γ^{*}

. Once

γ^{*}

is obtained, the MGDA common descent direction is defined as

d = - \sum_{i = 1}^{m} α_{i}^{*} \nabla f_{i} (θ)

, and the parameters are updated accordingly:

θ^{(t + 1)} \leftarrow θ^{(t)} + ρ d

, where

ρ

denotes the learning rate. Considering the constraint in Equation (7), we adopt a projection method to ensure feasibility. Specifically, after each update, if the new

θ^{(t + 1)}

violates the constraint, we perform a Euclidean projection:

{\hat{θ}}^{(t + 1)} \leftarrow arg {min}_{θ : \hat{g} (θ) \leq τ} {∥θ - θ^{(t + 1)}∥}^{2}

back onto the feasible region.

4.2.3. Enhanced Framework Extensibility

To further enhance the scalability and enable control over different concept erasure targets with varying priorities or degrees of erasure, we improve upon MGDA by introducing importance weights

δ = {(δ_{1}, \dots, δ_{m})}^{⊤}

to scale each column of the gradient in

G (θ)

, represented as

G_{δ} (θ) = G (θ) diag (δ) = [δ_{1} \nabla f_{1} (θ), δ_{2} \nabla f_{2} (θ), \dots, δ_{m} \nabla f_{m} (θ)]

. The complete procedure of the improved algorithm is illustrated in Algorithm 1.

Algorithm 1 Weighted MGDA for Constrained Multi-Concept Erasure

1:: Input: Pre-trained diffusion model parameters $\hat{θ}$ , target concepts $C = {c_{i}}_{i = 1}^{m}$ , video sequence $x_{t} \sim p_{data} (x)$ , textual prompts $Y$ , erasure strength $η$ , constraint threshold $τ$ , importance weights $δ = {(δ_{1}, \dots, δ_{m})}^{⊤}$ , learning rate $ρ > 0$ , max iterations T.
2:: Output: Model parameters after concept erasure $θ^{*}$ .
3:: Initialization: Set $θ^{(t)} = \hat{θ}$ and $t \leftarrow 0$ .
4:: while $t < T$ do
5:: Compute per-concept gradients: $g_{i} \leftarrow \nabla_{θ} f_{i} (θ^{(t)})$ , where $i \in {1, \dots, m}$ .
6:: Form weighted gradient matrix: $G_{δ} (θ^{(t)}) = [δ_{1} g_{1}, δ_{2} g_{2}, \dots, δ_{m} g_{m}] \in R^{d \times m}$ .
7:: Solve QP for combination weights: $γ^{*} = arg {min}_{γ \in Δ_{m}} \frac{1}{2} γ^{⊤} (G_{δ} {(θ^{(t)})}^{⊤} G_{δ} (θ^{(t)})) γ$ .
8:: Compute common descent direction: $d = - G_{δ} (θ^{(t)}) γ^{*} = - \sum_{i = 1}^{m} γ_{i}^{*} δ_{i} g_{i}$ .
9:: Parameter update: $θ^{(t + 1)} \leftarrow θ^{(t)} + ρ d$ .
10:: if $\hat{g} (θ^{(t + 1)}) > τ$ then
11:: $θ^{(t + 1)} \leftarrow arg {min}_{θ : \hat{g} (θ) \leq τ} {∥θ - θ^{(t + 1)}∥}_{2}^{2}$ .
12:: end if
13:: $t \leftarrow t + 1$
14:: end while
15:: return Final erased model parameters $θ^{*}$

5. Experiments

We conduct extensive experiments to evaluate the capability of ConceptVoid in removing safety-critical concepts from text-to-video diffusion models. The evaluation focuses on both the effectiveness of harmful content suppression and the preservation of video quality and semantic alignment. We consider both single-concept and multi-concept erasure scenarios to test the scalability and flexibility of our approach.

5.1. Experimental Setup

5.1.1. Datasets

We use T2VSafetyBench [51], a safety-focused evaluation suite tailored for text-to-video models. It contains 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Following recent study [22], we select four categories—Pornography, Public Figures, Copyright & Trademarks, and Sequential Action Risk—to evaluate the effectiveness of concept erasure. We filter out low-quality or meaningless prompts from the dataset. In addition, we adopt VBench [52] to assess the generative capabilities of the model after concept removal.

5.1.2. Models

We evaluate ConceptVoid on three high-performance text-to-video diffusion models:

CogVideoX-2B (CogX-2B) [3]: a lightweight, 2B-parameter diffusion transformer for text-to-video generation that leverages a 3D causal VAE for spatiotemporal compression, an Expert Transformer with adaptive LayerNorm for deep text–video fusion, and progressive/multi-resolution frame packing to produce coherent short videos (e.g., up to 6 s at moderate resolution) with efficient training and inference.
CogVideoX-5B (CogX-5B): a higher-capacity 5B-parameter variant that builds on the same architectural innovations (3D causal VAE, Expert Transformer with adaptive LayerNorm, progressive training) to deliver richer semantic modeling and stronger temporal coherence, enabling generation of longer (e.g., 10-s), high-quality videos with complex motion and narrative consistency.
OpenSora [53]: an open-source, large-scale, cost-efficient video diffusion framework that decouples spatial and temporal attention via a Spatial-Temporal Diffusion Transformer (STDiT) and employs a highly compressive 3D autoencoder for compact representations and accelerated training; it supports flexible synthesis (text-to-video, image-to-video, etc.) of up to 15-s, high-fidelity videos with arbitrary aspect ratios, and emphasizes commercial-level performance at controlled training cost.

All models are tested with 48-frame video outputs at 8 frames per second.

5.1.3. Evaluation Metrics

We adopt the following three metrics to assess both erasure success and generation fidelity:

Unsafe Generation Rate (UGR) [22]: The proportion of videos containing visible evidence of the harmful concept. Detection is performed via GPT-4-based binary frame-level annotation, and the video-level score is obtained by averaging the binary predictions across all frames in each video.
Fréchet Video Distance (FVD) [54]: Measures perceptual drift from the original model output. Lower values indicate higher visual and temporal consistency. Here we use sanitized prompts.
MM-Notox Distance (MMN) [23]: Captures the semantic gap between generated videos and their sanitized prompts. Lower scores reflect better alignment with safe intent.
Object-Subject Score (OSS): A composite metric computed as the average of the Object Class and Subject Consistency scores from VBench. Higher values indicate stronger semantic relevance and temporal coherence in benign video generation.

All reported metrics (UGR, FVD, MMN, and OSS) are dimensionless quantities, as they are computed from proportions, normalized feature distances, or similarity scores without any physical units.

5.1.4. Compared Methods

We refer to the original, unmodified diffusion model (without any concept erasure) as Original. Since no prior method handles multi-concept removal jointly, we build two baselines by adapting the single-concept erasure technique of [47]. In this approach, a short text description of the undesired concept guides fine-tuning: the model is updated using conditioned and unconditioned scores from a frozen diffusion model to steer generation away from that concept.

Mix: all target concept prompts are included within the same epoch and erased in one pass.
Sequential: concepts are removed one at a time, iteratively applying the single-concept erasure.

5.1.5. Training Details

We construct preservation concepts from related concepts that are most affected by erasing the target concept. For example, in the nudity erasure experiments, we set “person” as the preservation concept. All experiments are conducted on the same Ubuntu 20.04 LTS server equipped with a 48-core CPU, 256 GB RAM, and an NVIDIA A800 GPU. On an A800 GPU, unlearning a concept for 10 epochs takes approximately 20 min for CogVideoX-2B and OpenSora, and about 40 min for CogVideoX-5B.

5.2. Main Results

5.2.1. Single-Concept Erasure

We evaluate ConceptVoid under the single-concept erasure setting, where each harmful category is addressed independently. As shown in Figure 1 and Table 1, the method consistently reduces unsafe content across all tested categories, with the Unsafe Generation Rate (UGR) dropping by over 50% on average. These improvements are achieved without sacrificing visual quality or semantic relevance. Fréchet Video Distance (FVD) remains largely stable, indicating minimal perceptual drift, while MM-Notox Distance (MMN) also shows consistent reductions, reflecting improved alignment with safe textual intent.

5.2.2. Multi-Concept Erasure

We evaluate ConceptVoid under the more challenging multi-concept erasure setting, where all harmful concepts are removed jointly in a single optimization procedure. Table 2 compares ConceptVoid against two baselines.

Across all three models, ConceptVoid achieves the lowest Unsafe Generation Rate (UGR), with up to 79.1% reduction on Open-Sora. Notably, it maintains high Object-Subject Score (OSS), closely matching the original model in semantic and temporal consistency, while both baselines suffer notable OSS degradation. This confirms ConceptVoid’s ability to suppress diverse unsafe content without compromising generation quality.

The superior performance of ConceptVoid arises from its formulation as a constrained multi-objective optimization problem, which avoids the gradient conflicts and oversuppression often observed in naïve mixing or sequential schemes. As detailed in Algorithm 1, ConceptVoid explicitly balances per-concept gradients through a weighted MGDA step, solving a quadratic program (QP) to obtain optimal combination weights based on concept importance. This yields a unified update direction that minimizes harmful features while preserving general expressiveness. In contrast, the Mix baseline suffers from concept interference, where competing gradients collapse shared features, and Sequential erasure accumulates destructive updates, often leading to oversuppression or forgetting of non-target content. By introducing importance weights and enforcing output-level constraints, ConceptVoid provides fine-grained control and robust erasure behavior, making it well-suited for scalable, real-world safety applications.

5.3. Ablation Study

To evaluate the impact of core components in our design, we perform controlled ablations on ConceptVoid using the Open-Sora model. We vary one factor at a time and measure the Unsafe Generation Rate (UGR↓) and Object-Subject Score (OSS↑), averaged over all four safety categories and five random seeds. Results are summarized in Figure 2.

5.3.1. Effect of Importance Weighting

In ConceptVoid, importance weights

γ_{i}

allow prioritizing specific harmful concepts during joint optimization. Removing this mechanism (i.e., setting all

γ_{i} = 1

) leads to a noticeable degradation: UGR rises from 8.7% to 12.4% (+42.5%) while OSS drops from 78.8 to 77.2. This confirms that naive equal weighting can dilute suppression focus, making it harder to erase high-priority risks without compromising generality. In contrast, learned

γ_{i}

enables fine-grained, policy-aware moderation control.

5.3.2. Effect of Output-Drift Constraint $τ$

We examine the role of the output-anchoring constraint

∥ θ^{(t + 1)} - θ^{(0)} ∥ \leq τ

in controlling model deviation. Tightening

τ

(50% smaller) improves OSS from 78.8 to 79.4 due to stronger preservation of distributional stability, but slightly increases UGR to 9.9%. Conversely, removing the constraint entirely causes significant degeneration—UGR increases to 15.2% and OSS drops to 76.5. These results highlight that

τ

acts as a safeguard against catastrophic forgetting and over-erasure, maintaining the model’s ability to generate coherent, safe videos.

5.3.3. Effect of Erasure Strength $η$

The step size

η

determines how aggressively ConceptVoid modifies model parameters in each update. A smaller

η

slows down erasure, resulting in incomplete suppression (UGR 11.1%), while a larger

η

reduces UGR slightly (to 8.1%) but harms semantic quality (OSS drops to 77.6). The default setting strikes an effective trade-off between stability and suppression capacity. These findings suggest that

η

should be tuned based on the target application’s tolerance for visual perturbation versus safety guarantees.

6. Conclusions

In this work, we address the challenge of preventing pretrained GVDs from producing harmful or copyright-protected content without resorting to costly full retraining. Specifically, we tackle the problem of simultaneously erasing multiple harmful concepts. To this end, we propose ConceptVoid, a scalable multi-concept erasure framework that formulates multi-concept erasure as a constrained multi-objective optimization problem. For each target concept, we define a removal loss based on the discrepancy between noise predictions conditioned on the concept and those unconditioned, while preserving non-target capabilities through regularization of output distributions and parameter perturbations. We employ the MGDA to solve the resulting optimization problem, achieving Pareto-optimal trade-offs among competing erasure objectives. Additionally, we introduce an importance-weighting mechanism to flexibly adjust the priority and strength of each concept’s removal. Extensive experiments on state-of-the-art GVDs and diverse real-world datasets demonstrate that ConceptVoid effectively suppresses multiple harmful concepts while maintaining high video fidelity and strong scalability.

7. Limitations

Despite its advantages, ConceptVoid has several limitations. Its theoretical Pareto-optimality guarantees hinge on the convexity of both the objective functions and the feasible set; in non-convex scenarios, the framework can only assure weak Pareto optimality or Pareto stability. Furthermore, although ConceptVoid is in principle applicable to other generative architectures (e.g., text-to-image or text-to-text models), its empirical performance beyond video diffusion remains untested. The method also depends on explicit concept prompts and the corresponding conditioned versus unconditioned noise predictions, making it unable to automatically discover or erase novel or unspecified concepts. Finally, solving the constrained multi-objective optimization via MGDA and tuning importance weights for each target concept incurs additional computational overhead, which may limit real-time or resource-constrained deployments.

Author Contributions

Conceptualization, W.M. and Z.H.; methodology, Z.H., X.J. and C.W.; software, Z.H. and C.W.; validation, Z.H., X.J. and C.W.; formal analysis, Z.H. and W.M.; investigation, Z.H. and X.J.; resources, W.M.; data curation, Z.H. and X.J.; writing—original draft preparation, Z.H. and W.M.; writing—review and editing, X.J., C.W. and W.M.; visualization, Z.H. and X.J.; supervision, W.M.; project administration, W.M.; funding acquisition, W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GVDs	Generative Video Diffusion models
MGDA	Multiple Gradient Descent Algorithm
GIDs	Generative Image Diffusion models
DiTs	Diffusion Transformers
CFG	Classifier-Free Guidance
KL	Kullback–Leibler
MSE	Mean Squared Error
CogX-2B	CogVideoX-2B
CogX-5B	CogVideoX-5B
UGR	Unsafe Generation Rate
FVD	Fréchet Video Distance
MMN	MM-Notox Distance
OSS	Object-Subject Score
QP	Quadratic Program

References

Yin, S.; Wu, C.; Yang, H.; Wang, J.; Wang, X.; Ni, M.; Yang, Z.; Li, L.; Liu, S.; Yang, F.; et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv 2023, arXiv:2303.12346. [Google Scholar]
Kong, W.; Tian, Q.; Zhang, Z.; Min, R.; Dai, Z.; Zhou, J.; Xiong, J.; Li, X.; Wu, B.; Zhang, J.; et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv 2024, arXiv:2412.03603. [Google Scholar] [CrossRef]
Yang, Z.; Teng, J.; Zheng, W.; Ding, M.; Huang, S.; Xu, J.; Yang, Y.; Hong, W.; Zhang, X.; Feng, G.; et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv 2024, arXiv:2408.06072. [Google Scholar]
Zheng, J.; Liu, X.; Liu, W.; He, L.; Yan, C.; Mei, T. Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 20228–20237. [Google Scholar]
Zhong, J.; Wang, Y.; Zhu, D.; Wang, Z. A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning. arXiv 2025, arXiv:2506.07236. [Google Scholar] [CrossRef]
Jiang, Y.; Gao, X.; Peng, T.; Tan, Y.; Zhu, X.; Zheng, B.; Yue, X. Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states. arXiv 2025, arXiv:2502.14744. [Google Scholar]
Jiang, Y.; Tan, Y.; Yue, X. RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting. arXiv 2024, arXiv:2412.18826. [Google Scholar]
Tan, Y.; Jiang, Y.; Li, Y.; Liu, J.; Bu, X.; Su, W.; Yue, X.; Zhu, X.; Zheng, B. Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models. arXiv 2025, arXiv:2502.11555. [Google Scholar]
Xiao, H.; Liu, S.; Zuo, K.; Xu, H.; Cai, Y.; Liu, T.; Yang, Z. Multiple adverse weather image restoration: A review. Neurocomputing 2024, 618, 129044. [Google Scholar] [CrossRef]
Xu, Z.; Liu, Y. Robust Anomaly Detection in Network Traffic: Evaluating Machine Learning Models on CICIDS2017. arXiv 2025, arXiv:2506.19877. [Google Scholar] [CrossRef]
Setty, R. AI art generators hit with copyright suit over artists’ images. Bloom. Law 2023, 1, 2023. [Google Scholar]
Wang, C.; Nie, C.; Liu, Y. Evaluating Supervised Learning Models for Fraud Detection: A Comparative Study of Classical and Deep Architectures on Imbalanced Transaction Data. arXiv 2025, arXiv:2505.22521. [Google Scholar] [CrossRef]
Liu, Y.; Qin, X.; Gao, Y.; Li, X.; Feng, C. SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition. INNO-PRESS J. Emerg. Appl. AI 2025, 1, 26–33. [Google Scholar]
Zhong, J.; Wang, Y. Enhancing Thyroid Disease Prediction Using Machine Learning: A Comparative Study of Ensemble Models and Class Balancing Techniques. Res. Sq. 2025. [Google Scholar] [CrossRef]
Wang, Y.; Zhong, J.; Kumar, R. A Systematic Review of Machine Learning Applications in Infectious Disease Prediction, Diagnosis, and Outbreak Forecasting. Preprints 2025, 2025041250. [Google Scholar]
Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A survey of machine unlearning. arXiv 2022, arXiv:2209.02299. [Google Scholar] [CrossRef]
Feng, X.; Li, Y.; Yu, F.; Zhang, L.; Chen, C.; Zheng, X. Plug and Play: Enabling Pluggable Attribute Unlearning in Recommender Systems. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2689–2699. [Google Scholar]
Feng, X.; Li, Y.; Yu, F.; Xiong, K.; Fang, J.; Zhang, L.; Du, T.; Chen, C. RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems. arXiv 2025, arXiv:2504.11510. [Google Scholar] [CrossRef]
Li, Y.; Chen, C.; Zhang, Y.; Liu, W.; Lyu, L.; Zheng, X.; Meng, D.; Wang, J. Ultrare: Enhancing receraser for recommendation unlearning via error decomposition. Adv. Neural Inf. Process. Syst. 2023, 36, 12611–12625. [Google Scholar]
Ye, X.; Cheng, S.; Wang, Y.; Xiong, Y.; Li, Y. T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models. arXiv 2025, arXiv:2505.17550. [Google Scholar]
Liu, S.; Tan, Y. Unlearning Concepts from Text-to-Video Diffusion Models. arXiv 2024, arXiv:2407.14209. [Google Scholar]
Facchiano, S.; Saravalle, S.; Migliarini, M.; De Matteis, E.; Sampieri, A.; Pilzer, A.; Rodolà, E.; Spinelli, I.; Franco, L.; Galasso, F. Video Unlearning via Low-Rank Refusal Vector. arXiv 2025, arXiv:2506.07891. [Google Scholar] [CrossRef]
De Matteis, E.; Migliarini, M.; Sampieri, A.; Spinelli, I.; Galasso, F. Human Motion Unlearning. arXiv 2025, arXiv:2503.18674. [Google Scholar] [CrossRef]
Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; Zhang, S. Modelscope text-to-video technical report. arXiv 2023, arXiv:2308.06571. [Google Scholar]
Wang, X.; Yuan, H.; Zhang, S.; Chen, D.; Wang, J.; Zhang, Y.; Shen, Y.; Zhao, D.; Zhou, J. Videocomposer: Compositional video synthesis with motion controllability. Adv. Neural Inf. Process. Syst. 2023, 36, 7594–7611. [Google Scholar]
Wu, J.Z.; Ge, Y.; Wang, X.; Lei, S.W.; Gu, Y.; Shi, Y.; Hsu, W.; Shan, Y.; Qie, X.; Shou, M.Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7623–7633. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. Adv. Neural Inf. Process. Syst. 2022, 35, 8633–8646. [Google Scholar]
Chen, H.; Xia, M.; He, Y.; Zhang, Y.; Cun, X.; Yang, S.; Xing, J.; Liu, Y.; Chen, Q.; Wang, X.; et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv 2023, arXiv:2310.19512. [Google Scholar]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Kim, C.; Qi, Y. A comprehensive survey on concept erasure in text-to-image diffusion models. arXiv 2025, arXiv:2502.14896. [Google Scholar]
Xie, Y.; Liu, P.; Zhang, Z. Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression. arXiv 2025, arXiv:2505.19398. [Google Scholar] [CrossRef]
Feng, X.; Li, Y.; Wang, C.; Liu, J.; Zhang, L.; Chen, C. A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty. arXiv 2025, arXiv:2504.06658. [Google Scholar]
Feng, X.; Li, Y.; Ji, H.; Zhang, J.; Zhang, L.; Du, T.; Chen, C. Bridging the Gap Between Preference Alignment and Machine Unlearning. arXiv 2025, arXiv:2504.06659. [Google Scholar] [CrossRef]
Zhang, G.; Wang, K.; Xu, X.; Wang, Z.; Shi, H. Forget-me-not: Learning to forget in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1755–1764. [Google Scholar]
Lu, S.; Wang, Z.; Li, L.; Liu, Y.; Kong, A.W.K. Mace: Mass concept erasure in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6430–6440. [Google Scholar]
Gandikota, R.; Orgad, H.; Belinkov, Y.; Materzyńska, J.; Bau, D. Unified concept editing in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5111–5120. [Google Scholar]
Gong, C.; Chen, K.; Wei, Z.; Chen, J.; Jiang, Y.G. Reliable and efficient concept erasure of text-to-image diffusion models. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 73–88. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Zheng, J.; Liu, X.; Wang, S.; Wang, L.; Yan, C.; Liu, W. Parsing is All You Need for Accurate Gait Recognition in the Wild. In Proceedings of the ACMMM, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 116–124. [Google Scholar]
Liu, S.; Zhang, Y.; Li, X.; Liu, Y.; Feng, C.; Yang, H. Gated Multimodal Graph Learning for Personalized Recommendation. INNO-PRESS J. Emerg. Appl. AI 2025, 1, 17–25. [Google Scholar]
Li, Y.; Chen, C.; Zheng, X.; Zhang, Y.; Han, Z.; Meng, D.; Wang, J. Making users indistinguishable: Attribute-wise unlearning in recommender systems. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 984–994. [Google Scholar]
Chen, C.; Zhang, Y.; Li, Y.; Wang, J.; Qi, L.; Xu, X.; Zheng, X.; Yin, J. Post-training attribute unlearning in recommender systems. ACM Trans. Inf. Syst. 2024, 43, 1–28. [Google Scholar] [CrossRef]
Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 2426–2436. [Google Scholar]
Feng, X.; Li, Y.; Chen, C.; Zhang, L.; Li, L.; Zhou, J.; Zheng, X. Controllable Unlearning for Image-to-Image Generative Models via ε-Constrained Optimization. arXiv 2024, arXiv:2408.01689. [Google Scholar]
Sener, O.; Koltun, V. Multi-task learning as multi-objective optimization. Adv. Neural Inf. Process. Syst. 2018, 31, 525–536. [Google Scholar]
Frank, M.; Wolfe, P. An algorithm for quadratic programming. Nav. Res. Logist. Q. 1956, 3, 95–110. [Google Scholar] [CrossRef]
Miao, Y.; Zhu, Y.; Yu, L.; Zhu, J.; Gao, X.S.; Dong, Y. T2vsafetybench: Evaluating the safety of text-to-video generative models. Adv. Neural Inf. Process. Syst. 2024, 37, 63858–63872. [Google Scholar]
Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21807–21818. [Google Scholar]
Peng, X.; Zheng, Z.; Shen, C.; Young, T.; Guo, X.; Wang, B.; Xu, H.; Liu, H.; Jiang, M.; Li, W.; et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv 2025, arXiv:2503.09642. [Google Scholar]
Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018, arXiv:1812.01717. [Google Scholar]

Figure 1. Comparison of original and ConceptVoid-erased videos. In each group, the top row shows the original outputs, and the bottom row shows the results after applying ConceptVoid. In each prompt, the forgotten target concept is italicized.

Figure 2. Ablation results on UGR and OSS (all metrics are dimensionless).

Table 1. Evaluation of Concept Erasure across Safety Categories using UGR, MMN, and FVD. Results are obtained using the Open-Sora model and represent the average over five independent runs (all metrics are dimensionless, and arrows (↑/↓) denote the preferred metric direction).

Category	Original			ConceptVoid
Category	UGR↓	MMN↓	FVD↓	UGR↓	MMN↓	FVD↓
Pornography	52.3	22.8	151.9	19.5 (↓62.7%)	19.1 (↓16.2%)	148.2
Public Figures	16.2	20.1	175.4	8.9 (↓45.1%)	18.7 (↓6.9%)	172.9
Copyright & Trademarks	77.5	23.0	160.7	42.0 (↓45.8%)	20.9 (↓9.1%)	157.2
Sequential Action Risk	50.0	22.5	293.1	16.5 (↓67.0%)	19.4 (↓13.8%)	289.3

Table 2. Comparison of Erasure Strategies across Models using UGR and OSS. Each result is averaged over five independent runs and reflects the mean performance across all four harmful content categories (all metrics are dimensionless, and arrows (↑/↓) denote the preferred metric direction).

Method	CogX-2B		CogX-5B		Open-Sora
Method	UGR↓	OSS↑	UGR↓	OSS↑	UGR↓	OSS↑
Original	42.3	81.2	40.1	82.8	41.7	81.9
Mix	22.7 (↓46.3%)	76.0	20.1 (↓49.9%)	77.5	21.3 (↓48.9%)	76.4
Sequential	28.5 (↓32.6%)	72.3	25.8 (↓35.7%)	73.8	26.9 (↓35.5%)	73.0
ConceptVoid	10.8 (↓74.5%)	78.4	9.5 (↓76.3%)	79.1	8.7 (↓79.1%)	78.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Z.; Jin, X.; Wu, C.; Mao, W. ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion. Mathematics 2025, 13, 2652. https://doi.org/10.3390/math13162652

AMA Style

Huang Z, Jin X, Wu C, Mao W. ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion. Mathematics. 2025; 13(16):2652. https://doi.org/10.3390/math13162652

Chicago/Turabian Style

Huang, Zhongbin, Xingjia Jin, Cunkang Wu, and Wei Mao. 2025. "ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion" Mathematics 13, no. 16: 2652. https://doi.org/10.3390/math13162652

APA Style

Huang, Z., Jin, X., Wu, C., & Mao, W. (2025). ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion. Mathematics, 13(16), 2652. https://doi.org/10.3390/math13162652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ConceptVoid: Precision Multi-Concept Erasure in Generative Video Diffusion

Abstract

1. Introduction

2. Related Works

2.1. Generative Video Diffusion Models

2.2. Concept Erasure

2.2.1. Concept Erasure for GIDs

2.2.2. Concept Erasure for GVDs

3. Preliminary

3.1. Traning Process of GVDs

3.2. Inference Process of GVDs

4. Method

4.1. Single-Concept Erasure

4.1.1. Problem Setup of Concept Erasure

4.1.2. Distribution-Aligned Proxy

4.1.3. Safe Distribution Proxy

4.2. Multi-Concept Erasure

4.2.1. Conflict Resolution

4.2.2. Computational Implementation

4.2.3. Enhanced Framework Extensibility

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets

5.1.2. Models

5.1.3. Evaluation Metrics

5.1.4. Compared Methods

5.1.5. Training Details

5.2. Main Results

5.2.1. Single-Concept Erasure

5.2.2. Multi-Concept Erasure

5.3. Ablation Study

5.3.1. Effect of Importance Weighting

5.3.2. Effect of Output-Drift Constraint τ

5.3.3. Effect of Erasure Strength η

6. Conclusions

7. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.3.2. Effect of Output-Drift Constraint $τ$

5.3.3. Effect of Erasure Strength $η$