1. Introduction
Generative video diffusion models (GVDs) [
1,
2,
3] have rapidly emerged in recent years as an industrial-grade solution capable of producing high-fidelity, text-conditioned video clips [
4,
5]. These models have found wide-ranging applications in fields such as virtual cinematography and interactive simulations. However, since GVDs are typically trained in large-scale video datasets that are not rigorously curated, they inevitably learn and reproduce certain unsafe or harmful content concepts, such as explicit nudity, violent scenes, or recognizable and copyrighted characters [
6,
7,
8,
9,
10]. As a result, they may generate realistic yet potentially harmful video content [
11], posing serious negative implications for society [
12,
13,
14,
15]. Consequently, how to effectively prevent GVDs from generating harmful content has become a novel and pressing research challenge.
A common strategy to address this challenge is retraining on a filtered dataset excluding such content. While theoretically feasible, this strategy is often impractical for large-scale models due to its high computational and resource demands [
16,
17,
18,
19]. To mitigate this, concept erasure [
20,
21,
22,
23] techniques have been developed to remove harmful concepts from pretrained models without retraining, aiming to preserve generative quality and semantic coverage. Concept erasure techniques for GVDs fall into two main categories. The first involves fine-tuning-based methods [
20,
21] that use negative guidance gradients, often with regularization to preserve non-target content. The second category includes training-free methods that avoid parameter re-optimization. These suppress target concepts via null-space vector subtraction [
22] or latent code replacement [
23]. Though effective for single-concept erasure, real-world use often demands multi-concept erasure and fine-grained control over erasure strength. Applying existing methods sequentially to multiple concepts introduces two key challenges:
Challenge I: interference between concepts. Independently designed erasures can conflict, degrade performance, or remove unintended content.
Challenge II: scalability limitations. Current methods lack adaptability, making it hard to accommodate dynamic concept changes or strategy adjustments.
To address the above challenges, we propose ConceptVoid, a scalable and flexible multi-concept erasure framework for GVDs based on constrained multi-objective optimization. It mitigates conflicts in multi-concept erasure while maintaining broad applicability. To tackle inter-concept conflicts (Challenge I), ConceptVoid reformulates the erasure task as a constrained multi-objective problem. For each harmful concept, it defines an individual erasure loss by computing the difference between noise outputs conditioned and unconditioned on the concept prompt. This difference is subtracted from the unprompted output to guide the erasure. To preserve model capability, we introduce output-distribution alignment regularization to constrain output drift, thereby protecting non-target generation capabilities. We solve the optimization using the multiple gradient descent algorithm (MGDA) to obtain a Pareto-optimal solution, effectively balancing multiple concept objectives and reducing interference. For scalability (Challenge II), we incorporate importance weighting for target concepts during the optimization process. By adjusting the weights associated with each concept’s gradient, the framework enables flexible control over the priority and intensity of concept erasure.
Our main contributions can be summarized as follows:
We propose ConceptVoid, a framework for multi-concept erasure in GVDs that effectively resolves inter-concept conflict and offers strong scalability.
By reformulating the erasure task as a constrained multi-objective optimization problem and solving it via MGDA, we achieve a Pareto-optimal solution that minimizes inter-concept interference.
We enhance MGDA with importance weighting, enabling adaptive control over erasure priorities, further improving scalability in complex scenarios.
We conduct extensive experiments on state-of-the-art GVDs and multiple real-world datasets. The results demonstrate the effectiveness of the proposed method in multi-concept erasure tasks.
2. Related Works
2.1. Generative Video Diffusion Models
Early GVDs [
24,
25,
26], inspired by generative image diffusion models (GIDs), typically adopt U-Net [
27] backbones with cross-attention to integrate textual inputs. For example, VDM [
28] employs a 3D U-Net to enhance text-video alignment, while VideoCrafter2 [
29] introduces a two-stage training scheme to disentangle motion and appearance at the data level. Despite early success, U-Net-based models struggle with long-range temporal modeling and are computationally inefficient. To overcome these limitations, recent works [
2,
3] have shifted toward diffusion transformers (DiTs) [
30], often omitting traditional cross-attention. For instance, CogVideoX [
3] utilizes a multimodal DiT [
31] that concatenates text and visual tokens as input to a cross-modal 3D full-attention layer, enhancing both efficiency and performance. HunyuanVideo [
2] explores dual- and single-stream DiT variants, processing tokens either separately or jointly within a unified attention module. To evaluate the generalizability of our proposed ConceptVoid framework across different architectures, we conduct empirical studies on both U-Net-based and DiT-based architectures.
2.2. Concept Erasure
Concept erasure for generative models [
32,
33,
34,
35] seeks to eliminate specific harmful concepts from pretrained models without retraining, while maintaining generation quality and semantic integrity. Existing studies primarily address this in GIDs, with limited attention to GVDs. As follows, we provide an in-depth analysis of both lines of research.
2.2.1. Concept Erasure for GIDs
Concept erasure in GIDs can be broadly divided into fine-tuning-based and training-free methods, depending on their reliance on gradient updates. Fine-tuning methods, such as Forget-Me-Not [
36] and other targeted weight update strategies, use negative samples or counterexamples [
37] to suppress specific concepts. While effective, they require retraining for each target and risk catastrophic forgetting. In contrast, training-free methods perform erasure without gradient updates. For instance, UCE [
38] modifies the text projection layer via a closed-form solution to enable debiasing and concept erasure. RECE [
39] introduces an eraser embedding into the cross-attention mechanism, allowing efficient and thorough concept erasure through rapid closed-form inference. Despite notable success in text-to-image tasks, these methods remain tightly coupled to GID architectures and show limited robustness and precision when extended to GVDs.
2.2.2. Concept Erasure for GVDs
Recent research has begun addressing concept erasure in GVDs, with methods similarly classified into fine-tuning-based and training-free methods. Early work focused on fine-tuning, for example, T2VUnlearning [
20] employs negatively-guided velocity prediction with prompt augmentation to suppress target concepts, along with localization and preservation regularization to retain unrelated content. Later methods [
21] improve efficiency by restricting updates to the text encoder. More recently, training-free methods have gained attention. Some [
22] extract rejection vectors from intermediate activations of concept-differentiated input pairs and subtract them from model weights. Others [
23] operate in the discrete latent space, identifying and replacing encodings tied to specific concepts or actions. While these methods offer promising directions, they remain inadequate for multi-concept erasure in practical scenarios. To address this gap, we propose the ConceptVoid, tailored for robust and efficient multi-concept erasure in GVDs.
3. Preliminary
3.1. Traning Process of GVDs
Let
denote a clean target video sequence, where
M represents the number of frames,
the resolution, and
C the number of channels. Let
also denote the textual prompt,
L the predefined total number of denoising steps, and
the predefined noise scheduling parameters, where
indicates the noise intensity injected at step
t. GVDs first add noise to the video
progressively according to the following Markov chain:
where
denotes the noisy video sequence at step
t,
denotes a Gaussian distribution with mean
and covariance matrix
, and
represents the retention ratio at step
t. Let
denotes standard Gaussian noise, to improve training efficiency, the multi-step noise addition can be further simplified into a single-step noise injection [
40]:
where
, and
. To enable the model to learn video generation, GVDs employ a neural network parameterized by
, denoted as
, to predict the noise.
3.2. Inference Process of GVDs
The generation inference begins with a noisy video sequence
and iteratively denoises it through a sampling method to produce the clean target video sequence
. Specifically, the sampling methods include standard DDPM [
40] sampling, DDIM [
41] sampling, and classifier-free guidance (CFG) [
42]. Taking DDPM sampling as an example, for
, the denoising operation is performed as
where
denotes the additional injected random noise, and
, which can be either
or 0 (i.e., deterministic sampling), represents the user-specified noise injection strength.
4. Method
We propose ConceptVoid to tackle multi-concept erasure in real-world settings. We begin by formalizing single-concept erasure as a computable task, then reformulate it as a constrained optimization problem, enabling its extension to a constrained multi-objective framework for multi-concept erasure.
4.1. Single-Concept Erasure
4.1.1. Problem Setup of Concept Erasure
Since existing studies on concept erasure in T2Vs have not yet provided a formal mathematical definition of the problem, we draw upon the definitions established in text and image generation models [
43,
44,
45,
46] and adapt them for the T2Vs setting.
Suppose the target concept to be erased is c. Let represent the set of textual prompts that describe or contain concept c, and let denote the set of safe prompts (i.e., those that do not contain c). Let the original model parameters be denoted as , with the corresponding conditional generative distribution also denoted as . After erasure, the updated model parameters are denoted as , with the corresponding generative distribution . Let be the safe distribution, which can be a prior distribution unrelated to c or a masked generative distribution, serving as a reference for complete erasure. The goal of concept erasure is to identify a new set of parameters without retraining, such that the following conditions are satisfied:
I: high erasure strength. For prompts in , the generative distribution should be as close as possible to , i.e., should be sufficiently small, where denotes a distributional distance metric.
II: high capabilities preservation. For prompts in , the post-erasure distribution should remain consistent with the original distribution , i.e., should be sufficiently small.
By combining these two objectives, the objective function of concept erasure can be formulated as:
where
is a hyperparameter that balances the two objectives. As
, priority is given to complete erasure; whereas as
, priority is given to preservation of original capabilities.
Since Equation (
1) is formulated based on the true distributions, which are typically unknown and thus difficult to compute and backpropagate directly, we reformulate it as follows.
4.1.2. Distribution-Aligned Proxy
Due to the inaccessibility of the true distributional distance caused by the unknown nature of
, we construct a surrogate objective to address this issue. Specifically, within the GVD framework, the model aims to approximate the true forward process
with Kullback–Leibler (KL) divergence between the model posterior
and the true posterior
. Ho et al. [
40] further demonstrate that for Gaussian forward processes, each of these KL divergence terms can be exactly or approximately equivalent to a weighted mean squared error (MSE) of the noise prediction network
:
where
denotes the loss weight at step
t. In other words, the model can recover the true denoising posterior distribution by accurately predicting the injected noise; the smaller the error, the closer the generative distribution is to the data distribution. Therefore, we adopt the noise prediction MSE as a surrogate for the distributional distance in Equation (
1), with minimizing the MSE being equivalent to aligning the generative and data distributions.
4.1.3. Safe Distribution Proxy
Since the reference distribution
is typically unknown, we construct its proxy by predicting the negatively guided noise. Specifically, prior research [
47] has shown that concept erasure effectively reduces the probability of generating a video
x that represents the target concept
c through the following mechanism:
, where
. Based on the noise-prediction MSE surrogate, we construct the negative guided noise via reparameterization as a surrogate target for the reference distribution:
where
represents the network’s predictions under an unconditional setting and
is a tunable parameter controlling the erasure strength.
By integrating the aforementioned two proxy objectives, we reformulate Equation (
1) into the following computable form:
which approximates and efficiently realizes a proxy for the original concept erasure objective.
4.2. Multi-Concept Erasure
Let the set of concepts to be erased be denoted as
, with
m representing the number of concepts to be erased. For each concept
, there exists an erasure objective as defined in Equation (
4). For clarity, we denote (I) the erasure-strength objective in Equation (
4) as
and (II) the capability-preservation objective as
, which are formally defined as follows:
Therefore, for each target concept
to be erased, its objective function of single-concept erasure (i.e., Equation (
4)) can be expressed as:
Furthermore, the multi-concept erasure problem can be succinctly formulated by minimizing the sum of the objective functions (i.e., Equation (
5)) for all concepts to be removed, which can be written as:
However, due to inherent conflicts among different concept erasure objectives, directly applying Equation (
6) can lead to significant degradation in model performance [
48], as it fails to achieve an optimal trade-off across multiple objectives.
4.2.1. Conflict Resolution
To achieve optimal conflict resolution, we reformulate Equation (
6) as a constrained multi-objective optimization problem, where
serves as the objective function and
as the constraint with
denoting the threshold for allowable capabilities degradation. Observing the substantial overlap among the safety prompt sets
associated with different target concepts, we further consolidate the constraint set into a single constraint
, where
and
denotes the intersection of these sets. Then, Equation (
6) can be reformulated as the following form:
Solving Equation (
7) yields an approximate Pareto-optimal solution to the original problem Equation (
6), thereby achieving an optimal trade-off across multiple concept erasure objectives.
4.2.2. Computational Implementation
We adopt the MGDA [
49] method to solve Equation (
7), aiming to identify a common descent direction
satisfying
, where
denotes the gradient of the
i-th objective function, reducing the parameter gradients of all objectives as much as possible. Specifically, we define
and seek a weight vector
such that minimizing the norm of the combined gradient:
Equation (
8) can be reformulated as a standard quadratic programming (QP) problem [
50]. By making the QP strictly convex, it can be efficiently solved using any QP solver to obtain
. Once
is obtained, the MGDA common descent direction is defined as
, and the parameters are updated accordingly:
, where
denotes the learning rate. Considering the constraint in Equation (
7), we adopt a projection method to ensure feasibility. Specifically, after each update, if the new
violates the constraint, we perform a Euclidean projection:
back onto the feasible region.
4.2.3. Enhanced Framework Extensibility
To further enhance the scalability and enable control over different concept erasure targets with varying priorities or degrees of erasure, we improve upon MGDA by introducing importance weights
to scale each column of the gradient in
, represented as
. The complete procedure of the improved algorithm is illustrated in Algorithm 1.
Algorithm 1 Weighted MGDA for Constrained Multi-Concept Erasure |
- 1:
Input: Pre-trained diffusion model parameters , target concepts , video sequence , textual prompts , erasure strength , constraint threshold , importance weights , learning rate , max iterations T. - 2:
Output: Model parameters after concept erasure . - 3:
Initialization: Set and . - 4:
while do - 5:
Compute per-concept gradients: , where . - 6:
Form weighted gradient matrix: . - 7:
Solve QP for combination weights: . - 8:
Compute common descent direction: . - 9:
Parameter update: . - 10:
if then - 11:
. - 12:
end if - 13:
- 14:
end while - 15:
return Final erased model parameters
|
5. Experiments
We conduct extensive experiments to evaluate the capability of ConceptVoid in removing safety-critical concepts from text-to-video diffusion models. The evaluation focuses on both the effectiveness of harmful content suppression and the preservation of video quality and semantic alignment. We consider both single-concept and multi-concept erasure scenarios to test the scalability and flexibility of our approach.
5.1. Experimental Setup
5.1.1. Datasets
We use
T2VSafetyBench [
51], a safety-focused evaluation suite tailored for text-to-video models. It contains 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Following recent study [
22], we select four categories—Pornography, Public Figures, Copyright & Trademarks, and Sequential Action Risk—to evaluate the effectiveness of concept erasure. We filter out low-quality or meaningless prompts from the dataset. In addition, we adopt
VBench [
52] to assess the generative capabilities of the model after concept removal.
5.1.2. Models
We evaluate ConceptVoid on three high-performance text-to-video diffusion models:
CogVideoX-2B (CogX-2B) [
3]: a lightweight, 2B-parameter diffusion transformer for text-to-video generation that leverages a 3D causal VAE for spatiotemporal compression, an Expert Transformer with adaptive LayerNorm for deep text–video fusion, and progressive/multi-resolution frame packing to produce coherent short videos (e.g., up to 6 s at moderate resolution) with efficient training and inference.
CogVideoX-5B (CogX-5B): a higher-capacity 5B-parameter variant that builds on the same architectural innovations (3D causal VAE, Expert Transformer with adaptive LayerNorm, progressive training) to deliver richer semantic modeling and stronger temporal coherence, enabling generation of longer (e.g., 10-s), high-quality videos with complex motion and narrative consistency.
OpenSora [
53]: an open-source, large-scale, cost-efficient video diffusion framework that decouples spatial and temporal attention via a Spatial-Temporal Diffusion Transformer (STDiT) and employs a highly compressive 3D autoencoder for compact representations and accelerated training; it supports flexible synthesis (text-to-video, image-to-video, etc.) of up to 15-s, high-fidelity videos with arbitrary aspect ratios, and emphasizes commercial-level performance at controlled training cost.
All models are tested with 48-frame video outputs at 8 frames per second.
5.1.3. Evaluation Metrics
We adopt the following three metrics to assess both erasure success and generation fidelity:
All reported metrics (UGR, FVD, MMN, and OSS) are dimensionless quantities, as they are computed from proportions, normalized feature distances, or similarity scores without any physical units.
5.1.4. Compared Methods
We refer to the original, unmodified diffusion model (without any concept erasure) as
Original. Since no prior method handles multi-concept removal jointly, we build two baselines by adapting the single-concept erasure technique of [
47]. In this approach, a short text description of the undesired concept guides fine-tuning: the model is updated using conditioned and unconditioned scores from a frozen diffusion model to steer generation away from that concept.
Mix: all target concept prompts are included within the same epoch and erased in one pass.
Sequential: concepts are removed one at a time, iteratively applying the single-concept erasure.
5.1.5. Training Details
We construct preservation concepts from related concepts that are most affected by erasing the target concept. For example, in the nudity erasure experiments, we set “person” as the preservation concept. All experiments are conducted on the same Ubuntu 20.04 LTS server equipped with a 48-core CPU, 256 GB RAM, and an NVIDIA A800 GPU. On an A800 GPU, unlearning a concept for 10 epochs takes approximately 20 min for CogVideoX-2B and OpenSora, and about 40 min for CogVideoX-5B.
5.2. Main Results
5.2.1. Single-Concept Erasure
We evaluate ConceptVoid under the single-concept erasure setting, where each harmful category is addressed independently. As shown in
Figure 1 and
Table 1, the method consistently reduces unsafe content across all tested categories, with the Unsafe Generation Rate (UGR) dropping by over 50% on average. These improvements are achieved without sacrificing visual quality or semantic relevance. Fréchet Video Distance (FVD) remains largely stable, indicating minimal perceptual drift, while MM-Notox Distance (MMN) also shows consistent reductions, reflecting improved alignment with safe textual intent.
5.2.2. Multi-Concept Erasure
We evaluate ConceptVoid under the more challenging multi-concept erasure setting, where all harmful concepts are removed jointly in a single optimization procedure.
Table 2 compares ConceptVoid against two baselines.
Across all three models, ConceptVoid achieves the lowest Unsafe Generation Rate (UGR), with up to 79.1% reduction on Open-Sora. Notably, it maintains high Object-Subject Score (OSS), closely matching the original model in semantic and temporal consistency, while both baselines suffer notable OSS degradation. This confirms ConceptVoid’s ability to suppress diverse unsafe content without compromising generation quality.
The superior performance of ConceptVoid arises from its formulation as a constrained multi-objective optimization problem, which avoids the gradient conflicts and oversuppression often observed in naïve mixing or sequential schemes. As detailed in Algorithm 1, ConceptVoid explicitly balances per-concept gradients through a weighted MGDA step, solving a quadratic program (QP) to obtain optimal combination weights based on concept importance. This yields a unified update direction that minimizes harmful features while preserving general expressiveness. In contrast, the Mix baseline suffers from concept interference, where competing gradients collapse shared features, and Sequential erasure accumulates destructive updates, often leading to oversuppression or forgetting of non-target content. By introducing importance weights and enforcing output-level constraints, ConceptVoid provides fine-grained control and robust erasure behavior, making it well-suited for scalable, real-world safety applications.
5.3. Ablation Study
To evaluate the impact of core components in our design, we perform controlled ablations on ConceptVoid using the Open-Sora model. We vary one factor at a time and measure the Unsafe Generation Rate (UGR↓) and Object-Subject Score (OSS↑), averaged over all four safety categories and five random seeds. Results are summarized in
Figure 2.
5.3.1. Effect of Importance Weighting
In ConceptVoid, importance weights allow prioritizing specific harmful concepts during joint optimization. Removing this mechanism (i.e., setting all ) leads to a noticeable degradation: UGR rises from 8.7% to 12.4% (+42.5%) while OSS drops from 78.8 to 77.2. This confirms that naive equal weighting can dilute suppression focus, making it harder to erase high-priority risks without compromising generality. In contrast, learned enables fine-grained, policy-aware moderation control.
5.3.2. Effect of Output-Drift Constraint
We examine the role of the output-anchoring constraint in controlling model deviation. Tightening (50% smaller) improves OSS from 78.8 to 79.4 due to stronger preservation of distributional stability, but slightly increases UGR to 9.9%. Conversely, removing the constraint entirely causes significant degeneration—UGR increases to 15.2% and OSS drops to 76.5. These results highlight that acts as a safeguard against catastrophic forgetting and over-erasure, maintaining the model’s ability to generate coherent, safe videos.
5.3.3. Effect of Erasure Strength
The step size determines how aggressively ConceptVoid modifies model parameters in each update. A smaller slows down erasure, resulting in incomplete suppression (UGR 11.1%), while a larger reduces UGR slightly (to 8.1%) but harms semantic quality (OSS drops to 77.6). The default setting strikes an effective trade-off between stability and suppression capacity. These findings suggest that should be tuned based on the target application’s tolerance for visual perturbation versus safety guarantees.
6. Conclusions
In this work, we address the challenge of preventing pretrained GVDs from producing harmful or copyright-protected content without resorting to costly full retraining. Specifically, we tackle the problem of simultaneously erasing multiple harmful concepts. To this end, we propose ConceptVoid, a scalable multi-concept erasure framework that formulates multi-concept erasure as a constrained multi-objective optimization problem. For each target concept, we define a removal loss based on the discrepancy between noise predictions conditioned on the concept and those unconditioned, while preserving non-target capabilities through regularization of output distributions and parameter perturbations. We employ the MGDA to solve the resulting optimization problem, achieving Pareto-optimal trade-offs among competing erasure objectives. Additionally, we introduce an importance-weighting mechanism to flexibly adjust the priority and strength of each concept’s removal. Extensive experiments on state-of-the-art GVDs and diverse real-world datasets demonstrate that ConceptVoid effectively suppresses multiple harmful concepts while maintaining high video fidelity and strong scalability.
7. Limitations
Despite its advantages, ConceptVoid has several limitations. Its theoretical Pareto-optimality guarantees hinge on the convexity of both the objective functions and the feasible set; in non-convex scenarios, the framework can only assure weak Pareto optimality or Pareto stability. Furthermore, although ConceptVoid is in principle applicable to other generative architectures (e.g., text-to-image or text-to-text models), its empirical performance beyond video diffusion remains untested. The method also depends on explicit concept prompts and the corresponding conditioned versus unconditioned noise predictions, making it unable to automatically discover or erase novel or unspecified concepts. Finally, solving the constrained multi-objective optimization via MGDA and tuning importance weights for each target concept incurs additional computational overhead, which may limit real-time or resource-constrained deployments.
Author Contributions
Conceptualization, W.M. and Z.H.; methodology, Z.H., X.J. and C.W.; software, Z.H. and C.W.; validation, Z.H., X.J. and C.W.; formal analysis, Z.H. and W.M.; investigation, Z.H. and X.J.; resources, W.M.; data curation, Z.H. and X.J.; writing—original draft preparation, Z.H. and W.M.; writing—review and editing, X.J., C.W. and W.M.; visualization, Z.H. and X.J.; supervision, W.M.; project administration, W.M.; funding acquisition, W.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
GVDs | Generative Video Diffusion models |
MGDA | Multiple Gradient Descent Algorithm |
GIDs | Generative Image Diffusion models |
DiTs | Diffusion Transformers |
CFG | Classifier-Free Guidance |
KL | Kullback–Leibler |
MSE | Mean Squared Error |
CogX-2B | CogVideoX-2B |
CogX-5B | CogVideoX-5B |
UGR | Unsafe Generation Rate |
FVD | Fréchet Video Distance |
MMN | MM-Notox Distance |
OSS | Object-Subject Score |
QP | Quadratic Program |
References
- Yin, S.; Wu, C.; Yang, H.; Wang, J.; Wang, X.; Ni, M.; Yang, Z.; Li, L.; Liu, S.; Yang, F.; et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv 2023, arXiv:2303.12346. [Google Scholar]
- Kong, W.; Tian, Q.; Zhang, Z.; Min, R.; Dai, Z.; Zhou, J.; Xiong, J.; Li, X.; Wu, B.; Zhang, J.; et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv 2024, arXiv:2412.03603. [Google Scholar] [CrossRef]
- Yang, Z.; Teng, J.; Zheng, W.; Ding, M.; Huang, S.; Xu, J.; Yang, Y.; Hong, W.; Zhang, X.; Feng, G.; et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv 2024, arXiv:2408.06072. [Google Scholar]
- Zheng, J.; Liu, X.; Liu, W.; He, L.; Yan, C.; Mei, T. Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 20228–20237. [Google Scholar]
- Zhong, J.; Wang, Y.; Zhu, D.; Wang, Z. A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning. arXiv 2025, arXiv:2506.07236. [Google Scholar] [CrossRef]
- Jiang, Y.; Gao, X.; Peng, T.; Tan, Y.; Zhu, X.; Zheng, B.; Yue, X. Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states. arXiv 2025, arXiv:2502.14744. [Google Scholar]
- Jiang, Y.; Tan, Y.; Yue, X. RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting. arXiv 2024, arXiv:2412.18826. [Google Scholar]
- Tan, Y.; Jiang, Y.; Li, Y.; Liu, J.; Bu, X.; Su, W.; Yue, X.; Zhu, X.; Zheng, B. Equilibrate rlhf: Towards balancing helpfulness-safety trade-off in large language models. arXiv 2025, arXiv:2502.11555. [Google Scholar]
- Xiao, H.; Liu, S.; Zuo, K.; Xu, H.; Cai, Y.; Liu, T.; Yang, Z. Multiple adverse weather image restoration: A review. Neurocomputing 2024, 618, 129044. [Google Scholar] [CrossRef]
- Xu, Z.; Liu, Y. Robust Anomaly Detection in Network Traffic: Evaluating Machine Learning Models on CICIDS2017. arXiv 2025, arXiv:2506.19877. [Google Scholar] [CrossRef]
- Setty, R. AI art generators hit with copyright suit over artists’ images. Bloom. Law 2023, 1, 2023. [Google Scholar]
- Wang, C.; Nie, C.; Liu, Y. Evaluating Supervised Learning Models for Fraud Detection: A Comparative Study of Classical and Deep Architectures on Imbalanced Transaction Data. arXiv 2025, arXiv:2505.22521. [Google Scholar] [CrossRef]
- Liu, Y.; Qin, X.; Gao, Y.; Li, X.; Feng, C. SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition. INNO-PRESS J. Emerg. Appl. AI 2025, 1, 26–33. [Google Scholar]
- Zhong, J.; Wang, Y. Enhancing Thyroid Disease Prediction Using Machine Learning: A Comparative Study of Ensemble Models and Class Balancing Techniques. Res. Sq. 2025. [Google Scholar] [CrossRef]
- Wang, Y.; Zhong, J.; Kumar, R. A Systematic Review of Machine Learning Applications in Infectious Disease Prediction, Diagnosis, and Outbreak Forecasting. Preprints 2025, 2025041250. [Google Scholar]
- Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A survey of machine unlearning. arXiv 2022, arXiv:2209.02299. [Google Scholar] [CrossRef]
- Feng, X.; Li, Y.; Yu, F.; Zhang, L.; Chen, C.; Zheng, X. Plug and Play: Enabling Pluggable Attribute Unlearning in Recommender Systems. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 2689–2699. [Google Scholar]
- Feng, X.; Li, Y.; Yu, F.; Xiong, K.; Fang, J.; Zhang, L.; Du, T.; Chen, C. RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems. arXiv 2025, arXiv:2504.11510. [Google Scholar] [CrossRef]
- Li, Y.; Chen, C.; Zhang, Y.; Liu, W.; Lyu, L.; Zheng, X.; Meng, D.; Wang, J. Ultrare: Enhancing receraser for recommendation unlearning via error decomposition. Adv. Neural Inf. Process. Syst. 2023, 36, 12611–12625. [Google Scholar]
- Ye, X.; Cheng, S.; Wang, Y.; Xiong, Y.; Li, Y. T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models. arXiv 2025, arXiv:2505.17550. [Google Scholar]
- Liu, S.; Tan, Y. Unlearning Concepts from Text-to-Video Diffusion Models. arXiv 2024, arXiv:2407.14209. [Google Scholar]
- Facchiano, S.; Saravalle, S.; Migliarini, M.; De Matteis, E.; Sampieri, A.; Pilzer, A.; Rodolà, E.; Spinelli, I.; Franco, L.; Galasso, F. Video Unlearning via Low-Rank Refusal Vector. arXiv 2025, arXiv:2506.07891. [Google Scholar] [CrossRef]
- De Matteis, E.; Migliarini, M.; Sampieri, A.; Spinelli, I.; Galasso, F. Human Motion Unlearning. arXiv 2025, arXiv:2503.18674. [Google Scholar] [CrossRef]
- Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; Zhang, S. Modelscope text-to-video technical report. arXiv 2023, arXiv:2308.06571. [Google Scholar]
- Wang, X.; Yuan, H.; Zhang, S.; Chen, D.; Wang, J.; Zhang, Y.; Shen, Y.; Zhao, D.; Zhou, J. Videocomposer: Compositional video synthesis with motion controllability. Adv. Neural Inf. Process. Syst. 2023, 36, 7594–7611. [Google Scholar]
- Wu, J.Z.; Ge, Y.; Wang, X.; Lei, S.W.; Gu, Y.; Shi, Y.; Hsu, W.; Shan, Y.; Qie, X.; Shou, M.Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7623–7633. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. Adv. Neural Inf. Process. Syst. 2022, 35, 8633–8646. [Google Scholar]
- Chen, H.; Xia, M.; He, Y.; Zhang, Y.; Cun, X.; Yang, S.; Xing, J.; Liu, Y.; Chen, Q.; Wang, X.; et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv 2023, arXiv:2310.19512. [Google Scholar]
- Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
- Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Kim, C.; Qi, Y. A comprehensive survey on concept erasure in text-to-image diffusion models. arXiv 2025, arXiv:2502.14896. [Google Scholar]
- Xie, Y.; Liu, P.; Zhang, Z. Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression. arXiv 2025, arXiv:2505.19398. [Google Scholar] [CrossRef]
- Feng, X.; Li, Y.; Wang, C.; Liu, J.; Zhang, L.; Chen, C. A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty. arXiv 2025, arXiv:2504.06658. [Google Scholar]
- Feng, X.; Li, Y.; Ji, H.; Zhang, J.; Zhang, L.; Du, T.; Chen, C. Bridging the Gap Between Preference Alignment and Machine Unlearning. arXiv 2025, arXiv:2504.06659. [Google Scholar] [CrossRef]
- Zhang, G.; Wang, K.; Xu, X.; Wang, Z.; Shi, H. Forget-me-not: Learning to forget in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1755–1764. [Google Scholar]
- Lu, S.; Wang, Z.; Li, L.; Liu, Y.; Kong, A.W.K. Mace: Mass concept erasure in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6430–6440. [Google Scholar]
- Gandikota, R.; Orgad, H.; Belinkov, Y.; Materzyńska, J.; Bau, D. Unified concept editing in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5111–5120. [Google Scholar]
- Gong, C.; Chen, K.; Wei, Z.; Chen, J.; Jiang, Y.G. Reliable and efficient concept erasure of text-to-image diffusion models. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 73–88. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
- Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
- Zheng, J.; Liu, X.; Wang, S.; Wang, L.; Yan, C.; Liu, W. Parsing is All You Need for Accurate Gait Recognition in the Wild. In Proceedings of the ACMMM, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 116–124. [Google Scholar]
- Liu, S.; Zhang, Y.; Li, X.; Liu, Y.; Feng, C.; Yang, H. Gated Multimodal Graph Learning for Personalized Recommendation. INNO-PRESS J. Emerg. Appl. AI 2025, 1, 17–25. [Google Scholar]
- Li, Y.; Chen, C.; Zheng, X.; Zhang, Y.; Han, Z.; Meng, D.; Wang, J. Making users indistinguishable: Attribute-wise unlearning in recommender systems. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 984–994. [Google Scholar]
- Chen, C.; Zhang, Y.; Li, Y.; Wang, J.; Qi, L.; Xu, X.; Zheng, X.; Yin, J. Post-training attribute unlearning in recommender systems. ACM Trans. Inf. Syst. 2024, 43, 1–28. [Google Scholar] [CrossRef]
- Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 2426–2436. [Google Scholar]
- Feng, X.; Li, Y.; Chen, C.; Zhang, L.; Li, L.; Zhou, J.; Zheng, X. Controllable Unlearning for Image-to-Image Generative Models via ε-Constrained Optimization. arXiv 2024, arXiv:2408.01689. [Google Scholar]
- Sener, O.; Koltun, V. Multi-task learning as multi-objective optimization. Adv. Neural Inf. Process. Syst. 2018, 31, 525–536. [Google Scholar]
- Frank, M.; Wolfe, P. An algorithm for quadratic programming. Nav. Res. Logist. Q. 1956, 3, 95–110. [Google Scholar] [CrossRef]
- Miao, Y.; Zhu, Y.; Yu, L.; Zhu, J.; Gao, X.S.; Dong, Y. T2vsafetybench: Evaluating the safety of text-to-video generative models. Adv. Neural Inf. Process. Syst. 2024, 37, 63858–63872. [Google Scholar]
- Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21807–21818. [Google Scholar]
- Peng, X.; Zheng, Z.; Shen, C.; Young, T.; Guo, X.; Wang, B.; Xu, H.; Liu, H.; Jiang, M.; Li, W.; et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv 2025, arXiv:2503.09642. [Google Scholar]
- Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018, arXiv:1812.01717. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).