PD-CBDM: Training Class-Balancing Diffusion Models with Perceptual Distinguish Loss

Hu, Junyan; Luo, Wei; Chen, Tong; Yang, Xiaobao; Hou, Zhiqiang

doi:10.3390/math14101576

Open AccessArticle

PD-CBDM: Training Class-Balancing Diffusion Models with Perceptual Distinguish Loss

by

Junyan Hu

^1,2

,

Wei Luo

¹,

Tong Chen

¹,

Xiaobao Yang

^1,2,*

and

Zhiqiang Hou

¹

School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

²

Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Chang’an West Street, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1576; https://doi.org/10.3390/math14101576

Submission received: 7 April 2026 / Revised: 23 April 2026 / Accepted: 4 May 2026 / Published: 7 May 2026

(This article belongs to the Special Issue Advanced Mathematical Methods for Machine Learning, Neural Networks, and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

For image generation, denoising diffusion probabilistic models (DDPMs) have shown strong performance. Nevertheless, under class-imbalanced training data, many existing models tend to overfit head classes, which degrades image quality for tail classes. To mitigate this issue, we propose a new generation method, PD-CBDM (perceptual distinguish loss–class-balancing diffusion models). As a first step, PD-CBDM revises the target-label distribution used for label sampling in the baseline pipeline, so tail classes are sampled more frequently during training; this improves the diversity of generated images while keeping fidelity high. Next, we introduce a perceptual distinguish loss that enlarges the separation (measured by the KL divergence in the reverse process) between the data distributions of head and tail classes, which helps suppress head-class overfitting and improves generation quality across classes. Additionally, we propose a timestep-dependent Self-Attention (TSA) module that injects timestep cues into the self-attention mechanism to model temporal and spatial dependencies together, thereby enhancing noise estimation accuracy and image generation quality. Experiments show that PD-CBDM improves FID from 5.81 to 4.96 on CIFAR100-LT and from 5.46 to 5.03 on CIFAR10-LT, and it is competitive with representative recent methods such as BPA and NoisyTwins.

Keywords:

long-tail learning; image generation; diffusion models; conditional generation; attention mechanisms

MSC:

68T07

1. Introduction

In recent years, diffusion models [1,2] have made substantial progress in image generation [3]. Compared with GAN-based approaches [4], they often deliver better fidelity and stronger sample diversity. Beyond sample quality, diffusion models typically feature straightforward and stable optimization, which supports a wide range of applications, including text-to-image generation [5,6,7], video generation [8,9], audio generation [10,11], and object detection [12,13]. In conditional diffusion models, auxiliary signals are used to steer the denoising process toward the desired output; such signals can be category labels [14] or low-resolution inputs [15,16], leading to images that better match the given conditions.

Most existing diffusion training setups [1,17,18] implicitly assume roughly balanced data distributions. In practice, many real-world datasets are long-tailed, where head classes contain many more samples than tail classes. This imbalance is especially evident in some domains, such as medical image generation [19,20]. Although conditional generative models can produce satisfactory images for head classes, the skewed distribution makes it difficult to capture tail-class characteristics, which often results in low-quality tail-class generations and harms overall performance. Accordingly, the core challenge in long-tailed diffusion generation is not merely to correct skewed class frequencies but to improve tail-class generation quality while preserving head-class quality under three coupled difficulties: head-class-dominated training, perceptual confusion between tail and visually dominant head classes, and insufficient timestep-aware modeling in the denoising network. Existing class-imbalance handling strategies in generative models mainly focus on resampling, re-weighting, or prior adjustment to reduce distribution bias; however, these strategies do not directly resolve all three difficulties in a unified manner.

For long-tailed datasets, unconditional diffusion models often generate a substantial number of low-quality images. CBDM (class-balancing diffusion models) [21] was among the first to leverage diffusion models for image generation under long-tailed data distributions. It introduces a distribution-adjustment regularizer that encourages generated samples to match randomly chosen target labels more closely. This helps curb head-class overfitting and improves tail-class sample diversity by allowing the model to learn features beyond the majority classes. Other methods such as BPA [22] revisit long-tailed diffusion training from the perspective of bias-aware prior adjustment. While these methods effectively alleviate imbalance at the class-distribution or prior-adjustment level, they do not explicitly address head–tail perceptual confusion during generation, nor do they revisit timestep-aware attention modeling in the denoising network. PD-CBDM complements these class-balancing strategies by explicitly addressing these two aspects within a unified diffusion framework. Building upon the success of CBDM, we use it as the baseline in our study to further investigate long-tailed image generation. An overview of the proposed PD-CBDM framework is provided in Figure 1.

However, alleviating imbalance only through distribution or prior adjustment does not fully prevent head–tail confusion during sampling. We observe that, under long-tailed diffusion training, generated tail-class images often exhibit significant similarities to visually dominant head-class images, as shown in Figure 2. For example, in the training set, the sample count of class 0 is a hundred times that of class 83. Consequently, when generating images for the tail-class (class 83), features similar to the head class frequently appear, as demonstrated in Figure 2b. This head–tail appearance overlap has also been discussed in recent work on class-imbalanced diffusion training, where directly regularizing the overlap between head and tail distributions can alleviate tail-class confusion [23]. To address this issue, we introduce a perceptual distinguish loss to explicitly enlarge the separation between head- and tail-class representations, thereby reducing the tendency of the model to overfit head-class features. Specifically, during training, we resample head–tail-class image pairs and penalize the KL divergence between the distributions of head and tail images. This process is simplified into an MSE loss between noise and mean images in the reverse process, optimizing an updated loss function. The proposed loss increases the divergence between head and tail distributions while reducing the original diffusion loss, effectively achieving our goal.

Beyond improving head–tail representation separation, we also re-examine the target-label distribution used in the baseline [21]. Since this distribution is itself long-tailed, drawing labels from it directly during training can over-emphasize frequent classes, which may hurt tail-class coverage and limit the variety of generated samples. We therefore adjust the label-sampling strategy to increase tail-class exposure in training.

Moreover, beyond data-level re-weighting and loss-level regularization, we note that in the original U-Net, timestep cues are mainly injected through residual blocks, while the self-attention layers remain timestep-agnostic. Motivated by [24], we develop a timestep-dependent multihead self-attention module. The module incorporates timestep cues into self-attention computation [25], enabling the attention layers to account for both denoising dynamics over time and spatial contexts. This design improves noise prediction and leads to better generation quality.

In this paper, our contributions are as follows:

We propose PD-CBDM, a diffusion-based framework for long-tailed image generation that extends prior class-balancing diffusion methods by jointly addressing class-prior imbalance, head–tail perceptual confusion, and timestep-aware denoising.
We introduce a re-weighted target-label distribution together with a perceptual distinguish loss to improve tail-class training exposure and explicitly enlarge the separation between head- and tail-class representations, thereby reducing the tendency of tail-class samples to resemble head classes.
We design a timestep-dependent self-attention module that injects timestep information into self-attention computation, enabling the denoising network to better capture temporal–spatial dependencies under imbalanced training.
Extensive experiments on CIFAR100-LT, CIFAR10-LT, CIFAR-100, and CIFAR-10 demonstrate the effectiveness of the proposed method across both imbalanced and balanced settings, as well as its competitive performance against representative recent baselines.

The paper is structured as follows. In Section 1, we introduce the problem and motivate our approach. Section 2 summarizes related work. Section 3 describes the experimental setup and reports quantitative results. In Section 4, we benchmark our method against existing state-of-the-art approaches. Section 5 presents qualitative results, and Section 6 discusses the main findings. Section 7 concludes the paper.

2. Related Work

We discuss relevant work on long-tail learning as well as attention mechanisms in this section.

2.1. Long-Tail Learning

Long-tailed category imbalance is a common phenomenon in real-world datasets, where a small subset of classes has abundant samples, while many others are sparsely represented [26,27,28]. This imbalance often causes deep learning models to bias toward head classes with abundant data, leading to poor performance on tail classes with limited samples [29,30,31].

In long-tail learning, long-tail recognition has garnered significant attention. For instance, SMOTE [32] augments minority classes by synthesizing new samples via interpolation between a minority instance and one of its k nearest minority neighbors. However, such resampling methods can suffer from edge distribution issues and lack precision in neighbor selection. Class-balanced (CB) loss [27] addresses this by assigning weights to each class’s loss inversely proportional to its sample size, balancing their contributions during training. Transfer-based methods, represented by domain-specific transfer learning (DSTL) [33], learn representations from long-tailed data and are subsequently adapted using a more balanced subset, so that knowledge can be better transferred to tail classes. Along a similar line, SSP [34] relies on self-supervised pretraining (e.g., contrastive objectives or rotation-based prediction) before standard supervised learning on long-tailed data, aiming to obtain a more balanced feature space. However, self-supervised methods can be complex to implement. Recent studies have also extended diffusion-based paradigms to long-tailed learning beyond generation, such as improving long-tail recognition without relying on external knowledge [35] and incorporating LLM-derived priors for long-tailed diffusion learning [36].

In the domain of long-tailed image generation, CBGAN [37] introduces a class balance regularizer, using category distribution information from a pre-trained classifier to constrain GAN outputs for a more balanced category distribution. The gSR (group spectral regularizer) [38] alleviates mode collapse in CGAN [39] by introducing a group spectral regularization term. However, CBGAN requires an additional classifier, and gSR, if overly strong, can restrict model learning, reducing diversity and increasing computational cost. NoisyTwins [40] evaluates various GAN regularization techniques for long-tail image generation, identifying common issues such as mode collapse and category confusion. It proposes a class embedding enhancement strategy to prevent mode collapse and improve generation performance. Beyond GAN-based solutions, diffusion models have recently been explored for long-tailed image generation due to their stable training and strong fidelity. In the context of diffusion models, Xu et al. [22] observed that a uniform noise sampling distribution across all classes biases the model toward head classes, degrading the quality and diversity of generated tail-class images, and proposed BPA (bias-aware prior adjusting) to mitigate this effect. Subsequent long-tailed diffusion studies have improved tail synthesis from complementary perspectives, including oriented calibration to better transfer and calibrate head knowledge for tail classes [41], overlap optimization to reduce head–tail appearance confusion [23], and contrastive conditional–unconditional alignment objectives to enhance long-tailed conditional generation [42]. In parallel, journal-level efforts have begun to explicitly address long-tailed bias in diffusion-based image synthesis via dedicated solver-style designs such as LTB-Solver [43].

Beyond task-specific long-tailed generation methods, recent foundation-model-based generative frameworks have highlighted the value of large-scale generative pretraining for visual representation and synthesis [44,45,46]. Representative examples include SpectralGPT [45] and related generative foundation-model paradigms [44,46], which reflect the broader trend toward large-scale generative pretraining and transferable representation learning. Although such methods are developed under substantially different data modalities, model scales, and training objectives from our CIFAR-style class-conditional long-tailed generation setting, they provide a useful broader context for positioning PD-CBDM within the evolving landscape of generative modeling. In contrast, PD-CBDM focuses on improving class-balanced diffusion training under explicit long-tailed image generation scenarios.

Overall, existing long-tailed generative methods mainly improve performance through class balancing, regularization, or sampling adjustment. In diffusion-based long-tailed generation, prior methods such as CBDM and BPA mainly focus on distribution regularization or bias-aware prior adjustment, while more recent studies also explore calibrated transfer and overlap optimization from complementary perspectives. Compared with these approaches, PD-CBDM unifies re-weighted target-label sampling, explicit head–tail perceptual separation, and timestep-aware attention modeling within a single diffusion framework. This unified design distinguishes PD-CBDM from prior diffusion-based approaches for long-tailed image generation.

2.2. Attention

In recent years, attention mechanisms have been increasingly integrated into convolutional neural networks (CNNs) to improve feature selection and highlight informative cues. By emphasizing salient signals, attention helps models focus on the most relevant parts of the input, often leading to better results. While models like DDPM and improved DDPM utilize self-attention mechanisms, CBDM adopts the same approach. The standard spatial multihead self-attention module is widely used because it is simple and easy to implement, yet its suitability for diffusion models remains an open question. Stable diffusion [47] is a text-to-image diffusion system that uses cross-attention to incorporate different conditioning signals (e.g., text prompts and bounding boxes), enabling high-resolution image synthesis. VPD [48] further shows that cross-attention maps extracted from text-to-image diffusion models trained in advance can be used as explicit semantic cues for downstream visual perception tasks. By averaging cross-attention maps at different resolutions, it provides aggregated semantic information specific to classes. Linear attention mechanisms have also gained traction in image generation. For instance, ref. [49] transforms self-attention into a linear dot product of kernel feature mappings, reducing computational complexity to linear time and improving generation efficiency. The linear transformer model further facilitates diverse sample generation. Tactile diffusion [50] adopts linear attention to reduce computational burden and accelerate generation, but replacing the Softmax-based self-attention with a kernel-style mapping often comes with a drop in performance. In parallel, diffusion transformer backbones have rapidly advanced for high-fidelity synthesis and efficient training (e.g., PixArt-

α

) [44]. Moreover, several recent designs explicitly emphasize step-/timestep-aware attention or computation along the denoising trajectory, including step-wise dynamic attention mediators [51], dynamic diffusion transformers [52], and step-wise adaptive computation [53]. Motivated by these trends, PD-CBDM introduces a timestep-dependent attention module (TSA), which captures both temporal and spatial information, thereby enhancing the model’s precision in noise prediction.

3. Our Approach

3.1. Preliminary

To effectively address the challenges of image generation under long-tailed class distributions, we build upon the CBDM baseline and introduce a series of targeted improvements. We begin by summarizing the key diffusion-model background and the CBDM framework.

a. Diffusion Model: The diffusion model is a generative framework that synthesizes samples through an iterative denoising procedure. It specifies a forward noising chain that gradually maps data to Gaussian noise, together with a reverse denoising chain that reconstructs clean samples from noisy inputs.

In the forward process,

q (x_{t} | x_{t - 1})

describes the perturbation of original data point

x_{0}

from the real distribution. This perturbation is achieved by mixing Gaussian noise with a mean of 0 and a variance of

β_{t}

in a Markov chain manner over T steps until the data fully transitions into Gaussian noise. The process is mathematically formulated as Equation (1):

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

where

β_{t}

denotes the noise level hyperparameter.

N

denotes the normal distribution;

I

denotes the identity matrix. Further, the data point

x_{t}

at any intermediate step can be directly sampled from

x_{0}

using a closed-form expression:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I), where α_{t} = 1 - β_{t}, {\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s} .

(2)

Conversely, the reverse process gradually restores the data from noise. The true reverse transition probability

q (x_{t - 1} | x_{t})

is approximated by a learnable Gaussian model

p_{θ} (x_{t - 1} | x_{t})

, parameterized by

θ

, and is expressed as Equation (3):

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t) I) .

(3)

In the DDPM framework, the mean

μ_{θ} (x_{t}, t)

is learned parametrically through a neural network, while the variance

Σ_{θ} (x_{t}, t)

is typically treated as a time-dependent constant. The work [2] reparameterized the noise prediction network

ε_{θ}

by expressing

μ_{θ} (x_{t}, t)

as Equation (4):

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t)) .

(4)

Equation (4) demonstrates that the mean image can be derived from the noise. Furthermore, the work [2] introduced a simplified training objective that focuses on noise prediction, significantly improving training efficiency and generating high-quality samples. The simplified loss function is defined as Equation (5):

L_{DM} = E_{t, x_{0}, ε} [{∥ε - ε_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε, t)∥}^{2}] .

(5)

b. Conditional Diffusion Probabilistic Models: The conditional generative diffusion model introduces conditional information as the basis for our long-tail research. In the conditional generation setting, for training data

x_{0}

, the associated conditional information c can be category labels or low-resolution images, sampled jointly from the data distribution

(x_{0}, c)

. Therefore, the forward process remains unchanged, and the goal is to train a conditional generative model

p_{θ} (x_{0} | c)

. The reverse process is updated to Equations (6) and (7):

p_{θ} (x_{t - 1} | x_{t}, c) = N (x_{t - 1}; μ_{θ} (x_{t}, t, c), Σ_{θ} (x_{t}, t, c) I),

(6)

μ_{θ} (x_{t}, t, c) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t, c)),

(7)

where

μ_{θ} (x_{t}, t, c)

represents the mean image under the conditional information, and the variance is defined as a constant

Σ_{θ} (x_{t}, t, c) = β_{t}^{2}

. Under these conditions, the loss function is updated to Equation (8):

L_{CDM} = E_{t, x_{0}, c, ε} [{∥ε - ε_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε, t, c)∥}^{2}]

(8)

c. CBDM: CBDM applies diffusion models to the long-tailed dataset setting, aiming to generate high-quality samples for tail classes under class-imbalanced conditions. In this context, there is a significant discrepancy in the number of training images between head and tail classes (with head-class images being hundreds of times more numerous than tail-class images), resulting in a severe lack of diversity in generated tail-class images. CBDM addresses this by using a distribution-adjustment regularizer during training, thus mitigating the mode collapse problem of tail classes. The loss function for CBDM is defined as Equation (9), and it comprises Equation (8) (conditional diffusion loss in b) and a distribution-adjustment term.

\begin{matrix} L_{CBDM} & = L_{CDM} + \frac{τ t}{| Y |} \sum_{c^{'} \in Y} ({∥ε_{θ} (x_{t}, t, c) - sg (ε_{θ} (x_{t}, t, c^{'}))∥}^{2} \\ + γ {∥sg (ε_{θ} (x_{t}, t, c)) - ε_{θ} (x_{t}, t, c^{'})∥}^{2}) \end{matrix}

(9)

Here, “sg” represents the stop-gradient operation, while

τ

and

γ

are weight hyperparameters. Additionally,

Y

denotes the set of class labels.

3.2. Perceptual Distinguish Loss

Training deep generative models under a long-tailed distribution presents two primary challenges: the imbalance in the number of data across different classes causes deep generative models to bias towards head classes, leading to poor performance on tail classes; and the scarcity of tail-class images further complicates the training of these models. Existing class-balancing strategies mainly improve tail performance by adjusting sampling or class exposure, but they do not explicitly constrain the perceptual separation between head and tail classes during generation. To address this gap, we introduce a perceptual distinguish loss. Rather than serving as a generic auxiliary regularizer, it explicitly enlarges the separation between head- and tail-class reverse-process distributions, thereby reducing the tendency of tail-class generation to absorb visually dominant head-class features.

Specifically, in the diffusion reverse process, as shown in Equation (7), we randomly sample from head and tail data (

c_{h e a d}

and

c_{t a i l}

) as

x_{t}^{h e a d}

and

x_{t}^{t a i l}

, respectively. We define the transition probabilities of head and tail data in the reverse process as

p_{θ} (x_{t - 1}^{h e a d} ∣ x_{t}^{h e a d}, c_{h e a d})

and

p_{θ} (x_{t - 1}^{t a i l} ∣ x_{t}^{t a i l}, c_{t a i l})

, respectively. In generating tail-class images, to emphasize the differences from head-class images, we introduce a penalty term designed to maximize the Kullback–Leibler (KL) divergence [54] between the two distributions. This penalty is what we call the perceptual distinguish loss, and we introduce a weight

λ

to balance the original diffusion loss with the perceptual distinguish loss. Consequently, the updated loss function IS as Equation (10):

L = L_{CDM} - λ D_{K L} [p_{θ} (x_{t - 1}^{h e a d} ∣ x_{t}^{h e a d}, c_{h e a d}) ‖ p_{θ} (x_{t - 1}^{t a i l} ∣ x_{t}^{t a i l}, c_{t a i l})]

(10)

Optimizing the above objective increases the distinction between the distributions of generated head and tail data while simultaneously reducing the diffusion loss. In the reverse process, given a timestep t and noise

ε

, the mean image

μ_{θ} (x_{t}, t, c)

can be computed as shown in Equation (7), with the variance fixed as a constant. Therefore, the KL divergence can be expressed as Equation (11):

\begin{matrix} D_{K L} [p_{θ} (x_{t - 1}^{h e a d} ∣ x_{t}^{h e a d}, c_{h e a d}) ∥ p_{θ} (x_{t - 1}^{t a i l} ∣ x_{t}^{t a i l}, c_{t a i l})] \\ = E_{q} [\frac{1}{2 Σ_{t}^{2}} {∥ μ_{θ} (x_{t}^{h e a d}, t, c_{h e a d}) - μ_{θ} (x_{t}^{t a i l}, t, c_{t a i l}) ∥}^{2}] + C, \end{matrix}

(11)

where C is a constant. An alternative idea is to directly perform MSE loss on

x_{0}

, but in our experiments, we found that this approach did not yield optimal results. We infer that the error in

x_{0}

obtained under a given noise

ε

is too large. The PD-CBDM training algorithm is detailed in Algorithm 1.

Algorithm 1 Training Algorithm of PD-CBDM

Input:: train data $x_{0}$ with condition c;

t is the timestep;

q_{y}^{*}

is the target-label distribution;

| Y |

is the number of elements in the label set;

q_{h e a d}

is the distribution of head data after segmentation;

q_{t a i l}

is the distribution of tail data after segmentation.

Output:: Conditional noise prediction model $ε_{θ}$ .

1:: for Every batch of size N do
2:: for for data-label pair $(x_{0}, c)$ in this batch do
3:: Sample $ε \sim N (0, I), t \sim U ({0, 1, \dots, T})$
4:: Calculate $L_{CDM} = {∥ ε - ε_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε, t, c) ∥}^{2}$
5:: Sample $c^{'}$ from $q_{y}^{*}$
6:: Calculate distribution-adjustment regularization term $L_{r} = \frac{τ t}{| Y |} [∥ ε_{θ} (x_{t}, t, c) - sg (ε_{θ} (x_{t}, t, c^{'})) ∥^{2} + γ ∥ sg (ε_{θ} (x_{t}, t, c)) - ε_{θ} (x_{t}, t, c^{'}) ∥^{2}]$
7:: Sample $c_{h e a d}$ from $q_{h e a d}$ , sample $c_{t a i l}$ from $q_{t a i l}$
8:: Calculate $μ_{θ} (x_{t}^{h e a d}, t, c_{h e a d})$ and $μ_{θ} (x_{t}^{t a i l}, t, c_{t a i l})$ using Equation (7)
9:: Calculate pd loss $L_{pd} = | | μ_{θ} (x_{t}^{h e a d}, t, c_{h e a d}) - μ_{θ} (x_{t}^{t a i l}, t, c_{t a i l}) {| |}^{2}$
10:: Update with $L_{PD - CBDM} = L_{CDM} + L_{r} - L_{pd}$
11:: end for
12:: end for

3.3. Rethinking Target-Label Distribution $q_{y}^{*}$

In CBDM,

q_{y}^{*}

represents the target-label distribution after regularization, which aids the model in better learning and generating samples for tail classes by adjusting the distribution during training. However, because this distribution is still derived under long-tailed data, directly sampling from it can continue to over-emphasize frequent classes during training. Different from a generic class re-weighting strategy, we directly revisit the target-label distribution used in CBDM and invert the original

q_{y}^{*}

to construct the adjusted distribution

{\tilde{q}}_{y}

. This design increases tail-class exposure while avoiding the fidelity degradation associated with naively balanced sampling. To alleviate the problem of fidelity degradation while enhancing diversity, we propose to invert the original target-label distribution

q_{y}^{*}

, resulting in the adjusted distribution

{\tilde{q}}_{y}

as Equation (12):

{\tilde{q}}_{y} = \frac{1 / q_{y}^{*}}{\sum_{y^{'} = 1}^{C} 1 / q_{y^{'}}^{*}}

(12)

Instead of using a balanced distribution, we invert the target-label distribution by assigning relatively higher weights to tail classes. When the number of image classes included is fewer than in the training set, the distribution-adjustment loss will increase the probability of selecting underrepresented tail samples during training. This approach encourages the model to generate more diverse samples, while still maintaining high fidelity. This inverted distribution assigns larger probabilities to tail classes, enabling more frequent sampling of underrepresented categories during training.

3.4. Timestep-Dependent Self-Attention

As described previously in Section 3.1, in CBDM, timestep information is only utilized in the residual blocks, and only simple attention mechanisms are employed. Typically, the time dimension dependence of the denoising network is implemented through simple time position embeddings, which are applied to different residual blocks using operations such as spatial addition. However, this simple mechanism may not optimally capture the time-dependent relationships throughout the denoising process.

To overcome this limitation, we propose a timestep-dependent self-attention (TSA) module. The TSA module injects timestep information into the self-attention mechanism to jointly model temporal and spatial dependencies, thereby improving noise estimation in the denoising process. Specifically, we introduce timestep dependence into the query/key/value projections of self-attention. Given the U-Net feature map

z_{s}

and timestep t, we first compute the timestep embedding

z_{t} = Emb (t)

, where

Emb (\cdot)

denotes the standard timestep embedding function used in diffusion models. We then pass

z_{s}

and

z_{t}

through separate linear layers and sum the projected features to construct timestep-conditioned Q, K, and V before performing standard scaled dot-product self-attention, producing the updated output feature

z^{'}

, as illustrated in Figure 3. The updated formula can be written as Equations (13) and (14):

\begin{matrix} Q = z_{s} W_{q s} + z_{t} W_{q t} \\ K = z_{s} W_{k s} + z_{t} W_{k t} \\ V = z_{s} W_{v s} + z_{t} W_{v t} \end{matrix}

(13)

z^{'} = T S A (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(14)

In Section 4.3.3, we tested the performance of the timestep-dependent self-attention module (TSA) at different positions within the model. Ultimately, we decided to position the TSA module in the U-Net at the feature-map resolution of 16, specifically after the first downsampling and before the last upsampling.

4. Experiments

4.1. Datasets and Metrics

To evaluate our method under class-imbalanced generation settings, we conduct all experiments on long-tailed variants of CIFAR-10 [55] and CIFAR-100 [55]. The long-tailed splits are constructed following [21,30]. CIFAR-10 consists of 60,000 RGB images at a

32 \times 32

resolution across 10 categories (6000 images per category). CIFAR-100 contains the same number of

32 \times 32

RGB images but spans 100 categories (600 images per category), which are further grouped into 20 superclasses with 5 categories each.

In terms of evaluation metrics, we chose Fréchet inception distance (FID) [56], inception score (IS) [57], recall [58], and

F_{β}

[59] as evaluation metrics to assess the fidelity and diversity of the model-generated images. The value of

β

was set to 8. Recall and

F_{8}

were used as metrics to evaluate image diversity, while IS and

F_{1 / 8}

tended to measure image fidelity. In addition, for recall and

F_{β}

calculations, we use the Inception-V3 feature; recall parameter K is 5, and its clustering count is set to 20 times the number of classes.

4.2. Experimental Setup

Training settings: In CIFAR100 and CIFAR100-LT datasets, we performed 500 K iterations, and in CIFAR10 and CIFAR10-LT datasets, we performed 800 K iterations, with all training completed on a single Nvidia RTX 3090 GPU. For all datasets, we used the Adam optimizer with a learning rate of

2 \times 10^{- 4}

, set the model’s channel count to 128, and the network included three stages, with the resolution halved between two stages. Each stage had 2 residual blocks, 2 TSA blocks, and the noise strategy for all datasets used a linear noise schedule between

1 \times 10^{- 4}

and 0.02.

Testing settings: In the model testing phase, the settings were the same as the baseline model. In the ablation experiments, we randomly sampled 50,000 images to measure all metrics; when comparing with SOTA models, we randomly sampled 10,000 images for comparison with GAN series algorithms, and 5000 images for comparison with diffusion model series algorithms, with all dataset sampling completed on a single NVIDIA GeForce RTX 3090.

4.3. Ablation Study

4.3.1. Performance Analysis of Proposed Methods

To quantify the contribution of each proposed component, we perform controlled ablations on CIFAR100-LT by selectively enabling the three modules of PD-CBDM. We show the performance change when adding the perceptual distinguish loss, re-weighting, and TSA module to the baseline model. As shown in Table 1, adding the perceptual distinguish loss reduces FID from 5.81 to 5.19 and slightly improves IS, suggesting that explicitly enlarging head–tail separation mainly improves generation fidelity. Re-weighting increases recall from 0.57 to 0.63 and

F_{1 / 8}

from 0.90 to 0.94, indicating that greater tail-class exposure mainly enhances diversity and coverage, although IS decreases slightly. TSA improves FID to 5.42 and IS to 13.57, suggesting that timestep-aware attention benefits denoising quality. When all three components are combined, PD-CBDM achieves the best overall trade-off, improving FID from 5.81 to 4.96 and IS from 13.34 to 13.60. These results indicate that the three modules play complementary roles in improving long-tailed image generation.

4.3.2. Analysis of Data Division Ratio

In the perceptual distinguish loss, we explored the impact of different head–tail data division ratios on experimental results. Specifically, we sorted the training data by class frequency and partitioned them into training models at ratios of 6:4, 7:3, and 8:2. Since the imbalance factor varies greatly with the training dataset, different division ratios need to be set for different imbalance factors to ensure the model can achieve optimal performance. As shown in Table 2, after using perceptual distinguish loss, the model performance was significantly improved, but when the division ratio was 7:3, the FID reached 5.19, a reduction of 0.62 compared to the baseline; when the division ratio was 6:4, the IS reached 13.59, an increase of 0.25 compared to the baseline model. Therefore, we set the division ratio for all datasets to 7:3.

4.3.3. TSA Module Position Selection

To investigate the effect of the TSA module in the denoising U-Net, ablation studies were conducted by inserting TSA at different locations within the network. Specifically, the module was added at each resolution transition point—after downsampling, before upsampling, and in the bottleneck—corresponding to feature-map resolutions of 16, 8, and 4, respectively. As shown in Table 3, compared to using TSA only at resolution 16, applying it at all resolutions led to a performance drop: FID decreased by 0.34 and IS by 0.33. This indicates that using TSA across all resolutions may weaken model performance. Consequently, for other datasets with limited resolution, the TSA module was only applied to feature maps of size 16, and its placement at other resolutions was not further explored. This result suggests that the gain from TSA comes from localized timestep-aware attention rather than simply stacking more attention operations, which further isolates the architectural contribution of TSA.

4.3.4. Different Guidance Strength Analysis

In the conditional generation settings, the guidance strength

ω

is a key parameter that controls the trade-off between the quality and diversity of generated samples. During the sampling process, we experimented with the impact of the guidance strength

ω

in the method without classifier guidance (CFG) on the model’s generation effect. Therefore, we tested the impact of different datasets on the model’s FID and IS with guidance strength

ω

in the range of [0.2, 2]. Using the method in paper [24], we sampled 50,000 images every 0.2 to test their FID and IS to determine the optimal

ω

for each dataset. As shown in Table 4, we found that as

ω

increases, the model’s generation diversity is affected, but the fidelity is improved. When

ω

is 1.4, the model’s FID reaches the best at 4.96, but as the guidance strength increases, the IS continues to improve. Therefore, we set

ω

to 1.4 for the CIFAR100-LT dataset. Using the same method, we searched for the optimal

ω

for several other datasets, namely 1.2 (CIFAR10-LT), 0.9 (CIFAR-100), and 0.8 (CIFAR-10).

4.4. Performance on Class-Balanced Datasets

At the same time, to prove the effectiveness of the proposed method, we also explored the effect of PD-CBDM on balanced datasets. Specifically, for balanced datasets, we still use a 7:3 ratio for data division after shuffling image class, and we no longer use the re-weighting method (because the target-label distribution is already a balanced distribution). Recently, GP-MI and MMD [60] proposed a reinforcement learning-based fine-tuning method that leverages a reward function called “Diversity Reward” to guide the training of diffusion models. The results are shown in Table 5, where it can be seen that our method is also applicable on balanced datasets, with FID improved by 0.59 compared to unconditional generation of DDPM, IS improved by 0.64, and all indicators surpassing the baseline model CBDM; however, MMD performs better in the diversity of generation. Therefore, our method is not only suitable for class-imbalanced datasets but also performs well on conventional datasets.

5. Compare with SOTA Models

Image generation research has progressed quickly, and recent models continue to improve quantitative metrics. For a comparison against strong baselines, we evaluate PD-CBDM under several imbalance factors

ρ

(1, 10, and 100) and report FID as the primary measure. Concretely, we compute FID on CIFAR10-LT with

ρ \in {1, 10, 100}

, and on CIFAR100-LT with

ρ = 10

.

As shown in Table 6, we selected some GAN series models such as SNDCGAN [61], CBGAN [37], and NoisyTwins [40] models to compare with PD-CBDM. Among them, gSR [38] is a regularization method during the training of GAN series models. From Table 6, it can be seen that PD-CBDM trained under all imbalance factors outperforms other models, and it appears that GAN series models are not good at image generation tasks on long-tailed datasets.

On the CIFAR10-LT dataset, when

ρ

is 100, the FID is improved by 1.38 compared to the baseline model; when

ρ

is 10, it is improved by 0.62; and when

ρ

is 1, it is improved by 1.14. Compared with the recent SOTA model in the GAN field, NoisyTwins, PD-CBDM achieves better FID scores under various imbalance factors. On the CIFAR100-LT dataset, its performance is much higher than several GAN series models, and compared with the baseline model, the FID is improved by 2.57. This demonstrates that PD-CBDM performs exceptionally well under more categories and stronger imbalance factors.

On the other hand, we further compared the performance of several diffusion-model-based image generation algorithms under long-tailed distributions, including DDIM [65] and BPA [22]. Specifically, we evaluated each model on the CIFAR100-LT and CIFAR10-LT datasets with an imbalance factor

ρ

of 100. Following the BPA model settings, we randomly sampled 5000 images to compare their FID and IS metrics, and the results are shown in Table 7.

From Table 7, it can be observed that, compared with the recent SOTA diffusion-model BPA, PD-CBDM achieves a higher FID on both datasets. On the CIFAR10-LT dataset, its FID is 0.41 higher than BPA, although its IS is slightly lower. Similarly, on the CIFAR100-LT dataset, PD-CBDM surpasses BPA by 0.64 in FID but falls behind by 0.14 in IS.

6. Qualitative Analysis

To further evaluate the perceptual quality and diversity of the generated images, we present qualitative visual comparisons. Specifically, we compared the generation outputs of the baseline, NoisyTwins, and PD-CBDM models on the CIFAR100-LT dataset, as illustrated in Figure 4. As shown in Figure 4, under the same random seed, PD-CBDM generates higher-quality tail-class images than the other two models. PD-CBDM exhibits both higher diversity and fidelity in tail-class generation, with more distinct primary subject details and varied backgrounds. For instance, in generating the tail category “flowers”, the baseline and NoisyTwins models tend to produce uniform and less detailed backgrounds, indicating underfitting to tail-class data and limited diversity. In contrast, PD-CBDM generates a wider variety of “flowers” with richer and more diverse background details.

This comparison highlights PD-CBDM’s superior capability in generating high-quality tail-class images while maintaining the generation quality of head-class images.

7. Discussion

PD-CBDM aims to distinguish head and tail-class distributions via a perceptual distinguish loss, enhancing tail-class image quality with minimal impact on head classes. This section analyzes its effectiveness from both theoretical and empirical perspectives.

From a theoretical standpoint, the effectiveness of PD-CBDM can be attributed to the interplay of three key components: perceptual distinguish loss, the timestep-dependent self-attention (TSA) module, and the redefined target-label distribution

q_{y}^{*}

. First, the perceptual distinguish loss adds an explicit regularization term that encourages a larger KL-based separation between head- and tail-class feature distributions. By enforcing this separation in the perceptual feature space, the model is encouraged to learn class-specific representations, especially for underrepresented tail classes. This helps prevent the common issue of feature collapse, where tail-class features are overwhelmed or misaligned with those of head classes. Second, the TSA module improves dependency modeling by injecting temporal and spatial cues into the denoising process. Traditional attention mechanisms within U-Net architectures often neglect the timestep context crucial to diffusion models. By making attention explicitly timestep-dependent, TSA helps incorporate timestep information into attention computation, which can improve noise estimation and denoising dynamics across the diffusion process. This enhancement is particularly beneficial for tail classes, whose generation paths may require finer temporal modeling to reconstruct semantically meaningful samples. Finally, the redefinition of the target-label distribution

q_{y}^{*}

acts as an implicit re-weighting strategy. By assigning higher sampling probabilities or training emphasis to tail classes, it aligns with long-tailed data balancing principles. This promotes the generation of a more diverse and representative sample set for underrepresented classes, thereby mitigating class imbalance at the data level during training. Together, these components create a synergistic framework: the perceptual loss enforces inter-class separability, TSA improves temporal modeling capacity, and the adjusted label prior

q_{y}^{*}

addresses sample diversity—all contributing to PD-CBDM’s ability to generate high-fidelity, class-balanced outputs in long-tailed settings.

From a computational perspective, the three components of PD-CBDM introduce different types of overhead. The re-weighted target-label distribution only changes label sampling during training and does not alter the inference procedure. The perceptual distinguish loss adds extra head–tail pair processing and a separation term during training, but it is not used at inference time. By contrast, TSA increases computation in both training and sampling because timestep-conditioned projections are introduced into self-attention. Nevertheless, this overhead is localized, since TSA is only inserted at the

16 \times 16

stage of the U-Net rather than across all resolutions. Therefore, compared with the baseline, the additional training overhead comes from both the perceptual distinguish loss and TSA, whereas the additional inference-time overhead mainly comes from the TSA-enhanced attention computation.

Although our experiments are conducted on CIFAR-style benchmarks, the three components of PD-CBDM are not inherently restricted to low-resolution settings. The re-weighted target-label distribution operates at the data-sampling level, the perceptual distinguish loss modifies the training objective, and TSA is introduced as a modular attention design in the denoising network. Therefore, the overall framework is, in principle, transferable to higher-resolution or real-world long-tailed datasets. Nevertheless, extending PD-CBDM to such settings may introduce additional challenges. Higher-resolution generation would increase both memory and computation, while real-world long-tailed datasets may exhibit more complex head–tail visual overlap and broader distribution shifts across categories. These issues may affect both optimization stability and the effectiveness of the current module configuration. We therefore regard scalable long-tailed diffusion generation on higher-resolution and real-world datasets as an important direction for future work.

From an experimental standpoint, PD-CBDM demonstrates strong performance across both imbalanced and balanced settings under the evaluated benchmarks. As shown in Table 6, under varying imbalance factors, PD-CBDM achieves lower FID scores compared to the baseline model. This suggests that PD-CBDM effectively mitigates the degradation of generative performance typically observed under class imbalance, maintaining high visual fidelity even for tail classes. The superior performance under these challenging conditions highlights the method’s capacity to model underrepresented distributions more accurately. Moreover, on conventional balanced datasets (Table 5), PD-CBDM outperforms both DDPM and CBDM across all evaluated metrics, including FID and IS. This result confirms that the improvements introduced by PD-CBDM—such as the perceptual distinguish loss and the TSA module—do not overfit to the imbalanced setting but instead provide broader benefits in modeling and generation quality. This indicates that PD-CBDM enhances the underlying denoising and sample quality mechanisms in diffusion models, making it a more universally effective solution. Qualitative results further reinforce these findings. As illustrated in Figure 4, PD-CBDM produces notably higher-quality samples than both the baseline and NoisyTwins models on the CIFAR100-LT dataset, particularly for tail categories. Compared to other methods, PD-CBDM-generated images exhibit more distinct main objects, finer structural details, and richer, more diverse backgrounds. These visual differences imply that PD-CBDM not only increases the fidelity of tail-class samples but also enriches intra-class diversity—two critical aspects often lacking in long-tailed generative modeling. In addition, we quantify how TSA affects noise prediction by measuring the sampling-time noise error. We follow Algorithm 2 and remove sampling stochasticity, then compute the prediction error for CBDM and PD-CBDM under this deterministic setting. For CIFAR100-LT, images are normalized to the range

[- 1, 1]

, and we compute the per-pixel distance between the predicted noise and the target using the learned weights of each model with the

ℓ_{1}

metric. As reported in Table 8, PD-CBDM reduces the noise error by 0.18 compared with CBDM. This improvement suggests that incorporating TSA helps the denoising network capture noise-related information more precisely, leading to more accurate noise prediction.

Algorithm 2 Noise Error Estimation with Discarded Stochasticity Algorithm

Input:: $x_{0}$ is the training data;

t is the timestep;

ε

is the Gaussian noise;

ε_{θ} (x_{t}, t, c)

is the trained noise prediction model.

Output:: noise error $ω_{t} / η_{t}$ .

1:: $Initialize ω_{t} = 0, η_{t} = 0 (t \sim U ({1, \dots, T}))$
2:: repeat
3:: $x_{0} \sim q (x_{0}), t \sim U ({1, \dots, T})$
4:: Compute $x_{t}$ using Equation (2)
5:: for t to 1 do
6:: $x_{t - 1}^{'} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t, c))$
7:: end for
8:: $ω_{t} = ω_{t} + ∥ x_{0}^{'} - x_{0} ∥ / 1024, η_{t} = η_{t} + 1$
9:: until N iterations
10:: return $ω_{t} / η_{t}$

Nevertheless, several open challenges remain. Under more extreme imbalance, head-class dominance and head–tail confusion may become more severe, which could further affect optimization stability and tail-class fidelity. In addition, extending PD-CBDM to larger or higher-resolution datasets would introduce higher memory and computation costs, while cross-domain transfer to more realistic long-tailed datasets may involve more complex visual overlap and broader distribution shifts across categories. These issues remain important directions for future study.

8. Conclusions

This paper proposes PD-CBDM, a method designed to generate high-quality images on long-tailed datasets. In class-imbalanced situations, generative models often rely too heavily on head classes, leading to low-quality tail samples that resemble head data. To address this issue, we introduce a perceptual distinguish loss to better separate head- and tail-class representations. In addition, the target-label distribution of the baseline model is adjusted to give more weight to tail classes during training, which helps increase the diversity of generated images. To further improve denoising quality, a timestep-dependent self-attention (TSA) module is added to the denoising network, allowing the model to exploit both temporal and spatial information. Experiments show that PD-CBDM improves FID from 5.81 to 4.96 on CIFAR100-LT and from 8.10 to 7.48 on CIFAR10-LT, and also performs well on standard balanced datasets.

Limitations and future work: Although PD-CBDM achieves competitive FID and IS scores, it still has limitations. The method improves class-imbalanced image generation through a combination of re-weighted label sampling, perceptual-loss regularization, and timestep-aware attention, without a dedicated architecture specifically designed for long-tailed diffusion generation. In particular, its overall generation efficiency remains constrained by the iterative sampling nature of diffusion models, and the added TSA module further increases computation compared with the baseline. Under more extreme imbalance, the current configuration may also face additional optimization difficulty and stronger head–tail confusion. In future work, we plan to explore more efficient architectures and faster sampling strategies for long-tailed diffusion generation, and to further evaluate the proposed framework on larger, higher-resolution, and more realistic long-tailed datasets.

Author Contributions

J.H.: Conceptualization, methodology, writing—original draft. W.L.: Methodology, writing—original draft, experimenting and editing. T.C.: Software and validation. X.Y.: Writing—review, funding acquisition and polishing. Z.H.: Resources. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62402387 and by the Special Scientific Research Program of Education Department of Shaanxi under Grant 22JK0562.

Data Availability Statement

The data presented in this study are openly available on GitHub at https://github.com/leo-0010/PD-CBDM, accessed on 8 April 2026.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
Pernias, P.; Rampas, D.; Richter, M.L.; Pal, C.; Aubreville, M. Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Lu, H.; Yang, G.; Fei, N.; Huo, Y.; Lu, Z.; Luo, P.; Ding, M. VDT: General-purpose Video Diffusion Transformers via Mask Modeling. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Fei, H.; Wu, S.; Ji, W.; Zhang, H.; Chua, T.S. Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7641–7653. [Google Scholar]
Xing, Y.; He, Y.; Tian, Z.; Wang, X.; Chen, Q. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7151–7161. [Google Scholar]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Wang, Y.; Gao, R.; Chen, K.; Zhou, K.; Cai, Y.; Hong, L.; Li, Z.; Jiang, L.; Yeung, D.Y.; Xu, Q.; et al. Detdiffusion: Synergizing generative and perceptive models for enhanced data generation and perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7246–7255. [Google Scholar]
Ranasinghe, Y.; Hegde, D.; Patel, V.M. MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10659–10670. [Google Scholar]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 2022, 23, 2249–2281. [Google Scholar]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
Shang, S.; Shan, Z.; Liu, G.; Wang, L.; Wang, X.; Zhang, Z.; Zhang, J. Resdiff: Combining cnn and diffusion model for image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; AAAI Publications: Washington, DC, USA, 2024; Volume 38, pp. 8975–8983. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22669–22679. [Google Scholar]
Zhan, C.; Lin, Y.; Wang, G.; Wang, H.; Wu, J. MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11502–11512. [Google Scholar]
Zhang, Z.; Yao, L.; Wang, B.; Jha, D.; Keles, E.; Medetalibeyoglu, A.; Bagci, U. Emit-diff: Enhancing medical image segmentation via text-guided diffusion model. arXiv 2023, arXiv:2310.12868. [Google Scholar]
Qin, Y.; Zheng, H.; Yao, J.; Zhou, M.; Zhang, Y. Class-balancing diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18434–18443. [Google Scholar]
Xu, C.; Yan, J.; Yang, M.; Deng, C. Rethinking Noise Sampling in Class-Imbalanced Diffusion Models. IEEE Trans. Image Process. 2024, 33, 6298–6308. [Google Scholar] [CrossRef] [PubMed]
Yan, D.; Qi, L.; Hu, V.T.; Yang, M.H.; Tang, M. Training class-imbalanced diffusion model via overlap optimization. arXiv 2024, arXiv:2402.10821. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Song, J.; Liu, G.; Kautz, J.; Vahdat, A. Diffit: Diffusion vision transformers for image generation. In Proceedings of the European Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 37–55. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Kang, B.; Li, Y.; Xie, S.; Yuan, Z.; Feng, J. Exploring balanced feature spaces for representation learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2537–2546. [Google Scholar]
Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; Yu, S. Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11662–11671. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Cui, Y.; Song, Y.; Sun, C.; Howard, A.; Belongie, S. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4109–4118. [Google Scholar]
Yang, Y.; Xu, Z. Rethinking the value of labels for improving class-imbalanced learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19290–19301. [Google Scholar]
Shao, J.; Zhu, K.; Zhang, H.; Wu, J. DiffuLT: Diffusion for Long-tail Recognition Without External Knowledge. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 123007–123031. [Google Scholar]
Deng, C.; Li, D.; Ji, L.; Zhang, C.; Li, B.; Yan, H.; Zheng, J.; Wang, L.; Zhang, J. ChatDiff: A ChatGPT-based diffusion model for long-tailed classification. Neural Netw. 2025, 181, 106794. [Google Scholar] [CrossRef]
Rangwani, H.; Mopuri, K.R.; Babu, R.V. Class balancing gan with a classifier in the loop. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Online, 27–30 July 2021; pp. 1618–1627. [Google Scholar]
Rangwani, H.; Jaswani, N.; Karmali, T.; Jampani, V.; Babu, R.V. Improving gans for long-tailed data through group spectral regularization. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 426–442. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Rangwani, H.; Bansal, L.; Sharma, K.; Karmali, T.; Jampani, V.; Babu, R.V. Noisytwins: Class-consistent and diverse image generation through stylegans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5987–5996. [Google Scholar]
Zhang, T.; Zheng, H.; Yao, J.; Wang, X.; Zhou, M.; Zhang, Y.; Wang, Y. Long-tailed diffusion models with oriented calibration. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, F.; Villa, A.; Liang, G.; Lu, X.; Tang, M. Contrastive Conditional-Unconditional Alignment for Long-tailed Diffusion Model. arXiv 2025, arXiv:2507.09052. [Google Scholar]
Fu, S.; He, X.; Hu, H. LTB-Solver: Long-tailed Bias Solver for image synthesis of diffusion models. Neurocomputing 2025, 634, 129651. [Google Scholar] [CrossRef]
Chen, J.; Yu, J.; Ge, C.; Yao, L.; Xie, E.; Wu, Y.; Wang, Z.; Kwok, J.; Luo, P.; Lu, H.; et al. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv 2023, arXiv:2310.00426. [Google Scholar]
Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral remote sensing foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5227–5244. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4195–4205. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 10684–10695. [Google Scholar]
Zhao, W.; Rao, Y.; Liu, Z.; Liu, B.; Zhou, J.; Lu, J. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 5729–5739. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Higuera, C.; Boots, B.; Mukadam, M. Learning to read braille: Bridging the tactile reality gap with diffusion models. arXiv 2023, arXiv:2304.01182. [Google Scholar] [CrossRef]
Pu, Y.; Xia, Z.; Guo, J.; Han, D.; Li, Q.; Li, D.; Yuan, Y.; Li, J.; Han, Y.; Song, S.; et al. Efficient diffusion transformer with step-wise dynamic attention mediators. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 424–441. [Google Scholar]
Zhao, W.; Han, Y.; Tang, J.; Wang, K.; Song, Y.; Huang, G.; Wang, F.; You, Y. Dynamic diffusion transformer. arXiv 2024, arXiv:2410.03456. [Google Scholar] [CrossRef]
Tang, S.; Wang, Y.; Ding, C.; Liang, Y.; Li, Y.; Xu, D. Adadiff: Accelerating diffusion models through step-wise adaptive computation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 73–90. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
Kynkäänniemi, T.; Karras, T.; Laine, S.; Lehtinen, J.; Aila, T. Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 2019, 32, 3927–3936. [Google Scholar]
Sajjadi, M.S.; Bachem, O.; Lucic, M.; Bousquet, O.; Gelly, S. Assessing generative models via precision and recall. Adv. Neural Inf. Process. Syst. 2018, 31, 5234–5243. [Google Scholar]
Miao, Z.; Wang, J.; Wang, Z.; Yang, Z.; Wang, L.; Qiu, Q.; Liu, Z. Training diffusion models towards diverse image generation with reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10844–10853. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
Tseng, H.Y.; Jiang, L.; Liu, C.; Yang, M.H.; Yang, W. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7921–7931. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]

Figure 1. Overall architecture of PD-CBDM. Left: Long-tailed classes are split into head/tail and the target-label distribution is re-weighted by inverting

q_{y}^{*}

to obtain

{\tilde{q}}_{y}

. Middle: TSA blocks are inserted at the

16 \times 16

stage of the denoising U-Net, where timestep embedding

z_{t} = Emb (t)

is injected into the Q/K/V projections together with spatial features

z_{s}

. Right: The perceptual distinguish loss

L_{p d}

enlarges head–tail discrepancy by enforcing separation between

μ_{θ}^{h e a d}

and

μ_{θ}^{t a i l}

(computed via Equation (7)). Green/blue/orange indicate head/tail/overlap distributions, respectively.

Figure 1. Overall architecture of PD-CBDM. Left: Long-tailed classes are split into head/tail and the target-label distribution is re-weighted by inverting

q_{y}^{*}

to obtain

{\tilde{q}}_{y}

. Middle: TSA blocks are inserted at the

16 \times 16

stage of the denoising U-Net, where timestep embedding

z_{t} = Emb (t)

is injected into the Q/K/V projections together with spatial features

z_{s}

. Right: The perceptual distinguish loss

L_{p d}

enlarges head–tail discrepancy by enforcing separation between

μ_{θ}^{h e a d}

and

μ_{θ}^{t a i l}

(computed via Equation (7)). Green/blue/orange indicate head/tail/overlap distributions, respectively.

Figure 2. (a) Training data, illustrating the head class (0, apple) and the tail class (83, sweet pepper). (b,c) Comparison of results generated by the baseline model and our proposed model for the tail class (83, sweet pepper).

Figure 3. Timestep-dependent self-attention (TSA) module. The module combines spatial features with timestep embeddings in the Q/K/V projections so that self-attention can jointly model temporal and spatial dependencies during denoising.

Figure 4. Comparison of generation results of CBDM, NoisyTwins, and PD-CBDM on the CIFAR100-LT dataset.

Table 1. Performance analysis of our proposed different methods.

Methods	FID ↓	$F_{8}$ ↑	Recall ↑	IS ↑	$F_{1 / 8}$ ↑
Baseline	5.81	0.91	0.57	13.34	0.90
+PD Loss	5.19	0.90	0.58	13.45	0.92
+Re-weighting	5.48	0.89	0.63	13.10	0.94
+TSA	5.42	0.91	0.57	13.57	0.92
+All	4.96	0.91	0.58	13.60	0.92