1. Introduction
In recent years, diffusion models [
1,
2] have made substantial progress in image generation [
3]. Compared with GAN-based approaches [
4], they often deliver better fidelity and stronger sample diversity. Beyond sample quality, diffusion models typically feature straightforward and stable optimization, which supports a wide range of applications, including text-to-image generation [
5,
6,
7], video generation [
8,
9], audio generation [
10,
11], and object detection [
12,
13]. In conditional diffusion models, auxiliary signals are used to steer the denoising process toward the desired output; such signals can be category labels [
14] or low-resolution inputs [
15,
16], leading to images that better match the given conditions.
Most existing diffusion training setups [
1,
17,
18] implicitly assume roughly balanced data distributions. In practice, many real-world datasets are long-tailed, where head classes contain many more samples than tail classes. This imbalance is especially evident in some domains, such as medical image generation [
19,
20]. Although conditional generative models can produce satisfactory images for head classes, the skewed distribution makes it difficult to capture tail-class characteristics, which often results in low-quality tail-class generations and harms overall performance. Accordingly, the core challenge in long-tailed diffusion generation is not merely to correct skewed class frequencies but to improve tail-class generation quality while preserving head-class quality under three coupled difficulties: head-class-dominated training, perceptual confusion between tail and visually dominant head classes, and insufficient timestep-aware modeling in the denoising network. Existing class-imbalance handling strategies in generative models mainly focus on resampling, re-weighting, or prior adjustment to reduce distribution bias; however, these strategies do not directly resolve all three difficulties in a unified manner.
For long-tailed datasets, unconditional diffusion models often generate a substantial number of low-quality images. CBDM (class-balancing diffusion models) [
21] was among the first to leverage diffusion models for image generation under long-tailed data distributions. It introduces a distribution-adjustment regularizer that encourages generated samples to match randomly chosen target labels more closely. This helps curb head-class overfitting and improves tail-class sample diversity by allowing the model to learn features beyond the majority classes. Other methods such as BPA [
22] revisit long-tailed diffusion training from the perspective of bias-aware prior adjustment. While these methods effectively alleviate imbalance at the class-distribution or prior-adjustment level, they do not explicitly address head–tail perceptual confusion during generation, nor do they revisit timestep-aware attention modeling in the denoising network. PD-CBDM complements these class-balancing strategies by explicitly addressing these two aspects within a unified diffusion framework. Building upon the success of CBDM, we use it as the baseline in our study to further investigate long-tailed image generation. An overview of the proposed PD-CBDM framework is provided in
Figure 1.
However, alleviating imbalance only through distribution or prior adjustment does not fully prevent head–tail confusion during sampling. We observe that, under long-tailed diffusion training, generated tail-class images often exhibit significant similarities to visually dominant head-class images, as shown in
Figure 2. For example, in the training set, the sample count of class 0 is a hundred times that of class 83. Consequently, when generating images for the tail-class (class 83), features similar to the head class frequently appear, as demonstrated in
Figure 2b. This head–tail appearance overlap has also been discussed in recent work on class-imbalanced diffusion training, where directly regularizing the overlap between head and tail distributions can alleviate tail-class confusion [
23]. To address this issue, we introduce a perceptual distinguish loss to explicitly enlarge the separation between head- and tail-class representations, thereby reducing the tendency of the model to overfit head-class features. Specifically, during training, we resample head–tail-class image pairs and penalize the KL divergence between the distributions of head and tail images. This process is simplified into an MSE loss between noise and mean images in the reverse process, optimizing an updated loss function. The proposed loss increases the divergence between head and tail distributions while reducing the original diffusion loss, effectively achieving our goal.
Beyond improving head–tail representation separation, we also re-examine the target-label distribution used in the baseline [
21]. Since this distribution is itself long-tailed, drawing labels from it directly during training can over-emphasize frequent classes, which may hurt tail-class coverage and limit the variety of generated samples. We therefore adjust the label-sampling strategy to increase tail-class exposure in training.
Moreover, beyond data-level re-weighting and loss-level regularization, we note that in the original U-Net, timestep cues are mainly injected through residual blocks, while the self-attention layers remain timestep-agnostic. Motivated by [
24], we develop a timestep-dependent multihead self-attention module. The module incorporates timestep cues into self-attention computation [
25], enabling the attention layers to account for both denoising dynamics over time and spatial contexts. This design improves noise prediction and leads to better generation quality.
In this paper, our contributions are as follows:
We propose PD-CBDM, a diffusion-based framework for long-tailed image generation that extends prior class-balancing diffusion methods by jointly addressing class-prior imbalance, head–tail perceptual confusion, and timestep-aware denoising.
We introduce a re-weighted target-label distribution together with a perceptual distinguish loss to improve tail-class training exposure and explicitly enlarge the separation between head- and tail-class representations, thereby reducing the tendency of tail-class samples to resemble head classes.
We design a timestep-dependent self-attention module that injects timestep information into self-attention computation, enabling the denoising network to better capture temporal–spatial dependencies under imbalanced training.
Extensive experiments on CIFAR100-LT, CIFAR10-LT, CIFAR-100, and CIFAR-10 demonstrate the effectiveness of the proposed method across both imbalanced and balanced settings, as well as its competitive performance against representative recent baselines.
The paper is structured as follows. In
Section 1, we introduce the problem and motivate our approach.
Section 2 summarizes related work.
Section 3 describes the experimental setup and reports quantitative results. In
Section 4, we benchmark our method against existing state-of-the-art approaches.
Section 5 presents qualitative results, and
Section 6 discusses the main findings.
Section 7 concludes the paper.
2. Related Work
We discuss relevant work on long-tail learning as well as attention mechanisms in this section.
2.1. Long-Tail Learning
Long-tailed category imbalance is a common phenomenon in real-world datasets, where a small subset of classes has abundant samples, while many others are sparsely represented [
26,
27,
28]. This imbalance often causes deep learning models to bias toward head classes with abundant data, leading to poor performance on tail classes with limited samples [
29,
30,
31].
In long-tail learning, long-tail recognition has garnered significant attention. For instance, SMOTE [
32] augments minority classes by synthesizing new samples via interpolation between a minority instance and one of its
k nearest minority neighbors. However, such resampling methods can suffer from edge distribution issues and lack precision in neighbor selection. Class-balanced (CB) loss [
27] addresses this by assigning weights to each class’s loss inversely proportional to its sample size, balancing their contributions during training. Transfer-based methods, represented by domain-specific transfer learning (DSTL) [
33], learn representations from long-tailed data and are subsequently adapted using a more balanced subset, so that knowledge can be better transferred to tail classes. Along a similar line, SSP [
34] relies on self-supervised pretraining (e.g., contrastive objectives or rotation-based prediction) before standard supervised learning on long-tailed data, aiming to obtain a more balanced feature space. However, self-supervised methods can be complex to implement. Recent studies have also extended diffusion-based paradigms to long-tailed learning beyond generation, such as improving long-tail recognition without relying on external knowledge [
35] and incorporating LLM-derived priors for long-tailed diffusion learning [
36].
In the domain of long-tailed image generation, CBGAN [
37] introduces a class balance regularizer, using category distribution information from a pre-trained classifier to constrain GAN outputs for a more balanced category distribution. The gSR (group spectral regularizer) [
38] alleviates mode collapse in CGAN [
39] by introducing a group spectral regularization term. However, CBGAN requires an additional classifier, and gSR, if overly strong, can restrict model learning, reducing diversity and increasing computational cost. NoisyTwins [
40] evaluates various GAN regularization techniques for long-tail image generation, identifying common issues such as mode collapse and category confusion. It proposes a class embedding enhancement strategy to prevent mode collapse and improve generation performance. Beyond GAN-based solutions, diffusion models have recently been explored for long-tailed image generation due to their stable training and strong fidelity. In the context of diffusion models, Xu et al. [
22] observed that a uniform noise sampling distribution across all classes biases the model toward head classes, degrading the quality and diversity of generated tail-class images, and proposed BPA (bias-aware prior adjusting) to mitigate this effect. Subsequent long-tailed diffusion studies have improved tail synthesis from complementary perspectives, including oriented calibration to better transfer and calibrate head knowledge for tail classes [
41], overlap optimization to reduce head–tail appearance confusion [
23], and contrastive conditional–unconditional alignment objectives to enhance long-tailed conditional generation [
42]. In parallel, journal-level efforts have begun to explicitly address long-tailed bias in diffusion-based image synthesis via dedicated solver-style designs such as LTB-Solver [
43].
Beyond task-specific long-tailed generation methods, recent foundation-model-based generative frameworks have highlighted the value of large-scale generative pretraining for visual representation and synthesis [
44,
45,
46]. Representative examples include SpectralGPT [
45] and related generative foundation-model paradigms [
44,
46], which reflect the broader trend toward large-scale generative pretraining and transferable representation learning. Although such methods are developed under substantially different data modalities, model scales, and training objectives from our CIFAR-style class-conditional long-tailed generation setting, they provide a useful broader context for positioning PD-CBDM within the evolving landscape of generative modeling. In contrast, PD-CBDM focuses on improving class-balanced diffusion training under explicit long-tailed image generation scenarios.
Overall, existing long-tailed generative methods mainly improve performance through class balancing, regularization, or sampling adjustment. In diffusion-based long-tailed generation, prior methods such as CBDM and BPA mainly focus on distribution regularization or bias-aware prior adjustment, while more recent studies also explore calibrated transfer and overlap optimization from complementary perspectives. Compared with these approaches, PD-CBDM unifies re-weighted target-label sampling, explicit head–tail perceptual separation, and timestep-aware attention modeling within a single diffusion framework. This unified design distinguishes PD-CBDM from prior diffusion-based approaches for long-tailed image generation.
2.2. Attention
In recent years, attention mechanisms have been increasingly integrated into convolutional neural networks (CNNs) to improve feature selection and highlight informative cues. By emphasizing salient signals, attention helps models focus on the most relevant parts of the input, often leading to better results. While models like DDPM and improved DDPM utilize self-attention mechanisms, CBDM adopts the same approach. The standard spatial multihead self-attention module is widely used because it is simple and easy to implement, yet its suitability for diffusion models remains an open question. Stable diffusion [
47] is a text-to-image diffusion system that uses cross-attention to incorporate different conditioning signals (e.g., text prompts and bounding boxes), enabling high-resolution image synthesis. VPD [
48] further shows that cross-attention maps extracted from text-to-image diffusion models trained in advance can be used as explicit semantic cues for downstream visual perception tasks. By averaging cross-attention maps at different resolutions, it provides aggregated semantic information specific to classes. Linear attention mechanisms have also gained traction in image generation. For instance, ref. [
49] transforms self-attention into a linear dot product of kernel feature mappings, reducing computational complexity to linear time and improving generation efficiency. The linear transformer model further facilitates diverse sample generation. Tactile diffusion [
50] adopts linear attention to reduce computational burden and accelerate generation, but replacing the Softmax-based self-attention with a kernel-style mapping often comes with a drop in performance. In parallel, diffusion transformer backbones have rapidly advanced for high-fidelity synthesis and efficient training (e.g., PixArt-
) [
44]. Moreover, several recent designs explicitly emphasize step-/timestep-aware attention or computation along the denoising trajectory, including step-wise dynamic attention mediators [
51], dynamic diffusion transformers [
52], and step-wise adaptive computation [
53]. Motivated by these trends, PD-CBDM introduces a timestep-dependent attention module (TSA), which captures both temporal and spatial information, thereby enhancing the model’s precision in noise prediction.
3. Our Approach
3.1. Preliminary
To effectively address the challenges of image generation under long-tailed class distributions, we build upon the CBDM baseline and introduce a series of targeted improvements. We begin by summarizing the key diffusion-model background and the CBDM framework.
a. Diffusion Model: The diffusion model is a generative framework that synthesizes samples through an iterative denoising procedure. It specifies a forward noising chain that gradually maps data to Gaussian noise, together with a reverse denoising chain that reconstructs clean samples from noisy inputs.
In the forward process,
describes the perturbation of original data point
from the real distribution. This perturbation is achieved by mixing Gaussian noise with a mean of 0 and a variance of
in a Markov chain manner over
T steps until the data fully transitions into Gaussian noise. The process is mathematically formulated as Equation (
1):
where
denotes the noise level hyperparameter.
denotes the normal distribution;
denotes the identity matrix. Further, the data point
at any intermediate step can be directly sampled from
using a closed-form expression:
Conversely, the reverse process gradually restores the data from noise. The true reverse transition probability
is approximated by a learnable Gaussian model
, parameterized by
, and is expressed as Equation (
3):
In the DDPM framework, the mean
is learned parametrically through a neural network, while the variance
is typically treated as a time-dependent constant. The work [
2] reparameterized the noise prediction network
by expressing
as Equation (
4):
Equation (
4) demonstrates that the mean image can be derived from the noise. Furthermore, the work [
2] introduced a simplified training objective that focuses on noise prediction, significantly improving training efficiency and generating high-quality samples. The simplified loss function is defined as Equation (
5):
b. Conditional Diffusion Probabilistic Models: The conditional generative diffusion model introduces conditional information as the basis for our long-tail research. In the conditional generation setting, for training data
, the associated conditional information
c can be category labels or low-resolution images, sampled jointly from the data distribution
. Therefore, the forward process remains unchanged, and the goal is to train a conditional generative model
. The reverse process is updated to Equations (
6) and (
7):
where
represents the mean image under the conditional information, and the variance is defined as a constant
. Under these conditions, the loss function is updated to Equation (
8):
c. CBDM: CBDM applies diffusion models to the long-tailed dataset setting, aiming to generate high-quality samples for tail classes under class-imbalanced conditions. In this context, there is a significant discrepancy in the number of training images between head and tail classes (with head-class images being hundreds of times more numerous than tail-class images), resulting in a severe lack of diversity in generated tail-class images. CBDM addresses this by using a distribution-adjustment regularizer during training, thus mitigating the mode collapse problem of tail classes. The loss function for CBDM is defined as Equation (
9), and it comprises Equation (
8) (conditional diffusion loss in
b) and a distribution-adjustment term.
Here, “sg” represents the stop-gradient operation, while and are weight hyperparameters. Additionally, denotes the set of class labels.
3.2. Perceptual Distinguish Loss
Training deep generative models under a long-tailed distribution presents two primary challenges: the imbalance in the number of data across different classes causes deep generative models to bias towards head classes, leading to poor performance on tail classes; and the scarcity of tail-class images further complicates the training of these models. Existing class-balancing strategies mainly improve tail performance by adjusting sampling or class exposure, but they do not explicitly constrain the perceptual separation between head and tail classes during generation. To address this gap, we introduce a perceptual distinguish loss. Rather than serving as a generic auxiliary regularizer, it explicitly enlarges the separation between head- and tail-class reverse-process distributions, thereby reducing the tendency of tail-class generation to absorb visually dominant head-class features.
Specifically, in the diffusion reverse process, as shown in Equation (
7), we randomly sample from head and tail data (
and
) as
and
, respectively. We define the transition probabilities of head and tail data in the reverse process as
and
, respectively. In generating tail-class images, to emphasize the differences from head-class images, we introduce a penalty term designed to maximize the Kullback–Leibler (KL) divergence [
54] between the two distributions. This penalty is what we call the perceptual distinguish loss, and we introduce a weight
to balance the original diffusion loss with the perceptual distinguish loss. Consequently, the updated loss function IS as Equation (
10):
Optimizing the above objective increases the distinction between the distributions of generated head and tail data while simultaneously reducing the diffusion loss. In the reverse process, given a timestep
t and noise
, the mean image
can be computed as shown in Equation (
7), with the variance fixed as a constant. Therefore, the KL divergence can be expressed as Equation (
11):
where
C is a constant. An alternative idea is to directly perform MSE loss on
, but in our experiments, we found that this approach did not yield optimal results. We infer that the error in
obtained under a given noise
is too large. The PD-CBDM training algorithm is detailed in Algorithm 1.
| Algorithm 1 Training Algorithm of PD-CBDM |
- Input:
train data with condition c;
t is the timestep; is the target-label distribution; is the number of elements in the label set; is the distribution of head data after segmentation; is the distribution of tail data after segmentation. - Output:
Conditional noise prediction model .
- 1:
for Every batch of size N do - 2:
for for data-label pair in this batch do - 3:
Sample - 4:
Calculate - 5:
Sample from - 6:
Calculate distribution-adjustment regularization term - 7:
Sample from , sample from - 8:
Calculate and using Equation ( 7) - 9:
Calculate pd loss - 10:
Update with - 11:
end for - 12:
end for
|
3.3. Rethinking Target-Label Distribution
In CBDM,
represents the target-label distribution after regularization, which aids the model in better learning and generating samples for tail classes by adjusting the distribution during training. However, because this distribution is still derived under long-tailed data, directly sampling from it can continue to over-emphasize frequent classes during training. Different from a generic class re-weighting strategy, we directly revisit the target-label distribution used in CBDM and invert the original
to construct the adjusted distribution
. This design increases tail-class exposure while avoiding the fidelity degradation associated with naively balanced sampling. To alleviate the problem of fidelity degradation while enhancing diversity, we propose to invert the original target-label distribution
, resulting in the adjusted distribution
as Equation (
12):
Instead of using a balanced distribution, we invert the target-label distribution by assigning relatively higher weights to tail classes. When the number of image classes included is fewer than in the training set, the distribution-adjustment loss will increase the probability of selecting underrepresented tail samples during training. This approach encourages the model to generate more diverse samples, while still maintaining high fidelity. This inverted distribution assigns larger probabilities to tail classes, enabling more frequent sampling of underrepresented categories during training.
3.4. Timestep-Dependent Self-Attention
As described previously in
Section 3.1, in CBDM, timestep information is only utilized in the residual blocks, and only simple attention mechanisms are employed. Typically, the time dimension dependence of the denoising network is implemented through simple time position embeddings, which are applied to different residual blocks using operations such as spatial addition. However, this simple mechanism may not optimally capture the time-dependent relationships throughout the denoising process.
To overcome this limitation, we propose a timestep-dependent self-attention (TSA) module. The TSA module injects timestep information into the self-attention mechanism to jointly model temporal and spatial dependencies, thereby improving noise estimation in the denoising process. Specifically, we introduce timestep dependence into the query/key/value projections of self-attention. Given the U-Net feature map
and timestep
t, we first compute the timestep embedding
, where
denotes the standard timestep embedding function used in diffusion models. We then pass
and
through separate linear layers and sum the projected features to construct timestep-conditioned
Q,
K, and
V before performing standard scaled dot-product self-attention, producing the updated output feature
, as illustrated in
Figure 3. The updated formula can be written as Equations (
13) and (
14):
In
Section 4.3.3, we tested the performance of the timestep-dependent self-attention module (TSA) at different positions within the model. Ultimately, we decided to position the TSA module in the U-Net at the feature-map resolution of 16, specifically after the first downsampling and before the last upsampling.
4. Experiments
4.1. Datasets and Metrics
To evaluate our method under class-imbalanced generation settings, we conduct all experiments on long-tailed variants of CIFAR-10 [
55] and CIFAR-100 [
55]. The long-tailed splits are constructed following [
21,
30]. CIFAR-10 consists of 60,000 RGB images at a
resolution across 10 categories (6000 images per category). CIFAR-100 contains the same number of
RGB images but spans 100 categories (600 images per category), which are further grouped into 20 superclasses with 5 categories each.
In terms of evaluation metrics, we chose Fréchet inception distance (FID) [
56], inception score (IS) [
57], recall [
58], and
[
59] as evaluation metrics to assess the fidelity and diversity of the model-generated images. The value of
was set to 8. Recall and
were used as metrics to evaluate image diversity, while IS and
tended to measure image fidelity. In addition, for recall and
calculations, we use the Inception-V3 feature; recall parameter
K is 5, and its clustering count is set to 20 times the number of classes.
4.2. Experimental Setup
Training settings: In CIFAR100 and CIFAR100-LT datasets, we performed 500 K iterations, and in CIFAR10 and CIFAR10-LT datasets, we performed 800 K iterations, with all training completed on a single Nvidia RTX 3090 GPU. For all datasets, we used the Adam optimizer with a learning rate of , set the model’s channel count to 128, and the network included three stages, with the resolution halved between two stages. Each stage had 2 residual blocks, 2 TSA blocks, and the noise strategy for all datasets used a linear noise schedule between and 0.02.
Testing settings: In the model testing phase, the settings were the same as the baseline model. In the ablation experiments, we randomly sampled 50,000 images to measure all metrics; when comparing with SOTA models, we randomly sampled 10,000 images for comparison with GAN series algorithms, and 5000 images for comparison with diffusion model series algorithms, with all dataset sampling completed on a single NVIDIA GeForce RTX 3090.
4.3. Ablation Study
4.3.1. Performance Analysis of Proposed Methods
To quantify the contribution of each proposed component, we perform controlled ablations on CIFAR100-LT by selectively enabling the three modules of PD-CBDM. We show the performance change when adding the perceptual distinguish loss, re-weighting, and TSA module to the baseline model. As shown in
Table 1, adding the perceptual distinguish loss reduces FID from 5.81 to 5.19 and slightly improves IS, suggesting that explicitly enlarging head–tail separation mainly improves generation fidelity. Re-weighting increases recall from 0.57 to 0.63 and
from 0.90 to 0.94, indicating that greater tail-class exposure mainly enhances diversity and coverage, although IS decreases slightly. TSA improves FID to 5.42 and IS to 13.57, suggesting that timestep-aware attention benefits denoising quality. When all three components are combined, PD-CBDM achieves the best overall trade-off, improving FID from 5.81 to 4.96 and IS from 13.34 to 13.60. These results indicate that the three modules play complementary roles in improving long-tailed image generation.
4.3.2. Analysis of Data Division Ratio
In the perceptual distinguish loss, we explored the impact of different head–tail data division ratios on experimental results. Specifically, we sorted the training data by class frequency and partitioned them into training models at ratios of 6:4, 7:3, and 8:2. Since the imbalance factor varies greatly with the training dataset, different division ratios need to be set for different imbalance factors to ensure the model can achieve optimal performance. As shown in
Table 2, after using perceptual distinguish loss, the model performance was significantly improved, but when the division ratio was 7:3, the FID reached 5.19, a reduction of 0.62 compared to the baseline; when the division ratio was 6:4, the IS reached 13.59, an increase of 0.25 compared to the baseline model. Therefore, we set the division ratio for all datasets to 7:3.
4.3.3. TSA Module Position Selection
To investigate the effect of the TSA module in the denoising U-Net, ablation studies were conducted by inserting TSA at different locations within the network. Specifically, the module was added at each resolution transition point—after downsampling, before upsampling, and in the bottleneck—corresponding to feature-map resolutions of 16, 8, and 4, respectively. As shown in
Table 3, compared to using TSA only at resolution 16, applying it at all resolutions led to a performance drop: FID decreased by 0.34 and IS by 0.33. This indicates that using TSA across all resolutions may weaken model performance. Consequently, for other datasets with limited resolution, the TSA module was only applied to feature maps of size 16, and its placement at other resolutions was not further explored. This result suggests that the gain from TSA comes from localized timestep-aware attention rather than simply stacking more attention operations, which further isolates the architectural contribution of TSA.
4.3.4. Different Guidance Strength Analysis
In the conditional generation settings, the guidance strength
is a key parameter that controls the trade-off between the quality and diversity of generated samples. During the sampling process, we experimented with the impact of the guidance strength
in the method without classifier guidance (CFG) on the model’s generation effect. Therefore, we tested the impact of different datasets on the model’s FID and IS with guidance strength
in the range of [0.2, 2]. Using the method in paper [
24], we sampled 50,000 images every 0.2 to test their FID and IS to determine the optimal
for each dataset. As shown in
Table 4, we found that as
increases, the model’s generation diversity is affected, but the fidelity is improved. When
is 1.4, the model’s FID reaches the best at 4.96, but as the guidance strength increases, the IS continues to improve. Therefore, we set
to 1.4 for the CIFAR100-LT dataset. Using the same method, we searched for the optimal
for several other datasets, namely 1.2 (CIFAR10-LT), 0.9 (CIFAR-100), and 0.8 (CIFAR-10).
4.4. Performance on Class-Balanced Datasets
At the same time, to prove the effectiveness of the proposed method, we also explored the effect of PD-CBDM on balanced datasets. Specifically, for balanced datasets, we still use a 7:3 ratio for data division after shuffling image class, and we no longer use the re-weighting method (because the target-label distribution is already a balanced distribution). Recently, GP-MI and MMD [
60] proposed a reinforcement learning-based fine-tuning method that leverages a reward function called “Diversity Reward” to guide the training of diffusion models. The results are shown in
Table 5, where it can be seen that our method is also applicable on balanced datasets, with FID improved by 0.59 compared to unconditional generation of DDPM, IS improved by 0.64, and all indicators surpassing the baseline model CBDM; however, MMD performs better in the diversity of generation. Therefore, our method is not only suitable for class-imbalanced datasets but also performs well on conventional datasets.
5. Compare with SOTA Models
Image generation research has progressed quickly, and recent models continue to improve quantitative metrics. For a comparison against strong baselines, we evaluate PD-CBDM under several imbalance factors (1, 10, and 100) and report FID as the primary measure. Concretely, we compute FID on CIFAR10-LT with , and on CIFAR100-LT with .
As shown in
Table 6, we selected some GAN series models such as SNDCGAN [
61], CBGAN [
37], and NoisyTwins [
40] models to compare with PD-CBDM. Among them, gSR [
38] is a regularization method during the training of GAN series models. From
Table 6, it can be seen that PD-CBDM trained under all imbalance factors outperforms other models, and it appears that GAN series models are not good at image generation tasks on long-tailed datasets.
On the CIFAR10-LT dataset, when is 100, the FID is improved by 1.38 compared to the baseline model; when is 10, it is improved by 0.62; and when is 1, it is improved by 1.14. Compared with the recent SOTA model in the GAN field, NoisyTwins, PD-CBDM achieves better FID scores under various imbalance factors. On the CIFAR100-LT dataset, its performance is much higher than several GAN series models, and compared with the baseline model, the FID is improved by 2.57. This demonstrates that PD-CBDM performs exceptionally well under more categories and stronger imbalance factors.
On the other hand, we further compared the performance of several diffusion-model-based image generation algorithms under long-tailed distributions, including DDIM [
65] and BPA [
22]. Specifically, we evaluated each model on the CIFAR100-LT and CIFAR10-LT datasets with an imbalance factor
of 100. Following the BPA model settings, we randomly sampled 5000 images to compare their FID and IS metrics, and the results are shown in
Table 7.
From
Table 7, it can be observed that, compared with the recent SOTA diffusion-model BPA, PD-CBDM achieves a higher FID on both datasets. On the CIFAR10-LT dataset, its FID is 0.41 higher than BPA, although its IS is slightly lower. Similarly, on the CIFAR100-LT dataset, PD-CBDM surpasses BPA by 0.64 in FID but falls behind by 0.14 in IS.
6. Qualitative Analysis
To further evaluate the perceptual quality and diversity of the generated images, we present qualitative visual comparisons. Specifically, we compared the generation outputs of the baseline, NoisyTwins, and PD-CBDM models on the CIFAR100-LT dataset, as illustrated in
Figure 4. As shown in
Figure 4, under the same random seed, PD-CBDM generates higher-quality tail-class images than the other two models. PD-CBDM exhibits both higher diversity and fidelity in tail-class generation, with more distinct primary subject details and varied backgrounds. For instance, in generating the tail category “flowers”, the baseline and NoisyTwins models tend to produce uniform and less detailed backgrounds, indicating underfitting to tail-class data and limited diversity. In contrast, PD-CBDM generates a wider variety of “flowers” with richer and more diverse background details.
This comparison highlights PD-CBDM’s superior capability in generating high-quality tail-class images while maintaining the generation quality of head-class images.
7. Discussion
PD-CBDM aims to distinguish head and tail-class distributions via a perceptual distinguish loss, enhancing tail-class image quality with minimal impact on head classes. This section analyzes its effectiveness from both theoretical and empirical perspectives.
From a theoretical standpoint, the effectiveness of PD-CBDM can be attributed to the interplay of three key components: perceptual distinguish loss, the timestep-dependent self-attention (TSA) module, and the redefined target-label distribution . First, the perceptual distinguish loss adds an explicit regularization term that encourages a larger KL-based separation between head- and tail-class feature distributions. By enforcing this separation in the perceptual feature space, the model is encouraged to learn class-specific representations, especially for underrepresented tail classes. This helps prevent the common issue of feature collapse, where tail-class features are overwhelmed or misaligned with those of head classes. Second, the TSA module improves dependency modeling by injecting temporal and spatial cues into the denoising process. Traditional attention mechanisms within U-Net architectures often neglect the timestep context crucial to diffusion models. By making attention explicitly timestep-dependent, TSA helps incorporate timestep information into attention computation, which can improve noise estimation and denoising dynamics across the diffusion process. This enhancement is particularly beneficial for tail classes, whose generation paths may require finer temporal modeling to reconstruct semantically meaningful samples. Finally, the redefinition of the target-label distribution acts as an implicit re-weighting strategy. By assigning higher sampling probabilities or training emphasis to tail classes, it aligns with long-tailed data balancing principles. This promotes the generation of a more diverse and representative sample set for underrepresented classes, thereby mitigating class imbalance at the data level during training. Together, these components create a synergistic framework: the perceptual loss enforces inter-class separability, TSA improves temporal modeling capacity, and the adjusted label prior addresses sample diversity—all contributing to PD-CBDM’s ability to generate high-fidelity, class-balanced outputs in long-tailed settings.
From a computational perspective, the three components of PD-CBDM introduce different types of overhead. The re-weighted target-label distribution only changes label sampling during training and does not alter the inference procedure. The perceptual distinguish loss adds extra head–tail pair processing and a separation term during training, but it is not used at inference time. By contrast, TSA increases computation in both training and sampling because timestep-conditioned projections are introduced into self-attention. Nevertheless, this overhead is localized, since TSA is only inserted at the stage of the U-Net rather than across all resolutions. Therefore, compared with the baseline, the additional training overhead comes from both the perceptual distinguish loss and TSA, whereas the additional inference-time overhead mainly comes from the TSA-enhanced attention computation.
Although our experiments are conducted on CIFAR-style benchmarks, the three components of PD-CBDM are not inherently restricted to low-resolution settings. The re-weighted target-label distribution operates at the data-sampling level, the perceptual distinguish loss modifies the training objective, and TSA is introduced as a modular attention design in the denoising network. Therefore, the overall framework is, in principle, transferable to higher-resolution or real-world long-tailed datasets. Nevertheless, extending PD-CBDM to such settings may introduce additional challenges. Higher-resolution generation would increase both memory and computation, while real-world long-tailed datasets may exhibit more complex head–tail visual overlap and broader distribution shifts across categories. These issues may affect both optimization stability and the effectiveness of the current module configuration. We therefore regard scalable long-tailed diffusion generation on higher-resolution and real-world datasets as an important direction for future work.
From an experimental standpoint, PD-CBDM demonstrates strong performance across both imbalanced and balanced settings under the evaluated benchmarks. As shown in
Table 6, under varying imbalance factors, PD-CBDM achieves lower FID scores compared to the baseline model. This suggests that PD-CBDM effectively mitigates the degradation of generative performance typically observed under class imbalance, maintaining high visual fidelity even for tail classes. The superior performance under these challenging conditions highlights the method’s capacity to model underrepresented distributions more accurately. Moreover, on conventional balanced datasets (
Table 5), PD-CBDM outperforms both DDPM and CBDM across all evaluated metrics, including FID and IS. This result confirms that the improvements introduced by PD-CBDM—such as the perceptual distinguish loss and the TSA module—do not overfit to the imbalanced setting but instead provide broader benefits in modeling and generation quality. This indicates that PD-CBDM enhances the underlying denoising and sample quality mechanisms in diffusion models, making it a more universally effective solution. Qualitative results further reinforce these findings. As illustrated in
Figure 4, PD-CBDM produces notably higher-quality samples than both the baseline and NoisyTwins models on the CIFAR100-LT dataset, particularly for tail categories. Compared to other methods, PD-CBDM-generated images exhibit more distinct main objects, finer structural details, and richer, more diverse backgrounds. These visual differences imply that PD-CBDM not only increases the fidelity of tail-class samples but also enriches intra-class diversity—two critical aspects often lacking in long-tailed generative modeling. In addition, we quantify how TSA affects noise prediction by measuring the sampling-time noise error. We follow Algorithm 2 and remove sampling stochasticity, then compute the prediction error for CBDM and PD-CBDM under this deterministic setting. For CIFAR100-LT, images are normalized to the range
, and we compute the per-pixel distance between the predicted noise and the target using the learned weights of each model with the
metric. As reported in
Table 8, PD-CBDM reduces the noise error by 0.18 compared with CBDM. This improvement suggests that incorporating TSA helps the denoising network capture noise-related information more precisely, leading to more accurate noise prediction.
| Algorithm 2 Noise Error Estimation with Discarded Stochasticity Algorithm |
- Input:
is the training data;
t is the timestep; is the Gaussian noise; is the trained noise prediction model. - Output:
noise error .
- 1:
- 2:
repeat - 3:
- 4:
Compute using Equation ( 2) - 5:
for t to 1 do - 6:
- 7:
end for - 8:
- 9:
until N iterations - 10:
return
|
Nevertheless, several open challenges remain. Under more extreme imbalance, head-class dominance and head–tail confusion may become more severe, which could further affect optimization stability and tail-class fidelity. In addition, extending PD-CBDM to larger or higher-resolution datasets would introduce higher memory and computation costs, while cross-domain transfer to more realistic long-tailed datasets may involve more complex visual overlap and broader distribution shifts across categories. These issues remain important directions for future study.
8. Conclusions
This paper proposes PD-CBDM, a method designed to generate high-quality images on long-tailed datasets. In class-imbalanced situations, generative models often rely too heavily on head classes, leading to low-quality tail samples that resemble head data. To address this issue, we introduce a perceptual distinguish loss to better separate head- and tail-class representations. In addition, the target-label distribution of the baseline model is adjusted to give more weight to tail classes during training, which helps increase the diversity of generated images. To further improve denoising quality, a timestep-dependent self-attention (TSA) module is added to the denoising network, allowing the model to exploit both temporal and spatial information. Experiments show that PD-CBDM improves FID from 5.81 to 4.96 on CIFAR100-LT and from 8.10 to 7.48 on CIFAR10-LT, and also performs well on standard balanced datasets.
Limitations and future work: Although PD-CBDM achieves competitive FID and IS scores, it still has limitations. The method improves class-imbalanced image generation through a combination of re-weighted label sampling, perceptual-loss regularization, and timestep-aware attention, without a dedicated architecture specifically designed for long-tailed diffusion generation. In particular, its overall generation efficiency remains constrained by the iterative sampling nature of diffusion models, and the added TSA module further increases computation compared with the baseline. Under more extreme imbalance, the current configuration may also face additional optimization difficulty and stronger head–tail confusion. In future work, we plan to explore more efficient architectures and faster sampling strategies for long-tailed diffusion generation, and to further evaluate the proposed framework on larger, higher-resolution, and more realistic long-tailed datasets.