NTCE-KD: Non-Target-Class-Enhanced Knowledge Distillation

Most logit-based knowledge distillation methods transfer soft labels from the teacher model to the student model via Kullback–Leibler divergence based on softmax, an exponential normalization function. However, this exponential nature of softmax tends to prioritize the largest class (target class) while neglecting smaller ones (non-target classes), leading to an oversight of the non-target classes’s significance. To address this issue, we propose Non-Target-Class-Enhanced Knowledge Distillation (NTCE-KD) to amplify the role of non-target classes both in terms of magnitude and diversity. Specifically, we present a magnitude-enhanced Kullback–Leibler (MKL) divergence multi-shrinking the target class to enhance the impact of non-target classes in terms of magnitude. Additionally, to enrich the diversity of non-target classes, we introduce a diversity-based data augmentation strategy (DDA), further enhancing overall performance. Extensive experimental results on the CIFAR-100 and ImageNet-1k datasets demonstrate that non-target classes are of great significance and that our method achieves state-of-the-art performance across a wide range of teacher–student pairs.


Introduction
With the rapid advancements in deep learning, neural networks have undergone significant development, achieving remarkable breakthroughs in diverse domains including image classification [1][2][3], object detection and tracking [4][5][6][7], and semantic segmentation [8,9].However, despite their impressive performance, these models typically require substantial computational and storage resources, posing challenges for practical deployment on devices like intelligent sensors.
Given the typical resource constraints of intelligent sensors, such as limited memory and computational capabilities, knowledge distillation (KD) emerges as a particularly promising solution [10].KD enables the transfer of intricate knowledge from heavyweight teacher models to lightweight student models, allowing the latter to achieve comparable performance while significantly reducing resource requirements.This approach is particularly relevant in the context of intelligent sensors, where efficient utilization of resources is crucial for effective real-world deployment.
KD is primarily categorized into two branches: logit-based distillation and featurebased distillation.Logit-based methods transfer knowledge by minimizing the Kullback-Leibler (KL) divergence.Conversely, feature-based methods leverage knowledge from deep intermediate layers for superior performance at the cost of computational demands.However, logits with higher semantic information are supposed to provide more "dark knowledge" and logit-based methods are supposed to achieve better or comparable performance as feature-based methods, theoretically.Therefore, we believe that the knowledge within the logits (i.e., soft labels) of the teacher model has not been fully exploited.
Soft labels of the teacher model encompass the target class logit along with the nontarget class logits.The target class involves knowledge of the sample's true category, whereas the non-target class contains rich knowledge of category relevance.The soft labels are obtained by applying a softmax function to logits.However, the softmax function commonly utilized in most existing logit-based methods tends to disproportionately accentuate the largest class due to its exponential nature, thereby overlooking the informative guidance within the non-target class, as shown in the orange dashed box within Figure 1.Moreover, considering that the same sample may appear in different categories from different perspectives (green dashed box in Figure 1) thereby provides diverse knowledge of category relevance.Transferring knowledge of samples from a single perspective fails to fully exploit the latent inter-class correlation within the sample.

Softmax Augmentation
To address these issues, we propose a flexible and efficient logit-based distillation method dubbed Non-Target-Class-Enhanced Knowledge Distillation (NTCE-KD) to enhance the role of the non-target class in terms of both magnitude and diversity.Firstly, we introduce a magnitude-enhanced KL (MKL) divergence, in which the teacher logits' target class is multi-shrunk before applying softmax, yielding more informative soft labels rich in non-target class knowledge.To ensure the convergence of the model, identical compensatory logit shrinkage is applied to the student's target class.Moreover, to explore the diverse categorical relevance knowledge within the non-target class, we present a diversity-based data augmentation strategy (DDA) to obtain samples' various views.
Overall, our contributions can be summarized as follows: • We reveal the effect of non-target classes and present an improved KL divergence, named MKL, achieved by applying multi-shrinkages to the target class logits of both the teacher and student, thus amplifying the role of non-target classes in terms of magnitude.

•
We demonstrate that different views of an identical sample yield varying levels of similarity knowledge among categories.To enhance the diversity of non-target classes, we introduce a data augmentation strategy named DDA.

•
We propose NTCE-KD, a novel approach that enhances the significance of non-target classes in terms of both magnitude and diversity.We conduct extensive experiments on CIFAR-100 [11] and ImageNet-1k [12] across various teacher-student pairs, demonstrating our model's significant superiority.

Related Work Knowledge Distillation
Knowledge distillation, introduced by Hinton et al. [10], stands as an efficient model compression technique.Its core objective is to transfer the learned knowledge from a teacher model to a student model.Within the realm of knowledge distillation, two primary methods have gained significant attention: logit-based distillation and feature-based distillation.
Logit-based distillation methods [10] primarily focus on aligning the output logits of the teacher and student models.This approach offers a straightforward and practical solution for knowledge transfer.Conversely, feature-based distillation methods [13][14][15][16] emphasize the alignment of intermediate features extracted from the hidden layers of the teacher model.While these feature-based methods often demonstrate impressive performance, they tend to introduce significant computational overhead.This can render them impractical in scenarios where accessing intermediate features poses challenges.
Current logit-based distillation methods often require the student model to mimic the softmax-standardized soft labels of the teacher model, potentially overlooking knowledge within the non-target class.To improve knowledge transfer, certain adaptive logit-based distillation methods have been proposed, which may inadvertently increase the influence of the non-target class to some extent.ATS [17] employs a lower temperature for the target class compared to the non-target class.Several other approaches [18][19][20] adjust specific temperatures to globally scale the soft labels.However, these temperature-based methods are restricted to globally scaling soft labels, limiting their ability to flexibly explore knowledge within both target and non-target classes.DKD [21] aims to enhance the teacher's soft labels by decoupling the target and non-target classes from KL divergence into a fixed proportion, but it lacks sample-wise enhancement for the non-target class.Moreover, the diversity of the non-target distribution remains underexplored.Recognizing these limitations, our NTCE-KD explicitly enhances the magnitude and diversity of the non-target, thereby amplifying the role of the non-target class.

Preliminaries
Consider a dataset of image classification containing N samples {x n , y n } N n=1 , where x n ∈ R H×W is the n-th sample and y n ∈ [1, K] is its corresponding label.The notations H, W are the height and width of the image, and K is the total class number of the dataset.Consider the teacher model f T , the student model f S , the logits of teacher z n = f S (x n ), and the logits of student v n = f T (x n ), where z n and v n ∈ R 1×K .
It is widely accepted that the predicted probability vectors p(z n , τ) and p(v n , τ) are standardized by softmax, and the k-th classes of these predicted probability vectors p(z n , τ) (k) and p(v n , τ) (k) are calculated as follows: where τ is the temperature to soften the probability.Notably, k = y n presents the target class of the probability vectors and k ̸ = y n presents the non-target class.Knowledge distillation aims to align the predicted probability vectors p(z n , τ) (k) of the student to the soft labels p(v n , τ) (k) for each class via KL divergence.
where L KL is the knowledge distillation loss.The temperature τ is set greater than 1 to produce softer probability vectors for conveying more information.
In addition to the soft labels, it is generally beneficial to train the student together with ground-truth labels via the cross-entropy loss L CE .
where the temperature τ is set as 1.
The overall optimization objective involves both the knowledge distillation loss L KD and the cross-entropy loss L CE .
where α and β are weights for balancing the losses.

Magnitude Enhancement
In Sections 3.2.1 and 3.2.2,we show the effect of the non-target class on knowledge distillation from a mathematical perspective and the magnitude side-off drawbacks of the original KL divergence.In Section 3.2.3,we propose a magnitude-enhanced KL divergence by shrinking the target class as shown in Figure 2.

Effect of Non-Target
In the classification tasks, the non-target classes refer to those categories that are not selected as the predicted class in a given sample.In this section, we analyze the effect of the non-target class from the perspectives of reinterpretation and gradient.3) can be interpreted as the cumulative weighted difference across all classes:

Reinterpretation of KL. KL divergence in Equation (
where p(v n , τ) (k) is the k-th weight for difference and log p(v n , τ) (k) − log p(z n , τ) (k) is the difference in the k-th class.Enhanced model generalization.Equation ( 6) aims to align the logarithmic probability of both target and non-target classes between teacher and student.Thus, the model learns to discriminate not only between the correct and incorrect classes but also among various incorrect classes, enhancing the model's generalization ability when faced with unseen or difficult data.
Derivation of gradient.To further analyze the optimization of the student when aligned with the teacher, we calculate the gradient of L KL with respect to z (k) , omitting τ and n for brevity.
Taking the partial derivative with respect to p(z) (k) gives Taking the partial derivative of p(z) (k) with respect to z (i) Based on Equations ( 7)-( 9) and the chain rule, the partial derivative of L KL with respect to z can be derived as follows: Achievable optimization objective.The optimization objective of cross-entropy loss is to maximize the target class.The ideal output for the student is in the one-hot format, which is challenging for a model to achieve.However, the knowledge distillation loss compels the student to produce a probability identical to that of the teacher, as shown in Equation (10).With the temperature τ, the teacher's output is more reasonable and achievable for the student model.Thus, the non-target classes offer an achievable optimization objective for the student.

Drawbacks of KL
Based on the advantages of the non-target class listed above, we analyze the shortcomings of the original KL divergence.
Inadequate non-target class optimization.From Equation ( 10), the optimizing magnitude for k-th class |p(z) (k) − p(v) (k) | during the distillation is the absolute difference between the probabilities of teacher and student.However, the target class generally receives a much higher probability compared to non-target classes.This discrepancy leads to stronger gradients and, consequently, more focused optimization on the target class at the expense of the non-target classes.
Statistical support.To verify the inadequate optimization of the non-target class, we define two variables.
where r1 represents the ratio of the probability of the target class to the average probability of the non-target classes, and r2 measures the ratio of the optimizing magnitude of the target class to the average optimizing magnitude of the non-target classes.As indicated in Figure 3, statistical data from various teacher-student pairs consistently indicate that probabilities for target classes exceed those for non-target classes.Moreover, the target class exhibits a greater optimizing magnitude compared to the non-target classes.These findings validate our analysis that models tend to prioritize target class optimization, resulting in a disproportionate focus during the training phase.

Magnitude-Enhanced KL
To explicitly enhance the role of the non-target classes in optimization, we seek to introduce a non-target-class-enhanced KL divergence to increase the magnitude of probabilities of non-target classes.
Target class multi-shrinkage.Taking into account the suppressive effect of an excessively large target class on the non-target classes, we propose to shrink the target class in the logits and apply softmax to normalize them, increasing the non-target class logits globally.
where S n is the shrinkage for the target class of the n-th sample.To strike a balance between emphasizing non-target classes and maintaining the discriminative power of the model for accurate classification, we introduce the base shrinkage S n0 as the difference between the target class and the maximum non-target class.
To further enrich the information in the soft labels, we utilize a shrinkage coefficient λ m to derive the m-th shrinkage λ m * S n0 , where λ ∈ D λ = {λ m | 1 ≤ m ≤ M} and M is the total number of the shrinkage coefficient.With the multi-shrinkages, the scaled target class is And the m-th multi-shrunk probability of teacher is . otherwise (16) Compensatory shrinkage for convergence.To ensure convergence, the same multishrinkages are applied to the target class of student's logits.
And the i-th multi-shrunk probability of the student is . otherwise (18) The magnitude-enhanced KL divergence is derived as follows: The improved KL divergence we devised possesses three key characteristics: • Prominent role of the target class.For any given sample x n , the magnitude of optimization for the student model on the target class is always the greatest, i.e., ∂L KL ∂z (yn ) > ∂L KL ∂z (k)   holds for all k ̸ = y n .This ensures that sufficient attention is given to the target class in the new KL divergence formulation, thereby enabling more accurate predictions.

•
Isotonicity among the non-target classes.Given the indices t 1 , ..., t k−1 that sort the original probabilities of the teacher, such that p(v n,m it is observed that the ratio p(v n,m ) k p( ṽn,m ) k remains constant at 1 c , where c is a constant.This implies that the ordering of non-target probabilities p( ṽn,m ) (t 1 ) < ... < p( ṽn,m ) (t k ) is preserved for the teacher's model.Similarly, it can be proven that the ordering of non-target probabilities is also preserved in the student's model, preserving the underlying structure of class relationships, which is critical for meaningful learning.

•
Convergence property.Assuming identical logits between the teacher and student models, i.e., v n for all k ∈ [1, K], it is evident that L MKL remains constant at zero, thus ensuring stability and rationality in the training process.

Diversity Enhancement
In order to further enhance the influence of the non-target class, drawing inspiration from multi-view learning, we increase the diversity of the non-target class in soft labels by employing specialized data augmentation techniques to generate diverse variations in samples.

Diversity-Based Data Augmentation
Previous studies have firmly established that the key benefit of implementing data augmentation policies lies in broadening the diversity of examples [22][23][24][25].
We seek an augmentation strategy that can enhance the diversity of the non-target classes based on T data transformations that are commonly used.To exert more precise control over the augmentation strategy, we introduce the two hyper-parameters a and b for the occurrences and intensity of transformations, respectively, following [22].
Given the occurrences of transformations a, each transformation will be selected with a probability 1/T, and there are T a potential augmentation strategies available.Additionally, b is set to adjust the transformation strength due to its significant impact on the diversity of the augmentation strategy.Specifically, the intensity of each transformation ranges from 0 to 10, with 10 representing the maximum transformation intensity, following [22].
Once a and b are specified, the data augmentation strategy can ultimately be represented as X = DataAug(a, b).The final dataset is X ∪ X, where X is the original dataset.
We propose a gradient-free search method to find the data augmentation strategy with the maximum diversity of non-target classes from all potential data augmentation distributions.The candidate values for a and b are set to a = {a 1 , a 2 , . . ., a n a } and b = {b 1 , b 2 , . . ., b n b }, where n a and n b are the lengths of the two candidate sets.The objective of the search is listed as follows: where X n and Xn are the original version and the augmented version of the n-th sample, respectively, and Sim is the cosine similarity of the non-target class before and after the augmentation.
It is worth mentioning that the search process is conducted with the teacher model before distillation, effectively minimizing computational overhead.Moreover, despite the universal suitability of data augmentation for most distillation methods, we demonstrate in Section 4.5 that our proposed data augmentation approach can indeed enhance the diversity of non-target classes in samples to some extent, thereby improving the distillation performance.NTCE-KD surpasses both logit-based and feature-based methods by a considerable margin.Without diversity enhancement, MKL also demonstrates favorable performance compared with logit-based methods and achieves comparable or even superior performance compared with feature-based methods.
Table 1.Results of CIFAR-100 validation.Teachers and students are in the same architecture.MKL (magnitude-enhanced KL divergence): our method with magnitude enhancement only.NTCE-KD: our method with both magnitude and diversity enhancements.The best and second-best results are emphasized in bold and underlined cases.
MKL achieves state-of-the-art (SOTA) performance, surpassing feature-based methods by 0.46% and logit-based methods by 0.37%.For different architectures, MKL attains SOTA performance in Top 5 and secures a suboptimal position in Top 1.

Ablation Study
We conduct ablation studies in terms of magnitude and diversity enhancements, and the results are shown in Table 4. Considering ① and ②, the results show that magnitude enhancement benefits the student models, with improvements of 1.36%, 1.66%, and 3.58%.Similarly, the diversity enhancement benefits the student models with improvements of 1.9%, 1.82%, and 3.13%.Considering ① and ④, magnitude and diversity enhancements have orthogonal effects on model improvement, and their combination can further enhance the student models.✓and × present whether the method was adopted or not, respectively.

Analysis of Magnitude Enhancement
Logits and probabilities.We visualize the logits and probabilities before and after multi-shrinking the target class of the logits.We observe that the target class of logits is more prominent than the non-target class, as illustrated in Figure 4a.And the target class of probabilities becomes excessively prominent after softmax, as depicted in Figure 4b, leading to an insufficient optimization of the non-target class.Our approach, MKL, shrinks the target class of the logits, resulting in balanced target and non-target classes of the multishrunk logits and probabilities, as shown in Figure 4c,d.Moreover, with the enhancement in magnitude, the entropy of probabilities undergoes a significant increase from 0.005 to 3.435, effectively resulting in richer soft labels, as shown in Figure 4b,d  Target/non-target ratio.For a more thorough analysis of the target and non-target classes within soft labels, we compare the probability ratio r1 and the optimizing magnitude ratio r2 (in Equation (11)) between the target class and the average non-target class before and after augmentation on the CIFAR-100 dataset.As shown in Figure 5, we find that the probability ratio is around 40 across all teacher models, with some even exceeding 100, and the optimizing magnitude ratio is over 20.After magnitude enhancement, both ratios decreased to acceptable single-digit values, allowing for a more equitable optimization of the non-target class compared to the target class.Difference in non-target classes.We also compare the difference in non-target class logits between the teacher and student models.It can be observed that enhancing the values of non-target classes can prioritize their optimization, as shown in Figure 6, further Based on the experimental results presented in Table 6, β initially has a significant impact on the model's performance, peaking at β = 8.0.However, after β exceeds 4.0, the improvement in accuracy becomes marginal, indicating that the model is less sensitive to further increases in β.This underscores the robustness within a certain range and the importance of the distillation loss in effectively transferring knowledge from the teacher to the student model.Conversely, the accuracy in Table 7 remains relatively stable across varying α values, indicating a lower sensitivity to this parameter.The sensitivity analysis reveals that optimizing β is crucial for maximizing knowledge distillation effectiveness, as the distillation loss plays a pivotal role in capturing the knowledge transfer.Shrinkage coefficients.D λ is set to [1, 0.5, 0] for the best performance in our experiments.This choice of D λ indicates that our method is relatively robust to variations in the shrinkage coefficients, as the accuracies remain close across different settings in Table 8.However, the small improvement in accuracy when using [1, 0.5, 0] compared to other configurations suggests that the selection of these coefficients could have a non-negligible impact on the overall performance.Therefore, further exploration of optimal shrinkage coefficients for different datasets and model architectures remains an interesting direction for future work.

Analysis of Computational Complexity
In this section, we analyze the computational complexity of our proposed method, particularly considering the addition of non-target class optimization.During the loss computation phase, our approach necessitates the computation of probabilities and KL divergence for each shrinkage coefficient.This leads to a computational complexity of O(kn), where k represents the number of shrinkage coefficients.
Notably, when k is equal to 1, our approach does not introduce any additional overhead.Moreover, the predominant cost during training stems from the forward passes of the teacher and student models.Our enhancement lies primarily in the loss computation, which, compared to the forward passes, can be considered negligible.Additionally, our method does not introduce any additional overhead during the testing phase.As demonstrated in Table 9, the training time of our method is comparable to other logit-based methods, yet it achieves significantly higher accuracy.
In the realm of person ReID, which involves identifying individuals across multiple camera views, challenges like occlusions, pose variations, and clothing changes are prevalent.NTCE-KD can address these challenges effectively by emphasizing non-target classes.By augmenting gradients of non-target classes, the model can learn features that discriminate better between similar individuals, leading to more robust representations.
Similarly, in 3D point cloud understanding tasks, such as object classification, segmentation, and detection, distinguishing between target and non-target classes is fundamental.NTCE-KD can enhance the model's ability to discern subtle differences between similar objects within point clouds.By focusing on non-target class gradients, the model learns features that generalize well across instances.
Although the current research primarily focuses on validating NTCE-KD in image classification, its potential for person ReID and 3D point cloud understanding tasks is promising.Initial experiments in Appendix A suggest that NTCE-KD effectively leverages non-target class information to improve model performance in these domains.However, further investigation and experimentation are necessary to fully explore its capabilities.
In conclusion, the NTCE-KD approach offers a versatile framework applicable to a broader range of tasks beyond image classification.Future research will delve into its effectiveness in person ReID and 3D point cloud understanding, with the aim of conducting comprehensive experiments to validate its efficacy in these domains.

Conclusions
In this paper, we propose a novel knowledge distillation method, termed NTCE-KD, which enhances the non-target class from both magnitude and diversity perspectives to improve the distillation process.The NTCE-KD method exhibits significant performance improvements on the CIFAR-100 and ImageNet-1k datasets.Furthermore, through extensive analytical experiments, we validate the effectiveness of our approach.While promising, our method's reliance on a single-teacher model could limit its robustness.To address this, future work could explore multi-teacher knowledge distillation, which could provide richer knowledge to further enhance both performance and generalization.We believe this work contributes to the optimization of soft labels and logit-based distillation methods.

Figure 1 .
Figure 1.Motivation.Orange dashed box: the exponential nature of softmax results in overlooking the effect of non-target classes.Green dashed box: viewing the same sample from different perspectives provides diverse insights into category relevance.

Figure 2 .
Figure 2. The framework of our proposed NTCE-KD with magnitude and diversity enhancements in orange and green dashed boxes, respectively.Magnitude enhancement: multi-shrink the target class of the teacher's logits and apply the same shrinkage to the target class of the student's logits.Diversity enhancement: seek data augmentations to maximize the diversity of samples' non-target classes.

Figure 3 .
Figure 3. Statistical support for inadequate non-target class optimization.(a) r1: ratio of the probability of the target class to that of the non-target classes; (b) r2: ratio of the optimizing magnitude of the target class to that of the non-target classes.

Table 4 .
Results of ablation study.The experiments are conducted on CIFAR-100, with three teacherstudent pairs.MKL: magnitude-enhanced KL divergence.DDA: diversity-based data augmentation.

Figure 4 .
Figure 4. Comparison between logits and probabilities before and after magnitude enhancement.

Figure 5 .
Figure 5.Comparison of probability ratio and optimizing magnitude ratio before and after magnitude enhancement.(a) r1: ratio of the probability of the target class to that of the non-target classes; (b) r2: ratio of the optimizing magnitude of the target class to that of the non-target classes.

Table 3 .
Results of ImageNet-1k validation.MKL (magnitude-enhanced KL divergence): our method with magnitude enhancement only.NTCE-KD: our method with both magnitude and diversity enhancements.The best and second-best results are emphasized in bold and underlined cases.

Table 6 .
Results on different β.The experiments are conducted on CIFAR-100, with Resnet32×4 as teacher and Resnet8×4 as student.α is set to 1.0.

Table 7 .
Results of different α.The experiments are conducted on CIFAR-100, with Resnet32×4 as teacher and Resnet8×4 as student.β is set to 8.0.

Table 9 .
Training time and accuracy of different methods.The experiments are conducted on CIFAR-100, with Resnet32×4 as teacher and Resnet8×4 as student.