Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification

Xie, Jing; Guan, Penghui; Li, Han; Tang, Chunhua; Wang, Li; Lin, Yingcheng

doi:10.3390/sym18010177

Open AccessArticle

Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification

by

Jing Xie

¹

,

Penghui Guan

²,

Han Li

¹,

Chunhua Tang

²,

Li Wang

² and

Yingcheng Lin

^1,*

¹

The School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing 401133, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 177; https://doi.org/10.3390/sym18010177

Submission received: 23 December 2025 / Revised: 12 January 2026 / Accepted: 14 January 2026 / Published: 18 January 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Knowledge distillation achieves model compression by training a lightweight student network to mimic the output distribution of a larger teacher network. However, when the teacher becomes overconfident, its sharply peaked logits break the scale symmetry of supervision and induce high-variance gradients, leading to unstable optimization. Meanwhile, research that focuses only on final-logit alignment often fails to utilize intermediate semantic structure effectively. This causes weak discrimination of student representations, especially under class imbalance. To address these issues, we propose Scale Calibration and Pressure-Driven Knowledge Distillation (SPKD): a one-stage framework comprising two lightweight, complementary mechanisms. First, a dynamic scale calibration module normalizes the teacher’s logits to a consistent magnitude, reducing gradient variance. Secondly, an adaptive pressure-driven mechanism refines student learning by preventing feature collapse and promoting intra-class compactness and inter-class separability. Extensive experiments on CIFAR-100 and ImageNet demonstrate that SPKD achieves superior performance to distillation baselines across various teacher–student combinations. For example, SPKD achieves a score of 74.84% on CIFAR-100 for the homogeneous architecture VGG13-VGG8. Additional evidence from logit norm and gradient variance statistics, as well as representation analyses, proves the fact that SPKD stabilizes optimization while learning more discriminative and well-structured features.

Keywords:

knowledge distillation; model compression; symmetry; image classification

1. Introduction

Since its introduction by Hinton et al. [1], knowledge distillation has emerged as a prominent technique for model compression in deep learning, attracting considerable research interest. The fundamental principle is to train a student network mimicking the softened probability outputs of the teacher. These soft labels provide information about the predicted class, the relationships between classes, and the distributions of confidence that were learned by the teacher. To align the student’s predictions with these informative outputs, the training objective typically minimizes the Kullback–Leibler divergence or the mean squared error (MSE) relative to the teacher’s softened predictions. This approach has proven highly effective in various domains, including object detection [2], semantic segmentation [3], and natural language understanding [4].

While logit distillation has proven highly effective in model compression, its reliance on aligning final logits limits the exploitation of rich semantic information present in intermediate feature layers. This limitation often leads to suboptimal feature representations in the student model, characterized by insufficient intra-class cohesion and inadequate inter-class discrimination, ultimately impairing generalization and robustness [5,6]. Moreover, when the disparity between teacher and student capacity becomes too large, it will present a second, even more severe optimization challenge: the teacher model’s tendency to be too sure of itself leads to sharply peaked logit distributions, where the Top-1 predictions drown out more subtle signals from similar classes. Such an outcome not only diminishes the informative value of soft labels but also introduces large gradient variances during training, leading to unstable optimization dynamics, delayed convergence, or getting trapped in suboptimal solutions [7].

Existing work primarily focuses on two areas. To solve the problem of suboptimal feature representation, several methods have been proposed to improve the quality of student features, such as confidence-guided distillation [8], contrastive learning approaches [9], and specialized regularization schemes [10]. To address the problems of teacher overconfidence and optimization instability, Xu and Hossain [11,12] proposed feature and logit normalization; Hebbalaguppe [13] adopted the calibration transfer method; and Giakoumoglou and Wu [14,15] reinforced the intrinsic relational patterns.

While prior efforts have driven progress in mitigating logit overconfidence and refining feature structures, some challenges still hinder knowledge distillation performance. Regarding the optimization stability challenge, prevailing calibration approaches use statistical normalization [12] to solve it, but these methods fail to provide the necessary dynamic, real-time feedback. This means that high-variance gradients, especially those generated by a large teacher–student capacity gap, cannot be effectively reduced during training. Furthermore, explorations into optimizing the logit layer and refining intermediate feature structures have been conducted in isolation. This isolation overlooks a critical dependency: unstable gradients from uncalibrated logits propagate backward, disrupting the formation of structured feature embeddings. Therefore, if the method only adds feature constraints while not addressing the root cause of optimization instability, it will yield suboptimal results.

To address these challenges, we propose Scale Calibration and Pressure-Driven Knowledge Distillation (SPKD), a novel one-stage framework. SPKD employs two lightweight adaptive modules to simultaneously enhance optimization stability and strengthen robust feature representation. The scale calibration module dynamically normalizes the teacher logits through batch L2 normalization. From a geometric perspective, this operation projects the logits onto a hypersphere, promoting spherical symmetry in the optimization landscape. It effectively mitigates amplitude variance and high-variance gradients by means of supervised real-time alignment of signal strength to student ability, thereby correcting the distributional asymmetry caused by the teacher’s overconfidence. Meanwhile, the pressure-driven mechanism solves the problem of suboptimal feature representation. By adaptively penalizing feature folding based on student model confidence, we intelligently improve intra-class compactness and inter-class separability in a parameter-independent manner, thus exploiting the neglected structured semantic information.

Our main contributions can be summarized as follows:

We introduce SPKD, an efficient one-stage knowledge distillation framework that jointly addresses optimization instability and feature representation collapse. This is achieved through a dynamic scale calibration module stabilizing the teacher logits signal and an adaptive pressure-driven mechanism optimizing student feature learning.
We empirically show that SPKD delivers consistently improved accuracy over strong distillation baselines on CIFAR-100 and ImageNet, particularly for teacher–student pairs with large capacity gaps.
We provide in-depth analyses, supported by visualizations, that validate the distinct roles of our two components and reveal the underlying mechanism of how SPKD reshapes the feature space to be more discriminative.

The subsequent sections of this paper are structured as follows. In Section 2, we provide a detailed overview of the latest research developments concerning the relevant research questions. Section 3 elaborates on the core methodology, with particular emphasis on the scale calibration of teacher logits and the pressure-driven module incorporating the margin mechanism. The experimental details, encompassing dataset descriptions and configuration, are provided in Section 4, which also presents the key findings through comparative studies, visualizations, and hyperparameter analysis. The concluding remarks in Section 5 summarize the research and outline the practical implications of the proposed SPKD method.

2. Related Work

2.1. Enhancing Feature Representation in Knowledge Distillation

To mitigate representation degradation in logit-based knowledge distillation, early works like Mun et al. [16] explored data-dependent specialization strategies to handle complex distributions. Following this adaptive paradigm, Mishra [8] pioneered confidence-guided distillation techniques to curb feature collapse, while Yuan [9] bolstered embedding cohesion via intra-class contrastive methods and boundary penalties. Uikey et al. [10], on the other hand, implemented regularization schemes and progressive constraint approaches to refine feature efficacy in applications such as facial identification and image partitioning. These studies emphasize the importance of enhancing the inherent discriminative power of student models through active interventions in feature spaces. More recently, Gong et al. [17] proposed a method of aligning feature dynamics to overcome the limitations of static logit supervision. Yang et al. [18] introduced representation transfer schemes to facilitate knowledge migration in large language models.

Although these methods enhance discriminability effectively, they often neglect the instability inherent in the logit layer.

2.2. Calibration-Based Optimization and Stability Strategies

Recent studies have proposed several strategies for addressing instances of overconfidence in the teacher model. Xu et al. [11] leveraged feature normalization to reconcile distributional mismatches; LumiNet [12] incorporates statistical logit normalization to stabilize signals and reduce gradient variance; Hebbalaguppe et al. [13] facilitated calibration transfer to mitigate excessive certainty; RRD [14] and Wu et al. [15] emphasized the importance of reinforcing intrinsic relational patterns to enhance knowledge transmission. These methods validate the potential to stabilize the distillation process by adjusting logit magnitude or distribution. In recent years, studies have further investigated dynamic strategies. For example, Klaassen et al. [19] investigated robust calibration strategies to ensure stability and accuracy in complex estimation scenarios.

While the above approaches demonstrate the effectiveness of logit calibration in achieving output stability, they rarely consider how unstable gradients propagate backwards to disrupt intermediate feature formation.

2.3. Summary and Research Positioning

Despite the advancements in knowledge distillation, existing paradigms face balances between optimization stability, structural correctness, and computational efficiency. Calibration-based strategies generally function as passive stabilizers while being effective in mitigating teacher overconfidence. They lack an active mechanism to enhance the discriminative boundaries of the student. Logit decoupling or restructuring paradigms focus on refining the knowledge granularity, but they often overlook the geometric distortion in the teacher’s output space. Thus, these methods may inherit scale biases, which can mislead the student’s optimization trajectory. Furthermore, while relation-modeling approaches capture intermediate structural information via high-order constraints well, they incur significant computational overhead, making them less suitable for efficient one-stage training.

In response to these limitations, SPKD proposes a synergistic framework. We identify that restoring scale symmetry is a necessary prerequisite for advanced feature shaping: it creates a sanitized optimization landscape that allows the pressure-driven mechanism to enforce intra-class compactness. This coupling resolves the stability–discriminability paradox and achieves good structural learning with minimal computational cost.

3. Main Methods

In this section, we delineate the two key innovative logit hierarchy mechanisms of our proposed Scale Calibration and Pressure-Driven Knowledge Distillation (SPKD) framework. Firstly, the dynamic scale calibration module is introduced. This module has been designed to stabilize the teacher’s logit output, thereby mitigating any potential instability in the optimization process. Subsequently, we present the adaptive pressure-driven mechanism, which inhibits feature collapse and enhances discriminative feature learning by leveraging structured semantic knowledge. Finally, we provide a theoretical analysis from a gradient perspective, demonstrating how the synergy of these two mechanisms effectively addresses the challenges of unstable logit and limited feature representation.

3.1. Scale Calibration

The scale calibration module is designed to address the instability caused by significant variations in the teacher’s logit magnitudes. It achieves this by rescaling the teacher logits using a globally estimated average magnitude, denoted as

μ

. The calculation of

μ

and the rescaling process are performed as follows.

The value of

μ

is estimated during a warm-up phase, which spans the initial

E_{μ}

epochs of training. Within this phase, for each mini-batch, we first compute the batch-averaged L2 norm of the teacher’s raw logit

z^{T}

:

l_{batch} = \frac{1}{B} \sum_{b = 1}^{B} {∥ z_{b}^{T} ∥}_{2}

(1)

where B is the batch size. These batch-wise average norms are collected throughout the warm-up phase. At the beginning of epoch

E_{μ}

, the global magnitude

μ

is computed as the mean of all collected

l_{batch}

values:

μ = \frac{1}{| L |} \sum_{l \in L} l

(2)

where

L = {\{l_{batch}^{(e, i)}\}}_{e < E_{μ}}

is the set of all batch-averaged norms from the warm-up epochs. For all subsequent training epochs (

e \geq E_{μ}

), this computed

μ

is fixed.

The teacher logits are then dynamically rescaled using a scale factor s. During the warm-up phase (

e < E_{μ}

), the scale factor is the current batch average norm,

s = l_{batch}

. During the stable phase (

e \geq E_{μ}

), the fixed global magnitude uses

s = μ

. The final calibrated teacher logit

{\hat{z}}_{b}^{T}

is obtained by

{\hat{z}}_{b}^{T} = \frac{z_{b}^{T}}{∥ {z_{b}^{T} ∥}_{2} + ϵ} \cdot s

(3)

where

ϵ

is a tiny positive scalar added to prevent division by zero.

Geometrically, as shown in Figure 1, this operation projects the logits onto a hypersphere with a radius of

μ

. This restores the scale symmetry that is often lost in standard temperature scaling (TS). TS (

z / T

) softens the distribution but performs linear scaling, which preserves the inherent radial variance of the teacher’s logits. In contrast, our module imposes a strict norm constraint for all samples. This effectively removes sample-dependent scale fluctuations while preserving directional semantic information.

To summarize, we replace the potentially high and sample-dependent confidence level of the teacher with a global constant

μ

(estimated early in training) that represents the average logit magnitude. This aims to provide a more stable supervision signal that is easier for the student to learn, thereby alleviating the teacher–student capacity gap. As shown in Figure 2, this global calibration helps stabilize the loss dynamics and reduces abrupt fluctuations during training.

3.2. Pressure-Driven Mechanism

The SPKD framework introduces a margin-based pressure mechanism and decouples the distillation loss into target-class and non-target-class components to enhance the student’s feature learning.

To increase the learning difficulty and encourage more discriminative feature learning, we apply a margin penalty directly to the student logits. Specifically, a margin m is subtracted from the logit corresponding to the ground-truth class:

{\tilde{z}}_{b}^{S} = z_{b}^{S} - m \cdot l_{gt}

(4)

where the true label for sample b is represented as a one-hot encoding

l_{gt}

. This modified logit vector

{\tilde{z}}_{b}^{S}

is used for all subsequent distillation calculations.

Figure 3 illustrates how the student margin mechanism reshapes decision boundaries geometrically. The baseline method relies only on teacher calibration, where the student model overfits the teacher’s high-entropy soft labels, which leads to the formation of nonlinear and high-curvature manifold decision boundaries. This sensitivity to local noise undermines generalization capabilities. Additionally, gradient saturation during the convergence of the prediction distribution causes samples to cluster near the boundary, resulting in insufficient discriminative power. In contrast, SPKD introduces margin as a geometric regularization term and compels the model to seek a linear optimal hyperplane that maximizes the inter-class margin. This logit-driven gradient flow forces feature embeddings to be more compact and distant from the decision boundaries so that it effectively overcomes gradient vanishing at boundaries.

The TCKD loss distills the teacher’s confidence on the true label. It is formulated as the binary divergence measured between the educator and learner models’ probabilistic outputs over the target class versus all other classes. Let

p (z, t, T)

be the softmax probability of the target class t from logit vector z with temperature T. The probabilities are

p_{t}^{T} = p ({\hat{z}}_{b}^{T}, t_{b}, T), p_{t}^{S} = p ({\tilde{z}}_{b}^{S}, t_{b}, T)

(5)

The TCKD loss is then

\begin{matrix} L_{TCKD} & = T^{2} \cdot \frac{1}{B} \sum_{b = 1}^{B} KL (p_{t}^{T} ∥ p_{t}^{S}) \end{matrix}

(6)

The NCKD loss distills the relational information among the non-target classes. This is achieved by computing the KL divergence over a re-normalized probability distribution that only includes non-target logits. Let

z_{∖ t}

denote the logit vector excluding the element for the target class t. The loss is defined as

L_{NCKD} = T^{2} \cdot KL (softmax (\frac{{\hat{z}}_{∖ t_{b}}^{T}}{T}) ∥ softmax (\frac{{\tilde{z}}_{∖ t_{b}}^{S}}{T}))

(7)

This formulation is mathematically equivalent to masking the target logit with a large negative value before applying the softmax function, as implemented in our code.

The final SPKD loss function is formulated as a linear combination of the two constituent losses:

L_{SPKD} = α L_{TCKD} + β L_{NCKD} .

(8)

where

α

and

β

are weighting coefficients, and the values of

α

and

β

are set to 1 and 8, respectively, following [20].

The final training objective is

L = λ_{CE} \cdot L_{CE} + w (e) \cdot L_{SPKD}

(9)

where

L_{CE}

is the cross-entropy loss,

λ_{CE}

is its weight, and

w (e) = min (e / warmup, 1)

is the warm-up weight of the distillation loss with e denoting the current epoch.

This design enables the lightweight model to align with the source model’s target category probabilities as well as to acquire enhanced discriminative capabilities across the non-target class distributions. Furthermore, the pressure-driven mechanism implicitly applies a “pressure” to the target categories, guiding the student towards optimizing the feature representations of hard-to-discriminate samples. This process mitigates feature collapse and improves the student’s representational capacity and generalization performance.

The two aforementioned mechanisms operate synergistically within the SPKD architecture, as illustrated in Figure 4. In SPKD, we first calibrate the teacher logits by L2-normalizing

z^{T}

and rescaling it to a stable magnitude (

s = l_{batch}

during warm-up and

s = μ

afterwards), producing

{\hat{z}}^{T}

. This operation preserves the directional information while reducing magnitude-induced gradient variance, thereby providing a more predictable supervision signal.

Concurrently, we apply a margin-based pressure to the student logits to obtain

{\tilde{z}}^{S} = z^{S} - m \cdot l_{gt}

, which amplifies the learning signal for the target class and encourages more discriminative representations. Finally, distillation is decoupled into target-class and non-target-class components (TCKD and NCKD), and their weighted combination constitutes

L_{SPKD}

.

3.3. Gradient Intuitive Analysis

To intuitively understand how SPKD works, we analyze the gradients it imparts to the student’s logits

z^{S}

. The gradient of a standard temperature-smoothed KL divergence loss

L_{K D} = T^{2} KL (p ∥ q)

with respect to the student’s logits is given by:

\frac{\partial (T^{2} KL (p ∥ q))}{\partial z^{S}} = T (q - p),

(10)

where

p = softmax (\frac{z^{T}}{T})

and

q = softmax (\frac{z^{S}}{T})

. Our SPKD framework modifies this gradient dynamic in three key ways.

3.3.1. Stabilizing the Gradient Direction via Scale Calibration

Large-capacity teachers possess stronger fitting capabilities, allowing them to minimize the cross-entropy loss aggressively during training. To increase the predicted probability of the ground-truth class (

p_{g t}

), the model typically pushes the target logit (

z_{g t}

) to much larger values than other logits, resulting in a highly peaked distribution. Consequently, teachers with larger capacity gaps tend to generate logits with significantly higher magnitudes compared to smaller models.

When a large capacity gap exists, the teacher’s raw logit

z^{T}

can have excessively large magnitudes, leading to an extremely sharp (low-entropy) probability distribution p where one class probability is close to 1. This makes the gradient term

T (q - p)

highly sensitive to small changes in q, causing the gradient direction to oscillate drastically during training and leading to optimization instability. Our scale calibration module addresses this by normalizing and rescaling the teacher’s logit to a stable, moderate magnitude

μ

. This produces a smoother, softer distribution

p = softmax (\frac{{\hat{z}}^{T}}{T})

, which reduces the variance of the gradient vector

(q - p)

and provides a more stable and robust update signal for the student. More importantly, we can mathematically show why this reduces variance compared to temperature grading. The gradient magnitude of the distillation loss is proportional to the logit norm of the teacher. In standard TS, the gradient variance is dominated by the variance of the teacher’s original logit norms, which remains high due to the teacher’s varying confidence. By calibrating the logits to a fixed magnitude

μ

, SPKD mathematically decouples the gradient scale from the teacher’s confidence fluctuations. Since the radial component is now constant (

| | {\hat{z}}^{T} {| |}_{2} = μ

), the gradient variance is strictly bounded, driven only by the meaningful directional differences

(q - p)

rather than the scale noise. This provides a consistent and robust update signal for the student.

3.3.2. Amplifying the Target-Class Gradient via Pressure-Based Mechanism

While scale calibration stabilizes the gradient variance by smoothing p, it alone does not guarantee discriminative feature learning. This is where the synergy occurs. If we were to apply the margin penalty (pressure) directly to uncalibrated logits, the high-variance term

(q - p)

in Equation (10) would be amplified, potentially destabilizing the training.

However, because our scale calibration has already constrained the gradient direction to a stable range, we can now safely introduce the pressure-driven mechanism. By modifying the student’s logit to

{\tilde{z}}^{S}

, we amplify the gradient magnitude for the target class to enhance discrimination, knowing that the gradient direction is already rectified. This allows the student to “work harder” in the right direction.

Our pressure-driven mechanism modifies the student’s logit to

{\tilde{z}}^{S} = z^{S} - m \cdot l_{gt}

. Let us analyze its effect on the target-class logit

z_{t}^{S}

, where t is the ground-truth class. The gradient component for the target class in the TCKD loss

L_{T C K D}

is primarily driven by

T (q_{t} - p_{t})

, where

q_{t} = softmax {(\frac{{\tilde{z}}^{S}}{T})}_{t}

. By subtracting a positive margin m, the value of

z_{t}^{S}

must become significantly larger than its non-margin counterpart to achieve the same probability

q_{t}

. This effectively creates a sustained gradient signal for samples that are correctly classified but fall within the margin. From a theoretical perspective, since the logit is the inner product of the feature and the class weight (

z_{t} = w_{t}^{T} f

), it minimizes the loss with the penalty term

- m

to compel the model to maximize the projection of the feature f onto the target class vector

w_{t}

. This explicitly minimizes the angular distance between the feature and the class prototype. Therefore, it can shift the decision boundary away from the data density and enforce tighter intra-class compactness in the feature space.

3.3.3. Preserving Relational Structure via Decoupled Distillation

Finally, by decoupling the loss into

L_{T C K D}

and

L_{N C K D}

, SPKD ensures that the student receives dedicated supervision for both target and non-target classes. The

L_{N C K D}

term, in particular, forces the student to align its inter-class relationships with those of the teacher. This prevents the student from finding trivial solutions, such as merely suppressing non-target logit indiscriminately, and ensures that it learns a well-structured and meaningful feature space, thereby preventing feature collapse.

4. Experiments and Discussion

4.1. Dataset and Experimental Setup

Our experimental evaluation is conducted on two standard image classification benchmarks: CIFAR-100 [21] and ImageNet [22]. To ensure the validity of the experimental results, all data are selected from the average results of three independent experiments, unless otherwise stated. The bolded values in the table are the highest values in the corresponding cases.

The initial evaluation focuses on the CIFAR-100 dataset, which contains 60,000 color images, each with a resolution of

32 \times 32

pixels, distributed across 100 object categories. The standard data split allocates 50,000 images for training and 10,000 for testing. During training, data augmentation techniques are applied, including random cropping with a 4-pixel padding and random horizontal flips to improve model generalization. The models are trained for 240 epochs using a batch size of 64. The optimization employs Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay factor of 5 × 10⁻⁴. The initial learning rate is set to 0.1 for most architectures, while models like MobileNet and ShuffleNet use a slightly lower rate of 0.05. A step decay strategy reduces the learning rate by a factor of 0.1 at the 150th, 180th, and 210th epochs.

ImageNet comprises 1.28 million images for training and 50,000 for validation, distributed across 1000 object categories. During training, data augmentation is applied following standard practices: images are randomly resized and cropped to

224 \times 224

pixels with random horizontal flips, while validation uses a central crop. The models are trained for 100 epochs using stochastic gradient descent (SGD), with momentum set to 0.9 and weight decay coefficient at 1 × 10⁻⁴. For a batch size of 512, the initial learning rate is 0.2, which is subsequently reduced by a factor of 0.1 after the 30th, 60th, and 90th epochs. All experiments were executed on NVIDIA GeForce RTX 4090 graphics processing units.

4.2. Main Results

4.2.1. Efficiency Metrics

Our method imposes no additional parameters or inference costs, as shown in Table 1, since the teacher models are only used during training. Latency benchmarks were performed on a GPU with TensorRT (settings: batch size of 64, 20 warm-up rounds, and an average of three runs including data transfer). It is worth noting that the higher latency observed in some lightweight student models, despite their low FLOPs, stems from their architectural properties, rather than from our distillation process. Overall, our approach improves performance while maintaining the original model size and inference speed.

4.2.2. Stabilized and Targeted Gradient Modulation Analysis

To empirically validate the proposed gradient modulation mechanism of SPKD under class imbalance, we analyze its effects from two complementary perspectives: logit scale behavior and gradient dynamics during training.

Figure 5 illustrates how the student logit L2 norm fluctuates and is distributed across various gradient modulation strategies.

We can see that the only margin strategy consistently suppresses the logit magnitude, resulting in an overall collapse in scale during training. By contrast, only μ preserves the logit scale, but exhibits a relatively dispersed distribution, which indicates that calibration is insufficient across samples.

However, SPKD maintains a stable and moderately high logit scale throughout training and produces a more concentrated norm distribution. This behavior is consistent with our analysis in Section 3.3, in which μ-based calibration stabilizes the gradient magnitude and pressure-based margin selectively amplifies informative discrepancies without shrinking the logit space globally.

Figure 6 shows how gradient variance evolves and is distributed during training.

The only margin method exhibits rapid decay in gradient variance, this means that gradient collapse occurs before training and that learning signals for underrepresented classes are insufficient. However, the μ-only method initially has a higher variance, but then suffers increased fluctuations in later stages, indicating unstable optimization dynamics.

While SPKD preserves a relatively high yet stable gradient variance consistently during training, reflecting a balanced combination of gradient stabilization and targeted amplification. This observation supports our theoretical analysis: SPKD helps avoid gradient vanishing and excessive oscillation. Thus, it can sustain effective optimization signals in the face of imbalanced data.

These results form a coherent chain of evidence suggesting that SPKD jointly leverages gradient stabilization and targeted amplification. This helps avoid both scale collapse and optimization instability.

4.2.3. Qualitative Analysis of Model Behavior

To qualitatively validate our motivation, we analyze the behavior of student models trained with different distillation methods.

We first examine the predictive confidence. As shown in Figure 7, compared to DKD, the distribution curve for SPKD has a higher peak and is located further to the right, indicating that the SPKD-trained student model is able to mimic the instructor’s high-certainty prediction behavior more effectively. Yet, it does not replicate the teacher’s overconfidence almost exactly as the traditional KD method does. This distributional property between DKD and teacher models demonstrates that SPKD has made critical progress in bridging the teacher–student “capacity gap”. It allows students to learn not only the relative relationship of knowledge, but also the confidence pattern of the teacher’s judgment, so that the small-capacity student model can inherit more fully the essence of the large-capacity teacher model.

We further visualize the learned feature representations using t-SNE. The resulting plots in Figure 8 suggest that SPKD produces more compact intra-class clusters and larger inter-class margins compared to both KD and DKD, suggesting superior feature discriminability.

Futhermore, we use three standard metrics: the Silhouette score, the Davies–Bouldin Index and the intra-/inter-class distance ratio to evaluate the penultimate-layer feature clustering.

As shown in Figure 9, SPKD gains higher Silhouette scores and lower DBI/ratio values, which is consistent with tighter intra-class compactness and larger inter-class margins.

To assess how well the student mimics the teacher’s relational knowledge, we compute the difference in their logit correlation matrices. As depicted in Figure 10, the difference matrix for SPKD is visibly lighter, suggesting a higher correlation between the teacher’s and student’s logit structures and thus a better mitigation of feature collapse.

To complement the qualitative comparison in Figure 10, we report correlation–difference metrics in Figure 11. It quantifies how closely the student matches the teacher’s logit correlation structure.

The results show that SPKD yields smaller correlation differences, which indicates that the inter-class relational knowledge is improved ofafter scale calibration.

Because of SPKD’s distinctive refinement mechanism, the student model achieves generally higher and more uniform feature similarity across all network layers. Notably, even in the bottleneck layer where feature dimensionality is drastically reduced, SPKD preserves strong representational alignment, alleviating feature collapse in intermediate stages. As illustrated in Figure 12, SPKD optimizes final output distributions by better capturing and transferring of the teacher model’s knowledge within the internal architecture.

As shown in Figure 13, we quantify teacher–student representational alignment using Centered Kernel Alignment (CKA) similarity. SPKD achieves higher CKA similarity than KD/DKD across the evaluated settings mostly, indicating improved feature-level alignment.

4.2.4. Analysis of Robustness to Temperature

We investigate the model’s sensitivity to the temperature hyperparameter T. As shown in Figure 14 and Table 2, SPKD not only achieves the highest peak accuracy (74.99% at T = 4) but also exhibits greater stability across a wide range of temperatures compared to KD and DKD, showcasing its enhanced robustness.

4.2.5. Ablation Study

We conducted ablation experiments on CIFAR-100 to validate the effectiveness of the two components in SPKD: scale calibration (

μ

) and the pressure-driven margin. Table 3 and Table 4 summarize results on a homogeneous pair (ResNet50–ResNet18) and a heterogeneous pair (ResNet50–MobileNetV2). In both settings, introducing

μ

yields consistent and substantial gains over the DKD baseline, confirming its role in stabilizing distillation. However, the margin term alone can be less beneficial when the teacher–student gap is small (Table 3), but it becomes a stronger regularizer in the heterogeneous setting (Table 4). Combining both components achieves the best overall performance, indicating that stable calibration and targeted pressure are complementary.

4.2.6. Hyperparameter Analysis

We analyze the sensitivity of SPKD to its key hyperparameters

E_{μ}

and margin m.

Effect of

E_{μ}

. Table 5 shows the performance with different warm-up epochs for calculating

μ

. We observe that setting

E_{μ} = 2

yields the best performance, indicating that a brief warm-up phase is sufficient to obtain a stable global magnitude estimate. It is important to note that the calculated value of

μ

varies adaptively across different teacher architectures while the warm-up duration (

E_{μ}

) is fixed. This adaptivity makes it so that SPKD generalizes well across diverse teacher–student pairs (as shown in Table 6).

Effect of margin m. As shown in Table 6, the model achieves the highest accuracy when

m = 0.5

. This confirms that a moderate margin effectively encourages discriminative feature learning without overly penalizing the model.

4.2.7. Benchmarking Against Contemporary Methods

Table 7 presents the comparison of SPKD with other leading KD methods on CIFAR-100 across six diverse teacher–student pairs. SPKD achieves competitive performance and attains the best result in several configurations, demonstrating its strong generalization ability. For the challenging ResNet32x4–ResNet8x4 pair, SPKD achieves 76.43% accuracy, significantly outperforming the strong baseline DKD (75.51%) and other methods like CRD (72.72%). The improvements are particularly notable for heterogeneous pairs, such as VGG13–MobileNetV2.

According to standard protocols [29], we construct the long-tailed versions of CIFAR-100 by reducing the number of training samples per class according to an exponential decay function. Specifically, the number of samples for class k, denoted as

N_{k}

, is calculated as

N_{k} = N_{m a x} \times {(ρ_{r a t i o})}^{\frac{k}{C - 1}}

(11)

where

N_{m a x}

is the original number of samples (500 for CIFAR-100), C is the total number of classes, and

ρ

denotes the ratio of the minimum class to the maximum class (

N_{m i n} / N_{m a x}

). We evaluate our method under two specific settings:

ρ = 1.0

and

ρ = 0.01

. Here,

ρ = 1.0

represents the standard balanced distribution, while

ρ = 0.01

simulates a severe long-tailed distribution with an imbalance factor (IF) of 100.

Following the imbalance setup and subgroup definitions above, we choose

ρ

= 0.01 to further report Head, Medium, and Tail performance on long-tailed CIFAR-100 to assess generalization across frequency strata. As shown in Table 8 and Figure 15, SPKD improves performance on Tail categories across several teacher–student pairs, while maintaining competitive overall accuracy. Notably, the Tail Gain values (Tail(method) − Tail(KD)) directly quantify minority-class improvements relative to the vanilla KD baseline.

In the VGG13–VGG8 pair, DKD shows a clear performance drop in the Tail group compared to vanilla KD (−2.11%). In contrast, SPKD mitigates this degradation, improving Tail accuracy to 73.92% (

+ 1.37 %

over DKD) while achieving a higher overall accuracy of 76.14%. This suggests that our spatial-based distillation can better retain minority-class features that are often overlooked by purely logit-based methods.

A more striking improvement is observed in the ResNet50–MobileNetV2 pair, where SPKD improves the Tail group accuracy by 7.85% over vanilla KD and 5.40% over DKD. This substantial gain suggests that SPKD is particularly effective in low-resource scenarios.

We further validate SPKD on the large-scale ImageNet dataset. As shown in Table 9 and Table 10, SPKD achieves competitive performance against prior methods. For the ResNet34–ResNet18 setup, SPKD achieves a Top-1 accuracy of 71.90%, surpassing the DKD method with a margin of 0.2%. For the ResNet50–MobileNetV1 pair, SPKD (72.40%) trails behind ReviewKD (72.56%) slightly in Top-1 accuracy but achieves the best Top-5 accuracy (91.35%). This demonstrates the competitive performance of our method on this challenging benchmark.

5. Conclusions

In this work, we propose Scale Calibration and Pressure-Driven Knowledge Distillation (SPKD), a lightweight one-stage framework that improves the stability and effectiveness of logit-based distillation. SPKD consists of two complementary components. First, dynamic scale calibration globally normalizes and rescales the teacher logits to mitigate magnitude-induced instability and reduce gradient variance, as supported by our gradient-level analysis. Second, a margin-based pressure mechanism adaptively increases the learning difficulty for the target class based on prediction confidence, encouraging the student to learn more discriminative representations. Qualitative analyses further suggest that SPKD leads to tighter intra-class clustering and clearer inter-class separation.

Experiments on CIFAR-100 and ImageNet demonstrate that SPKD achieves consistently strong performance across both homogeneous and heterogeneous teacher–student pairs, particularly under large capacity gaps. On CIFAR-100, SPKD improves ResNet8x4 to 76.43% Top-1 accuracy under a ResNet32x4 teacher. On ImageNet, SPKD achieves 72.40% Top-1 accuracy on the challenging ResNet50-MobileNetV1 pair and attains the best Top-5 accuracy (91.35%) among compared methods in our evaluation.

Our research enables lightweight student models to maintain high accuracy while significantly reducing computational, storage, and energy requirements. As the teacher is only used during training, and the student model retains its original architecture upon deployment, SPKD is compatible with standard inference toolchains and is ideal for practical applications such as on-device mobile classification, in-vehicle/roadside edge vision, and real-time industrial camera inspection.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, J.X.; validation, J.X., H.L., P.G. and C.T.; formal analysis, J.X.; investigation, J.X. and P.G.; resources, Y.L.; data curation, L.W.; writing–original draft preparation, J.X.; writing–review and editing, Y.L.; visualization, H.L. and P.G.; supervision, Y.L.; project administration, H.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Science and Technology Innovation Key R&D Program of Chongqing, grant number CSTB2024TIAD-STX0012.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

During the preparation of this work the authors used ChatGPT in order to improve language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

Authors P.G., C.T. and L.W. were employed by State Key Laboratory of Intelligent Vehicle Safety Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. Proc. AAAI Conf. Artif. Intell. 2020, 34, 5191–5198. [Google Scholar] [CrossRef]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3962–3971. [Google Scholar]
Cho, Y.; Saul, L.K. Kernel Methods for Deep Learning. Adv. Neural Inf. Process. Syst. 2009, 22, 342–350. [Google Scholar]
Mishra, S.; Sundaram, S. Confidence Conditioned Knowledge Distillation. arXiv 2021, arXiv:2107.06993. [Google Scholar] [CrossRef]
Yuan, H.; Xu, N.; Geng, X.; Rui, Y. Enriching Knowledge Distillation with Intra-Class Contrastive Learning. arXiv 2025, arXiv:2509.22053. [Google Scholar] [CrossRef]
Mishra, D.; Uikey, R. Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition. arXiv 2025, arXiv:2508.11376. [Google Scholar] [CrossRef]
Xu, K.; Rui, L.; Li, Y.; Gu, L. Feature Normalized Knowledge Distillation for Image Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Hossain, M.I.; Elahi, M.M.L.; Ramasinghe, S.; Cheraghian, A.; Rahman, F.; Mohammed, N.; Rahman, S. LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration. Trans. Mach. Learn. Res. 2025, Article 3rU1lp9w2l. Available online: https://openreview.net/forum?id=3rU1lp9w2l (accessed on 22 December 2025).
Hebbalaguppe, R.; Baranwal, M.; Anand, K.; Arora, C. Calibration Transfer via Knowledge Distillation. In Proceedings of the Asian Conference on Computer Vision (ACCV 2024), Hanoi, Vietnam, 8–12 December 2024; pp. 513–530. [Google Scholar]
Giakoumoglou, N.; Stathaki, T. Relational Representation Distillation. arXiv 2024, arXiv:2407.12073. [Google Scholar] [CrossRef]
Wu, H.; Xiao, L.; Zhang, X.; Miao, Y. Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures. arXiv 2024, arXiv:2405.18524. [Google Scholar] [CrossRef]
Mun, J.; Lee, K.; Shin, J.; Han, B. Learning to Specialize with Knowledge Distillation for Visual Question Answering. Adv. Neural Inf. Process. Syst. 2018, 31, 8081–8091. [Google Scholar]
Gong, G.; Wang, J.; Xu, J.; Xiang, D.; Zhang, Z.; Shen, L.; Zhang, Y.; Shu, J.; Xing, Z.; Chen, Z.; et al. Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 23067–23077. [Google Scholar]
Yang, J.; Song, J.; Han, X.; Bi, Z.; Wang, T.; Liang, C.X.; Song, X.; Zhang, Y.; Niu, Q.; Peng, B.; et al. Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models. arXiv 2025, arXiv:2504.13825. [Google Scholar] [CrossRef]
Klaassen, S.; Rabenseifner, J.; Kueck, J.; Bach, P. Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score-Based Estimators. arXiv 2025, arXiv:2503.17290. [Google Scholar]
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11943–11952. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Ahn, S.; Hu, S.X.; Damianou, A.; Lawrence, N.D.; Dai, Z. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9155–9163. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Passalis, N.; Tefas, A. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 268–284. [Google Scholar]
Liu, X.; Li, L.; Li, C.; Yao, A. Norm: Knowledge distillation via n-to-one representation matching. arXiv 2023, arXiv:2305.13803. [Google Scholar] [CrossRef]
Wei, Y.; Bai, Y. Dynamic temperature knowledge distillation. arXiv 2024, arXiv:2404.12711. [Google Scholar] [CrossRef]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar]
Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; Choi, J.Y. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1921–1930. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive representation distillation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia (Virtual), 26–30 April 2020. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit, Nashville, TN, USA, 20–25 June 2021; pp. 5006–5015. [Google Scholar]

Figure 1. Visualization of scale calibration: projecting teacher logits onto a hypersphere via L2 normalization and rescaling.

Figure 2. Loss function dynamic analysis. Our experimental setup on CIFAR-100 involves a teacher–student pair composed of VGG13 and the more lightweight VGG8.

Figure 3. Geometric illustration of the student margin mechanism (baseline vs. SPKD).

Figure 4. SPKD calibrates teacher logits via L2 normalization and rescaling to a stable magnitude, applies margin-based pressure to student logits, then decouples distillation into TCKD (target class) and NCKD (non-target classes).

Figure 5. Student logit L2 norm distribution under different gradient modulation strategies.

Figure 6. Student gradient variance distribution under different gradient modulation strategies.

Figure 7. SPKD effectively balances learning teacher’s confidence without replicating overconfidence.

Figure 8. Comparison of t-SNE plots of learned features under different methods. (a) Feature distribution learned using the classical KD method. (b) Feature distribution learned using the DKD method. (c) Feature distribution learned using our proposed SPKD method, showing better-defined clusters.

Figure 9. Quantitative comparison of feature clustering metrics across various architectures.

Figure 10. Differences in student and teacher logit correlation matrices. (a) The difference matrix under the KD method. (b) The difference matrix under the DKD method, showing smaller differences than KD. (c) The difference matrix under our SPKD method, which achieves the smallest differences among the three.

Figure 11. Teacher–student logit-alignment quantified by correlation–difference metrics.

Figure 12. CKA similarity cross-layer comparison (Teacher vs. Student). This figure illustrates the CKA similarity between the features of each layer of the student model and the corresponding layers of the teacher model under different knowledge distillation methods. (a) CKA similarity under the KD method. (b) CKA similarity under the DKD method. (c) CKA similarity under the SPKD method.

Figure 13. Teacher–student alignment quantified by CKA-based similarity metrics.

Figure 14. Assessment of temperature scaling robustness across different distillation pairs. (a) Robustness analysis on the VGG13–VGG8 pair. (b) Robustness analysis on the ResNet50–MobileNetV2 pair. (c) Robustness analysis on the ResNet32x4–ShuffleV1 pair.

Figure 15. Comparison of Top-1 accuracy across Head, Medium, and Tail subgroups to evaluate generalization on imbalanced data. (a) VGG13–VGG8 pair. (b) ResNet32x4–ShuffleV1 pair. (c) ResNet50–MobileNetV2 pair.

Table 1. A comparison of the parameter quantity and inference speed of different teacher–student models.

Model	Parameters (MB)	FLOPs (GFLOPs)	Inference Time (ms)
VGG13	66.42	0.572	1.882 ± 0.10
VGG8	15.15	0.193	1.303 ± 0.18
ResNet50	97.16	2.624	8.075 ± 0.43
MobileNetV2	3.79	0.015	4.661 ± 0.10
ResNet32x4	35.93	2.176	4.892 ± 0.03
ShuffleNetV1	3.79	0.084	5.725 ± 0.05

Table 2. Temperature robustness analysis across different teacher–student pairs. Accuracy (%) is reported for various temperature settings (T).

Teacher–Student Pairs	Temperature (T)	KD	DKD	SPKD
VGG13–VGG8	1	71.31	73.67	73.65
	2	71.95	74.66	74.85
	4	73.74	74.73	74.99
	8	73.67	74.33	74.75
	16	74.09	74.67	74.80
ResNet50–MobileNetV2	1	65.40	68.12	68.55
	2	67.56	69.89	70.52
	4	68.32	70.11	70.83
	8	69.22	70.40	70.70
	16	69.36	70.69	70.97
Res32x4–ShuffleNetV1	1	71.99	76.14	76.30
	2	73.73	76.35	76.48
	4	74.49	76.40	76.79
	8	75.20	76.76	76.84
	16	75.43	76.84	76.90

Table 3. Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and ResNet18 as the student model.

Experiment Group	Margin	Global Mu	Top-1 Acc(%)
Student_only	×	×	78.14
KD	×	×	80.15 (+2.01)
DKD	×	×	80.36 (+2.22)
Ablation (No mu)	√	×	80.44 (+2.30)
Ablation (No margin)	×	√	80.56 (+2.42)
SPKD (full)	√	√	80.67 (+2.53)

Table 4. Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and MobileNetV2 as the student model.

Experiment Group	Margin	Global Mu	Top-1 Acc(%)
Student_only	×	×	64.36
KD	×	×	68.54 (+4.18)
DKD	×	×	68.69 (+4.53)
Ablation (No mu)	√	×	65.14 (+0.78)
Ablation (No margin)	×	√	70.63 (+6.27)
SPKD (full)	√	√	70.8 (+6.44)

Table 5. Comparison of Top-1 accuracy for different Mu_epoch_end values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.

Mu_Epoch_End	0	1	2	5
Top-1 Acc(%)	72.05	71.38	76.69	76.40

Table 6. Comparison of Top-1 accuracy for different margin values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.

Margin	0	0.25	0.50	0.75
Top-1 Acc(%)	76.55	75.91	76.88	76.69

Table 7. The results on the CIFAR-100 dataset.

Teacher	ResNet32x4	VGG13	WRN-40-2	ResNet50	VGG13	ResNet32x4
Teacher	79.42	74.64	75.61	79.34	74.64	79.42
Student	ResNet8x4	VGG8	WRN-16-2	MobileNetV2	MobileNetV2	ShuffleNetV1
Student	72.5 ± 0.14	70.36 ± 0.22	73.26 ± 0.16	64.6 ± 0.30	64.6 ± 0.22	70.5 ± 0.22
AT [23]	73.61 ± 0.30	71.76 ± 0.15	74.30 ± 0.11	58.06 ± 1.39	60.42 ± 0.48	73.57 ± 0.32
VID [24]	72.98 ± 0.08	71.00 ± 0.28	73.87 ± 0.13	65.77 ± 0.45	65.72 ± 0.55	72.78 ± 0.20
FITNET [25]	73.71 ± 0.13	71.48 ± 0.30	73.61 ± 0.18	64.33 ± 0.55	64.44 ± 0.96	74.39 ± 0.32
PKT [26]	74.10 ± 0.26	73.36 ± 0.15	75.12 ± 0.22	68.59 ± 0.78	68.34 ± 0.20	75.71 ± 0.30
KD [1]	73.70 ± 0.38	73.31 ± 0.18	75.07 ± 0.19	68.50 ± 0.42	68.02 ± 0.25	74.88 ± 0.26
RKD [6]	72.72 ± 0.14	71.71 ± 0.29	73.82 ± 0.11	65.95 ± 0.50	65.97 ± 0.27	73.84 ± 0.23
DKD [20]	76.06 ± 0.20	74.66 ± 0.14	75.40 ± 0.16	68.23 ± 0.42	67.11 ± 0.53	74.34 ± 0.18
RRD [14]	75.85 ± 0.17	74.01 ± 0.15	75.77 ± 0.12	70.11 ± 0.35	69.61 ± 0.18	75.60 ± 0.30
NORM [27]	76.08 ± 0.15	73.95 ± 0.15	75.65 ± 0.13	70.56 ± 0.32	68.94 ± 0.22	77.18 ± 0.01
DTKD [28]	76.16 ± 0.20	74.12 ± 0.22	75.81 ± 0.09	69.10 ± 0.48	69.01 ± 0.25	75.43 ± 0.22
SPKD	76.43 ± 0.19	74.84 ± 0.06	75.86 ± 0.16	70.83 ± 0.30	70.11 ± 0.29	76.70 ± 0.46

Table 8. Top-1 accuracy (%) on long-tailed distribution subgroups: Head (high-frequency), Medium, and Tail (low-frequency) classes across different teacher–student architectures.

T–S Pairs	Method	Overall	Head	Medium	Tail	Tail Gain
VGG13–VGG8	KD	74.08	73.18	70.54	74.66	-
	DKD	75.96	75.31	72.77	72.55	−2.11
	SPKD (Ours)	76.14	75.26	72.87	73.92	−0.74
Res32x4–ShuffleNetV1	KD	75.37	73.79	75.45	71.47	-
	DKD	76.65	75.87	73.87	78.19	+6.72
	SPKD (Ours)	77.43	76.11	74.32	76.86	+5.39
Res50–MobileNetV2	KD	69.31	67.85	68.40	62.89	-
	DKD	71.79	71.51	67.21	65.34	+2.45
	SPKD (Ours)	71.56	69.61	70.82	70.74	+7.85

Table 9. Performance of the ResNet-34 educator and ResNet-18 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.

	Distillation_Manner		Feature				Logit
	Teacher	Student	AT [23]	OFD [30]	CRD [31]	Review KD [32]	KD [1]	KD *	DKD [20]	SPKD
Top-1	73.31	69.75 ± 0.20	70.69 ± 0.15	70.81 ± 0.13	71.17 ± 0.10	71.61 ± 0.15	70.66 ± 0.13	71.03 ± 0.14	71.70 ± 0.18	71.90 ± 0.11
Top-5	91.42	89.07 ± 0.22	90.01 ± 0.18	89.98 ± 0.15	90.13 ± 0.16	90.51 ± 0.12	89.88 ± 0.16	90.03 ± 0.19	90.41 ± 0.16	90.59 ± 0.07

KD * indicates KD with tuned hyperparameters.

Table 10. Performance of the ResNet-50 educator and MobileNetV1 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.

	Distillation_Manner		Feature				Logit
	Teacher	Student	AT [23]	OFD [30]	CRD [31]	Review KD [32]	KD [1]	KD *	DKD [20]	SPKD
Top-1	76.16	68.87 ± 0.14	69.56 ± 0.11	71.25 ± 0.12	71.37 ± 0.17	72.56 ± 0.12	68.58 ± 0.08	70.50 ± 0.14	72.05 ± 0.20	72.40 ± 0.13
Top-5	92.86	88.76 ± 0.17	89.33 ± 0.14	90.54 ± 0.16	90.41 ± 0.14	91.00 ± 0.14	88.98 ± 0.12	89.80 ± 0.15	91.05 ± 0.23	91.35 ± 0.08

KD * indicates KD with tuned hyperparameters.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, J.; Guan, P.; Li, H.; Tang, C.; Wang, L.; Lin, Y. Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification. Symmetry 2026, 18, 177. https://doi.org/10.3390/sym18010177

AMA Style

Xie J, Guan P, Li H, Tang C, Wang L, Lin Y. Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification. Symmetry. 2026; 18(1):177. https://doi.org/10.3390/sym18010177

Chicago/Turabian Style

Xie, Jing, Penghui Guan, Han Li, Chunhua Tang, Li Wang, and Yingcheng Lin. 2026. "Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification" Symmetry 18, no. 1: 177. https://doi.org/10.3390/sym18010177

APA Style

Xie, J., Guan, P., Li, H., Tang, C., Wang, L., & Lin, Y. (2026). Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification. Symmetry, 18(1), 177. https://doi.org/10.3390/sym18010177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scale Calibration and Pressure-Driven Knowledge Distillation for Image Classification

Abstract

1. Introduction

2. Related Work

2.1. Enhancing Feature Representation in Knowledge Distillation

2.2. Calibration-Based Optimization and Stability Strategies

2.3. Summary and Research Positioning

3. Main Methods

3.1. Scale Calibration

3.2. Pressure-Driven Mechanism

3.3. Gradient Intuitive Analysis

3.3.1. Stabilizing the Gradient Direction via Scale Calibration

3.3.2. Amplifying the Target-Class Gradient via Pressure-Based Mechanism

3.3.3. Preserving Relational Structure via Decoupled Distillation

4. Experiments and Discussion

4.1. Dataset and Experimental Setup

4.2. Main Results

4.2.1. Efficiency Metrics

4.2.2. Stabilized and Targeted Gradient Modulation Analysis

4.2.3. Qualitative Analysis of Model Behavior

4.2.4. Analysis of Robustness to Temperature

4.2.5. Ablation Study

4.2.6. Hyperparameter Analysis

4.2.7. Benchmarking Against Contemporary Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI