Author Contributions
Conceptualization, Y.L.; methodology, Y.L.; software, J.X.; validation, J.X., H.L., P.G. and C.T.; formal analysis, J.X.; investigation, J.X. and P.G.; resources, Y.L.; data curation, L.W.; writing–original draft preparation, J.X.; writing–review and editing, Y.L.; visualization, H.L. and P.G.; supervision, Y.L.; project administration, H.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Visualization of scale calibration: projecting teacher logits onto a hypersphere via L2 normalization and rescaling.
Figure 1.
Visualization of scale calibration: projecting teacher logits onto a hypersphere via L2 normalization and rescaling.
Figure 2.
Loss function dynamic analysis. Our experimental setup on CIFAR-100 involves a teacher–student pair composed of VGG13 and the more lightweight VGG8.
Figure 2.
Loss function dynamic analysis. Our experimental setup on CIFAR-100 involves a teacher–student pair composed of VGG13 and the more lightweight VGG8.
Figure 3.
Geometric illustration of the student margin mechanism (baseline vs. SPKD).
Figure 3.
Geometric illustration of the student margin mechanism (baseline vs. SPKD).
Figure 4.
SPKD calibrates teacher logits via L2 normalization and rescaling to a stable magnitude, applies margin-based pressure to student logits, then decouples distillation into TCKD (target class) and NCKD (non-target classes).
Figure 4.
SPKD calibrates teacher logits via L2 normalization and rescaling to a stable magnitude, applies margin-based pressure to student logits, then decouples distillation into TCKD (target class) and NCKD (non-target classes).
Figure 5.
Student logit L2 norm distribution under different gradient modulation strategies.
Figure 5.
Student logit L2 norm distribution under different gradient modulation strategies.
Figure 6.
Student gradient variance distribution under different gradient modulation strategies.
Figure 6.
Student gradient variance distribution under different gradient modulation strategies.
Figure 7.
SPKD effectively balances learning teacher’s confidence without replicating overconfidence.
Figure 7.
SPKD effectively balances learning teacher’s confidence without replicating overconfidence.
Figure 8.
Comparison of t-SNE plots of learned features under different methods. (a) Feature distribution learned using the classical KD method. (b) Feature distribution learned using the DKD method. (c) Feature distribution learned using our proposed SPKD method, showing better-defined clusters.
Figure 8.
Comparison of t-SNE plots of learned features under different methods. (a) Feature distribution learned using the classical KD method. (b) Feature distribution learned using the DKD method. (c) Feature distribution learned using our proposed SPKD method, showing better-defined clusters.
Figure 9.
Quantitative comparison of feature clustering metrics across various architectures.
Figure 9.
Quantitative comparison of feature clustering metrics across various architectures.
Figure 10.
Differences in student and teacher logit correlation matrices. (a) The difference matrix under the KD method. (b) The difference matrix under the DKD method, showing smaller differences than KD. (c) The difference matrix under our SPKD method, which achieves the smallest differences among the three.
Figure 10.
Differences in student and teacher logit correlation matrices. (a) The difference matrix under the KD method. (b) The difference matrix under the DKD method, showing smaller differences than KD. (c) The difference matrix under our SPKD method, which achieves the smallest differences among the three.
Figure 11.
Teacher–student logit-alignment quantified by correlation–difference metrics.
Figure 11.
Teacher–student logit-alignment quantified by correlation–difference metrics.
Figure 12.
CKA similarity cross-layer comparison (Teacher vs. Student). This figure illustrates the CKA similarity between the features of each layer of the student model and the corresponding layers of the teacher model under different knowledge distillation methods. (a) CKA similarity under the KD method. (b) CKA similarity under the DKD method. (c) CKA similarity under the SPKD method.
Figure 12.
CKA similarity cross-layer comparison (Teacher vs. Student). This figure illustrates the CKA similarity between the features of each layer of the student model and the corresponding layers of the teacher model under different knowledge distillation methods. (a) CKA similarity under the KD method. (b) CKA similarity under the DKD method. (c) CKA similarity under the SPKD method.
Figure 13.
Teacher–student alignment quantified by CKA-based similarity metrics.
Figure 13.
Teacher–student alignment quantified by CKA-based similarity metrics.
Figure 14.
Assessment of temperature scaling robustness across different distillation pairs. (a) Robustness analysis on the VGG13–VGG8 pair. (b) Robustness analysis on the ResNet50–MobileNetV2 pair. (c) Robustness analysis on the ResNet32x4–ShuffleV1 pair.
Figure 14.
Assessment of temperature scaling robustness across different distillation pairs. (a) Robustness analysis on the VGG13–VGG8 pair. (b) Robustness analysis on the ResNet50–MobileNetV2 pair. (c) Robustness analysis on the ResNet32x4–ShuffleV1 pair.
Figure 15.
Comparison of Top-1 accuracy across Head, Medium, and Tail subgroups to evaluate generalization on imbalanced data. (a) VGG13–VGG8 pair. (b) ResNet32x4–ShuffleV1 pair. (c) ResNet50–MobileNetV2 pair.
Figure 15.
Comparison of Top-1 accuracy across Head, Medium, and Tail subgroups to evaluate generalization on imbalanced data. (a) VGG13–VGG8 pair. (b) ResNet32x4–ShuffleV1 pair. (c) ResNet50–MobileNetV2 pair.
Table 1.
A comparison of the parameter quantity and inference speed of different teacher–student models.
Table 1.
A comparison of the parameter quantity and inference speed of different teacher–student models.
| Model | Parameters (MB) | FLOPs (GFLOPs) | Inference Time (ms) |
|---|
| VGG13 | 66.42 | 0.572 | 1.882 ± 0.10 |
| VGG8 | 15.15 | 0.193 | 1.303 ± 0.18 |
| ResNet50 | 97.16 | 2.624 | 8.075 ± 0.43 |
| MobileNetV2 | 3.79 | 0.015 | 4.661 ± 0.10 |
| ResNet32x4 | 35.93 | 2.176 | 4.892 ± 0.03 |
| ShuffleNetV1 | 3.79 | 0.084 | 5.725 ± 0.05 |
Table 2.
Temperature robustness analysis across different teacher–student pairs. Accuracy (%) is reported for various temperature settings (T).
Table 2.
Temperature robustness analysis across different teacher–student pairs. Accuracy (%) is reported for various temperature settings (T).
| Teacher–Student Pairs | Temperature (T) | KD | DKD | SPKD |
|---|
| VGG13–VGG8 | 1 | 71.31 | 73.67 | 73.65 |
| 2 | 71.95 | 74.66 | 74.85 |
| 4 | 73.74 | 74.73 | 74.99 |
| 8 | 73.67 | 74.33 | 74.75 |
| 16 | 74.09 | 74.67 | 74.80 |
| ResNet50–MobileNetV2 | 1 | 65.40 | 68.12 | 68.55 |
| 2 | 67.56 | 69.89 | 70.52 |
| 4 | 68.32 | 70.11 | 70.83 |
| 8 | 69.22 | 70.40 | 70.70 |
| 16 | 69.36 | 70.69 | 70.97 |
| Res32x4–ShuffleNetV1 | 1 | 71.99 | 76.14 | 76.30 |
| 2 | 73.73 | 76.35 | 76.48 |
| 4 | 74.49 | 76.40 | 76.79 |
| 8 | 75.20 | 76.76 | 76.84 |
| 16 | 75.43 | 76.84 | 76.90 |
Table 3.
Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and ResNet18 as the student model.
Table 3.
Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and ResNet18 as the student model.
| Experiment Group | Margin | Global Mu | Top-1 Acc(%) |
|---|
| Student_only | × | × | 78.14 |
| KD | × | × | 80.15 (+2.01) |
| DKD | × | × | 80.36 (+2.22) |
| Ablation (No mu) | √ | × | 80.44 (+2.30) |
| Ablation (No margin) | × | √ | 80.56 (+2.42) |
| SPKD (full) | √ | √ | 80.67 (+2.53) |
Table 4.
Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and MobileNetV2 as the student model.
Table 4.
Ablation experiments on the Cifar-100 dataset. We chose ResNet50 as the teacher model and MobileNetV2 as the student model.
| Experiment Group | Margin | Global Mu | Top-1 Acc(%) |
|---|
| Student_only | × | × | 64.36 |
| KD | × | × | 68.54 (+4.18) |
| DKD | × | × | 68.69 (+4.53) |
| Ablation (No mu) | √ | × | 65.14 (+0.78) |
| Ablation (No margin) | × | √ | 70.63 (+6.27) |
| SPKD (full) | √ | √ | 70.8 (+6.44) |
Table 5.
Comparison of Top-1 accuracy for different Mu_epoch_end values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.
Table 5.
Comparison of Top-1 accuracy for different Mu_epoch_end values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.
| Mu_Epoch_End | 0 | 1 | 2 | 5 |
|---|
| Top-1 Acc(%) | 72.05 | 71.38 | 76.69 | 76.40 |
Table 6.
Comparison of Top-1 accuracy for different margin values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.
Table 6.
Comparison of Top-1 accuracy for different margin values on the CIFAR-100 dataset. We chose ResNet32x4 as the teacher model and ResNet8x4 as the student model.
| Margin | 0 | 0.25 | 0.50 | 0.75 |
|---|
| Top-1 Acc(%) | 76.55 | 75.91 | 76.88 | 76.69 |
Table 7.
The results on the CIFAR-100 dataset.
Table 7.
The results on the CIFAR-100 dataset.
| Teacher | ResNet32x4 | VGG13 | WRN-40-2 | ResNet50 | VGG13 | ResNet32x4 |
| 79.42 | 74.64 | 75.61 | 79.34 | 74.64 | 79.42 |
| Student | ResNet8x4 | VGG8 | WRN-16-2 | MobileNetV2 | MobileNetV2 | ShuffleNetV1 |
| 72.5 ± 0.14 | 70.36 ± 0.22 | 73.26 ± 0.16 | 64.6 ± 0.30 | 64.6 ± 0.22 | 70.5 ± 0.22 |
| AT [23] | 73.61 ± 0.30 | 71.76 ± 0.15 | 74.30 ± 0.11 | 58.06 ± 1.39 | 60.42 ± 0.48 | 73.57 ± 0.32 |
| VID [24] | 72.98 ± 0.08 | 71.00 ± 0.28 | 73.87 ± 0.13 | 65.77 ± 0.45 | 65.72 ± 0.55 | 72.78 ± 0.20 |
| FITNET [25] | 73.71 ± 0.13 | 71.48 ± 0.30 | 73.61 ± 0.18 | 64.33 ± 0.55 | 64.44 ± 0.96 | 74.39 ± 0.32 |
| PKT [26] | 74.10 ± 0.26 | 73.36 ± 0.15 | 75.12 ± 0.22 | 68.59 ± 0.78 | 68.34 ± 0.20 | 75.71 ± 0.30 |
| KD [1] | 73.70 ± 0.38 | 73.31 ± 0.18 | 75.07 ± 0.19 | 68.50 ± 0.42 | 68.02 ± 0.25 | 74.88 ± 0.26 |
| RKD [6] | 72.72 ± 0.14 | 71.71 ± 0.29 | 73.82 ± 0.11 | 65.95 ± 0.50 | 65.97 ± 0.27 | 73.84 ± 0.23 |
| DKD [20] | 76.06 ± 0.20 | 74.66 ± 0.14 | 75.40 ± 0.16 | 68.23 ± 0.42 | 67.11 ± 0.53 | 74.34 ± 0.18 |
| RRD [14] | 75.85 ± 0.17 | 74.01 ± 0.15 | 75.77 ± 0.12 | 70.11 ± 0.35 | 69.61 ± 0.18 | 75.60 ± 0.30 |
| NORM [27] | 76.08 ± 0.15 | 73.95 ± 0.15 | 75.65 ± 0.13 | 70.56 ± 0.32 | 68.94 ± 0.22 | 77.18 ± 0.01 |
| DTKD [28] | 76.16 ± 0.20 | 74.12 ± 0.22 | 75.81 ± 0.09 | 69.10 ± 0.48 | 69.01 ± 0.25 | 75.43 ± 0.22 |
| SPKD | 76.43 ± 0.19 | 74.84 ± 0.06 | 75.86 ± 0.16 | 70.83 ± 0.30 | 70.11 ± 0.29 | 76.70 ± 0.46 |
Table 8.
Top-1 accuracy (%) on long-tailed distribution subgroups: Head (high-frequency), Medium, and Tail (low-frequency) classes across different teacher–student architectures.
Table 8.
Top-1 accuracy (%) on long-tailed distribution subgroups: Head (high-frequency), Medium, and Tail (low-frequency) classes across different teacher–student architectures.
| T–S Pairs | Method | Overall | Head | Medium | Tail | Tail Gain |
|---|
| VGG13–VGG8 | KD | 74.08 | 73.18 | 70.54 | 74.66 | - |
| DKD | 75.96 | 75.31 | 72.77 | 72.55 | −2.11 |
| SPKD (Ours) | 76.14 | 75.26 | 72.87 | 73.92 | −0.74 |
| Res32x4–ShuffleNetV1 | KD | 75.37 | 73.79 | 75.45 | 71.47 | - |
| DKD | 76.65 | 75.87 | 73.87 | 78.19 | +6.72 |
| SPKD (Ours) | 77.43 | 76.11 | 74.32 | 76.86 | +5.39 |
| Res50–MobileNetV2 | KD | 69.31 | 67.85 | 68.40 | 62.89 | - |
| DKD | 71.79 | 71.51 | 67.21 | 65.34 | +2.45 |
| SPKD (Ours) | 71.56 | 69.61 | 70.82 | 70.74 | +7.85 |
Table 9.
Performance of the ResNet-34 educator and ResNet-18 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.
Table 9.
Performance of the ResNet-34 educator and ResNet-18 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.
| | Distillation_Manner | Feature | Logit |
|---|
| |
Teacher
|
Student
|
AT [23]
|
OFD [30]
|
CRD [31]
|
Review KD [32]
|
KD [1]
|
KD *
|
DKD [20]
|
SPKD
|
|---|
| Top-1 | 73.31 | 69.75 ± 0.20 | 70.69 ± 0.15 | 70.81 ± 0.13 | 71.17 ± 0.10 | 71.61 ± 0.15 | 70.66 ± 0.13 | 71.03 ± 0.14 | 71.70 ± 0.18 | 71.90 ± 0.11 |
| Top-5 | 91.42 | 89.07 ± 0.22 | 90.01 ± 0.18 | 89.98 ± 0.15 | 90.13 ± 0.16 | 90.51 ± 0.12 | 89.88 ± 0.16 | 90.03 ± 0.19 | 90.41 ± 0.16 | 90.59 ± 0.07 |
Table 10.
Performance of the ResNet-50 educator and MobileNetV1 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.
Table 10.
Performance of the ResNet-50 educator and MobileNetV1 learner on the ImageNet benchmark, evaluated by Top-1 and Top-5 accuracy.
| | Distillation_Manner | Feature | Logit |
|---|
| |
Teacher
|
Student
|
AT [23]
|
OFD [30]
|
CRD [31]
|
Review KD [32]
|
KD [1]
|
KD *
|
DKD [20]
|
SPKD
|
|---|
| Top-1 | 76.16 | 68.87 ± 0.14 | 69.56 ± 0.11 | 71.25 ± 0.12 | 71.37 ± 0.17 | 72.56 ± 0.12 | 68.58 ± 0.08 | 70.50 ± 0.14 | 72.05 ± 0.20 | 72.40 ± 0.13 |
| Top-5 | 92.86 | 88.76 ± 0.17 | 89.33 ± 0.14 | 90.54 ± 0.16 | 90.41 ± 0.14 | 91.00 ± 0.14 | 88.98 ± 0.12 | 89.80 ± 0.15 | 91.05 ± 0.23 | 91.35 ± 0.08 |