Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation

Sun, Lei; Li, Yao; Liu, Gushuai; Yang, Zengjian; Kong, Xuecheng

doi:10.3390/electronics14244918

Open AccessArticle

Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation

by

Lei Sun

^*,

Yao Li

,

Gushuai Liu

,

Zengjian Yang

and

Xuecheng Kong

State Grid Shandong Electric Power Company, Zibo Power Supply Company, Zibo 255030, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4918; https://doi.org/10.3390/electronics14244918

Submission received: 27 October 2025 / Revised: 24 November 2025 / Accepted: 28 November 2025 / Published: 15 December 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Spiking Neural Networks (SNNs) have drawn increasing attention for their event-driven and energy-efficient characteristics. However, achieving accurate and efficient inference within limited time-steps remains a major challenge. This paper proposes an efficient spiking Transformer framework that integrates cross-time-step knowledge distillation, multi-scale resolution processing, and attention-based token pruning to enhance both temporal modeling and energy efficiency. The cross-time-step distillation mechanism enables earlier time steps to learn from later ones, which improves inference consistency and accuracy, leading to better performance. Meanwhile, the multi-scale processing module dynamically adjusts input resolution and reuses features across scales, while the attention-based token pruning adaptively removes redundant tokens to reduce computational overhead. Extensive experimental results on static datasets (CIFAR-10/100 and ImageNet-1K) and dynamic event-based datasets (DVS128-Gesture and CIFAR10-DVS) demonstrate that the proposed method achieves higher accuracy and more than 1.4× inference speedup compared to baseline SNN–Transformer models. This framework provides a promising solution for developing energy-efficient and high-performance neuromorphic vision systems.

Keywords:

Spiking Neural Networks (SNNs); transformer; knowledge distillation; event-driven computing; model pruning

1. Introduction

With the rapid development of artificial intelligence, traditional Artificial Neural Networks (ANNs) have achieved remarkable results in tasks such as image classification and object detection. However, these networks often rely on fixed time steps and continuous activation functions, leading to limitations in handling highly dynamic, low-power scenarios, particularly in balancing computational resources and accuracy. Spiking Neural Networks (SNNs) [1], as the third generation of neural networks, have gradually become a research hotspot in brain-inspired computing and efficient artificial intelligence due to their biologically inspired spike-based information encoding and event-driven, low-power characteristics.

SNNs [2] simulate the spike firing mechanism of biological neurons, activating neurons only when input events occur, avoiding redundant computations in fixed time steps of traditional neural networks and significantly reducing power consumption. This mechanism not only aligns more closely with the operation of biological neural [3] systems but also adapts to the asynchronous event streams output by Dynamic Vision Sensors (DVSs) [4]. DVS cameras capture scene changes with microsecond-level temporal resolution, generating sparse spatiotemporal event streams [5].

However, the practical application of SNNs still faces several challenges [6]. The discrete spike-firing behavior of spiking neurons makes the network difficult to differentiate, which limits the direct use of traditional gradient-based optimization methods. Existing approaches such as surrogate gradients still have limitations in improving time-step efficiency and accuracy. In recent years, a number of low-power SNN studies have attempted to improve energy efficiency by reducing spike firing rates, designing more energy-friendly network structures, or integrating the characteristics of event cameras [7,8]. Examples include constant firing rate control, energy-efficient spiking convolution modules, and streaming inference frameworks tailored for DVS data. However, most existing pruning strategies mainly focus on the spatial dimension, while ignoring the dynamic properties of temporal spiking sequences. Although spatial pruning can reduce redundant tokens or features, temporal redundancy in spike sequences has not been effectively exploited, causing the inference process to remain fixed and time-consuming. Moreover, cross-time-step supervision, dynamic pruning, and multi-scale processing are often investigated separately, lacking a unified optimization framework. As a result, existing methods struggle to simultaneously address the requirements of temporal modeling, spatial compression, and adaptive resolution within a single architecture.

To address these issues, this paper proposes an SNN architecture incorporating a multi-time-step attention mechanism. By introducing the global spatiotemporal modeling capability of Transformer [9] and combining cross-time-step knowledge distillation, dual optimization of computational efficiency and key feature extraction is achieved. Additionally, a multi-scale image processing framework is designed to significantly reduce computational energy consumption through progressive feature reuse and dynamic pruning strategies. Experiments demonstrate that this method improves performance and efficiency on both static and dynamic datasets, providing a new technical pathway for SNNs in edge computing, visual analysis, and other scenarios.

The contributions of this paper are summarized as follows:

(1) Cross-time-step knowledge distillation for temporal optimization. We design a novel cross-time-step knowledge distillation strategy that transfers information from later to earlier time steps, which enables the model to learn more discriminative representations at earlier time-steps. This mechanism enhances temporal consistency and significantly improves inference accuracy.

(2) Multi-scale resolution processing with cross-scale feature reuse. A multi-scale processing framework is introduced to dynamically adjust image resolution during inference. By progressively reusing low-resolution features in higher-resolution stages, the proposed method achieves an effective balance between computational cost and classification performance.

(3) Attention-based token pruning for lightweight spiking transformers. We propose an attention-guided token pruning approach that adaptively removes less informative tokens using learnable thresholds. This mechanism reduces redundant computation and achieves significant inference speedup, while maintaining competitive accuracy across static and dynamic benchmarks.

(4) We evaluate the proposed method on three representative models, including Spikformer, S-D Transformer, and SGLFormer, and conduct extensive experiments on both static datasets (CIFAR-10/100 [10] and ImageNet-1K [11]) and dynamic datasets (DVS128-Gesture [12] and CIFAR10-DVS [13]). The results demonstrate that the proposed method not only achieves better performance but also significantly accelerates the model inference (more than

1.4 \times

speedup), which thus can further enhance the energy efficiency of SNNs.

2. Preliminaries

In this section, we introduce the basic components that are directly used in our method, including the LIF neuron and surrogate gradients, the formulation of knowledge distillation, and the spiking Transformer backbones.

2.1. Spiking Neural Networks

SNNs process information by simulating the spike firing mechanism of biological neurons. This network model abandons the continuous activation functions and fixed-time-step information processing of traditional ANNs, instead using discrete spike signals for information encoding and transmission. In SNNs, neuron activation and transmission are driven by spike events, meaning that neurons fire only when input signals reach a specific threshold. This mechanism gives SNNs sparse activation characteristics, effectively avoiding redundant computations at unnecessary times, thereby significantly reducing power consumption and improving computational efficiency.

Furthermore, the ability of SNNs to encode information along the temporal dimension is another highlight. Through spike intervals and frequencies, SNNs can carry rich information, demonstrating excellent spatiotemporal information representation capabilities. Neuron models such as the Integrate-and-Fire (IF) and Leaky Integrate-and-Fire (LIF) further enhance their ability to learn temporal patterns in input sequences [14]. In the LIF model, the membrane potential of a neuron gradually accumulates upon receiving synaptic input; once it exceeds the set threshold, the neuron fires a spike and resets the membrane potential.

Unlike the activation functions of traditional neural networks, SNNs use biologically inspired spiking neurons. The LIF neuron is a commonly used activation function in SNNs:

\begin{matrix} V [t] = H [t] \cdot (1 - S [t]) + V_{r e s e t} \cdot S [t], \\ S [t] = \{\begin{matrix} 1, & if V [t] \geq V_{t h r e s h o l d} \\ 0, & if V [t] < V_{t h r e s h o l d} \end{matrix}, \\ H [t] = V [t - 1] + \frac{1}{τ} (X [t] - (V [t - 1] - V_{r e s e t})) . \end{matrix}

(1)

V [t]

represents the membrane potential at time step t,

S [t]

represents the spike firing state at time step t,

V_{threshold}

is the threshold voltage,

V_{reset}

is the reset voltage,

H [t]

is an intermediate variable at time step t, and

τ

is the time constant. The membrane potential

V [t]

is updated step by step based on the input

X [t]

. When

V [t]

exceeds the threshold voltage

V_{threshold}

, a spike is fired (i.e.,

S [t] = 1

), generating a spike feature map

S [t]

.

Training of Deep SNNs

In terms of training methods, SNNs face challenges due to the discreteness and non-differentiability of spike signals. Traditional backpropagation algorithms cannot be directly applied. To address the non-differentiability of SNNs [15], there are two main approaches to obtaining deep SNNs:

ANN-to-SNN Conversion: Replaces activation functions like ReLU with spiking neurons and adds scaling operations such as weight normalization and threshold balancing to convert pre-trained ANNs into SNNs.
Surrogate Gradients: Uses continuous differentiable functions to approximate the derivative of the step function, providing gradients during backpropagation. This allows direct training and handling of temporal data with only a few time steps, achieving good performance on both static and dynamic datasets [16].

A common surrogate gradient is the sigmoid function [17], as shown in Equation (2). Forward computation yields a 0–1 matrix, while backpropagation provides continuous gradient computation through the derivative of the sigmoid function as follows:

\begin{matrix} g^{'} (x) = α \cdot sigmoid (α x) \cdot (1 - sigmoid (α x)) \end{matrix}

(2)

where a larger

α

makes the sigmoid closer to the step function, but the gradient range becomes narrower, requiring a trade-off between gradient strength and stability.

2.2. Knowledge Distillation

Knowledge distillation (KD) [18] is a technique for transferring knowledge from a large, complex model (teacher) to a smaller, simpler model (student). Let the teacher and student networks output the logits

z^{teacher} \in R^{C}

and

z^{student} \in R^{C}

, respectively, for a given input sample, where C denotes the number of classes. Their softened class probability distributions, obtained via the softmax function with temperature parameter

τ > 0

, are defined as follows:

\begin{matrix} p_{i}^{teacher} = \frac{exp (z_{i}^{teacher} / τ)}{\sum_{j = 1}^{C} exp (z_{j}^{teacher} / τ)}, p_{i}^{student} = \frac{exp (z_{i}^{student} / τ)}{\sum_{j = 1}^{C} exp (z_{j}^{student} / τ)}, \end{matrix}

(3)

where

p_{i}^{teacher}

and

p_{i}^{student}

represent the predicted probability of class i by the teacher and student models, respectively. The temperature

τ

controls the smoothness of the output distribution: higher values of

τ

produce softer probability distributions that reveal more information about the relative relationships among classes [19].

The student model is trained to minimize a distillation loss, which encourages its output distribution to match that of the teacher. A commonly used formulation is the Kullback–Leibler (KL) divergence between the softened teacher and student distributions:

L_{KD} = τ^{2} \sum_{i = 1}^{C} p_{i}^{teacher} log \frac{p_{i}^{teacher}}{p_{i}^{student}},

(4)

where the factor

τ^{2}

compensates for the gradient scaling introduced by the temperature. In practice, this loss is often combined with the standard cross-entropy loss against the ground-truth labels to form the total training objective.

However, for SNNs, implementing knowledge distillation faces unique challenges [20,21,22,23,24]. The discrete spike characteristics and non-differentiability of SNNs limit the direct application of traditional knowledge distillation methods. To this end, researchers have explored strategies to introduce proxy losses between the continuous probability distribution output by the teacher model and the discrete spikes output by the student model, optimizing with surrogate gradients [25]. This method aims to maintain the event-driven and low-power characteristics of SNNs while enabling them to learn the teacher model’s advantages in feature extraction. Through this approach, the accuracy and generalization ability of SNNs in complex tasks are expected to be significantly improved, further promoting the practical application of SNNs.

2.3. Spiking Transformer

Transformer was initially designed for natural language processing but has also achieved great success in computer vision tasks, including image classification, object detection, and semantic segmentation. Unlike convolution-based models that primarily rely on inductive biases and focus on adjacent pixels, the Transformer structure uses self-attention mechanisms to globally capture relationships between spike features, thereby effectively improving performance.

To adopt the Transformer structure in SNNs, Zhou et al. [26] designed a novel spiking self-attention mechanism called Spiking Self-Attention (SSA). This mechanism uses sparse spike forms for Query, Key, and Value without softmax operations. The SSA computation process is as follows:

\begin{matrix} Q = S N_{Q} (B N (X W_{Q})), \\ K = S N_{K} (B N (X W_{K})), \\ V = S N_{V} (B N (X W_{V})), \\ S S A^{'} (Q, K, V) = S N (Q K^{T} V^{*} s), \\ S S A (Q, K, V) = S N (B N (Linear (S S A^{'} (Q, K, V)))) \end{matrix}

(5)

where spike-form Query, Key, and Value are computed through learnable layers, sis a scaling factor, SNis the spiking activation layer, BNis the batch normalization layer, and Linearis the linear layer. SSA computation avoids multiplication operations, aligning with the characteristics of SNNs. Based on SSA, Zhou et al. designed Spikformer, which achieved 74.81% accuracy on ImageNet-1k, demonstrating excellent performance potential [27].

Zhou et al. [28] discussed the non-spiking computation issue caused by “addition after activation” shortcuts in Spikformer and proposed Spikingformer, which uses “pre-activation” shortcuts to avoid non-spiking computation in synaptic calculations. CML [29] specifically designed a downsampling structure for SNNs to address the imprecise gradient backpropagation in most advanced deep SNNs, achieving 77.64% accuracy on ImageNet, significantly improving the performance of Transformer-based SNNs.

Yao et al. [30] designed a spike-driven self-attention mechanism (SDSA) that linearly depends on the number of tokens and channels in terms of complexity, significantly reducing computational energy consumption. SDSA relies only on masking and addition operations, abandoning multiplication operations, reducing computational energy consumption by 87.2 times compared to SSA. Meanwhile, Zhang et al. [31] proposed SGLformer, which cleverly integrates Transformer and convolutional structures into SNNs to achieve efficient information processing at global and local scales. SGLFormer achieved a breakthrough top-1 accuracy of 83.73% on the ImageNet-1k dataset with 64 M parameters. Additionally, Zhou et al. [32] proposed QKFormer, which uses a novel hierarchical spiking Transformer structure, leveraging Q-K attention mechanisms to easily model the importance of token or channel dimensions while maintaining linear complexity. Through direct training, it achieved over 85% top-1 accuracy on ImageNet with 4 time steps, surpassing the performance of most ANN-based Transformer structures.

In summary, the fusion architecture of SNNs and Transformers not only inherits the biological plausibility and event-driven characteristics of SNNs but also incorporates the global modeling capability of Transformers, opening new avenues for developing low-power, high-performance neuromorphic computing systems, which are expected to play an important role in future edge computing and end-side models [33].

3. Proposed Method

The overall architecture of the proposed multi-scale SNN processing framework is shown in Figure 1. First, the input image undergoes multi-scale image processing: small-scale images are input and passed through LIF neurons to obtain spike feature maps, enhancing the representation of key information. Then, the model undergoes cross-time-step distillation, combined with KL divergence loss calculation [34], using information from subsequent time steps to guide feature learning in earlier time steps. At the final output layer, the current highest class probability is computed and compared with a confidence threshold: if the probability is below the threshold, a higher-resolution input mechanism is triggered, inputting higher-resolution images while reusing low-resolution feature information to achieve feature reuse; if the classification probability is above the threshold, the predicted classification is directly output.

3.1. Multi-Scale Resolution Processing

As shown in Figure 2, visual images are scaled down to the target size using bilinear interpolation to reduce subsequent computation. This process not only helps reduce interference from irrelevant information but also maintains the overall structure of the image, preserving features of key regions. The formula is expressed as follows:

\begin{matrix} F_{L} = Conv (Downsample (I)) \in R^{56 \times 56 \times C} \end{matrix}

(6)

In Equation(6), I is the original image and

F_{L}

is the downscaled image. To ensure the clarity of visual features in the downscaled image, further preprocessing is performed, including grayscale enhancement and edge-preserving filtering, to maintain image clarity.

If the final classification probability of the small-scale image is below the threshold, a higher-scale image input is triggered. The low-scale image is upsampled first, and then feature reuse is performed. The formula is expressed as follows:

\begin{matrix} F_{L}^{'} = Conv (Upsample (F_{L})) \\ F_{reuse} = F_{H} + α F_{L}^{'} \end{matrix}

(7)

where cross-scale feature reuse is used to balance computational efficiency and classification accuracy. Taking ImageNet-1K as an example, a 56 × 56 low-resolution image is first input, and basic semantic information is obtained through conventional convolution and feature extraction. At this stage, computational load is small, and inference speed is fast. If the classification evaluation probability at the current resolution is low, higher-resolution processing is triggered, gradually introducing finer pixel details. When processing high-resolution images, features extracted at the low-resolution stage are reused and fused with current layer features after upsampling, preserving semantic information and enhancing feature expression capability. Finally, multi-time-step analysis integrates feature information from different resolutions, leveraging both global context information from low resolution and local details from high resolution to enhance classification accuracy.

3.2. Cross-Time-Step Distillation

As shown in Figure 3, the attention distribution at time step t + 1 serves as the teacher signal, guiding the attention distribution learning at time step t through KL divergence loss, enabling earlier time steps to focus on image features extracted by subsequent time steps.

The designed loss function is

\begin{matrix} Loss = CE (y_{T}, y_{true}) + \sum_{t = 1}^{T - 1} γ_{t} \cdot KL (y_{t + 1} ∥ y_{t}) \\ KL (y_{t + 1} ∥ y_{t}) = \sum_{i} y_{t + 1}^{(i)} \cdot log (\frac{y_{t + 1}^{(i)}}{y_{t}^{(i)}}) \end{matrix}

(8)

By combining cross-entropy and KL divergence, both final prediction accuracy and prediction consistency across time steps are optimized.

To strengthen the learning ability of earlier time steps, the model designs a gradually decaying loss weight strategy. As training progresses, the contribution of later time steps to the loss gradually increases, helping earlier time steps approach the final output more quickly. This progressively increasing guidance strategy not only accelerates the learning of earlier time steps but also reduces dependence on later time steps, thereby improving the prediction accuracy of earlier time steps. It is worth noting that the non-causal components of our method are used only during training for knowledge transfer, while the inference stage remains fully causal and does not rely on future information.

3.3. Attention-Based Token Pruning

For the Spike Transformer architecture, tokens are based on pixels rather than image patches, leading to excessive redundancy. The attention-based pruning strategy achieves finer computational resource allocation by evaluating token importance. As shown in Figure 4, by analyzing the importance scores of tokens in the attention mechanism, low-importance tokens are selectively pruned, retaining only tokens with scores exceeding

γ

times the global maximum value (

γ \in (0, 1)

). This design preserves key information for feature extraction while achieving a flexible balance between computational efficiency and model accuracy through the learnable parameter

γ

.

\begin{matrix} s_{i} = \frac{1}{H} \sum_{h = 1}^{H} \sum_{j = 1}^{N} softmax (\frac{Q K^{⊤}}{\sqrt{d}}) \\ T_{keep} = \{i ∣ s_{i} > γ \cdot max (s)\} \end{matrix}

(9)

As shown in Equation (8), token pruning is designed to further lightweight the model. Tokens with low attention scores are deemed less important, and a dynamic pruning strategy is used to reduce computation.

3.4. Multi-Scale Feature Fusion

To further reduce computational energy consumption, a multi-scale classification evaluation method is designed, as shown in Equation (9). By fusing the focused features of the current scale with the output results of the previous time step, a rich cross-scale feature input is formed.

\begin{matrix} F_{fusion} = α F_{t} + β \cdot Conv (F_{t - 1}) \end{matrix}

(10)

where

α

and

β

are learnable fusion weights,

F_{t - 1}

is the feature map from historical time steps, and Conv represents convolutional computation.

F_{fusion}

is the fused image after multi-time-step processing. The system enhances global understanding of the target through this iterative fusion method. After outputting results at each time step, confidence evaluation is performed for each scale’s output results, as shown in Equation (10), to ensure the reliability of detection results.

\begin{matrix} C_{t} = Linear (W_{c} \cdot AvgPool (F_{fusion}) + b_{t}) \end{matrix}

(11)

where

C_{l}

is the final classification probability, Linear maps the dimension to the classification dimension,

W_{c}

is the convolution operation, AvgPool is the average pooling of the output, and

b_{i}

is the bias, defaulting to 0. If the maximum classification probability is above the set threshold, the prediction result is output; if the confidence is below the set threshold, a larger-scale image is input. Through this gradual increase in resolution, the model can model the target from coarse to fine granularity, gradually deepening feature representation and reducing initial computational complexity. During inference, the model determines whether to end inference early by judging the similarity between the current time step and the final time step, reducing unnecessary computational resource consumption and further improving overall inference efficiency.

4. Experiments and Results Analysis

4.1. Experimental Environment and Parameter Settings

The experiments were conducted on Ubuntu 20.04 using Python implementation. The network model was built with PyTorch 1.13.0, compiled with Python 3.7, and accelerated with CUDA 11.8 on 8 V100 GPUs. The hyperparameters set for ImageNet-1k are shown in Table 1.

4.2. Results on Static Datasets

We evaluate the proposed method on five mainstream models (Spikformer, Spike-driven Transformer, SGLFormer, Spike-driven Transformer, and QKFormer) on the CIFAR10, CIFAR100, and ImageNet-1K datasets [10,11]. CIFAR-10 is a common image classification dataset containing 60,000 32 × 32 color images divided into 10 classes, with 6000 images per class (50,000 for training, 10,000 for testing). It is often used for small-scale image classification tasks. CIFAR-100 is similar to CIFAR-10 but contains 100 classes with 600 images each, making it more challenging and suitable for complex image classification research. ImageNet-1K is a large-scale image classification dataset containing approximately 1.28 million images divided into 1000 classes, serving as a standard benchmark in computer vision for image classification. As shown in Table 2, the improved model demonstrates significant performance improvements.

4.3. Results on Dynamic Datasets

DVS records scene changes in the form of asynchronous event streams, featuring high temporal resolution, low latency, and low power consumption, making it suitable for processing with SNNs and enabling efficient handling of sparse spatiotemporal data. The proposed improvements in this paper were validated on the DVS datasets DVS128-Gesture [12] and CIFAR10-DVS [13]. DVS128-Gesture is an event camera-based dynamic gesture recognition dataset containing 11 types of gestures, with data represented as event streams, commonly used in neuromorphic processing tasks. CIFAR10-DVS is a dynamic vision version of CIFAR-10, generated by capturing dynamic transitions of the original CIFAR-10 images using DVS, producing approximately 1000 event sequences per class for evaluating the performance of event-based neural networks in object classification tasks. The experimental results are shown in Table 3. All improved models demonstrated performance gains on both dynamic datasets.

The results indicate that the proposed improvements offer significant advantages in processing dynamic event stream data. The high temporal resolution of DVS128-Gesture enables the model to accurately capture continuous changes in gestures through a pulse-driven temporal attention mechanism, achieving an improved accuracy of nearly 99.5%. The dynamic blur challenge in CIFAR10-DVS is effectively mitigated by cross-time-step knowledge distillation. For instance, the improved version of SGLFormer achieves state-of-the-art performance of 83.72% by preserving key event features. The performance improvements on both dynamic datasets validate the generalization capability of the proposed method for event-driven scenarios, providing an efficient solution for dynamic vision tasks.

4.4. Ablation Study

To evaluate the contributions of each module, ablation studies were conducted using Spikformer as the baseline. The results in Table 4 demonstrate the impact of progressively introducing time step distillation, attention pruning, and the classification evaluation module on both model performance and computational efficiency, systematically validating the influence of each component.

4.4.1. Cross-Time-Step Knowledge Distillation

As shown in Table 5, the time step distillation mechanism enhances classification accuracy by facilitating knowledge transfer between different time steps. Classification accuracy improved by 1.42% on CIFAR10 and 0.91% on CIFAR100, indicating that the model can better capture dynamic features and improve temporal information modeling through cross-time-step learning. However, the optimization in energy consumption was limited, with only a 0.11 mJ reduction, while training speed decreased by approximately 8%. This suggests that although time step distillation improves accuracy, the increased interaction between time steps may slightly reduce training efficiency.

4.4.2. Attention-Based Pruning

The attention pruning strategy significantly reduces power consumption and accelerates training. By dynamically pruning attention weights to eliminate redundant computations, the model’s classification accuracy decreased slightly by 0.30% on CIFAR10 and 0.29% on CIFAR100, but energy consumption was reduced by 0.91 mJ, and training speed increased by 38%. This indicates that reducing unnecessary computations in the attention mechanism can significantly enhance training and inference efficiency while maintaining accuracy. The dynamic attention pruning strategy greatly optimizes computational resource utilization, lowering power consumption and accelerating training, though minor accuracy degradation occurs on smaller datasets due to partial loss of feature information.

4.4.3. Multi-Scale Resolution Processing

The multi-scale resolution processing approach dynamically optimizes classification tasks by re-evaluating low-confidence samples at higher resolutions. On CIFAR10 and CIFAR100, classification accuracy decreased by 1.40% and 1.84%, respectively, but power consumption was significantly reduced by 2.62 mJ, and training speed increased by 142%. Although this module excels in ultra-low-power scenarios, the loss of feature information during high-resolution re-evaluation reduces the model’s ability to capture complex features, leading to an overall decline in accuracy. By triggering high-resolution re-evaluation for low-confidence samples, the module enables fast inference in low-power settings, but over-reliance on it weakens the model’s capacity to capture intricate features.

4.4.4. Comparison of Energy Consumption and Computational Complexity

The proposed improvements were validated on three baseline models, all with a final dimension of 384, and tested on ImageNet1k. As shown in Table 5, significant achievements were made in energy efficiency and computational efficiency. Spikformer performed most prominently, reducing power consumption by 29.8% to 5.43 mJ while maintaining a 1.82× training acceleration and a 1.42× inference acceleration, achieving a dual improvement in accuracy and energy efficiency. This indicates that pixel redundancy is a critical factor limiting SSA performance. SGLFormer reduced power consumption by 19.7% to 10.47 mJ while maintaining high accuracy and achieving a 1.97× training acceleration, with its dynamic classification evaluation strategy effectively balancing computational costs.

From a technical perspective, time step distillation and dynamic pruning are key to improving energy efficiency. These techniques effectively reduce computational complexity while maintaining or even enhancing model accuracy. For instance, SGLFormer achieved 84.33% accuracy on ImageNet1k with a 1.63× inference speedup, surpassing the performance of conventional optimization methods.

Integrating all modules achieves a balance between power consumption and training efficiency while maintaining improved accuracy. Time step distillation effectively compensates for the feature loss caused by attention pruning, and the dynamic classification evaluation strategy avoids redundant computations by activating high-resolution analysis, providing a solution that balances performance and efficiency for model deployment in resource-constrained scenarios.

5. Conclusions

In this work, we presented an efficient spiking Transformer architecture that enhances the temporal modeling and computational efficiency of SNNs. The proposed cross-time-step knowledge distillation mechanism enables earlier time steps to learn from later ones, which improves temporal consistency and overall inference accuracy. The multi-scale image processing framework further balances computational cost and performance through progressive feature reuse, while the attention-based token pruning dynamically eliminates redundant information to achieve efficient inference. Extensive experiments on both static and dynamic benchmarks verify that our approach achieves superior performance and over

1.4 \times

inference acceleration, which significantly enhances the energy efficiency of SNNs. These techniques enable temporal knowledge transfer and spatial feature selection in spiking CNNs, thereby improving their efficiency and overall performance. Future work will explore adaptive event encoding and hardware-level optimization to further improve real-time deployment on neuromorphic and edge computing devices.

Author Contributions

Conceptualization, L.S.; Methodology, L.S., Y.L., Z.Y. and G.L.; Software, Y.L., X.K. and G.L.; Validation, Y.L.; Data Curation, L.S.; Writing—Original Draft Preparation, Y.L.; Writing—Review & Editing, L.S.; Visualization, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Science and Technology Project of State Grid Shandong Electric Power Company (2024A-010).

Data Availability Statement

All the used datasets are publicly available online.

Conflicts of Interest

Authors Lei Sun, Yao Li, Gushuai Liu, Zengjian Yan and Xuecheng Kong was employed by the company State Grid Shandong Electric Power Company, Zibo Power Supply Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ghosh-Dastidar, S.; Adeli, H. Spiking neural networks. Int. J. Neural Syst. 2009, 19, 295–308. [Google Scholar] [CrossRef]
Gerstner, W.; Kistler, W.M. Spiking Neuron Models: Single Neurons, Populations, Plasticity; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Ponulak, F.; Kasiński, A. Introduction to spiking neural networks: Information processing, learning and applications. Acta Neurobiol. Exp. 2011, 71, 409–433. [Google Scholar] [CrossRef]
Serrano-Gotarredona, T.; Linares-Barranco, B. A 128 × 128 1.5% contrast sensitivity 0.9% FPN 3 μs latency 4 mW asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers. IEEE J. Solid-State Circuits 2013, 48, 827–838. [Google Scholar]
Chan, K.H.; So, S.K. Using admittance spectroscopy to quantify transport properties of P3HT thin films. J. Photonics Energy 2011, 1, 011112. [Google Scholar] [CrossRef]
Pfeiffer, M.; Pfeil, T. Deep learning with spiking neurons: Opportunities and challenges. Front. Neurosci. 2018, 12, 774. [Google Scholar] [CrossRef] [PubMed]
Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [CrossRef] [PubMed]
Jang, H.; Simeone, O.; Gardner, B.; Grüning, A. An introduction to spiking neural networks: Probabilistic models, learning rules, and applications. IEEE Signal Process. Mag. 2019, 36, 64–77. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Amir, A.; Taba, B.; Berg, D.; Melano, T.; McKinstry, J.; Nolfo, C.D.; Nayak, T.; Andreopoulos, A.; Garreau, G.; Mendoza, M.; et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7243–7252. [Google Scholar]
Li, H.; Liu, H.; Ji, X.; Li, G.; Shi, L. CIFAR10-DVS: An event-stream dataset for object classification. Front. Neurosci. 2017, 11, 309. [Google Scholar] [CrossRef]
Hodgkin, A.L.; Huxley, A.F. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 1952, 117, 500–544. [Google Scholar] [CrossRef]
Neftci, E.O.; Mostafa, H.; Zenke, F. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Shi, L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Front. Neurosci. 2018, 12, 331. [Google Scholar]
Fang, W.; Yu, Z.; Chen, Y.; Huang, T.; Masquelier, T.; Tian, Y. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2661–2671. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Kushawaha, R.K.; Kumar, S.; Banerjee, B.; Chaudhuri, B.B. Distilling spikes: Knowledge distillation in spiking neural networks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtual, 10–15 January 2021; pp. 4536–4543. [Google Scholar]
Yu, K.; Yu, C.; Zhang, T.; Huang, T. Temporal separation with entropy regularization for knowledge distillation in spiking neural networks. arXiv 2025, arXiv:2503.03144. [Google Scholar] [CrossRef]
Xu, Q.; Li, Y.; Shen, J.; Zhang, J.; Liu, Z.; Tang, H.; Pan, G. Constructing deep spiking neural networks from artificial neural networks with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7886–7895. [Google Scholar]
Takuya, S.; Zhang, R.; Nakashima, Y. Training low-latency spiking neural network through knowledge distillation. In Proceedings of the 2021 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), Tokyo, Japan, 14–16 April 2021; pp. 1–3. [Google Scholar]
Qiu, H.; Ning, M.; Song, Z.; Pan, G. Self-architectural knowledge distillation for spiking neural networks. Neural Netw. 2024, 178, 106475. [Google Scholar] [CrossRef]
Wang, J.; Bertasius, G.; Tran, D.; Torresani, L. Long-short temporal contrastive learning of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14010–14020. [Google Scholar]
Zhou, Z.; Zhu, Y.; He, C.; Wang, Y.; Yan, S.; Tian, Y.; Yuan, L. Spikformer: When spiking neural network meets transformer. arXiv 2022, arXiv:2209.15425. [Google Scholar] [CrossRef]
Zhou, Z.; Che, K.; Fang, W.; Yu, Z.; Tian, Y. Spikformer v2: Join the high accuracy club on imagenet with an snn ticket. arXiv 2024, arXiv:2401.02020. [Google Scholar] [CrossRef]
Zhou, C.; Yu, L.; Zhou, Z.; Zhang, H.; Tian, Y. Spikingformer: Spike-driven residual learning for transformer-based spiking neural network. arXiv 2023, arXiv:2304.11954. [Google Scholar]
Zhou, C.; Zhang, H.; Zhou, Z.; Yu, L.; Tian, Y. Enhancing the performance of transformer-based spiking neural networks by SNN-optimized downsampling with precise gradient backpropagation. arXiv 2023, arXiv:2305.05954. [Google Scholar]
Yao, M.; Hu, J.; Zhou, Z.; Yuan, L.; Tian, Y. Spike-driven transformer. Adv. Neural Inf. Process. Syst. 2023, 36, 64043–64058. [Google Scholar]
Zhang, H.; Zhou, C.; Yu, L.; Zhou, Z.; Tian, Y. SGLFormer: Spiking global-local-fusion transformer with high performance. Front. Neurosci. 2024, 18, 1371290. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Zhang, H.; Zhou, Z.; Yu, L.; Tian, Y. Qkformer: Hierarchical spiking transformer using qk attention. arXiv 2024, arXiv:2403.16552. [Google Scholar] [CrossRef]
Li, W.; Wang, P.; Wang, X.; Zuo, W.; Fan, X.; Tian, Y. Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10772–10786. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]

Figure 1. Overall architecture of the proposed multi-scale processing method, which adjusts image resolution during inference. A cross time-step distillation approach is design to learn the model.

Figure 2. Illustration of the proposed multi-scale image processing scheme.

Figure 3. Multi-Resolution Image Processing.

Figure 4. Attention Map Token Pruning.

Table 1. Hyperparameter Settings.

Parameter	Value
Learning Rate	6 $\times 10^{- 4}$
Warm-up Epochs	5
Weight Decay	0.05
Batch Size	256
Epochs	200

Table 2. Performance Comparison on Static Datasets (%).

Algorithm Model	Dataset			Improvement
Algorithm Model	CIFAR10	CIFAR100	ImageNet-1K	Improvement
Spikformer [26]	95.51	78.21	74.81	-
Spikformer (Ours)	96.73	78.62	75.84	+1.22/+0.41/+1.03
S-D Transformer [28]	95.60	78.40	77.07	-
S-D Transformer (Ours)	96.82	79.12	78.62	+1.22/+0.72/+1.55
SGLFormer [31]	96.76	82.26	83.73	-
SGLFormer (Ours)	97.12	83.07	84.33	+0.36/+0.81/+0.60
S-D Transformer V2 [27]	96.86	81.36	80.02	-
S-D Transformer V2 (Ours)	97.37	82.07	80.75	+0.51/+0.71/+0.73
QKFormer [32]	96.18	81.15	84.22	-
QKFormer (Ours)	97.02	81.92	84.76	+0.84/+0.77/+0.54

Table 3. Performance Comparison on Dynamic Datasets (%).

Algorithm Model	DVS128-Gesture	CIFAR10-DVS
Spikformer [26]	98.30	80.90
Spikformer (Ours)	98.74	81.56
S-D Transformer [28]	99.30	80.00
S-D Transformer (Ours)	99.42	80.73
SGLFormer [31]	98.60	82.90
SGLFormer (Ours)	99.00	83.72

Table 4. Ablation Study Results.

Module Configuration	CIFAR10 (%)	CIFAR100 (%)	Power (mJ)	Training Speed
Spikformer [26] (Baseline)	95.51	78.21	7.73	1.00×
+ Time-step Distillation	96.93	79.12	7.62	0.92×
+ Attention Pruning	95.21	77.92	6.82	1.38×
+ Classification Evaluation	94.11	76.37	5.11	2.42×
Full Model (Ours)	96.73	78.62	5.43	1.82×

Table 5. Computational Efficiency Comparison.

Algorithm Model	Power (mJ)	Training Speed	Inference Speed
Spikformer [26]	7.73	1.00×	1.00×
Spikformer (Ours)	5.43	1.82×	1.42×
S-D Transformer [28]	3.90	1.00×	1.00×
S-D Transformer (Ours)	3.37	2.01×	1.57×
SGLFormer [31]	13.04	1.00×	1.00×
SGLFormer (Ours)	10.47	1.97×	1.63×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, L.; Li, Y.; Liu, G.; Yang, Z.; Kong, X. Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation. Electronics 2025, 14, 4918. https://doi.org/10.3390/electronics14244918

AMA Style

Sun L, Li Y, Liu G, Yang Z, Kong X. Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation. Electronics. 2025; 14(24):4918. https://doi.org/10.3390/electronics14244918

Chicago/Turabian Style

Sun, Lei, Yao Li, Gushuai Liu, Zengjian Yang, and Xuecheng Kong. 2025. "Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation" Electronics 14, no. 24: 4918. https://doi.org/10.3390/electronics14244918

APA Style

Sun, L., Li, Y., Liu, G., Yang, Z., & Kong, X. (2025). Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation. Electronics, 14(24), 4918. https://doi.org/10.3390/electronics14244918

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Spiking Transformer Based on Temporal Multi-Scale Processing and Cross-Time-Step Distillation

Abstract

1. Introduction

2. Preliminaries

2.1. Spiking Neural Networks

Training of Deep SNNs

2.2. Knowledge Distillation

2.3. Spiking Transformer

3. Proposed Method

3.1. Multi-Scale Resolution Processing

3.2. Cross-Time-Step Distillation

3.3. Attention-Based Token Pruning

3.4. Multi-Scale Feature Fusion

4. Experiments and Results Analysis

4.1. Experimental Environment and Parameter Settings

4.2. Results on Static Datasets

4.3. Results on Dynamic Datasets

4.4. Ablation Study

4.4.1. Cross-Time-Step Knowledge Distillation

4.4.2. Attention-Based Pruning

4.4.3. Multi-Scale Resolution Processing

4.4.4. Comparison of Energy Consumption and Computational Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI