1. Introduction
Motor imagery (MI) EEG classification is a core component of noninvasive brain–computer interfaces (BCIs) and is often required to operate under strict latency and computing constraints in online or edge deployments [
1,
2]. Despite substantial progress, EEG-based decoding remains challenging due to the low signal-to-noise ratio (SNR) [
3,
4]. In MI, the problem is further aggravated by limited labeled data per subject and pronounced inter-subject and inter-session variability [
5,
6]. As a result, training a single model can be sensitive to random initialization and optimization noise, leading to unstable generalization and unreliable final checkpoint selection in practice [
7].
Ensembling is a reliable way to improve accuracy and reduce variance, but its inference cost scales linearly with the number of ensemble members, which is undesirable for real-time BCI systems. Knowledge distillation (KD) addresses this “train heavy, infer light” objective by transferring an ensemble’s predictive distribution to a single student model [
8]. However, in MI-EEG settings, student optimization can be noisy, and a fixed teacher alone may not prevent instability during optimization.
From an information theoretic perspective, two failure modes are particularly common in small and noisy MI datasets. First, the predicted class distribution can remain noisy, resulting in elevated predictive entropy and thus uncertain decisions. Second, even when the accuracy improves, the predictive distribution may oscillate noticeably across epochs near convergence, leading to large entropy fluctuation and unstable final checkpoint selection. These two phenomena motivate a stability-oriented training strategy, where predictive entropy serves as a measurable proxy for the reliability of the predictive distribution: we aim to make predictions less noisy and more consistent, while suppressing late-stage oscillations near convergence.
To address the above issues, we propose an entropy-based dual-teacher distillation framework that combines (i) an offline ensemble teacher built from multiple instances of a backbone network and (ii) an entropy-gated exponential moving average (EMA) teacher formed from the student’s historical weights [
9]. The ensemble teacher provides high-quality soft targets for transferring strong decision boundaries. Meanwhile, the EMA teacher plays a complementary role as a low-pass filter over the student’s historical weights, producing temporally smoothed teacher logits that reduce target noise, which is activated based on the sample’s current predictive entropy to allocate stronger EMA guidance to high-entropy samples and down-weights low-entropy samples, concentrating the stabilization effect where it is most needed. We further adopt a two-stage cosine annealing schedule [
10] to suppress entropy fluctuation in the late training stage, improving the robustness of final checkpoint selection when early stopping is not used.
Our main contributions are as follows:
We propose an entropy-based dual-teacher distillation framework for MI EEG that distills an offline ensemble into a single deployable backbone student without increasing the inference time.
We introduce an EMA teacher as a parameter-space low-pass filtering mechanism that yields more stable teacher logits. It is activated based on the sample’s predictive entropy to concentrate this effect on high-noise samples and avoid redundant regularization on easy samples.
We integrate a two-stage cosine annealing schedule to suppress late-stage entropy fluctuation, yielding more stable training dynamics and more reliable final checkpoint selection.
We evaluate the proposed method on BCI Competition IV-2a/2b with three representative backbones, and provide entropy-based analyses to link the accuracy gains to the improved reliability of predictive distributions [
11,
12].
2. Related Work
Classical MI decoding commonly relies on spatial filtering and linear classification, such as CSP/FBCSP pipelines [
6,
13,
14]. Geometry-aware approaches based on covariance representations on Riemannian manifolds have also been studied [
15,
16]. Deep learning has become the dominant approach for end-to-end MI decoding. Widely used CNN architectures include ShallowConvNet [
17] and EEGNet [
18]. Recent models incorporate stronger temporal modeling [
19,
20], attention mechanisms [
21,
22,
23], and state–space sequence models [
24]. A recent survey further systematizes deep learning-based MI-EEG research by summarizing input formulations, architectures, and commonly used public datasets [
25]. While these architectures improve accuracy, MI training data remain scarce and noisy, making training stability central concerns for further improvement [
7].
KD transfers soft targets from a teacher (often a larger network) to a student and is a standard tool for compressing knowledge into a single deployable model [
8]. Several variants are closely related to our setting: Deep Mutual Learning (DML) uses peer networks for online knowledge exchange [
26]; Born-Again Networks (BAN) iteratively distill a model into a new instance of the same architecture [
27]; and Decoupled KD (DKD) refines logit distillation by separating target-class and non-target-class components [
28].
Distillation has been applied to MI decoding. Examples include multi-subject or cross-subject distillation strategies [
29,
30], relation/similarity-preserving distillation for low-density EEG [
31], and more structured teacher–assistant designs targeting high compression [
32]. Self distillation is also studied [
33]. These studies support the application of distillation in MI-BCI, but many require additional intermediate networks or focus on scenario-specific constraints. In contrast, our approach uses an ensemble of the same backbone as teacher, while keeping the inference time identical to the backbone student. No additional teacher or assistant network architecture is required, which is particularly useful when the backbone is already strong, and a strictly stronger single-model teacher or assistant is hard to obtain.
Predictive uncertainty and training stability can be naturally described from an information theoretic perspective. Given a probabilistic classifier, the Shannon entropy of its output distribution provides a measure of decision uncertainty, where higher entropy indicates a flatter prediction [
11]. Calibration and reliability are practically relevant to performance. This motivates calibration diagnostics such as reliability diagrams and the expected calibration error (ECE) [
12,
34], as well as complementary proper scoring rules (e.g., the Brier score) for assessing probabilistic forecasts [
35]. Meanwhile, uncertainty quality and robustness are known to benefit from model averaging: deep ensembles often reduce overconfidence [
36]. Bayesian approximations such as Monte Carlo Dropout are widely used to quantify uncertainty in deep networks [
37,
38]. Beyond calibration, predictive entropy is also informative about the stability of the predictive distribution during optimization [
39,
40]: non-common-mode perturbations that change relative logit differences tend to yield less consistent predictions and are often reflected by higher entropy on average.
3. Methods
3.1. Problem Formulation and Notation
We consider supervised MI EEG classification with C classes, where C denotes the number of MI categories (e.g., left/right hand, feet, tongue, etc.). Each trial is denoted by , where is the multi-channel EEG segment, is the number of channels, T is the number of time points, and is the class label. Let and denote the training and test sets under a chosen evaluation protocol, where and are the numbers of training and test trials, respectively.
We use a single backbone network as the student
, where
represents all trainable parameters, producing logits
and class probabilities
During training, we employ two teachers: (i) an ensemble teacher
(trained offline, fixed during student training), producing
and
; (ii) an EMA teacher
(trained online, updated alongside student training), producing
and
. Importantly, only the student is used at the inference time, both teachers are only used during the training time.
3.2. Teacher Construction
3.2.1. Ensemble Teacher via K-Fold Bagging
To obtain a strong teacher when a strictly stronger single-model architecture is unavailable, we construct an ensemble teacher using
K independently trained instances of the same backbone. We split the training set
into
K disjoint folds:
For teacher
k, we train on
to encourage diversity. After training, each teacher yields logits
. The ensemble logits are computed by averaging
We then obtain the ensemble soft target at temperature
:
3.2.2. EMA Teacher
We maintain an EMA of student parameters as an additional teacher. Let
be the EMA parameters and
the smoothing coefficient. After each student update, we perform
This update can be viewed as a low-pass filter in parameter space, which attenuates high-frequency optimization noise and yields temporally smoothed teacher logits. The EMA teacher produces logits
and a temperature-scaled soft target
3.3. Entropy-Based Dual-Teacher Distillation Objective
We use the standard cross-entropy as the supervised classification loss:
is the
yth element of vector
. The ensemble knowledge is distilled by matching temperature-scaled distributions via KL divergence:
Similarly, we align the student to the EMA teacher:
We introduce a sample level entropy gate
that modulates the EMA KD term. Specifically, we compute the predictive entropy of the EMA teacher on
:
and then normalize the entropy into
as
The entropy gate is defined as a linear and clipped mapping:
where
are two hyperparameters. This design assigns a smaller EMA weight to high-entropy (eazy) samples and a larger EMA weight to high-entropy (noisy) samples.
The overall objective is
where
e is the epoch index, and
and
control the contributions of the two teachers.
3.4. Two-Stage Cosine Annealing Schedule
MI-EEG training often exhibits noticeable late-stage oscillations. We adopt cosine annealing to reduce oscillation. Let
and
be the maximum and minimum learning rates. Within a stage of length
, the learning rate at step
is
We use a two-stage schedule:
Phase I (length N epochs): Train the student with , i.e., .
Phase II (after restart, length epochs): Enable EMA guidance and train with , i.e., .
Figure 1 shows the two-stage training schedule with cosine annealing and EMA activation. This design is motivated by the fact that the EMA teacher is low quality at the beginning (since it is derived from an untrained student). Delaying EMA activation prevents early-stage noisy targets.
3.5. Training Procedure
Algorithms 1 and 2 summarize the complete training pipeline. Algorithm 1 constructs the ensemble teacher by training
K backbone teachers using a K-fold scheme on the training set
. Specifically,
is partitioned into
K disjoint folds
. Each teacher
is trained on
for the same number of epochs
using the cross-entropy (CE) loss only. After all teachers are trained, the ensemble teacher prediction
for an input
is obtained by averaging teacher logits, which is later converted into a temperature-scaled soft target
during student training.
| Algorithm 1 Offline Training of K-fold Teachers |
Require: Training set ; backbone network ; number of folds/teachers K; teacher epochs (same for all teachers); teacher learning rate . Ensure: Trained teachers .
- 1:
Split into K disjoint folds . - 2:
for to K do - 3:
Initialize teacher parameters . - 4:
for to do - 5:
for all batches do - 6:
Compute the batch CE loss - 7:
Back propagate - 8:
Update with learning rate - 9:
end for - 10:
end for - 11:
Store trained teacher . - 12:
end for - 13:
return .
|
Algorithm 2 trains the deployable student under a two-stage cosine annealing schedule with a single restart. For epoch , the learning rate is computed by mapping e to the local index t and stage length : for Phase I (), and for Phase II (). is then computed by cosine annealing. For each minibatch, the student first computes logits and supervised CE loss . Next, the ensemble KD term is computed by forming and and then applying KL-based distillation. The EMA KD term is activated only in Phase II. At the beginning of Phase II (), the EMA parameters are initialized as , and for each subsequent minibatch, the EMA teacher produces logits and soft target , which are used to compute the corresponding distillation loss and entropy gate . Finally, the student is updated by minimizing the total objective , and the EMA parameters are updated via .
Figure 2 provides an overview of the proposed training framework that corresponds to Algorithm 2.
| Algorithm 2 Online Student Training with Dual-Teacher Distillation |
Require: Training set ; backbone network ; trained teachers ; number of teachers K; student phase length N (total epochs ); temperature ; EMA coefficient ; learning-rate bounds ; KD weights and . Ensure: Deployable student parameters . |
| 1: Definitions: |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: ≜ Logits output of with input x |
| 7: ≜ entropy gate of sample x |
| 8: Initialize student parameters . |
| 9: for to do |
| 10: if then |
| 11: , | ▹ Phase I segment |
| 12: else |
| 13: , | ▹ Phase II segment (after restart) |
| 14: end if |
| 15: |
| 16: if then |
| 17: |
| 18: end if |
| 19: for all batches do |
| 20: for all do |
| 21: |
| 22: |
| 23: |
| 24: | ▹ teacher logits average |
| 25: | ▹ soft target |
| 26: |
| 27: if then |
| 28: |
| 29: |
| 30: |
| 31: Compute with |
| 32: else |
| 33: |
| 34: end if |
| 35: |
| 36: end for |
| 37: |
| 38: Back propagate |
| 39: Update with learning rate |
| 40: if then |
| 41: |
| 42: end if |
| 43: end for |
| 44: end for |
| 45: return |
4. Experiments
4.1. Experimental Settings
We conduct subject-dependent motor imagery (MI) classification experiments on two public benchmarks: BCI Competition IV-2a and BCI Competition IV-2b. For IV-2a, we follow the official protocol by using the training session for model training and the testing session for evaluation. For IV-2b, we use sessions 1–3 for training and sessions 4–5 for testing. We apply no additional signal preprocessing. Each trial is constructed by directly cropping the raw EEG around the cue onset using a 4.5 s segment (0.5 s pre-cue and 4 s post-cue), resulting in samples per trial at 250 Hz.
We evaluate the algorithms on three representative deep MI backbones: EEGNet [
18], ShallowConvNet [
17], and ATCNet [
23]. All three models are implemented following the original architectures/hyperparameters. For ATCNet, which contains multiple parallel branches, we aggregate the branch outputs by averaging the branch logits to obtain the final logit prediction.
For each dataset–backbone pair, we construct an ensemble teacher using K-fold and bagging with . Each teacher is trained for epochs using cross-entropy (CE) with a fixed learning rate of 0.001. All models are optimized using AdamW (weight decay 0.009) with batch size 64. Distillation hyperparameters are fixed across datasets/backbones as follows: temperature , , , , , and EMA coefficient . Following our protocol, we select the model from the last epoch (no early stopping, no validation set is used). We do not use a separate validation set, because the number of labeled trials per subject is limited; holding out a validation split substantially reduces the effective training data and noticeably degrades the performance.
Performance is measured by classification accuracy. Each experiment is run three times, and we report the mean accuracy over the three runs. All experiments are implemented in PyTorch 2.0, Python 3.9, and executed on an NVIDIA RTX 4090 GPU.
4.2. Main Results
We report the subject-dependent classification accuracy on BCI Competition IV-2a and IV-2b using three backbones (EEGNet, ShallowConvNet, and ATCNet). We compare (i) the original backbone trained with standard cross entropy loss for 750 epochs, (ii) the bagging ensemble of K backbone models, and (iii) the proposed method that trains a single backbone student to approach the ensemble performance.
Table 1 and
Table 2 show the experiment results. Subject-wise distributions are provided in
Figure 3 and
Figure 4. Across both datasets and all backbones, the ensemble baseline substantially improves over the original single model, confirming that ensembling reduces the variance and enhances the robustness in MI-EEG classification. Our method substantially closes the gap between the original model and the ensemble teacher and in most cases exceeds the ensemble performance while preserving the inference cost of a single backbone. These results support the effectiveness of combining a high-quality ensemble teacher with entropy-based guidance for training compact yet accurate MI classifiers.
4.3. Accuracy–Latency and Accuracy–Memory Trade-Off
To evaluate the deployment-oriented efficiency, we visualize the trade-off between the classification accuracy and inference cost. For each dataset (IV-2a and IV-2b), we plot a scatter diagram, where the x-axis is accuracy, and the y-axis is the average per-trial inference latency or memory. Each point corresponds to one model variant, including the following: (i) original single backbone models (EEGNet, ShallowConvNet, ATCNet), (ii) the corresponding ensemble models formed by K backbones, and (iii) the proposed method (Ours) that distills ensemble knowledge into a single backbone. We measure the inference latency and GPU memory usage on a single NVIDIA RTX 4090 GPU. The batch size is set to 1 to reflect the online BCI setting where trials arrive sequentially.
Figure 5 and
Figure 6 visualize the accuracy–latency trade-off across backbones and training strategies,
Figure 7 and
Figure 8 visualize the accuracy–memory trade-off. On IV-2a, ATCNet achieves the highest accuracy among the original single models, but it also incurs substantially higher inference latency and memory than EEGNet and ShallowConvNet. Notably, even as a single model, ATCNet already outperforms the ensemble baselines built upon EEGNet and ShallowConvNet, indicating that improving the backbone architecture is often the most effective way to boost performance when cost is not strictly constrained. However, ensembling remains unattractive for deployment due to its amplified cost. In contrast, our method consistently narrows the gap between a single student and its ensemble teacher while keeping the inference cost unchanged, making it more suitable for cost-sensitive scenarios. This effect is particularly valuable for strong backbones such as ATCNet, where further gains from architectural improvement are increasingly difficult, and using an ensemble teacher provides an effective way to inject stronger supervision.
4.4. Comparison to Methods in the Literature
To further evaluate the proposed method, we compare against three representative distillation paradigms widely used in deep learning: Deep Mutual Learning [
26], Born-Again Networks [
27], and Decoupled Knowledge Distillation [
28], which cover common teacher–student distillation strategies. All methods are evaluated under the same subject-dependent protocol, datasets, backbones, and environment as in
Section 4.1. The results on BCI Competition IV-2a and IV-2b are summarized in
Table 3 and
Table 4, respectively.
Deep Mutual Learning (DML): DML replaces the standard teacher–student transfer with collaborative learning among multiple peer models trained simultaneously. Each peer is optimized by the standard cross-entropy loss and a peer-wise KL mimicry loss that matches its predictive distribution to those of the other peers. In our implementation, we train models jointly using the peer-to-self and peer-wise KL formulation with temperature and KL weight . Following the common practice for obtaining a single deployable model, we report the performance of the first peer network as the final result.
Born-Again Networks (BAN): BAN performs self-distillation across generations. A teacher is first trained using standard supervision; then, a new student with the same architecture is trained using a combination of cross-entropy and distillation from the previous generation teacher. The process can be repeated to obtain progressively improved students. We train a Generation-0 model with CE for 750 epochs and then iteratively train the next generation for 1500 epochs using CE + KD, with KD temperature and KD weight . We repeat this procedure for four generations and report Generation-4 as the final model. We select four generations for BAN because the performance saturates after four generations.
Decoupled Knowledge Distillation (DKD): DKD reformulates the classical logit-based KD into two complementary terms, target-class KD (TCKD) and non-target-class KD (NCKD), and uses independent weights to balance the two components. In our comparison, DKD uses the same ensemble teacher as our method (an ensemble of backbones), while replacing the conventional KD loss with DKD: for TCKD and for NCKD, with temperature . and are selected according to the recommended configuration from the original DKD paper.
Table 3 and
Table 4 show that our method achieves the best average performance on both datasets, consistently outperforming DML, BAN, and DKD across the three backbones. Notably, the relative advantages of the literature methods vary with the backbone strength. In contrast, our method yields the most robust improvements.
4.5. Ablation Analysis
We conduct ablation studies to isolate the contribution of each component in the proposed training pipeline. All ablations follow the same subject-dependent protocol, datasets, backbones (EEGNet, ShallowConvNet, ATCNet), and environment as the full method. The following ablations are compared:
(A1) Ours w/o Ensemble KD (EMA-only KD). We remove the ensemble distillation term and keep only the EMA teacher guidance. The loss is reduced to , where the EMA teacher is initialized and activated at epoch 501 (Phase II) as in the full method.
(A2) Ours w/o EMA KD (Ensemble-only KD). We disable the entropy-gated EMA distillation branch and retain only ensemble distillation. The loss becomes throughout training.
(A3) Ours w/o Cosine Annealing. We replace the two-stage cosine schedule with a fixed learning rate of 0.001 for the student. The EMA teacher is still activated at epoch 501, ensuring that only the learning-rate schedule is changed.
(A4) Ours w/o entropy-gate. We remove the entropy-gate while preserving the EMA KD. The loss becomes throughout training.
The results are summarized in
Table 5 and
Table 6. Overall, each component contributes positively, with the full method consistently achieving the best performance across datasets and backbones.
4.6. Predictive Entropy Analysis
This section provides an entropy-based validation of our design. Motivated by an information theoretic view of training instability, our method is explicitly designed to (i) stabilize the teacher signal via entropy-gated EMA filtering and thereby reduce the student’s predictive noise and (ii) suppress oscillations near convergence via the two-stage cosine schedule. To verify that these objectives are indeed achieved in practice, we analyze the epoch-wise dynamics of predictive entropy on a representative configuration: ATCNet on BCI Competition IV-2a, Subject 2, using a single run. We compare our method (EMA+Cosine) with two ablations that isolate each component: A2 (NoEMA+Cosine) to evaluate the effect of EMA filtering on entropy level and A3 (EMA+NoCosine) to evaluate the effect of cosine annealing on entropy fluctuation.
Given the predicted class probability vector
for a test sample
, we compute the predictive entropy as
where larger entropy indicates higher uncertainty, and smaller entropy indicates a more peaked predictive distribution. We conduct the following three experiments:
- (1)
Entropy trajectories for correct vs. incorrect predictions.
For each epoch
e, we evaluate the student model on the test set and partition test samples into correctly classified and misclassified subsets, denoted by
and
. We then compute the mean predictive entropy for each subset:
We visualize
and
as two curves across epochs (
Figure 9 and
Figure 10). A clear trend emerges after Phase II begins (EMA enabled): compared with A2, both the correct-sample entropy and the wrong-sample entropy are consistently lower under the EMA-enabled settings (A3 and Ours).
The EMA teacher can be viewed as a low-pass filter in parameter space, which yields temporally smoothed teacher logits and more stable soft targets. Logit-level instability can be modeled as non-common-mode perturbations that change relative logit differences (e.g., per-class additive noise), which tends to flatten the predictive distribution and increase the predictive entropy on average. By distilling from the EMA teacher, the student receives a less noisy guidance signal, which stabilizes the student logits. Consequently, the overall predictive entropy on the test set decreases after EMA activation, as observed in
Figure 9 and
Figure 10. The entropy decrease across datasets and backbones is shown in
Table 7.
- (2)
Entropy fluctuation over a sliding window.
In the second experiment, we quantify how much the model’s prediction fluctuates across epochs. For each test sample
, we compute a sliding-window variance of its predictive entropy over the current epoch and the previous 19 epochs (window size
):
We then average this quantity over all test samples to obtain an epoch-wise entropy fluctuation score:
where
denotes the test set. We plot
across epochs to visualize the stability of the predictive entropy during training.
Figure 11 reports the results. In the late stage of training, A3 exhibits noticeably larger fluctuation than both Ours and A2, which enable the cosine schedule. This indicates that the cosine annealing schedule plays a dominant role in suppressing late-stage oscillations. Practically, reduced fluctuation makes the final checkpoint selection more reliable, because the model behavior near convergence becomes more stable across epochs. The entropy variance decrease across datasets and backbones is shown in
Table 8.
- (3)
Correlation analysis between entropy decrease and accuracy increase.
To quantify the relationship between entropy decrease and performance change across subjects, we perform an correlation analysis for each dataset–backbone setting.
For each dataset and backbone, we treat each subject
s as one sample point and get the accuracy
, mean entropy across test samples
, and mean entropy variance across test samples
under method
at the end of training. We define the subject-wise changes relative to a baseline method
b as
We then compute the Pearson correlation between
and
(also
).
We report the resulting correlations in
Table 9. The results show a positive correlation in general.
5. Discussion
This work targets a practical tension in MI-EEG decoding: achieving ensemble-level performance while keeping single-model inference cost for real-time or edge BCI deployment. Across two public benchmarks and three representative backbones, the proposed entropy-based dual-teacher distillation consistently improves a deployable single model student and, in most cases, approaches or exceeds the corresponding ensemble teacher.
We interpret MI-EEG training instability from an information-theoretic perspective, with predictive entropy serving as a measurable proxy of uncertainty in the predictive distribution. Under this view, the offline ensemble teacher primarily improves the quality of soft supervision, while the EMA teacher plays a complementary role as a denoising mechanism. Due to the low signal-to-noise ratio and limited per-subject data, MI training is often sensitive to random initialization and optimization noise, and the resulting logits may exhibit non-common-mode perturbations that change relative logit differences across classes. Such perturbations tend to flatten the predictive distribution, leading to higher predictive entropy. Distilling from an EMA teacher therefore transfers a more stable supervisory signal to the student, which is reflected as a lower mean predictive entropy on the test set. To further connect this stability proxy to performance, we additionally report correlation studies between entropy-based changes and accuracy changes, showing that entropy reduction captures improved stability even when accuracy gains vary across subjects due to strong inter-subject variability. It is important to note that the role of entropy in this work is different from probability calibration: we use predictive entropy primarily as a proxy for logit-level noise under scarce and noisy MI data, and our entropy gate relies on predictive entropy to distinguish high-noise vs. low-noise samples, rather than requiring well-calibrated probabilities.
The entropy-gated activation further improves robustness by allocating EMA guidance adaptively across samples. We increase the weights of samples with higher entropy and decrease the weights of low-entropy samples (easy/confident cases) to avoid redundant regularization. This weighting concentrates the stabilization effect where it is most needed, while preserving discriminative learning on easy samples. Complementarily, the two-stage cosine annealing schedule suppresses late-stage oscillations, which manifests as reduced entropy fluctuation near convergence, making checkpoint selection more reliable.
The results suggest a simple “train heavy, infer light” recipe for MI-EEG: use a strong but offline ensemble teacher to provide high-quality supervision, and stabilize the student with entropy-based mechanisms to obtain a deployable single model with ensemble-like performance. This is particularly useful for strong backbones (e.g., ATCNet), where further architectural modifications may yield diminishing returns; in such cases, an ensemble teacher becomes a practical way to strengthen supervision. More broadly, the proposed framework is backbone-agnostic and can be integrated into existing MI-EEG pipelines with minimal changes.
6. Conclusions
This paper presented a dual-teacher distillation framework for efficient motor-imagery EEG classification under practical latency constraints. The proposed approach distills knowledge from an offline ensemble teacher and introduces an entropy-gated EMA teacher that acts as a low-pass filter on parameters to produce denoised guidance. Once activated in the second stage of cosine annealing, the EMA-guided distillation transfers stability to the student, resulting in a denoised predictive distribution. Moreover, the entropy-gated weighting modulates the EMA KD term at the sample level, emphasizing uncertain samples and deemphasizing easy samples to focus the denoising effect where it is most beneficial. Complementarily, the two-stage cosine annealing schedule reduces late-stage fluctuation, making convergence behavior and checkpoint selection more reliable. Experiments on BCI Competition IV-2a and IV-2b with three representative backbones demonstrated that the proposed method consistently closes the performance gap to ensembles, while avoiding the inference-time overhead incurred by multi-member ensembles.
Future work will extend this framework to subject-independent settings and explore richer distillation (e.g., network intermediate representations) to further improve the robustness in real-world BCI deployments.