Author Contributions
Conceptualization, C.Y., I.L. and Y.M.; data curation, C.Y. and I.L.; formal analysis, C.Y., I.L., K.E.M. and I.O.; methodology, C.Y., I.L., K.E.M. and Y.M.; project administration, C.Y., I.L., K.E.M., Y.M. and I.O.; supervision, Y.M., K.E.M. and I.O.; validation, C.Y., I.L., K.E.M., I.O. and Y.M.; visualization, C.Y. and I.L.; writing—original draft, C.Y. and I.L.; writing—review and editing, C.Y., I.L., Y.M., K.E.M. and I.O. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Frame-wise 2D CNN used for spatial feature extraction. Each RGB frame is processed independently through convolution, normalization, activation, and pooling, followed by a projection to the embedding .
Figure 1.
Frame-wise 2D CNN used for spatial feature extraction. Each RGB frame is processed independently through convolution, normalization, activation, and pooling, followed by a projection to the embedding .
Figure 2.
LSTM memory cell. The input, forget, and output gates regulate information flow through the cell state .
Figure 2.
LSTM memory cell. The input, forget, and output gates regulate information flow through the cell state .
Figure 3.
BiLSTM temporal encoder. Forward and backward LSTMs process the same clip in opposite directions, and their hidden states are concatenated at each time step.
Figure 3.
BiLSTM temporal encoder. Forward and backward LSTMs process the same clip in opposite directions, and their hidden states are concatenated at each time step.
Figure 5.
Flowchart of the proposed hybrid architecture. Each branch performs frame-wise spatial encoding with a 2D CNN, followed by BiLSTM-based temporal aggregation and GRU-based refinement. The three branch outputs are concatenated and classified by a 4-way softmax head.
Figure 5.
Flowchart of the proposed hybrid architecture. Each branch performs frame-wise spatial encoding with a 2D CNN, followed by BiLSTM-based temporal aggregation and GRU-based refinement. The three branch outputs are concatenated and classified by a 4-way softmax head.
Figure 6.
Training/validation accuracy and loss across epochs.
Figure 6.
Training/validation accuracy and loss across epochs.
Figure 7.
Per class precision, recall, and F1 for baseline and compressed models.
Figure 7.
Per class precision, recall, and F1 for baseline and compressed models.
Figure 8.
Row-normalized confusion matrices (%). Left: baseline FP32; right: pruned + INT8.
Figure 8.
Row-normalized confusion matrices (%). Left: baseline FP32; right: pruned + INT8.
Figure 9.
One-vs-rest ROC curves for all classes (macro-AUC baseline, compressed).
Figure 9.
One-vs-rest ROC curves for all classes (macro-AUC baseline, compressed).
Figure 10.
Reliability diagram; dashed line denotes perfect calibration. Temperature scaling () corrected mild overconfidence post-compression.
Figure 10.
Reliability diagram; dashed line denotes perfect calibration. Temperature scaling () corrected mild overconfidence post-compression.
Figure 11.
Parameter count and on-disk size: FP32 vs. pruned + INT8.
Figure 11.
Parameter count and on-disk size: FP32 vs. pruned + INT8.
Figure 12.
End-to-end CPU latency per 30-frame clip for the full-precision (FP32) and compressed INT8 models (batch size 1, T = 30).
Figure 12.
End-to-end CPU latency per 30-frame clip for the full-precision (FP32) and compressed INT8 models (batch size 1, T = 30).
Figure 13.
CPU throughput in clips per second for the FP32 and INT8 models under the same setting as
Figure 12.
Figure 13.
CPU throughput in clips per second for the FP32 and INT8 models under the same setting as
Figure 12.
Table 1.
Verification checks used to ensure strict train/validation/test separation.
Table 1.
Verification checks used to ensure strict train/validation/test separation.
| Check | Purpose |
|---|
| Subject-ID disjointness | Ensures that no subject appears in more than one split. |
| Video-ID disjointness | Ensures that no video is fragmented across different splits. |
| Sequence-origin verification | Confirms that each temporal window inherits a unique subject/video origin from its own split. |
| Split-before-windowing rule | Prevents overlapping windows from being generated before the subject partition. |
| Training-only normalization statistics | Prevents validation/test information from influencing preprocessing. |
Table 3.
Methodological comparison between common CNN–RNN hybrids and the proposed architecture.
Table 3.
Methodological comparison between common CNN–RNN hybrids and the proposed architecture.
| Architecture Family | Spatial Encoder | Temporal Encoder | Multi-Branch Fusion | Main Limitation/Distinction |
|---|
| CNN-only | 2D CNN | None | No | Captures spatial cues only; no explicit temporal modeling |
| CNN–LSTM | 2D CNN | LSTM | No | Uses only forward temporal modeling |
| CNN–BiLSTM | 2D CNN | BiLSTM | No | Uses bidirectional context but no gated refinement stage |
| CNN–GRU | 2D CNN | GRU | No | Lightweight temporal modeling but less contextual coverage than BiLSTM |
| 3D CNN/Conv3D | Spatiotemporal convolution | Implicit | No | Higher computational cost; less explicit temporal interpretability |
| Proposed model | 2D CNN | BiLSTM + GRU | Yes | Combines bidirectional context, gated refinement, and heterogeneous branch fusion |
Table 4.
Parameter configuration of Branch I ().
Table 4.
Parameter configuration of Branch I ().
| Component | Units/Filters | Kernel Size | Return Sequence | Activation | Pool Size | Dropout |
|---|
| 2D CNN + max pooling | 128 | | – | ReLU | 2 | 0.2 |
| BiLSTM | 64 | – | True | tanh | – | 0.2 |
| GRU | 32 | – | True | tanh | – | 0.2 |
| Temporal pooling | – | – | – | – | 2 | – |
| Dense layer | 4 | – | – | Softmax | – | – |
| Total params | 119,312 |
Table 5.
Parameter configuration of Branch II ().
Table 5.
Parameter configuration of Branch II ().
| Component | Units/Filters | Kernel Size | Return Sequence | Activation | Pool Size | Dropout |
|---|
| 2D CNN + max pooling | 256 | | – | ReLU | 2 | 0.3 |
| BiLSTM | 128 | – | True | tanh | – | 0.3 |
| GRU | 64 | – | True | tanh | – | 0.3 |
| Temporal pooling | – | – | – | – | 2 | – |
| Dense layer | 4 | – | – | Softmax | – | – |
| Total params | 472,048 |
Table 6.
Parameter configuration of Branch III ().
Table 6.
Parameter configuration of Branch III ().
| Component | Units/Filters | Kernel Size | Return Sequence | Activation | Pool Size | Dropout |
|---|
| 2D CNN + max pooling | 128 | | – | ReLU | 2 | 0.4 |
| BiLSTM | 128 | – | True | tanh | – | 0.4 |
| GRU | 64 | – | True | tanh | – | 0.4 |
| Temporal pooling | – | – | – | – | 2 | – |
| Dense layer | 4 | – | – | Softmax | – | – |
| Total params | 339,312 |
Table 7.
Computing environment used for all experiments.
Table 7.
Computing environment used for all experiments.
| Operating system | Ubuntu 22.04 LTS |
| Python/PyTorch/CUDA | Python 3.10 PyTorch 2.2 CUDA 12.1 |
| Inference runtime | ONNX Runtime 1.17 (CPU: MKL-DNN; GPU: CUDA EP) |
| Hardware | 1× NVIDIA RTX 3080 (10 GB), Intel Xeon-class CPU, 64 GB RAM |
| Determinism | Fixed seeds {13, 29, 47}; CuDNN deterministic kernels; hash seed fixed |
Table 8.
Training and compression hyperparameters (default values).
Table 8.
Training and compression hyperparameters (default values).
| Aspect | Setting | Value(s) | Notes |
|---|
| Input representation | Frames per clip | (5 fps) | , RGB |
| Optimizer | AdamW | lr = , wd = | Cosine decay; warm-up 5 epochs |
| Regularization | Dropout | {0.2, 0.3, 0.4} by branch | |
| Stabilization | Grad clip | norm | Applied each step |
| Batching | Batch size | 16 clips | Mixed precision (FP16) |
| Early stopping | Criterion | Val. macro-F1, patience 10 | Best checkpoint by macro-F1 |
| Loss | Class-weighted CE | | Weights normalized to |
| Pruning | Target sparsity | CNN: 30–50% channels; RNN: 20–40% units | Gradual schedule, mask frozen during FT |
| Quantization | Scheme | Conv/FC: static INT8; RNN: dynamic INT8 | Percentile calibration on 512 clips |
| QAT/KD | Triggers | Acc. drop pp | QAT 5 epochs; KD , |
| Evaluation | Seeds | 3 runs | Report |
Table 9.
Repeated subject-independent cross-validation results on the DAiSEE attention classification task. Values are reported as the mean ± standard deviation across repeated subject-wise folds.
Table 9.
Repeated subject-independent cross-validation results on the DAiSEE attention classification task. Values are reported as the mean ± standard deviation across repeated subject-wise folds.
| Model | Accuracy (%) | Macro-F1 | QWK | Ordinal MAE |
|---|
| CNN-only (Tiny) | 97.42 ± 0.38 | 0.973 ± 0.004 | 0.972 ± 0.005 | 0.109 ± 0.010 |
| CNN + BiLSTM (no attention) | 98.71 ± 0.24 | 0.987 ± 0.003 | 0.986 ± 0.003 | 0.061 ± 0.007 |
| Proposed CNN–Attention–BiLSTM | 99.18 ± 0.16 | 0.991 ± 0.002 | 0.991 ± 0.002 | 0.044 ± 0.005 |
Table 10.
Overall performance on DAiSEE test split ().
Table 10.
Overall performance on DAiSEE test split ().
| Model | Acc. (%) | Macro-P | Macro-R | Macro-F1 | QWK ↑ | MAEord ↓ | Macro-AUC | Brier ↓ | ECE ↓ | MCC |
|---|
| Baseline (FP32) | | | | | | | | | | |
| Compressed (Pruned + INT8) | | | | | | | | | | |
Table 11.
Per class precision/recall/F1 (test split).
Table 11.
Per class precision/recall/F1 (test split).
| Class | Baseline (FP32) | Compressed (Pruned + INT8) |
|---|
| Precision | Recall | F1 | Precision | Recall | F1 |
|---|
| Very Low | 0.998 | 0.997 | 0.998 | 0.996 | 0.995 | 0.996 |
| Low | 0.997 | 0.997 | 0.997 | 0.994 | 0.994 | 0.994 |
| High | 0.998 | 0.999 | 0.998 | 0.996 | 0.997 | 0.996 |
| Very High | 0.999 | 0.999 | 0.999 | 0.997 | 0.998 | 0.997 |
Table 12.
Separating the effect of preprocessing from the effect of the proposed architecture on DAiSEE. In Block A, the backbone is fixed to a single-branch CNN–BiLSTM–GRU, and preprocessing is progressively enabled. In Block B, the full preprocessing pipeline is fixed, and the model family is varied.
Table 12.
Separating the effect of preprocessing from the effect of the proposed architecture on DAiSEE. In Block A, the backbone is fixed to a single-branch CNN–BiLSTM–GRU, and preprocessing is progressively enabled. In Block B, the full preprocessing pipeline is fixed, and the model family is varied.
| Block | Setting | Acc. [%] | Macro-F1 | QWK | Ordinal MAE |
|---|
| A. Fixed backbone: single-branch CNN–BiLSTM–GRU |
| A1 | Naive clip construction (first T frames, direct resize only) | 97.20 | 0.968 | 0.969 | 0.103 |
| A2 | + Uniform frame sampling + fixed clip length () | 98.10 | 0.979 | 0.980 | 0.073 |
| A3 | + Per channel normalization after resizing | 98.80 | 0.987 | 0.988 | 0.055 |
| A4 | + Adaptive brightness normalization + repetition padding (full preprocessing) | 99.30 | 0.994 | 0.993 | 0.041 |
| B. Fixed preprocessing: full pipeline used for all rows below |
| B1 | Frame-CNN only (clip prediction by posterior averaging) | 92.80 | 0.921 | 0.915 | 0.248 |
| B2 | CNN + LSTM | 95.90 | 0.955 | 0.952 | 0.131 |
| B3 | CNN + BiLSTM | 98.70 | 0.986 | 0.985 | 0.058 |
| B4 | CNN–BiLSTM–GRU, single branch | 99.30 | 0.994 | 0.993 | 0.041 |
| B5 | Proposed three-branch CNN–BiLSTM–GRU | 99.86 | 0.998 | 0.998 | 0.030 |
Table 13.
Ablations on compression components (test split).
Table 13.
Ablations on compression components (test split).
| Variant | Macro-F1 | QWK | Brier↓ | Params (M) | Size (MB) | CPU Latency (ms) |
|---|
| Baseline (FP32) | | | | 5.8 | 22.8 | 38.5 |
| Pruning only (40% CNN/25% RNN) | 0.997 | 0.997 | 0.007 | 3.2 | 12.7 | 27.4 |
| INT8 only (post-training) | 0.996 | 0.996 | 0.008 | 5.8 | 5.8 | 25.6 |
| Pruning + INT8 (ours) | 0.995 | 0.995 | 0.008 | 2.1 | 5.6 | 16.7 |
Table 14.
Sensitivity to sequence length (baseline).
Table 14.
Sensitivity to sequence length (baseline).
| Frames (T) | 15 | 30 | 45 | 60 |
|---|
| Macro-F1 | 0.994 | 0.998 | 0.998 | 0.998 |
| MACs/clip (G) | 3.6 | 6.3 | 9.0 | 11.8 |
Table 15.
Ablation of key components of the proposed pipeline on the DAiSEE test split. We isolate the effect of the GRU refinement layer after the BiLSTM, the three-branch encoder versus a single-branch variant, temporal max pooling versus mean- and last-state pooling, and post hoc temperature scaling after pruning and INT8 quantization. The results show that the full three-branch BiLSTM + GRU model with temporal max pooling yields the best trade-off between accuracy, ordinal consistency (QWK, MAEord), and calibration (Brier, ECE), while temperature scaling significantly improves calibration of the compressed model without affecting its classification performance.
Table 15.
Ablation of key components of the proposed pipeline on the DAiSEE test split. We isolate the effect of the GRU refinement layer after the BiLSTM, the three-branch encoder versus a single-branch variant, temporal max pooling versus mean- and last-state pooling, and post hoc temperature scaling after pruning and INT8 quantization. The results show that the full three-branch BiLSTM + GRU model with temporal max pooling yields the best trade-off between accuracy, ordinal consistency (QWK, MAEord), and calibration (Brier, ECE), while temperature scaling significantly improves calibration of the compressed model without affecting its classification performance.
| Variant | Acc. [%] | Macro-F1 | QWK | MAEord | Brier | ECE | Params (M) |
|---|
| Full, 3 branches, BiLSTM + GRU, max-pool (FP32) | 99.86 | 0.998 | 0.998 | 0.03 | 0.0060 | 0.012 | 5.8 |
| w/o GRU, 3 branches, BiLSTM only (FP32) | 99.41 | 0.996 | 0.996 | 0.05 | 0.0074 | 0.017 | 5.2 |
| Single branch, BiLSTM + GRU, max-pool (FP32) | 99.33 | 0.995 | 0.995 | 0.06 | 0.0078 | 0.019 | 3.1 |
| 3 branches, BiLSTM + GRU, mean-pool (FP32) | 99.58 | 0.997 | 0.997 | 0.04 | 0.0066 | 0.015 | 5.8 |
| 3 branches, BiLSTM + GRU, last-state (FP32) | 99.22 | 0.995 | 0.995 | 0.06 | 0.0079 | 0.019 | 5.8 |
| Pruned + INT8, 3 branches, BiLSTM + GRU, max-pool, no temp. | 99.52 | 0.995 | 0.995 | 0.04 | 0.0083 | 0.031 | 2.1 |
| Pruned + INT8, 3 branches, BiLSTM + GRU, max-pool, +temp. scaling | 99.52 | 0.995 | 0.995 | 0.04 | 0.0080 | 0.016 | 2.1 |
Table 16.
Stricter generalization validation on DAiSEE using subject-wise and robustness-oriented protocols.
Table 16.
Stricter generalization validation on DAiSEE using subject-wise and robustness-oriented protocols.
| Protocol | Model | Accuracy (%) | Macro-F1 (%) | Remarks |
|---|
| Default held-out split | CNN-only (Tiny) | 98.70 | 98.60 | Main protocol |
| | CNN + BiLSTM (no attention) | 99.10 | 99.00 | Main protocol |
| | Proposed CNN–Attention–BiLSTM | 99.47 | 99.47 | Main protocol |
| 5-fold subject-wise CV | CNN-only (Tiny) | | | Grouped by subject |
| | CNN + BiLSTM (no attention) | | | Grouped by subject |
| | Proposed CNN–Attention–BiLSTM | | | Grouped by subject |
| Repeated subject-independent splits | CNN-only (Tiny) | | | 5 repeated runs |
| | CNN + BiLSTM (no attention) | | | 5 repeated runs |
| | Proposed CNN–Attention–BiLSTM | | 97.89 ± 0.41 | 5 repeated runs |
| Perturbed unseen-subject test | CNN-only (Tiny) | 94.86 | 94.30 | Brightness/blur/occlusion |
| | CNN + BiLSTM (no attention) | 96.01 | 95.58 | Brightness/blur/occlusion |
| | Proposed CNN–Attention–BiLSTM | 96.94 | 96.51 | Brightness/blur/occlusion |
Table 17.
Compression ablation of the proposed model: trade-off between predictive performance and deployment efficiency.
Table 17.
Compression ablation of the proposed model: trade-off between predictive performance and deployment efficiency.
| Variant | Acc (%) | Macro-F1 (%) | Params (M) | Model Size (MB) | Peak RAM (KB) | Latency (ms) | CR |
|---|
| Full precision (FP32) | 99.47 | 99.47 | 0.28 | 1.12 | 220 | 20.0 | |
| INT8 PTQ | 99.02 | 99.02 | 0.28 | 0.28 | 160 | 12.0 | |
| INT8 QAT | 99.18 | 99.16 | 0.28 | 0.28 | 160 | 12.4 | |
| Structured pruning (∼30%) + FP32 | 99.21 | 99.18 | 0.20 | 0.80 | 190 | 17.0 | |
| Structured pruning (∼30%) + INT8 | 98.93 | 98.90 | 0.20 | 0.20 | 145 | 10.3 | |
Table 18.
Effect of different structured pruning criteria on the compressed (pruned + INT8) model on DAiSEE. All variants use the same sparsity levels and quantization settings.
Table 18.
Effect of different structured pruning criteria on the compressed (pruned + INT8) model on DAiSEE. All variants use the same sparsity levels and quantization settings.
| Pruning Criterion | Acc. [%] | Macro-F1 | QWK | Pruning Cost (Relative) |
|---|
| magnitude (baseline) | 99.44 | 0.994 | 0.994 | |
| First-order Taylor (ours) | 99.52 | 0.995 | 0.995 | |
| Simplified movement-style score | 99.54 | 0.996 | 0.996 | |
Table 19.
Comparison between the proposed architecture and lightweight baselines on the DAiSEE test split (subject-independent protocol).
Table 19.
Comparison between the proposed architecture and lightweight baselines on the DAiSEE test split (subject-independent protocol).
| Model | Params (M) | MACs/Clip (G) | Acc. [%] | Macro-F1 | QWK | CPU Latency (ms) |
|---|
| MobileNetV2 + GRU (single stream) | 3.2 | 7.1 | 97.8 | 0.977 | 0.978 | 34.2 |
| CNN + TCN (single stream) | 2.9 | 5.5 | 97.1 | 0.971 | 0.973 | 28.6 |
| CNN–BiLSTM–GRU (single branch) | 3.1 | 4.9 | 99.3 | 0.995 | 0.995 | 30.1 |
| Proposed 3-branch CNN–BiLSTM–GRU (FP32) | 5.8 | 6.3 | 99.86 | 0.998 | 0.998 | 38.5 |
| Proposed 3-branch, pruned + INT8 | 2.1 | 3.2 | 99.52 | 0.995 | 0.995 | 16.7 |
Table 20.
Robustness to environmental factors (macro-F1).
Table 20.
Robustness to environmental factors (macro-F1).
| Condition | Clean | Low Light (↓20% Luminance) | Partial Occlusion (20% Area) | Yaw ±20° |
|---|
| Baseline (FP32) | 0.998 | 0.996 | 0.995 | 0.997 |
| Compressed (Pruned + INT8) | 0.995 | 0.991 | 0.989 | 0.993 |
Table 21.
Class distribution of DAiSEE clips in the subject-independent splits used in this work (attention dimension).
Table 21.
Class distribution of DAiSEE clips in the subject-independent splits used in this work (attention dimension).
| Attention Level | Train | Validation | Test |
|---|
| Very Low | 1846 | 230 | 232 |
| Low | 2421 | 302 | 304 |
| High | 1610 | 187 | 189 |
| Very High | 1377 | 188 | 182 |
| Total | 7254 | 907 | 907 |
Table 22.
Cross-dataset generalization of the proposed model. For each dataset, the results are shown when the network is trained and evaluated on that dataset (fine-tuned) and when the feature extractor is trained only on DAiSEE and applied to the new dataset without additional training (zero-shot from DAiSEE).
Table 22.
Cross-dataset generalization of the proposed model. For each dataset, the results are shown when the network is trained and evaluated on that dataset (fine-tuned) and when the feature extractor is trained only on DAiSEE and applied to the new dataset without additional training (zero-shot from DAiSEE).
| Dataset/Regime | Acc. [%] | Macro-F1 | QWK | Notes |
|---|
| DAiSEE (attention, 4 levels) |
| Fine-tuned (subject-independent) | 99.86 | 0.998 | 0.998 | Reference configuration |
| YawDD |
| Zero-shot from DAiSEE | 94.1 | 0.936 | 0.941 | DAiSEE-trained backbone, frozen |
| Fine-tuned on YawDD | 98.7 | 0.985 | 0.987 | Same architecture and training schedule |
| RLDD |
| Zero-shot from DAiSEE | 93.4 | 0.928 | 0.934 | Short clips sampled to T = 24 |
| Fine-tuned on RLDD | 98.2 | 0.981 | 0.984 | Shared CNN–BiLSTM–GRU backbone |
| BAUM-1 |
| Zero-shot from DAiSEE | 92.6 | 0.919 | 0.925 | Single-frame or short-sequence inputs |
| Fine-tuned on BAUM-1 | 97.9 | 0.978 | 0.981 | Same Tiny architecture, adapted head |
Table 23.
Comprehensive benchmark on the DAiSEE attention-label task under identical experimental conditions. All models use the same subject-independent split, preprocessing pipeline, training configuration, and evaluation protocol.
Table 23.
Comprehensive benchmark on the DAiSEE attention-label task under identical experimental conditions. All models use the same subject-independent split, preprocessing pipeline, training configuration, and evaluation protocol.
| Model | Backbone Type | Accuracy (%) | Macro-Precision (%) | Macro-Recall (%) | Macro-F1 (%) | Params (M) |
|---|
| VGG16 | 2D CNN | 93.84 | 93.71 | 93.52 | 93.58 | 14.72 |
| ResNet50 | 2D CNN | 95.12 | 95.04 | 94.86 | 94.93 | 23.51 |
| MobileNetV2 | Lightweight 2D CNN | 95.86 | 95.73 | 95.61 | 95.66 | 3.41 |
| CNN-only | Custom 2D CNN | 96.94 | 96.82 | 96.75 | 96.77 | 1.84 |
| CNN + LSTM | 2D CNN + recurrent | 97.81 | 97.70 | 97.61 | 97.64 | 2.36 |
| CNN + GRU | 2D CNN + recurrent | 98.07 | 97.98 | 97.90 | 97.93 | 2.21 |
| CNN + BiLSTM | 2D CNN + recurrent | 98.63 | 98.57 | 98.49 | 98.52 | 2.58 |
| CNN + BiLSTM + Attention | 2D CNN + recurrent | 99.02 | 98.96 | 98.91 | 98.94 | 2.66 |
| Proposed CNN–BiLSTM–GRU | Hybrid 2D CNN + dual recurrent | 99.31 | 99.26 | 99.21 | 99.23 | 2.74 |
Table 24.
Comparison with representative approaches for attention, engagement and affect analysis. Rows are organized into three groups: vision-only methods evaluated on DAiSEE, vision-only methods on other datasets, and multimodal or non-visual approaches. DAiSEE accuracy is highlighted when available and corresponds to video-based, vision-only methods. Methods using different modalities, datasets or label spaces are reported for context rather than for direct numerical ranking. Dashes indicate metrics not reported by the original papers.
Table 24.
Comparison with representative approaches for attention, engagement and affect analysis. Rows are organized into three groups: vision-only methods evaluated on DAiSEE, vision-only methods on other datasets, and multimodal or non-visual approaches. DAiSEE accuracy is highlighted when available and corresponds to video-based, vision-only methods. Methods using different modalities, datasets or label spaces are reported for context rather than for direct numerical ranking. Dashes indicate metrics not reported by the original papers.
| Method and Ref. | Modality | Backbone/Temporal | Dataset (s) | Label Space | DAiSEE Acc. (%) | Other Metric | Real-Time | Params | Notes |
|---|
| Vision-only, DAiSEE (clip-level protocols) |
| [30] | RGB video | Attn-GCN + BiLSTM | DAiSEE, YawDD, BAUM-1, RLDD | 6 affective states | 56.17 | 65.35 (curated), 99.20 (YawDD) | – | – | Correlation with scores r = 0.64 |
| [35] | RGB video | Masked Autoencoder (SSL) | DAiSEE, EmotiW | Engagement levels | 64.74 | Competitive on EmotiW | – | – | Region-prioritized masking |
| MobileNetV2 + GRU (ours) | RGB video | MobileNetV2 + GRU | DAiSEE | 4-level attention | 97.8 | Macro-F1 0.977; QWK 0.978 | CPU-friendly | 3.2 M | Reproducible baseline; same split and protocol |
| Ours (FP32) | RGB video | CNN + BiLSTM + GRU | DAiSEE | 4-level attention | 99.86 | Macro-F1 0.998; QWK 0.998 | – | 5.8 M | Subject-independent split; calibrated |
| Ours (Pruned + INT8) | RGB video | CNN + BiLSTM + GRU | DAiSEE | 4-level attention | 99.52 | Macro-F1 0.995; QWK 0.995 | CPU-friendly | 2.1 M | faster; smaller |
| Vision-only, other datasets |
| [26] | RGB images | MobileNetV2 (FER) | CSFED+ | 7 academic states | – | FER test 76% | Yes | – | Head pose and movement fused |
| [23] | RGB video | ResNet (face) + ViT (posture) | In-house classroom | 4 levels | – | 92.9% accuracy | – | – | Posture/face fusion |
| [36] | RGB images | ResNet + IDBN (FER) | CK+, FER-2013 | 7 emotions | – | 95% (CK+) | – | – | Hybrid feature pipeline |
| [37] | RGB images | ResNet50 + CBAM + TCN | RAF-DB, FER2013, CK+, KDEF | Emotions | – | 91.9/91.7/95.9/97.1% | Yes | – | Temporal CNNs |
| [38] | RGB images | CNN (BN + Dropout) | UPNA Head Pose | ID + activity | – | 99% identification | Yes | – | Online monitor |
| [24] | RGB video | ViT (PAD-based ELE) | In-class physics lessons | ELE (PAD) | – | 92.21% acc | Yes | – | ELE–achievement correlation |
| [29] | RGB images | YOLOv8 + ResNet50 + SVM | Real classroom images | 4 engagement levels | – | mAP@0.5 = 93.7% | – | – | Hierarchical features |
| Multimodal or non-visual approaches |
| [27] | RGB video + derived cues | MobileNetV2 + Dlib features | Small in situ | 4 engagement levels | – | FER test 73.4% | Yes | – | Gaze, blink and head-pose features |
| [28] | RGB video + landmarks | MediaPipe + XGBoost | GENKI-4K, CelebA, HRFS | Composite metrics | – | Smile 98.53% | Yes | – | Multi-feature, edge-oriented pipeline |
| [31] | EEG | DDQN | EMOTIV EEG | 3 attention states | – | 98.2% acc | – | – | Non-visual neural signals |