AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection
Abstract
1. Introduction
2. Related Work
3. Proposed AT-HSTNet Framework and Methodology
| Algorithm 1: AT-HSTNet for Deepfake Video Detection. |
| Input: Video sequence Output: Prediction label Extract frame sequence For each frame:
|
3.1. Dataset and Pre-Processing
3.2. AT-HSTNet Hybrid Spatiotemporal Feature Extraction Architecture
3.2.1. Spatial Feature Extraction
3.2.2. Temporal Sequence Modeling
3.3. Action-Transformer-Based Contextual Reasoning
3.3.1. Input Representation
3.3.2. Self-Attention Mechanism
3.4. Feature Aggregation and Classification
3.5. Training Strategy and Experimental Configuration
3.5.1. Memory Optimization and Training Strategy
3.5.2. Training Stabilization and Optimization
3.5.3. Experimental Setup and Implementation Details
3.5.4. Training Convergence and Performance Metrics
4. Experimental Results and Analysis
4.1. Training and Validation Performance Analysis
4.2. Comparative Performance Analysis
4.3. Computational Efficiency
4.4. Analysis of Temporal Module Design
5. Conclusions and Future Directions
- A hierarchical deepfake video detection framework, AT-HSTNet, is introduced to explicitly separate short- and medium-range time modeling from long-range sequence reasoning, enabling robust capture of frame-level visual artifacts and multi-scale time inconsistencies.
- An action-aware Transformer module is introduced to perform long-range time reasoning on BiLSTM-encoded features instead of raw frame features, reducing redundant attention computation and improving training stability compared with conventional CNN–Transformer designs.
- A lightweight spatial feature extraction strategy based on EfficientNet-B0 is introduced to effectively balance detection accuracy and computational efficiency for fine-grained facial artifact analysis.
- A memory-efficient training and optimization framework is developed, incorporating sequence-level MixUp, frame-level Random Erasing, and stabilization techniques to enable efficient training and real-time inference on consumer-grade hardware.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| 3D | 3-dimensional |
| AT-HSTNet | Action-Transformer-based Hierarchical Spatiotemporal Network |
| AUC | Area Under the Curve |
| BiLSTM | Bidirectional Long Short-Term Memory |
| Celeb-DF | Celebrity DeepFake |
| CNN | Convolutional Neural Network |
| DFDC | Deepfake Detection Challenge |
| EMA | Exponential Moving Average |
| FF | Face Forensics |
| FFIW | Face Forensics in Wild |
| FN | False Negative |
| FP | False Positive |
| FPS | Frames Per Second |
| GAN | Generative Adversarial Network |
| GAP | Global Average Pooling |
| GB | Giga Bytes |
| GFLOPs | Giga Floating Point Operations Per Seconds |
| GPU | Graphical Processing Unit |
| HCiT | Hybrid Convolutional Neural Network |
| ISTVT | Interpretable Spatial–Temporal Video Transformer |
| Leaky RELU | Leaky Rectified Linear Unit |
| LSTM | Long Short-Term Memory |
| RNN | Recurrent Neural Network |
| SForms | Swine-based Transformers |
| TP | True Positive |
| UV | Ultraviolet |
| ViT | Vision Transformer |
References
- Sharma, V.K.; Garg, R.; Caudron, Q. A systematic literature review on deepfake detection techniques. Multimed. Tools Appl. 2025, 84, 22187–22229. [Google Scholar] [CrossRef]
- Li, M.; Ahmadiadli, Y.; Zhang, X.-P. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 2025, 57, 165. [Google Scholar] [CrossRef]
- Heidari, A.; Navimipour, N.J.; Dag, H.; Unal, M. Deepfake detection using deep learning methods: A systematic and comprehensive review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1520. [Google Scholar] [CrossRef]
- Yan, Z.; Yao, T.; Chen, S.; Zhao, Y.; Fu, X.; Zhu, J.; Luo, D.; Wang, C.; Ding, S.; Wu, Y.; et al. Df40: Toward next-generation deepfake detection. Adv. Neural Inf. Process. Syst. 2024, 37, 29387–29434. [Google Scholar]
- Concas, S.; La Cava, S.M.; Casula, R.; Orru, G.; Puglisi, G.; Marcialis, G.L. Quality-based artifact modeling for facial deepfake detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3845–3854. Available online: https://openaccess.thecvf.com/content/CVPR2024W/DFAD/html/Concas_Quality-based_Artifact_Modeling_for_Facial_Deepfake_Detection_in_Videos_CVPRW_2024_paper.html (accessed on 14 December 2025).
- Le, B.M.; Kim, J.; Woo, S.S.; Moore, K.; Abuadbba, A.; Tariq, S. SoK: Systematization and Benchmarking of Deepfake Detectors in a Unified Framework. arXiv 2025, arXiv:2401.04364. [Google Scholar] [CrossRef]
- Zafar, F.; Khan, T.A.; Akbar, S.; Ubaid, M.T.; Javaid, S.; Kadir, K.A. A Hybrid Deep Learning Framework for Deepfake Detection Using Temporal and Spatial Features. IEEE Access 2025, 13, 79560–79570. [Google Scholar] [CrossRef]
- Al Redhaei, A.; Fraihat, S.; Al-Betar, M.A. A self-supervised BEiT model with a novel hierarchical patchReducer for efficient facial deepfake detection. Artif. Intell. Rev. 2025, 58, 278. [Google Scholar] [CrossRef]
- AlMuhaideb, S.; Alshaya, H.; Almutairi, L.; Alomran, D.; Alhamed, S.T. LightFakeDetect: A Lightweight Model for Deepfake Detection in Videos That Focuses on Facial Regions. Mathematics 2025, 13, 3088. [Google Scholar] [CrossRef]
- Wang, Z.; Cheng, Z.; Xiong, J.; Xu, X.; Li, T.; Veeravalli, B.; Yang, X. A Timely Survey on Vision Transformer for Deepfake Detection. arXiv 2024, arXiv:2405.08463. [Google Scholar] [CrossRef]
- Cantero-Arjona, P.; Sánchez-Macián, A. Deepfake Detection and the Impact of Limited Computing Capabilities. arXiv 2024, arXiv:2402.14825. [Google Scholar] [CrossRef]
- Ain, Q.U.; Ning, H.; Philipo, A.G.; Daneshmand, M.; Ding, J. Beyond Accuracy: A Deployment-Oriented Benchmark of Deepfake Detection Models. TechRxiv 2025, 18. [Google Scholar] [CrossRef] [PubMed]
- Patel, Y.; Tanwar, S.; Bhattacharya, P.; Gupta, R.; Alsuwian, T.; Davidson, I.E.; Mazibuko, T.F. An improved dense CNN architecture for deepfake image detection. IEEE Access 2023, 11, 22081–22095. [Google Scholar] [CrossRef]
- Tipper, S.; Atlam, H.F.; Lallie, H.S. An investigation into the utilisation of CNN with LSTM for video deepfake detection. Appl. Sci. 2024, 14, 9754. [Google Scholar] [CrossRef]
- Al-Dulaimi, O.A.H.H.; Kurnaz, S. A hybrid CNN-LSTM approach for precision deepfake image detection based on transfer learning. Electronics 2024, 13, 1662. [Google Scholar] [CrossRef]
- Petmezas, G.; Vanian, V.; Konstantoudakis, K.; Almaloglou, E.E.I.; Zarpalas, D. Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification. Multimed. Tools Appl. 2025, 84, 40617–40636. [Google Scholar] [CrossRef]
- Ikram, S.T.; Chambial, S.; Sood, D. A performance enhancement of deepfake video detection through the use of a hybrid CNN Deep learning model. Int. J. Electr. Comput. Eng. Syst. 2023, 14, 169–178. [Google Scholar] [CrossRef]
- Kaddar, B.; Fezza, S.A.; Akhtar, Z.; Hamidouche, W.; Hadid, A.; Serra-Sagristá, J. Deepfake Detection Using Spatiotemporal Transformer. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 20, 345. [Google Scholar] [CrossRef]
- Heo, Y.-J.; Yeo, W.-H.; Kim, B.-G. DeepFake detection algorithm based on improved vision transformer. Appl. Intell. 2023, 53, 7512–7527. [Google Scholar] [CrossRef]
- Khormali, A.; Yuan, J.-S. DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer. Appl. Sci. 2022, 12, 2953. [Google Scholar] [CrossRef]
- Wang, T.; Cheng, H.; Chow, K.P.; Nie, L. Deep Convolutional Pooling Transformer for Deepfake Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 174. [Google Scholar] [CrossRef]
- Khan, S.A.; Dang-Nguyen, D.-T. Hybrid Transformer Network for Deepfake Detection. In Proceedings of the International Conference on Content-based Multimedia Indexing, Dublin, Ireland, 22–24 October 2022; ACM: Graz, Austria, 2022; pp. 8–14. [Google Scholar] [CrossRef]
- Zhao, C.; Wang, C.; Hu, G.; Chen, H.; Liu, C.; Tang, J. ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1335–1348. [Google Scholar] [CrossRef]
- Khan, S.A.; Dai, H. Video Transformer for Deepfake Detection with Incremental Learning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; ACM: Graz, Austria, 2021; pp. 1821–1828. [Google Scholar] [CrossRef]
- Kingra, S.; Aggarwal, N.; Kaur, N. SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection. Forensic Sci. Int. Digit. Investig. 2024, 51, 301817. [Google Scholar] [CrossRef]
- Javed, M.; Zhang, Z.; Dahri, F.H.; Laghari, A.A.; Krajčík, M.; Almadhor, A. Real-Time Deepfake Detection via Gaze and Blink Patterns: A Transformer Framework. Comput. Mater. Contin. 2025, 85, 1457–1493. [Google Scholar] [CrossRef]
- Zhou, T. tfzhou/FFIW. 26 September 2025. Available online: https://github.com/tfzhou/FFIW (accessed on 14 December 2025).
- Huang, J.; Yang, P.; Xiong, B.; Lv, Y.; Wang, Q.; Wan, B.; Zhang, Z.-Q. Mixup-based data augmentation for enhancing few-shot SSVEP detection performance. J. Neural Eng. 2025, 22, 046038. [Google Scholar] [CrossRef]
- Ma, G.; Wang, Z.; Yuan, Z.; Wang, X.; Yuan, B.; Tao, D. A comprehensive survey of data augmentation in visual reinforcement learning. Int. J. Comput. Vis. 2025, 133, 7368–7405. [Google Scholar] [CrossRef]
- Kumar, A.; Yadav, S.P.; Kumar, A. An improved feature extraction algorithm for robust Swin Transformer model in high-dimensional medical image analysis. Comput. Biol. Med. 2025, 188, 109822. [Google Scholar] [CrossRef]
- Chen, X.; Liu, C.; Xia, H.; Chi, Z. Burn-through point prediction and control based on multi-cycle dynamic spatio-temporal feature extraction. Control Eng. Pract. 2025, 154, 106165. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, K.; Zhang, J.; Huang, L. Self-attention mechanism network integrating spatio-temporal feature extraction for remaining useful life prediction. J. Electr. Eng. Technol. 2025, 20, 1127–1142. [Google Scholar] [CrossRef]
- Li, L.; Xu, M.; Chen, S.; Mu, B. An adaptive feature fusion framework of CNN and GNN for histopathology images classification. Comput. Electr. Eng. 2025, 123, 110186. [Google Scholar] [CrossRef]
- Xiao, J.; Sang, S.; Zhi, T.; Liu, J.; Yan, Q.; Luo, L.; Yuan, B. Coap: Memory-efficient training with correlation-aware gradient projection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 30116–30126. [Google Scholar]
- Ubaid, M.T.; Javaid, S. Precision Agriculture: Computer Vision-Enabled Sugarcane Plant Counting in the Tillering Phase. J. Imaging 2024, 10, 102. [Google Scholar] [CrossRef] [PubMed]







| Year | Technique | Dataset(s) | Metrics | Efficiency |
|---|---|---|---|---|
| 2021 [24] | Video Transformer with UV texture alignment and incremental learning (Xception + Video Transformer) | FaceForensics++, Deep Fake Detection Challenge (DFDC) | FF++: ACC 99.52%, AUC 99.64% DFDC: ACC 91.69% | High computational cost |
| 2023 [19] | Improved Vision Transformer with CNN–patch feature fusion and knowledge distillation (EfficientNet-B7 + ViT + DeiT) | Deep Fake Detection Challenge (DFDC), Celeb-DF | DFDC: AUC 97.8%, F1 91.9% Celeb-DF: AUC 99.3%, F1 97.8% | Very high computational cost (440 M parameters, ~8–10× higher than CNN-based models) |
| 2022 [20] | DFDT: End-to-End Vision Transformer with multi-stream re-attention and patch selection | FaceForensics++, Celeb-DF, WildDeepfake | FF++: ACC 99.41%, AUC 99.94% Celeb-DF: ACC 99.31%, AUC 99.26% | High computational cost (multi-stream Transformer architecture, requires multi-GPU training) |
| 2025 [26] | Hybrid Transformer Network with early feature fusion (XceptionNet + EfficientNet-B4 + ViT) | FaceForensics++, Deep Fake Detection Challenge (DFDC) | FF++: ACC 97.00% DFDC: ACC 98.24% | Moderate computational cost (hybrid CNN–Transformer with Time Sformer) |
| 2025 [18] | Deep Convolutional Pooling Transformer with key frame selection and re-attention mechanism | FaceForensics++, Deep Fake Detection Challenge (DFDC), Celeb-DF, DeeperForensics | FF++: ACC 92.11%, AUC 97.66% DFDC: ACC 65.76% | High computational cost (hybrid Xception + ViT with spatiotemporal attention and preprocessing overhead) |
| 2023 [23] | ISTVT: Interpretable Spatial–Temporal Video Transformer with decomposed attention and self-subtraction | FaceForensics++, FaceShifter, DeeperForensics, Celeb-DF, Deep Fake Detection Challenge (DFDC) | FF++: ACC 99.6%, AUC 99.6% Celeb-DF: AUC 99.8% | High computational cost (spatiotemporal Transformer with decomposed self-attention) |
| 2023 [21] | Deep Convolutional Pooling Transformer with key frame selection and re-attention mechanism | FaceForensics++, Deep Fake Detection Challenge (DFDC), Celeb-DF. DeeperForensics | FF++: ACC 92.11%, AUC 97.66% DFDC: ACC 65.76%, AUC 73.68% Celeb-DF: ACC 63.27%, AUC 72.43% | High computational cost (deep CNN + 24-layer Transformer with re-attention and preprocessing overhead) |
| 2022 [22] | HCiT: Hybrid CNN–Vision Transformer for spatiotemporal deepfake detection (Xception + ViT) | FaceForensics++, Deep Fake Detection Challenge (DFDC), Celeb-DF | FF++: ACC 96.0%, F1 93.86% DFDC-p: ACC 97.82% | High computational cost (dual CNN feature extractors + Transformer with feature fusion and preprocessing overhead) |
| 2024 [25] | SFormer: End-to-end spatiotemporal Transformer using Swin Transformer for spatial modeling and Transformer encoder for temporal reasoning | FaceForensics++, DFD, Celeb-DF, Deep Fake Detection Challenge (DFDC), Deeper-Forensics | FF++: 100% DFD: 97.81% Celeb-DF: 99.1% DFDC: 93.67% | Moderate to high computational cost (end-to-end spatiotemporal Transformer with Swin backbone) |
| 2025 [16] | Hybrid CNN–LSTM–Transformer with 3D Morphable Models for identity-aware deepfake detection | VoxCeleb2 (train), DFD, Celeb-DF, FF++ (test) | DFD: AUC ≈ 97% Celeb-DF: AUC ≈ 86% FF++: AUC ≈ 99% | Moderate computational cost with improved inference efficiency (hybrid CNN–LSTM–Transformer; moderately reduced inference time) |
| Model | Architecture Description | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | GFLOPs | FPS (Single GPU) |
|---|---|---|---|---|---|---|---|
| Model A | CNN + BiLSTM (No Transformer) | 95.4 | 96.1 | 93.2 | 94.6 | ~0.90 | ~22 |
| Model B | CNN + Transformer (No Hierarchical Temporal Modeling) | 96.8 | 97.4 | 94.8 | 96.1 | ~1.80 | ~14 |
| AT-HSTNet (Proposed) | EfficientNet-B0 + BiLSTM + Action-Transformer | 98.7 | 98.0 | 96.0 | 96.9 | 0.45 | ~30 |
| Method | Representative Work | Architecture Type | GFLOPs | FPS (Single GPU) |
|---|---|---|---|---|
| CNN–RNN Baseline | CNN–LSTM-based detectors [16] | CNN + LSTM | ~0.90 | ~22 |
| CNN–ViT Hybrid | HCiT [22], Hybrid Transformer Network [22] | CNN + Vision Transformer | ~1.8 | ~14 |
| Spatiotemporal Transformer | ISTVT [23], Video Transformer with UV alignment [24] | Video Transformer | ~3.5 | ~8 |
| Swin-based Transformer | SFormer [25] | Windowed Video Transformer | ~2.4 | ~11 |
| AT-HSTNet (Proposed) | Proposed Architecture | CNN + BiLSTM + Action-Transformer | 0.45 | ~30 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Javaid, S.; El Rai, M.C.; Elkhouly, A.; Al-Khatib, O.; Far, A.B.; El Barachi, M. AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection. Appl. Sci. 2026, 16, 3450. https://doi.org/10.3390/app16073450
Javaid S, El Rai MC, Elkhouly A, Al-Khatib O, Far AB, El Barachi M. AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection. Applied Sciences. 2026; 16(7):3450. https://doi.org/10.3390/app16073450
Chicago/Turabian StyleJavaid, Sameena, Marwa Chendeb El Rai, Abeer Elkhouly, Obada Al-Khatib, Aicha Beya Far, and May El Barachi. 2026. "AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection" Applied Sciences 16, no. 7: 3450. https://doi.org/10.3390/app16073450
APA StyleJavaid, S., El Rai, M. C., Elkhouly, A., Al-Khatib, O., Far, A. B., & El Barachi, M. (2026). AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection. Applied Sciences, 16(7), 3450. https://doi.org/10.3390/app16073450

