TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition
Abstract
1. Introduction
2. Related Work
2.1. Spatiotemporal Convolutions for Action Recognition
2.2. Transformer-Based Video Understanding
2.3. Pose-Guided Action Recognition
3. The TransMODAL Architecture
3.1. Overall Pipeline
- Person Detection and Tracking: We first employ an off-the-shelf, pre-trained RT-DETR model for each video to detect all persons in every frame [14]. A simple tracker associates detections across frames, yielding a set of person-centric video clips.
- Pose Estimation: A pre-trained ViTPose++ model is used for each person clip to estimate a sequence of 2D human poses [15]. As specified in our data loader, we extract 17 keypoints corresponding to the COCO format for each frame. This results in a tensor of skeletal coordinates.
- Dual-Stream Encoding: The pipeline then splits into two parallel streams. The cropped RGB person clip is fed into a frozen VideoMAE backbone, while the corresponding sequence of pose coordinates is passed to our lightweight PoseEncoder.
- Fusion, Selection, and Classification: Visual and pose tokens undergo iterative cross-modal refinement via CoAttentionFusion (inspired by STAR-Transformer’s zigzag attention [26] and MM-ViT’s modality factorization [27]) and are then pruned by AdaptiveSelector using lightweight learnable scoring (echoing DynamicViT’s token sparsification [25]). The remaining tokens are averaged and classified.
Upstream Model Selection and Hyperparameter Justification
3.2. Input Modalities and Encoders
3.2.1. RGB Appearance Stream
3.2.2. Pose Kinematics Stream (PoseEncoder)
3.3. Core Fusion and Selection Modules
3.3.1. Co-Attention Fusion
3.3.2. Adaptive Feature Selector (AdaptiveSelector)
- Stage 1—Frame Selection: For each frame , we first aggregate its patch tokens with a permutation-invariant pooling operator (we use the mean over tokens) and score the frame with a linear unit using (7):
- Stage 2—Token Selection within selected frames: Within each high-salience frame , we compute a token-wise score with a second linear unit using (8):
3.4. Implementation and Complexity
4. Experimental Evaluation
4.1. Datasets and Protocol
4.2. Implementation Details
4.3. Main Results and Baselines
4.4. Ablation Studies
- Pose is Critical: Removing the pose stream entirely (Row 1) causes the largest drop in accuracy (−1.7%), confirming it as the most impactful component for performance.
- AdaptiveSelector is Efficient: Removing the AdaptiveSelector (Row 2) results in a marginal 0.3% drop in accuracy but increases latency by over 46% (from 35.2 ms to 51.5 ms). This demonstrates that our token pruning strategy is highly effective at reducing computational cost with a negligible impact on performance.
- top_k_frames Affects Sensitivity: Varying top_k_frames shows a clear trade-off. Reducing it to 4 (Row 3) harms accuracy, suggesting that important temporal information is lost. Increasing it to 12 (Row 4) provides no significant benefit over our default of 8, validating our hyperparameter choice.
4.5. Qualitative Analysis
5. Analysis and Discussion
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Dong, Y.; Zhou, R.; Zhu, C.; Cao, L.; Li, X. Hierarchical activity recognition based on belief functions theory in body sensor networks. IEEE Sens. J. 2022, 22, 15211–15221. [Google Scholar] [CrossRef]
- Joudaki, M.; Ebrahimpour Komleh, H. Introducing a new architecture of deep belief networks for action recognition in videos. JMVIP 2024, 11, 43–58. [Google Scholar]
- Teng, Q.; Wang, K.; Zhang, L.; He, J. The layer-wise training convolutional neural networks using local loss for sensor-based human activity recognition. IEEE Sens. J. 2020, 20, 7265–7274. [Google Scholar] [CrossRef]
- Han, Y.; Zhang, P.; Zhuo, T.; Huang, W.; Zhang, Y. Going Deeper with Two-Stream ConvNets for Action Recognition in Video Surveillance. Pattern Recognit. Lett. 2018, 107, 83–90. [Google Scholar] [CrossRef]
- Abdelbaky, A.; Aly, S. Two-Stream Spatiotemporal Feature Fusion for Human Action Recognition. Vis. Comput. 2021, 37, 1821–1835. [Google Scholar] [CrossRef]
- Joudaki, M.; Imani, M.; Arabnia, H.R. A New Efficient Hybrid Technique for Human Action Recognition Using 2D Conv-RBM and LSTM with Optimized Frame Selection. Technologies 2025, 13, 53. [Google Scholar] [CrossRef]
- Xin, C.; Kim, S.; Cho, Y.; Park, K.S. Enhancing Human Action Recognition with 3D Skeleton Data: A Comprehensive Study of Deep Learning and Data Augmentation. Electronics 2024, 13, 747. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar] [CrossRef]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar] [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
- Baradel, F.; Wolf, C.; Mille, J. Human activity recognition with pose-driven attention to RGB. In Proceedings of the 29th British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; pp. 1–14. [Google Scholar]
- Song, S.; Liu, J.; Li, Y.; Guo, Z. Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans. Image Process. 2020, 29, 3957–3969. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [Google Scholar] [CrossRef]
- Zhang, Y.; Guo, Q.; Du, Z.; Wu, A. Human Action Recognition for Dynamic Scenes of Emergency Rescue Based on Spatial-Temporal Fusion Network. Electronics 2023, 12, 538. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 813–824. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
- Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar] [CrossRef]
- Jiang, Y.; Yu, S.; Wang, T.; Sun, Z.; Wang, S. Skeleton-Based Human Action Recognition Based on Single Path One-Shot Neural Architecture Search. Electronics 2023, 12, 3156. [Google Scholar] [CrossRef]
- Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
- Bevilacqua, A.; MacDonald, K.; Rangarej, A.; Widjaya, V.; Caulfield, B.; Kechadi, T. Human activity recognition with convolutional neural networks. In Proceedings of the European Conference Machine Learning and Knowledge Discovery in Databases, Cham, Switzerland, 10–14 September 2018; pp. 541–552. [Google Scholar] [CrossRef]
- Reilly, D.; Chadha, A.; Das, S. Seeing the pose in the pixels: Learning pose-aware representations in vision transformers. arXiv 2023, arXiv:2306.09331. [Google Scholar] [CrossRef]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
- Ahn, D.; Kim, S.; Hong, H.; Ko, B.C. STAR-Transformer: A spatio-temporal cross attention transformer for human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3330–3339. [Google Scholar] [CrossRef]
- Chen, J.; Ho, C.M. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1910–1921. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7473. [Google Scholar] [CrossRef]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), Cambridge, UK, 26 August 2004; Volume 3, pp. 32–36. [Google Scholar] [CrossRef]
- Soomro, K.; Zamir, A.R.; Shah, M. Ucf101: A dataset of 101 human action classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar] [CrossRef]
- Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, Y.; Wang, L.; Qiao, Y. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv 2022, arXiv:2211.09552. [Google Scholar]
- Li, Y.; Lu, Z.; Xiong, X.; Huang, J. PERF-Net: Pose empowered RGB-Flow Net. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 513–522. [Google Scholar] [CrossRef]
- Wang, L.; Huang, B.; Zhao, Z.; Tong, Z.; He, Y.; Wang, Y.; Wang, Y.; Qiao, Y. VideoMAE v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14549–14560. [Google Scholar] [CrossRef]
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Freund, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar] [CrossRef]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 April 2018; Volume 32. [Google Scholar] [CrossRef]
- Tan, T.-H.; Wu, J.-Y.; Liu, S.-H.; Gochoo, M. Human Activity Recognition Using an Ensemble Learning Algorithm with Smartphone Sensor Data. Electronics 2022, 11, 322. [Google Scholar] [CrossRef]
Module Name (Code) | Key Hyperparameters | Parameters (M) | FLOPs (G) |
---|---|---|---|
VideoMAE Backbone | embed_dim = 768 | 87 (frozen) | N/A |
PoseEncoder | embed_dim = 768, num_joints = 17 | 0.22 | 0.02 |
CoAttentionFusion | num_heads = 8, depth = 1 | 9.45 | 0.81 |
AdaptiveSelector | top_k_frames = 8, top_k_tokens = 12 | 0.05 | <0.01 |
ActionClassifier | num_classes = 6 | 0.60 | <0.01 |
Total Trainable | 10.32 | ~0.83 |
Model | Input Modality | Pre-Trained on | KTH Top-1 Acc (%) |
---|---|---|---|
Two-stream ConvNets [4] | RGB + Flow | ImageNet | 93.1 |
ST-VLAD-PCANet [5] | RGB | - | 93.33 |
2D Conv-RBM + LSTM [6] | RGB | - | 97.3 |
VideoMAE(B/16) [10] | RGB | Kinetics | 96.5 |
TransMODAL (Proposed method)—No Pose | RGB | Kinetics | 96.8 |
TransMODAL (Proposed method) | RGB + Pose | Kinetics | 98.5 |
Model | Input Modality | Pre-Trained on | UCF101 Top-1 Acc (%) |
---|---|---|---|
I3D (Two-Stream) [8] | RGB + Flow | Kinetics | 97.9 |
R(2+1)D (Two-Stream) [9] | RGB+ Flow | Kinetics | 97.3 |
VideoMAE [10] | RGB | Kinetics | 91.3 |
MViTv2 [35] | RGB | Kinetics-400 | 98.6 |
UniFormerV2 [36] | RGB | Kinetics-400 | 98.9 |
PERF-Net [37] | RGB + Flow + Pose | S3D-G | 98.6 |
TransMODAL (Proposed method) | RGB + Pose | Kinetics | 96.9 |
Model | Input Modality | Pre-Trained on | HMDB51 Top-1 Acc (%) |
---|---|---|---|
2D Conv-RBM + LSTM [6] | RGB | - | 81.5 |
I3D (Two-Stream) [8] | RGB + Flow | Kinetics | 80.2 |
R(2+1)D (Two-Stream) [9] | RGB+ Flow | Kinetics | 78.7 |
VideoMAE [10] | RGB | Kinetics | 62.6 |
MViTv2 [33] | RGB | Kinetics-400 | 85.5 |
UniFormerV2 [34] | RGB | Kinetics-400 | 86.1 |
PERF-Net [37] | RGB + Flow + Pose | S3D-G | 83.2 |
VideoMAE V2-g [38] | RGB | UnlabeledHybrid | 88.7 |
TransMODAL (Proposed method) | RGB + Pose | Kinetics | 84.2 |
Configuration | Top-1 Acc (%) | Δ vs. Full Model | Latency (ms/batch) |
---|---|---|---|
Full TransMODAL Model | 98.5 | - | 35.2 |
No Pose Stream (VideoMAE-only) | 96.8 | −1.7 | 29.8 |
No AdaptiveSelector (use all tokens) | 98.2 | −0.3 | 51.5 |
top_k_frames = 4 | 97.5 | −1.0 | 31.1 |
top_k_frames = 12 | 98.4 | −0.1 | 39.4 |
Model | Input Modality | Trainable Params (M) | FLOPs (G) | Top-1 Acc (%) |
---|---|---|---|---|
I3D (Two-Stream) [8] | RGB + Flow | 27.9 | 108 | 97.9 |
R(2+1)D (Two-Stream) [9] | RGB + Flow | 31.4 | 152 | 97.3 |
TransMODAL (Proposed) | RGB + Pose | 10.32 | ~45 | 96.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Joudaki, M.; Imani, M.; Arabnia, H.R. TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition. Electronics 2025, 14, 3326. https://doi.org/10.3390/electronics14163326
Joudaki M, Imani M, Arabnia HR. TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition. Electronics. 2025; 14(16):3326. https://doi.org/10.3390/electronics14163326
Chicago/Turabian StyleJoudaki, Majid, Mehdi Imani, and Hamid R. Arabnia. 2025. "TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition" Electronics 14, no. 16: 3326. https://doi.org/10.3390/electronics14163326
APA StyleJoudaki, M., Imani, M., & Arabnia, H. R. (2025). TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition. Electronics, 14(16), 3326. https://doi.org/10.3390/electronics14163326