Comparative Analysis of Action Recognition Techniques: Exploring Two-Stream CNNs, C3D, LSTM, I3D, Attention Mechanisms, and Hybrid Models †
Abstract
1. Introduction
- Two-Stream Convolutional Networks (2D + Optical Flow Streams): The method involves the introduction of two separate CNNs (Convolution Neural Networks) that operate in parallel to process spatial representation in RGB (Red Green Blue) frames and temporal representation in optical flows, respectively.
- 3D Convolutional Networks (C3Ds): Using 3D CNNs instead of traditional 2D CNNs, this method employs 3D convolutional neural networks where each layer is a 3D cube (width, height, and depth).
- Long Short-Term Memory (LSTM) Recurrent Networks: LSTMs are aimed at modeling long-range temporal dependencies in sequential data. Combining LSTMs with CNNs allows them to properly deal with video sequences of varying lengths.
- Two-Stream Inflated 3D Convolutional Networks (I3Ds): The idea is to expand 2D filter-centered or pre-trained convolutional neural networks (CNNs) into three dimensions using inflation, applying convolutional operations to preprocessed 3D tensors.
- Attention Mechanisms: Attention mechanisms are key to obtaining more accurate and understandable models by directing them to the relevant regions of interest.
- Hybrid Models: These models combine multiple types or forms of signals such as RGB frames, depth data, and skeleton joints to better represent human activities.
2. Literature Review
Evolution of Action Recognition Techniques
- Single-Stream CNNs: Initial attempts utilized single-stream CNN architectures to capture spatial dimensions of sequences [4].
- Two-Stream Convolutional Networks (2D + Optical Flow): For appearance changes and motion cues these networks incorporate a parallel optical flow-based branch trained using informal motion like dense optical flow.
- 3D Convolutional Networks (C3D): For temporal dynamics, researchers upgraded CNNs to 3D to capture actions.
- Long Short-Term Memory (LSTM) Networks: LSTMs and other recurrent neural networks (RNNs) have shown promising results in modeling dependencies.
- Attention Mechanisms: Attention mechanisms improve action recognition models by focusing on relevant regions of interest within images and videos.
- Hybrid Models: These models integrate multiple modalities such as RGB features, depth data, and skeletal information to improve recognition accuracy [5].
3. Methodologies and Architectures
3.1. Two-Stream Convolutional Networks (2D + Optical Flow Streams)
3.2. Three-Dimensional Conventional Networks (C3D)
3.3. Long Short-Term Memory (LSTM) Recurrent Networks
3.4. Attention Mechanisms
- Spatial Attention: Spotlights the video frames’ regions of interest, such as those which contain static cues such as backgrounds, to make sure that the relevant features of the scene are indeed extracted.
- Temporal Attention: It enhances the discovery of time-period of video sequences ’significantly’, which captures the motion dynamics.
- Advantages: It heightens the discriminative capabilities of the model. It also improves adaptability, the ability to remove background disturbances and objects, allowing for improvements in object recognition accuracy.
3.5. Hybrid Models
4. Experimental Setup
4.1. Dataset
4.2. Preprocessing
4.3. Model Configurations
4.4. Model Initialization
4.5. Hyperparameters
4.6. Trainining Procedures
- Loss Function: We aim to make use of cross-entropy loss in order to obtain the best learning for the models in multiclass classification.
- Batch Size: The batch size setting of 32 allows for efficient parameter computation and model convergence, while maintaining the balance.
- Training Epochs: On the other hand, the models are trained for up to 50 interventions, and early stopping in line with the validation performance is applied to prevent any chances of overfitting (patience of 10 epochs).
- Hardware: The experiments are carried out on Google Colab and a T4 GPU for the model training and evaluation.
4.7. Evaluation Metrics
- Accuracy: The main evaluation criterion is the accuracy, which is considered to be the percentage of the correctly classified video clips.
- Precision, Recall, and F1-Score: With all the scales defined, the precision, recall, and the F1-score are analyzed for each class, followed by a macro-average over all the classes used, which shows detailed analysis of model performance. As a result, macro-averaging is employed to summarize these metrics across all classes.
- Confusion Matrix: A chart called the confusion matrix is used to show the classification performance. It highlights the true and false predictions for each category/human action.
- ROC-AUC (Region of Interest-Area Under the Curve): In receiver operating characteristics, curve lines are drawn between the true positive rate and the false positive rate.
- Statistical Significance: The statistical tests are groups of differences between models compared statistically due to general effects rather than true differences in performance.
- Qualitative Analysis: Attention rate maps and video predictions further strengthen our insights into the behavior of the model by giving us a closer look at its features and the extent to which it responds to images.
4.8. Libraries and Tools
- TensorFlow: TensorFlow is mainly used for creating and teaching deep learning models.
- Scikit-Learn: This is a machine learning platform that gives you access to convenient evaluation metrics and data preprocessing.
- Pandas: A tool that is beneficial in the analysis and manipulation of data.
- NumPy: It is exclusively involved in numerical operations and data processing tasks. OpenCV: This is normally used for such video processes as optical flow computation, etc.
5. Results
5.1. Accuracy
- Two-Stream CNN: Has achieved 85.4% accuracy. This shows its capability of capturing both temporal and spatial information properly.
- 3D CNN: Attained an accuracy of 83.2% by learning spatiotemporal features end-to-end, which is a strength of the model.
- CNN + LSTM: Achieved an accuracy of 82.5%, meaning it handles sequences of varying lengths effectively.
- I3D: By the virtue of pre-trained 2D models and capturing both appearance and motion cues, it achieved the highest accuracy of 88.1%.
- Attention Mechanisms: Reached an accuracy of 86.7%. This shows that the model can concentrate on relevant spatiotemporal regions.
- Hybrid Models: Achieved 84.9% accuracy, illustrating that the hybrid models are flexible to alterations by combining multiple modalities.
5.2. Precision, Recall, and F1-Score
- Two-Stream CNN: Precision 0.86, recall 0.85, and F1-score 0.85, indicating well-distributed performance across various metrics.
- 3D CNN: Precision 0.84, recall 0.83, and F1-score 0.83, suggesting similar performance to two-stream CNN but with slightly lower accuracy.
- CNN + LSTM: Precision 0.83, recall 0.82, and F1-score 0.82, demonstrating effective handling of sequential inputs.
- I3D: Led the leaderboard with precision 0.89, recall 0.88, and F1-score 0.88, showcasing its exceptional performance.
- Attention Mechanisms: Achieved precision 0.87, recall 0.87, and F1-score 0.87, highlighting its ability to focus on relevant spatiotemporal regions.
- Hybrid Models: Secured precision 0.85, recall 0.85, and F1-score 0.85, indicating robust performance with multimodal data fusion.
5.3. Analysis
6. Future Scope
6.1. Advanced Architectures and Multimodal Fusion
6.2. Real-Time Processing
6.3. Data Augmentation and Synthesis
6.4. Transfer Learning and Domain Adaptation
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 1, 568–576. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
- Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 34–45. [Google Scholar]
- Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Wang, J.; Zhang, Z.; Gao, J. Action recognition with multiscale spatiotemporal contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3185–3193. [Google Scholar]
- Singh, T.; Solanki, A.; Sharma, S.K.; Jhanjhi, N.Z.; Ghoniem, R.M. Grey Wolf Optimization-Based CNN-LSTM Network for the Prediction of Energy Consumption in Smart Home Environment. IEEE Access 2023, 11, 114917–114935. [Google Scholar] [CrossRef]
- Yan, O.J.; Ashraf, H.; Ihsan, U.; Jhanjhi, N.; Ray, S.K. Facial expression recognition (FER) system using deep learning. In Proceedings of the 2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC), Tandojam, Pakistan, 8–9 January 2024; pp. 1–11. [Google Scholar]
- Mahendar, M.; Malik, A.; Batra, I. Facial Micro-expression Modelling-Based Student Learning Rate Evaluation Using VGG–CNN Transfer Learning Model. SN Comput. Sci. 2024, 5, 204. [Google Scholar] [CrossRef]
- Al-Quayed, F.; Javed, D.; Jhanjhi, N.Z.; Humayun, M.; Alnusairi, T.S. A Hybrid Transformer-Based Model for Optimizing Fake News Detection. IEEE Access 2024, 12, 160822–160834. [Google Scholar] [CrossRef]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Arshiya; Singh, G.; Malik, A.; Nugraha. Comparative Analysis of Action Recognition Techniques: Exploring Two-Stream CNNs, C3D, LSTM, I3D, Attention Mechanisms, and Hybrid Models. Eng. Proc. 2025, 107, 43. https://doi.org/10.3390/engproc2025107043
Arshiya, Singh G, Malik A, Nugraha. Comparative Analysis of Action Recognition Techniques: Exploring Two-Stream CNNs, C3D, LSTM, I3D, Attention Mechanisms, and Hybrid Models. Engineering Proceedings. 2025; 107(1):43. https://doi.org/10.3390/engproc2025107043
Chicago/Turabian StyleArshiya, Gursharan Singh, Arun Malik, and Nugraha. 2025. "Comparative Analysis of Action Recognition Techniques: Exploring Two-Stream CNNs, C3D, LSTM, I3D, Attention Mechanisms, and Hybrid Models" Engineering Proceedings 107, no. 1: 43. https://doi.org/10.3390/engproc2025107043
APA StyleArshiya, Singh, G., Malik, A., & Nugraha. (2025). Comparative Analysis of Action Recognition Techniques: Exploring Two-Stream CNNs, C3D, LSTM, I3D, Attention Mechanisms, and Hybrid Models. Engineering Proceedings, 107(1), 43. https://doi.org/10.3390/engproc2025107043