A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification
Abstract
1. Introduction
2. Methods
3. Design Procedure
4. Results and Discussion
- i.
- Novel Framework: We developed a novel framework by introducing a 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE3D) module to a 3D CNN-based architecture tailored for classifying upper and lower GI endoscopic videos. This approach leverages spatiotemporal features to improve accuracy and efficiency.
- ii.
- Extensive experiments: We conducted extensive experiments to demonstrate the potential of video-based studies in AI for GI endoscopy. The results show that the integration of the P-scSE3D module increased the F1-score by 7%.
- iii.
- Future Directions: Our contributions lay the groundwork for future research for video AI, demonstrating its potential in GI endoscopy, including further optimization of 3D CNN architectures, exploration of additional clinical applications, and integration of explainable AI techniques for better interpretability. Additional wide-range video AI applications in retinal endoscopy, rhinoscopy, cardiac MRI and CT cine imaging applications, ultrasound videos in many disciplines, etc., will open new avenues for novel diagnostics leveraging this technology.
- iv.
- Ensuring Reproducibility: Our model uses publicly available data that allows other researchers to explore this methodology, ensuring the reproducibility of our results and facilitating further research and exploration in this area.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Han, L.; Shi, H.; Li, Y.; Qi, H.; Wang, Y.; Gu, J.; Wu, J.; Zhao, S.; Cao, P.; Xu, L.; et al. Excess deaths of gastrointestinal, liver, and pancreatic diseases during the COVID-19 pandemic in the United States. Int. J. Public Health 2023, 68, 1606305. [Google Scholar] [CrossRef] [PubMed]
- Adedire, O.; Love, N.K.; Hughes, H.E.; Buchan, I.; Vivancos, R.; Elliot, A.J. Early Detection and Monitoring of Gastrointestinal Infections Using Syndromic Surveillance: A Systematic Review. Int. J. Environ. Res. Public Health 2024, 21, 489. [Google Scholar] [CrossRef] [PubMed]
- Borgli, H.; Thambawita, V.; Smedsrud, P.H.; Hicks, S.; Jha, D.; Eskeland, S.L.; Randel, K.R.; Pogorelov, K.; Lux, M.; Nguyen, D.T.D.; et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci. Data 2020, 7, 283. [Google Scholar] [CrossRef] [PubMed]
- Akpunonu, B.; Hummell, J.; Akpunonu, J.D.; Din, S.U. Capsule endoscopy in gastrointestinal disease: Evaluation, diagnosis, and treatment. Clevel. Clin. J. Med. 2022, 89, 200–211. [Google Scholar] [CrossRef] [PubMed]
- Öztürk, Ş.; Özkaya, U. Residual LSTM layered CNN for classification of gastrointestinal tract diseases. J. Biomed. Inform. 2021, 113, 103638. [Google Scholar] [CrossRef] [PubMed]
- Zhuang, H.; Zhang, J.; Liao, F. A systematic review on application of deep learning in digestive system image processing. Vis. Comput. 2023, 39, 2207–2222. [Google Scholar] [CrossRef]
- Min, J.K.; Kwak, M.S.; Cha, J.M. Overview of deep learning in gastrointestinal endoscopy. Gut Liver 2019, 13, 388–393. [Google Scholar] [CrossRef] [PubMed]
- Kim, E.-S.; Lee, K.-S. Artificial Intelligence in Gastrointestinal Disease: Diagnosis and Management; MDPI-Multidisciplinary Digital Publishing Institute: Basel, Switzerland, 2024. [Google Scholar]
- Sethi, A.; Damani, S.; Sethi, A.K.; Rajagopal, A.; Gopalakrishnan, K.; Cherukuri, A.S.S.; Arunachalam, S.P. Gastrointestinal Endoscopic Image Classification using a Novel Wavelet Decomposition Based Deep Learning Algorithm. In Proceedings of the 2023 IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 18–20 May 2023; pp. 616–621. [Google Scholar]
- Lonseko, Z.M.; Adjei, P.E.; Du, W.; Luo, C.; Hu, D.; Zhu, L.; Gan, T.; Rao, N. Gastrointestinal disease classification in endoscopic images using attention-guided convolutional neural networks. Appl. Sci. 2021, 11, 11136. [Google Scholar] [CrossRef]
- Owais, M.; Arsalan, M.; Choi, J.; Mahmood, T.; Park, K.R. Artificial intelligence-based classification of multiple gastrointestinal diseases using endoscopy videos for clinical diagnosis. J. Clin. Med. 2019, 8, 986. [Google Scholar] [CrossRef] [PubMed]
- Sharma, V.; Gupta, M.; Kumar, A.; Mishra, D. Video processing using deep learning techniques: A systematic literature review. IEEE Access 2021, 9, 139489–139507. [Google Scholar] [CrossRef]
- Yu, T.; Hu, H.; Zhang, X.; Lei, H.; Liu, J.; Hu, W.; Duan, H.; Si, J. Real-Time Multi-Label Upper Gastrointestinal Anatomy Recognition from Gastroscope Videos. Appl. Sci. 2022, 12, 3306. [Google Scholar] [CrossRef]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Scovanner, P.; Ali, S.; Shah, M. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 24–29 September 2007; pp. 357–360. [Google Scholar]
- Klaser, A.; Marszałek, M.; Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the BMVC 2008-19th British Machine Vision Conference, Leeds, UK, 1–4 September 2008; pp. 271–275. [Google Scholar]
- Willems, G.; Tuytelaars, T.; Van Gool, L. An efficient dense and scale-invariant spatio-temporal interest point detector. In Computer Vision–ECCV 2008, Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008, Proceedings, Part II 10; Springer: Berlin/Heidelberg, Germany; pp. 650–663.
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Ibrahim, M.S.; Muralidharan, S.; Deng, Z.; Vahdat, A.; Mori, G. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1971–1980. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- He, J.-Y.; Wu, X.; Cheng, Z.-Q.; Yuan, Z.; Jiang, Y.-G. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 2021, 444, 319–331. [Google Scholar] [CrossRef]
- Ballas, N.; Yao, L.; Pal, C. Delving deeper into convolutional networks for learning video representations. arXiv 2015, arXiv:1511.06432. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Duta, I.C.; Nguyen, T.A.; Aizawa, K.; Ionescu, B.; Sebe, N. Boosting VLAD with double assignment using deep features for action recognition in videos. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2210–2215. [Google Scholar]
- Ali, D.; Mohsen, F.; Vivek, S.; Amir, H.K.; Mohammad, M.A.; Rahman, Y.; Luc, V.G. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Yin, Z.; Geraedts, V.J.; Wang, Z.; Contarino, M.F.; Dibeklioglu, H.; Van Gemert, J. Assessment of Parkinson’s disease severity from videos using deep architectures. IEEE J. Biomed. Health Inform. 2021, 26, 1164–1176. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Ho, E.S.L.; Zhang, F.X.; Shum, H.P.H. Pose-based tremor classification for Parkinson’s disease diagnosis from video. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 489–499. [Google Scholar]
- Li, G.Y.; Chen, L.; Zahiri, M.; Balaraju, N.; Patil, S.; Mehanian, C.; Gregory, C.; Gregory, K.; Raju, B.; Kruecker, J.; et al. Weakly Semi-Supervised Detector-Based Video Classification with Temporal Context for Lung Ultrasound. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2483–2492. [Google Scholar]
- Shea, D.E.; Kulhare, S.; Millin, R.; Laverriere, Z.; Mehanian, C.; Delahunt, C.B.; Banik, D.; Zheng, X.; Zhu, M.; Ji, Y.; et al. Deep learning video classification of lung ultrasound features associated with pneumonia. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3103–3112. [Google Scholar]
- Thuwajit, P.; Rangpong, P.; Sawangjai, P.; Autthasan, P.; Chaisaen, R.; Banluesombatkul, N.; Boonchit, P.; Tatsaringkansakul, N.; Sudhawiyangkul, T.; Wilaiprasitporn, T. EEGWaveNet: Multiscale CNN-based spatiotemporal feature extraction for EEG seizure detection. IEEE Trans. Ind. Inform. 2021, 18, 5547–5557. [Google Scholar] [CrossRef]
- Krishnaswamy, D.; Ebadi, S.E.; Bolouri, S.E.S.; Zonoobi, D.; Greiner, R.; Meuser-Herr, N.; Jaremko, J.L.; Kapur, J.; Noga, M.; Punithakumar, K. A novel machine learning-based video classification approach to detect pneumonia in COVID-19 patients using lung ultrasound. Int. J. Noncommun. Dis. 2021, 6 (Suppl. S1), S69–S75. [Google Scholar] [CrossRef]
- Jin, Y.; Li, H.; Dou, Q.; Chen, H.; Qin, J.; Fu, C.-W.; Heng, P.-A. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 2020, 59, 101572. [Google Scholar] [CrossRef] [PubMed]
- Oh, J.; Hwang, S.; Lee, J.; Tavanapong, W.; Wong, J.; de Groen, P.C. Informative frame classification for endoscopy video. Med. Image Anal. 2007, 11, 110–127. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Tao, Y.; Wenfang, Z.; Ne, L.; Zhengxing, H.; Jiquan, L.; Weiling, H.; Huilong, D.; Jianmin, S. Upper gastrointestinal anatomy detection with multi-task convolutional neural networks. Health Technol. Lett. 2019, 6, 176–180. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Oh, J.; Shah, S.K.; Yuan, X.; Tang, S.J. Automatic classification of digestive organs in wireless capsule endoscopy videos. In Proceedings of the 2007 ACM Symposium on Applied Computing, Seoul, Republic of Korea, 11–15 March 2007; pp. 1041–1045. [Google Scholar]
- Billah, M.; Waheed, S.; Rahman, M.M. An automatic gastrointestinal polyp detection system in video endoscopy using fusion of color wavelet and convolutional neural network features. Int. J. Biomed. Imaging 2017, 2017, 9545920. [Google Scholar] [CrossRef] [PubMed]
- TensorFlow, Video Classification. Available online: https://www.tensorflow.org/tutorials/video/video_classification (accessed on 28 April 2024).
- Lee, Y.; Kim, H.I.; Yun, K.; Moon, J. Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification. IEEE Access 2021, 9, 163054–163064. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Dhar, M.K.; Zhang, T.; Patel, Y.; Gopalakrishnan, S.; Yu, Z. FUSegNet: A deep convolutional neural network for foot ulcer segmentation. Biomed. Signal Process Control 2024, 92, 106057. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks with Spatial and Channel ‘Squeeze and Excitation’ Blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef] [PubMed]
- Dhar, M.K.; Wang, C.; Patel, Y.; Zhang, T.; Niezgoda, J.; Gopalakrishnan, S.; Chen, K.; Yu, Z. Wound Tissue Segmentation in Diabetic Foot Ulcer Images Using Deep Learning: A Pilot Study. arXiv 2024, arXiv:2406.16012. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015–Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
No. of Frames (N) | Frame Gap (G) | Accuracy | Precision | Recall | F1-Score | AUC | Time (min) |
---|---|---|---|---|---|---|---|
10 | 15 | 0.933 (CI: 0.907, 0.959) | 0.932 | 0.944 | 0.935 (CI: 0.910, 0.960) | 0.933 | 4.067 |
20 | 8 | 0.914 (CI: 0.889, 0.938) | 0.906 | 0.933 | 0.916 (CI: 0.893, 0.940) | 0.914 | 6.810 |
50 | 5 | 0.933 (CI: 0.910, 0.961) | 0.924 | 0.950 | 0.935 (CI: 907, 0.962) | 0.933 | 13.797 |
100 | 2 | 0.925 (CI: 0.891, 0.959) | 0.925 | 0.939 | 0.928 (0.897, 0.959) | 0.925 | 25.579 |
No. of Frames (N) | Frame Gap (G) | Max Incorrect | Accuracy | F1-Score | AUC | |||
---|---|---|---|---|---|---|---|---|
Max | Min | Max | Min | Max | Min | |||
10 | 15 | 3 | 1 | 0.833 | 1 | 0.824 | 1 | 0.833 |
20 | 8 | 3 | 1 | 0.833 | 1 | 0.824 | 1 | 0.833 |
50 | 5 | 3 | 1 | 0.833 | 1 | 0.824 | 1 | 0.833 |
100 | 2 | 4 | 1 | 0.778 | 1 | 0.818 | 1 | 0.778 |
No. of Frames | Frame Gap | No. of Incorrect Predictions | ||||
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | ||
10 | 15 | 5 | 9 | 3 | 3 | 0 |
20 | 8 | 3 | 6 | 8 | 3 | 0 |
50 | 5 | 6 | 7 | 4 | 3 | 0 |
100 | 2 | 7 | 4 | 6 | 1 | 2 |
P-scSE3D | Accuracy | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|---|
No | 0.88 | 0.90 | 0.88 | 0.87 | 0.88 |
Yes | 0.93 | 0.93 | 0.94 | 0.94 | 0.93 |
P-scSE3D | Number of Incorrect Predictions | |||||||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
No | 3 | 17 | 13 | 9 | 5 | 1 | 1 | 1 |
Yes | 12 | 14 | 16 | 8 | 0 | 0 | 0 | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dhar, M.K.; Deb, M.; Elangovan, P.; Gopalakrishnan, K.; Sood, D.; Kaur, A.; Parikh, C.; Rapolu, S.; Panjwani, G.A.R.; Ansari, R.A.; et al. A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification. J. Imaging 2025, 11, 243. https://doi.org/10.3390/jimaging11070243
Dhar MK, Deb M, Elangovan P, Gopalakrishnan K, Sood D, Kaur A, Parikh C, Rapolu S, Panjwani GAR, Ansari RA, et al. A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification. Journal of Imaging. 2025; 11(7):243. https://doi.org/10.3390/jimaging11070243
Chicago/Turabian StyleDhar, Mrinal Kanti, Mou Deb, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Avneet Kaur, Charmy Parikh, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, and et al. 2025. "A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification" Journal of Imaging 11, no. 7: 243. https://doi.org/10.3390/jimaging11070243
APA StyleDhar, M. K., Deb, M., Elangovan, P., Gopalakrishnan, K., Sood, D., Kaur, A., Parikh, C., Rapolu, S., Panjwani, G. A. R., Ansari, R. A., Asadimanesh, N., Karuppiah, S. S., Helgeson, S. A., Akshintala, V. S., & Arunachalam, S. P. (2025). A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification. Journal of Imaging, 11(7), 243. https://doi.org/10.3390/jimaging11070243