A Multimodal Deep Learning Pipeline for Enhanced Detection and Classification of Oral Cancer †
Abstract
1. Introduction
2. Related Works
3. Materials and Methods
3.1. Data Description
3.2. Data Preparation
3.3. The Proposed Approach
- represents the feature map value at position for a particular feature map;
- H is the height (number of rows) of the feature map;
- W is the width (number of columns) of the feature map;
- is the output feature vector, which is a 1D vector containing one value per feature map channel.
- x is the input vector of features;
- is the mean of the input features;
- is the standard deviation of the input features;
- and are learnable parameters (scaling and shifting).
- x is the input to the GELU function.
- : Input feature map at spatial position and channel c.
- : Depthwise convolution kernel at spatial position for channel c.
- : Kernel height.
- : Kernel width.
- : Output feature map at spatial position and channel c.
4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
- Chen, P.H.; Wu, C.H.; Chen, Y.F.; Yeh, Y.C.; Lin, B.H.; Chang, K.W.; Lai, P.Y.; Hou, M.C.; Lu, C.L.; Kuo, W.C. Combination of structural and vascular optical coherence tomography for differentiating oral lesions of mice in different carcinogenesis stages. Biomed. Opt. Express 2018, 9, 1461–1476. [Google Scholar] [CrossRef]
- Yang, E.C.; Tan, M.T.; Schwarz, R.A.; Richards-Kortum, R.R.; Gillenwater, A.M.; Vigneswaran, N. Noninvasive diagnostic adjuncts for the evaluation of potentially premalignant oral epithelial lesions: Current limitations and future directions. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2018, 125, 670–681. [Google Scholar] [CrossRef] [PubMed]
- Rimal, J.; Shrestha, A.; Maharjan, I.K.; Shrestha, S.; Shah, P. Risk assessment of smokeless tobacco among oral precancer and cancer patients in eastern developmental region of Nepal. Asian Pac. J. Cancer Prev. APJCP 2019, 20, 411. [Google Scholar] [CrossRef] [PubMed]
- Devindi, G.; Dissanayake, D.; Liyanage, S.; Francis, F.; Pavithya, M.; Piyarathne, N.; Hettiarachchi, P.; Rasnayaka, R.; Jayasinghe, R.; Ragel, R.; et al. Multimodal Deep Convolutional Neural Network Pipeline for AI-Assisted Early Detection of Oral Cancer. IEEE Access 2024, 12, 124375–124390. [Google Scholar] [CrossRef]
- Tafala, I.; Ben-Bouazza, F.E.; Edder, A.; Manchadi, O.; Et-Taoussi, M.; Jioudi, B. EfficientNetV2 and Attention Mechanisms for the automated detection of Cephalometric landmarks. In Proceedings of the 2024 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 8–10 May 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Warin, K.; Suebnukarn, S. Deep learning in oral cancer-a systematic review. BMC Oral Health 2024, 24, 212. [Google Scholar] [CrossRef] [PubMed]
- Tafala, I.; Ben-Bouazza, F.E.; Edder, A.; Manchadi, O.; Et-Taoussi, M.; Jioudi, B. Cephalometric Landmarks Identification Through an Object Detection-based Deep Learning Model. Int. J. Adv. Comput. Sci. Appl. 2024, 15. [Google Scholar] [CrossRef]
- Sharkas, M.; Attallah, O. Color-CADx: A deep learning approach for colorectal cancer classification through triple convolutional neural networks and discrete cosine transform. Sci. Rep. 2024, 14, 6914. [Google Scholar] [CrossRef]
- Vanitha, K.; Sree, S.S.; Guluwadi, S. Deep learning ensemble approach with explainable AI for lung and colon cancer classification using advanced hyperparameter tuning. BMC Med. Inform. Decis. Mak. 2024, 24, 222. [Google Scholar] [CrossRef]
- Huang, Q.; Ding, H.; Razmjooy, N. Oral cancer detection using convolutional neural network optimized by combined seagull optimization algorithm. Biomed. Signal Process. Control 2024, 87, 105546. [Google Scholar] [CrossRef]
- Vollmer, A.; Hartmann, S.; Vollmer, M.; Shavlokhova, V.; Brands, R.C.; Kübler, A.; Wollborn, J.; Hassel, F.; Couillard-Despres, S.; Lang, G.; et al. Multimodal artificial intelligence-based pathogenomics improves survival prediction in oral squamous cell carcinoma. Sci. Rep. 2024, 14, 5687. [Google Scholar] [CrossRef]
- Sangeetha, S.; Mathivanan, S.K.; Karthikeyan, P.; Rajadurai, H.; Shivahare, B.D.; Mallik, S.; Qin, H. An enhanced multimodal fusion deep learning neural network for lung cancer classification. Syst. Soft Comput. 2024, 6, 200068. [Google Scholar]
- Chen, X.; Xie, H.; Tao, X.; Wang, F.L.; Leng, M.; Lei, B. Artificial intelligence and multimodal data fusion for smart healthcare: Topic modeling and bibliometrics. Artif. Intell. Rev. 2024, 57, 91. [Google Scholar] [CrossRef]
- Ribeiro-de Assis, M.C.F.; Soares, J.P.; de Lima, L.M.; de Barros, L.A.P.; Grão-Velloso, T.R.; Krohling, R.A.; Camisasca, D.R. NDB-UFES: An oral cancer and leukoplakia dataset composed of histopathological images and patient data. Data Brief 2023, 48, 109128. [Google Scholar] [CrossRef]
- Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image data augmentation for deep learning: A survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
- Plompen, A.J.; Cabellos, O.; De Saint Jean, C.; Fleming, M.; Algora, A.; Angelone, M.; Archier, P.; Bauge, E.; Bersillon, O.; Blokhin, A.; et al. The joint evaluated fission and fusion nuclear data library, JEFF-3.3. Eur. Phys. J. A 2020, 56, 1–108. [Google Scholar] [CrossRef]
- Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
- Huang, B.; Yang, F.; Yin, M.; Mo, X.; Zhong, C. A review of multimodal medical image fusion techniques. Comput. Math. Methods Med. 2020, 2020, 8279342. [Google Scholar] [CrossRef]
- Azam, M.A.; Khan, K.B.; Salahuddin, S.; Rehman, E.; Khan, S.A.; Khan, M.A.; Kadry, S.; Gandomi, A.H. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput. Biol. Med. 2022, 144, 105253. [Google Scholar] [CrossRef]
- Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
- Yu, W.; Zhou, P.; Yan, S.; Wang, X. Inceptionnext: When inception meets convnext. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5672–5683. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the on the Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, 3–7 November 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]
- Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
- Halder, R.K.; Uddin, M.N.; Uddin, M.A.; Aryal, S.; Khraisat, A. Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications. J. Big Data 2024, 11, 113. [Google Scholar] [CrossRef]
- Abhisheka, B.; Biswas, S.K.; Purkayastha, B. A comprehensive review on breast cancer detection, classification and segmentation using deep learning. Arch. Comput. Methods Eng. 2023, 30, 5023–5052. [Google Scholar] [CrossRef]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]






| Hyperparameter | Value |
|---|---|
| Model Name | ConvNext feature extractor |
| Pretrained Weights | imagenet |
| Input Shape | (384, 512, 3) |
| Output Feature Dimension | 768 |
| Include Top | False (the classification head is excluded to use the model as a feature extractor) |
| Final Layer (before pooling) | LayerNormalization with output shape (12, 16, 768) |
| Pooling Layer | GlobalAveragePooling2D (output shape becomes (None, 768)) |
| Activation Function | GELU |
| Normalization | Layer normalization (LayerNorm) is used instead of BatchNorm for all layers. |
| Convolutional Kernel Size | 7 × 7 (depthwise convolutions in ConvNeXt blocks) |
| Stem Convolution Type | 4 × 4 Conv with stride = 4 for initial downsampling |
| Number of Stages | 4 |
| Blocks per Stage | [3, 3, 9, 3] (block count for stages 1 to 4) |
| Expansion Ratio | 4 × (for the MLPs inside the blocks) |
| Dropout Rate | 0.3 |
| Trainable Parameters | 49,454,688 |
| Precision | F1-Score | Recall | Accuracy | |
|---|---|---|---|---|
| XG-Boost | 0.87 | 0.87 | 0.88 | 0.87 |
| Random Forest | 0.71 | 0.44 | 0.58 | 0.48 |
| SVM | 0.31 | 0.38 | 0.50 | 0.61 |
| Logistic Regression | 0.85 | 0.83 | 0.82 | 0.84 |
| KNN | 0.97 | 0.97 | 0.98 | 0.97 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tafala, I.; Ben-Bouazza, F.-E.; Mezali, M.C.E.; Emssaad, I.; Jioudi, B. A Multimodal Deep Learning Pipeline for Enhanced Detection and Classification of Oral Cancer. Eng. Proc. 2025, 112, 65. https://doi.org/10.3390/engproc2025112065
Tafala I, Ben-Bouazza F-E, Mezali MCE, Emssaad I, Jioudi B. A Multimodal Deep Learning Pipeline for Enhanced Detection and Classification of Oral Cancer. Engineering Proceedings. 2025; 112(1):65. https://doi.org/10.3390/engproc2025112065
Chicago/Turabian StyleTafala, Idriss, Fatima-Ezzahraa Ben-Bouazza, Manal Chakour El Mezali, Ilyass Emssaad, and Bassma Jioudi. 2025. "A Multimodal Deep Learning Pipeline for Enhanced Detection and Classification of Oral Cancer" Engineering Proceedings 112, no. 1: 65. https://doi.org/10.3390/engproc2025112065
APA StyleTafala, I., Ben-Bouazza, F.-E., Mezali, M. C. E., Emssaad, I., & Jioudi, B. (2025). A Multimodal Deep Learning Pipeline for Enhanced Detection and Classification of Oral Cancer. Engineering Proceedings, 112(1), 65. https://doi.org/10.3390/engproc2025112065

