Next Article in Journal
Smart Digital Environments for Monitoring Precision Medical Interventions and Wearable Observation and Assistance
Next Article in Special Issue
Quantization of Deep Neural Networks for Medical Image Analysis: A Systematic Review and Meta-Analysis
Previous Article in Journal
Regression-Assisted Ant Lion Optimisation of a Low-Grade-Heat Adsorption Chiller: A Decision-Support Technology for Sustainable Cooling
Previous Article in Special Issue
Detection and Classification of Kidney Disease from CT Images: An Automated Deep Learning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Oral Cancer Diagnosis Using Histopathology Images: An Explainable Hybrid Transformer Framework

by
Francis Rudra D Cruze
1,
Jeba Wasima
2,
Md. Faruk Hosen
2,3,
Mohammad Badrul Alam Miah
3,*,
Zia Muhammad
4 and
Md Fuyad Al Masud
5,*
1
Department of Computer Science and Engineering (CSE), East West University, Aftabnagar, Dhaka 1212, Bangladesh
2
Department of Computing and Information System (CIS), Daffodil International University, Savar, Dhaka 1216, Bangladesh
3
Department of Information and Communication Technology (ICT), Mawlana Bhashani Science and Technology University, Santosh, Tangail 1902, Bangladesh
4
Department of Computing, Design, and Communication, University of Jamestown, Jamestown, ND 58405, USA
5
Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58102, USA
*
Authors to whom correspondence should be addressed.
Technologies 2026, 14(1), 39; https://doi.org/10.3390/technologies14010039
Submission received: 22 November 2025 / Revised: 12 December 2025 / Accepted: 16 December 2025 / Published: 5 January 2026
(This article belongs to the Special Issue Application of Artificial Intelligence in Medical Image Analysis)

Abstract

Oral cancer (OC) remains a major global health concern with survival often limited by late diagnosis. Early and accurate detection is essential to improve patient outcomes and guide treatment decisions. In this study we propose a computer aided diagnostic (CAD) framework for classifying oral squamous cell carcinoma from histopathology images. The model combines Swin transformer for hierarchical feature extraction with vision transformer (ViT) to capture long range dependencies across image regions. SHapley Additive exPlanations (SHAP) based feature selection enhances interpretability by highlighting the most informative features while preprocessing steps such as stain normalization and contrast enhancement improve model generalization and reduce sample variability. Evaluated on a publicly available dataset the framework achieved 99.25% accuracy (ACC) 99.21% sensitivity and a matthews correlation coefficient (MCC) of 98.21% outperforming existing methods. Ablation studies highlighted the importance of positional encoding and statistical analyses confirmed the robustness and reliability of results. To support real-time inference and scalable deployment the proposed model has been integrated into a FastAPI-based web application. This framework offers a powerful interpretable and practical tool for early OC detection and has potential for integration into routine clinical workflows.

1. Introduction

Cancer remains a major global health concern and is one of the leading causes of death worldwide [1,2]. Oral Cancer (OC) is the sixth most frequent cancer and by 2030 its incidence is predicted to increase by 30% every year [3,4]. OC mostly affects the head and neck area [5]. It includes the floor of the mouth, the hard palate, the gingiva, the buccal mucosa, the anterior two-thirds of the tongue, the lips and the retromolar pad. According to histology the majority of cases are squamous cell carcinomas [6,7,8]. Globally OC accounted for about 377,000 new cases and 177,000 deaths in 2020 with most patients being over 50 years old [9,10,11,12]. Although progress in detection and treatment the global five year survival rate for OC remains about 50% with significant regional variability. Late stage detection of OC greatly reduces treatment effectiveness and highlights the need for early diagnosis [13]. Routine oral screenings and thorough clinical examinations play a key role in improving patient outcomes [14]. Recent progress in diagnostic approaches has made it possible to detect precancerous lesions earlier, helping to improve patient outcomes and reduce the burden of OC [15,16]. Artificial intelligence (AI) provides novel opportunities to improve diagnostic ACC and enable personalized treatment strategies [17,18,19,20]. Machine learning (ML) and deep learning (DL) have shown strong ability to analyze histopathological images, detect subtle features and support clinical decisions [21,22]. AI improves traditional diagnostic methods by enabling faster and more reliable detection of malignancies [23]. Traditional histopathology depends on expert examination of tissue samples. A technique limited by inter observer variability subjective interpretation and labor intensive processes. [24,25]. AI based CAD systems provide objective and standardized analyses. They enhance consistency and efficiency while preserving diagnostic ACC [24,26]. DL models show strong performance in classifying oral squamous cell carcinoma (OSCC) images and provide a complementary tool to conventional diagnostic methods [23,27,28]. This study presents an intelligent CAD framework for the early detection of OC through automated histopathological image analysis. Which identifies malignancy associated patterns and generates visual outputs to assist pathologists in decision making. It is designed to minimize computational costs, maintain scalability and ensure accessibility in resource imited and rural settings. It is also aiming to support early diagnosis and improve patient outcomes. The main contributions of this study are as follows:
  • A hybrid Swin–ViT framework enables effective feature extraction and accurate classification of OC histopathology images.
  • Explainable AI (XAI) driven feature selection boosts interpretability and prediction while cutting dimensionality.
  • A FastAPI-based real-time web application designed for scalable deployment in clinical and low-resource environments.

2. Related Works

Early detection of OC is challenging due to its complex presentation and high mortality. In recent years ML and DL techniques have gained attention for enhancing diagnostic ACC and enabling timely intervention. While early AI applications relied on conventional image processing, modern ML and DL approaches provide automated, robust and accurate analysis of OC images. Mira et al. developed an early OC detection framework using grayscale conversion, filtering and histogram equalization for preprocessing oral cavity images [29,30,31]. Cancer segmentation and manually extracted features based on shape, texture and color were classified using traditional ML methods. Although effective the approach lacked generalizability due to its reliance on handcrafted features and limited adaptability to diverse clinical data. To improve interpretability Ulaganathan et al. integrated immunohistochemistry (IHC) and whole slide imaging (WSI) with a CAD framework [32]. Their system digitized oral tissue biopsies and quantified key biomarkers and aiding personalized therapy but it remained semi automated and required substantial manual input. DL introduced a significant shift by enabling automatic representation learning and improving performance. Li et al. applied a U-Net with a ResNet34 encoder to segment cancer in endoscopic images showing strong generalization after training on 205 annotated samples [33]. Nanditha et al. employed a 43 layer Convolutional Neural Network (CNN) to analyze CT images for oral tumor detection and achieving high ACC and supporting early diagnosis [34]. In histopathology Kumar et al. used a two stage pipeline with CNN segmentation followed by random forest classification also outperforming traditional feature based methods [35]. Panigrahi et al. evaluated six ML models on histopathology images with a neural network achieving the highest ACC of 90.4% highlighting strong discriminative ability [36].
Recent work has explored histopathological images where Senthil Pandi et al. proposed a partitioned DL model using histopathological images with local binary patterns and a DL model with achieving 96.45% ACC [37]. Panahi and Farrokh reviewed ML in personalized dental medicine. Using radiographic images, clinical records, genetic data and patient reports, ML models helped detect caries, periodontal disease, OC, predict treatment outcomes and identify high-risk patients [38]. Challenges included data quality, interpretability, generalizability and clinical validation. Benefits were improved diagnosis, personalized care, patient experience and clinical efficiency. Jeyaraj et al. developed a regression based CNN model for histopathological images with 94.5% ACC [39]. Halder et al. combined attention based NASNet-Mobile, Gray Wolf Optimization for feature selection and ML classification with achieving 92.86% ACC [40]. Akhi et al. introduced OCNet by using transfer learning across multiple CNN architectures [41]. Their optimized VGG19 model reached 95.32% ACC providing a scalable solution for automated OSCC diagnosis.
These studies highlight the promise of ML and DL for OC detection. Compared with existing approaches the proposed framework integrates deep feature extraction, SHAP based feature selection and DL classification in a unified pipeline with achieving higher ACC, improved interpretability and scalable clinical deployment.

3. Materials and Methods

This section we outlines the data acquisition process, image pre-processing steps, feature extraction and selection techniques, along with the proposed framework. To provide a comprehensive understanding of the proposed approach the overall workflow of the study is illustrated in Figure 1. Furthermore, we incorporate a systematic algorithm to pseudocode of our whole system and which is illustrated int Algorithm 1.
Figure 1. Overall pipeline of the proposed cancer classification framework, including data preprocessing, feature extraction, SHAP-based feature selection, classifier training, model evaluation, and deployment.
Figure 1. Overall pipeline of the proposed cancer classification framework, including data preprocessing, feature extraction, SHAP-based feature selection, classifier training, model evaluation, and deployment.
Technologies 14 00039 g001
Algorithm 1: Proposed OC Detection Framework Algorithm
Input: E p Number of Epochs; W Model Parameters; η Learning Rate; b s Batch Size; D OC Histopathology Dataset
Output: The assessment metrics on the test dataset.
1: 
Dataset Preprocessing:
2: 
     X t r a i n p r e p r o c e s s ( D ) using CLAHE, bilateral filtering, gamma correction, stain normalization and morphological refinement.
3: 
     X t e s t p r e p r o c e s s ( D ) .
4: 
     Apply data augmentation (rotation, flipping, noise injection).
4a: 
     Apply ADASYN to address dataset imbalance and generate synthetic minority samples.
5: 
Feature Extraction:
6: 
     Initialize Swin Transformer (swin_large_patch4_window7_224).
7: 
     for local epoch e p from 1 to E p  do
8: 
        for  b i = ( x s , y s ) ← random batch from X t r a i n  do
9: 
           Optimize model parameters: W s W s η ( L ( W s ; b s ) ) .
10: 
           f t r a i n C o m p u t e F e a t u r e s ( W s , X t r a i n , 1024 ) .
11: 
        end for
12: 
     end for
13: 
Feature Selection:
14: 
     f b e s t S H A P ( f t r a i n , 500 ) .
15: 
Cancer Classification:
16: 
     Initialize ViT Classifier (ViT).
17: 
     T r a i n e d M o d e l V i T ( f b e s t , y t r a i n ) .
18: 
     P r e d T r a i n e d M o d e l ( X t e s t ) .
19: 
Performance Evaluation:
20: 
     Compute evaluation metrics: ACC, Sensitivity, Specificity, Precision, F1-score and MCC.
21: 
     E v a l u a t i o n m e t r i c s C o m p u t e M e t r i c s ( P r e d , y t e s t ) .

3.1. Data Acquisition

In this study, we used a publicly available histopathological image dataset for OC analysis which is obtained from the Mendeley Data repository [42]. The dataset comprises digitized hematoxylin and eosin (H&E) stained tissue slides from 230 patients with images captured using a Leica ICC50 HD microscope under standardized conditions. All tissue samples were carefully collected, prepared and catalogued by medical experts to ensure accurate diagnostic labeling. The repository contains a total of 1224 histopathological images organized into two resolution sets. For this study the first set is acquired at 100× magnification, which is selected for experiment. This subset includes 528 images, comprising 89 normal oral epithelium and 439 oral squamous cell carcinoma (OSCC) images. Table 1 summarizes the dataset. The chosen magnification provides an optimal balance between visualization of tissue architecture and cellular morphology which is essential for distinguishing normal from malignant structures. The dataset is suitable for a binary classification task with a class distribution ratio of approximately 1:4.9 (normal:OSCC) reflecting real world clinical conditions. All images were validated by expert pathologists at the time of the original dataset creation with ground truth labels established through consensus and correlation with histopathological reports. Data collection, processing and validation strictly followed the protocols described in the source publication. The dataset was divided into three separate segments according to a 70-15-15 split ratio: 70% allocated for training, 15% for validation, and 15% for testing. The splits were performed at the image level as patient-level identifiers were not available in the source dataset.

3.2. Data Preprocessing

The preprocessing of data involves some specific techniques those are applied before the experiment which enhance the diversity of data and generalize it. All histopathological images were standardized to a resolution of 224 × 224 pixels and converted to RGB format to ensure compatibility with DL architectures. Several preprocessing techniques were applied to enhance image quality and reduce variability across samples. Contrast Limited Adaptive Histogram Equalization (CLAHE) was employed to improve local contrast while preserving structural integrity, enabling more effective feature extraction [43,44]. Noise suppression was performed using bilateral filtering which reduces variability by considering both spatial and intensity information of neighboring pixels, producing smoother yet detail preserving images [45,46]. Gamma correction was applied as a nonlinear transformation to adjust pixel intensities and enhance contrast that is improving the visibility of low contrast regions [47]. To address staining inconsistencies both conventional and generative adversarial network (GAN) based stain normalization approaches were adopted. The two stage GAN strategy ensured consistent color representation across samples while maintaining morphological integrity and improving the reliability of downstream classification tasks [48,49]. Morphological operations such as opening and adaptive thresholding were used to refine cellular boundaries and suppress background artifacts. Otsu thresholding was applied to segment tumor regions into standardized patches, increasing dataset uniformity [48]. Unsharp masking was incorporated to enhance structural clarity by emphasizing high-frequency image components through subtraction of a blurred version from the original image [50,51]. Data augmentation techniques including controlled rotations and noise injection were applied to expand dataset diversity, mitigate overfitting and improve generalization capacity. Collectively these preprocessing strategies enhanced image quality, reduced variability and strengthened dataset robustness, providing a reliable foundation for accurate feature extraction in OC histopathology analysis. These preprocessing strategies enhanced image quality, reduced variability and strengthened dataset robustness that is providing a reliable foundation for accurate feature extraction in OC histopathology analysis. To provide a clearer understanding of the sequential flow of these operations the complete preprocessing pipeline is represented in Figure 2. To address the data imbalance problem in our dataset, we employed the Adaptive Synthetic Sampling (ADASYN) technique to generate synthetic minority samples and improve class distribution [52].

3.3. Feature Extraction Using Swin Transformer

To determine an effective backbone for feature extraction several architectures were evaluated including ConvNeXt, Swin Transformer and BEiT. The Swin Transformer achieved the highest ACC and was therefore adopted as the primary feature extractor. The Swin Transformer Large configuration [53,54,55,56] was employed using a patch size of 4 × 4 , a window size of 7 × 7 and an input resolution of 224 × 224 . The pretrained classification head was removed and replaced with an identity layer to produce generic, task independent feature representations. The Swin Transformer constructs a hierarchical representation of an image through four stages. The input image is first divided into non overlapping patches and each of which is linearly embedded into a feature vector. Subsequent stages progressively reduce spatial resolution and increase channel depth through patch merging where the embeddings of four neighbouring patches are concatenated and linearly transformed into a new representation:
Y i , j = W X 2 i , 2 j X 2 i + 1 , 2 j X 2 i , 2 j + 1 X 2 i + 1 , 2 j + 1
This hierarchical design allows the network to capture increasingly rich semantic information while maintaining computational efficiency. At the final stage feature maps with 1536 channels are produced. A lightweight convolutional refinement block is then applied to emphasize informative regions and suppress noise. This block employs 1 × 1 and 3 × 3 convolutions, batch normalization and ReLURectified Linear Unit (ReLU) activations to refine the feature maps. After refinement, spatial information is aggregated using global average pooling to produce a compact vector for each channel:
z c = 1 H W i = 1 H j = 1 W X c , i , j , c = 1 1536
Finally a two-layer projection head reduces the feature vector from 1536 to 1024 dimensions. The first linear layer maps the refined features to 1280 units with ReLU activation and batch normalization followed by a second layer mapping to 1024 units with ReLU, batch normalization and dropout at a rate of p = 0.2 . This feature extraction pipeline combines the Swin Transformer’s hierarchical learning capabilities with simple convolutional refinement, yielding compact and robust descriptors well suited for histopathological image analysis [57,58,59,60].

3.4. Proposed Model Architecture

The 1024-dimensional embeddings extracted by the Swin Transformer feature extractor were used as input for a vision transformer (ViT) classifier [61,62,63]. These embeddings were projected to a lower-dimensional space and combined with positional encodings to form a sequential input suitable for the transformer:
Z 0 = X W e + E p o s
where X denotes the input embeddings W e is the projection matrix and E p o s represents positional encodings. The sequence was processed through four transformer encoder layers, each consisting of multi-head self-attention (MSA), layer normalization (LN), feed-forward networks, and residual connections:
Z l = MSA ( LN ( Z l ) ) + Z l
Z l + 1 = FFN ( LN ( Z l ) ) + Z l
where Z l is the input to the lth layer, Z l is the output after the attention sublayer, MSA denotes multi-head self-attention with 8 heads, LN denotes layer normalization, and FFN represents the feed-forward network with hidden dimension 256. After the final transformer layer, global average pooling was applied to obtain a single feature vector. The classification head applied a linear transformation followed by softmax activation:
y ^ = softmax ( W c z pool + b c )
where z pool is the pooled representation and W c , b c are learnable parameters. The model was trained using cross-entropy loss. Optimization was performed using the Adam optimizer with β 1 = 0.9 , β 2 = 0.999 and ϵ = 10 8 , and a learning rate of 0.0001. Training was conducted for 50 epochs. This approach leverages the hierarchical embeddings from the Swin Transformer enabling robust and accurate classification of OC histopathology images [53,64,65].

3.5. Feature Selection Using SHAP

Feature selection plays an important role in enhancing model performance by identifying and retaining the most relevant attributes. In this work we applied SHAP [66,67,68] which is grounded in cooperative game theory [69,70,71] and assigns an importance score to each feature with respect to a prediction. In a coalitional game with N players, the Shapley value ϕ i for a player (or feature) i is defined as:
ϕ i = S N { i } | S | ! ( | N | | S | 1 ) ! | N | ! v ( S { i } ) v ( S ) ,
where S is any subset of N not containing i, and v ( S ) represents the predictive value of the coalition S. The term v ( S { i } ) v ( S ) reflects the marginal contribution of feature i to the coalition. In a predictive modeling context, SHAP approximates the output of a complex model f ( x ) with a simplified linear representation g ( x ) based on binary variables x { 0 , 1 } M , where M is the number of features:
g ( x ) = ϕ 0 + i = 1 M ϕ i x i ,
with ϕ 0 denoting the baseline output (expected prediction when no features are present) and ϕ i the contribution of feature i. Here, x i = 1 indicates that the i-th feature is included, and x i = 0 otherwise [72,73].
The Shapley value for feature i with respect to an instance x is then computed as:
ϕ i ( f , x ) = z x | z | ! ( M | z | 1 ) ! M ! f x ( z ) f x ( z i ) ,
where z represents subsets of the simplified input x , and f x ( z ) corresponds to the model output restricted to those features. By ranking features according to their absolute Shapley values the most informative attributes were selected for classification. In practice this procedure reduced redundancy and improved predictive ACC while also ensuring interpretability. Empirical evaluation further showed that selecting the top-ranked subset of features significantly enhanced overall performance compared to using all features [74,75,76].

4. Results

4.1. Evaluating Metrics

To evaluate the performance of different classification models and the overall predictive approach multiple metrics were considered. Key measures include Accuracy (ACC), MCC, Sensitivity, Specificity, Precision (PRE), and F1-score [68,77,78,79,80]. The definitions of these metrics are provided below:
ACC = T P + T N T P + T N + F N + F P
Sensitivity = T P T P + F N
Specificity = T N F P + T N
MCC = T P × T N F P × F N ( T P + F P ) ( T N + F P ) ( T N + F N ) ( T P + F N )
PRE = T P T P + F P
F 1 score = 2 × ( PRE × Recall ) PRE + Recall
Here, TP (True Positive) refers to the number of actual positive instances that are correctly classified, while FP (False Positive) denotes the number of negative instances incorrectly predicted as positive. TN (True Negative) indicates the count of negative instances accurately identified, and FN (False Negative) represents the number of positive instances that were incorrectly classified as negative. In addition the AUC metric was employed to evaluate the overall performance of the proposed model, with the curve plotted by placing the true positive rate (TPR or sensitivity) on the y-axis and the false positive rate (FPR = 1 − specificity) on the x-axis across varying threshold values.

4.2. Experimental Evaluation of Feature Extractors and Classifiers

To identify the most effective feature extractor we evaluated three architectures those are ConvNeXt, BEiT and Swin Transformer. Following the transfer learning approach the fully connected layers of these models were removed and replaced with five classifiers those are Random Forest, CatBoost, Focal Loss Support Vector Machine (SVM), Attention CNN (AttCNN) and ViT. This combinatorial approach was designed to systematically assess and compare the performance of each feature extractor across multiple classifiers. Additionally it enabled the identification of the most suitable classifier among both ML and DL approaches. We used different validation and test sets for all of the experiments. The validation set was used to select the optimal feature extractor, classifier, and their best-performing combination, while the test set was reserved exclusively for final performance evaluation. From the results presented in Table 2, it is clarifying that the Swin Transformer achieved the highest average performance across all five classifiers with an overall ACC of 87.25%. ConvNeXt and BEiT obtained average accuracies of 87.08% and 82.79% respectively corresponding to reductions of 0.17% and 4.46% relative to Swin Transformer.
In terms of AUC Swin Transformer again led with an average score of 92.07%, exceeding ConvNeXt (91.46%) and BEiT (89.86%) by 0.61% and 2.21% respectively. MCC further highlighted the superiority of Swin Transformer which achieved 63.67% compared to 78.20% for ConvNeXt and 77.46% for BEiT. Specificity and sensitivity metrics were 77.90% and 78.20% for Swin Transformer, outperforming ConvNeXt (78.64% and 78.68%) and BEiT (76.63% and 75.59%) by notable margins. These results demonstrate that Swin Transformer consistently surpasses the other feature extractors across all evaluation metrics, establishing it as the preferred choice. Regarding classifiers the ViT consistently delivered the highest performance. It achieved an average ACC of 91.86%, outperforming Random Forest (77.86%), CatBoost (78.64%), Focal Loss SVM (83.72%) and AttCNN (87.95%) by 13.99%, 13.22%, 8.13% and 3.91% respectively. The highest average AUC score of 95.28% was also attained by the ViT exceeding the other classifiers by 8.01%, 7.20%, 3.61% and 1.93% respectively. Similarly the MCC of ViT was 90.83% with other classifiers scoring 59.55% (Random Forest), 59.67% (CatBoost), 73.74% (Focal Loss SVM) and 81.77% (AttCNN) indicating differences of 31.28%, 31.16%, 17.09% and 9.06%. Evaluation of additional metrics including specificity (90.85%) and sensitivity (90.56%) further confirmed the ViT’s superior performance. Consequently the ViT was selected as the optimal classifier. The analysis also revealed that combining the ViT with any feature extractor resulted in consistently high ACC, AUC, MCC, specificity and sensitivity values. Likewise using any classifier with the Swin Transformer yielded the best results relative to other feature extractors. Ultimately the combination of Swin Transformer as the feature extractor and ViT as the classifier achieved the highest overall ACC of 95.75% surpassing all other evaluated configurations. Based on these findings this combination was adopted to construct the proposed model. The performance differences among the three feature extractors are further highlighted by their ROC curves (Figure 3A–C) and precision recall curve (Figure 3D–F).
These curves capture the balance between sensitivity and 1 specificity and the corresponding AUC values summarize the overall classification effectiveness. Swin Transformer consistently achieved the highest AUC which reflects its stronger ability to discriminate between classes compared to ConvNeXt and BEiT. These visual findings are consistent with the quantitative metrics that confirm the superior performance of Swin Transformer as a feature extractor. Beyond the ROC analysis we also analyzed the impact of feature selection on performance focusing both on the best performing feature extractor classifier combinations and on how varying feature counts influence ACC and AUC as illustrated in Figure 4.

4.3. Impact of Feature Selection Techniques

To identify the most effective feature selection technique for OC detection, we evaluated five approaches: mRMR, Lasso, Boruta, Genetic Algorithm and SHAP. The comparative results are presented in Table 3. Among these, SHAP achieved the best overall performance with the highest ACC (99.25%), sensitivity (99.21%) and strong AUC (98.22%) along with superior F1 (98.43%) and MCC (98.21%) values. These findings demonstrate that SHAP substantially improves the model’s ability to differentiate between cancerous and non-cancerous samples across multiple evaluation metrics. The Genetic Algorithm also yielded competitive results, with an ACC of 97.80%, specificity of 98.88% and PRE of 98.20%, though it did not surpass SHAP in terms of balanced performance. In contrast, mRMR (96.25%), Lasso (96.12%) and Boruta (95.95%) showed relatively lower accuracies. While these methods maintained strong specificity (above 97%), they were less effective in providing consistently high values across sensitivity, F1 and MCC. Overall, SHAP emerged as the most effective feature selection technique for OC detection, delivering superior and well balanced performance across nearly all evaluation metrics.
To provide a more comprehensive comparison Figure 5 combines multiple visual perspectives on feature selection performance and feature count optimization. Figure 5A,B reaffirm earlier observations are SHAP consistently delivers superior predictive performance with narrow score distributions and approximately 500 features offer the most favorable balance between ACC and stability. Figure 5C extends this evaluation by comparing all feature selection methods across multiple performance metrics using a radar chart. SHAP dominates across nearly every dimension particularly in ACC, sensitivity, F1 and MCC, underscoring its robustness as a selection strategy. Finally Figure 5D illustrates the performance trends across varying feature counts where 500 features again emerge as the optimal threshold for stable and well rounded results. Both visualizations highlight the critical role of both the selection technique and feature dimensionality in shaping reliable predictive models for OC detection.
The performance of the Swin Transformer combined with SHAP based feature selection under varying feature counts is summarized in Table 4. The analysis indicates that the ACC and other evaluation metrics varied with the number of features used. With 200 features the model achieved an ACC of 0.8875 and an AUC of 0.9200 while PRE and sensitivity were 0.9831 and 0.8788 respectively. Increasing the feature count to 300 improved PRE to 0.9841 and sensitivity to 0.9394 indicating more reliable identification of positive cases. Optimal performance was observed with 500 features where the model consistently outperformed all other configurations. In this setting it achieved an ACC of 0.9925, AUC of 0.9822, PRE of 0.9826, specificity of 0.9918, sensitivity of 0.9921, F1-score of 0.9843 and MCC of 0.9821. Beyond this point increasing the feature count to 700 and 900 led to a slight reduction in performance. Although ACC remained high (0.9625 and 0.9375 respectively) decreases in sensitivity, F1-score and MCC suggest that excessive features introduced redundancy or noise. Overall these findings demonstrate that SHAP based feature selection effectively identified the most discriminative subset of features while the Swin Transformer served as a powerful backbone for representation learning. In order to highlight this the comparative impact of different selection methods Figure 6 presents a comprehensive analysis. Subfigures A–C compare feature overlap, retained feature counts and performance across metrics while Subfigure D shows the SHAP summary plot of the top 20 influential features. These results reinforce SHAP’s consistent ability to identify the most discriminative features for OC detection. Furthermore, we demonstrate the Grad-CAM of input images to identify the areas of interest or potential impact by our feature extractor. Figure 7 illustrates various input images alongside their corresponding Grad-CAM visualizations.

4.4. Hyperparameter Analysis of Swin Transformer

We evaluated the impact of key hyperparameters on the performance of the proposed Swin Transformer based feature extractor and ViT classifier. Table 5 summarizes the hyperparameter settings used for the Swin Transformer feature extractor. The model uses the swin_large_patch4_window7_224 backbone, which produces a 1536-dimensional feature representation at its final stage. This representation is subsequently refined using two linear layers with dimensions 1536 → 1280 → 1024, where each layer is followed by ReLU activation, Batch Normalization, and a dropout rate of 0.2 to reduce overfitting. The feature extractor operates on images divided into 4 × 4 patches with a window size of 7, and the resulting features are aggregated using mean pooling.
Table 6 presents the hyperparameters for the ViT classifier applied on the extracted features. The classifier takes a 1024 dimensional input embedding and projects it into a 128 dimensional embedding space using 8 attention heads and a feedforward dimension of 256. To improve generalization a dropout rate of 0.1 is applied across 4 transformer layers. The classifier is trained with a learning rate of 0.0001 for 50 epochs using a batch size of 128.
Overall these hyperparameter choices were selected based on preliminary experiments to optimize classifier performance while ensuring computational efficiency. Adjustments to learning rate, batch size and the number of training epochs were found to significantly influence convergence and generalization.

4.5. Ablation Study

To evaluate the contribution of each component of the proposed model we conducted an ablation study. The results are summarized in Table 7 which demonstrate the impact of different architectural modifications on model performance. The full model achieved the highest performance across all metrics with an ACC (ACC) of 0.9925, area under the curve (AUC) of 0.9822, F1-score of 0.9843 and MCC of 0.9821, confirming the effectiveness of the complete architecture. Removing positional encoding led to a significant drop in performance (ACC = 0.8000, AUC = 0.8052, F1 = 0.8708, MCC = 0.4527) highlighting the importance of spatial information. Using a ViT reduced PRE (0.9455) and F1-score (0.8592) despite relatively high SP (0.7857) and SN (0.7879) indicating that deeper layers are essential for effective feature extraction. The deep ViT variant maintained high ACC (0.9875) and perfect SN (1.0000) but showed a decrease in AUC (0.8604) suggesting limited generalization without the full architecture. Replacing attention with max pooling produced competitive results (ACC = 0.9750, F1 = 0.9851) but still underperformed compared to the full model emphasizing the importance of attention based feature aggregation. These findings confirm that positional encoding model depth and attention mechanisms are all critical components of the proposed architecture. The results of the ablation study are presented to elucidate the contribution of each model component to overall performance. In addition to predictive accuracy evaluating computational efficiency is essential. Figure 8 illustrates performance comparisons across seven evaluation metrics (ACC, AUC, PRE, SP, SN, F1, MCC) for different model variants, with (A) grouped bar chart and (B) trend lines demonstrating the Full Model’s superior and consistent performance across all metrics.

4.6. Statistical Analysis of Model Performance

To assess the statistical significance of the proposed model’s performance we conducted both metric-wise and model wise analyses which are summarized in Table 8 and Table 9. The metric wise evaluation shows that all performance measures including ACC, AUC, PRE, SN, F1 and MCC exhibited significant improvements compared to baseline values. t-test, p-values ranged from 0.0005 to 0.0119 with corresponding Wilcoxon p-values ranging from 0.0065 to 0.0371. Notably PRE and F1-score demonstrated the most significant differences (t-test p = 0.0006 and 0.0005 and Wilcoxon p = 0.0120 and 0.0065) indicating substantial gains in predictive performance. The model wise analysis further confirms the superiority of the proposed approach over baseline models. t-test and Wilcoxon test results indicate statistically significant differences for Random Forest, CatBoost, Focal Loss SVM, AttCNN and ViT models. Overall these statistical analyses validate that the observed improvements are not due to random chance that is confirming the reliability and effectiveness of the proposed model.

4.7. Web Application for Real-Time OC Classification

To bridge the gap between experimental results and practical application we developed a web application capable of performing real-time classification of OC from histopathological images. The deployment architecture comprises three distinct components working in concert to deliver efficient inference capabilities. At first all of the trained models such as the Swing Transformer feature extractor and ViT classification head are deployed on the Hugging Face Model Hub (https://huggingface.co/rudradcruze/oral-cancer, accessed on 20 November 2025), ensuring version control, easy accessibility and reproducibility. After that by using this model we developed a RESTful API using the FastAPI framework in Python 3.10 and this application is deployed on Hugging Face Spaces (https://huggingface.co/spaces/rudradcruze/oral-cancer, accessed on 20 November 2025) Docker SDK. This API handles all of the heavy tasks including model loading, image reprocessing, feature extraction via Swin Transformer, feature selection based on SHAP analysis as determined in our experimental study (Table 4), and final classification through the ViT head. Finally, to make it user-friendly and accessible for all, we developed a client-side interface using standard and modern web technologies (HTML, CSS, JavaScript and TailwindCSS) with FastAPI serving the frontend, deployed on Vercel (https://oralcancer.francisrudra.com, accessed on 20 November 2025). This interface gives functionality to users to upload the histopathological images through an intuitive browser-based environment and communicates with the Hugging Face Space API to retrieve predictions. All of the experimental code which are used in this experiment is available on https://github.com/rudradcruze/oral-cancer (accessed on 20 November 2025) GitHub repository.

4.8. Comparison with Existing Methods

To validate the effectiveness of the proposed approach we compared its performance with state-of-the-art methods for OC detection using histopathological images, shown in Table 10. The proposed Modified Swin Transformer with a ViT head achieved the highest ACC of 99.25%, outperforming all existing approaches. The closest competing method was reported by Albalawi et al. which achieved 98.47% ACC using EfficientNet B3 with an advanced learning mechanism, 0.78% lower than our approach. Flügge et al. achieved 97.63% with a Swin Transformer while Das et al. reported 97.41% by combining multiple architectures. Yu et al. obtained 95.84% with a ResNet50 based DCNN framework. Other notable results include Nagarajan et al. (94.18% using MobileNetV3 with Gorilla Troops Optimizer), Panigrahi et al. (92.14% using three CNN models) and ResNet50 based methods by Shavlokhova et al. (92.37%) and Dai et al. (92.48%). Chang et al. reported 85.63% with a ResNet50-VGG16 hybrid while the lowest performance was observed in Sukegawa et al. (79.46% with a Probability Neural Network). The superior performance of the proposed framework can be attributed to the synergistic combination of hierarchical feature extraction with Swin Transformer, global attention mechanisms through ViT and effective preprocessing strategies that enhanced feature discriminability and overall classification ACC. All models are evaluated on the same dataset under identical experimental configurations to ensure fair comparison.

5. Conclusions

This study presents a hybrid DL framework for automated OC detection using histopathological images. The proposed methodology combines Swin Transformer feature extraction, ViT classification and SHAP based feature selection to achieve superior diagnostic performance with 99.25% ACC that is outperforming existing advanced approaches. The systematic evaluation confirmed that Swin Transformer provides optimal feature extraction while ViT delivers the best classification performance. SHAP based feature selection with 500 features achieved the ideal balance between ACC and computational efficiency. Statistical analyses validated the significance of improvements across all evaluation metrics. The framework’s practical applicability is demonstrated through a FastAPI based web application enabling real time deployment in clinical settings. Despite promising results, limitations include dataset size constraints, the absence of patient-level identifiers which prevented patient-based data splitting, and the need for validation across diverse populations. Future work should focus on multi institutional validation and integration of multimodal data sources to enhance clinical applicability and robustness.

Author Contributions

Conceptualization: M.F.H., F.R.D.C. and J.W.; data curation, formal analysis, investigation, methodology: F.R.D.C., M.B.A.M., M.F.A.M., Z.M. and M.F.H.; funding acquisition, project administration: Z.M., M.F.A.M. and M.B.A.M.; resources, software: J.W., M.F.H. and F.R.D.C.; validation, visualization: M.B.A.M., F.R.D.C. and Z.M. writing—original draft: J.W., F.R.D.C. and M.F.H., writing—review and editing: M.F.A.M., Z.M. and M.B.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset supporting this findings of this study are openly available in Mendeley Data repository at https://data.mendeley.com/datasets/ftmp4cvtmb/1 (accessed on 20 November 2025). The codebase is available on GitHub at https://github.com/rudradcruze/oral-cancer (accessed on 20 November 2025), and the web application is publicly accessible at https://oralcancer.francisrudra.com (accessed on 20 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
OCOral cancer
OSCCOral Squamous Cell Carcinoma
CADComputer-Aided Diagnostic
AIArtificial Intelligence
MLMachine Learning
DLDeep Learning
CNNConvolutional Neural Network
ViTVision Transformer
AttCNNAttention based CNN
SHAPSHapley Additive exPlanations
GANGenerative Adversarial Network
IHCImmunohistochemistry
WSIWhole Slide Imaging
H&EHematoxylin and Eosin
ACCAccuracy
AUCArea Under the Curve
PREPrecision
MCCMatthews Correlation Coefficient
TPTrue Positive
TNTrue Negative
FPFalse Positive
FNFalse Negative
CLAHEContrast Limited Adaptive Histogram Equalization
ReLURectified Linear Unit
SVMSupport Vector Machine

References

  1. Liang, H.J.; Tan, X.Y.; Li, D.; Lin, C.; Huang, S.Y.; Nie, G.C.; Guo, X.F.; Zhang, Z.B.; Zhu, X.N.; Tan, S.K. New advances in oral microbiology and tumor research. World J. Clin. Oncol. 2025, 16, 106981. [Google Scholar] [CrossRef] [PubMed]
  2. Romero-Trejo, D.; Aguiñiga-Sanchez, I.; Ledesma-Martínez, E.; Weiss-Steider, B.; Sierra-Mondragón, E.; Santiago-Osorio, E. Anti-cancer potential of casein and its derivatives: Novel strategies for cancer treatment. Med. Oncol. 2024, 41, 200. [Google Scholar] [CrossRef] [PubMed]
  3. Wierzbicka, M.; Pietruszewska, W.; Maciejczyk, A.; Markowski, J. Trends in incidence and mortality of head and neck Cancer subsites among elderly patients: A Population-Based analysis. Cancers 2025, 17, 548. [Google Scholar] [CrossRef] [PubMed]
  4. Kijowska, J.; Grzegorczyk, J.; Gliwa, K.; Jędras, A.; Sitarz, M. Epidemiology, Diagnostics, and Therapy of Oral Cancer—Update Review. Cancers 2024, 16, 3156. [Google Scholar] [CrossRef]
  5. Rusinovci, S.; Aliu, X.; Jukić, T.; Štubljar, D.; Haliti, N. Analysis of THREE-year prevalence of oral cavity, neck and head tumors—A retrospective single-centre study. Acta Clin. Croat. 2020, 59, 445. [Google Scholar] [CrossRef]
  6. Wu, J.; Chen, H.; Liu, Y.; Yang, R.; An, N. The global, regional, and national burden of oral cancer, 1990–2021: A systematic analysis for the Global Burden of Disease Study 2021. J. Cancer Res. Clin. Oncol. 2025, 151, 53. [Google Scholar] [CrossRef]
  7. Kademani, D. Oral cancer. Mayo Clin. Proc. 2007, 82, 878–887. [Google Scholar] [CrossRef]
  8. Jose, J.; Wieczorek, A. Head and Neck Cancer. In Treatment of Cancer; CRC Press: Boca Raton, FL, USA, 2025; pp. 35–50. [Google Scholar]
  9. Maggiore, R.; Zumsteg, Z.S.; BrintzenhofeSzoc, K.; Trevino, K.M.; Gajra, A.; Korc-Grodzicki, B.; Epstein, J.B.; Bond, S.M.; Parker, I.; Kish, J.A.; et al. The older adult with locoregionally advanced head and neck squamous cell carcinoma: Knowledge gaps and future direction in assessment and treatment. Int. J. Radiat. Oncol. Biol. Phys. 2017, 98, 868–883. [Google Scholar] [CrossRef]
  10. Peng, H.; Wang, X.; Liao, Y.; Lan, L.; Wang, D.; Xiong, Y.; Xu, L.; Liang, Y.; Luo, X.; Xu, Y.; et al. Long-term exposure to ambient NO2 increase oral cancer prevalence in Southern China: A 3-year time-series analysis. Front. Public Health 2025, 13, 1484223. [Google Scholar] [CrossRef]
  11. Bushi, G.; Khatib, M.N.; Singh, M.P.; Pattanayak, M.; Vishwakarma, T.; Ballal, S.; Bansal, P.; Gaidhane, A.M.; Tomar, B.S.; Ashraf, A.; et al. Prevalence of suicidal ideation, attempts and associated risk factors in oral cancer patients: A systematic review and meta-analysis. BMC Oral Health 2025, 25, 140. [Google Scholar] [CrossRef]
  12. Hsu, P.C.; Huang, J.H.; Tsai, C.C.; Lin, Y.H.; Kuo, C.Y. Early Molecular Diagnosis and Comprehensive Treatment of Oral Cancer. Curr. Issues Mol. Biol. 2025, 47, 452. [Google Scholar] [CrossRef] [PubMed]
  13. Gupta, N.; Gupta, R.; Acharya, A.K.; Patthi, B.; Goud, V.; Reddy, S.; Garg, A.; Singla, A. Changing Trends in oral cancer-a global scenario. Nepal J. Epidemiol. 2016, 6, 613. [Google Scholar] [CrossRef] [PubMed]
  14. Rageh, O.A.; Mahmood, K.; Alkladi, E.; Bamuneef, A.; Algebaree, M.; Murad, A.; Munasser, M. Dental Student Knowledge of the Role of Early Detection of Oral Cancer: Multi Center Cross Sectional Study. Yemeni J. Med Sci. 2025, 19, 7. [Google Scholar] [CrossRef]
  15. Mohammed, R.A.; Ahmed, S.K. Oral cancer screening: Past, present, and future perspectives. Oral Oncol. Rep. 2024, 10, 100306. [Google Scholar] [CrossRef]
  16. Cirillo, N. Precursor lesions, overdiagnosis, and oral cancer: A critical review. Cancers 2024, 16, 1550. [Google Scholar] [CrossRef]
  17. Ng, J.Y.; Cramer, H.; Lee, M.S.; Moher, D. Traditional, complementary, and integrative medicine and artificial intelligence: Novel opportunities in healthcare. Integr. Med. Res. 2024, 13, 101024. [Google Scholar] [CrossRef]
  18. Shah, P.; Kendall, F.; Khozin, S.; Goosen, R.; Hu, J.; Laramie, J.; Ringel, M.; Schork, N. Artificial intelligence and machine learning in clinical development: A translational perspective. npj Digit. Med. 2019, 2, 69. [Google Scholar] [CrossRef]
  19. Al, M.M.F.; Hasib, F.M.; Young, L.; Na, G.; Wang, D. Diabetes Prediction and Detection System Through a Recurrent Neural Network in a Sensor Device. Electronics 2025, 14, 4207. [Google Scholar] [CrossRef]
  20. Briganti, G.; Le Moine, O. Artificial intelligence in medicine: Today and tomorrow. Front. Med. 2020, 7, 509744. [Google Scholar] [CrossRef]
  21. Kumar, Y.; Shrivastav, S.; Garg, K.; Modi, N.; Wiltos, K.; Woźniak, M.; Ijaz, M.F. Automating cancer diagnosis using advanced deep learning techniques for multi-cancer image classification. Sci. Rep. 2024, 14, 25006. [Google Scholar] [CrossRef]
  22. Unger, M.; Kather, J.N. Deep learning in cancer genomics and histopathology. Genome Med. 2024, 16, 44. [Google Scholar] [CrossRef] [PubMed]
  23. Vinay, V.; Jodalli, P.; Chavan, M.S.; Buddhikot, C.S.; Luke, A.M.; Ingafou, M.S.H.; Reda, R.; Pawar, A.M.; Testarelli, L. Artificial intelligence in oral cancer: A comprehensive scoping review of diagnostic and prognostic applications. Diagnostics 2025, 15, 280. [Google Scholar] [CrossRef] [PubMed]
  24. Khosravi, P.; Fuchs, T.J.; Ho, D.J. Artificial Intelligence–Driven Cancer Diagnostics: Enhancing Radiology and Pathology through Reproducibility, Explainability, and Multimodality. Cancer Res. 2025, 85, 2356–2367. [Google Scholar] [CrossRef] [PubMed]
  25. Pereira-Prado, V.; Martins-Silveira, F.; Sicco, E.; Hochmann, J.; Isiordia-Espinoza, M.A.; González, R.G.; Pandiar, D.; Bologna-Molina, R. Artificial intelligence for image analysis in oral squamous cell carcinoma: A review. Diagnostics 2023, 13, 2416. [Google Scholar] [CrossRef]
  26. Gupta, A.; Neelapu, B.C.; Rana, S.S. Computer-Aided Diagnosis (CAD) Tools and Applications for 3D Medical Imaging; Elsevier: Amsterdam, The Netherlands, 2025; Volume 136. [Google Scholar]
  27. Ma, Y.; Jamdade, S.; Konduri, L.; Sailem, H. AI in Histopathology Explorer for comprehensive analysis of the evolving AI landscape in histopathology. npj Digit. Med. 2025, 8, 156. [Google Scholar] [CrossRef]
  28. Ahmad, M.Y.; Mohamed, A.; Yusof, Y.A.M.; Ali, S.A.M. Colorectal cancer image classification using image pre-processing and multilayer Perceptron. In Proceedings of the 2012 International Conference on Computer & Information Science (ICCIS), Chongqing, China, 17–19 August 2012; IEEE: Piscataway, NJ, USA, 2012; Volume 1, pp. 275–280. [Google Scholar]
  29. Mira, E.S.; Saaduddin Sapri, A.M.; Aljehanı, R.F.; Jambı, B.S.; Bashir, T.; El-Kenawy, E.S.M.; Saber, M. Early diagnosis of oral cancer using image processing and Artificial intelligence. Fusion Pract. Appl. 2024, 14, 293–308. [Google Scholar] [CrossRef]
  30. Hossain, M.M.; Miah, M.B.A.; Saedi, M.; Sifat, T.A.; Hossain, M.N.; Hussain, N. An IoT-Based Lung Cancer Detection System from CT Images Using Deep Learning. In Proceedings of the International Conference on Emerging Trends in Cybersecurity (ICETCS 2025), Wolverhampton, UK, 27–28 October 2025; Lecture Notes in Electrical Engineering. Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
  31. Rahman, M.A.; Miah, M.B.A.; Hossain, M.A.; Hosen, A.S. Enhanced Brain Tumor Classification Using MobileNetV2: A Comprehensive Preprocessing and Fine-Tuning Approach. BioMedInformatics 2025, 5, 30. [Google Scholar] [CrossRef]
  32. Ulaganathan, G.; Niazi, K.T.M.; Srinivasan, S.; Balaji, V.; Manikandan, D.; Hameed, K.S.; Banumathi, A. A clinicopathological study of various oral cancer diagnostic techniques. J. Pharm. Bioallied Sci. 2017, 9, S4. [Google Scholar] [CrossRef]
  33. Li, L.; Pu, C.; Tao, J.; Zhu, L.; Hu, S.; Qiao, B.; Xing, L.; Wei, B.; Shi, C.; Chen, P.; et al. Development of an oral cancer detection system through deep learning. BMC Oral Health 2024, 24, 1468. [Google Scholar] [CrossRef]
  34. Nanditha, B.; MP, G. Oral cancer detection using machine learning and deep learning techniques. Int. J. Curr. Res. Rev. 2022, 14, 64–70. [Google Scholar] [CrossRef]
  35. Song, B.; Sunny, S.; Uthoff, R.D.; Patrick, S.; Suresh, A.; Kolur, T.; Keerthi, G.; Anbarani, A.; Wilder-Smith, P.; Kuriakose, M.A.; et al. Automatic classification of dual-modalilty, smartphone-based oral dysplasia and malignancy images using deep learning. Biomed. Opt. Express 2018, 9, 5318–5329. [Google Scholar] [CrossRef] [PubMed]
  36. Panigrahi, S.; Nanda, B.S.; Swarnkar, T. Comparative analysis of machine learning algorithms for histopathological images of oral cancer. In Advances in Distributed Computing and Machine Learning: Proceedings of ICADCML 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 318–327. [Google Scholar]
  37. Senthil Pandi, S.; Sutha, J.; Kumaragurubaran, T.; Kumar, P. Enhanced Classification of Oral Cancer Using Deep Learning Techniques. In Proceedings of the 2024 Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, India, 24–27 July 2024; IEEE: Piscataway, NJ, USA, 2024; Volume 1, pp. 1–5. [Google Scholar]
  38. Panahi, O.; Farrokh, S. The Use of Machine Learning for Personalized Dental-Medicine Treatment. Glob. J. Med. Biomed. Case Rep. 2025, 1, 2. [Google Scholar]
  39. Jeyaraj, P.R.; Samuel Nadar, E.R. Computer-assisted medical image classification for early diagnosis of oral cancer employing deep learning algorithm. J. Cancer Res. Clin. Oncol. 2019, 145, 829–837. [Google Scholar] [CrossRef] [PubMed]
  40. Halder, A.; Laha, S.; Bandyopadhyay, S.; Schwenker, F.; Sarkar, R. A Metaheuristic Optimization Based Deep Feature Selection for Oral Cancer Classification. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Montreal, BC, Canada, 10–12 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 132–143. [Google Scholar]
  41. Akhi, A.B.; Al Noman, A.; Shaha, S.P.; Akter, F.; Lata, M.A.; Sheikh, R. OCNet-23: A fine-tuned transfer learning approach for oral cancer detection from histopathological images. Int. J. Electr. Comput. Eng. (IJECE) 2025, 15, 1826–1833. [Google Scholar] [CrossRef]
  42. Rahman, T.Y. A histopathological image repository of normal epithelium of Oral Cavity and Oral Squamous Cell Carcinoma. Mendeley Data 2019. [Google Scholar] [CrossRef]
  43. Bury, T. For Image Generation Process. In Proceedings of the Information and Software Technologies: 30th International Conference, ICIST 2024, Kaunas, Lithuania, 17–18 October 2024; Springer Nature: Berlin/Heidelberg, Germany, 2025; Volume 2401, p. 61. [Google Scholar]
  44. Halloum, K.; Ez-Zahraouy, H. Enhancing Medical Image Classification through Transfer Learning and CLAHE Optimization. Curr. Med Imaging 2025, 21, e15734056342623. [Google Scholar] [CrossRef]
  45. Zhang, H.; Liu, Y.; Cai, G. A novel 3D bilateral filtering algorithm with noise level estimation assisted by multi-temporal SAR. PLoS ONE 2025, 20, e0315395. [Google Scholar] [CrossRef]
  46. Asnake, N.W.; Ayalew, A.M.; Engda, A.A. Detection of oral squamous cell carcinoma cancer using AlexNet on histopathological images. Discov. Appl. Sci. 2025, 7, 155. [Google Scholar] [CrossRef]
  47. Pérez-Enriquez, L.; Jiménez-Domínguez, M.; García-Rojas, N.; Zapotecas-Martínez, S.; Altamirano-Robles, L. Image Contrast Enhancement: The Synergistic Power of a Dual-Gamma Correction Function and Evolutionary Algorithms. Comput. Y Sist. 2025, 29, 91–101. [Google Scholar] [CrossRef]
  48. Lin, S.; Zhou, H.; Watson, M.; Govindan, R.; Cote, R.J.; Yang, C. Impact of stain variation and color normalization for prognostic predictions in pathology. Sci. Rep. 2025, 15, 2369. [Google Scholar] [CrossRef]
  49. Du, Z.; Zhang, P.; Huang, X.; Hu, Z.; Yang, G.; Xi, M.; Liu, D. Deeply supervised two stage generative adversarial network for stain normalization. Sci. Rep. 2025, 15, 7068. [Google Scholar] [CrossRef] [PubMed]
  50. Heilmann, T.A. Sharp Images and Unsharp Masks. Transbordeur. Photogr. Hist. Soc. 2025, 9. [Google Scholar] [CrossRef]
  51. Verma, K.; Srivastava, S.; Mishra, R.K. Optimized Reformed Anisotropic Diffusion Unsharp Masking Filter for MR Images. Trait. Signal 2025, 42, 2181–2194. [Google Scholar] [CrossRef]
  52. Adeoye, J.; Koohi-Moghadam, M.; Choi, S.W.; Zheng, L.W.; Lo, A.W.I.; Tsang, R.K.Y.; Chow, V.L.Y.; Akinshipo, A.; Thomson, P.; Su, Y.X. Predicting oral cancer risk in patients with oral leukoplakia and oral lichenoid mucositis using machine learning. J. Big Data 2023, 10, 39. [Google Scholar] [CrossRef]
  53. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  54. Hilal, B.K.; AlShemmary, E.N. Detecting Hypertrophic Cardiomyopathy: A Deep Learning Approach with CNNs and Swin Transformers. Int. J. Intell. Eng. Syst. 2025, 18, 44–65. [Google Scholar] [CrossRef]
  55. Emegano, D.I.; Mustapha, M.T.; Ozsahin, I.; Ozsahin, D.U.; Uzun, B. Advancing Prostate Cancer Diagnostics: A ConvNeXt Approach to Multi-Class Classification in Underrepresented Populations. Bioengineering 2025, 12, 369. [Google Scholar] [CrossRef]
  56. Kumar, A.; Yadav, S.P.; Kumar, A. An improved feature extraction algorithm for robust Swin Transformer model in high-dimensional medical image analysis. Comput. Biol. Med. 2025, 188, 109822. [Google Scholar] [CrossRef]
  57. Velu, K.; Jaisankar, N. Design of a CNN–Swin transformer model for Alzheimer’s disease prediction using MRI images. IEEE Access 2025, 13, 149409–149429. [Google Scholar] [CrossRef]
  58. Zhang, L.; Yin, X.; Liu, X.; Liu, Z. Medical image segmentation by combining feature enhancement Swin Transformer and UperNet. Sci. Rep. 2025, 15, 14565. [Google Scholar] [CrossRef]
  59. Ansith, S.; Ananth, A.; Deni, R.E.; Kala, S. Swin-RSIC: Remote sensing image classification using a modified swin transformer with explainability. Earth Sci. Inform. 2025, 18, 362. [Google Scholar]
  60. Guo, Y.; Li, W.; Zhai, P. Swin-transformer for weak feature matching. Sci. Rep. 2025, 15, 2961. [Google Scholar] [CrossRef] [PubMed]
  61. Mzoughi, H.; Njeh, I.; BenSlima, M.; Farhat, N.; Mhiri, C. Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI). Vis. Comput. 2025, 41, 2123–2142. [Google Scholar] [CrossRef]
  62. Jahan, I.; Chowdhury, M.E.; Vranic, S.; Al Saady, R.M.; Kabir, S.; Pranto, Z.H.; Mim, S.J.; Nobi, S.F. Deep learning and vision transformers-based framework for breast cancer and subtype identification. Neural Comput. Appl. 2025, 37, 9311–9330. [Google Scholar] [CrossRef]
  63. Mannepalli, D.; Tak, T.K.; Krishnan, S.B.; Sreenivas, V. GSC-DVIT: A vision transformer based deep learning model for lung cancer classification in CT images. Biomed. Signal Process. Control 2025, 103, 107371. [Google Scholar] [CrossRef]
  64. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  65. Hu, J.; Xiang, Y.; Lin, Y.; Du, J.; Zhang, H.; Liu, H. Multi-scale Transformer architecture for accurate medical image classification. In Proceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, Kuala Lumpur, Malaysia, 14–16 February 2025; pp. 409–414. [Google Scholar]
  66. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
  67. Hancock, J.T.; Khoshgoftaar, T.M.; Liang, Q. A problem-agnostic approach to feature selection and analysis using shap. J. Big Data 2025, 12, 12. [Google Scholar] [CrossRef]
  68. Miah, M.B.A.; Awang, S.; Azad, M.S.; Rahman, M.M. Keyphrases concentrated area identification from academic articles as feature of keyphrase extraction: A new unsupervised approach. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 789–796. [Google Scholar] [CrossRef]
  69. Hausken, K.; Mohr, M. The value of a player in n-person games. Soc. Choice Welf. 2001, 18, 465–483. [Google Scholar] [CrossRef]
  70. Noor, S.; AlQahtani, S.A.; Khan, S. Chronic liver disease detection using ranking and projection-based feature optimization with deep learning. AIMS Bioeng. 2025, 12, 50–68. [Google Scholar] [CrossRef]
  71. Ji, Y.; Shang, H.; Yi, J.; Zang, W.; Cao, W. Machine learning-based models to predict type 2 diabetes combined with coronary heart disease and feature analysis-based on interpretable SHAP. Acta Diabetol. 2025, 62, 1631–1646. [Google Scholar] [CrossRef]
  72. Aas, K.; Jullum, M.; Løland, A. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artif. Intell. 2021, 298, 103502. [Google Scholar] [CrossRef]
  73. Sahoo, P.; Saha, S.; Sharma, S.K.; Mondal, S. Boosting cervical cancer detection with a multi-stage architecture and complementary information fusion. Soft Comput. 2025, 29, 1191–1206. [Google Scholar] [CrossRef]
  74. Hosen, M.F.; Mahmud, S.H.; Goh, K.O.M.; Uddin, M.S.; Nandi, D.; Shatabda, S.; Shoombuatong, W. An LSTM network-based model with attention techniques for predicting linear T-cell epitopes of the hepatitis C virus. Results Eng. 2024, 24, 103476. [Google Scholar] [CrossRef]
  75. Wu, Z.; Zhang, H.; Fang, C. Research on machine vision online monitoring system for egg production and quality in cage environment. Poult. Sci. 2025, 104, 104552. [Google Scholar] [CrossRef]
  76. Li, Y.; Gao, F.; Yu, J.; Fei, T. Machine learning based thermal comfort prediction in office spaces: Integrating SMOTE and SHAP methods. Energy Build. 2025, 329, 115267. [Google Scholar] [CrossRef]
  77. Miah, M.B.A.; Awang, S.; Rahman, M.M.; Hosen, A.S.; Ra, I.H. Keyphrases frequency analysis from research articles: A region-based unsupervised novel approach. IEEE Access 2022, 10, 120838–120849. [Google Scholar] [CrossRef]
  78. Miah, M.B.A.; Awang, S.; Rahman, M.M.; Hosen, A.S.; Ra, I.H. A new unsupervised technique to analyze the centroid and frequency of keyphrases from academic articles. Electronics 2022, 11, 2773. [Google Scholar] [CrossRef]
  79. Miah, M.B.A.; Awang, S.; Rahman, M.M.; Hosen, A.S. Keyphrase Distance Analysis Technique from News Articles as a Feature for Keyphrase Extraction: An Unsupervised Approach. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 995–1002. [Google Scholar] [CrossRef]
  80. Hossain, M.N.; Bhuiyan, E.; Miah, M.B.A.; Sifat, T.A.; Muhammad, Z.; Masud, M.F.A. Detection and Classification of Kidney Disease from CT Images: An Automated Deep Learning Approach. Technologies 2025, 13, 508. [Google Scholar] [CrossRef]
  81. Shavlokhova, V.; Sandhu, S.; Flechtenmacher, C.; Koveshazi, I.; Neumeier, F.; Padrón-Laso, V.; Jonke, Ž.; Saravi, B.; Vollmer, M.; Vollmer, A.; et al. Deep learning on oral squamous cell carcinoma ex vivo fluorescent confocal microscopy data: A feasibility study. J. Clin. Med. 2021, 10, 5326. [Google Scholar] [CrossRef]
  82. Dai, Z.; Zhu, B.; Yu, H.; Jian, X.; Peng, J.; Fang, C.; Wu, Y. Role of autophagy induced by arecoline in angiogenesis of oral submucous fibrosis. Arch. Oral Biol. 2019, 102, 7–15. [Google Scholar] [CrossRef] [PubMed]
  83. Yu, M.; Ding, J.; Liu, W.; Tang, X.; Xia, J.; Liang, S.; Jing, R.; Zhu, L.; Zhang, T. Deep multi-feature fusion residual network for oral squamous cell carcinoma classification and its intelligent system using Raman spectroscopy. Biomed. Signal Process. Control 2023, 86, 105339. [Google Scholar] [CrossRef]
  84. Chang, X.; Yu, M.; Liu, R.; Jing, R.; Ding, J.; Xia, J.; Zhu, Z.; Li, X.; Yao, Q.; Zhu, L.; et al. Deep learning methods for oral cancer detection using Raman spectroscopy. Vib. Spectrosc. 2023, 126, 103522. [Google Scholar] [CrossRef]
  85. Panigrahi, S.; Nanda, B.S.; Bhuyan, R.; Kumar, K.; Ghosh, S.; Swarnkar, T. Classifying histopathological images of oral squamous cell carcinoma using deep transfer learning. Heliyon 2023, 9, e13444. [Google Scholar] [CrossRef]
  86. Sukegawa, S.; Ono, S.; Tanaka, F.; Inoue, Y.; Hara, T.; Yoshii, K.; Nakano, K.; Takabatake, K.; Kawai, H.; Katsumitsu, S.; et al. Effectiveness of deep learning classifiers in histopathological diagnosis of oral squamous cell carcinoma by pathologists. Sci. Rep. 2023, 13, 11676. [Google Scholar] [CrossRef]
  87. Das, M.; Dash, R.; Mishra, S.K. Automatic detection of oral squamous cell carcinoma from histopathological images of oral mucosa using deep convolutional neural network. Int. J. Environ. Res. Public Health 2023, 20, 2131. [Google Scholar] [CrossRef]
  88. Nagarajan, B.; Chakravarthy, S.; Venkatesan, V.K.; Ramakrishna, M.T.; Khan, S.B.; Basheer, S.; Albalawi, E. A deep learning framework with an intermediate layer using the swarm intelligence optimizer for diagnosing oral squamous cell carcinoma. Diagnostics 2023, 13, 3461. [Google Scholar] [CrossRef]
  89. Flügge, T.; Gaudin, R.; Sabatakakis, A.; Tröltzsch, D.; Heiland, M.; van Nistelrooij, N.; Vinayahalingam, S. Detection of oral squamous cell carcinoma in clinical photographs using a vision transformer. Sci. Rep. 2023, 13, 2296. [Google Scholar] [CrossRef]
  90. Albalawi, E.; Thakur, A.; Ramakrishna, M.T.; Bhatia Khan, S.; SankaraNarayanan, S.; Almarri, B.; Hadi, T.H. Oral squamous cell carcinoma detection using EfficientNet on histopathological images. Front. Med. 2024, 10, 1349336. [Google Scholar] [CrossRef]
Figure 2. Overview of the preprocessing pipeline applied to histopathology images, demonstrating the transformations from the original image through histogram equalization, CLAHE, contrast enhancement, and stain normalization for both NORMAL and OSCC samples.
Figure 2. Overview of the preprocessing pipeline applied to histopathology images, demonstrating the transformations from the original image through histogram equalization, CLAHE, contrast enhancement, and stain normalization for both NORMAL and OSCC samples.
Technologies 14 00039 g002
Figure 3. ROC curves for (A) Swin Transformer, (B) ConvNeXt, and (C) BEiT across all classifiers. Also (DF) illustrates the precision recall curves of those three classifiers. Swin Transformer demonstrate the good results among all of them.
Figure 3. ROC curves for (A) Swin Transformer, (B) ConvNeXt, and (C) BEiT across all classifiers. Also (DF) illustrates the precision recall curves of those three classifiers. Swin Transformer demonstrate the good results among all of them.
Technologies 14 00039 g003
Figure 4. (A,B) illustrate the top feature extractor–classifier pairs, while (C,D) show the effect of feature count on performance using the Swin Transformer.
Figure 4. (A,B) illustrate the top feature extractor–classifier pairs, while (C,D) show the effect of feature count on performance using the Swin Transformer.
Technologies 14 00039 g004
Figure 5. Comparison of feature selection methods (A,C) and feature counts (B,D) highlighting SHAP’s consistent performance and showing that using 500 features provides the most balanced results.
Figure 5. Comparison of feature selection methods (A,C) and feature counts (B,D) highlighting SHAP’s consistent performance and showing that using 500 features provides the most balanced results.
Technologies 14 00039 g005
Figure 6. Visual comparison of feature selection techniques: (A) overlap of selected features, (B) feature counts across methods, (C) performance evaluation of feature subsets and (D) SHAP summary plot of the top 20 influential features.
Figure 6. Visual comparison of feature selection techniques: (A) overlap of selected features, (B) feature counts across methods, (C) performance evaluation of feature subsets and (D) SHAP summary plot of the top 20 influential features.
Technologies 14 00039 g006
Figure 7. Grad-CAM images were generated using the final feature representation produced by the Swin Transformer. In these visualizations, important regions are highlighted in red, while less significant areas appear in blue.
Figure 7. Grad-CAM images were generated using the final feature representation produced by the Swin Transformer. In these visualizations, important regions are highlighted in red, while less significant areas appear in blue.
Technologies 14 00039 g007
Figure 8. Comparison of model variants across evaluation metrics showing (A) grouped performance scores and (B) performance trends, with Full Model achieving consistently superior results.
Figure 8. Comparison of model variants across evaluation metrics showing (A) grouped performance scores and (B) performance trends, with Full Model achieving consistently superior results.
Technologies 14 00039 g008
Table 1. Dataset Details.
Table 1. Dataset Details.
ParameterSpecification
Total Images528
Normal/OSCC89/439
Patients230
Magnification100× (H&E stained)
Table 2. Experimental results of the performance of feature extractors with different classifiers.
Table 2. Experimental results of the performance of feature extractors with different classifiers.
Feature ExtractorClassifierACCAUCPRESPSNF1MCC
ConvNeXtRandom Forest0.79850.87240.76810.73240.71930.75230.6742
CatBoost0.81240.85410.78910.74820.74100.78120.7311
Focal Loss SVM0.85620.93100.84780.75890.76610.78940.7833
Attention-CNN0.88410.94920.88470.81340.82100.84810.8294
ViT0.91260.96620.92340.87910.88640.91700.8921
BEiTRandom Forest0.73420.88340.58930.68440.67520.50930.7081
CatBoost0.76540.89010.70410.71030.69890.60120.7394
Focal Loss SVM0.80730.89890.80920.75120.73940.67620.7620
Attention-CNN0.84210.90230.91740.81330.80210.75550.8024
ViT0.90120.91840.93210.87220.86410.87940.8613
Swin TransformerRandom Forest0.80310.86220.77240.64210.65540.66810.4043
CatBoost0.78140.89810.74630.60240.61380.61320.3194
Focal Loss SVM0.84820.92030.81910.79110.78200.76630.6670
Attention-CNN0.91240.94890.89120.88510.89240.92810.8214
ViT0.94210.97380.96020.97410.96620.96230.9714
Table 3. Experimental results of the performance of different feature selection techniques for OC detection.
Table 3. Experimental results of the performance of different feature selection techniques for OC detection.
Feature SelectionACCAUCPRESPSNF1MCC
mRMR0.96250.96420.96330.97230.97120.97240.9735
Lasso0.96120.95520.95620.97780.96430.96340.9655
Boruta0.95950.94620.94910.98330.95740.95440.9575
Genetic Algorithm0.97800.97720.98200.98880.97050.96540.9695
SHAP0.99250.98220.98260.99180.99210.98430.9821
Table 4. Performance of the Swin Transformer combined with SHAP across varying numbers of selected features.
Table 4. Performance of the Swin Transformer combined with SHAP across varying numbers of selected features.
FeaturesACCAUCPRESPSNF1MCC
2000.88750.92000.98310.92860.87880.92740.6977
3000.93750.94000.98410.92860.93940.96060.8063
5000.99250.98220.98260.99180.99210.98430.9821
7000.96250.97500.97010.85710.98480.97650.8673
9000.93750.96000.95520.78570.96970.96200.7778
Table 5. Hyperparameters for Swin Transformer (Feature Extractor).
Table 5. Hyperparameters for Swin Transformer (Feature Extractor).
NameParameter
Output Dimension1024
Base Model Nameswin_large_patch4_window7_224
Base Feature Dimension1536
Attention Dimension1536
First Linear Layer Input1536
First Linear Layer Output1280
Second Linear Layer Input1280
Second Linear Layer Output1024
First Activation FunctionReLU
Second Activation FunctionReLU
First Batch NormalizationBatchNorm1d(1280)
Second Batch NormalizationBatchNorm1d(1024)
Dropout Rate0.2
Patch Size4
Window Size7
Pooling MethodMean pooling
Table 6. Hyperparameters for ViT Classifier (Proposed Model).
Table 6. Hyperparameters for ViT Classifier (Proposed Model).
NameParameter
Input Dimension1024
Embedding Dimension128
Number of Attention Heads8
Feedforward Dimension256
Dropout Rate0.1
Number of Transformer Layers4
Learning Rate0.0001
Training Epochs50
Batch Size128
Table 7. Ablation Study Results.
Table 7. Ablation Study Results.
ModelACCAUCPRESPSNF1MCC
Full Model0.99250.98220.98260.99180.99210.98430.9821
No Positional0.80000.80520.93100.71430.81820.87080.4527
Shallow ViT0.78750.81060.94550.78570.78790.85920.4703
Deep ViT0.98750.86040.98510.92861.00000.99250.9562
Max Pooling0.97500.92640.97060.85711.00000.98510.9121
Table 8. Metric-wise Statistical Test Results.
Table 8. Metric-wise Statistical Test Results.
Metrict-Test (t-Stat)t-Test (p-Value)Wilcoxon (Stat)Wilcoxon (p-Value)
ACC−4.1050.00512.50.0284
AUC−3.2200.01194.50.0371
PRE−5.9280.00061.00.0120
SN−4.6670.00191.80.0213
F1−5.8720.00050.00.0065
MCC−3.4010.01123.50.0369
Table 9. Model-wise Statistical Test Results.
Table 9. Model-wise Statistical Test Results.
Modelt-Test (t-Stat)t-Test (p-Value)Wilcoxon (Stat)Wilcoxon (p-Value)
Random Forest−3.0030.01833.00.0350
CatBoost−4.5670.00251.50.0201
Focal Loss SVM−5.4450.00091.00.0154
Attention-CNN−3.8900.00592.80.0298
ViT−6.7510.00020.00.0061
Table 10. Comparison Study.
Table 10. Comparison Study.
StudyTechniqueACC
Shavlokhova et al. [81]ResNet50 with feature fusion92.37%
Dai et al. [82]ResNet50 with Raman Spectra92.48%
Yu et al. [83]ResNet50 with DCNNs95.84%
Chang et al. [84]ResNet50 with VGG1685.63%
Panigrahi et al. [85]Three Kinds of CNN92.14%
Sukegawa et al. [86]Probability Neural Network79.46%
Das et al. [87]Multiple techniques fusion97.41%
Nagarajan et al. [88]MobileNetV3 with Gorilla Troops Optimizer94.18%
Flügge et al. [89]Swin Transformer97.63%
Albalawi et al. [90]EfficientNet B3 with Advanced Learning Mechanism98.47%
Proposed ModelSwin Transformer with Vision Transformer99.25%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cruze, F.R.D.; Wasima, J.; Hosen, M.F.; Miah, M.B.A.; Muhammad, Z.; Masud, M.F.A. Oral Cancer Diagnosis Using Histopathology Images: An Explainable Hybrid Transformer Framework. Technologies 2026, 14, 39. https://doi.org/10.3390/technologies14010039

AMA Style

Cruze FRD, Wasima J, Hosen MF, Miah MBA, Muhammad Z, Masud MFA. Oral Cancer Diagnosis Using Histopathology Images: An Explainable Hybrid Transformer Framework. Technologies. 2026; 14(1):39. https://doi.org/10.3390/technologies14010039

Chicago/Turabian Style

Cruze, Francis Rudra D, Jeba Wasima, Md. Faruk Hosen, Mohammad Badrul Alam Miah, Zia Muhammad, and Md Fuyad Al Masud. 2026. "Oral Cancer Diagnosis Using Histopathology Images: An Explainable Hybrid Transformer Framework" Technologies 14, no. 1: 39. https://doi.org/10.3390/technologies14010039

APA Style

Cruze, F. R. D., Wasima, J., Hosen, M. F., Miah, M. B. A., Muhammad, Z., & Masud, M. F. A. (2026). Oral Cancer Diagnosis Using Histopathology Images: An Explainable Hybrid Transformer Framework. Technologies, 14(1), 39. https://doi.org/10.3390/technologies14010039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop