Next Article in Journal
Manifold Integration of Lung Emphysema Signatures (MILES): A Radiomic-Based Study
Previous Article in Journal
Hierarchical Caching for Agentic Workflows: A Multi-Level Architecture to Reduce Tool Execution Overhead
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis

by
Shirin Shila
1,†,
Md. Safayat Hossain
2,†,
Md Fuyad Al Masud
3,
Mohammad Badrul Alam Miah
2,*,
Afrig Aminuddin
4 and
Zia Muhammad
5,*
1
Department of Food Technology and Nutritional Science, Mawlana Bhashani Science and Technology University, Tangail 1902, Bangladesh
2
Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Tangail 1902, Bangladesh
3
Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58102, USA
4
Department of Information System, Universitas Amikom, Yogyakarta 55283, Indonesia
5
Department of Computing, Design, and Communication, University of Jamestown, Jamestown, ND 58405, USA
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Mach. Learn. Knowl. Extr. 2026, 8(2), 31; https://doi.org/10.3390/make8020031
Submission received: 8 December 2025 / Revised: 18 January 2026 / Accepted: 26 January 2026 / Published: 28 January 2026
(This article belongs to the Section Learning)

Abstract

The automated and accurate results of classifying histopathology images are necessary in the early detection of cancer, especially the common cancers such as Colorectal Cancer (CRC) and Lung Cancer (LC). Nonetheless, classical deep learning frameworks often face challenges because the intra-class variations are large, the relations across classes are alike, and the quality of images is not stable. In order to eliminate these constraints, a multi-layer diagnostic framework is offered in detail. This process starts with a strong preprocessing pipeline, which involves gamma correction, bilateral filtering, and adaptive CLAHE, resulting in statistically significant changes in image quality quantitative measures. The hybrid attention architecture is presented and includes an Xception backbone, a Convolutional Block Attention Module (CBAM), a Transformer block, and an MLP classifier to successfully combine local features with global context. The proposed model achieved an outstanding performance with a classification of 99.98%, 99.58%, and 99.33% percent on LC25000, CRC-VAL-HE-7K, and NCT-CRC-HE-100K when tested on three publicly available datasets. In order to enhance transparency, very detailed explainability analyses are conducted with the help of layer-wise feature visualization and Grad-CAM. Finally, the real-world example of this framework is presented by its implementation in a web-based platform, which can be a useful and easy-to-use tool in helping to diagnose a pathology.

1. Introduction

Cancer is a leading cause of death all over the world, and lung and colorectal cancer (CRC) used to be among the most prevalent and fatal types [1]. The ultimate diagnosis of the majority of cancers such as CRC and lung cancer depends on the histopathological analysis of hematoxylin and eosin (H&E)-stained tissue slides by professional pathologists [2]. In spite of the fact that this manual method is considered the gold standard, it is labor-intensive, subjective, and likely to have a significant degree of variability, not only across the interpreters, but also within the same kind of interpreter, and this can have a detrimental effect on the reliability of the diagnostic process and patient outcomes [3]. These concerns have led to an area in computational pathology that uses computer-aided diagnosis (CAD) systems to analyze them in a more objective and efficient way. Over the past ten years, the field of deep learning and specifically Convolutional Neural Networks (CNNs) has achieved a lot in terms of classifying histopathological images [4]. Google architectures like ResNet, DenseNet, and Xception [5,6,7] have shown abilities to extract complex hierarchical features on pixel data, and in some tasks can perform as well or even better than humans. However, traditional CNN models are faced with two limitations. First, their convoluted structure is very effective at identifying local features (such as the form of cell nuclei) but often fails to provide larger contextual information (such as the structure of the tissue on the whole), which is required to make accurate diagnoses. Second, these models are black box in nature and therefore their use in clinical practice is challenging because pathologists may not be able to identify the reasoning behind the findings of the model [8].
In addition, the quality of input data is also a very important factor in the performance of any deep learning model. There are high amounts of variation in histopathological images, where the images are often met with issues of stain strength, color variation, and artifacts that occur during slide preparation [9]. This variability may severely impair the capacity of a model to be effectively generalized. Although the process of stain normalization is a usual preprocessing procedure, simple normalization might be inadequate. It is argued that a strong image enhancement pipeline is required in order not only to standardize images, but also to disclose some textural details that are critical to classification, which must be quantitatively verified. In order to deal with these shortcomings, a multi-stage hybrid deep learning framework is proposed. It begins with a proven preprocessing pipeline that includes gamma corrections, bilateral filtering, and Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the look of images [10].
Quantitative measures of these improvements are entropy and PSNR. A hybrid architecture is presented in advance of the classification task, which takes the benefits of different paradigms. The Xception network [7] is an effective framework at the root of the local features extraction. The features are then optimized using a Convolutional Block Attention Module (CBAM) [10] that allows the features to be adaptively re-allocated by focusing on the most important spatial and channel-wise regions. A Transformer block is added to come up with a broader context that traditional CNNs can easily miss. Designed as a tool to process natural language, Vision Transformers (ViT) [11] have proven to be incredibly capable of identifying long-range dependencies in images, and thus, are exceptionally appropriate to analyze the structure of tissue. The suggested model combines these factors into a trainable network that is end-to-end. In order to enhance transparency and build clinical confidence, the explainability (XAI) techniques, especially Gradient-weighted Class Activation Mapping (Grad-CAM) [12], and layer-wise feature visualization, are applied to explain the model choices. Finally, to demonstrate the usefulness of such research in practice, the trained model was introduced as an operational and web-based tool. The model was proposed and highly trained and tested on three important public datasets: NCT-CRC-HE-100K [13], CRC-VAL-HE-7K [13], and LC25000 [14]. The framework provides state-of-the-art performance, which demonstrates the stable, accurate, and interpretable methodology of automated histopathological diagnosis. The key contributions of this study are as follows:
1.
A quantitative comparison of histopathology images before and after a novel enhancement pipeline (gamma correction, bilateral filtering, CLAHE), using IQA metrics to validate its efficacy.
2.
The design and implementation of a novel hybrid architecture (Xception–CBAM–Transformer) that synergistically combines local feature extraction, dual-axis attention, and global context modeling.
3.
State-of-the-art classification performance demonstrated on three distinct and widely used CRC and lung cancer datasets.
4.
A comprehensive explainability analysis using Grad-CAM and feature visualization to ensure model transparency and trustworthiness.
5.
The development and deployment of a web-based platform, translating our research into a practical tool for pathological assistance.
The rest of this paper will be structured in the following way. Section 2 is the review of the related literature. Section 3 discusses materials and methods, which include a description of the suggested framework and data utilized. Section 4 includes the description of an experiment setup and implementation. The results of the experiment are presented in Section 5. In Section 6, the findings are discussed in detail. Lastly, Section 7 provides a conclusion of the paper with a summary of the main findings and contributions.

2. Related Work

Automated histopathological image classification as a type of technology to classify cancer has undergone tremendous improvements, moving beyond traditional machine learning to the new, elaborate deep learning models. This section addresses the key techniques of classifying colorectal cancer (CRC) and lung cancer with their respective strengths and weaknesses, and thus, the justification of the hybrid approach proposed in this paper.

2.1. Convolutional Neural Networks (CNNs) in Pathology

A Convolutional Neural Network (CNN) was the first significant step in computational pathology. Architectures that have been trained in advance on ImageNet, including ResNet and Xception [5,7], are commonly used as feature extractors. Ongoing research continues to utilize these models, often in ensemble approaches, to attain high accuracy on datasets such as LC25000, CRC-VAL-HE-7K, and NCT-CRC-HE-100K [13,14].
Although these models excel at capturing local, hierarchical features (like nuclear atypia and mitotic figures), conventional CNNs have a key shortcoming: their effective receptive field is limited to a local scope. They find it challenging to represent long-range spatial relationships and the overall tissue architecture, such as the connections between tumor-infiltrating lymphocytes and epithelial cells that are distantly located, which are often vital for precise diagnosis and grading [15].

2.2. Attention-Enhanced CNNs

In order to overcome the shortcomings of conventional CNNs, researchers started incorporating attention mechanisms. These modules, such as the Convolutional Block Attention Module (CBAM) [10], help the network “learn what and where to emphasize.” Recent 2024 studies confirm that integrating spatial and channel attention mechanisms (including CBAM) into CNN backbones significantly improves focus on critical regions, reduces noise, and enhances classification accuracy [15]. While these attention-gated CNNs improve performance, they are still fundamentally constrained by the convolutional backbone. They enhance the focus of local feature extraction but do not solve the problem of modeling global context. A comparative summary of existing histopathological image classification approaches, their strengths, and limitations is presented in Table 1, which motivates the design of the proposed hybrid framework.

2.3. Transformers in Computational Pathology

Recently, the Vision Transformer (ViT) [19] has emerged as a formidable alternative. According to a comprehensive analysis, the application of Transformers is increasingly prevalent in the domain of pathology. They effectively understand global, long-range relationships by dividing an image into smaller segments and employing self-attention. Nonetheless, pure Transformers are known for their substantial data requirements and may miss the detailed, pixel-level textural details that CNNs excel in capturing.

2.4. Hybrid Architectures and Identified Gaps

The latest advancements are leaning towards hybrid models that integrate the advantages of both frameworks: employing a CNN for effective local feature extraction and a Transformer to capture the overarching context of these features. The efficacy of hybrid Xception–Transformer designs has been explicitly demonstrated in recent studies in 2024, assuming the plan of using Xception to improve local features, which are further modeled using a Transformer since the token operates on the global context [20].
This is where the work is situated, however, with major improvements. There are three gaps in the current literature that were identified:
  • Preprocessing: Many studies do not employ or, more importantly, quantitatively validate a preprocessing pipeline to handle the extreme stain variability in multi-center datasets.
  • Feature Refinement: While hybrid models exist, the features are often passed directly from the CNN to the Transformer. That is why refining these features with a lightweight dual-attention mechanism (CBAM) provides a more salient and robust input to the Transformer block.
  • Explainability: Many complex hybrid models remain “black boxes.” The integration of Grad-CAM and feature visualization as a core part of our methodology to ensure clinical trust and interpretability.
The proposed model validated the preprocessing pipeline—followed by an Xception–CBAM–Transformer network—is explicitly designed to address these three gaps.

3. Materials and Methods

This section details the datasets, the preprocessing pipeline, and the novel hybrid architecture used in this study. The overall workflow of our methodology is presented in Figure 1.

3.1. Dataset Description

This research work utilized three publicly available, large-scale histopathology datasets to ensure the robustness and generalizability of our model.
  • CRC-VAL-HE-7K [13]: This validation set is associated with the NCT-CRC dataset and comprises 7180 image patches from various patients, providing a strong evaluation of the model’s ability to generalize.
  • NCT-CRC-HE-100K [13]: This dataset comprises 100,000 distinct image patches derived from 86 patients. Each image has dimensions of 224 × 224 pixels and features nine different tissue categories: Adipose (ADI), Background (BACK), Debris (DEB), Lymphocytes (LYM), Mucus (MUC), Smooth Muscle (MUS), Normal Colon Mucosa (NORM), Cancer-Associated Stroma (STR), and Colorectal Adenocarcinoma Epithelium (TUM).
  • LC25000 [14]: The LC25000 dataset contains 25,000 histopathological images, which are distributed across five diagnostic classes. It was generated via extensive data augmentation from a limited number of original lung and colon images. As the publicly available dataset lacks patient- or slide-level identifiers, true patient-level separation cannot be ensured. Therefore, LC25000 is treated as an image-level benchmark in this study, and results are interpreted accordingly.
The visualization of the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets’ class distribution is shown in Figure 2.
The datasets were divided into training (80%) and validation (20%) subsets. The numerical distribution of images across the respective classes after the data partitioning is summarized in Table 2, Table 3 and Table 4.

3.2. Image Preprocessing and Enhancement

Histopathological images often show considerable variability in staining and appearance. To unify the data and amplify important features, this research developed a multi-phase preprocessing pipeline.

Image Enhancement Pipeline

The purpose of this first step, which was used prior to the train/validation split, was to enhance image fidelity.
  • Gamma Correction: The Gamma correction is the nonlinear transformation of the general brightness and contrast of an image to make unlighted areas more visible and also to make light areas clearer.
I g a m m a ( x , y ) = 255 × ( I ( x , y ) 255 ) 1 γ
where I ( x , y ) is the original pixel intensity at coordinates (x, y), I g a m m a ( x , y ) is the gamma-corrected pixel intensity, and γ is the correction parameter. In this study, γ = 1.2 was selected to enhance contrast without overexposure.
2.
Bilateral Filtering: Bilateral filtering reduces noise, and, at the same time, preserves edges, which is important in preserving the structural details of the nuclei and tissues in photos of histopathology. The output of the bilateral filter at the pixel (x, y) is calculated in the following manner:
                                               I b i l a t e r a l ( x , y ) = 1 W p i ϵ j ϵ I ( i , j ) × exp ( ( i x ) 2 + ( j y ) 2 2 σ s 2 ) ×   exp ( | I ( i , j ) I ( x , y ) | 2 2 σ r 2 )
W p = i ϵ j ϵ exp ( ( i x ) 2 + ( j y ) 2 2 σ s 2 ) ×                                                                exp ( | I ( i , j ) I ( x , y ) | 2 2 σ r 2 )
where Ω is the spatial neighborhood around pixel ( x , y ), σ s controls spatial smoothing, σ r controls intensity similarity, and Wp is the normalization factor.
3.
Adaptive CLAHE: Bilateral Adaptive Contrast Limited Histogram Equalization (CLAHE) enhances local contrast while preventing over-amplification of noise in homogeneous regions. For each pixel ( x ,   y ) in each tile, the output intensity is computed as:
                      I C L A H E ( x , y ) = C D F c l i p × ( I ( x , y ) × ( L 1 ) )
where C D F c l i p is the clipped cumulative distribution function of the pixel intensities 149 within the tile, and L is the total number of possible intensity levels (typically 256). When clipping 150, this gives the slope of the CDF a limit to ensure over-enhancement in areas that are almost uniform.
The overall image enhancement workflow is illustrated in Figure 3, which outlines the sequential application of gamma correction, bilateral filtering, and CLAHE to improve the contrast and structural clarity of histopathological images. Initially, brightness and contrast were adjusted using gamma correction (γ = 1.2), which brought out fine details in the darkest zones. Subsequently, a bilateral filter was applied to decrease noise while maintaining tissue and tumor edges. Finally, Contrast Limited Adaptive Histogram Equalization (CLAHE), applied in the LAB color space, enhanced local contrast and emphasized fine structural details.
This series generated improved images of higher visual quality and structural integrity; thus, they were more suitable for the downstream classification processes. The enhanced images according to their classes are represented in Figure 4.
Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10 present the image quality assessment results for the original and enhanced images across the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets. For the CRC-VAL-HE-7K dataset (Table 5 and Table 6), the original images exhibit class-dependent variations in entropy and sharpness, with classes such as NORM, LYM, and STR showing higher entropy and Sharpness Index (SI) values, indicating richer structural information.
Table 5 presents the average image quality metrics for each class of the original images on the CRC-VAL-HE-7K dataset.
Table 6 presents the average image quality metrics for each class of enhanced images on the CRC-VAL-HE-7K dataset.
All original images maintain the maximum Peak Signal-to-Noise Ratio (PSNR) of 100 dB and Image Quality Index (IQI) of 1.0. After enhancement, entropy and SI increase consistently across all classes, confirming improved textural details, while PSNR and IQI slightly decrease as expected due to the introduction of new high-frequency information during enhancement.
A similar trend is observed in the NCT-CRC-HE-100K dataset (Table 7 and Table 8).
Table 7 presents the average image quality metrics for each class of the original images on the NCT-CRC-100K dataset.
Table 8 presents the average image quality metrics for each class of enhanced images on the NCT-CRC-100K dataset.
It can once again be observed that the original images exhibit uniformly high PSNR and IQI values as compared to the enhanced ones, which exhibit higher entropy and SI values across all classes and indicate the presence of an improved visual sharpening and contrast. Though the PSNR values drop to the range of 2127 dB instead of the ideal 100 dB, the IQI values are still high (0.950.99), which means that structural similarity is still being preserved.
In the case of the LC25000 dataset (Table 9 and Table 10), the same trend is observed. Original images have optimal PSNR and IQI, whereas improvement in entropy and SI, especially in colon_aca and colon_bnt, is observed, which is an indicator of considerable texture amplification. The increased values of PSNR are within the anticipated range (1621 dB), and the values of IQI are always larger than 0.93, which proves that the improvement increases the detail without hurting the structural integrity.
Table 9 presents the average image quality metrics for each class of the original images on the LC25000 dataset.
Table 10 presents the average image quality metrics for each class of enhanced images on the LC25000 dataset.
The qualitative results obtained with all three datasets, in general, verify that the presented enhancement approach boosts image entropy and sharpness, but does not deteriorate high structural quality, which guarantees the enhancement of visual and textual richness and application to downstream deep learning problems.
The impact of PSNR, IQI, and mixed-metric variations. Effects of PSNR, IQI, and mixed-metric changes. Though the enhancement process is associated with the introduction of new high-frequency information, which decreases the PSNR and IQI values compared with the reference values, the decreases do not imply a loss of diagnostic utility. The PSNR is based on the pixel-level fidelity, and the IQI is based on the structural similarity of the image on a global level, luminance stability, and preservation of contrast. On the contrary, entropy and the SI import richness of information and edge clarity. Thus, entropy and SI increase, and PSNR and IQI decrease, which signifies the deliberate introduction of contrast and textual detail on enhancement. The IQI values of all the datasets are high (0.93–0.99), which validates the fact that the integrity of the tissue structure is not compromised.
From a deep-learning standpoint, these moderately decreased PSNR and IQI do not adversely affect the classification performance. Convolutional neural networks are based mostly on discriminating morphological features, e.g., the structure of the glands, the boundaries of the nucleus, and the texture of images, but not on the pixel-wise similarity to the source image. The improved images that have a greater entropy and SI give much richer and differentiable features, which enhance feature separability and allow for more robust learning. Consequently, the quality of the inputs utilized in the process of training is reinforced by the enhancement pipeline and eventually leads to better model performance.

3.3. Baseline Models for Comparison

Training deep learning models on large datasets is essential to avoid overfitting. Transfer learning allows for effective training with smaller, specialized datasets by refining pre-trained models, which boosts performance and reduces training time. To benchmark the proposed model, the implementation and fine-tuning were performed on several advanced CNN architectures known for their potent image recognition performance. The fine-tuning of these models was performed to enhance histopathology datasets to compare their ability to classify CRC and Lung Cancer.
  • DenseNet121: Introduced by Huang et al. [6], this model utilizes dense skip connections to improve feature reuse, decrease parameters, and improve gradient flow.
  • MobileNetV2: Sandler et al. [23] introduced an architecture that employs inverted residuals and linear bottlenecks (depth-wise separable convolutions) to reduce computational cost, making it highly efficient.
  • NASNetMobile: Zoph et al. [24] used neural architecture search to identify an effective network structure optimized for high performance on mobile-sized models.
  • InceptionV3: Szegedy et al. [25] use factorized convolutions to enhance efficiency by reducing connections without sacrificing performance, and it is known for its multi-scale processing.
  • VGG16: Introduced by Simonyan and Zisserman [26], this 16-layer model is renowned for its simplicity and uniform 3 × 3 filter-based architecture.
  • Xception: Chollet [7] improved the Inception architecture by replacing Inception modules with depth-wise separable convolutions and adding residual connections, achieving high accuracy with fewer parameters.

3.4. Proposed Hybrid Architecture

The proposed model, illustrated in Figure 5, represents an innovative hybrid structure. It aims to recognize both local morphological characteristics and global contextual relationships.

3.4.1. Xception Backbone (Local Feature Extractor)

The 224 × 224 × 3 preprocessed images serve as input to the Xception Block (Figure 5a). The Xception architecture [5] was used as the primary local feature extractor. This backbone is built upon a series of Residual Blocks as detailed in Figure 6a. Each block consists of a ‘Main Path’ and a ‘Shortcut Path’.
  • Main Path: The input passes through a Separable 2D Convolution (Sep_Conv2D), Batch Normalization (BN), a GeLU activation, another Sep_Conv2D, and a final BN.
  • Shortcut Path: The original input is passed through a 1 × 1 Convolution (Conv2D 64 × 1) and a BN layer to match the channel dimensions of the main path.

3.4.2. CBAM Attention (Feature Refinement)

The high-level feature map from the Xception backbone is immediately passed to the CBAM Block (Figure 5b) for feature refinement. This module [10] sequentially infers and applies channel and spatial attention maps to recalibrate the feature map, amplifying salient information and suppressing irrelevant noise before it is passed to the Transformer.

3.4.3. Transformer Encoder (Global Context Modeling)

To model global context, the refined feature map from CBAM is “tokenized” (flattened into a 1D sequence of feature vectors) and combined with a learned positional embedding. This sequence is fed into the Transformer Block (Figure 5c). This block follows the standard encoder architecture, which is composed of two primary sub-layers:
  • Multi-Head Self-Attention (MHSA): As detailed in Figure 6b, the input sequence (X) is used to generate the Query (Q), Key (K), and Value (V) matrices through learned linear projections. The attention output is calculated using scaled dot-product attention: Attention ( Q ,   K ,   V ) = Softmax ( Q K T d k ) V . The outputs from multiple such “heads” are concatenated, passed through a final linear projection, and multiplied to produce the refined feature map. This allows the model to weigh the importance of every feature token relative to every other token.
  • Feed-Forward Network (FFN): As shown in Figure 6c, the output of the MHSA sub-layer is passed through a position-wise FFN. This network consists of two Dense (fully connected) layers separated by a GeLU activation and Dropout layers for regularization.
Residual connections and layer normalization are applied around each of these two sub-layers (MHSA and FFN) as shown in Figure 5c.

3.4.4. Classification Head

The output sequence from the Transformer block is processed by the final Classification Head (Figure 5d). First, a Global Average Pooling 1D (GAP1D) layer is applied to condense the token sequence into a single, fixed-size feature vector. This vector is then passed through a small MLP, consisting of a Dense (FC) layer with a ReLU activation, followed by the final Dense (FC) layer with a Softmax activation to produce a probability distribution across the n output classes.

3.4.5. Input Resizing and Dataset Compatibility

The input images, including the LC25000 ones that have an original resolution of 768 × 768, are down-sampled to 224 × 224 using bilinear interpolation in the preprocessing step. This process of resizing will ensure that it is compatible with the Xception backbone and that the input resolution is also consistent across all datasets.

3.5. Web-Based Deployment and Accessibility

To address the research-to-clinical practice gap, the optimized hybrid model is implemented in the form of a publicly available interactive web application in Hugging Face Spaces [27]. The platform is a demonstration of a live diagnostics solution at a distance, as shown in Figure 7.
The web interface allows users to simply upload high-resolution patches of histopathology images; these images can be processed by the backend with the pre-trained Xception–CBAM–Transformer pipeline. The system provides probabilistic predictions, in real-time, of each class, thus providing an easy-to-use second-opinion validation tool, without the need to have local GPU infrastructure.

3.6. AI-Assisted Language Editing

AI-assisted tools (ChatGPT, GPT-5-mini by OpenAI and Gemini, model 1.5 Pro, by Google) were used solely for language editing and improving manuscript clarity. They did not contribute to the study design, data analysis, model development, or interpretation of results. All scientific decisions and conclusions were made by the authors.

4. Experimental Setup

This segment describes the technical details of the hardware utilized for training and the hyperparameter configurations employed across all models.

4.1. Hardware Configuration

All experiments were conducted on a workstation with the specifications detailed in Table 11. The deep learning models were created, trained, and evaluated using the TensorFlow and Keras libraries (version 2.10) in conjunction with Python 3.9 [28].

4.2. Hyperparameter Settings

The model was trained on 224 × 224 × 3 inputs for 100 epochs with a batch size of 16 using the Adam optimizer (learning rate 1 × 10−4) and categorical cross-entropy loss. A dropout rate of 0.5 was applied before the SoftMax layer to decrease overfitting. A 20% validation split was used for data augmentation, which includes rescaling, shearing, zooming, rotating, width/height shifts, horizontal flips, and brightness modifications. The architecture consists of five residual blocks, each featuring SeparableConv2D and batch normalization, followed by global average pooling and a five-class SoftMax head. The best checkpoint was saved automatically based on validation performance. All hyperparameter settings used in the training process are summarized in Table 12.

4.3. Cross-Validation Strategy and Leakage Considerations

Five-fold cross-validation was employed to evaluate model performance. For datasets containing explicit patient identifiers, grouped cross-validation was applied to prevent patient-level data leakage. In contrast, the LC25000 dataset does not provide patient- or slide-level metadata and consists of augmented images derived from a limited number of original samples, making grouped splitting infeasible. Consequently, standard k-fold cross-validation was used for LC25000, and the reported results reflect image-level classification performance. This evaluation protocol is consistent with prior studies utilizing the LC25000 dataset.

5. Result Analysis and Discussion

This section offers an in-depth evaluation of the experimental findings. Firstly, the evaluation metrics used are shown, then the detailed performance of the proposed model is presented, and finally, it is compared with the baseline models and existing state-of-the-art research.

5.1. Performance Metrics

Metrics of evaluation are crucial for assessing the performance of the model. Accuracy, Precision, Sensitivity, F1-score, AUC, Loss, Cohen’s Kappa, and Matthews Correlation Coefficient (MCC) were among the metrics utilized to evaluate the classification model. Here in this context, TP, TN, FP, and FN stand for True Positives, True Negatives, False Positives, and False Negatives, respectively [29,30,31].
A c c u r a c y = T p o s + T n e g T p o s + T n e g + F p o s + F n e g
P r e c i s i o n = T p o s T p o s + F p o s
R e c a l l = T p o s   T p o s + F n e g
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
C o h e n s   K a p p a ,   Κ = p o p e 1 p e
M C C = ( T p o s × T n e g ) ( F p o s × F n e g ) ( T p o s + F p o s ) ( T p o s + F n e g ) ( T n e g + F p o s ) ( T n e g + F n e g )
where p o is observed agreement (accuracy) and p e is expected agreement.

5.2. Performance Results

In this subsection, the quantitative and qualitative performance of the proposed hybrid model is analyzed.

5.2.1. Confusion Matrices

The class-wise performance of the proposed model is detailed in the normalized confusion matrices in Figure 8, Figure 9 and Figure 10.
The model achieves exceptional accuracy, with most classes at or near 100%. For instance, ADI, BACK, DEB, LYM, and STR are all classified with >99.6% accuracy. The most complex class, TUM (Tumor), is still correctly identified 98.8% of the time, with only minor, clinically expected confusion with related tissues like NORM (0.8%).
On this larger and more complex 9-class dataset, the model’s robustness is evident. It achieves 100% for ADI and LYM. The TUM (Tumor) class, which is notoriously difficult, is correctly classified 98.6% of the time, with slight confusion with other stromal and mucosal classes (MUC, NORM, STR), which is a common challenge in pathology.
The model’s performance on the 5-class lung and colon dataset is virtually flawless. colon_n, lung_n, and lung_scc are all classified with 100% accuracy. The adenocarcinoma classes (colon_aca, lung_aca) are classified with 99.9% accuracy, demonstrating the model’s profound capability to distinguish between benign and malignant tissues, as well as between different types of carcinomas.

5.2.2. Training and Validation

The training and validation accuracy and training and validation loss for the CRC-VAL-HE-7K dataset are shown in Figure 11 and Figure 12.
The training and validation accuracy and training and validation loss for the NCT-CRC-HE-100K dataset are shown in Figure 13 and Figure 14.
The training and validation accuracy and training and validation loss for the LC25000 dataset are shown in Figure 15 and Figure 16.

5.2.3. Feature Space and ROC Analysis

The discriminative power of the learned features is visualized in Figure 17, Figure 18 and Figure 19 using t-SNE. These plots show clear, well-separated clusters for the different tissue classes.
The model’s superiority is benchmarked in the ROC curves in Figure 20, Figure 21 and Figure 22. These figures compare the Area Under the Curve (AUC) of the proposed model against all six baselines (InceptionV3, NASNetMobile, VGG16, Xception, DenseNet121, MobileNetV2) for the three datasets. The proposed model consistently achieves the highest AUC, approaching 1.0.

5.2.4. Explainability Visualization

In order to optimize the explainability and transparency of the suggested framework, explainability analyses were conducted with the help of layer-wise feature visualization and Gradient-weighted Class Activation Mapping (Grad-CAM). Such methods provide a view of the hierarchical feature-learning process of the model and explain how discriminative areas impact the ultimate classification decisions. The qualitative findings presented in Figure 23 show the layer-by-layer feature extraction architecture of the proposed model.
In the initial convolutional stages, the network mostly picks up low-level features like edges, color variations, and basic texture patterns, and thus, fine histopathological details are retained. The deeper the layer, the higher the degree of structural information stored in intermediate layers, and the higher levels of structural information include glandular structure, tissue structure, and arrangement of cells. The representations obtained in deeper layers are more abstract and more specific to the classes, with the focus on pathological patterns, such as tumor-infiltrated areas, abnormal glandular structures, and high-nuclear-density sectors. This hierarchy of abstraction goes to prove that the model does not form superficial representations of images, but rather, it obtains meaningful representations based on them.
In order to further question the decision-making mechanism, representative samples were selected based on the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets by generating Grad-CAM visualizations, as illustrated in Figure 24.
The Grad-CAM heatmaps show the spatial areas that have the most significant contribution to the predicted classification. The findings show that the model is consistent in focusing on pathologically salient objects, including tumor masses, aberrant glandular formations, and densely cellular objects, and has little salience on the background and staining artefacts. This behavior proves that model predictions are based on clinically relevant histological features.
Furthermore, the Grad-CAM results obtained have a high level of spatial consistency when using different datasets, despite changes in staining conditions, image resolution, and distribution of data. This consistency means strong generalization and stability of the offered framework. Altogether, the combination of layer-wise visualization of features with Grad-CAM positively influences the interpretability of the model, its trustworthiness, and applicability in clinical decisions, supporting its application in the field of digital pathology.

5.3. Comparative Analysis

In this section, the performance of the proposed model is evaluated against pre-trained deep learning baselines and existing work reported in the literature, with per-class performance metrics provided.
Table 13 summarize the comparative performance of the proposed model and pretrained models on the CRC-VAL-HE-7K dataset. Detailed per-class performance metrics for each histopathological class in this dataset are provided in Table 14.
For the NCT-CRC-HE-100K dataset, overall and per-class performance are provided in Table 15 and Table 16, respectively.
For the LC25000 dataset, overall and per-class performance are provided in Table 17 and Table 18, respectively.
Step-by-step implementation details of the proposed model are shown in the Table 19.
Comparisons with prior works are presented in Table 20, Table 21 and Table 22, demonstrating the proposed model’s superior performance across all datasets: Table 20 for NCT-CRC-HE-100K, Table 21 for CRC-VAL-HE-7K, and Table 22 for LC25000.

5.4. Patient-Level Five-Fold Cross-Validation Results

This subsection investigates the impact of individual preprocessing components on the classification performance of the proposed framework. A quantitative comparison of different preprocessing strategies is summarized in Table 23, where all experiments were conducted using identical training–validation splits, model architecture, and hyperparameter settings.
To evaluate the robustness and generalization capability of the proposed framework while explicitly preventing patient-level data leakage, a five-fold patient-aware cross-validation strategy was employed using GroupKFold. Performance metrics were computed independently for each fold and subsequently aggregated and reported as mean ± standard deviation.
As shown in Table 23, the performance of the proposed model is quite high in all five folds. Figure 25 is the fold-wise evaluation results obtained with patient-level cross-validation.
There is limited variance in all the performance measures across folds, which means that the model behavior is consistent and can generalize well at the patient level.

5.5. Image Enhancement Ablation Study

The primary aim of the preprocessing phase was to enhance the image quality to enhance the effectiveness of classification. A detailed ablation study of the enhancement strategy was carried out to determine the need and unique contribution of the proposed image enhancement strategy. The impact of the following image enhancement configurations on classification performance was investigated:
  • No preprocessing, where raw images were directly used as model input.
  • Stain normalization only.
  • Spatially Adaptive NLM + Edge-Aware Sharpening.
  • Gamma correction only.
  • Gamma correction combined with bilateral filtering.
  • Complete enhancement pipeline consisting of gamma correction, bilateral filtering, and CLAHE.
To ensure a fair and controlled comparison, the same training–validation splits, model architecture, optimization strategy, and hyperparameter settings were maintained for all configurations. This ablation study quantifies the individual and combined contributions of contrast enhancement, noise suppression, and stain normalization, thereby justifying the inclusion of the proposed enhancement pipeline within the overall framework.
A quantitative comparison of different preprocessing strategies is summarized in Table 24. All experiments were conducted using identical training–validation splits, model architecture, and hyperparameter settings to ensure a fair evaluation. A visual comparison of the corresponding performance metrics is illustrated in Figure 26.
As shown in Table 24 and Figure 26, most single-step preprocessing strategies provide limited or inconsistent improvements over using raw images.
Stain normalization and spatially adaptive non-local means (NLM) with edge-aware sharpening yield only marginal gains, indicating that color normalization or smoothing alone is insufficient for robust histopathological discrimination. Gamma correction results in moderate performance improvements, which are further enhanced by the addition of bilateral filtering due to improved noise suppression while preserving tissue boundaries.
In contrast, the complete preprocessing pipeline combining gamma correction, bilateral filtering, and contrast-limited adaptive histogram equalization (CLAHE) consistently achieves the best performance across all evaluation metrics, as clearly depicted in Figure 26. These findings confirm that the proposed preprocessing pipeline plays a critical role in enhancing diagnostically relevant features and improving the robustness of the proposed hybrid deep learning framework.

6. Discussion

6.1. Performance Analysis and Comparison with Existing Methods

Although deep-learning models have been extended to histopathological cancer detection, the current structure remains superior to the current state-of-the-art models on several datasets and evaluation indicators. This is mainly thanks to the synergistic combination of convolutional neural networks, attention mechanisms, and Transformer-based global context modeling.
Xception backbone is efficient at capturing multi-scale features on a spatial basis, and retains the computational efficiency through depth-wise separable convolutions. In contrast to traditional CNN-based methods, the design of the Convolution Block Attention Module (CBAM) enables the network to focus on diagnostically relevant areas of the spatial and feature channels to reduce background noise and staining artefacts. In addition, the Transformer represents a long-range tissue dependence encoder, which is essential to images of histopathology, where the patterns related to cancer are likely to be distributed in space but not limited to localized areas.
Compared to standalone CNNs and available hybrid models, the proposed model has superior sensitivity to inter-dataset changes, including differences in staining procedures, magnification factors, and morphology of tissues. These features explain why it has better accuracy, precision, recall, and F1-score on the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets.

6.2. Design Insights for Cancer Detection Models

The paper draws a number of critical conclusions on how to design effective models of histopathological diagnosis of cancer. First, local feature extraction through convolutional neural networks (CNNs) with global context modeling through transformer networks has a significant positive effect on the discrimination of the complex tissue patterns. Second, the attention processes including the Convolutional Block Attention Module (CBAM) can improve the accuracy of classifications, as well as provide interpretability to the model by directing the network to clinically meaningful parts of the image. Third, the inclusion of explainability methods in the model evaluation supports faith in the predictive results and makes it easier to validate a clinical process.
These results suggest that future cancer-detection systems must focus on hybrid designs that simultaneously balance performance, interpretability, and generalization, and not necessarily on more and deeper or broader CNN models.

6.3. Advantages and Limitations of the Proposed Model

The suggested framework has a number of significant benefits. It performs better at classification than on several benchmark datasets, and it also has a high generalization ability. The attention-enhanced architecture enhances the quality of accuracy and interpretability, and the modular design is flexible in adapting the architecture to other histopathology and medical imaging tasks.
Nevertheless, the model poses some limitations as well. The addition of attention and Transformer modules requires more computational power and training time than lightweight CNN models. Besides this, Grad-CAM gives rough spatial explanations and might not be able to give finer-grained cellular-level reasoning. In addition, the analysis is performed on edited community datasets, and this may not be a complete view of clinical variability in practice.
Further work could be undertaken to enhance the performance and applicability by building on the work on model compression, knowledge distillation, self-supervised learning, and integration of multimodal data. Improved explainability by using concept-based or multi-level interpretation procedures is also an interesting research avenue.

6.4. Clinical Implications and Application Scenarios

Clinically and methodologically, the suggested framework has a promising potential of becoming a computer-based diagnostic support tool in the analysis of histopathological cancers. Large and consistent results in colorectal and lung cancer data indicate that the model is capable of reliably detecting patterns in tissues, which are diagnostically significant and commonly used in standard pathological evaluations.
In clinical processes, the model can be used as a second-reader system, which can support pathologists in identifying areas of interest that should be studied further, and thus alleviating the issue of diagnostic variability in the high-volume environment. For better clinical interpretability, the use of Grad-CAM visualizations can improve the explanation of visual results, which are clear and understandable, and have an association with known histopathological characteristics.
The suggested approach will not be aimed at replacing skilled clinical judgment, but it might enhance the consistency of diagnosis and efficiency of workflow. Future research needs to concentrate on the prospective clinical validation and connection with the digital pathology systems to enable real-world application.

7. Conclusions

This research introduces an innovative, multi-phase hybrid deep learning framework aimed at the automated classification of colorectal and lung cancers from histopathology images. The approach addresses major limitations of traditional models by utilizing a validated image-enhancement technique, the local feature-extraction strengths of Xception, the feature-refinement functions of the CBAM attention mechanism, and the global context modeling provided by a Transformer block.
The proposed architecture achieved state-of-the-art results, reaching accuracy levels of 99.58%, 99.29%, and 99.98% for the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets, correspondingly. The performance was rigorously validated using both quantitative and qualitative approaches, employing confusion matrices, ROC curves, and t-SNE visualizations, which showcased strong discriminative abilities. The compounded advantages of every part of the architecture were also confirmed with ablation research, whereas explainability examinations by Grad-CAM and layer-wise feature maps highlighted the clarity and reliability of the model.
The trained model was made as an accessible web tool to make it easier to use in practice. To sum up, this study draws a very precise, interpretable, and comprehensive model aimed at helping pathologists with the very important task of cancer diagnosis.

Author Contributions

Conceptualization: S.S., M.S.H., and M.B.A.M.; data curation, formal analysis, investigation, methodology: S.S., M.S.H., M.B.A.M., M.F.A.M., and Z.M.; funding acquisition, project administration: Z.M., M.B.A.M., and M.F.A.M.; resources, software: S.S., M.S.H., and M.B.A.M.; validation, visualization: M.B.A.M., A.A., M.F.A.M., and Z.M.; writing—original draft: S.S., M.S.H., and M.B.A.M.; writing—review and editing: M.B.A.M., A.A., Z.M., and M.F.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available. The CRC-VAL-HE-7K and NCT-CRC-HE-100K datasets are accessible from the publicly released collection by Kather et al. [13]. The LC25000 dataset is available from the publicly released “Lung and Colon Cancer” dataset by Borkowski et al. [14]. All datasets can be obtained from their respective open-access sources without restrictions.

Acknowledgments

The authors acknowledge the use of ChatGPT (OpenAI) and Gemini (Google) for language refinement and editorial assistance during manuscript preparation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
XAIExplainability (Explainable Artificial Intelligence)
CRCColorectal Cancer
SOTAState-of-the-art
IQAImage Quality Assessment
PSNRPeak Signal-to-Noise Ratio
t-SNEt-distributed Stochastic Neighbor Embedding
H&EHematoxylin and Eosin
CADComputer-Aided Diagnosis
CNNConvolutional Neural Network
GELUGaussian Error Linear Unit

References

  1. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA A Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
  2. Bera, K.; Schalper, K.A.; Rimm, D.L.; Velcheti, V.; Madabhushi, A. Artificial intelligence in digital pathology—New tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 2019, 16, 703–715. [Google Scholar] [CrossRef]
  3. Elmore, J.G.; Longton, G.M.; Carney, P.A.; Geller, B.M.; Onega, T.; Tosteson, A.N.; Nelson, H.D.; Pepe, M.S.; Allison, K.H.; Schnitt, S.J.; et al. Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens. JAMA 2015, 313, 1122. [Google Scholar] [CrossRef]
  4. Komura, D.; Ishikawa, S. Machine Learning Methods for Histopathological Image Analysis. Comput. Struct. Biotechnol. J. 2018, 16, 34–42. [Google Scholar] [CrossRef] [PubMed]
  5. Targ, S.; Almeida, D.; Enlitic, K.L. Resnet in Resnet: Generalizing Residual Architectures. March 2016. Available online: https://arxiv.org/pdf/1603.08029 (accessed on 13 January 2026).
  6. Zhu, Y.; Newsam, S. DenseNet for dense flow. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; Volume 2017, pp. 790–794. [Google Scholar] [CrossRef]
  7. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  8. Colling, R.; Pitman, H.; Oien, K.; Rajpoot, N.; Macklin, P.; CM-Path AI in Histopathology Working Group; Snead, D.; Sackville, T.; Verrill, C. Artificial intelligence in digital pathology: A roadmap to routine use in clinical practice. J. Pathol. 2019, 249, 143–150. [Google Scholar] [CrossRef]
  9. Vahadane, A.; Peng, T.; Sethi, A.; Albarqouni, S.; Wang, L.; Baust, M.; Steiger, K.; Schlitter, A.M.; Esposito, I.; Navab, N. Structure-Preserving Color Normalization and Sparse Stain Separation for Histological Images. IEEE Trans. Med. Imaging 2016, 35, 1962–1971. [Google Scholar] [CrossRef] [PubMed]
  10. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  11. Zhou, T.; Niu, Y.; Lu, H.; Peng, C.; Guo, Y.; Zhou, H. Vision transformer: To discover the ‘four secrets’ of image patches. Inf. Fusion 2024, 105, 102248. [Google Scholar] [CrossRef]
  12. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
  13. Kather, J.N.; Krisam, J.; Charoentong, P.; Luedde, T.; Herpel, E.; Weis, C.A.; Gaiser, T.; Marx, A.; Valous, N.A.; Ferber, D.; et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS Med. 2019, 16, e1002730. [Google Scholar] [CrossRef]
  14. Borkowski, A.A.; Bui, M.M.; Thomas, L.B.; Wilson, C.P.; DeLand, L.A.; Mastorides, S.M. Lung and Colon Cancer Histopathological Image Dataset (LC25000). December 2019. Available online: https://arxiv.org/pdf/1912.12142 (accessed on 13 January 2026).
  15. Chen, R.J.; Lu, M.Y.; Wang, J.; Williamson, D.F.; Rodig, S.J.; Lindeman, N.I.; Mahmood, F. Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis. IEEE Trans. Med. Imaging 2022, 41, 757–770. [Google Scholar] [CrossRef] [PubMed]
  16. Ke, Q.; Yap, W.S.; Tee, Y.K.; Hum, Y.C.; Zheng, H.; Gan, Y.J. Advanced deep learning for multi-class colorectal cancer histopathology: Integrating transfer learning and ensemble methods. Quant. Imaging Med. Surg. 2025, 15, 2329–2346. [Google Scholar] [CrossRef]
  17. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
  18. Miah, M.B.A.; Yousuf, M.A. Detection of lung cancer from CT image using image processing and neural network. In Proceedings of the 2nd International Conference on Electrical Engineering and Information and Communication Technology (iCEEiCT), Savar, Bangladesh, 21–23 May 2015. [Google Scholar] [CrossRef]
  19. Xu, H.; Xu, Q.; Cong, F.; Kang, J.; Han, C.; Liu, Z.; Madabhushi, A.; Lu, C. Vision Transformers for Computational Histopathology. IEEE Rev. Biomed. Eng. 2024, 17, 63–79. [Google Scholar] [CrossRef]
  20. Zeynali, A.; Tinati, M.A.; Tazehkand, B.M. Hybrid CNN-Transformer Architecture With Xception-Based Feature Enhancement for Accurate Breast Cancer Classification. IEEE Access 2024, 12, 189477–189493. [Google Scholar] [CrossRef]
  21. Dunn, C.; Brettle, D.; Hodgson, C.; Hughes, R.; Treanor, D. An international study of stain variability in histopathology using qualitative and quantitative analysis. J. Pathol. Inform. 2025, 17, 100423. [Google Scholar] [CrossRef]
  22. Kainz, B.; Heinrich, M.P.; Makropoulos, A.; Oppenheimer, J.; Mandegaran, R.; Sankar, S.; Deane, C.; Mischkewitz, S.; Al-Noor, F.; Rawdin, A.C.; et al. Non-invasive diagnosis of deep vein thrombosis from ultrasound imaging with machine learning. NPJ Digit. Med. 2021, 4, 137. [Google Scholar] [CrossRef]
  23. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018. [Google Scholar]
  24. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef]
  25. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
  26. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/pdf/1409.1556 (accessed on 13 January 2026).
  27. Ai Colon Cancer Predictions—A Hugging Face Space by Safayat12. Available online: https://huggingface.co/spaces/safayat12/Ai_Colon_Cancer_predictions (accessed on 13 January 2026).
  28. Hossain, M.S.; Juthy, M.J.N.; Miah, M.B.A.; Awang, S.; Hossain, M.N.; Bhuiyan, E. AlzCNN: A Custom CNN Architecture for Alzheimer’s Stage Detection from MRI Images. In Proceedings of the 2025 IEEE 9th International Conference on Software Engineering & Computer Systems (ICSECS), Pekan, Pahang, Malaysia, 15–16 October 2025; pp. 359–364. [Google Scholar] [CrossRef]
  29. Hossain, M.N.; Bhuiyan, E.; Miah, M.B.A.; Sifat, T.A.; Muhammad, Z.; Masud, M.F.A. Detection and Classification of Kidney Disease from CT Images: An Automated Deep Learning Approach. Technologies 2025, 13, 508. [Google Scholar] [CrossRef]
  30. Miah, M.B.A.; Awang, S.; Rahman, M.M.; Hosen, A.S.M.S.; Ra, I.H. Keyphrases Frequency Analysis From Research Articles: A Region-Based Unsupervised Novel Approach. IEEE Access 2022, 10, 120838–120849. [Google Scholar] [CrossRef]
  31. Rahman, M.A.; Miah, M.B.A.; Hossain, M.A.; Hosen, A.S. Enhanced Brain Tumor Classification Using MobileNetV2: A Comprehensive Preprocessing and Fine-Tuning Approach. BioMedInformatics 2025, 5, 30. [Google Scholar] [CrossRef]
  32. Kumar, A.; Vishwakarma, A.; Bajaj, V. CRCCN-Net: Automated framework for classification of colorectal tissue using histopathological images. Biomed. Signal Process. Control. 2023, 79, 104172. [Google Scholar] [CrossRef]
  33. Qin, Z.; Sun, W.; Guo, T.; Lu, G. Colorectal cancer image recognition algorithm based on improved transformer. Discov. Appl. Sci. 2024, 6, 422. [Google Scholar] [CrossRef]
  34. Martínez-Fernandez, E.; Rojas-Valenzuela, I.; Valenzuela, O.; Rojas, I. Computer Aided Classifier of Colorectal Cancer on Histopatological Whole Slide Images Analyzing Deep Learning Architecture Parameters. Appl. Sci. 2023, 13, 4594. [Google Scholar] [CrossRef]
  35. Ghosh, S.; Bandyopadhyay, A.; Sahay, S.; Ghosh, R.; Kundu, I.; Santosh, K.C. Colorectal Histology Tumor Detection Using Ensemble Deep Neural Network. Eng. Appl. Artif. Intell. 2021, 100, 104202. [Google Scholar] [CrossRef]
  36. Jiang, L.; Huang, S.; Luo, C.; Zhang, J.; Chen, W.; Liu, Z. An improved multi-scale gradient generative adversarial network for enhancing classification of colorectal cancer histological images. Front. Oncol. 2023, 13, 1240645. [Google Scholar] [CrossRef]
  37. Hajsalem, I.D.; Ayed, Y.B. Detecting early gastrointestinal polyps in histology and endoscopy images using deep learning. Front. Artif. Intell. 2025, 8, 1571075. [Google Scholar] [CrossRef]
  38. Anju, T.E.; Vimala, S. Finetuned-VGG16 CNN Model for Tissue Classification of Colorectal Cancer. Lect. Notes Netw. Syst. 2023, 665, 73–84. [Google Scholar] [CrossRef]
  39. Azar, A.T.; Tounsi, M.; Fati, S.M.; Javed, Y.; Amin, S.U.; Khan, Z.I.; Alsenan, S.; Ganesan, J. Automated System for Colon Cancer Detection and Segmentation Based on Deep Learning Techniques. Int. J. Sociotechnology Knowl. Dev. 2023, 15, 1–28. [Google Scholar] [CrossRef]
  40. El-Ghany, S.A.; Azad, M.; Elmogy, M.; El-Ghany, S.A.; Azad, M.; Elmogy, M. Robustness Fine-Tuning Deep Learning Model for Cancers Diagnosis Based on Histopathology Image Analysis. Diagnostics 2023, 13, 699. [Google Scholar] [CrossRef]
  41. Hasan, M.A.; Haque, F.; Sabuj, S.R.; Sarker, H.; Goni, M.O.F.; Rahman, F.; Rashid, M.M. An End-to-End Lightweight Multi-Scale CNN for the Classification of Lung and Colon Cancer with XAI Integration. Technologies 2024, 12, 56. [Google Scholar] [CrossRef]
  42. Masud, M.; Sikder, N.; Nahid, A.A.; Bairagi, A.K.; AlZain, M.A. A Machine Learning Approach to Diagnosing Lung and Colon Cancer Using a Deep Learning-Based Classification Framework. Sensors 2021, 21, 748. [Google Scholar] [CrossRef]
  43. Provath, M.A.M.; Deb, K.; Dhar, P.K.; Shimamura, T. Classification of Lung and Colon Cancer Histopathological Images Using Global Context Attention Based Convolutional Neural Network. IEEE Access 2023, 11, 110164–110183. [Google Scholar] [CrossRef]
  44. Alotaibi, M.; Alshardan, A.; Maashi, M.; Asiri, M.M.; Alotaibi, S.R.; Yafoz, A.; Alsini, R.; Khadidos, A.O. Exploiting histopathological imaging for early detection of lung and colon cancer via ensemble deep learning model. Sci. Rep. 2024, 14, 20434. [Google Scholar] [CrossRef]
  45. Said, M.M.R.; Islam, M.S.B.; Sumon, M.S.I.; Vranic, S.; Al Saady, R.M.; Alqahtani, A.; Chowdhury, M.E.H.; Pedersen, S. Innovative Deep Learning Architecture for the Classification of Lung and Colon Cancer From Histopathology Images. Appl. Comput. Intell. Soft Comput. 2024, 2024, 5562890. [Google Scholar] [CrossRef]
  46. Uddin, A.H.; Chen, Y.L.; Akter, M.R.; Ku, C.S.; Yang, J.; Por, L.Y. Colon and lung cancer classification from multi-modal images using resilient and efficient neural network architectures. Heliyon 2024, 10, e30625. [Google Scholar] [CrossRef] [PubMed]
  47. Vanitha, K.; R, M.T.; Sree, S.S.; Guluwadi, S. Deep learning ensemble approach with explainable AI for lung and colon cancer classification using advanced hyperparameter tuning. BMC Med. Inform. Decis. Mak. 2024, 24, 222. [Google Scholar] [CrossRef] [PubMed]
  48. El-Aziz, A.A.A.; Mahmood, M.A.; El-Ghany, S.A.; El-Aziz, A.A.A.; Mahmood, M.A.; El-Ghany, S.A. Advanced Deep Learning Fusion Model for Early Multi-Classification of Lung and Colon Cancer Using Histopathological Images. Diagnostics 2024, 14, 2274. [Google Scholar] [CrossRef] [PubMed]
Figure 1. System architecture of the proposed hybrid classification model, including the preprocessing, feature extraction, and classification stages.
Figure 1. System architecture of the proposed hybrid classification model, including the preprocessing, feature extraction, and classification stages.
Make 08 00031 g001
Figure 2. Representative Sample Images from Each Class in the (a) CRC-VAL-HE-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets.
Figure 2. Representative Sample Images from Each Class in the (a) CRC-VAL-HE-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets.
Make 08 00031 g002
Figure 3. Image enhancement process applied to histopathological samples, including gamma correction, bilateral filtering, and CLAHE to improve contrast and tissue structure visibility.
Figure 3. Image enhancement process applied to histopathological samples, including gamma correction, bilateral filtering, and CLAHE to improve contrast and tissue structure visibility.
Make 08 00031 g003
Figure 4. Normal versus enhanced histopathological images from the (a) CRC-VAL-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets, demonstrating the visual improvements introduced by the enhancement pipeline.
Figure 4. Normal versus enhanced histopathological images from the (a) CRC-VAL-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets, demonstrating the visual improvements introduced by the enhancement pipeline.
Make 08 00031 g004
Figure 5. Overall architecture of the proposed hybrid model: (a) Xception backbone for hierarchical feature extraction, (b) Convolutional Block Attention Module (CBAM) for spatial–channel attention refinement, (c) Transformer block for capturing long-range spatial dependencies, and (d) Classification head for final prediction across all target classes.
Figure 5. Overall architecture of the proposed hybrid model: (a) Xception backbone for hierarchical feature extraction, (b) Convolutional Block Attention Module (CBAM) for spatial–channel attention refinement, (c) Transformer block for capturing long-range spatial dependencies, and (d) Classification head for final prediction across all target classes.
Make 08 00031 g005
Figure 6. Proposed model architecture with attention: (a) Residual block, (b) Multiheaded self-attention, (c) FFN network.
Figure 6. Proposed model architecture with attention: (a) Residual block, (b) Multiheaded self-attention, (c) FFN network.
Make 08 00031 g006
Figure 7. The web-based Graphical User Interface (GUI) deployed for model inference and accessibility.
Figure 7. The web-based Graphical User Interface (GUI) deployed for model inference and accessibility.
Make 08 00031 g007
Figure 8. Confusion matrix of the proposed model on the CRC-VAL-HE-7K dataset.
Figure 8. Confusion matrix of the proposed model on the CRC-VAL-HE-7K dataset.
Make 08 00031 g008
Figure 9. Confusion matrix of the proposed model on the NCT-CRC-HE-100K dataset.
Figure 9. Confusion matrix of the proposed model on the NCT-CRC-HE-100K dataset.
Make 08 00031 g009
Figure 10. Confusion matrix of the proposed model on the LC25000 dataset.
Figure 10. Confusion matrix of the proposed model on the LC25000 dataset.
Make 08 00031 g010
Figure 11. The proposed model’s training and validation accuracy for CRC-VAL-7K.
Figure 11. The proposed model’s training and validation accuracy for CRC-VAL-7K.
Make 08 00031 g011
Figure 12. The proposed model’s training and validation loss for CRC-VAL-7K.
Figure 12. The proposed model’s training and validation loss for CRC-VAL-7K.
Make 08 00031 g012
Figure 13. The proposed model’s training and validation accuracy for NCT-CRC-100K.
Figure 13. The proposed model’s training and validation accuracy for NCT-CRC-100K.
Make 08 00031 g013
Figure 14. The proposed model’s training and validation loss for NCT-CRC-HE-100K.
Figure 14. The proposed model’s training and validation loss for NCT-CRC-HE-100K.
Make 08 00031 g014
Figure 15. The proposed model’s training and validation accuracy for LC25000.
Figure 15. The proposed model’s training and validation accuracy for LC25000.
Make 08 00031 g015
Figure 16. The proposed model’s training and validation loss for LC25000.
Figure 16. The proposed model’s training and validation loss for LC25000.
Make 08 00031 g016
Figure 17. t-SNE visualization of deep feature representations (CRC-VAL-HE-7K).
Figure 17. t-SNE visualization of deep feature representations (CRC-VAL-HE-7K).
Make 08 00031 g017
Figure 18. t-SNE visualization of deep feature representations (NCT-CRC-HE-100K).
Figure 18. t-SNE visualization of deep feature representations (NCT-CRC-HE-100K).
Make 08 00031 g018
Figure 19. t-SNE visualization of deep feature representations (LC25000).
Figure 19. t-SNE visualization of deep feature representations (LC25000).
Make 08 00031 g019
Figure 20. ROC curve for the proposed model on (CRC-VAL-HE-7K).
Figure 20. ROC curve for the proposed model on (CRC-VAL-HE-7K).
Make 08 00031 g020
Figure 21. ROC curve for the proposed model (NCT-CRC-100K).
Figure 21. ROC curve for the proposed model (NCT-CRC-100K).
Make 08 00031 g021
Figure 22. ROC curve for the proposed model (LC25000).
Figure 22. ROC curve for the proposed model (LC25000).
Make 08 00031 g022
Figure 23. The proposed model’s layer-wise feature extraction process on the input images.
Figure 23. The proposed model’s layer-wise feature extraction process on the input images.
Make 08 00031 g023
Figure 24. The proposed model’s Grad-CAM visualization across three different datasets: (a) CRC-VAL-HE-7K; (b) NCT-CRC-HE-100K; and (c) LC25000.
Figure 24. The proposed model’s Grad-CAM visualization across three different datasets: (a) CRC-VAL-HE-7K; (b) NCT-CRC-HE-100K; and (c) LC25000.
Make 08 00031 g024
Figure 25. Fold-wise performance metrics of the proposed model on the LC25000 dataset.
Figure 25. Fold-wise performance metrics of the proposed model on the LC25000 dataset.
Make 08 00031 g025
Figure 26. Ablation analysis of different preprocessing strategies on the CRC-VAL-HE-7K dataset.
Figure 26. Ablation analysis of different preprocessing strategies on the CRC-VAL-HE-7K dataset.
Make 08 00031 g026
Table 1. Comparison of existing histopathological image classification approaches, highlighting their advantages and limitations.
Table 1. Comparison of existing histopathological image classification approaches, highlighting their advantages and limitations.
Model CategoryAdvantagesLimitations
Standard CNNs [16,17,18]Excellent at extracting hierarchical local features (cell, nucleus morphology). Computationally efficient.Limited effective receptive field fails to capture global tissue context. Highly sensitive to stain and color variations.
Pure Transformers [19]Excellent at modeling global context and long-range spatial relationships.Requires massive training datasets. May lose fine-grained local texture details captured by CNNs.
Standard Hybrid Models [15,20]Combines CNN local feature power with Transformer global context (represents current SOTA).It can be overly complex. Often lacks a validated preprocessing stage and may not use fine-grained, low-level attention mechanisms (CBAM/SE).
Stain Invariance Networks [21]Explicitly minimizes the variability caused by inconsistent H&E staining, improving generalization across different scanning centers.Primary focus on color consistency may neglect morphological feature enhancement. Still reliant on local feature extractors.
Multiple Instance Learning (MIL) [22]Handles extremely large Whole-Slide Images (WSIs) by aggregating information from numerous small patches. Often includes a form of patch-level attention.Aggregation layer may lose crucial spatial relationships between patches. Computationally intensive due to sequential tile processing.
Proposed Model
(Enhancement + Xception–CBAM–Transformer)
Designed to solve all listed limitations: (1) Preprocessing handles stain variance. (2) Xception captures local detail. (3) CBAM refines feature focus. (4) Transformer models global context.-
Table 2. Class distribution of the CRC-VAL-7K dataset following an 80–20% train–validation split.
Table 2. Class distribution of the CRC-VAL-7K dataset following an 80–20% train–validation split.
ClassTotalTraining (80%)Validation (20%)
ADI13381070268
BACK847678169
DEB33927168
LYM634507127
MUC1035828207
MUS592474118
NORM741593148
STR42133784
TUM1233986247
Total728057441536
Table 3. Class distribution of the NCT-CRC-100K dataset following an 80–20% train–validation split.
Table 3. Class distribution of the NCT-CRC-100K dataset following an 80–20% train–validation split.
ClassTotalTraining (80%)Validation (20%)
ADI15,02012,0163004
BACK10,56684532113
DEB11,51292102302
LYM11,55792462311
MUC889671171770
MUS13,53610,8292707
NORM876370101753
STR10,44683572089
TUM14,31711,4542863
Total104,11383,19220,921
Table 4. Class distribution of the LC2500 dataset following an 80–20% train–validation split.
Table 4. Class distribution of the LC2500 dataset following an 80–20% train–validation split.
ClassTotalTraining (80%)Validation (20%)
colon_aca500040001000
colon_bnt500040001000
lung_aca500040001000
lung_scc500040001000
lung_bnt500040001000
Total25,00020,0005000
Table 5. Average image quality metrics of original images in the CRC-VAL-HE-7K dataset.
Table 5. Average image quality metrics of original images in the CRC-VAL-HE-7K dataset.
ClassEntropySIPSNR (db)IQI
ADI5.308713.5822100.001.0
BACK3.92384.9472100.001.0
DEB6.864118.8359100.001.0
LYM7.366524.9039100.001.0
MUC6.929520.7820100.001.0
MUS6.890617.7147100.001.0
NORM7.448517.6426100.001.0
STR7.032519.4184100.001.0
TUM7.119615.6653100.001.0
Table 6. Average image quality metrics of enhanced images in the CRC-VAL-HE-7K dataset.
Table 6. Average image quality metrics of enhanced images in the CRC-VAL-HE-7K dataset.
ClassEntropySIPSNR (db)IQI
ADI5.962514.032626.90670.9943
BACK5.16294.610718.82820.9316
DEB7.404921.536421.92430.9744
LYM7.723329.398121.97100.9750
MUC7.338420.782023.47380.9814
MUS7.378719.881522.26760.9813
NORM7.730220.491221.99380.9749
STR7.497822.596421.15070.9762
TUM7.535917.658221.71720.9762
Table 7. Average image quality metrics of original images in the NCT-CRC-HE-100K dataset.
Table 7. Average image quality metrics of original images in the NCT-CRC-HE-100K dataset.
ClassEntropySIPSNR (db)IQI
ADI5.116216.2056100.001.0
BACK3.75434.9802100.001.0
DEB6.777219.1007100.001.0
LYM7.367721.6227100.001.0
MUC7.082917.7169100.001.0
MUS6.806118.9369100.001.0
NORM7.358519.1077100.001.0
STR6.997120.0641100.001.0
TUM7.165320.4176100.001.0
Table 8. Average image quality metrics of enhanced images in the NCT-CRC-HE-100K dataset.
Table 8. Average image quality metrics of enhanced images in the NCT-CRC-HE-100K dataset.
ClassEntropySIPSNR (db)IQI
ADI5.739216.482626.83520.9960
BACK5.04444.642822.31140.9522
DEB7.233321.530122.82470.9813
LYM7.741327.329722.08190.9744
MUC7.460219.402122.52640.9827
MUS7.274321.364722.68100.9833
NORM7.692422.105122.09920.9773
STR7.458323.253722.16080.9812
TUM7.575823.646021.95190.9793
Table 9. Average image quality metrics of original images in the LC25000 dataset.
Table 9. Average image quality metrics of original images in the LC25000 dataset.
ClassEntropySIPSNR (db)IQI
colon_aca7.091928.2666100.001.0
colon_bnt7.150528.0476100.001.0
lung_aca7.045114.2314100.001.0
Lung_bnt6.616013.7855100.001.0
Lung_scc6.757513.9977100.001.0
Table 10. Average image quality metrics of enhanced images in the LC25000 dataset.
Table 10. Average image quality metrics of enhanced images in the LC25000 dataset.
ClassEntropySIPSNR (db)IQI
colon_aca7.656734.287817.22380.9511
colon_bnt7.719932.799316.13840.9375
lung_aca7.579519.872021.13610.9680
Lung_bnt7.223518.291121.65300.9780
Lung_scc7.472020.473120.79480.9687
Table 11. Hardware specifications used for training the proposed model.
Table 11. Hardware specifications used for training the proposed model.
Hardware SpecificationsDetails
PlatformJupyter Notebook
ProcessorAMD Ryzen 5 3600
Memory (RAM)64 GB
Operating SystemUbuntu 23.10, 64 bit
Graphics CardNVIDIA GeForce GTX 1660 (VRAM 6 GB)
Table 12. Hyperparameters used in training the proposed model.
Table 12. Hyperparameters used in training the proposed model.
HyperparameterValue
Input image size224 × 224 × 3
Number of classes5, 9, and 9
Batch size16
Number of epochs50, 100, and 100
BackboneXception
Frozen layersFirst 100 layers
Attention moduleCBAM (Channel + Spatial)
Token embedding dimension 128
Number of tokensH × W (CNN feature map)
Positional encodingLearned
Transformer encoder blocks2
Attention heads4
FFN hidden dimension256
Transformer dropout0.1
Classifier dropout0.3
OptimizerAdam (1 × 10−4)
Loss functionCategorical Cross-entropy
Pooling layerGlobal Average Pooling
Final activationSoftmax
Table 13. Comparative performance of pre-trained models and the proposed model on CRC-VAL-HE-7K.
Table 13. Comparative performance of pre-trained models and the proposed model on CRC-VAL-HE-7K.
ModelAccuracyPrecisionRecallF1-ScoreMCCKappa
Densenet-12199.2699.0799.1099.3099.3899.37
MobileNetV297.7696.7396.8996.8196.6096.61
Xception99.2098.7798.8498.8098.7898.79
InceptionV397.2797.1497.8996.3896.6395.82
VGG-1698.1897.5496.7097.1297.0897.09
NasNetMobile94.3595.0893.3094.1998.3293.86
Proposed Model99.5899.1099.0099.4099.4099.40
Table 14. Per-class performance metrics for each histopathological class on CRC-VAL-HE-7K.
Table 14. Per-class performance metrics for each histopathological class on CRC-VAL-HE-7K.
ClassPrecisionRecallF1-ScorePer-Class Acc.MCCSupport
ADI1.00000.99250.99620.99860.9954268
BACK1.00001.00001.00001.00001.0000169
DEB1.00000.98510.99250.99930.992168
LYM1.00000.99210.99600.99930.9956127
MUC1.00000.99520.99760.99930.9972207
MUS1.00000.98310.99150.99860.9907118
NORM1.00000.99320.99660.99930.9962148
STR0.95451.00000.97670.99720.975684
TUM0.98401.00000.99190.99720.9903247
Macro Avg0.99320.99350.99320.99880.9926
Weighted Avg0.99460.99440.99440.99870.99371536
Table 15. Comparative performance of pre-trained models and the proposed model on NCT-CRC-HE-100K.
Table 15. Comparative performance of pre-trained models and the proposed model on NCT-CRC-HE-100K.
ModelAccuracyPrecisionRecallF1-ScoreMCCKappa
Densenet-12196.8196.8196.8196.8196.4096.40
MobileNetV296.3096.2596.2996.2995.8395.82
Xception98.1698.1298.1898.1497.9297.92
InceptionV395.2095.1795.1395.1494.5894.58
VGG-1697.0296.9497.0696.9896.6596.64
NasNetMobile94.2094.1694.1094.1293.4693.45
Proposed Model99.3399.2799.2699.2799.9799.17
Table 16. Per-class performance metrics for each histopathological class on NCT-CRC-HE-100K.
Table 16. Per-class performance metrics for each histopathological class on NCT-CRC-HE-100K.
ClassPrecisionRecallF1-ScorePer-Class Acc.MCCSupport
ADI0.99810.99810.99810.99960.99793004
BACK0.99910.99860.99880.99970.99872113
DEB0.98580.99520.99050.99780.98932302
LYM0.99960.99440.99700.99930.99662311
MUC0.99770.98260.99010.99820.98921770
MUS0.99370.99480.99430.99840.99342707
NORM0.98970.98970.98970.99820.98871753
STR0.98710.98850.98780.99740.98642089
TUM0.98820.99230.99020.99720.98862863
Macro Avg0.99320.99270.99290.99840.9921
Weighted Avg0.99300.99300.99300.99840.992120,921
Table 17. Comparative performance of pre-trained models and the proposed model on the LC25000 dataset.
Table 17. Comparative performance of pre-trained models and the proposed model on the LC25000 dataset.
ModelAccuracyPrecisionRecallF1-ScoreMCCKappa
Densenet-12199.2699.5299.5299.5299.4099.25
MobileNetV299.2699.3799.3899.3799.2599.20
Xception99.6899.6899.6899.6899.6099.42
InceptionV397.8697.8897.8497.8397.3197.32
VGG-1699.7299.7899.7899.7899.7299.60
NasNetMobile97.7497.8297.7897.7897.3297.45
Proposed Model99.9899.9899.9899.9899.7599.75
Table 18. Per-class performance metrics for colon and lung cancer classification on the LC25000 dataset.
Table 18. Per-class performance metrics for colon and lung cancer classification on the LC25000 dataset.
ClassPrecisionRecallF1-ScorePer-class Acc.MCCSupport
colon_aca1.00001.00001.00001.00001.00001000
colon_bnt1.00001.00001.00001.00001.00001000
lung_aca0.99901.00000.99950.99960.99881000
lung_bnt1.00001.00001.00001.00001.00001000
lung_scc1.00000.99900.99960.99960.99871000
Macro Avg0.99980.99980.99980.99980.9995
Weighted Avg0.99980.99980.99980.99980.99955000
Table 19. Step-by-step implementation approaches of the proposed model on the LC25000 dataset.
Table 19. Step-by-step implementation approaches of the proposed model on the LC25000 dataset.
Model VariantParamsFLOPS (G)Inference (ms)Accuracy (%)F1 (%) Acc Vs Base (%)
Ensemble (MobilenetV2 + Xception + VGG16)164.8 M24.211298.76 ± 0.2196.74 ± 0.233.24
Xception (ImageNet)21.9 M4.2 G7.6699.68 ± 0.1799.68 ± 0.260.30
Xception Conv57.6 M3.8 G3.697.58 ± 0.3897.48 ± 0.622.20
Xception + Spatial Attention0.66 M1.5 G0.798.54 ± 0.0498.38 ± 0.341.11
Xception (like-CNN) + SE + CBAM3.01 M2.45 G11.398.53 ± 0.1598.46 ± 0.361.45
Xception + CBAM Attention + Transformer20.1 M3.9 G7.199.98 ± 0.0199.98 ± 0.01
Table 20. Performance comparison between prior work and the proposed model on the NCT-CRC-HE-100K dataset.
Table 20. Performance comparison between prior work and the proposed model on the NCT-CRC-HE-100K dataset.
ModelDatasetAccuracyPrecisionRecallF1-ScoreMCCKappaRef.
CRCCN-NETNCT-CRC-HE-100K96.2696.4496.3496.3896.0096.00[32]
CNN + SWIN TransformerNCT-CRC-HE-100K95.8097.9097.6397.7697.6197.64[33]
VGG19NCT-CRC-HE-100K96.4094.2294.4494.44NANA[34]
Ensemble CNNNCT-CRC-HE-100K96.1696.1796.15NANANA[35]
GAN + InceptionNCT-CRC-HE-100K89.5486.8486.6298.70NANA[36]
Proposed ModelNCT-CRC-HE-100K99.3399.2799.2699.2799.1799.17
Table 21. Performance comparison between prior work and the proposed model on CRC-VAL-HE-7K.
Table 21. Performance comparison between prior work and the proposed model on CRC-VAL-HE-7K.
ModelDatasetAccuracyPrecisionRecallF1-ScoreMCCKappaRef.
ResNet50 + Kernel PolynomialCRC-VAL-HE-7K97.0198.2098.2098.2096.5098.10[37]
FineTuned-VGG16CRC-VAL-HE-7K97.9298.0297.3897.6597.6297.61[38]
CNN-adamCRC-VAL-HE-7K90.0089.0087.0087.00NANA[39]
Proposed ModelCRC-VAL-HE-7K99.5899.1099.4099.4099.4099.40
Table 22. Performance comparison between prior work and the proposed model on the LC2500 dataset.
Table 22. Performance comparison between prior work and the proposed model on the LC2500 dataset.
ModelDatasetAccuracyPrecisionRecallF1-ScoreMCCKappaRef.
Fine-tuned ResNet101LC2500099.9499.8499.8599.84NANA[40]
LW-MS-CCNLC2500099.2099.1699.3699.29NANA[41]
CNNLC2500096.3396.3996.3796.3895.4495.41[42]
CNN + GC attention blockLC2500099.7699.7699.4099.7099.5099.50[43]
HIELCC-EDLLC2500099.6099.0099.0099.0099.0099.20[44]
Self-ONNLC2500099.8999.7499.7499.7499.8499.78[45]
CNN + ImageNetLC2500099.9699.9699.9699.9699.9698.36[46]
Ensemble (MobileNet + Xception)LC2500099.4499.4299.4399.4299.4399.30[47]
Ensemble (ResNet + NasNet + EfficientNet)LC2500099.9499.8499.8499.8499.7899.88[48]
Proposed ModelLC2500099.9899.9899.9899.9899.9599.98
Table 23. Patient-level five-fold cross-validation performance (mean ± standard deviation) on the LC25000 dataset.
Table 23. Patient-level five-fold cross-validation performance (mean ± standard deviation) on the LC25000 dataset.
FoldAccuracyPrecisionRecallF1-ScoreMCCKappa
Fold-199.9899.9899.9899.9899.9899.97
Fold-299.9699.9699.9699.9699.9599.95
Fold-3100.00100.00100.00100.00100.00100.00
Fold-499.9299.9299.9299.9299.9099.90
Fold-5100.00100.00100.00100.00100.00100.00
Average ± SD99.97 ± 0.0399.97 ± 0.0399.97 ± 0.0399.97 ± 0.0399.97 ± 0.000499.97 ± 0.0004
Table 24. Ablation study evaluating the effect of different preprocessing configurations on classification performance on the CRC-VAL-HE-7K dataset.
Table 24. Ablation study evaluating the effect of different preprocessing configurations on classification performance on the CRC-VAL-HE-7K dataset.
Preprocessing StrategyAccuracyPrecisionRecallF1-ScoreMCCKappa
No preprocessing99.3299.2699.1199.1999.1999.26
Stain normalization only98.9598.6698.8398.7198.8098.80
Spatially Adaptive
NLM + Edge-Aware Sharpening
97.6597.2697.3297.2597.4497.10
Gamma correction only99.3699.3699.2399.2999.2999.44
Gamma + Bilateral filtering99.5199.3699.2999.3399.4499.44
Gamma + Bilateral + CLAHE99.5899.1099.4099.4099.4099.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shila, S.; Hossain, M.S.; Masud, M.F.A.; Miah, M.B.A.; Aminuddin, A.; Muhammad, Z. Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Mach. Learn. Knowl. Extr. 2026, 8, 31. https://doi.org/10.3390/make8020031

AMA Style

Shila S, Hossain MS, Masud MFA, Miah MBA, Aminuddin A, Muhammad Z. Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Machine Learning and Knowledge Extraction. 2026; 8(2):31. https://doi.org/10.3390/make8020031

Chicago/Turabian Style

Shila, Shirin, Md. Safayat Hossain, Md Fuyad Al Masud, Mohammad Badrul Alam Miah, Afrig Aminuddin, and Zia Muhammad. 2026. "Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis" Machine Learning and Knowledge Extraction 8, no. 2: 31. https://doi.org/10.3390/make8020031

APA Style

Shila, S., Hossain, M. S., Masud, M. F. A., Miah, M. B. A., Aminuddin, A., & Muhammad, Z. (2026). Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Machine Learning and Knowledge Extraction, 8(2), 31. https://doi.org/10.3390/make8020031

Article Metrics

Back to TopTop