Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis

Shila, Shirin; Hossain, Md. Safayat; Masud, Md Fuyad Al; Miah, Mohammad Badrul Alam; Aminuddin, Afrig; Muhammad, Zia

doi:10.3390/make8020031

Open AccessArticle

Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis

by

Shirin Shila

^1,†,

Md. Safayat Hossain

^2,†,

Md Fuyad Al Masud

³

,

Mohammad Badrul Alam Miah

^2,*,

Afrig Aminuddin

⁴

and

Zia Muhammad

^5,*

¹

Department of Food Technology and Nutritional Science, Mawlana Bhashani Science and Technology University, Tangail 1902, Bangladesh

²

Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Tangail 1902, Bangladesh

³

Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58102, USA

⁴

Department of Information System, Universitas Amikom, Yogyakarta 55283, Indonesia

⁵

Department of Computing, Design, and Communication, University of Jamestown, Jamestown, ND 58405, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mach. Learn. Knowl. Extr. 2026, 8(2), 31; https://doi.org/10.3390/make8020031

Submission received: 8 December 2025 / Revised: 18 January 2026 / Accepted: 26 January 2026 / Published: 28 January 2026

(This article belongs to the Section Learning)

Download

Browse Figures

Versions Notes

Abstract

The automated and accurate results of classifying histopathology images are necessary in the early detection of cancer, especially the common cancers such as Colorectal Cancer (CRC) and Lung Cancer (LC). Nonetheless, classical deep learning frameworks often face challenges because the intra-class variations are large, the relations across classes are alike, and the quality of images is not stable. In order to eliminate these constraints, a multi-layer diagnostic framework is offered in detail. This process starts with a strong preprocessing pipeline, which involves gamma correction, bilateral filtering, and adaptive CLAHE, resulting in statistically significant changes in image quality quantitative measures. The hybrid attention architecture is presented and includes an Xception backbone, a Convolutional Block Attention Module (CBAM), a Transformer block, and an MLP classifier to successfully combine local features with global context. The proposed model achieved an outstanding performance with a classification of 99.98%, 99.58%, and 99.33% percent on LC25000, CRC-VAL-HE-7K, and NCT-CRC-HE-100K when tested on three publicly available datasets. In order to enhance transparency, very detailed explainability analyses are conducted with the help of layer-wise feature visualization and Grad-CAM. Finally, the real-world example of this framework is presented by its implementation in a web-based platform, which can be a useful and easy-to-use tool in helping to diagnose a pathology.

Keywords:

histopathology; colorectal cancer; lung cancer; Xception; CBAM; image enhancement; Grad-CAM; deep learning; web-based platform

1. Introduction

Cancer is a leading cause of death all over the world, and lung and colorectal cancer (CRC) used to be among the most prevalent and fatal types [1]. The ultimate diagnosis of the majority of cancers such as CRC and lung cancer depends on the histopathological analysis of hematoxylin and eosin (H&E)-stained tissue slides by professional pathologists [2]. In spite of the fact that this manual method is considered the gold standard, it is labor-intensive, subjective, and likely to have a significant degree of variability, not only across the interpreters, but also within the same kind of interpreter, and this can have a detrimental effect on the reliability of the diagnostic process and patient outcomes [3]. These concerns have led to an area in computational pathology that uses computer-aided diagnosis (CAD) systems to analyze them in a more objective and efficient way. Over the past ten years, the field of deep learning and specifically Convolutional Neural Networks (CNNs) has achieved a lot in terms of classifying histopathological images [4]. Google architectures like ResNet, DenseNet, and Xception [5,6,7] have shown abilities to extract complex hierarchical features on pixel data, and in some tasks can perform as well or even better than humans. However, traditional CNN models are faced with two limitations. First, their convoluted structure is very effective at identifying local features (such as the form of cell nuclei) but often fails to provide larger contextual information (such as the structure of the tissue on the whole), which is required to make accurate diagnoses. Second, these models are black box in nature and therefore their use in clinical practice is challenging because pathologists may not be able to identify the reasoning behind the findings of the model [8].

In addition, the quality of input data is also a very important factor in the performance of any deep learning model. There are high amounts of variation in histopathological images, where the images are often met with issues of stain strength, color variation, and artifacts that occur during slide preparation [9]. This variability may severely impair the capacity of a model to be effectively generalized. Although the process of stain normalization is a usual preprocessing procedure, simple normalization might be inadequate. It is argued that a strong image enhancement pipeline is required in order not only to standardize images, but also to disclose some textural details that are critical to classification, which must be quantitatively verified. In order to deal with these shortcomings, a multi-stage hybrid deep learning framework is proposed. It begins with a proven preprocessing pipeline that includes gamma corrections, bilateral filtering, and Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the look of images [10].

Quantitative measures of these improvements are entropy and PSNR. A hybrid architecture is presented in advance of the classification task, which takes the benefits of different paradigms. The Xception network [7] is an effective framework at the root of the local features extraction. The features are then optimized using a Convolutional Block Attention Module (CBAM) [10] that allows the features to be adaptively re-allocated by focusing on the most important spatial and channel-wise regions. A Transformer block is added to come up with a broader context that traditional CNNs can easily miss. Designed as a tool to process natural language, Vision Transformers (ViT) [11] have proven to be incredibly capable of identifying long-range dependencies in images, and thus, are exceptionally appropriate to analyze the structure of tissue. The suggested model combines these factors into a trainable network that is end-to-end. In order to enhance transparency and build clinical confidence, the explainability (XAI) techniques, especially Gradient-weighted Class Activation Mapping (Grad-CAM) [12], and layer-wise feature visualization, are applied to explain the model choices. Finally, to demonstrate the usefulness of such research in practice, the trained model was introduced as an operational and web-based tool. The model was proposed and highly trained and tested on three important public datasets: NCT-CRC-HE-100K [13], CRC-VAL-HE-7K [13], and LC25000 [14]. The framework provides state-of-the-art performance, which demonstrates the stable, accurate, and interpretable methodology of automated histopathological diagnosis. The key contributions of this study are as follows:

1.: A quantitative comparison of histopathology images before and after a novel enhancement pipeline (gamma correction, bilateral filtering, CLAHE), using IQA metrics to validate its efficacy.
2.: The design and implementation of a novel hybrid architecture (Xception–CBAM–Transformer) that synergistically combines local feature extraction, dual-axis attention, and global context modeling.
3.: State-of-the-art classification performance demonstrated on three distinct and widely used CRC and lung cancer datasets.
4.: A comprehensive explainability analysis using Grad-CAM and feature visualization to ensure model transparency and trustworthiness.
5.: The development and deployment of a web-based platform, translating our research into a practical tool for pathological assistance.

The rest of this paper will be structured in the following way. Section 2 is the review of the related literature. Section 3 discusses materials and methods, which include a description of the suggested framework and data utilized. Section 4 includes the description of an experiment setup and implementation. The results of the experiment are presented in Section 5. In Section 6, the findings are discussed in detail. Lastly, Section 7 provides a conclusion of the paper with a summary of the main findings and contributions.

2. Related Work

Automated histopathological image classification as a type of technology to classify cancer has undergone tremendous improvements, moving beyond traditional machine learning to the new, elaborate deep learning models. This section addresses the key techniques of classifying colorectal cancer (CRC) and lung cancer with their respective strengths and weaknesses, and thus, the justification of the hybrid approach proposed in this paper.

2.1. Convolutional Neural Networks (CNNs) in Pathology

A Convolutional Neural Network (CNN) was the first significant step in computational pathology. Architectures that have been trained in advance on ImageNet, including ResNet and Xception [5,7], are commonly used as feature extractors. Ongoing research continues to utilize these models, often in ensemble approaches, to attain high accuracy on datasets such as LC25000, CRC-VAL-HE-7K, and NCT-CRC-HE-100K [13,14].

Although these models excel at capturing local, hierarchical features (like nuclear atypia and mitotic figures), conventional CNNs have a key shortcoming: their effective receptive field is limited to a local scope. They find it challenging to represent long-range spatial relationships and the overall tissue architecture, such as the connections between tumor-infiltrating lymphocytes and epithelial cells that are distantly located, which are often vital for precise diagnosis and grading [15].

2.2. Attention-Enhanced CNNs

In order to overcome the shortcomings of conventional CNNs, researchers started incorporating attention mechanisms. These modules, such as the Convolutional Block Attention Module (CBAM) [10], help the network “learn what and where to emphasize.” Recent 2024 studies confirm that integrating spatial and channel attention mechanisms (including CBAM) into CNN backbones significantly improves focus on critical regions, reduces noise, and enhances classification accuracy [15]. While these attention-gated CNNs improve performance, they are still fundamentally constrained by the convolutional backbone. They enhance the focus of local feature extraction but do not solve the problem of modeling global context. A comparative summary of existing histopathological image classification approaches, their strengths, and limitations is presented in Table 1, which motivates the design of the proposed hybrid framework.

2.3. Transformers in Computational Pathology

Recently, the Vision Transformer (ViT) [19] has emerged as a formidable alternative. According to a comprehensive analysis, the application of Transformers is increasingly prevalent in the domain of pathology. They effectively understand global, long-range relationships by dividing an image into smaller segments and employing self-attention. Nonetheless, pure Transformers are known for their substantial data requirements and may miss the detailed, pixel-level textural details that CNNs excel in capturing.

2.4. Hybrid Architectures and Identified Gaps

The latest advancements are leaning towards hybrid models that integrate the advantages of both frameworks: employing a CNN for effective local feature extraction and a Transformer to capture the overarching context of these features. The efficacy of hybrid Xception–Transformer designs has been explicitly demonstrated in recent studies in 2024, assuming the plan of using Xception to improve local features, which are further modeled using a Transformer since the token operates on the global context [20].

This is where the work is situated, however, with major improvements. There are three gaps in the current literature that were identified:

Preprocessing: Many studies do not employ or, more importantly, quantitatively validate a preprocessing pipeline to handle the extreme stain variability in multi-center datasets.
Feature Refinement: While hybrid models exist, the features are often passed directly from the CNN to the Transformer. That is why refining these features with a lightweight dual-attention mechanism (CBAM) provides a more salient and robust input to the Transformer block.
Explainability: Many complex hybrid models remain “black boxes.” The integration of Grad-CAM and feature visualization as a core part of our methodology to ensure clinical trust and interpretability.

The proposed model validated the preprocessing pipeline—followed by an Xception–CBAM–Transformer network—is explicitly designed to address these three gaps.

3. Materials and Methods

This section details the datasets, the preprocessing pipeline, and the novel hybrid architecture used in this study. The overall workflow of our methodology is presented in Figure 1.

3.1. Dataset Description

This research work utilized three publicly available, large-scale histopathology datasets to ensure the robustness and generalizability of our model.

CRC-VAL-HE-7K [13]: This validation set is associated with the NCT-CRC dataset and comprises 7180 image patches from various patients, providing a strong evaluation of the model’s ability to generalize.
NCT-CRC-HE-100K [13]: This dataset comprises 100,000 distinct image patches derived from 86 patients. Each image has dimensions of 224 × 224 pixels and features nine different tissue categories: Adipose (ADI), Background (BACK), Debris (DEB), Lymphocytes (LYM), Mucus (MUC), Smooth Muscle (MUS), Normal Colon Mucosa (NORM), Cancer-Associated Stroma (STR), and Colorectal Adenocarcinoma Epithelium (TUM).
LC25000 [14]: The LC25000 dataset contains 25,000 histopathological images, which are distributed across five diagnostic classes. It was generated via extensive data augmentation from a limited number of original lung and colon images. As the publicly available dataset lacks patient- or slide-level identifiers, true patient-level separation cannot be ensured. Therefore, LC25000 is treated as an image-level benchmark in this study, and results are interpreted accordingly.

The visualization of the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets’ class distribution is shown in Figure 2.

The datasets were divided into training (80%) and validation (20%) subsets. The numerical distribution of images across the respective classes after the data partitioning is summarized in Table 2, Table 3 and Table 4.

3.2. Image Preprocessing and Enhancement

Histopathological images often show considerable variability in staining and appearance. To unify the data and amplify important features, this research developed a multi-phase preprocessing pipeline.

Image Enhancement Pipeline

The purpose of this first step, which was used prior to the train/validation split, was to enhance image fidelity.

Gamma Correction: The Gamma correction is the nonlinear transformation of the general brightness and contrast of an image to make unlighted areas more visible and also to make light areas clearer.

I_{g a m m a} (x, y) = 255 \times {(\frac{I (x, y)}{255})}^{\frac{1}{γ}}

(1)

where

I (x, y)

is the original pixel intensity at coordinates (x, y),

I_{g a m m a} (x, y)

is the gamma-corrected pixel intensity, and γ is the correction parameter. In this study, γ = 1.2 was selected to enhance contrast without overexposure.

2.: Bilateral Filtering: Bilateral filtering reduces noise, and, at the same time, preserves edges, which is important in preserving the structural details of the nuclei and tissues in photos of histopathology. The output of the bilateral filter at the pixel (x, y) is calculated in the following manner:

I_{b i l a t e r a l} (x, y) = \frac{1}{W_{p}} \sum_{i ϵ Ω} \sum_{j ϵ Ω} I (i, j) \times \exp (- \frac{{(i - x)}^{2} + {(j - y)}^{2}}{2 {σ_{s}}^{2}}) \times \exp (- \frac{{| I (i, j) - I (x, y) |}^{2}}{2 {σ_{r}}^{2}})

(2)

W_{p} = \sum_{i ϵ Ω} \sum_{j ϵ Ω} \exp (- \frac{{(i - x)}^{2} + {(j - y)}^{2}}{2 {σ_{s}}^{2}}) \times \exp (- \frac{{| I (i, j) - I (x, y) |}^{2}}{2 {σ_{r}}^{2}})

(3)

where Ω is the spatial neighborhood around pixel (

x, y

),

σ_{s}

controls spatial smoothing,

σ_{r}

controls intensity similarity, and W_p is the normalization factor.

3.: Adaptive CLAHE: Bilateral Adaptive Contrast Limited Histogram Equalization (CLAHE) enhances local contrast while preventing over-amplification of noise in homogeneous regions. For each pixel ( $x, y$ ) in each tile, the output intensity is computed as:

I_{C L A H E} (x, y) = {C D F}_{c l i p} \times (I (x, y) \times (L - 1))

(4)

where

{C D F}_{c l i p}

is the clipped cumulative distribution function of the pixel intensities 149 within the tile, and L is the total number of possible intensity levels (typically 256). When clipping 150, this gives the slope of the CDF a limit to ensure over-enhancement in areas that are almost uniform.

The overall image enhancement workflow is illustrated in Figure 3, which outlines the sequential application of gamma correction, bilateral filtering, and CLAHE to improve the contrast and structural clarity of histopathological images. Initially, brightness and contrast were adjusted using gamma correction (γ = 1.2), which brought out fine details in the darkest zones. Subsequently, a bilateral filter was applied to decrease noise while maintaining tissue and tumor edges. Finally, Contrast Limited Adaptive Histogram Equalization (CLAHE), applied in the LAB color space, enhanced local contrast and emphasized fine structural details.

This series generated improved images of higher visual quality and structural integrity; thus, they were more suitable for the downstream classification processes. The enhanced images according to their classes are represented in Figure 4.

Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10 present the image quality assessment results for the original and enhanced images across the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets. For the CRC-VAL-HE-7K dataset (Table 5 and Table 6), the original images exhibit class-dependent variations in entropy and sharpness, with classes such as NORM, LYM, and STR showing higher entropy and Sharpness Index (SI) values, indicating richer structural information.

Table 5 presents the average image quality metrics for each class of the original images on the CRC-VAL-HE-7K dataset.

Table 6 presents the average image quality metrics for each class of enhanced images on the CRC-VAL-HE-7K dataset.

All original images maintain the maximum Peak Signal-to-Noise Ratio (PSNR) of 100 dB and Image Quality Index (IQI) of 1.0. After enhancement, entropy and SI increase consistently across all classes, confirming improved textural details, while PSNR and IQI slightly decrease as expected due to the introduction of new high-frequency information during enhancement.

A similar trend is observed in the NCT-CRC-HE-100K dataset (Table 7 and Table 8).

Table 7 presents the average image quality metrics for each class of the original images on the NCT-CRC-100K dataset.

Table 8 presents the average image quality metrics for each class of enhanced images on the NCT-CRC-100K dataset.

It can once again be observed that the original images exhibit uniformly high PSNR and IQI values as compared to the enhanced ones, which exhibit higher entropy and SI values across all classes and indicate the presence of an improved visual sharpening and contrast. Though the PSNR values drop to the range of 2127 dB instead of the ideal 100 dB, the IQI values are still high (0.950.99), which means that structural similarity is still being preserved.

In the case of the LC25000 dataset (Table 9 and Table 10), the same trend is observed. Original images have optimal PSNR and IQI, whereas improvement in entropy and SI, especially in colon_aca and colon_bnt, is observed, which is an indicator of considerable texture amplification. The increased values of PSNR are within the anticipated range (1621 dB), and the values of IQI are always larger than 0.93, which proves that the improvement increases the detail without hurting the structural integrity.

Table 9 presents the average image quality metrics for each class of the original images on the LC25000 dataset.

Table 10 presents the average image quality metrics for each class of enhanced images on the LC25000 dataset.

The qualitative results obtained with all three datasets, in general, verify that the presented enhancement approach boosts image entropy and sharpness, but does not deteriorate high structural quality, which guarantees the enhancement of visual and textual richness and application to downstream deep learning problems.

The impact of PSNR, IQI, and mixed-metric variations. Effects of PSNR, IQI, and mixed-metric changes. Though the enhancement process is associated with the introduction of new high-frequency information, which decreases the PSNR and IQI values compared with the reference values, the decreases do not imply a loss of diagnostic utility. The PSNR is based on the pixel-level fidelity, and the IQI is based on the structural similarity of the image on a global level, luminance stability, and preservation of contrast. On the contrary, entropy and the SI import richness of information and edge clarity. Thus, entropy and SI increase, and PSNR and IQI decrease, which signifies the deliberate introduction of contrast and textual detail on enhancement. The IQI values of all the datasets are high (0.93–0.99), which validates the fact that the integrity of the tissue structure is not compromised.

From a deep-learning standpoint, these moderately decreased PSNR and IQI do not adversely affect the classification performance. Convolutional neural networks are based mostly on discriminating morphological features, e.g., the structure of the glands, the boundaries of the nucleus, and the texture of images, but not on the pixel-wise similarity to the source image. The improved images that have a greater entropy and SI give much richer and differentiable features, which enhance feature separability and allow for more robust learning. Consequently, the quality of the inputs utilized in the process of training is reinforced by the enhancement pipeline and eventually leads to better model performance.

3.3. Baseline Models for Comparison

Training deep learning models on large datasets is essential to avoid overfitting. Transfer learning allows for effective training with smaller, specialized datasets by refining pre-trained models, which boosts performance and reduces training time. To benchmark the proposed model, the implementation and fine-tuning were performed on several advanced CNN architectures known for their potent image recognition performance. The fine-tuning of these models was performed to enhance histopathology datasets to compare their ability to classify CRC and Lung Cancer.

DenseNet121: Introduced by Huang et al. [6], this model utilizes dense skip connections to improve feature reuse, decrease parameters, and improve gradient flow.
MobileNetV2: Sandler et al. [23] introduced an architecture that employs inverted residuals and linear bottlenecks (depth-wise separable convolutions) to reduce computational cost, making it highly efficient.
NASNetMobile: Zoph et al. [24] used neural architecture search to identify an effective network structure optimized for high performance on mobile-sized models.
InceptionV3: Szegedy et al. [25] use factorized convolutions to enhance efficiency by reducing connections without sacrificing performance, and it is known for its multi-scale processing.
VGG16: Introduced by Simonyan and Zisserman [26], this 16-layer model is renowned for its simplicity and uniform 3 × 3 filter-based architecture.
Xception: Chollet [7] improved the Inception architecture by replacing Inception modules with depth-wise separable convolutions and adding residual connections, achieving high accuracy with fewer parameters.

3.4. Proposed Hybrid Architecture

The proposed model, illustrated in Figure 5, represents an innovative hybrid structure. It aims to recognize both local morphological characteristics and global contextual relationships.

3.4.1. Xception Backbone (Local Feature Extractor)

The 224

\times

224

\times

3 preprocessed images serve as input to the Xception Block (Figure 5a). The Xception architecture [5] was used as the primary local feature extractor. This backbone is built upon a series of Residual Blocks as detailed in Figure 6a. Each block consists of a ‘Main Path’ and a ‘Shortcut Path’.

Main Path: The input passes through a Separable 2D Convolution (Sep_Conv2D), Batch Normalization (BN), a GeLU activation, another Sep_Conv2D, and a final BN.
Shortcut Path: The original input is passed through a 1 × 1 Convolution (Conv2D 64 × 1) and a BN layer to match the channel dimensions of the main path.

3.4.2. CBAM Attention (Feature Refinement)

The high-level feature map from the Xception backbone is immediately passed to the CBAM Block (Figure 5b) for feature refinement. This module [10] sequentially infers and applies channel and spatial attention maps to recalibrate the feature map, amplifying salient information and suppressing irrelevant noise before it is passed to the Transformer.

3.4.3. Transformer Encoder (Global Context Modeling)

To model global context, the refined feature map from CBAM is “tokenized” (flattened into a 1D sequence of feature vectors) and combined with a learned positional embedding. This sequence is fed into the Transformer Block (Figure 5c). This block follows the standard encoder architecture, which is composed of two primary sub-layers:

Multi-Head Self-Attention (MHSA): As detailed in Figure 6b, the input sequence (X) is used to generate the Query (Q), Key (K), and Value (V) matrices through learned linear projections. The attention output is calculated using scaled dot-product attention: Attention $(Q, K, V)$ = Softmax $(\frac{Q K^{T}}{\sqrt{d_{k}}}) V$ . The outputs from multiple such “heads” are concatenated, passed through a final linear projection, and multiplied to produce the refined feature map. This allows the model to weigh the importance of every feature token relative to every other token.
Feed-Forward Network (FFN): As shown in Figure 6c, the output of the MHSA sub-layer is passed through a position-wise FFN. This network consists of two Dense (fully connected) layers separated by a GeLU activation and Dropout layers for regularization.

Residual connections and layer normalization are applied around each of these two sub-layers (MHSA and FFN) as shown in Figure 5c.

3.4.4. Classification Head

The output sequence from the Transformer block is processed by the final Classification Head (Figure 5d). First, a Global Average Pooling 1D (GAP1D) layer is applied to condense the token sequence into a single, fixed-size feature vector. This vector is then passed through a small MLP, consisting of a Dense (FC) layer with a ReLU activation, followed by the final Dense (FC) layer with a Softmax activation to produce a probability distribution across the n output classes.

3.4.5. Input Resizing and Dataset Compatibility

The input images, including the LC25000 ones that have an original resolution of 768

\times

768, are down-sampled to 224

\times

224 using bilinear interpolation in the preprocessing step. This process of resizing will ensure that it is compatible with the Xception backbone and that the input resolution is also consistent across all datasets.

3.5. Web-Based Deployment and Accessibility

To address the research-to-clinical practice gap, the optimized hybrid model is implemented in the form of a publicly available interactive web application in Hugging Face Spaces [27]. The platform is a demonstration of a live diagnostics solution at a distance, as shown in Figure 7.

The web interface allows users to simply upload high-resolution patches of histopathology images; these images can be processed by the backend with the pre-trained Xception–CBAM–Transformer pipeline. The system provides probabilistic predictions, in real-time, of each class, thus providing an easy-to-use second-opinion validation tool, without the need to have local GPU infrastructure.

3.6. AI-Assisted Language Editing

AI-assisted tools (ChatGPT, GPT-5-mini by OpenAI and Gemini, model 1.5 Pro, by Google) were used solely for language editing and improving manuscript clarity. They did not contribute to the study design, data analysis, model development, or interpretation of results. All scientific decisions and conclusions were made by the authors.

4. Experimental Setup

This segment describes the technical details of the hardware utilized for training and the hyperparameter configurations employed across all models.

4.1. Hardware Configuration

All experiments were conducted on a workstation with the specifications detailed in Table 11. The deep learning models were created, trained, and evaluated using the TensorFlow and Keras libraries (version 2.10) in conjunction with Python 3.9 [28].

4.2. Hyperparameter Settings

The model was trained on

224 \times 224 \times 3

inputs for 100 epochs with a batch size of 16 using the Adam optimizer (learning rate 1

\times

10⁻⁴) and categorical cross-entropy loss. A dropout rate of 0.5 was applied before the SoftMax layer to decrease overfitting. A 20% validation split was used for data augmentation, which includes rescaling, shearing, zooming, rotating, width/height shifts, horizontal flips, and brightness modifications. The architecture consists of five residual blocks, each featuring SeparableConv2D and batch normalization, followed by global average pooling and a five-class SoftMax head. The best checkpoint was saved automatically based on validation performance. All hyperparameter settings used in the training process are summarized in Table 12.

4.3. Cross-Validation Strategy and Leakage Considerations

Five-fold cross-validation was employed to evaluate model performance. For datasets containing explicit patient identifiers, grouped cross-validation was applied to prevent patient-level data leakage. In contrast, the LC25000 dataset does not provide patient- or slide-level metadata and consists of augmented images derived from a limited number of original samples, making grouped splitting infeasible. Consequently, standard k-fold cross-validation was used for LC25000, and the reported results reflect image-level classification performance. This evaluation protocol is consistent with prior studies utilizing the LC25000 dataset.

5. Result Analysis and Discussion

This section offers an in-depth evaluation of the experimental findings. Firstly, the evaluation metrics used are shown, then the detailed performance of the proposed model is presented, and finally, it is compared with the baseline models and existing state-of-the-art research.

5.1. Performance Metrics

Metrics of evaluation are crucial for assessing the performance of the model. Accuracy, Precision, Sensitivity, F1-score, AUC, Loss, Cohen’s Kappa, and Matthews Correlation Coefficient (MCC) were among the metrics utilized to evaluate the classification model. Here in this context, TP, TN, FP, and FN stand for True Positives, True Negatives, False Positives, and False Negatives, respectively [29,30,31].

A c c u r a c y = \frac{T p o s + T n e g}{T p o s + T n e g + F p o s + F n e g}

(5)

P r e c i s i o n = \frac{T p o s}{T p o s + F p o s}

(6)

R e c a l l = \frac{T p o s}{T p o s + F n e g}

(7)

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

C o h e n^{'} s K a p p a, Κ = \frac{p_{o} - p_{e}}{1 - p_{e}}

(9)

M C C = \frac{(T p o s \times T n e g) - (F p o s \times F n e g)}{\sqrt{(T p o s + F p o s) (T p o s + F n e g) (T n e g + F p o s) (T n e g + F n e g)}}

(10)

where

p_{o}

is observed agreement (accuracy) and

p_{e}

is expected agreement.

5.2. Performance Results

In this subsection, the quantitative and qualitative performance of the proposed hybrid model is analyzed.

5.2.1. Confusion Matrices

The class-wise performance of the proposed model is detailed in the normalized confusion matrices in Figure 8, Figure 9 and Figure 10.

The model achieves exceptional accuracy, with most classes at or near 100%. For instance, ADI, BACK, DEB, LYM, and STR are all classified with >99.6% accuracy. The most complex class, TUM (Tumor), is still correctly identified 98.8% of the time, with only minor, clinically expected confusion with related tissues like NORM (0.8%).

On this larger and more complex 9-class dataset, the model’s robustness is evident. It achieves 100% for ADI and LYM. The TUM (Tumor) class, which is notoriously difficult, is correctly classified 98.6% of the time, with slight confusion with other stromal and mucosal classes (MUC, NORM, STR), which is a common challenge in pathology.

The model’s performance on the 5-class lung and colon dataset is virtually flawless. colon_n, lung_n, and lung_scc are all classified with 100% accuracy. The adenocarcinoma classes (colon_aca, lung_aca) are classified with 99.9% accuracy, demonstrating the model’s profound capability to distinguish between benign and malignant tissues, as well as between different types of carcinomas.

5.2.2. Training and Validation

The training and validation accuracy and training and validation loss for the CRC-VAL-HE-7K dataset are shown in Figure 11 and Figure 12.

The training and validation accuracy and training and validation loss for the NCT-CRC-HE-100K dataset are shown in Figure 13 and Figure 14.

The training and validation accuracy and training and validation loss for the LC25000 dataset are shown in Figure 15 and Figure 16.

5.2.3. Feature Space and ROC Analysis

The discriminative power of the learned features is visualized in Figure 17, Figure 18 and Figure 19 using t-SNE. These plots show clear, well-separated clusters for the different tissue classes.

The model’s superiority is benchmarked in the ROC curves in Figure 20, Figure 21 and Figure 22. These figures compare the Area Under the Curve (AUC) of the proposed model against all six baselines (InceptionV3, NASNetMobile, VGG16, Xception, DenseNet121, MobileNetV2) for the three datasets. The proposed model consistently achieves the highest AUC, approaching 1.0.

5.2.4. Explainability Visualization

In order to optimize the explainability and transparency of the suggested framework, explainability analyses were conducted with the help of layer-wise feature visualization and Gradient-weighted Class Activation Mapping (Grad-CAM). Such methods provide a view of the hierarchical feature-learning process of the model and explain how discriminative areas impact the ultimate classification decisions. The qualitative findings presented in Figure 23 show the layer-by-layer feature extraction architecture of the proposed model.

In the initial convolutional stages, the network mostly picks up low-level features like edges, color variations, and basic texture patterns, and thus, fine histopathological details are retained. The deeper the layer, the higher the degree of structural information stored in intermediate layers, and the higher levels of structural information include glandular structure, tissue structure, and arrangement of cells. The representations obtained in deeper layers are more abstract and more specific to the classes, with the focus on pathological patterns, such as tumor-infiltrated areas, abnormal glandular structures, and high-nuclear-density sectors. This hierarchy of abstraction goes to prove that the model does not form superficial representations of images, but rather, it obtains meaningful representations based on them.

In order to further question the decision-making mechanism, representative samples were selected based on the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets by generating Grad-CAM visualizations, as illustrated in Figure 24.

The Grad-CAM heatmaps show the spatial areas that have the most significant contribution to the predicted classification. The findings show that the model is consistent in focusing on pathologically salient objects, including tumor masses, aberrant glandular formations, and densely cellular objects, and has little salience on the background and staining artefacts. This behavior proves that model predictions are based on clinically relevant histological features.

Furthermore, the Grad-CAM results obtained have a high level of spatial consistency when using different datasets, despite changes in staining conditions, image resolution, and distribution of data. This consistency means strong generalization and stability of the offered framework. Altogether, the combination of layer-wise visualization of features with Grad-CAM positively influences the interpretability of the model, its trustworthiness, and applicability in clinical decisions, supporting its application in the field of digital pathology.

5.3. Comparative Analysis

In this section, the performance of the proposed model is evaluated against pre-trained deep learning baselines and existing work reported in the literature, with per-class performance metrics provided.

Table 13 summarize the comparative performance of the proposed model and pretrained models on the CRC-VAL-HE-7K dataset. Detailed per-class performance metrics for each histopathological class in this dataset are provided in Table 14.

For the NCT-CRC-HE-100K dataset, overall and per-class performance are provided in Table 15 and Table 16, respectively.

For the LC25000 dataset, overall and per-class performance are provided in Table 17 and Table 18, respectively.

Step-by-step implementation details of the proposed model are shown in the Table 19.

Comparisons with prior works are presented in Table 20, Table 21 and Table 22, demonstrating the proposed model’s superior performance across all datasets: Table 20 for NCT-CRC-HE-100K, Table 21 for CRC-VAL-HE-7K, and Table 22 for LC25000.

5.4. Patient-Level Five-Fold Cross-Validation Results

This subsection investigates the impact of individual preprocessing components on the classification performance of the proposed framework. A quantitative comparison of different preprocessing strategies is summarized in Table 23, where all experiments were conducted using identical training–validation splits, model architecture, and hyperparameter settings.

To evaluate the robustness and generalization capability of the proposed framework while explicitly preventing patient-level data leakage, a five-fold patient-aware cross-validation strategy was employed using GroupKFold. Performance metrics were computed independently for each fold and subsequently aggregated and reported as mean ± standard deviation.

As shown in Table 23, the performance of the proposed model is quite high in all five folds. Figure 25 is the fold-wise evaluation results obtained with patient-level cross-validation.

There is limited variance in all the performance measures across folds, which means that the model behavior is consistent and can generalize well at the patient level.

5.5. Image Enhancement Ablation Study

The primary aim of the preprocessing phase was to enhance the image quality to enhance the effectiveness of classification. A detailed ablation study of the enhancement strategy was carried out to determine the need and unique contribution of the proposed image enhancement strategy. The impact of the following image enhancement configurations on classification performance was investigated:

No preprocessing, where raw images were directly used as model input.
Stain normalization only.
Spatially Adaptive NLM + Edge-Aware Sharpening.
Gamma correction only.
Gamma correction combined with bilateral filtering.
Complete enhancement pipeline consisting of gamma correction, bilateral filtering, and CLAHE.

To ensure a fair and controlled comparison, the same training–validation splits, model architecture, optimization strategy, and hyperparameter settings were maintained for all configurations. This ablation study quantifies the individual and combined contributions of contrast enhancement, noise suppression, and stain normalization, thereby justifying the inclusion of the proposed enhancement pipeline within the overall framework.

A quantitative comparison of different preprocessing strategies is summarized in Table 24. All experiments were conducted using identical training–validation splits, model architecture, and hyperparameter settings to ensure a fair evaluation. A visual comparison of the corresponding performance metrics is illustrated in Figure 26.

As shown in Table 24 and Figure 26, most single-step preprocessing strategies provide limited or inconsistent improvements over using raw images.

Stain normalization and spatially adaptive non-local means (NLM) with edge-aware sharpening yield only marginal gains, indicating that color normalization or smoothing alone is insufficient for robust histopathological discrimination. Gamma correction results in moderate performance improvements, which are further enhanced by the addition of bilateral filtering due to improved noise suppression while preserving tissue boundaries.

In contrast, the complete preprocessing pipeline combining gamma correction, bilateral filtering, and contrast-limited adaptive histogram equalization (CLAHE) consistently achieves the best performance across all evaluation metrics, as clearly depicted in Figure 26. These findings confirm that the proposed preprocessing pipeline plays a critical role in enhancing diagnostically relevant features and improving the robustness of the proposed hybrid deep learning framework.

6. Discussion

6.1. Performance Analysis and Comparison with Existing Methods

Although deep-learning models have been extended to histopathological cancer detection, the current structure remains superior to the current state-of-the-art models on several datasets and evaluation indicators. This is mainly thanks to the synergistic combination of convolutional neural networks, attention mechanisms, and Transformer-based global context modeling.

Xception backbone is efficient at capturing multi-scale features on a spatial basis, and retains the computational efficiency through depth-wise separable convolutions. In contrast to traditional CNN-based methods, the design of the Convolution Block Attention Module (CBAM) enables the network to focus on diagnostically relevant areas of the spatial and feature channels to reduce background noise and staining artefacts. In addition, the Transformer represents a long-range tissue dependence encoder, which is essential to images of histopathology, where the patterns related to cancer are likely to be distributed in space but not limited to localized areas.

Compared to standalone CNNs and available hybrid models, the proposed model has superior sensitivity to inter-dataset changes, including differences in staining procedures, magnification factors, and morphology of tissues. These features explain why it has better accuracy, precision, recall, and F1-score on the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets.

6.2. Design Insights for Cancer Detection Models

The paper draws a number of critical conclusions on how to design effective models of histopathological diagnosis of cancer. First, local feature extraction through convolutional neural networks (CNNs) with global context modeling through transformer networks has a significant positive effect on the discrimination of the complex tissue patterns. Second, the attention processes including the Convolutional Block Attention Module (CBAM) can improve the accuracy of classifications, as well as provide interpretability to the model by directing the network to clinically meaningful parts of the image. Third, the inclusion of explainability methods in the model evaluation supports faith in the predictive results and makes it easier to validate a clinical process.

These results suggest that future cancer-detection systems must focus on hybrid designs that simultaneously balance performance, interpretability, and generalization, and not necessarily on more and deeper or broader CNN models.

6.3. Advantages and Limitations of the Proposed Model

The suggested framework has a number of significant benefits. It performs better at classification than on several benchmark datasets, and it also has a high generalization ability. The attention-enhanced architecture enhances the quality of accuracy and interpretability, and the modular design is flexible in adapting the architecture to other histopathology and medical imaging tasks.

Nevertheless, the model poses some limitations as well. The addition of attention and Transformer modules requires more computational power and training time than lightweight CNN models. Besides this, Grad-CAM gives rough spatial explanations and might not be able to give finer-grained cellular-level reasoning. In addition, the analysis is performed on edited community datasets, and this may not be a complete view of clinical variability in practice.

Further work could be undertaken to enhance the performance and applicability by building on the work on model compression, knowledge distillation, self-supervised learning, and integration of multimodal data. Improved explainability by using concept-based or multi-level interpretation procedures is also an interesting research avenue.

6.4. Clinical Implications and Application Scenarios

Clinically and methodologically, the suggested framework has a promising potential of becoming a computer-based diagnostic support tool in the analysis of histopathological cancers. Large and consistent results in colorectal and lung cancer data indicate that the model is capable of reliably detecting patterns in tissues, which are diagnostically significant and commonly used in standard pathological evaluations.

In clinical processes, the model can be used as a second-reader system, which can support pathologists in identifying areas of interest that should be studied further, and thus alleviating the issue of diagnostic variability in the high-volume environment. For better clinical interpretability, the use of Grad-CAM visualizations can improve the explanation of visual results, which are clear and understandable, and have an association with known histopathological characteristics.

The suggested approach will not be aimed at replacing skilled clinical judgment, but it might enhance the consistency of diagnosis and efficiency of workflow. Future research needs to concentrate on the prospective clinical validation and connection with the digital pathology systems to enable real-world application.

7. Conclusions

This research introduces an innovative, multi-phase hybrid deep learning framework aimed at the automated classification of colorectal and lung cancers from histopathology images. The approach addresses major limitations of traditional models by utilizing a validated image-enhancement technique, the local feature-extraction strengths of Xception, the feature-refinement functions of the CBAM attention mechanism, and the global context modeling provided by a Transformer block.

The proposed architecture achieved state-of-the-art results, reaching accuracy levels of 99.58%, 99.29%, and 99.98% for the CRC-VAL-HE-7K, NCT-CRC-HE-100K, and LC25000 datasets, correspondingly. The performance was rigorously validated using both quantitative and qualitative approaches, employing confusion matrices, ROC curves, and t-SNE visualizations, which showcased strong discriminative abilities. The compounded advantages of every part of the architecture were also confirmed with ablation research, whereas explainability examinations by Grad-CAM and layer-wise feature maps highlighted the clarity and reliability of the model.

The trained model was made as an accessible web tool to make it easier to use in practice. To sum up, this study draws a very precise, interpretable, and comprehensive model aimed at helping pathologists with the very important task of cancer diagnosis.

Author Contributions

Conceptualization: S.S., M.S.H., and M.B.A.M.; data curation, formal analysis, investigation, methodology: S.S., M.S.H., M.B.A.M., M.F.A.M., and Z.M.; funding acquisition, project administration: Z.M., M.B.A.M., and M.F.A.M.; resources, software: S.S., M.S.H., and M.B.A.M.; validation, visualization: M.B.A.M., A.A., M.F.A.M., and Z.M.; writing—original draft: S.S., M.S.H., and M.B.A.M.; writing—review and editing: M.B.A.M., A.A., Z.M., and M.F.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available. The CRC-VAL-HE-7K and NCT-CRC-HE-100K datasets are accessible from the publicly released collection by Kather et al. [13]. The LC25000 dataset is available from the publicly released “Lung and Colon Cancer” dataset by Borkowski et al. [14]. All datasets can be obtained from their respective open-access sources without restrictions.

Acknowledgments

The authors acknowledge the use of ChatGPT (OpenAI) and Gemini (Google) for language refinement and editorial assistance during manuscript preparation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

XAI	Explainability (Explainable Artificial Intelligence)
CRC	Colorectal Cancer
SOTA	State-of-the-art
IQA	Image Quality Assessment
PSNR	Peak Signal-to-Noise Ratio
t-SNE	t-distributed Stochastic Neighbor Embedding
H&E	Hematoxylin and Eosin
CAD	Computer-Aided Diagnosis
CNN	Convolutional Neural Network
GELU	Gaussian Error Linear Unit

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA A Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
Bera, K.; Schalper, K.A.; Rimm, D.L.; Velcheti, V.; Madabhushi, A. Artificial intelligence in digital pathology—New tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 2019, 16, 703–715. [Google Scholar] [CrossRef]
Elmore, J.G.; Longton, G.M.; Carney, P.A.; Geller, B.M.; Onega, T.; Tosteson, A.N.; Nelson, H.D.; Pepe, M.S.; Allison, K.H.; Schnitt, S.J.; et al. Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens. JAMA 2015, 313, 1122. [Google Scholar] [CrossRef]
Komura, D.; Ishikawa, S. Machine Learning Methods for Histopathological Image Analysis. Comput. Struct. Biotechnol. J. 2018, 16, 34–42. [Google Scholar] [CrossRef] [PubMed]
Targ, S.; Almeida, D.; Enlitic, K.L. Resnet in Resnet: Generalizing Residual Architectures. March 2016. Available online: https://arxiv.org/pdf/1603.08029 (accessed on 13 January 2026).
Zhu, Y.; Newsam, S. DenseNet for dense flow. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; Volume 2017, pp. 790–794. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Colling, R.; Pitman, H.; Oien, K.; Rajpoot, N.; Macklin, P.; CM-Path AI in Histopathology Working Group; Snead, D.; Sackville, T.; Verrill, C. Artificial intelligence in digital pathology: A roadmap to routine use in clinical practice. J. Pathol. 2019, 249, 143–150. [Google Scholar] [CrossRef]
Vahadane, A.; Peng, T.; Sethi, A.; Albarqouni, S.; Wang, L.; Baust, M.; Steiger, K.; Schlitter, A.M.; Esposito, I.; Navab, N. Structure-Preserving Color Normalization and Sparse Stain Separation for Histological Images. IEEE Trans. Med. Imaging 2016, 35, 1962–1971. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhou, T.; Niu, Y.; Lu, H.; Peng, C.; Guo, Y.; Zhou, H. Vision transformer: To discover the ‘four secrets’ of image patches. Inf. Fusion 2024, 105, 102248. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Kather, J.N.; Krisam, J.; Charoentong, P.; Luedde, T.; Herpel, E.; Weis, C.A.; Gaiser, T.; Marx, A.; Valous, N.A.; Ferber, D.; et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS Med. 2019, 16, e1002730. [Google Scholar] [CrossRef]
Borkowski, A.A.; Bui, M.M.; Thomas, L.B.; Wilson, C.P.; DeLand, L.A.; Mastorides, S.M. Lung and Colon Cancer Histopathological Image Dataset (LC25000). December 2019. Available online: https://arxiv.org/pdf/1912.12142 (accessed on 13 January 2026).
Chen, R.J.; Lu, M.Y.; Wang, J.; Williamson, D.F.; Rodig, S.J.; Lindeman, N.I.; Mahmood, F. Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis. IEEE Trans. Med. Imaging 2022, 41, 757–770. [Google Scholar] [CrossRef] [PubMed]
Ke, Q.; Yap, W.S.; Tee, Y.K.; Hum, Y.C.; Zheng, H.; Gan, Y.J. Advanced deep learning for multi-class colorectal cancer histopathology: Integrating transfer learning and ensemble methods. Quant. Imaging Med. Surg. 2025, 15, 2329–2346. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Miah, M.B.A.; Yousuf, M.A. Detection of lung cancer from CT image using image processing and neural network. In Proceedings of the 2nd International Conference on Electrical Engineering and Information and Communication Technology (iCEEiCT), Savar, Bangladesh, 21–23 May 2015. [Google Scholar] [CrossRef]
Xu, H.; Xu, Q.; Cong, F.; Kang, J.; Han, C.; Liu, Z.; Madabhushi, A.; Lu, C. Vision Transformers for Computational Histopathology. IEEE Rev. Biomed. Eng. 2024, 17, 63–79. [Google Scholar] [CrossRef]
Zeynali, A.; Tinati, M.A.; Tazehkand, B.M. Hybrid CNN-Transformer Architecture With Xception-Based Feature Enhancement for Accurate Breast Cancer Classification. IEEE Access 2024, 12, 189477–189493. [Google Scholar] [CrossRef]
Dunn, C.; Brettle, D.; Hodgson, C.; Hughes, R.; Treanor, D. An international study of stain variability in histopathology using qualitative and quantitative analysis. J. Pathol. Inform. 2025, 17, 100423. [Google Scholar] [CrossRef]
Kainz, B.; Heinrich, M.P.; Makropoulos, A.; Oppenheimer, J.; Mandegaran, R.; Sankar, S.; Deane, C.; Mischkewitz, S.; Al-Noor, F.; Rawdin, A.C.; et al. Non-invasive diagnosis of deep vein thrombosis from ultrasound imaging with machine learning. NPJ Digit. Med. 2021, 4, 137. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/pdf/1409.1556 (accessed on 13 January 2026).
Ai Colon Cancer Predictions—A Hugging Face Space by Safayat12. Available online: https://huggingface.co/spaces/safayat12/Ai_Colon_Cancer_predictions (accessed on 13 January 2026).
Hossain, M.S.; Juthy, M.J.N.; Miah, M.B.A.; Awang, S.; Hossain, M.N.; Bhuiyan, E. AlzCNN: A Custom CNN Architecture for Alzheimer’s Stage Detection from MRI Images. In Proceedings of the 2025 IEEE 9th International Conference on Software Engineering & Computer Systems (ICSECS), Pekan, Pahang, Malaysia, 15–16 October 2025; pp. 359–364. [Google Scholar] [CrossRef]
Hossain, M.N.; Bhuiyan, E.; Miah, M.B.A.; Sifat, T.A.; Muhammad, Z.; Masud, M.F.A. Detection and Classification of Kidney Disease from CT Images: An Automated Deep Learning Approach. Technologies 2025, 13, 508. [Google Scholar] [CrossRef]
Miah, M.B.A.; Awang, S.; Rahman, M.M.; Hosen, A.S.M.S.; Ra, I.H. Keyphrases Frequency Analysis From Research Articles: A Region-Based Unsupervised Novel Approach. IEEE Access 2022, 10, 120838–120849. [Google Scholar] [CrossRef]
Rahman, M.A.; Miah, M.B.A.; Hossain, M.A.; Hosen, A.S. Enhanced Brain Tumor Classification Using MobileNetV2: A Comprehensive Preprocessing and Fine-Tuning Approach. BioMedInformatics 2025, 5, 30. [Google Scholar] [CrossRef]
Kumar, A.; Vishwakarma, A.; Bajaj, V. CRCCN-Net: Automated framework for classification of colorectal tissue using histopathological images. Biomed. Signal Process. Control. 2023, 79, 104172. [Google Scholar] [CrossRef]
Qin, Z.; Sun, W.; Guo, T.; Lu, G. Colorectal cancer image recognition algorithm based on improved transformer. Discov. Appl. Sci. 2024, 6, 422. [Google Scholar] [CrossRef]
Martínez-Fernandez, E.; Rojas-Valenzuela, I.; Valenzuela, O.; Rojas, I. Computer Aided Classifier of Colorectal Cancer on Histopatological Whole Slide Images Analyzing Deep Learning Architecture Parameters. Appl. Sci. 2023, 13, 4594. [Google Scholar] [CrossRef]
Ghosh, S.; Bandyopadhyay, A.; Sahay, S.; Ghosh, R.; Kundu, I.; Santosh, K.C. Colorectal Histology Tumor Detection Using Ensemble Deep Neural Network. Eng. Appl. Artif. Intell. 2021, 100, 104202. [Google Scholar] [CrossRef]
Jiang, L.; Huang, S.; Luo, C.; Zhang, J.; Chen, W.; Liu, Z. An improved multi-scale gradient generative adversarial network for enhancing classification of colorectal cancer histological images. Front. Oncol. 2023, 13, 1240645. [Google Scholar] [CrossRef]
Hajsalem, I.D.; Ayed, Y.B. Detecting early gastrointestinal polyps in histology and endoscopy images using deep learning. Front. Artif. Intell. 2025, 8, 1571075. [Google Scholar] [CrossRef]
Anju, T.E.; Vimala, S. Finetuned-VGG16 CNN Model for Tissue Classification of Colorectal Cancer. Lect. Notes Netw. Syst. 2023, 665, 73–84. [Google Scholar] [CrossRef]
Azar, A.T.; Tounsi, M.; Fati, S.M.; Javed, Y.; Amin, S.U.; Khan, Z.I.; Alsenan, S.; Ganesan, J. Automated System for Colon Cancer Detection and Segmentation Based on Deep Learning Techniques. Int. J. Sociotechnology Knowl. Dev. 2023, 15, 1–28. [Google Scholar] [CrossRef]
El-Ghany, S.A.; Azad, M.; Elmogy, M.; El-Ghany, S.A.; Azad, M.; Elmogy, M. Robustness Fine-Tuning Deep Learning Model for Cancers Diagnosis Based on Histopathology Image Analysis. Diagnostics 2023, 13, 699. [Google Scholar] [CrossRef]
Hasan, M.A.; Haque, F.; Sabuj, S.R.; Sarker, H.; Goni, M.O.F.; Rahman, F.; Rashid, M.M. An End-to-End Lightweight Multi-Scale CNN for the Classification of Lung and Colon Cancer with XAI Integration. Technologies 2024, 12, 56. [Google Scholar] [CrossRef]
Masud, M.; Sikder, N.; Nahid, A.A.; Bairagi, A.K.; AlZain, M.A. A Machine Learning Approach to Diagnosing Lung and Colon Cancer Using a Deep Learning-Based Classification Framework. Sensors 2021, 21, 748. [Google Scholar] [CrossRef]
Provath, M.A.M.; Deb, K.; Dhar, P.K.; Shimamura, T. Classification of Lung and Colon Cancer Histopathological Images Using Global Context Attention Based Convolutional Neural Network. IEEE Access 2023, 11, 110164–110183. [Google Scholar] [CrossRef]
Alotaibi, M.; Alshardan, A.; Maashi, M.; Asiri, M.M.; Alotaibi, S.R.; Yafoz, A.; Alsini, R.; Khadidos, A.O. Exploiting histopathological imaging for early detection of lung and colon cancer via ensemble deep learning model. Sci. Rep. 2024, 14, 20434. [Google Scholar] [CrossRef]
Said, M.M.R.; Islam, M.S.B.; Sumon, M.S.I.; Vranic, S.; Al Saady, R.M.; Alqahtani, A.; Chowdhury, M.E.H.; Pedersen, S. Innovative Deep Learning Architecture for the Classification of Lung and Colon Cancer From Histopathology Images. Appl. Comput. Intell. Soft Comput. 2024, 2024, 5562890. [Google Scholar] [CrossRef]
Uddin, A.H.; Chen, Y.L.; Akter, M.R.; Ku, C.S.; Yang, J.; Por, L.Y. Colon and lung cancer classification from multi-modal images using resilient and efficient neural network architectures. Heliyon 2024, 10, e30625. [Google Scholar] [CrossRef] [PubMed]
Vanitha, K.; R, M.T.; Sree, S.S.; Guluwadi, S. Deep learning ensemble approach with explainable AI for lung and colon cancer classification using advanced hyperparameter tuning. BMC Med. Inform. Decis. Mak. 2024, 24, 222. [Google Scholar] [CrossRef] [PubMed]
El-Aziz, A.A.A.; Mahmood, M.A.; El-Ghany, S.A.; El-Aziz, A.A.A.; Mahmood, M.A.; El-Ghany, S.A. Advanced Deep Learning Fusion Model for Early Multi-Classification of Lung and Colon Cancer Using Histopathological Images. Diagnostics 2024, 14, 2274. [Google Scholar] [CrossRef] [PubMed]

Figure 1. System architecture of the proposed hybrid classification model, including the preprocessing, feature extraction, and classification stages.

Figure 2. Representative Sample Images from Each Class in the (a) CRC-VAL-HE-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets.

Figure 3. Image enhancement process applied to histopathological samples, including gamma correction, bilateral filtering, and CLAHE to improve contrast and tissue structure visibility.

Figure 4. Normal versus enhanced histopathological images from the (a) CRC-VAL-7K, (b) NCT-CRC-HE-100K, and (c) LC25000 datasets, demonstrating the visual improvements introduced by the enhancement pipeline.

Figure 5. Overall architecture of the proposed hybrid model: (a) Xception backbone for hierarchical feature extraction, (b) Convolutional Block Attention Module (CBAM) for spatial–channel attention refinement, (c) Transformer block for capturing long-range spatial dependencies, and (d) Classification head for final prediction across all target classes.

Figure 6. Proposed model architecture with attention: (a) Residual block, (b) Multiheaded self-attention, (c) FFN network.

Figure 7. The web-based Graphical User Interface (GUI) deployed for model inference and accessibility.

Figure 8. Confusion matrix of the proposed model on the CRC-VAL-HE-7K dataset.

Figure 9. Confusion matrix of the proposed model on the NCT-CRC-HE-100K dataset.

Figure 10. Confusion matrix of the proposed model on the LC25000 dataset.

Figure 11. The proposed model’s training and validation accuracy for CRC-VAL-7K.

Figure 12. The proposed model’s training and validation loss for CRC-VAL-7K.

Figure 13. The proposed model’s training and validation accuracy for NCT-CRC-100K.

Figure 14. The proposed model’s training and validation loss for NCT-CRC-HE-100K.

Figure 15. The proposed model’s training and validation accuracy for LC25000.

Figure 16. The proposed model’s training and validation loss for LC25000.

Figure 17. t-SNE visualization of deep feature representations (CRC-VAL-HE-7K).

Figure 18. t-SNE visualization of deep feature representations (NCT-CRC-HE-100K).

Figure 19. t-SNE visualization of deep feature representations (LC25000).

Figure 20. ROC curve for the proposed model on (CRC-VAL-HE-7K).

Figure 21. ROC curve for the proposed model (NCT-CRC-100K).

Figure 22. ROC curve for the proposed model (LC25000).

Figure 23. The proposed model’s layer-wise feature extraction process on the input images.

Figure 24. The proposed model’s Grad-CAM visualization across three different datasets: (a) CRC-VAL-HE-7K; (b) NCT-CRC-HE-100K; and (c) LC25000.

Figure 25. Fold-wise performance metrics of the proposed model on the LC25000 dataset.

Figure 26. Ablation analysis of different preprocessing strategies on the CRC-VAL-HE-7K dataset.

Table 1. Comparison of existing histopathological image classification approaches, highlighting their advantages and limitations.

Model Category	Advantages	Limitations
Standard CNNs [16,17,18]	Excellent at extracting hierarchical local features (cell, nucleus morphology). Computationally efficient.	Limited effective receptive field fails to capture global tissue context. Highly sensitive to stain and color variations.
Pure Transformers [19]	Excellent at modeling global context and long-range spatial relationships.	Requires massive training datasets. May lose fine-grained local texture details captured by CNNs.
Standard Hybrid Models [15,20]	Combines CNN local feature power with Transformer global context (represents current SOTA).	It can be overly complex. Often lacks a validated preprocessing stage and may not use fine-grained, low-level attention mechanisms (CBAM/SE).
Stain Invariance Networks [21]	Explicitly minimizes the variability caused by inconsistent H&E staining, improving generalization across different scanning centers.	Primary focus on color consistency may neglect morphological feature enhancement. Still reliant on local feature extractors.
Multiple Instance Learning (MIL) [22]	Handles extremely large Whole-Slide Images (WSIs) by aggregating information from numerous small patches. Often includes a form of patch-level attention.	Aggregation layer may lose crucial spatial relationships between patches. Computationally intensive due to sequential tile processing.
Proposed Model (Enhancement + Xception–CBAM–Transformer)	Designed to solve all listed limitations: (1) Preprocessing handles stain variance. (2) Xception captures local detail. (3) CBAM refines feature focus. (4) Transformer models global context.	-

Table 2. Class distribution of the CRC-VAL-7K dataset following an 80–20% train–validation split.

Class	Total	Training (80%)	Validation (20%)
ADI	1338	1070	268
BACK	847	678	169
DEB	339	271	68
LYM	634	507	127
MUC	1035	828	207
MUS	592	474	118
NORM	741	593	148
STR	421	337	84
TUM	1233	986	247
Total	7280	5744	1536

Table 3. Class distribution of the NCT-CRC-100K dataset following an 80–20% train–validation split.

Class	Total	Training (80%)	Validation (20%)
ADI	15,020	12,016	3004
BACK	10,566	8453	2113
DEB	11,512	9210	2302
LYM	11,557	9246	2311
MUC	8896	7117	1770
MUS	13,536	10,829	2707
NORM	8763	7010	1753
STR	10,446	8357	2089
TUM	14,317	11,454	2863
Total	104,113	83,192	20,921

Table 4. Class distribution of the LC2500 dataset following an 80–20% train–validation split.

Class	Total	Training (80%)	Validation (20%)
colon_aca	5000	4000	1000
colon_bnt	5000	4000	1000
lung_aca	5000	4000	1000
lung_scc	5000	4000	1000
lung_bnt	5000	4000	1000
Total	25,000	20,000	5000

Table 5. Average image quality metrics of original images in the CRC-VAL-HE-7K dataset.

Class	Entropy	SI	PSNR (db)	IQI
ADI	5.3087	13.5822	100.00	1.0
BACK	3.9238	4.9472	100.00	1.0
DEB	6.8641	18.8359	100.00	1.0
LYM	7.3665	24.9039	100.00	1.0
MUC	6.9295	20.7820	100.00	1.0
MUS	6.8906	17.7147	100.00	1.0
NORM	7.4485	17.6426	100.00	1.0
STR	7.0325	19.4184	100.00	1.0
TUM	7.1196	15.6653	100.00	1.0

Table 6. Average image quality metrics of enhanced images in the CRC-VAL-HE-7K dataset.

Class	Entropy	SI	PSNR (db)	IQI
ADI	5.9625	14.0326	26.9067	0.9943
BACK	5.1629	4.6107	18.8282	0.9316
DEB	7.4049	21.5364	21.9243	0.9744
LYM	7.7233	29.3981	21.9710	0.9750
MUC	7.3384	20.7820	23.4738	0.9814
MUS	7.3787	19.8815	22.2676	0.9813
NORM	7.7302	20.4912	21.9938	0.9749
STR	7.4978	22.5964	21.1507	0.9762
TUM	7.5359	17.6582	21.7172	0.9762

Table 7. Average image quality metrics of original images in the NCT-CRC-HE-100K dataset.

Class	Entropy	SI	PSNR (db)	IQI
ADI	5.1162	16.2056	100.00	1.0
BACK	3.7543	4.9802	100.00	1.0
DEB	6.7772	19.1007	100.00	1.0
LYM	7.3677	21.6227	100.00	1.0
MUC	7.0829	17.7169	100.00	1.0
MUS	6.8061	18.9369	100.00	1.0
NORM	7.3585	19.1077	100.00	1.0
STR	6.9971	20.0641	100.00	1.0
TUM	7.1653	20.4176	100.00	1.0

Table 8. Average image quality metrics of enhanced images in the NCT-CRC-HE-100K dataset.

Class	Entropy	SI	PSNR (db)	IQI
ADI	5.7392	16.4826	26.8352	0.9960
BACK	5.0444	4.6428	22.3114	0.9522
DEB	7.2333	21.5301	22.8247	0.9813
LYM	7.7413	27.3297	22.0819	0.9744
MUC	7.4602	19.4021	22.5264	0.9827
MUS	7.2743	21.3647	22.6810	0.9833
NORM	7.6924	22.1051	22.0992	0.9773
STR	7.4583	23.2537	22.1608	0.9812
TUM	7.5758	23.6460	21.9519	0.9793

Table 9. Average image quality metrics of original images in the LC25000 dataset.

Class	Entropy	SI	PSNR (db)	IQI
colon_aca	7.0919	28.2666	100.00	1.0
colon_bnt	7.1505	28.0476	100.00	1.0
lung_aca	7.0451	14.2314	100.00	1.0
Lung_bnt	6.6160	13.7855	100.00	1.0
Lung_scc	6.7575	13.9977	100.00	1.0

Table 10. Average image quality metrics of enhanced images in the LC25000 dataset.

Class	Entropy	SI	PSNR (db)	IQI
colon_aca	7.6567	34.2878	17.2238	0.9511
colon_bnt	7.7199	32.7993	16.1384	0.9375
lung_aca	7.5795	19.8720	21.1361	0.9680
Lung_bnt	7.2235	18.2911	21.6530	0.9780
Lung_scc	7.4720	20.4731	20.7948	0.9687

Table 11. Hardware specifications used for training the proposed model.

Hardware Specifications	Details
Platform	Jupyter Notebook
Processor	AMD Ryzen 5 3600
Memory (RAM)	64 GB
Operating System	Ubuntu 23.10, 64 bit
Graphics Card	NVIDIA GeForce GTX 1660 (VRAM 6 GB)

Table 12. Hyperparameters used in training the proposed model.

Hyperparameter	Value
Input image size	224 × 224 × 3
Number of classes	5, 9, and 9
Batch size	16
Number of epochs	50, 100, and 100
Backbone	Xception
Frozen layers	First 100 layers
Attention module	CBAM (Channel + Spatial)
Token embedding dimension	128
Number of tokens	H × W (CNN feature map)
Positional encoding	Learned
Transformer encoder blocks	2
Attention heads	4
FFN hidden dimension	256
Transformer dropout	0.1
Classifier dropout	0.3
Optimizer	Adam (1 × 10⁻⁴)
Loss function	Categorical Cross-entropy
Pooling layer	Global Average Pooling
Final activation	Softmax

Table 13. Comparative performance of pre-trained models and the proposed model on CRC-VAL-HE-7K.

Model	Accuracy	Precision	Recall	F1-Score	MCC	Kappa
Densenet-121	99.26	99.07	99.10	99.30	99.38	99.37
MobileNetV2	97.76	96.73	96.89	96.81	96.60	96.61
Xception	99.20	98.77	98.84	98.80	98.78	98.79
InceptionV3	97.27	97.14	97.89	96.38	96.63	95.82
VGG-16	98.18	97.54	96.70	97.12	97.08	97.09
NasNetMobile	94.35	95.08	93.30	94.19	98.32	93.86
Proposed Model	99.58	99.10	99.00	99.40	99.40	99.40

Table 14. Per-class performance metrics for each histopathological class on CRC-VAL-HE-7K.

Class	Precision	Recall	F1-Score	Per-Class Acc.	MCC	Support
ADI	1.0000	0.9925	0.9962	0.9986	0.9954	268
BACK	1.0000	1.0000	1.0000	1.0000	1.0000	169
DEB	1.0000	0.9851	0.9925	0.9993	0.9921	68
LYM	1.0000	0.9921	0.9960	0.9993	0.9956	127
MUC	1.0000	0.9952	0.9976	0.9993	0.9972	207
MUS	1.0000	0.9831	0.9915	0.9986	0.9907	118
NORM	1.0000	0.9932	0.9966	0.9993	0.9962	148
STR	0.9545	1.0000	0.9767	0.9972	0.9756	84
TUM	0.9840	1.0000	0.9919	0.9972	0.9903	247
Macro Avg	0.9932	0.9935	0.9932	0.9988	0.9926	–
Weighted Avg	0.9946	0.9944	0.9944	0.9987	0.9937	1536

Table 15. Comparative performance of pre-trained models and the proposed model on NCT-CRC-HE-100K.

Model	Accuracy	Precision	Recall	F1-Score	MCC	Kappa
Densenet-121	96.81	96.81	96.81	96.81	96.40	96.40
MobileNetV2	96.30	96.25	96.29	96.29	95.83	95.82
Xception	98.16	98.12	98.18	98.14	97.92	97.92
InceptionV3	95.20	95.17	95.13	95.14	94.58	94.58
VGG-16	97.02	96.94	97.06	96.98	96.65	96.64
NasNetMobile	94.20	94.16	94.10	94.12	93.46	93.45
Proposed Model	99.33	99.27	99.26	99.27	99.97	99.17

Table 16. Per-class performance metrics for each histopathological class on NCT-CRC-HE-100K.

Class	Precision	Recall	F1-Score	Per-Class Acc.	MCC	Support
ADI	0.9981	0.9981	0.9981	0.9996	0.9979	3004
BACK	0.9991	0.9986	0.9988	0.9997	0.9987	2113
DEB	0.9858	0.9952	0.9905	0.9978	0.9893	2302
LYM	0.9996	0.9944	0.9970	0.9993	0.9966	2311
MUC	0.9977	0.9826	0.9901	0.9982	0.9892	1770
MUS	0.9937	0.9948	0.9943	0.9984	0.9934	2707
NORM	0.9897	0.9897	0.9897	0.9982	0.9887	1753
STR	0.9871	0.9885	0.9878	0.9974	0.9864	2089
TUM	0.9882	0.9923	0.9902	0.9972	0.9886	2863
Macro Avg	0.9932	0.9927	0.9929	0.9984	0.9921	–
Weighted Avg	0.9930	0.9930	0.9930	0.9984	0.9921	20,921

Table 17. Comparative performance of pre-trained models and the proposed model on the LC25000 dataset.

Model	Accuracy	Precision	Recall	F1-Score	MCC	Kappa
Densenet-121	99.26	99.52	99.52	99.52	99.40	99.25
MobileNetV2	99.26	99.37	99.38	99.37	99.25	99.20
Xception	99.68	99.68	99.68	99.68	99.60	99.42
InceptionV3	97.86	97.88	97.84	97.83	97.31	97.32
VGG-16	99.72	99.78	99.78	99.78	99.72	99.60
NasNetMobile	97.74	97.82	97.78	97.78	97.32	97.45
Proposed Model	99.98	99.98	99.98	99.98	99.75	99.75

Table 18. Per-class performance metrics for colon and lung cancer classification on the LC25000 dataset.

Class	Precision	Recall	F1-Score	Per-class Acc.	MCC	Support
colon_aca	1.0000	1.0000	1.0000	1.0000	1.0000	1000
colon_bnt	1.0000	1.0000	1.0000	1.0000	1.0000	1000
lung_aca	0.9990	1.0000	0.9995	0.9996	0.9988	1000
lung_bnt	1.0000	1.0000	1.0000	1.0000	1.0000	1000
lung_scc	1.0000	0.9990	0.9996	0.9996	0.9987	1000
Macro Avg	0.9998	0.9998	0.9998	0.9998	0.9995	–
Weighted Avg	0.9998	0.9998	0.9998	0.9998	0.9995	5000

Table 19. Step-by-step implementation approaches of the proposed model on the LC25000 dataset.

Model Variant	Params	FLOPS (G)	Inference (ms)	Accuracy (%)	F1 (%)	$∆$ Acc Vs Base (%)
Ensemble (MobilenetV2 + Xception + VGG16)	164.8 M	24.2	112	98.76 $\pm$ 0.21	96.74 $\pm$ 0.23	3.24
Xception (ImageNet)	21.9 M	4.2 G	7.66	99.68 $\pm$ 0.17	99.68 $\pm$ 0.26	0.30
Xception Conv5	7.6 M	3.8 G	3.6	97.58 $\pm$ 0.38	97.48 $\pm$ 0.62	2.20
Xception + Spatial Attention	0.66 M	1.5 G	0.7	98.54 $\pm$ 0.04	98.38 $\pm$ 0.34	1.11
Xception (like-CNN) + SE + CBAM	3.01 M	2.45 G	11.3	98.53 $\pm$ 0.15	98.46 $\pm$ 0.36	1.45
Xception + CBAM Attention + Transformer	20.1 M	3.9 G	7.1	99.98 $\pm$ 0.01	99.98 $\pm$ 0.01

Table 20. Performance comparison between prior work and the proposed model on the NCT-CRC-HE-100K dataset.

Model	Dataset	Accuracy	Precision	Recall	F1-Score	MCC	Kappa	Ref.
CRCCN-NET	NCT-CRC-HE-100K	96.26	96.44	96.34	96.38	96.00	96.00	[32]
CNN + SWIN Transformer	NCT-CRC-HE-100K	95.80	97.90	97.63	97.76	97.61	97.64	[33]
VGG19	NCT-CRC-HE-100K	96.40	94.22	94.44	94.44	NA	NA	[34]
Ensemble CNN	NCT-CRC-HE-100K	96.16	96.17	96.15	NA	NA	NA	[35]
GAN + Inception	NCT-CRC-HE-100K	89.54	86.84	86.62	98.70	NA	NA	[36]
Proposed Model	NCT-CRC-HE-100K	99.33	99.27	99.26	99.27	99.17	99.17

Table 21. Performance comparison between prior work and the proposed model on CRC-VAL-HE-7K.

Model	Dataset	Accuracy	Precision	Recall	F1-Score	MCC	Kappa	Ref.
ResNet50 + Kernel Polynomial	CRC-VAL-HE-7K	97.01	98.20	98.20	98.20	96.50	98.10	[37]
FineTuned-VGG16	CRC-VAL-HE-7K	97.92	98.02	97.38	97.65	97.62	97.61	[38]
CNN-adam	CRC-VAL-HE-7K	90.00	89.00	87.00	87.00	NA	NA	[39]
Proposed Model	CRC-VAL-HE-7K	99.58	99.10	99.40	99.40	99.40	99.40

Table 22. Performance comparison between prior work and the proposed model on the LC2500 dataset.

Model	Dataset	Accuracy	Precision	Recall	F1-Score	MCC	Kappa	Ref.
Fine-tuned ResNet101	LC25000	99.94	99.84	99.85	99.84	NA	NA	[40]
LW-MS-CCN	LC25000	99.20	99.16	99.36	99.29	NA	NA	[41]
CNN	LC25000	96.33	96.39	96.37	96.38	95.44	95.41	[42]
CNN + GC attention block	LC25000	99.76	99.76	99.40	99.70	99.50	99.50	[43]
HIELCC-EDL	LC25000	99.60	99.00	99.00	99.00	99.00	99.20	[44]
Self-ONN	LC25000	99.89	99.74	99.74	99.74	99.84	99.78	[45]
CNN + ImageNet	LC25000	99.96	99.96	99.96	99.96	99.96	98.36	[46]
Ensemble (MobileNet + Xception)	LC25000	99.44	99.42	99.43	99.42	99.43	99.30	[47]
Ensemble (ResNet + NasNet + EfficientNet)	LC25000	99.94	99.84	99.84	99.84	99.78	99.88	[48]
Proposed Model	LC25000	99.98	99.98	99.98	99.98	99.95	99.98

Table 23. Patient-level five-fold cross-validation performance (mean ± standard deviation) on the LC25000 dataset.

Fold	Accuracy	Precision	Recall	F1-Score	MCC	Kappa
Fold-1	99.98	99.98	99.98	99.98	99.98	99.97
Fold-2	99.96	99.96	99.96	99.96	99.95	99.95
Fold-3	100.00	100.00	100.00	100.00	100.00	100.00
Fold-4	99.92	99.92	99.92	99.92	99.90	99.90
Fold-5	100.00	100.00	100.00	100.00	100.00	100.00
Average ± SD	99.97 $\pm$ 0.03	99.97 $\pm$ 0.03	99.97 $\pm$ 0.03	99.97 $\pm$ 0.03	99.97 $\pm$ 0.0004	99.97 $\pm$ 0.0004

Table 24. Ablation study evaluating the effect of different preprocessing configurations on classification performance on the CRC-VAL-HE-7K dataset.

Preprocessing Strategy	Accuracy	Precision	Recall	F1-Score	MCC	Kappa
No preprocessing	99.32	99.26	99.11	99.19	99.19	99.26
Stain normalization only	98.95	98.66	98.83	98.71	98.80	98.80
Spatially Adaptive NLM + Edge-Aware Sharpening	97.65	97.26	97.32	97.25	97.44	97.10
Gamma correction only	99.36	99.36	99.23	99.29	99.29	99.44
Gamma + Bilateral filtering	99.51	99.36	99.29	99.33	99.44	99.44
Gamma + Bilateral + CLAHE	99.58	99.10	99.40	99.40	99.40	99.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shila, S.; Hossain, M.S.; Masud, M.F.A.; Miah, M.B.A.; Aminuddin, A.; Muhammad, Z. Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Mach. Learn. Knowl. Extr. 2026, 8, 31. https://doi.org/10.3390/make8020031

AMA Style

Shila S, Hossain MS, Masud MFA, Miah MBA, Aminuddin A, Muhammad Z. Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Machine Learning and Knowledge Extraction. 2026; 8(2):31. https://doi.org/10.3390/make8020031

Chicago/Turabian Style

Shila, Shirin, Md. Safayat Hossain, Md Fuyad Al Masud, Mohammad Badrul Alam Miah, Afrig Aminuddin, and Zia Muhammad. 2026. "Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis" Machine Learning and Knowledge Extraction 8, no. 2: 31. https://doi.org/10.3390/make8020031

APA Style

Shila, S., Hossain, M. S., Masud, M. F. A., Miah, M. B. A., Aminuddin, A., & Muhammad, Z. (2026). Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis. Machine Learning and Knowledge Extraction, 8(2), 31. https://doi.org/10.3390/make8020031

Article Menu

Attention-Driven Feature Extraction for XAI in Histopathology Leveraging a Hybrid Xception Architecture for Multi-Cancer Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Convolutional Neural Networks (CNNs) in Pathology

2.2. Attention-Enhanced CNNs

2.3. Transformers in Computational Pathology

2.4. Hybrid Architectures and Identified Gaps

3. Materials and Methods

3.1. Dataset Description

3.2. Image Preprocessing and Enhancement

Image Enhancement Pipeline

3.3. Baseline Models for Comparison

3.4. Proposed Hybrid Architecture

3.4.1. Xception Backbone (Local Feature Extractor)

3.4.2. CBAM Attention (Feature Refinement)

3.4.3. Transformer Encoder (Global Context Modeling)

3.4.4. Classification Head

3.4.5. Input Resizing and Dataset Compatibility

3.5. Web-Based Deployment and Accessibility

3.6. AI-Assisted Language Editing

4. Experimental Setup

4.1. Hardware Configuration

4.2. Hyperparameter Settings

4.3. Cross-Validation Strategy and Leakage Considerations

5. Result Analysis and Discussion

5.1. Performance Metrics

5.2. Performance Results

5.2.1. Confusion Matrices

5.2.2. Training and Validation

5.2.3. Feature Space and ROC Analysis

5.2.4. Explainability Visualization

5.3. Comparative Analysis

5.4. Patient-Level Five-Fold Cross-Validation Results

5.5. Image Enhancement Ablation Study

6. Discussion

6.1. Performance Analysis and Comparison with Existing Methods

6.2. Design Insights for Cancer Detection Models

6.3. Advantages and Limitations of the Proposed Model

6.4. Clinical Implications and Application Scenarios

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI