1. Introduction
Breast cancer is a common malignancy among women and one of the biggest causes of cancer-related deaths in the world. There were approximately 2.3 million new cases and 685,000 deaths reported in 2020, with 317,000 new cases predicted to be diagnosed in the United States in 2025. Similarly, Asian and American women had experienced a rapid increase in both age groups (2.7% younger and 2.5% older) per year, exhibiting a negative impact on global health [
1]. Tissue analysis under a microscope remains the gold-standard method of breast cancer diagnosis; currently, pathologists assess the type of tissue samples when handling the diagnostic process. However, manual assessment is time-consuming, subjective, and prone to inter-reader variability that underscores the need for a diagnostic workflow based on artificial intelligence (AI) [
2]. Early computational techniques relied on classical machine learning (ML) classifiers, including k-nearest Neighbors (k-NNs), Decision Trees, and Support Vector Machines (SVMs), which are based on manually designed features such as texture, shape, and color. Altogether, these methods are beneficial because they can only process subtle, complex, hierarchical, and global structures in histopathological images [
3,
4].
Models like XGBoost, Multilayer Perceptron (MLP), and Random Forest have become very popular in recent years because they achieved high predictive performance, are highly interpretable, and are naturally robust to imbalanced datasets [
2,
4]. It is reported that fused features with deep learning and XGBoost achieved better classification results in medical image analysis. Similarly, we fed deep CNN features into XGBoost and achieved higher performance on the BreakHis dataset, but these models still faced number of challenges due to the complexity of spatial feature representations [
3,
5,
6]. The introduction of deep learning has transformed the medical image classification like Convolutional Neural Networks (CNNs) such as VGG16, ResNet50, and DenseNet121 have been widely adopted because they can automatically learn hierarchical representations. CNNs are good at capturing local spatial patterns, but when trained on small datasets, they tend to overfit, have difficulty in modeling the long-range dependencies, and are critical for histopathological textures [
5,
7].
Vision Transformers (ViTs), which are self-attention-based models that leverage the long-range global feature relationships, have achieved palpable improvements over transformers [
8]. The Swin Transformers (a hierarchical version of popular ViTs) have dominated classification on several medical imaging tasks, indicating that the Swin Transformer models are better than CNNs at learning the structural variations between tissues [
9,
10,
11]. Nevertheless, ViTs have high data requirements and are computationally intensive; they also do not have the inductive biases of CNNs, such as locality or translation invariance [
12]. CNNs are prone to overfitting on small or unbalanced datasets that negatively affect their performance and generalization ability [
13,
14,
15]. ViTs, on the contrary, underfit or overfit due to tokenization granularity and often require pre-training on large datasets to achieve optimal performance [
16]. CNNs perform well with regard to fine-grained local features, including tissue textures, cellular boundaries, and subtle morphological patterns, because they can only be trained to provide only localized spatial data. ViTs, on the other hand, are highly efficient at capturing the long-range dependencies and global contextual structure of the entire image. Hybrid CNN and ViT models achieved improved feature representations, higher classification performance, as well as greater sensitivity to image resolution, magnification, and staining regimes. Such synergy is particularly effective in histopathological image analysis, where the details of cells at the micro-scale level as well as the structure of tissues at the macro-scale level are important in providing a valid description of the disease. Thus, hybrid models could be useful for minimizing the inherent disequilibrium of using either global or local features to provide a globalized diagnosis. Ensemble and hybrid learning strategies have been widely researched to alleviate the shortcomings of individual models. CNN-based ensembles, namely those that employ soft voting schemes such as VGG16, ResNet50, and DenseNet121, have demonstrated dramatic improvements in predictive performance and resistance to predictive variability. Despite these advancements, the combination of state-of-the-art CNN and ViT models has received a little attention in the literature, suggesting that much greater performance improvements can be achieved through a more detailed integration [
17,
18]. Nevertheless, related work has barely discussed the hybrid use of the best-performing CNN and ViT models.
This paper presents a four-stage pipeline that integrates the strengths of CNNs, ViTs, and conventional ML models, such as XGBoost, Random Forest, and Multilayer Perceptron, to effectively classify breast cancer in histopathological images. During Phase 1, the BreakHis dataset, consisting of 7909 images, is resized, normalized, and augmented via stratified sampling in a 70/15/15 manner [
3,
17,
19]. Phase 2 has baseline model evaluation in which three CNN-based models (VGG16, ResNet50, and DenseNet121) and four transformer-based classifiers (ViTs, Swin Transformer, etc.) are independently trained and evaluated to access their accuracy and overall performance. Other classical ML classifiers, such as Random Forests, Multilayer Perceptrons, and XGBoost, are used to classify the deep features [
17,
20,
21]. Phase 3 includes the CNN ensemble model, in which all three CNN classifiers were assembled via soft voting, while the ensemble method is reported to achieve the higher accuracy [
18,
22]. Finally, Phase 4 combines the best-performing CNN (VGG16) and ViTs (Swin Transformer) at the feature level, while on the CNN and ViT outputs, Global Average Pooling (GAP) was applied independently to convert the spatial features to 1D, which could be easily integrated with the other model outputs, while the features were scaled to ensure all model outputs were on the same scale after GAP. The normalized features were fused using a Self-Attention mechanism to ensure the model is interpretable and to better leverage the strengths of each architecture. Dense layers followed by dropout were then added on top of fused and refined features to prevent overfitting in finite datasets and to maximize the feature discriminability. The final classification was performed using three Classifiers XGBoost, Random Forest, and Multilayer Perceptron (MLP), with XGBoost yielding the best accuracy and stability. The Proposed Framework achieved 98.7% accuracy, 98.6% precision, 98.7% recall, and 98.7% F1-score. The most innovative part of proposed approach is the design of a dense-attention fusion block, which includes the dense layers, dropout, and self-attention mechanisms that clearly focus on weighted CNN and ViT features. An ablation study clearly shows that adding fusion, attention, and dense components improves performance incrementally, indicating that the Proposed Framework is highly innovative, computationally efficient, and robust, ensuring reliable performance and offering a substantial contribution. Additionally, to ensure the robustness and generalization ability, the Proposed Framework was evaluated on an independent external BACH dataset, while Grad-CAM, Grad-CAM++, and global attention maps based on explainability analysis were implemented on breast cancer histopathological images to provide interpretable and clinically relevant insights into the model’s decision-making process.
The Breast Cancer Histopathological Image dataset (BreakHis) is a popular benchmark utilized in automated diagnosis of breast cancer that consists of 7909 histopathological images gathered on 82 patients, which are divided into benign and malignant. The data is captured with images at four different magnifications, 40×, 100×, 200×, and 400×, which enables the study of tissue morphology in various magnifications, as shown in
Figure 1 below. The global architectural pattern of the glandular structures and stroma can be identified at 40× and 100× magnification. When magnifications are increased to 200× and 400×, finer cellular features, such as nuclear size, chromatin distribution, and mitotic processes, can be observed, which are important for breast cancer diagnosis.
This multi-scale imaging approach is beneficial in training the models to learn both global and local features, as the deep learning architectures can learn deep features. CNNs that are good at extracting fine-grained local features, such as VGG16, can easily capture local texture and shape details, whereas Vision Transformers (ViTs) that are good at long-range dependencies, such as Swin Transformer, capture the long-range contextual relationships. The multifunctional features of these two classifiers, along with the Proposed Framework (CNN–ViT) built on them, were used to enhance the quality and predictability of automated breast cancer diagnosis.
The dataset has eight tumor types (four benign, i.e., Adenosis, Fibro Adenoma, Tubular Adenoma, Phyllodes Tumor) and four malignant (i.e., Ductal Carcinoma, Lobular Carcinoma, Papillary Carcinoma, and Mucinous Carcinoma). Samples are evenly distributed across four magnifications—40× (1995 samples), 100× (2081 samples), 200× (2013 samples), and 400× (1820 samples)—as shown in
Table 1 below. This range allows models to yield both global tissue architecture and the cellular structure, which are specific to the BreakHis dataset.
2. Literature Review
Globally, breast cancer is the leading cause of cancer-related mortality, with an estimated 317,000 new cases predicted to be diagnosed in the United States in 2025. Similarly, Asian and American women had experienced a rapid increase in both age groups (2.7% younger and 2.5% older) per year, thus exhibiting a negative impact on global health [
1]. Early diagnosis is vital in improving the patient’s prognosis, reducing the risk of complications, enhancing the patient’s survival, and aiding therapeutic interventions. Histopathological image analysis, which is regarded as the gold standard of breast cancer diagnosis, is traditionally conducted manually by pathologists as a result of scrupulous examination of tissue sections. Nevertheless, it is a time-consuming and laborious process that is likely to be affected by inter-observer bias or human error, especially due to fatigue. First-generation computer-aided diagnosis systems were mainly based on classical machine learning algorithms, such as k-nearest Neighbors (k-NNs), Support Vector Machines (SVMs), decision trees, and ensemble algorithms. These methods rely on handcrafted features, including manually extracted information on tissue textures, forms, and color patterns, as well as features derived from standard descriptors. Even these classical models have demonstrated strong potential. However, they are inherently constrained in their predictive ability by the quality, extent, and representational integrity of manually engineered features. Such types of handcrafted features do not always adequately reflect the morphological and structural heterogeneity typical in histopathological images, thereby limiting the models’ ability to be adaptive and to generalize across different tissue patterns and variations [
3,
7,
11,
23]. Although these models are effective for classification tasks, they do not generalize well to complex, high-dimensional histopathological images because they cannot capture the complex spatial patterns [
3,
23].
Ensemble models, which offer a strong regularization and interpretability including XGBoost, have also become a powerful alternatives for classification tasks in recent years [
4,
10,
18]. Recent literature has shown that XGBoost performs better on medical image analysis, especially when deep features from Convolutional Neural Networks (CNNs) are used, and it has been effectively trained on CNN-generated features to achieve higher classification performance [
1,
4,
24]. Deep learning and CNNs have completely transformed medical image analysis, with different architectures such as VGG16, ResNet50, and DenseNet121 demonstrating a strong performance in classifying breast cancer [
25]. Such types of networks can learn hierarchical representations of features and spatial patterns that are essential for detecting the malignant structures in different tissue regions [
7,
8,
26]. VGG16 is a deep network with sixteen weight layers and a fixed kernel size, allowing for a more detailed extraction of local patterns [
27,
28]. ResNet50 uses the skip connections, which reduce the vanishing gradient problem and enable the training of significantly deeper networks [
26,
29]. CNN models can overfit even with good performance on relatively small datasets like the BreakHis, especially when the dataset is not artificially augmented or augmented after the train–test split. Further, CNNs have an intrinsically local receptive field, limiting their ability to represent global contextual information (which can be very important for global tissue characterization) [
7,
8].
Building on the achievements of transformers in natural language processing, Vision Transformers (ViTs) have found more applications in medical image analysis. ViTs can capture long-range relationships via self-attention mechanisms and extract global contextual information from histopathological images. Despite these benefits, traditional Vision Transformers lack the inherent inductive biases of Convolutional Neural Networks, such as locality and translation invariance, and typically perform best with large datasets and substantial computational resources [
3,
8,
12]. Swin Transformer, a variant of ViT, addresses many of ViTs shortcomings, such as overfitting and redundant features representation. The hierarchical structure is sufficient to capture both local and global contextual patterns, leading to higher and more expressive feature representations [
12,
30]. Hybrid models have already proven better performance than traditional CNNs in analyzing breast cancer from histopathology images. Specifically, the Multi-View Swin Transformer (MSMV-Swin) that leverages multiple perspectives to capture the richer features is more powerful and can represent features with higher performance due to its generalization across different tissue structures and magnifications [
10,
13,
16]. However, Vision Transformers are resource-intensive, required very large datasets to perform optimally, and can become impractical in resource-constrained environments when datasets are limited or hard to acquire [
11,
31,
32]. Recent research has focused on the strengths of hybrid architectures by combining the Convolutional Neural Networks (CNNs) with variants of Vision Transformers (ViTs) for medical image classification. This method achieved high-quality diagnostic results, with lower diagnostic errors than single CNN or ViT models due to combination of local CNNs with global ViT representations. These results highlights the importance of hybrid frameworks for capturing the multi-scale and multilevel features, especially in complex medical imaging problems where both small-scale features and large-scale context are equally important for accurate classification [
16,
18,
33]. The CNN extracted the fine-grained local features, while ViT captured the long-range contextual dependencies, with a fused embedding used for final classification [
34,
35]. Evaluation metrics, such as accuracy, precision, recall, and F1-score, showed that the hybrid model achieved excellent results across all datasets, with reported accuracies of 99.62% in DDSM and 100% in MIAS. These findings not only indicate that the model has strong generalization potential but also demonstrate the benefits of combining local and global feature representations. The results also indicate that the hybrid CNN–ViT-based models can significantly improve diagnostic performance with low error, underscoring their potential for clinical breast cancer diagnosis [
36,
37].
Overfitting in deep learning models is a very critical problem due to scarcity of data that is prevalent in the medical imaging sector. Overfitting is commonly overcome using a number of traditional methods. For example, during training, dropout randomly blocks some neurons that cannot co-adapt, leading to the learning of higher-quality features. In batch normalization, inputs are normalized at each layer, stabilizing training by enabling faster convergence and reducing sensitivity to initialization. Data augmentation is an artificial method that enlarges the training set by introducing variations in the training data, such as rotation, scaling, and flipping, to allow the classifier to learn more versatile representations. All these methods improve the generalization and model performance, yielding more reliable findings on unseen medical images [
33]. However, Vision Transformer (ViT) underfitting can be caused by coarse-grained patch tokenization, whereas fine-tuning on a small dataset can lead to overfitting. This is because ViT’s weak architectural inductive biases make them highly reliant on large training datasets and various strategies to ensure the successful and stable training [
12,
29,
38,
39,
40]. Hybrid learning algorithms have been extensively used to overcome the drawbacks of a single model. Specifically, robustness and predictive performance can be improved using ensemble of Convolutional Neural Networks (CNNs) based on soft voting; for example, soft voting across an ensemble of VGG16, ResNet50, and DenseNet121 has been shown to outperform each individual architecture for breast cancer histopathology classification and to leverage the strengths of each architecture to complement the others [
2,
3,
4,
16].
Recent studies in 2023–2025 have made significant strides in histopathology image analysis by embracing a weakly supervised learning paradigm and transformer models. Transformers integrated with multiple instance learning (MIL) have gained popularity to model global contextual relationships in both whole-slide and patch-based histopathology images [
41]. Attention mechanisms are widely used in transformer-based MIL frameworks to identify discriminative regions, thereby enhancing the classification robustness without requiring the pixel-level annotations [
10,
42,
43]. Meanwhile, transformer-only architectures and self-supervised learning methods have emerged as highly effective in learning powerful representations and improving generalization across histopathological datasets [
43,
44]. Furthermore, very recently, foundation model-based pipelines and large-scale pre-trained Vision Transformers have been explored for cancer sub-typing and validation across datasets, showing promising results in computational pathology [
41,
42,
43,
44].
Despite these advancements, many such approaches are either bound to complex MIL pipelines or large-scale transformer models. In the present work, on the other hand, a lightweight and selective CNN–ViT fusion strategy is pursued, in which only the top-performing backbones are fused using a dense-attention module and the effectiveness of each component is systematically evaluated through ablation [
21,
40]. Advanced fusion schemes have also been considered in recent work, while the proposed Token Mixer is a transformer-based architecture tailored for histopathological tokenization. Intra-class discrimination has been further promoted in invasive ductal carcinoma detection in supervised learning techniques, like SupCon–ViT. Confusion matrices, ROC curves, Grad-CAM, global attention maps, and feature importance visualizations are some of the techniques that can help the clinicians understand how models make decisions and build confidence in AI systems. Performance measurement tools such as area under the curve (AUC), precision, recall, F1-score, and standard deviation are crucial for evaluating the performance, especially when the dataset is class-imbalanced [
10,
14,
16,
28,
33].
Literature reviews from 2020 to 2025 highlight the various limitations of case-based breast cancer histopathology classification, as summarized in
Table 2. The main limitations are a high reliance on the training dataset, complex computations, and poor interpretability. Models that are trained on particular datasets can be very difficult to generalize and can be too complicated to implement in clinical practice in real time. The small size of available datasets can lead to overfitting, and it is impossible to train all models on large-scale datasets, which is rather expensive.
The main approaches to overcoming those shortcomings are data augmentation, transfer learning, model pruning, and adopting more efficient architectures. Overfitting can be alleviated through regularization, visualization, and explainable AI techniques. The next round of research should aim to optimize the models, making them perform well on small datasets with lower computational demands, enhancing their clinical usefulness and practical implementations.
4. Experimental Results
4.1. Evolution Matrices
The Proposed Framework was evaluated on two histopathological datasets, BreakHis and BACH, while the number of metrics was calculated to assess robustness and generalization performance, including precision, recall, F1-score, AUC, and standard deviation (STD).
As illustrated in the curve and bar chart in
Figure 4 below, part (a) illustrates the comparison to show the trade-off of TPR versus FPR of all the competing models. The Proposed Framework had the highest AUC and good generalization ability without overfitting across all models on both datasets, with part (b) showing the AUC score comparison of every model. CNN-based models, i.e., VGG16 and ResNet50, have relatively low AUC values as compared to transformer-based models, while BreakHis and BACH had the highest AUCs of 0.994 and 0.960, respectively. This highlights the complementary nature of CNN- and Transformer-based architectures as the CNN detects fine-grained local features, whereas the Swin Transformer focuses on global contextual patterns. The consistency in the Proposed Framework across two datasets further establishes that it has a good balance between bias and variance, ensuring that it does not overfit and can be generalized to other heterogeneous histopathological image distributions.
The major concern of comparative performance analysis was to evaluate the generalization potential of the Proposed Framework in both microscopic and clinically relevant settings. The BreakHis dataset was chosen for training and testing because the number of breast tissue samples at various magnifications (40×, 100×, 200×, 400×) are large, optimal for feature extraction, and possess learning representations with deep architectures. Thus, validation was mainly applied to the BACH dataset, which comprises the annotated images of breast histology. The final accuracy plot across the BreakHis and BACH datasets are compared for nine deep architectures (VGG16, ResNet50, DenseNet121, DeiT, CaiT, T2T-ViT, Swin Transformer, Ensemble Model, and the Proposed Framework), as shown in
Figure 5 below. The highest accuracy of the Proposed Framework on the BreakHis dataset was 98.74%, and the validation accuracy on the BACH dataset was 95.80%. Conversely, the Proposed Framework performed consistently across the BreakHis dataset, demonstrating that it could learn discriminative patterns from breast histopathological images and has the strong ability to generalize them.
The effectiveness of the Proposed Framework on the BreakHis dataset is demonstrated by comparative learning behavior reflected in
Figure 6 below. The BreakHis dataset results show the mean training and validation performance across 5-fold cross-validation, and it was primarily used to train and internally validate the model.
The validation accuracy is slightly lower than the training accuracy at 50 epochs, indicating a small generalization gap, but training and validation curves converge after 70 epochs, while the validation loss stabilizes, indicating that the model continues to perform well. Scatter points highlight chosen checkpoints, while black markers and dashed lines identify the training stop point where early stopping was used. In general, the curves show that the Proposed Framework performs extremely well, with low loss across all folds, resulting in strong generalization and good resistance to overfitting.
The analysis of comparative performance across nine deep learning architectures is shown in
Figure 7 below on both BreakHis and BACH breast histopathology datasets. Each model was trained on the BreakHis dataset across 5-fold cross-validation where each fold served as a validation split, while the remaining folds served as the training set. The cross-validation approach reduces the bias and variance, thereby improving the consistency of the reported measures. All models were subsequently evaluated using the BreakHis validation folds and assessed on external BACH dataset to examine their generalization across various data sources and imaging condition.
Three standard classification measures such as precision, recall, and F1-score were used to measure the performance of the Proposed Framework. These indicators highlight a moderate view of the model’s performance in accurately identifying cancerous tissue and reducing the false positives. The Proposed Framework performed better on both datasets, having 98.7% precision, 98.6% recall, and 98.7% F1-score on BreakHis, while having 95.7%, 95.8%, and 95.7% on the external BACH dataset, respectively.
Interestingly, the slight and steady decrease in performance is approximately 3% between internal BreakHis validation and external BACH testing, indicating that the Proposed Framework does not overfit the training set and learns the domain-invariant morphological representations that are applicable to other histopathology slides and stain variations.
4.2. CNN–ViT Feature Fusion Classification Results
An ablation study is conducted to examine the contribution of each proposed part (Fusion, Attention, Dense) as they are gradually introduced, beginning with the simplest feature fusion. All additions enhanced the classification performance, demonstrating that attention improves the discriminative feature selection, whereas the dense layers increase feature compactness and reduces redundancy. Deep learning models are still very promising for medical image classification, but it can only be realized when the features extracted from images are generalized across different datasets. The Proposed Framework combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to provide stronger feature representations and more complementary feature space.
Among the best-performing CNN and ViT variants, most stable and refined features were selected. An attention mechanism was used to extract, normalize, and channel-fuse features to learn channel-wise weights to adapt the two sets of features. This methodology allowed the model to focus on the most significant discriminative features in one set while ignoring the weak features in others. To overcome overfitting, the fine-fused features were used as inputs to the classifiers such as Random Forest (RF), XGBoost, and Multilayer Perceptron (MLP) to evaluate their performance.
The BreakHis dataset was tested across five-fold cross-validation to provide a fair and trustworthy performance comparison, as depicted in
Table 5 below. Both classifiers demonstrated excellent performance when integrating the attention mechanism with a densely connected layers across all evaluation metrics.
The attention mechanism significantly enhanced precision and F1-score, even when fused on its own, demonstrating a consistent performance. Moreover, the dense layers helped to enhance performance by minimizing noise and reducing feature dimensionality. XGBoost demonstrated the best results among all classifiers, with 98.74 ± 0.14%, 98.70 ± 0.18%, accuracy, and F1-sore, respectively.
To assess generalization, the Proposed Framework is evaluated on the BACH dataset, which was not cross-validated and used entirely as an external validation set, while performance trend on BACH was quite similar to the BreakHis, as shown in
Table 6 below. The application of attention mechanisms and dense layers consistently improved the performance of all classifiers. Once again, the XGBoost classifier achieved the best results on the validation set, with 95.8% and 95.75% accuracy and F1-score, respectively, indicative of its strong resilience on the unknown dataset. MLP and RF classifiers were once again strong, with accuracy of 94.8% and 94.2%, respectively. Despite the minor decrease in the performance on BACH because of variations in domains, these findings confirm that the Proposed Framework enables strong cross-dataset generalization.
XGBoost was the most stable across two datasets, which can be attributed to its gradient-boosting classifier, that effectively captured the nonlinear interactions and subtle differences in the merged feature space. It demonstrates a strong generalization, as evidenced by its consistent performance on the external BACH dataset as compared to the MLP and RF. In sum, the experimental results demonstrate that the attention-directed CNN–ViT fusion with dense-layer refinement produces a strong and noise-resistant representation that enables the high-performance classification, as summarized in
Table 7.
4.3. Visual Explainability Analysis Using Grad-CAM and Global Attention
The visual interpretability findings of all observed CNNs and ViTs, as well as the Proposed Framework, on the BreakHis dataset are shown below. We used Attention maps, Grad-CAM, and Grad-CAM++ methods to understand the behavior of various CNN and ViT variants, where Grad-CAM annotates the image features that contribute most strongly to the model prediction by using gradients from the target concepts directed to the last convolutional layers. An extension called Grad-CAM++ produces the sharper and more localized heatmaps, particularly when several regions contribute to a class. Attention maps were also employed with ViT variations, where attention models rely on self-attention patterns to locate areas of interest while attention maps indicate the transformer’s focus during inference. This visualization offers a complementary insight, like Grad-CAM and Grad-CAM++ highlighting the class-discriminative regions, while the attention maps indicate the model’s focus without direct supervision.
Figure 8a below depicts the CNN-based variants with Grad-CAM and Grad-CAM++, while
Figure 8b highlights the ViT-based variants with attention maps, Grad-CAM, and Grad-CAM++. The visualization demonstrates how CNNs focus on texture and structural patterns typical of histopathology images, whereas ViTs use patch-based attention to detect the global contextual regions. The comparison between two models highlights the complementary advantages of CNN and transformer-based architectures in feature localization.
Figure 8c below shows the Proposed Framework visualization, which highlights how combining CNN and ViT features improved the localization by leveraging the CNN texture-oriented representations, while the transformer’s global attention can generate more illuminating and comprehensive activation maps. Addressing both model performance and interpretability, it shows that the hybrid visualization highlights the areas of interest more clearly than individual models. The rationale behind the Proposed Framework is the advantage of combining complementary architectures to better understand the histopathology images.
Notably, the CNN and ViT visualizations are presented separately, prior to fusion, to explicitly indicate how each backbone provides distinct and complementary information. This is supported by the final fused maps, which are clearly more focused on diagnostically relevant regions of tissue, indicating that the Proposed Framework does not simply add the features but learns the synergistic representation that enhances interpretability and diagnostic confidence.
4.4. Row-Wise Confusion Matrix of BreakHis and BACH
Row-wise confusion matrices are used to summarize the classification results of the Proposed Framework to provide a clear view of how classification is performed. The confusion matrices for the BreakHis and BACH datasets are shown in
Figure 9 below. All these matrices indicate the sample counts that were correctly and falsely categorized as normal or malignant, respectively, and the values have been row-wise normalized to make them easier to interpret. The Proposed Framework on the BreakHis dataset achieved 97.2% and 98.4% correct prediction rates for benign and malignant samples, respectively, indicating a very high discrimination between normal and cancerous tissues. The percentage of benign cases forecast as malignant was small (2.3%), and the percentage of malignant cases falsely classified as benign was only 1.5%. The high sensitivity and specificity indicate that the model can detect both texture-level features via CNNs and structural features via ViT in histopathological images.
The model’s results on BACH dataset have been very strong, with 94.6% of benign cases and 96.9% of malignant cases correctly classified. Despite the BACH dataset being much more varied in terms of image scale, staining, and acquisition conditions, the model maintains its incredibly high levels of generalization, with only a slight decrease in performance. This stability indicates the transferability of the learned hybrid representations and whether the attention-based fusion is useful for adapting to unobserved data distributions.
Overall, the confusion matrices highlight the highly accurate results from the Proposed Framework, with balanced and trustworthy predictions across all datasets. The findings also confirm that combining an attention mechanism with dense layers suppresses the redundant features, promotes class-discriminative information, and reduces false positives and false negatives.
4.5. Actual vs. Predicted Analysis of Models
To critically assess the strength and generalization capability of the Proposed Framework, a compound visualization was developed to incorporate a series of performance indices, as shown in
Figure 10 below. The combined visualization aims to provide a general view of the model’s ability to capture the relationship between actual and predicted values across two datasets. In actual vs. predicted curves, the actual and predicted values overlap significantly, with 98.7% accuracy on the BreakHis dataset and 95.8% on the BACH dataset. The fact that these curves are perfectly aligned indicates the model’s ability to effectively learn and reproduce the nonlinear distribution of histopathological features, even across different image sizes and staining conditions. This behavior is further supported by the actual vs. predicted scatter plot, which clearly indicates that the two datasets are characterized by tightly clustered points on the diagonal line, with coefficient of determination (R
2) values of 0.998 and 0.995, indicating that the predictions of the model are still highly linearly correlated with the ground truth.
The visualizations of the residuals also confirm the model’s stability and predictive reliability, while the residual plot of errors shows that both datasets have zero-mean deviations and are randomly distributed around zero, indicating no systematic bias or model drifting. Similarly, the histogram of the residues is approximately Gaussian with a mean of zero, indicating that the prediction errors are unbiased and uncorrelated. The model achieved 98.7% on the BreakHis and 95.8% on the BACH, with a slight decline in performance due to domain differences. Taken together, these findings confirm that the Proposed Framework with attention and dense feature reduction exhibits excellent generalization, low residual variance, and predictive consistency across different histopathological fields.
4.6. Computational Complexity and Training Summary
The computational efficiency of the Proposed Framework has been compared against different models in terms of complexity and training requirements. The most important metrics of efficiency are training time, number of parameters, floating-point operations, model size, and memory consumption that are summarized in
Table 8 below. Although VGG16 and Swin Transformer are relatively effective models when trained individually, their combination imposes only a minor computational burden. The reason is that the moderate complexity trends are the results of the incorporation of both convolutional and transformer-based representations, which improve the spatial and contextual feature extraction. Generally, the Proposed Framework has a good trade-off between computational cost and performance. These findings validate that the Proposed Framework provides better generalization and stronger performance while remaining computationally viable for analyzing histopathological images.
5. Discussions
This study illustrates the synergistic effect of combining CNNs and ViTs for classifying breast cancer from histopathological images. The proposed pipeline is systematic and methodical in evaluating, comparing, and combining the strengths of two architectures to address the complex morphological and textural variations in tissue in histopathological images. The Proposed Framework uses VGG16 (CNN) to extract fine-grained local features and Swin Transformer (ViT) to capture long-range global dependencies while fused through an attention-guided mechanism followed by dense layers with dropout to boost discriminative learning, minimize redundancy, and avoid overfitting. The fused representation is classified using XGBoost, which effectively learns the intricate decision boundaries and complements of deep feature learning.
The Proposed Framework incorporates the complementary features of Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) models via a dense-attention fusion module, aiming to optimize breast cancer histopathology image classification and model generalization by integrating the hierarchical feature representations of CNNs with the global context modeling abilities of ViTs. Experimental findings indicate that the Proposed Framework consistently outperforms as compared to baseline models, highlighting the significance of feature fusion in enhancing the discriminative capability in breast cancer classification. Beyond feature fusion, the framework focuses on a dense attention-based fusion strategy, which enhances the fused feature vector through densely connected layers and discriminatively emphasizes the most significant features via an attention mechanism that mitigates the extraneous information. This approach significantly improves the discriminative power of the model, facilitating it to robustly manage variations across the histopathological image dataset. Additionally, ablation studies validated that each component—feature fusion, dense layers, and attention modules—enhances the overall classification performance, collectively demonstrating the effectiveness of the proposed framework in improving accurate and reliable breast cancer diagnosis.
A comparison between the BreakHis and BACH datasets indicate that the Proposed Framework outperforms on both datasets by achieving accuracies of 98.7% and 95.8%, respectively, demonstrating the strong generalization of the Proposed Framework. Meanwhile, its discriminative power is supported by the large AUC values (0.994 and 0.960, respectively). The analysis of confusion matrices and error residuals shows that the predictions are consistent across folds. Comparing Random Forest and MLP classifiers reflects that the Proposed Framework performs well alongside traditional models in terms of performance and discriminative ability. The model balanced the local details (nuclei, cell boundaries, micro-textures) with global contextual interpretation, thereby improving interpretability, reliability, and diagnostic accuracy. Transparent decision-making is illustrated by Grad-CAM, Grad-CAM++, and attention visualizations, which reflects that the model focuses on the discriminative regions.
Although the outcomes are promising, several limitations still remain. The dataset size is small, which may limit the generalizability of the findings to a diverse population. The computational cost of training the Proposed Framework is also high, while the real-world implementations would involve the use of optimized architectures. Still, clinical interpretability remains a persistent problem even when visualizations are informative, but they must be incorporated into regular workflows to be substantiated. Future works will involve validation on larger, multi-institutional datasets, optimization in clinical settings, and further study of methods for explainability to enhance clinical trust and adoption.
In general, the presented results demonstrate that the deep feature-level fusion and ensemble classification constitute a strong framework for analyzing histopathological images, achieving a high diagnostic performance without sacrificing interpretability and providing useful clinical implications. A more comprehensive analysis of the literature on histopathological images classification from 2020 to 2025, as shown in
Table 9 below, reveals that the outcomes have been obtained on both traditional and hybrid deep learning models. The earlier methods based on CNNs alone or in combination with classical machine learning models, such as XGBoost, also reported consistent but generally moderate accuracy scores ranging from 89.9% to 95.3%. The most recent works incorporated the CNN architectures (e.g., VGG16 and ResNet50) into machine learning classifiers, achieving up to 97.1% accuracy. Subsequently, in 2023–2024, advanced Vision Transformers were introduced and achieved high diagnostic performance, due to their high-level understanding of global context, but most of them were likely to have low localization sensitivity and high complexity.
Conversely, the Proposed Framework (CNN–ViT) incorporates the CNN-based local feature extraction (VGG16) and Transformer-based global representation learning (Swin Transformer), which are then fused at the feature level and fed to the XGBoost classifier. It achieved 98.7% accuracy and 98.7% F1-score, which were better than those of all previous methods. The findings validate the fact that a well-integrated fusion technique involving CNNs and ViTs should offer a very powerful and generalized model to classify breast cancer with high precision in terms of binary classification from histopathological images.
6. Conclusions
This study introduces a robust framework that efficiently integrates the local feature representation and strengths of VGG16 (CNN) with the global contextual relationships and long-range dependencies of the Swin Transformer (ViT). The Proposed Framework combines the CNN and ViT features with attention-based fusion, followed by dense-layer refinement and machine learning classifiers, while XGBoost exhibited the highest performance among the evaluated ML classifiers, achieving 98.7% accuracy and 98.7% F1-score on BreakHis, while achieving 95.8% accuracy on the external BACH dataset. The ablation study demonstrated that attention-based fusion and dense layers played essential roles in optimizing the model performance and robustness. Results across datasets with varying magnifications underscore the model’s versatility and robustness, enabling it to handle the scale variations effectively. Moreover, explainability is provided by Grad-CAM, Grad-CAM++, and global attention visualizations, which identify the critical tissue areas of interest for diagnosis, thus balancing a high performance with high interpretability.
Regardless of these positive results across datasets, there are still some limitations that persist. The model was trained and tested on a publicly available dataset due to some privacy and institutional data protection rules. As a result, large-scale clinical validation was impossible, which might have restricted exposure to rare or institution-specific tissue variants. Moreover, the deployment of Transformer-based architectures may be limited in hospitals due to their high computational requirements. Collaboration with medical institutions via ethical data-sharing agreements is necessary to test the model on private clinical datasets. The lightweight or compressed variants of the framework can be developed to minimize the computational cost without compromising the diagnostic quality. The integration of multi-modal data, e.g., genomic or molecular data, with histopathological images would further enhance the interpretability and diagnostic performance. Future studies could focus on multi-institutional datasets, self-supervised pre-training, and the incorporation of MIL-based validation to enhance the model generalization and robustness. Overall, the Proposed Framework can serve as a reliable, accurate, and interpretable solution to the problem of computer-aided detection of breast cancer.