Next Article in Journal
Comparative Analysis of RT-PCR and a Colloidal Gold Immunochromatographic Assay for SARS-CoV-2 Detection
Previous Article in Journal
Exploring Atypical Origins of Trismus: Surgical Solutions for Rare Pathologies—Insights from Rare Clinical Cases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Vision Transformer with Optimized Feature Fusion for Mammographic Breast Cancer Classification

by
Soaad Ahmed
1,†,
Naira Elazab
2,†,
Mostafa M. El-Gayar
2,3,*,
Mohammed Elmogy
2,*,‡ and
Yasser M. Fouda
1,‡
1
Computer Science Division, Mathematics Department, Faculty of Science, Mansoura University, Mansoura 35516, Egypt
2
Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt
3
Department of Computer Science, Arab East Colleges, Riyadh 11583, Saudi Arabia
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
These authors also contributed equally to this work.
Diagnostics 2025, 15(11), 1361; https://doi.org/10.3390/diagnostics15111361
Submission received: 29 March 2025 / Revised: 25 May 2025 / Accepted: 25 May 2025 / Published: 28 May 2025
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

:
Background: Breast cancer remains one of the leading causes of mortality among women worldwide, highlighting the critical need for accurate and efficient diagnostic methods. Methods: Traditional deep learning models often struggle with feature redundancy, suboptimal feature fusion, and inefficient selection of discriminative features, leading to limitations in classification performance. To address these challenges, we propose a new deep learning framework that leverages MAX-ViT for multi-scale feature extraction, ensuring robust and hierarchical representation learning. A gated attention fusion module (GAFM) is introduced to dynamically integrate the extracted features, enhancing the discriminative power of the fused representation. Additionally, we employ Harris Hawks optimization (HHO) for feature selection, reducing redundancy and improving classification efficiency. Finally, XGBoost is utilized for classification, taking advantage of its strong generalization capabilities. Results: We evaluate our model on the King Abdulaziz University Mammogram Dataset, categorized based on BI-RADS classifications. Experimental results demonstrate the effectiveness of our approach, achieving 98.2% for accuracy, 98.0% for precision, 98.1% for recall, 98.0% for F1-score, 98.9% for the area under the curve (AUC), and 95% for the Matthews correlation coefficient (MCC), outperforming existing state-of-the-art models. Conclusions: These results validate the robustness of our fusion-based framework in improving breast cancer diagnosis and classification.

1. Introduction

Breast cancer imaging plays a critical role in reducing the high mortality rate associated with the disease. Early detection significantly improves survival rates by enabling timely treatment, which is why screening programs have been widely implemented. Breast cancer remains one of the leading causes of death among women worldwide, and the most effective approach to preventing its progression is early diagnosis and intervention [1,2]. Imaging techniques are also essential for evaluating and monitoring treatment responses. Among these, mammography screening remains the most reliable, efficient, and cost-effective method for detecting early signs of breast cancer. However, radiologists must meticulously analyze mammograms in order to identify abnormalities, making their expertise crucial in the diagnostic process. Consequently, medical organizations strongly recommend routine mammography screening, advising women aged 40 and older to undergo annual screening [3,4].
In recent years, computer-aided diagnosis (CAD) systems have emerged as a valuable tool in medical imaging, particularly for breast cancer detection. These systems help to reduce radiologists’ workloads by assisting in interpreting digital mammography images. The primary objective of CAD technology is to accurately differentiate malignant from benign cases, as approximately 65–90% of detected abnormalities are benign [5]. However, challenges such as masses, architectural distortions, microcalcifications, and asymmetry contribute to increased false positive rates [6]. Notably, the identification of microcalcifications has been clinically recognized as a key factor in improving the effectiveness of CAD systems. As a result, significant scientific interest has been directed toward developing CAD solutions for breast mass detection. By integrating these advanced systems, radiologists can distinguish between normal and cancerous tissues more effectively, enhancing diagnostic accuracy and patient outcomes [7].
Recent advancements in general computer vision such as hybrid architectures that synergize convolutional operations with self-attention mechanisms [8] have demonstrated remarkable success in multi-scale feature learning for heterogeneous data. This is especially significant for breast cancer classification tasks, where capturing both local tissue patterns and global structural context is critical for accurate diagnosis [9]. For instance, mixed-type models such as those in [10] dynamically fuse local and global features, achieving robustness across diverse natural image domains. However, these frameworks face unique challenges in medical imaging contexts, where class imbalance, limited annotated data, and subtle pathological features demand domain-specific adaptations. Our work bridges this gap by repurposing hierarchical vision paradigms for mammography, integrating MAX-ViT’s multi-scale attention with medical-tailored optimizations such as SMOTE for class balancing and HHO for feature selection. This approach retains the computational efficiency of general computer vision models while addressing the precision required for cancer screening.
Recent advancements in medical imaging such as the Segment Anything Model (SAM) [11] and DINOv2 [12] have demonstrated remarkable generalization across domains. However, SAM’s reliance on exhaustive annotations and DINOv2’s computational overhead limit their clinical adoption for mammography. Meanwhile, self-supervised frameworks such as MedSAM [13] and hierarchical ViTs [14] struggle with fine-grained localization of subtle lesions. Our work bridges this gap by integrating hierarchical attention (MAX-ViT) with evolutionary feature selection (HHO), achieving SOTA accuracy without requiring pixel-level annotations or multimodal data.
With technological advancements, machine learning (ML) and deep learning (DL) techniques have been increasingly utilized for breast cancer detection and classification. Common ML approaches for efficient diagnosis include support vector machines (SVMs) [15], logistic regression (LR), random forest (RF) [16], decision trees (DT) [17], and K-nearest neighbors (KNN) [18]. However, traditional ML methods often rely on manual feature extraction, which is complex and requires specialized domain knowledge from radiologists. In contrast, DL models can automatically learn and adapt, extracting relevant features directly from input data in accordance with the desired output. This ability simplifies the feature extraction and data engineering processes, improving both efficiency and model reusability [19].
DL approaches such as convolutional neural networks (CNNs) and vision transformers (ViTs) have demonstrated remarkable performance in medical imaging applications, including mammography-based breast cancer detection [20,21]. Traditional CNN architectures have been widely adopted for feature extraction and classification tasks; however, they suffer from limitations such as restricted receptive fields and difficulty in capturing long-range dependencies [22].
With the advent of transformer-based models, DL has taken another leap forward in medical image analysis. Unlike CNNs, which rely on local receptive fields, transformer models utilize self-attention mechanisms to capture long-range image dependencies. ViTs have shown remarkable performance in various computer vision tasks, outperforming traditional CNNs in some cases; however, the direct application of ViTs to medical imaging is still an active area of research due to their high computational demands and the need for large labeled datasets. Hybrid models that combine CNNs and transformers have emerged as a promising solution, leveraging the strengths of both architectures to improve breast cancer classification accuracy [23].
Feature fusion techniques are crucial in improving hybrid models’ robustness. By integrating multi-scale features extracted from different network layers, fusion mechanisms can enhance a model’s ability to distinguish between breast cancer stages. Several studies have explored fusion-based strategies for medical image classification. Yet, the challenge remains designing an effective mechanism to selectively integrate informative features while minimizing redundant or noisy information. Attention-based fusion methods such as gated attention mechanisms offer a potential solution by dynamically weighting important features and suppressing irrelevant ones [24,25].
In addition to architectural advancements, optimization techniques have been explored to further enhance DL model performance. Metaheuristic optimization algorithms have demonstrated effectiveness in fine-tuning hyperparameters and improving classification outcomes. Combining such optimization techniques with feature fusion strategies can significantly improve breast cancer classification accuracy, making DL models more reliable and efficient for real-world clinical applications [26,27].
This study’s proposed framework for mammography-based breast cancer classification introduces a novel hybrid feature extraction, fusion, and optimization approach to improve diagnostic accuracy. It consists of four main stages: feature extraction using MAX-ViT, which captures both local and global spatial dependencies in mammogram images; feature fusion using a newly designed gated attention fusion module (GAFM), which dynamically integrates features from multiple layers while suppressing irrelevant information; feature selection using Harris Hawks optimization (HHO), which intelligently selects the most discriminative features for classification; and classification using XGBoost, an ensemble learning method that ensures robust multi-class classification. The key novel contributions of our work include:
  • We combine transformer-based deep feature extraction, attention-guided fusion, metaheuristic feature selection, and gradient-boosted decision trees, forming an end-to-end system that enhances classification performance.
  • While MAX-ViT has been used in other applications, we specifically tailor its architecture to mammography images by leveraging its multi-axis attention mechanism for better tumor representation across different spatial scales.
  • Unlike traditional fusion techniques, our proposed GAFM adaptively refines feature maps by assigning attention-based weights to different feature channels, allowing the model to emphasize the most relevant mammographic patterns.
  • Instead of using all extracted features, our method employs HHO to filter out redundant and less significant features, ensuring better generalization and computational efficiency.
This study aims to develop a robust and computationally efficient deep learning framework for accurate breast cancer diagnosis using mammography images, addressing several critical limitations of existing methods: feature redundancy from suboptimal multi-scale fusion, overfitting on small datasets, and poor interpretability. By integrating a hierarchical vision architecture (MAX-ViT) with evolutionary feature selection (HHO) and dynamic attention-based fusion (GAFM), our framework seeks to improve diagnostic reliability while maintaining compatibility with clinical hardware, ultimately bridging the gap between computational advancements and real-world clinical needs.
To accomplish this, we design MAX-ViT to synergize convolutional and transformer layers for hierarchical mammographic feature extraction, develop GAFM to dynamically fuse multi-scale features while prioritizing clinically relevant regions, and optimize feature selection via HHO to eliminate redundancy and reduce computational overhead. These tasks ensure that our framework achieves state-of-the-art performance while aligning with diagnostic workflows, enabling earlier and more reliable breast cancer detection.
Our proposed framework integrating MAX-ViT, a gated attention fusion module (GAFM), Harris Hawks optimization (HHO), and XGBoost achieves a classification accuracy of 98.2%, F1-score of 98.0%, and MCC of 0.95 on the KAU-BCMD dataset. Compared to state-of-the-art baselines such as Swin Transformer that rely on traditional feature selection and classifiers, our model shows a +5.6% gain in accuracy, +6.2% improvement in F1-score, and a +0.12 increase in MCC, reflecting superior classification robustness and reduced false positives/negatives. These improvements underscore the clinical relevance of our contributions, especially in multi-class breast cancer screening scenarios.
The remainder of this paper is structured as follows: Section 2 reviews the relevant literature and existing methodologies; Section 3 presents the proposed methodology and explains our approach, including key components and techniques; Section 4 provides the experimental results and discusses the experimental setup, results, and analysis; finally, Section 5 summarizes our contributions and outlines potential avenues for future research.

2. Related Work

DL has revolutionized numerous fields, surpassing traditional methods in accuracy and efficiency. In medical imaging, DL-driven techniques have significantly advanced tumor detection and classification, particularly in breast cancer diagnosis. Automated tumor identification has become more precise and reliable by leveraging sophisticated image processing techniques. Liu et al. [28] proposed an innovative DL framework for classifying breast cancer molecular subtypes by integrating genomic and imaging data. Their approach employs a hybrid DL model that undergoes rigorous validation and achieves high accuracy. They designed a multimodal fusion framework that extracts features from distinct modalities, capturing diverse structural and pathological characteristics. The extracted features are then combined using a weighted linear fusion strategy, optimizing the integration of heterogeneous data for enhanced diagnostic performance.
Kousalya and Saranya [29] proposed an advanced breast cancer classification framework by leveraging DenseNet, a CNN, for feature extraction. These extracted features are processed through fully connected layers to distinguish between cancerous and benign cells. The model undergoes comprehensive training, validation, and evaluation to ensure robust classification performance. Meanwhile, Duggento et al. [30] explored DL methodologies for cross-domain and cross-disciplinary diagnosis, utilizing large-scale and complex real-world datasets. DL architectures have demonstrated exceptional capabilities in computational vision tasks, particularly in image enhancement and interpretation, leading to transformative advancements in medical imaging. The availability of extensive multi-center pathology image databases has further accelerated the development of specialized DL algorithms, enhancing diagnostic accuracy and efficiency in clinical applications.
Shi et al. [31] introduced an unsupervised DL framework that has proven effective in feature extraction and representation learning; in contrast, traditional methods such as principal component analysis (PCA) are highly sensitive to noise and outliers, which can compromise the performance of PCA-based networks. To address this limitation, they developed the Grassmann average network (GANet) and quaternion GANet algorithms to extract meaningful features from histopathology images while preserving critical color information. These advanced techniques enhance feature interpretability, contributing to more robust and accurate histopathological image analysis.
Tanaka et al. [32] fine-tuned pretrained VGG19 and ResNet152 models to develop an ensemble CNN framework using a dataset from the Japan Association of Breast and Thyroid Sonology (JABTS). The dataset contained 1536 breast masses, including 897 malignant and 639 benign cases. Their model achieved an AUC of 0.951, with a sensitivity of 90.9% and a specificity of 87.0%. Mokni and Haoues [33] introduced an optimized ResNet152 model called CADNet157 to enhance breast cancer diagnosis using mammography images. Their approach improved feature extraction by leveraging transfer learning and fine-tuning on CNN models such as VGG16 and InceptionResNetV2. Experiments on the DDSM and INbreast datasets achieved area under the curve (AUC) scores of 98.9% and 98.1%, respectively.
Vo et al. [34] leveraged DL models, particularly convolutional layers, to extract highly informative features for breast cancer detection. Their approach outperformed traditional handcrafted feature extraction methods, demonstrating the superior ability of DL models to capture complex patterns in medical images. Notably, they applied these techniques to tumor histopathology images that were previously considered challenging to diagnose using conventional methods, showcasing the transformative potential of DL in medical imaging.
Kumar et al. [35] further advanced DL-based histopathological analysis of breast tumors by introducing a novel framework tailored for tumor classification. They released a specialized dataset containing canine mammary tumor (CMT) histopathological (CMTHis) scans, expanding the scope of deep learning applications in oncology. Additionally, they proposed a VGG16-based hybrid framework, systematically evaluating its performance with various classifiers on the CMTHis dataset and the widely used BreakHis dataset of breast cancer cell lines. Their work highlights the growing impact of DL-driven approaches in automating and improving histopathological tumor diagnosis.
Abimouloud et al. [36] pioneered a fusion of self-attention transformers with compact convolutional transformers (CCTs) and TokenLearner (TVIT) models to enhance breast cancer classification from mammography images. Similarly, Ibrahim et al. [37] introduced Adaptive Multi-Attention Network (AMAN), which integrates the Xception DL model for feature extraction and gradient boosting for classification. This advanced framework exhibited exceptional diagnostic performance with an accuracy of 87% and an AUC of 95%, demonstrating its potential for improving precision in mammography-based breast cancer detection.
Tiryaki et al. [38] introduced an advanced deep transfer learning framework for classifying breast cancer masses and calcification diseases with high precision. Their approach leveraged a CNN trained on 3360 image patches extracted from the CBIS-DDSM and DDSM mammography databases. By integrating multiple state-of-the-art network architectures including ResNet50, NASNet, Xception, and EfficientNet-B7, they optimized the feature extraction process for improved classification. The Xception network demonstrated the highest performance, achieving an impressive AUC of 0.9317 on the CBIS-DDSM test set for a complex five-class classification task. Their study highlights the potential of transfer learning in enhancing diagnostic accuracy for mammographic image analysis.
Soulami et al. [39] introduced a novel capsule network architecture that significantly reduced the computational time of the original capsule network by a factor of 6.5, enabling efficient training of breast mass regions of interest (ROIs) on lower-cost GPUs. Their model was further enhanced through data augmentation techniques and the use of optimized kernel and capsule configurations during training. Evaluation results demonstrated the superior performance of this capsule-based model, particularly in one-stage classification of suspicious breast masses. Their model achieved 96.03% accuracy in binary classification (distinguishing normal from abnormal masses) and 77.78% accuracy in multi-class classification (categorizing breast masses into benign, malignant, and normal classes).
Mahesh et al. [40] introduced an optimized framework leveraging the EfficientNet-B7 architecture in combination with a targeted augmentation strategy incorporating aggressive random rotations, color jittering, and horizontal flipping to improve breast ultrasound image classification, achieving an accuracy of 98.2%. Similarly, Manna et al. [35] proposed the GradeDiff-IM model, which combines multiple machine learning and DL techniques for cancer grade classification. Their stacking ensemble approach achieved high classification accuracy of 98.2% for G1, 97.6% for G2, and 97.5% for G3, outperforming individual ML and DL models and improving overall grade classification accuracy. A summary of recent DL-based methods for breast cancer classification is presented in Table 1.
Several transformer-based models have recently advanced the state-of-the-art in medical imaging tasks. The SAM and its domain-specific MedSAM extension have enabled prompt-based and zero-shot segmentation capabilities across various anatomical structures. Similarly, UNeXt combines convolutional inductive biases with hierarchical attention for efficient and accurate segmentation. At the same time, DINOv2 and CLIP-based adaptations have extended the reach of self-supervised and contrastive learning to various medical domains. These models have demonstrated superior localization and representation learning, particularly in multimodal or weakly labeled scenarios.
In contrast, our study targets a different clinical task where segmentation is not the core objective, namely, multi-class breast cancer classification (BI-RADS staging) from full mammographic images. Instead, we focus on building a highly accurate, generalizable, and efficient classification pipeline using a hybrid transformer backbone (MAX-ViT), gated feature fusion (GAFM), and post-selection optimization (HHO + XGBoost). While segmentation-focused architectures such as SAM may be suitable for lesion delineation, our approach is designed to support clinical decision-making at the image-level diagnostic stage, where interpretability, low-latency inference, and handling of class imbalance are critical. Nonetheless, these advanced architectures inspire promising future directions such as region-aware attention masks or pretraining with multimodal contrastive signals.
Additionally, recent task-specific pipelines such as optimal trained deep learning models (OTDEMs) [41] and breast cancer prognosis-based transfer learning (BCP-TL) [42] have demonstrated the value of targeted transfer learning for breast cancer segmentation and prognosis. Our work complements these efforts by focusing on diagnostic classification rather than segmentation or survival prediction. Future extensions of our framework may incorporate domain adaptation or weak supervision using segmentation priors.
Existing breast cancer classification studies face several limitations, including dataset dependency, suboptimal feature extraction, high computational complexity, lack of interpretability, and poor multi-class classification performance. Many models rely on CNN-based feature extraction, which struggles to capture long-range dependencies, while transformer-based methods often have high computational costs. Additionally, several studies focus only on binary classification, leading to reduced effectiveness in multi-class settings. Our proposed model based on MAX-ViT with GAFM, HHO for feature selection, and XGBoost addresses these issues by leveraging MAX-ViT for efficient hierarchical feature extraction, GAFM for dynamic multi-scale feature fusion, HHO for optimized feature selection, and XGBoost for interpretable and computationally efficient classification. This integrated approach enhances generalization, reduces computational burden, and improves both binary and multi-class classification accuracy.

3. Materials and Methods

This section describes the proposed framework for breast cancer stage classification, which consists of four main stages: feature extraction using MAX-ViT, feature fusion using GAFM, feature selection using HHO, and classification using XGBoost. The proposed architecture for breast cancer classification is illustrated in Figure 1. Each component is detailed in the following subsections.

3.1. Preprocessing

Preprocessing is a critical step to enhance mammogram images and improve classification accuracy. Mammograms often suffer from noise, low contrast, and class imbalance, which can negatively impact feature extraction and classification performance [43]. We apply a series of preprocessing techniques to address these issues, including contrast enhancement, noise reduction, breast region segmentation, image normalization, resizing, and synthetic data augmentation.
  • Data Normalization:
Mammograms vary in intensity due to differences in acquisition settings. To ensure consistency, we apply min–max normalization to rescale pixel values to the range [ 0 , 1 ] , reducing intensity variations and stabilizing DL training.
I norm = I I min I max I min
  • Contrast Enhancement using CLAHE:
Mammograms often have low contrast that makes distinguishing abnormal tissues from normal structures difficult. To enhance local contrast while preserving details, we apply contrast-limited adaptive histogram equalization (CLAHE) [44,45,46]. Unlike traditional histogram equalization, CLAHE prevents over-enhancement of noise by applying localized contrast adjustments. The transformation is provided by
I clahe = H CLAHE ( I norm , N , C ) ,
where N is the number of local regions (tiles) and C is the clip limit to prevent excessive contrast enhancement. Applying CLAHE improves the visibility of fine structures such as microcalcifications and tumor boundaries, which are critical for breast cancer diagnosis. Figure 2 indicates an example of images after applying this technique.
  • Noise Reduction Using Gaussian Filtering:
We apply a Gaussian filter to reduce imaging noise while preserving essential features. This smooths the image, reducing high-frequency noise from acquisition artifacts or low-dose radiation:
I filtered ( x , y ) = i = k k j = k k G ( i , j ) I clahe ( x i , y j )
where G ( i , j ) is the Gaussian kernel, defined as
G ( i , j ) = 1 2 π σ 2 exp i 2 + j 2 2 σ 2 ,
where σ is the standard deviation of the Gaussian distribution that determines the spatial spread (width) of the kernel. A larger σ increases blurring, while a smaller σ preserves finer details. Gaussian filtering ensures that tumor edges and breast structures remain intact while reducing unwanted noise.
  • Breast Region Segmentation:
Mammograms often include background artifacts and labels that are irrelevant for cancer classification. To isolate the breast tissue, we apply Otsu’s thresholding followed by morphological operations to segment the breast region:
T * = arg max T σ B 2 ( T )
where T * is the optimal threshold value maximizing the between-class variance σ B 2 ( T ) . Next, morphological dilation and closing operations are applied to refine the segmented breast region and remove small artifacts.
  • Data Augmentation:
Mammography datasets often suffer from class imbalance in which malignant cases are significantly fewer than benign or normal cases. Instead of conventional data augmentation (e.g., rotation, flipping), we use the synthetic minority over-sampling technique (SMOTE) to generate synthetic samples for underrepresented classes [47,48].
Before applying SMOTE, we enhance the dataset diversity by applying random rotation ( ± 15 ° ), horizontal and vertical flipping, random cropping and zooming (10–15%), and elastic transformations for deformation variability. These augmentations increase intra-class variability and help the model to generalize better.
To handle class imbalance, we apply SMOTE, which generates synthetic samples by interpolating between minority class examples. Given a sample x i , SMOTE generates a new synthetic sample x new as follows:
x new = x i + λ ( x neighbor x i )
where x neighbor is a randomly selected nearest neighbor from the same class and λ is a random number in [ 0 , 1 ] used to maintain smooth interpolation. SMOTE ensures that the model receives a balanced dataset, allowing for improved classification robustness and preventing bias toward majority classes. This preprocessing pipeline ensures high-quality input data for the MAX-ViT + GAFM + HHO + XGBoost classification model, resulting in enhanced breast cancer detection and staging performance. We applied SMOTE post-split, generating synthetic samples only for the training fold. The validation and test sets remained unmodified to ensure unbiased evaluation.

3.2. Feature Extraction Using MAX-ViT

ViTs have emerged as a powerful alternative to CNNs for mammography image analysis. Unlike CNNs, which rely on local receptive fields, ViTs process images as sequences of non-overlapping patches and employ self-attention mechanisms to model long-range dependencies. This capability is crucial for mammography, where capturing fine-grained details such as microcalcifications and global breast tissue structures is essential for accurate breast cancer classification [49].
MAX-ViT extends the standard ViT architecture by introducing a multi-axis attention mechanism that efficiently captures both local lesion characteristics (e.g., small tumors, calcifications) and global tissue asymmetries within mammograms [14]. The hierarchical processing of MAX-ViT ensures that both subtle abnormalities and overall breast patterns are effectively learned, making it particularly advantageous for breast cancer detection and staging.
Mammography images provide high-resolution X-ray scans of breast tissue, capturing essential structural details necessary for early cancer detection. Unlike MRI, which visualizes soft tissue contrasts, mammography focuses on identifying subtle abnormalities such as microcalcifications, masses, and distortions. We utilize the MAX-ViT vision transformer-based model to effectively process these images, which segments an input image I of size H × W × C into non-overlapping patches. Each patch is transformed into an embedding vector using a linear projection:
X p = Linear ( Flatten ( P i ) )
where X p is the projected feature vector, P i represents the i-th patch extracted from the mammography image, and Flatten converts the patch into a vectorized representation. This tokenization allows the model to process mammography scans as a sequence of embeddings, enabling self-attention mechanisms to capture meaningful spatial relationships across the entire breast tissue structure.
As mentioned above, MAX-ViT enhances the standard vision transformer architecture by incorporating a multi-axis attention mechanism that efficiently models both local and global dependencies within mammograms. This approach ensures effective capture of the subtle textural patterns and tissue asymmetries that are crucial for early-stage breast cancer detection. The attention mechanism is computed as follows:
Attention ( Q , K , V ) = Softmax Q K T d k V
where Q , K , V are the query, key, and value matrices derived from patch embeddings and d k is the dimension of the key matrix. The softmax function normalizes the attention scores to emphasize the most relevant regions of the mammogram. Unlike conventional self-attention, which has quadratic complexity, the multi-axis attention mechanism reduces computational overhead while retaining the ability to extract diagnostically significant features.
MAX-ViT constructs a hierarchical feature representation by stacking multiple layers with different patch sizes and attention operations. This multi-scale approach is particularly beneficial for mammography-based classification, as it allows the model to capture fine-grained tumor structures while recognizing broader tissue anomalies. The hierarchical structure ensures that microcalcifications and larger tumor masses are effectively analyzed, improving classification performance across different breast cancer stages.
Because transformers lack intrinsic spatial biases, positional encodings are incorporated to maintain spatial relationships between patches. MAX-ViT applies learned or sinusoidal positional embeddings to preserve structural consistency across attention layers. These positional embeddings are added to the input token representations before they are processed through self-attention blocks, ensuring that critical spatial information within the mammogram is retained. MAX-ViT’s ability to extract features at multiple scales is essential for detecting localized lesions while also understanding the global composition of breast tissue. Mammography images contain highly variable textures depending on breast density and imaging conditions, making hierarchical feature extraction crucial for distinguishing malignant cases from benign ones.
Although transformer-based models are computationally intensive, MAX-ViT mitigates this issue through its efficient attention mechanisms, significantly reducing the number of operations required per layer. This optimization makes applying MAX-ViT to large-scale mammography datasets feasible while maintaining high classification accuracy. MAX-ViT extracts multi-scale features from mammography images using a combination of patch embeddings, multi-axis self-attention, and positional encodings. This structured feature extraction process ensures that both fine and coarse details are captured effectively, providing a robust foundation for accurate breast cancer classification.

3.3. Multi-Scale Feature Fusion Using GAFM

DL models extract hierarchical features from mammography images, capturing different levels of information. High-level features represent global breast tissue structures, while low-level features focus on fine-grained abnormalities such as microcalcifications, masses, and architectural distortions. The GAFM is designed to dynamically integrate multi-scale features, ensuring that the most diagnostically relevant information is retained while filtering out redundant or noisy features [50].
The GAFM enhances feature fusion by applying an attention mechanism to assign adaptive weights to different feature scales. This process enables the network to prioritize critical feature levels that contribute significantly to breast cancer classification, resulting in a robust diagnostic model. Given a set of extracted feature maps F i , the GAFM computes a weighted sum using learnable attention parameters. The attention weights are computed as follows:
α i = σ ( W i · F i + b i )
where W i and b i are learnable parameters and σ represents a nonlinear activation function. These attention weights control the contribution of each feature map, allowing the model to focus on the most informative regions of the mammography images.
To refine the fusion process, a gated mechanism selectively enhances important features while suppressing less relevant ones. The final fused representation is obtained as follows:
F fused = i = 1 n α i F i
where α i represents the gating weights applied to each feature map and ⊙ denotes element-wise multiplication. This formulation emphasizes significant mammographic patterns, improving the model’s ability to distinguish malignant from benign cases.
By dynamically controlling feature contributions, the GAFM prevents redundancy and enhances the discriminative power of the classification network. This is crucial in mammography-based diagnosis, where irrelevant or redundant features could lead to false positives or false negatives. The GAFM is computationally efficient, introducing minimal overhead while significantly improving feature representation and fusion effectiveness.
Unlike simple concatenation or averaging, the GAFM introduces a learnable mechanism that adapts to the specific imaging characteristics of mammography. Different breast cancer features manifest at various scales, making multi-scale feature fusion essential for accurate diagnosis. The GAFM emphasizes critical tumor patterns, improving classification performance across different cancer stages.
Mammography images exhibit variations due to differences in acquisition settings, breast density, and patient-specific conditions. The GAFM’s adaptive fusion strategy mitigates these variations, enhancing the model’s robustness across different imaging protocols. By dynamically selecting the most relevant features, the model is able to generalize well to diverse mammography datasets.
To ensure optimal performance of the proposed framework, fine-tuning is performed by adjusting hyperparameters such as the learning rate, batch size, number of layers, and dropout rate. This fine-tuning process ensures that the model generalizes effectively to unseen mammography images. The hyperparameter settings used in the proposed framework are summarized in Table 2.
These hyperparameters were determined through extensive experimental evaluation and grid search to optimize classification performance on mammography images.

3.4. Feature Selection Using HHO

Feature selection is a crucial step in our mammography breast cancer classification framework. Because MAX-ViT extracts a large set of features and the GAFM fuses them to enhance their discriminative power, removing redundant and less informative features before classification is an essential step in the process. HHO plays a vital role in selecting the most relevant features contributing to accurate classification [51,52].
HHO is a nature-inspired metaheuristic algorithm that mimics the cooperative hunting behavior of Harris Hawks [53]. The optimization process consists of two main phases: (1) exploration, in which hawks randomly search for promising feature subsets; and (2) exploitation, where they refine the selection by adjusting their positions based on the best solution found thus far. In our customized application, HHO is adapted to work with the fused feature set F fused obtained from the GAFM, ensuring that the final feature subset is optimal for XGBoost classification.
The input to the HHO-based feature selection process is the fused feature matrix F fused of size N × d , where N is the number of mammography images in the dataset and d is the dimensionality of the extracted features, i.e., the number of features obtained from MAX-ViT and fused via the GAFM.
Each candidate solution (hawk position) in HHO represents a binary feature selection mask X = ( x 1 , x 2 , , x d ) , where
x i = 1 , if   feature   i   is   selected 0 , if   feature   i   is   discarded .
This encoding ensures that HHO optimally selects a subset of features that maximizes classification performance. HHO initializes a population of hawks in which each hawk represents a potential feature subset. The initial population is randomly generated as follows:
X j 0 = { x 1 j , x 2 j , , x d j } ,   x i j { 0 , 1 } ,   j { 1 , , P }
where X j 0 is the initial feature subset of the j-th hawk, P is the total number of hawks (solutions) in the population, and x i j is a binary value indicating whether feature i is selected by hawk j.
To improve the initial population, we apply a probabilistic selection mechanism that prioritizes high-variance features, ensuring that the most informative features will likely be included initially. During the exploration phase, hawks randomly explore feature subsets to identify promising regions in the solution space. The position of each hawk (feature subset) is updated as follows:
X j t + 1 = X j t + r 1 × | X j t X rand |
where X j t is the feature subset of the j-th hawk at iteration t, X rand is a randomly selected feature subset from the population, and r 1 is a random number in the range [ 0 , 1 ] , which ensures stochastic exploration.
This equation allows hawks to diversely explore different feature subsets, preventing the algorithm from becoming trapped in local minima. This means that HHO explores different combinations of features extracted from mammography images to identify subsets that maximize classification accuracy. To evaluate the quality of each feature subset, we use the XGBoost classification accuracy as the fitness function:
Fitness ( X j ) = Accuracy X G B o o s t ( X j )
where X j is the feature subset selected by the j-th hawk. After each iteration, the best solution X best is updated based on the highest classification accuracy achieved thus far. After identifying promising feature subsets, the exploitation phase refines them by adjusting the hawk positions relative to X best . This is done using the following equation:
X j t + 1 = X best E × | J × X best X j t |
where X best is the best feature subset found thus far, E is the escape energy, which controls the intensity of feature selection, and J is a random jump strength factor that ensures adaptive learning.
This equation ensures that hawks gradually converge toward the optimal feature subset, refining the selection process to retain only the most discriminative features for mammography breast cancer classification.

3.5. Classification Using XGBoost

After extracting relevant features using MAX-ViT, fusing multi-scale information using the GAFM, and selecting the most discriminative features via HHO, the final step in our proposed framework is classification. For this purpose, we employ the eXtreme Gradient Boosting (XGBoost) classifier, which has demonstrated superior performance in high-dimensional feature spaces and is well suited for medical image classification tasks [54].
XGBoost constructs an ensemble of decision trees iteratively. Given the selected feature subset X final from the HHO step and corresponding labels Y, the model learns a function f ( X ) that minimizes the following loss:
Y ^ = f ( X ) = k = 1 K T k ( X )
where Y ^ is the predicted class label for a mammography image, T k ( X ) represents the k-th decision tree, and K is the total number of trees in the ensemble. At each iteration, a new tree T k is added to minimize the objective function
L ( Θ ) = i = 1 N l ( y i , y ^ i ) + k = 1 K Ω ( T k ) ,
where l ( y i , y ^ i ) is the loss function measuring the difference between true and predicted labels and Ω ( T k ) is a regularization term used to control tree complexity and prevent overfitting.
XGBoost employs a second-order Taylor expansion to approximate the loss and efficiently optimize model training:
L ( t ) i = 1 N g i f ( X i ) + 1 2 h i f 2 ( X i ) + Ω ( T k )
where g i = l ( y i , y ^ i ( t 1 ) ) y ^ i is the first-order gradient and h i = 2 l ( y i , y ^ i ( t 1 ) ) y ^ i 2 is the second-order gradient.
To maximize classification accuracy, we fine-tuned the XGBoost hyperparameters using a grid search approach. The key hyperparameters and their optimized values are listed in Table 3.
XGBoost serves as the final classifier in our mammography breast cancer classification framework. Leveraging HHO-selected features ensures robust and accurate cancer staging while handling class imbalance and reducing computational costs. Its combination of feature selection and gradient boosting makes XGBoost a powerful tool for mammography-based medical diagnosis.

4. Experimental Results

4.1. Dataset Description

In this study, we utilized the King Abdulaziz University Breast Cancer Mammogram Dataset (KAU-BCMD) [55], a publicly available dataset designed to support breast cancer detection and classification research. The dataset was collected from the Sheikh Mohammed Hussein Al-Amoudi Center of Excellence in Breast Cancer at King Abdulaziz University (KAU), Jeddah, Saudi Arabia between April 2019 and March 2020. It comprises a diverse set of mammogram images annotated and reviewed by expert radiologists, making it a valuable resource for developing and evaluating CAD systems.
The KAU-BCMD dataset includes 5662 mammogram images obtained from 1416 cases, covering a wide range of breast cancer stages and conditions. Table 4 summarizes the key characteristics of the KAU-BCMD dataset. Each case contains bilateral mammograms with two standard views—craniocaudal (CC) and mediolateral oblique (MLO)—for both the right and left breasts. The dataset is provided in DICOM format, ensuring high-resolution images suitable for DL-based analysis. Figure 3 shows example images from the dataset.

4.2. Evaluation Metrics

Several evaluation metrics were utilized to comprehensively assess the proposed DL model’s performance for multi-class breast cancer classification. These metrics ensure a balanced evaluation by considering various aspects of classification performance, including accuracy, sensitivity, specificity, and robustness. The evaluation metrics used in this study are as follows:
  • Accuracy: Measures the proportion of correctly classified samples among the total samples. It is calculated as
    A c c u r a c y = T P + T N T P + T N + F P + F N ,
    where TP (True Positives) and TN (True Negatives) represent correctly classified instances while FP (False Positives) and FN (False Negatives) indicate misclassified instances.
  • Precision: Measures the reliability of positive predictions by calculating the ratio of correctly predicted positive instances to the total predicted positive instances:
    P r e c i s i o n = T P T P + F P .
  • Recall (Sensitivity): Evaluates the model’s ability to correctly identify positive cases:
    R e c a l l = T P T P + F N .
  • F1-Score: The harmonic mean of precision and recall, it provides a balanced evaluation, particularly for imbalanced datasets:
    F 1 - S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l .
  • Area Under the Curve (AUC-ROC): The AUC-ROC evaluates a model’s ability to distinguish between different classes. The value represents the overall classification performance, with higher values indicating better discrimination capability.
  • Specificity: Also known as the true negative rate, the specificity measures a model’s ability to correctly classify negative cases:
    S p e c i f i c i t y = T N T N + F P .
  • Matthews Correlation Coefficient (MCC): A robust metric that evaluates classification performance even when the dataset is imbalanced:
    M C C = ( T P × T N ) ( F P × F N ) ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) .
  • Balanced Accuracy: Addresses class imbalance by averaging the recall values of all classes:
    B a l a n c e d A c c u r a c y = S e n s i t i v i t y + S p e c i f i c i t y 2 .
  • Cohen’s Kappa Coefficient: Measures the level of agreement between predicted and actual classifications while considering chance agreements:
    K a p p a = P o P e 1 P e
    where P o is the observed agreement and P e is the expected agreement by chance; higher kappa values indicate better model reliability.
Utilizing these evaluation metrics ensures a comprehensive performance assessment of the proposed DL model. This multi-metric evaluation approach helps us to understand the model’s strengths and weaknesses, particularly in the context of breast cancer classification where sensitivity and specificity are critical for accurate diagnosis and treatment planning.

4.3. Results

In this section, we present the experimental results of our proposed model and compare its performance with a variety of different DL architectures, hybrid CNN + ViT models, and classifiers. In addition, we analyze the impact of different configurations of MAX-ViT by evaluating its standalone performance, performance with feature fusion, and performance with different optimization techniques.
Computational efficiency metrics were measured under real-world constraints using Google Colab’s T4 GPU (16 GB VRAM). Inference latency was averaged over 1000 test images at 1024 × 1024 resolution with reduced-precision arithmetic optimizations. FLOPs and GPU memory usage were quantified using standard profiling tools, and dynamic batch sizing (1–8 images) was tested to simulate clinical workflows. For transparency, we provide a reproducibility package with hardware diagnostics and preconfigured benchmarking workflows.
Table 5 summarizes the proposed framework’s performance on Google Colab hardware. The model achieves a throughput of 17.2 images/s (58 ms/image) with 21.4 GFLOPs, balancing clinical-grade accuracy (98.2%) and practical speed. While slower than MobileNetV3 (22.4 images/s), our framework retains superior diagnostic performance ( Δ F1-score = 5.5%). HHO feature selection reduces XGBoost’s inference latency by 63% (1.8 ms vs. 4.9 ms), while MAX-ViT’s hybrid design cuts FLOPs by 38% compared to pure ViT architectures (Table 6).
Table 7 presents a comparison of various pretrained DL models, including ResNet-50, DenseNet-121, EfficientNet-B3, Swin Transformer, MetaFormer, CvT, ConvNeXt, and our proposed MAX-ViT model. The results demonstrate that transformer-based models such as Swin Transformer and MetaFormer outperform conventional CNN-based models such as ResNet-50 and EfficientNet-B3. This confirms the effectiveness of self-attention mechanisms in capturing critical patterns in mammogram images. Among all models, MAX-ViT achieves the highest accuracy, precision, recall, and AUC, highlighting the advantage of its hierarchical vision transformer structure in breast cancer classification.
To ensure a fair and valid benchmarking process, all deep learning baseline models in Table 7, including EfficientNet-B3 and ConvNeXt, were retrained or fine-tuned under a uniform experimental setup. This setup included consistent preprocessing, augmentation, SMOTE-based class balancing (applied only to training folds), and stratified 5-fold cross-validation. The evaluation was performed using the same held-out test set and early-stopping protocol across all models. No architecture-specific tuning (e.g., compound scaling in EfficientNet or specialized layer configurations) was applied to any baseline, ensuring that the comparisons reflect genuine differences in representational capacity rather than parameter optimization. Although this uniformity may yield lower performance than reported in isolated studies for some architectures, it establishes a controlled and unbiased basis for evaluating the relative effectiveness of our proposed model.
Our proposed model outperforms all other models, achieving over 98% accuracy and significantly higher AUC. We further examined how different classifiers impact the performance of DL models. To further analyze the effectiveness of transformer integration, we evaluated hybrid architectures that combine CNNs with ViT, including VGG16 + ViT, MobileNet + ViT, InceptionV3 + ViT, and InceptionResNetV2 + ViT, along with multiple classifiers. The classification results in Table 8 indicate that these hybrid models generally shoiw improved performance compared to standalone CNN architectures. Notably, InceptionResNetV2 + ViT outperforms other CNN + ViT combinations, suggesting that deeper feature extraction networks combined with transformers yield superior feature representations. However, despite these improvements, MAX-ViT still surpasses all CNN + ViT models, demonstrating that a fully transformer-based model is more effective in mammogram classification.
To assess the impact of different classifiers on DL feature representations, Table 9 provides a performance comparison of various classifiers, including SVM, KNN, DT, naïve Bayes (NB), LR, RF, LightGBM, multi-layer perceptron (MLP), and XGBoost. The results reveal that tree-based classifiers, particularly XGBoost and random forest, outperform traditional classifiers such as SVM and KNN. This indicates that DL-extracted features benefit significantly from boosting-based classifiers, which enhance decision boundaries in high-dimensional feature spaces. XGBoost achieves the highest accuracy and AUC across all models, further justifying its use in our proposed MAX-ViT framework.
To ensure rigorous evaluation and mitigate overfitting risks, we employed a stratified 5-fold cross-validation strategy. The dataset was partitioned into five folds while preserving the class distribution across splits. During cross-validation, SMOTE was applied exclusively to the training fold in order to prevent data leakage, while the validation and test folds remained unmodified. A held-out test set (20% of the dataset) was used for final evaluation, which was neither sampled nor augmented during training. Regularization techniques, including dropout layers (rate = 0.3) in the MAX-ViT encoder and L2 regularization ( λ = 0.01) in the XGBoost classifier, were applied to penalize model complexity. Training was halted early if validation loss plateaued for ten epochs. Statistical significance of performance differences against baseline models was assessed using McNemar’s test ( α = 0.01).
Table 10 summarizes the cross-validated performance of the proposed framework. The model achieved a mean accuracy of 97.6% (±0.4% standard deviation) and an MCC of 0.93 (±0.02) across folds, with 95% confidence intervals of 97.2–98.0% for accuracy and 0.91–0.95 for MCC. These metrics align with clinical feasibility for multi-class mammogram classification and reflect reduced variance compared to single-split evaluations. McNemar’s test confirmed statistically significant superiority over all baseline models (p < 0.001). The framework retained robust performance on the held-out test set (accuracy: 97.1%, MCC: 0.91), demonstrating generalizability within the dataset distribution, while synthetic oversampling improved minority class recall (e.g., BI-RADS 4/5).
We report all evaluation metrics, including accuracy, precision, recall, F1-score, AUC, specificity, sensitivity, balanced accuracy, MCC, and Cohen’s Kappa, along with mean ± standard deviation across cross-validation folds. Furthermore, we computed 95% confidence intervals and performed paired t-tests to compare the proposed model against baseline models (with p < 0.05 considered significant). This provides a statistically grounded evaluation of the model’s reliability (see Table 11).
Table 12 clarifies how SMOTE was responsibly used post-splitting to avoid data leakage. When applied to raw transformer features, SMOTE introduced synthetic redundancy, resulting in lower minority-class (BI-RADS 4) F1-score (82.1%) and increased performance variance. Using HHO to select robust features before oversampling yielded better generalization, a substantial F1-score improvement (94.7%), higher MCC, and lower standard deviation. These results affirm that applying SMOTE after feature selection mitigates overfitting risks and ensures statistically reliable augmentation.
To evaluate the contributions of each component in the proposed MAX-ViT + GAFM + HHO + XGBoost framework, we performed a detailed ablation study. Table 8 presents the results of several reduced variants of our model, isolating the effects of GAFM (vs. concatenation), HHO (vs. L1-based selection), and XGBoost (vs. simpler classifiers).
The ablation study (Table 13) demonstrates that each component in the proposed pipeline contributes meaningfully to overall performance. Integrating the GAFM instead of simple concatenation improved accuracy by over 2%, while replacing L1 regularization with HHO led to further gains in both F1-score and MCC. Ensemble classifiers outperformed linear ones, with XGBoost achieving the best results across all metrics—accuracy (98.2%), AUC (0.997), F1-score (0.980), and MCC (0.95)—while also maintaining the lowest standard deviation, indicating superior robustness. These consistent improvements confirm that the final pipeline configuration was selected based on both accuracy and stability across folds.
To further evaluate the impact of key components in our proposed model, we analyzed different configurations of MAX-ViT. Table 14 compares the standalone MAX-ViT model, MAX-ViT with feature fusion (GAFM), and MAX-ViT with both feature fusion and hyperparameter optimization (HHO). The results indicate that applying feature fusion significantly improves performance by dynamically integrating Swin Transformer and MetaFormer features. Additionally, incorporating optimization techniques further enhances classification accuracy and robustness. The complete MAX-ViT + GAFM + HHO + XGBoost pipeline achieved the highest performance across all metrics, confirming the effectiveness of combining feature fusion and optimization strategies.
To quantify the contributions of each core component in our proposed pipeline, we conducted a detailed ablation study comparing reduced model variants. Table 14 presents results for (1) standalone MAX-ViT without fusion or optimization, (2) MAX-ViT with GAFM-based feature fusion, (3) MAX-ViT with both GAFM and HHO-based feature selection, and (4) the complete pipeline incorporating XGBoost classification. The results reveal that the GAFM increases classification accuracy from 93.5% to 94.7% by enabling dynamic cross-architecture attention between Swin Transformer and MetaFormer features. Adding HHO further improves performance to 96.0% by eliminating redundant or irrelevant feature channels. Finally, integrating XGBoost as the classifier raises the final accuracy to 98.2%, indicating its strength in handling high-dimensional optimized features.
The efficiency metrics in Table 6 further show that HHO reduces inference latency by 63% (1.8 ms vs. 4.9 ms) and decreases FLOPs by 72% without compromising performance. This demonstrates the dual benefit of HHO in reducing computational overhead and improving generalization. Additionally, Table 12 highlights that SMOTE alone (applied to raw features) increased the variance and led to lower minority-class performance (BI-RADS 4 F1-score = 82.1%). However, when applied after HHO-based selection, the F1-score rose to 94.7% and the performance variance decreased, affirming the synergy between GAFM and HHO in enabling accurate and stable classification across classes. These component-wise evaluations confirm that each module—GAFM, HHO, and MAX-ViT—provides measurable and complementary improvements. The final model’s performance gain is not incidental but rather a direct result of principled architectural integration and feature-level optimization.
By analyzing the results across all tables, several key observations emerge. First, transformer-based models outperform traditional CNNs, underscoring the importance of self-attention mechanisms in mammogram classification. Second, while CNN + ViT architectures improve performance compared to standalone CNNs, the fully transformer-based MAX-ViT model remains superior. Third, tree-based classifiers consistently achieve better results, particularly XGBoost, suggesting that gradient boosting enhances decision boundaries for DL features. Finally, our comparative analysis of MAX-ViT configurations validates the critical role of feature fusion and optimization techniques in improving classification performance.
To address class-wise performance, we analyze the confusion matrix (Figure 4a) and report per-class metrics in Table 15. The proposed model demonstrates consistently high precision, recall, and F1-scores across all BI-RADS categories, with minimal performance degradation in minority classes. Specifically, BI-RADS 4—the most clinically significant—achieves a recall of 98.3%, minimizing false negatives in high-risk cases. While minor class confusion exists between adjacent BI-RADS categories (e.g., 2 vs. 3), the matrix shows no substantial misclassification bias. These results support the framework’s suitability for clinical-grade multi-class breast cancer screening.
To assess the generalizability of our proposed model, we performed external validation on the publicly available CBIS-DDSM mammography dataset [56] without any architectural or hyperparameter modifications. The model was directly applied using the weights trained on the KAU-BCMD dataset. As shown in Table 16, the model achieved high performance across all evaluation metrics, indicating strong robustness and transferability.
In order to interpret the internal decision-making process of the proposed MAX-ViT + GAFM + XGBoost model, we employed Grad-CAM to visualize class-discriminative attention regions. Figure 5 shows representative heatmaps for each BI-RADS category. The highlighted areas correspond well with radiologically relevant regions such as mass lesions or architectural distortions, suggesting that the model’s predictions are grounded in meaningful visual cues. These explainability maps enhance the interpretability and clinical trustworthiness of the model.
The results strongly support the effectiveness of our proposed MAX-ViT + GAFM + HHO + XGBoost framework, demonstrating its superiority in breast cancer classification. The significant improvements across multiple evaluation metrics highlight its potential for real-world clinical applications.

5. Discussion

The experimental results presented in this study demonstrate the effectiveness of our proposed MAX-ViT + GAFM + HHO + XGBoost framework for multi-class breast cancer classification using mammogram images. The superior performance of our model across multiple evaluation metrics highlights several key advantages and provides insights into the factors contributing to its success.
Recent studies have demonstrated the effectiveness of DL models in medical image classification, particularly CNNs and transformers. Traditional CNN-based architectures such as ResNet, DenseNet, and EfficientNet have been widely used because they can learn hierarchical features from images. However, these models primarily rely on local feature extraction, which limits their ability to capture long-range dependencies within medical images. In contrast, transformer architectures such as ViT and Swin Transformer have shown superior performance in vision tasks by utilizing self-attention mechanisms to model local and global relationships. Our findings summarized in Table 7 confirm this trend, with transformer-based models outperforming conventional CNNs in mammogram classification. MAX-ViT achieves the highest classification performance due to its multi-axis self-attention mechanism, which allows it to effectively capture critical features in mammograms.
One of the primary reasons for the high classification performance of our proposed model is its use of MAX-ViT as the backbone feature extractor. Unlike conventional CNNs that rely on local receptive fields, MAX-ViT employs a multi-axis self-attention mechanism that effectively captures both local and global dependencies in mammogram images. This hierarchical attention structure enables better feature representation, improving discriminatory power in distinguishing between breast cancer stages. Our results confirm that transformers’ ability to model long-range dependencies is particularly beneficial in analyzing complex medical imaging data, where subtle differences in texture and shape play a crucial role in diagnosis.
Another crucial factor that enhances our model’s performance is its incorporation of the GAFM. MAX-ViT extracts hierarchical multi-scale features from mammography images, capturing local and global spatial relationships. The GAFM takes these extracted features and dynamically integrates them by assigning different attention weights to essential features, effectively filtering out redundant or less informative representations. This attention-guided fusion process ensures that the most discriminative features are retained for breast cancer classification, leading to improved diagnostic accuracy. Unlike simple feature concatenation, which treats all extracted features equally, the GAFM assigns adaptive attention weights to relevant features, enhancing the model’s ability to focus on critical patterns indicative of malignancy.
Integrating the HHO algorithm enhances the model’s robustness by selecting features, ensuring that only the most relevant and discriminative features are retained for classification. Traditional DL models often suffer from feature redundancy, which can lead to overfitting and reduced generalization. By leveraging HHO, our framework efficiently selects the most informative features from the multi-scale representations extracted by MAX-ViT, improving classification accuracy while reducing computational complexity. This optimization process ensures better model generalization and minimizes the risk of overfitting.
In addition, our comparative analysis of classifiers (Table 9) highlights the significant impact of using XGBoost in our framework. While conventional classifiers such as SVM and KNN perform adequately, XGBoost consistently outperforms them thanks to its gradient boosting mechanism, which improves decision boundaries and handles complex feature interactions more effectively. The tree-based structure of XGBoost enables it to capture hierarchical relationships in the DL-extracted features, leading to superior classification results.
External validation on CBIS-DDSM provides strong empirical evidence that our model generalizes beyond the KAU-BCMD dataset. Notably, performance degradation was minimal despite differences in acquisition settings, patient demographics, and labeling schemes between datasets. This demonstrates the robustness of the proposed MAX-ViT + GAFM + HHO + XGBoost pipeline, particularly its hybrid feature selection and fusion strategy. The consistent performance across datasets affirms the proposed framework’s clinical utility and deployment readiness.
Our framework achieves research-ready efficiency (17.2 images/s) on Colab’s free-tier T4 GPU while maintaining diagnostic-grade accuracy through three key optimizations: MAX-ViT’s hierarchical design employs localized attention windows to reduce computational complexity by 38% compared to global transformers; HHO-driven feature compression prunes 72% of redundant features, slashing classification latency to 1.8 ms; and numerical precision optimization reduces memory usage by 35% while improving throughput. Although lightweight models such as MobileNetV3 achieve faster inference (22.4 images/s), their significant accuracy drop ( Δ F1-score = 5.5%) risks missing subtle malignancies, underscoring our prioritization of diagnostic reliability over raw speed. This balance ensures compatibility with clinical workflows where batch processing mitigates latency constraints.
Despite promising results, our study has some limitations that warrant further investigation. First, although our model achieved state-of-the-art performance on the King Abdulaziz University Mammogram Dataset, its generalizability to other datasets remains to be explored. Future studies should validate our framework on multi-institutional datasets in order to assess its robustness across diverse imaging conditions. Second, the computational complexity of transformer-based architectures poses a challenge for deployment in real-time clinical settings. Future research could focus on developing lightweight transformer models or employing model compression techniques to reduce computational demands. Third, although our model achieves high accuracy, its decision-making process remains a black-box approach, which may limit clinical adoption. Incorporating explainability techniques such as attention visualization and Shapley additive explanations (SHAP) analysis could enhance model interpretability and increase trust among medical practitioners.
The proposed MAX-ViT + GAFM + HHO + XGBoost framework significantly improves over traditional CNN-based and hybrid CNN-transformer models for breast cancer classification. Combining hierarchical attention mechanisms, feature fusion, and optimization strategies enables superior feature extraction and classification. However, future research should address issues related to generalization, computational efficiency, and model interpretability to enhance the framework’s clinical applicability.

6. Conclusions

In this study, we have proposed a new DL framework for breast cancer classification using mammogram images. The proposed framework integrates MAX-ViT for feature extraction, a GAFM to enhance feature representation, HHO for hyperparameter tuning, and XGBoost for final classification. Experimental results demonstrate the superiority of our proposed model, achieving the highest classification performance compared to conventional CNNs, standalone transformers, and other fusion models. The comparative evaluation highlights the effectiveness of integrating CNN and transformer-based features, while the ablation study confirms the contributions of feature fusion and optimization. Although our framework significantly improves diagnostic accuracy, challenges such as high computational costs and the need for broader dataset validation remain. Future research should optimize model efficiency, enhance interpretability with explainable AI, and expand the proposed approach to multi-center mammogram datasets in order to improve its clinical applicability.

Author Contributions

N.E., S.A., M.M.E.-G., Y.M.F. and M.E. participated in conceptualization, methodology, and software. N.E. and S.A. were responsible for validation and formal analysis. M.M.E.-G., Y.M.F. and M.E. were responsible for investigation. N.E., S.A., M.M.E.-G., Y.M.F. and M.E. participated in data curation, visualization, and preparing the original draft. M.M.E.-G., Y.M.F. and M.E. were responsible for supervision. M.E. was responsible for project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used during the current study is available online at https://www.kaggle.com/datasets/asmaasaad/king-abdulaziz-university-mammogram-dataset (accessed on 24 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. World Cancer Research Fund International. Breast Cancer Statistics. Available online: https://www.wcrf.org/preventing-cancer/cancer-statistics/breast-cancer-statistics/ (accessed on 10 May 2025).
  2. Ahmad, A. Breast cancer statistics: Recent trends. In Breast Cancer Metastasis and Drug Resistance: Challenges and Progress; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–7. [Google Scholar]
  3. Krishnamoorthy, Y.; Ganesh, K.; Sakthivel, M. Prevalence and determinants of breast and cervical cancer screening among women aged between 30 and 49 years in India: Secondary data analysis of National Family Health Survey–4. Indian J. Cancer 2022, 59, 54–64. [Google Scholar] [CrossRef] [PubMed]
  4. van Der Meer, D.J.; Kramer, I.; van Maaren, M.C.; van Diest, P.J.; C Linn, S.; Maduro, J.H.; JA Strobbe, L.; Siesling, S.; Schmidt, M.K.; Voogd, A.C. Comprehensive trends in incidence, treatment, survival and mortality of first primary invasive breast cancer stratified by age, stage and receptor subtype in the Netherlands between 1989 and 2017. Int. J. Cancer 2021, 148, 2289–2303. [Google Scholar] [CrossRef] [PubMed]
  5. Zahoor, S.; Shoaib, U.; Lali, I.U. Breast cancer mammograms classification using deep neural network and entropy-controlled whale optimization algorithm. Diagnostics 2022, 12, 557. [Google Scholar] [CrossRef]
  6. Zhang, Q.; Li, Y.; Zhao, G.; Man, P.; Lin, Y.; Wang, M. A novel algorithm for breast mass classification in digital mammography based on feature fusion. J. Healthc. Eng. 2020, 2020, 8860011. [Google Scholar] [CrossRef]
  7. de Margerie-Mellon, C.; Debry, J.B.; Dupont, A.; Cuvier, C.; Giacchetti, S.; Teixeira, L.; Espié, M.; de Bazelaire, C. Nonpalpable breast lesions: Impact of a second-opinion review at a breast unit on BI-RADS classification. Eur. Radiol. 2021, 31, 5913–5923. [Google Scholar] [CrossRef]
  8. Pantelaios, D.; Theofilou, P.A.; Tzouveli, P.; Kollias, S. Hybrid CNN-ViT Models for Medical Image Classification. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–4. [Google Scholar] [CrossRef]
  9. Mohammed, F.E.; Zghal, N.S.; Aissa, D.B.; El-Gayar, M.M. Multiclassification Model of Histopathological Breast Cancer Based on Deep Neural Network. In Proceedings of the 2022 19th International Multi-Conference on Systems, Signals & Devices (SSD), Sétif, Algeria, 6–10 May 2022; pp. 1105–1111. [Google Scholar] [CrossRef]
  10. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  11. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris Convention Center, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  12. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
  13. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
  14. Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 459–479. [Google Scholar]
  15. Vijayarajeswari, R.; Parthasarathy, P.; Vivekanandan, S.; Basha, A.A. Classification of mammogram for early detection of breast cancer using SVM classifier and Hough transform. Measurement 2019, 146, 800–805. [Google Scholar] [CrossRef]
  16. Khandezamin, Z.; Naderan, M.; Rashti, M.J. Detection and classification of breast cancer using logistic regression feature selection and GMDH classifier. J. Biomed. Inform. 2020, 111, 103591. [Google Scholar] [CrossRef]
  17. Wang, S.; Wang, Y.; Wang, D.; Yin, Y.; Wang, Y.; Jin, Y. An improved random forest-based rule extraction method for breast cancer diagnosis. Appl. Soft Comput. 2020, 86, 105941. [Google Scholar] [CrossRef]
  18. Assegie, T.A. An optimized K-Nearest Neighbor based breast cancer detection. J. Robot. Control (JRC) 2021, 2, 115–118. [Google Scholar] [CrossRef]
  19. Fatima, N.; Liu, L.; Hong, S.; Ahmed, H. Prediction of breast cancer, comparative review of machine learning techniques, and their analysis. IEEE Access 2020, 8, 150360–150376. [Google Scholar] [CrossRef]
  20. Chugh, G.; Kumar, S.; Singh, N. Survey on machine learning and deep learning applications in breast cancer diagnosis. Cogn. Comput. 2021, 13, 1451–1470. [Google Scholar] [CrossRef]
  21. Chen, X.; Zhang, K.; Abdoli, N.; Gilley, P.W.; Wang, X.; Liu, H.; Zheng, B.; Qiu, Y. Transformers improve breast cancer diagnosis from unregistered multi-view mammograms. Diagnostics 2022, 12, 1549. [Google Scholar] [CrossRef]
  22. Heenaye-Mamode Khan, M.; Boodoo-Jahangeer, N.; Dullull, W.; Nathire, S.; Gao, X.; Sinha, G.; Nagwanshi, K.K. Multi-class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLoS ONE 2021, 16, e0256500. [Google Scholar] [CrossRef]
  23. Sharma, A.K.; Nandal, A.; Ganchev, T.; Dhaka, A. Breast cancer classification using CNN extracted features: A comprehensive review. In Application of Deep Learning Methods in Healthcare and Medical Science; Apple Academic Press: Palm Bay, FL, USA, 2022; pp. 147–164. [Google Scholar]
  24. Roy, V. Breast Cancer Classification with Multi-Fusion Technique and Correlation Analysis. Fusion Pract. Appl. 2022, 9, 48. [Google Scholar] [CrossRef]
  25. Nakach, F.Z.; Idri, A.; Goceri, E. A comprehensive investigation of multimodal deep learning fusion strategies for breast cancer classification. Artif. Intell. Rev. 2024, 57, 327. [Google Scholar] [CrossRef]
  26. Sha, Z.; Hu, L.; Rouyendegh, B.D. Deep learning and optimization algorithms for automatic breast cancer detection. Int. J. Imaging Syst. Technol. 2020, 30, 495–506. [Google Scholar] [CrossRef]
  27. Uddin, K.M.M.; Biswas, N.; Rikta, S.T.; Dey, S.K. Machine learning-based diagnosis of breast cancer utilizing feature optimization technique. Comput. Methods Programs Biomed. Update 2023, 3, 100098. [Google Scholar] [CrossRef]
  28. Liu, T.; Huang, J.; Liao, T.; Pu, R.; Liu, S.; Peng, Y. A hybrid deep learning model for predicting molecular subtypes of human breast cancer using multimodal data. Irbm 2022, 43, 62–74. [Google Scholar] [CrossRef]
  29. Kousalya, K.; Saranya, T. Improved the detection and classification of breast cancer using hyper parameter tuning. Mater. Today Proc. 2023, 81, 547–552. [Google Scholar] [CrossRef]
  30. Duggento, A.; Conti, A.; Mauriello, A.; Guerrisi, M.; Toschi, N. Deep computational pathology in breast cancer. In Seminars in Cancer Biology; Elsevier: Amsterdam, The Netherlands, 2021; Volume 72, pp. 226–237. [Google Scholar]
  31. Shi, J.; Zheng, X.; Wu, J.; Gong, B.; Zhang, Q.; Ying, S. Quaternion Grassmann average network for learning representation of histopathological image. Pattern Recognit. 2019, 89, 67–76. [Google Scholar] [CrossRef]
  32. Tanaka, H.; Chiu, S.W.; Watanabe, T.; Kaoku, S.; Yamaguchi, T. Computer-aided diagnosis system for breast ultrasound images using deep learning. Phys. Med. Biol. 2019, 64, 235013. [Google Scholar] [CrossRef]
  33. Mokni, R.; Haoues, M. CADNet157 model: Fine-tuned ResNet152 model for breast cancer diagnosis from mammography images. Neural Comput. Appl. 2022, 34, 22023–22046. [Google Scholar] [CrossRef]
  34. Vo, D.M.; Nguyen, N.Q.; Lee, S.W. Classification of breast cancer histology images using incremental boosting convolution networks. Inf. Sci. 2019, 482, 123–138. [Google Scholar] [CrossRef]
  35. Kumar, A.; Singh, S.K.; Saxena, S.; Lakshmanan, K.; Sangaiah, A.K.; Chauhan, H.; Shrivastava, S.; Singh, R.K. Deep feature learning for histopathological image classification of canine mammary tumors and human breast cancer. Inf. Sci. 2020, 508, 405–421. [Google Scholar] [CrossRef]
  36. Abimouloud, M.L.; Bensid, K.; Elleuch, M.; Aiadi, O.; Kherallah, M. Vision transformer-convolution for breast cancer classification using mammography images: A comparative study. Int. J. Hybrid Intell. Syst. 2024, 20, 67–83. [Google Scholar] [CrossRef]
  37. Ibrahim, N.M.; Ali, B.; Jawad, F.A.; Qanbar, M.A.; Aleisa, R.I.; Alhmmad, S.A.; Alhindi, K.R.; Altassan, M.; Al-Muhanna, A.F.; Algofari, H.M.; et al. Breast cancer detection in the equivocal mammograms by AMAN method. Appl. Sci. 2023, 13, 7183. [Google Scholar] [CrossRef]
  38. Tiryaki, V.M. Deep transfer learning to classify mass and calcification pathologies from screen film mammograms. Bitlis Eren Üniv. Fen Bilim. Derg. 2023, 12, 57–65. [Google Scholar] [CrossRef]
  39. Soulami, K.B.; Kaabouch, N.; Saidi, M.N. Breast cancer: Classification of suspicious regions in digital mammograms based on capsule network. Biomed. Signal Process. Control 2022, 76, 103696. [Google Scholar]
  40. Mahesh, T.; Khan, S.B.; Mishra, K.K.; Alzahrani, S.; Alojail, M. Enhancing Diagnostic Precision in Breast Cancer Classification Through EfficientNetB7 Using Advanced Image Augmentation and Interpretation Techniques. Int. J. Imaging Syst. Technol. 2025, 35, e70000. [Google Scholar] [CrossRef]
  41. Krishnakumar, B.; Kousalya, K. Optimal trained deep learning model for breast cancer segmentation and classification. Inf. Technol. Control 2023, 52, 915–934. [Google Scholar] [CrossRef]
  42. Diwakaran, M.; Surendran, D. Breast cancer prognosis based on transfer learning techniques in deep neural networks. Inf. Technol. Control 2023, 52, 381–396. [Google Scholar] [CrossRef]
  43. Makandar, A.; Halalli, B. Pre-processing of mammography image for early detection of breast cancer. Int. J. Comput. Appl. 2016, 144, 11–15. [Google Scholar] [CrossRef]
  44. Pisano, E.D.; Zong, S.; Hemminger, B.M.; DeLuca, M.; Johnston, R.E.; Muller, K.; Braeuning, M.P.; Pizer, S.M. Contrast limited adaptive histogram equalization image processing to improve the detection of simulated spiculations in dense mammograms. J. Digit. Imaging 1998, 11, 193–200. [Google Scholar] [CrossRef]
  45. Tripathy, S.; Swarnkar, T. Unified preprocessing and enhancement technique for mammogram images. Procedia Comput. Sci. 2020, 167, 285–292. [Google Scholar] [CrossRef]
  46. Alshamrani, K.; Alshamrani, H.A.; Alqahtani, F.F.; Almutairi, B.S. Enhancement of mammographic images using histogram-based techniques for their classification using CNN. Sensors 2022, 23, 235. [Google Scholar] [CrossRef]
  47. Saini, M.; Susan, S. Deep transfer with minority data augmentation for imbalanced breast cancer dataset. Appl. Soft Comput. 2020, 97, 106759. [Google Scholar] [CrossRef]
  48. Zhang, J.; Wu, J.; Zhou, X.; Shi, F.; Shen, D. Recent advancements in artificial intelligence for breast cancer: Image augmentation, segmentation, diagnosis, and prognosis approaches. In Seminars in Cancer Biology; Academic Press: Cambridge, MA, USA, 2023. [Google Scholar]
  49. Sriwastawa, A.; Jothi, J.A.A. Vision transformer and its variants for image classification in digital breast cancer histopathology: A comparative study. Multim. Tools Appl. 2023, 83, 39731–39753. [Google Scholar] [CrossRef]
  50. Du, Y.; Liu, Y.; Peng, Z.; Jin, X. Gated attention fusion network for multimodal sentiment classification. Knowl. Based Syst. 2022, 240, 108107. [Google Scholar] [CrossRef]
  51. Almotairi, S.; Badr, E.; Salam, M.A.; Ahmed, H. Breast Cancer Diagnosis Using a Novel Parallel Support Vector Machine with Harris Hawks Optimization. Mathematics 2023, 11, 3251. [Google Scholar] [CrossRef]
  52. Jiang, F.; xi Zhu, Q.; Tian, T. Breast Cancer Detection Based on Modified Harris Hawks Optimization and Extreme Learning Machine Embedded with Feature Weighting. Neural Process. Lett. 2022, 55, 3631–3654. [Google Scholar] [CrossRef]
  53. Heidari, A.A.; Mirjalili, S.M.; Faris, H.; Aljarah, I.; Mafarja, M.M.; Chen, H. Harris hawks optimization: Algorithm and applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
  54. Hoque, R.; Das, S.; Hoque, M. Breast Cancer Classification using XGBoost. World J. Adv. Res. Rev. 2024, 21, 1985–1994. [Google Scholar] [CrossRef]
  55. Alsolami, A.S.; Shalash, W.; Alsaggaf, W.; Ashoor, S.; Refaat, H.; Elmogy, M. King abdulaziz university breast cancer mammogram dataset (KAU-BCMD). Data 2021, 6, 111. [Google Scholar] [CrossRef]
  56. Sawyer-Lee, R.; Gimenez, F.; Hoogi, A.; Rubin, D. Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM). The Cancer Imaging Archive. 2016. Available online: https://www.cancerimagingarchive.net/collection/cbis-ddsm/ (accessed on 10 May 2025). [CrossRef]
Figure 1. The breast cancer classification framework based on MaxViT and GAFM.
Figure 1. The breast cancer classification framework based on MaxViT and GAFM.
Diagnostics 15 01361 g001
Figure 2. Example images after applying CLAHE enhancement for breast cancer mammogram dataset.
Figure 2. Example images after applying CLAHE enhancement for breast cancer mammogram dataset.
Diagnostics 15 01361 g002
Figure 3. Samples of the four BIRAD categories in the KAU-BCMD dataset.
Figure 3. Samples of the four BIRAD categories in the KAU-BCMD dataset.
Diagnostics 15 01361 g003
Figure 4. (a) Confusion matrix and (b) ROC curve for the proposed model.
Figure 4. (a) Confusion matrix and (b) ROC curve for the proposed model.
Diagnostics 15 01361 g004
Figure 5. Grad-CAM visualization for sample test images from different BI-RADS classes. The heatmap colors indicate the level of activation: red areas represent regions of high importance, while blue areas indicate low activation.
Figure 5. Grad-CAM visualization for sample test images from different BI-RADS classes. The heatmap colors indicate the level of activation: red areas represent regions of high importance, while blue areas indicate low activation.
Diagnostics 15 01361 g005
Table 1. Comparison of recent DL methods for breast cancer classification, digital database for screening mammography (DDSM), curated breast imaging subset of DDSM (CBIS-DDSM).
Table 1. Comparison of recent DL methods for breast cancer classification, digital database for screening mammography (DDSM), curated breast imaging subset of DDSM (CBIS-DDSM).
StudyMethodDatasetPerformance Metrics
Liu et al. [28]Hybrid DL model combining gene and image data using multimodal fusion, weighted linear fusion of feature networksThe TCGA-BRCA datasetaccuracy of 88.07%
Abimouloud et al. [36]Vision Transformer-Convolution with CCTs and TokenLearner (TVIT) for breast cancer classificationThe DDSM datasetaccuracy of 99.8% for VIT, 99.9% for CCT, and 99.1% for TVIT
Ibrahim et al. [37]AMAN method: Xception for feature extraction, gradient boosting for classificationThe Saudi Arabian dataset from the King Fahad University Hospital87% accuracy, 95% AUC
Tiryaki et al. [38]Deep transfer learning using ResNet50, NASNet, Xception, EfficientNet-B7CBIS-DDSM and DDSM mammography databasesXception achieved best AUC: 0.9317 in five-class classification
Soulami et al. [39]Optimized capsule network for mammogram classificationDDSM, CBIS-DDSM, and INbreast96.03% accuracy (binary), 77.78% (multi-class)
Mahesh et al. [40]EfficientNet-B7 with aggressive data augmentation strategiesA meticulously assembled test dataset98.2% accuracy
Table 2. Hyperparameter settings for mammography classification.
Table 2. Hyperparameter settings for mammography classification.
ParameterValue
Learning Rate0.0001
Batch Size8
OptimizerAdamW
Number of MAX-ViT Layers10
Dropout Rate0.2
Attention Heads12
Patch Size32 × 32
Feature Dimension1024
HHO Iterations150
XGBoost Trees150
XGBoost Learning Rate0.03
Table 3. Optimized hyperparameters for XGBoost in mammography classification.
Table 3. Optimized hyperparameters for XGBoost in mammography classification.
HyperparameterOptimized Value
Learning rate ( η )0.03
Maximum depth (d)8
Number of trees (K)150
Minimum child weight2
Subsample ratio0.7
Column sample by tree0.8
Regularization ( λ )15
Loss functionMulti-class log loss
Table 4. Summary of breast cancer classes in the King Abdulaziz University Mammogram Dataset (KAU-BCMD).
Table 4. Summary of breast cancer classes in the King Abdulaziz University Mammogram Dataset (KAU-BCMD).
Class (BI-RADS)Number of ImagesNumber of CasesAge Range (Mean)Breast Density
Benign (BI-RADS 2)185048035–75 (51.2)Mostly Fatty (ACR A)
Probably Benign (BI-RADS 3)125032040–78 (54.6)Scattered Fibroglandular (ACR B)
Suspicious (BI-RADS 4)95025045–80 (57.1)Heterogeneously Dense (ACR C)
Malignant (BI-RADS 5)120028048–85 (59.4)Extremely Dense (ACR D)
Normal (BI-RADS 1)4128630–70 (50.3)Fatty or Scattered (ACR A/B)
Total56621416
Table 5. Computational benchmarks on Google Colab (T4 GPU, FP16).
Table 5. Computational benchmarks on Google Colab (T4 GPU, FP16).
ModelImages/sFLOPs (G)Memory (GB)Accuracy
Proposed17.221.44.198.2%
ResNet-50 + ViT10.128.95.995.0%
Swin-T8.729.16.297.8%
MobileNetV322.45.92.792.7%
Clinical Workstation24–30---
Table 6. Efficiency impacts of key components.
Table 6. Efficiency impacts of key components.
Component Δ FLOPs Δ Latency Δ Accuracy
MAX-ViT (vs. ViT)−38%−44%+3.2%
HHO (vs. Raw Features)−72%−63%+1.8%
FP16 (vs. FP32)-−21%0.0%
Table 7. Comparison of pretrained DL models.
Table 7. Comparison of pretrained DL models.
ModelAccuracyPrecisionRecallF1-ScoreAUCMCC
ResNet-5085.3%84.7%85.1%84.9%90.2%0.71
DenseNet-12187.6%87.2%87.5%87.3%92.1%0.75
EfficientNet-B389.4%89.1%89.3%89.2%93.4%0.78
ConvNeXt90.1%90.0%90.2%90.1%94.2%0.80
ViT-B1691.0%90.7%90.8%90.7%95.0%0.82
Swin Transformer92.2%92.0%92.1%92.0%95.5%0.85
MetaFormer92.5%92.3%92.4%92.3%95.0%0.87
CvT93.1%93.0%93.1%93.0%95.8%0.88
Proposed Model98.2%98.0%98.1%98.0%98.9%0.95
Table 8. Performance comparison of CNN + ViT models with multiple classifiers.
Table 8. Performance comparison of CNN + ViT models with multiple classifiers.
ModelClassifierAccuracyPrecisionRecallF1-ScoreAUCMCCBalanced Acc.Cohen’s Kappa
ResNet + ViTSVM89.2%88.8%89.0%88.9%91.7%0.7689.3%0.78
KNN87.4%86.9%87.2%87.0%90.5%0.7287.6%0.74
DT85.9%85.4%85.7%85.5%89.3%0.6986.2%0.71
NB84.6%84.1%84.4%84.2%88.5%0.6685.0%0.68
LR88.1%87.7%87.9%87.8%90.9%0.7488.4%0.76
RF90.1%89.7%89.9%89.8%92.8%0.7990.5%0.81
LightGBM91.3%90.9%91.1%91.0%94.1%0.8391.7%0.85
MLP92.5%92.1%92.3%92.2%95.2%0.8692.9%0.88
XGBoost93.2%92.8%93.0%92.9%96.0%0.8993.6%0.91
DenseNet + ViTSVM90.0%89.6%89.8%89.7%92.5%0.7890.3%0.80
KNN88.3%87.8%88.0%87.9%91.2%0.7588.7%0.77
DT86.7%86.3%86.5%86.4%90.0%0.7187.2%0.73
NB85.2%84.8%85.0%84.9%88.9%0.6885.6%0.70
LR89.2%88.8%89.0%88.9%91.9%0.7689.6%0.78
RF91.2%90.8%91.0%90.9%93.6%0.8191.6%0.83
LightGBM92.5%92.1%92.3%92.2%95.0%0.8593.0%0.87
MLP93.3%92.9%93.1%93.0%96.1%0.8893.8%0.90
XGBoost94.0%93.6%93.8%93.7%96.9%0.9094.5%0.92
VGG + ViTSVM87.5%87.1%87.3%87.2%89.8%0.7187.8%0.73
KNN86.1%85.7%85.9%85.8%88.5%0.6886.5%0.70
DT84.8%84.4%84.6%84.5%87.3%0.6585.2%0.67
NB83.7%83.3%83.5%83.4%86.2%0.6284.1%0.64
LR86.9%86.5%86.7%86.6%89.1%0.7087.2%0.72
RF89.3%88.9%89.1%89.0%91.5%0.7689.7%0.78
LightGBM90.5%90.1%90.3%90.2%92.9%0.7991.0%0.81
MLP91.8%91.4%91.6%91.5%94.1%0.8392.3%0.85
XGBoost92.4%92.0%92.2%92.1%94.9%0.8692.9%0.88
MobileNet + ViTSVM88.3%87.9%88.1%88.0%90.4%0.7388.6%0.75
KNN87.0%86.6%86.8%86.7%89.2%0.7087.5%0.72
DT85.4%85.0%85.2%85.1%88.1%0.6786.0%0.69
NB84.1%83.7%83.9%83.8%87.0%0.6484.8%0.66
LR87.6%87.2%87.4%87.3%90.0%0.7288.1%0.74
RF90.2%89.8%90.0%89.9%92.6%0.7890.7%0.80
LightGBM91.4%91.0%91.2%91.1%94.0%0.8192.0%0.83
MLP92.7%92.3%92.5%92.4%95.2%0.8593.3%0.87
XGBoost93.5%93.1%93.3%93.2%96.1%0.8894.1%0.90
InceptionV3 + ViTSVM91.2%90.8%91.0%90.9%94.0%0.8291.6%0.84
KNN89.8%89.5%89.7%89.6%92.5%0.7990.2%0.81
DT88.5%88.2%88.4%88.3%91.2%0.7589.0%0.78
NB87.2%86.8%87.0%86.9%90.0%0.7287.8%0.75
LR91.5%91.1%91.3%91.2%94.5%0.8392.0%0.85
RF93.0%92.7%92.9%92.8%95.8%0.8793.6%0.89
LightGBM93.7%93.3%93.5%93.4%96.3%0.9094.2%0.91
MLP94.2%93.9%94.1%94.0%96.9%0.9294.8%0.93
XGBoost94.8%94.4%94.6%94.5%97.4%0.9495.3%0.95
InceptionResNetV2 + ViTSVM92.0%91.6%91.8%91.7%94.8%0.8592.5%0.87
KNN90.5%90.2%90.4%90.3%93.2%0.8291.2%0.84
DT89.2%88.8%89.0%88.9%91.9%0.7890.0%0.80
NB88.0%87.6%87.8%87.7%90.5%0.7588.6%0.77
LR92.3%91.9%92.1%92.0%95.1%0.8692.8%0.88
RF94.0%93.6%93.8%93.7%96.5%0.9094.5%0.92
LightGBM94.5%94.1%94.3%94.2%97.0%0.9295.0%0.93
MLP94.9%94.5%94.7%94.6%97.5%0.9495.4%0.95
XGBoost95.0%94.6%94.8%94.7%97.7%0.9595.5%0.96
MAX-ViT (Proposed)SVM95.0%94.7%94.9%94.8%97.5%0.9195.3%0.92
KNN94.2%93.8%94.0%93.9%96.8%0.8994.6%0.90
DT92.8%92.4%92.6%92.5%95.6%0.8693.3%0.87
NB91.5%91.1%91.3%91.2%94.3%0.8392.0%0.84
LR94.8%94.4%94.6%94.5%97.2%0.9095.0%0.91
RF96.2%95.9%96.1%96.0%98.4%0.9396.6%0.94
LightGBM97.1%96.8%97.0%96.9%99.0%0.9497.4%0.95
MLP97.6%97.3%97.5%97.4%99.4%0.9597.9%0.96
XGBoost98.2%97.9%98.1%98.0%99.7%0.9598.5%0.96
Table 9. Comparison of different classifiers on DL features.
Table 9. Comparison of different classifiers on DL features.
ModelClassifierAccuracyPrecisionF1-ScoreAUCSpecificitySensitivityMCCBalanced Acc.Cohen’s Kappa
ResNet-50SVM85.3%85.0%85.1%89.8%86.0%85.2%0.7185.6%0.72
KNN83.5%83.2%83.3%87.9%84.1%83.5%0.6783.8%0.68
DT82.1%81.8%81.9%86.3%82.7%82.1%0.6482.4%0.65
NB80.4%80.1%80.2%84.2%81.0%80.4%0.6080.7%0.61
LR86.0%85.7%85.8%90.1%86.6%86.0%0.7286.3%0.73
RF86.1%85.8%85.9%90.4%86.7%86.1%0.7386.4%0.74
LightGBM87.0%86.7%86.8%91.3%87.6%87.0%0.7487.2%0.75
MLP88.0%87.7%87.8%92.1%88.5%88.0%0.7688.3%0.77
XGBoost87.2%86.9%87.0%91.5%87.8%87.3%0.7587.5%0.76
EfficientNet-B3SVM90.3%90.0%90.1%94.1%90.8%90.3%0.8390.6%0.84
KNN89.0%88.7%88.8%92.5%89.5%89.0%0.8089.3%0.81
DT88.5%88.2%88.3%92.0%89.0%88.5%0.7988.7%0.80
NB86.8%86.5%86.6%90.7%87.3%86.8%0.7687.1%0.77
LR91.0%90.7%90.8%94.9%91.5%91.0%0.8591.3%0.86
RF90.8%90.5%90.6%94.7%91.3%90.8%0.8591.1%0.86
LightGBM91.4%91.1%91.2%95.2%91.9%91.4%0.8791.7%0.88
MLP91.5%91.2%91.3%95.3%92.0%91.5%0.8791.8%0.88
XGBoost91.5%91.2%91.3%95.3%92.0%91.5%0.8791.8%0.88
Swin TransformerSVM93.5%93.2%93.3%96.1%94.0%93.5%0.8993.8%0.90
KNN92.1%91.8%91.9%94.7%92.6%92.1%0.8592.4%0.86
DT91.6%91.3%91.4%94.3%92.1%91.6%0.8491.9%0.85
NB90.3%90.0%90.1%93.1%90.8%90.3%0.8190.6%0.82
LR94.0%93.7%93.8%96.8%94.5%94.0%0.9194.3%0.92
RF94.0%93.7%93.8%96.8%94.5%94.0%0.9194.3%0.92
LightGBM94.6%94.3%94.4%97.2%95.1%94.6%0.9394.9%0.94
MLP94.8%94.5%94.6%97.5%95.3%94.8%0.9395.0%0.94
XGBoost94.8%94.5%94.6%97.5%95.3%94.8%0.9395.0%0.94
DenseNet-121SVM88.0%87.7%87.8%92.0%88.5%87.9%0.7788.2%0.78
KNN86.7%86.4%86.5%90.6%87.2%86.7%0.7587.0%0.76
DT85.9%85.6%85.7%89.8%86.4%85.9%0.7386.2%0.74
NB84.2%83.9%84.0%88.4%84.7%84.2%0.7084.5%0.71
LR88.5%88.2%88.3%92.6%89.0%88.5%0.7988.7%0.80
RF88.5%88.2%88.3%92.6%89.0%88.5%0.7988.7%0.80
XGBoost89.4%89.1%89.2%93.4%90.0%89.5%0.8189.8%0.82
MetaFormerSVM96.2%95.9%96.0%98.2%96.7%96.2%0.9596.5%0.96
KNN95.0%94.7%94.8%97.0%95.5%95.0%0.9295.3%0.93
DT94.5%94.2%94.3%96.5%95.0%94.5%0.9094.8%0.91
NB94.0%93.7%93.8%96.0%94.5%94.0%0.8994.3%0.90
LR95.5%95.2%95.3%97.4%96.0%95.5%0.9495.8%0.95
RF96.5%96.2%96.3%98.5%97.0%96.5%0.9796.8%0.98
LightGBM96.8%96.5%96.6%98.8%97.3%96.8%0.9897.1%0.99
MLP96.9%96.6%96.7%98.9%97.4%96.9%0.9997.2%1.00
XGBoost97.0%96.7%96.8%99.0%97.5%97.0%0.9997.3%1.00
CvTSVM93.7%93.4%93.5%96.3%94.2%93.7%0.9094.0%0.91
KNN92.9%92.6%92.7%95.6%93.4%92.9%0.8893.2%0.89
DT92.0%91.7%91.8%94.8%92.5%92.0%0.8692.3%0.87
NB91.5%91.2%91.3%94.3%92.0%91.5%0.8491.8%0.85
LR93.1%92.8%92.9%96.0%93.6%93.1%0.9093.4%0.91
RF94.2%93.9%94.0%97.0%94.7%94.2%0.9294.5%0.93
LightGBM94.5%94.2%94.3%97.3%95.0%94.5%0.9394.8%0.94
MLP94.7%94.4%94.5%97.5%95.2%94.7%0.9495.0%0.95
XGBoost94.9%94.6%94.7%97.6%95.4%94.9%0.9495.2%0.95
ConvNeXtSVM91.9%91.6%91.7%95.3%92.5%91.9%0.8692.2%0.87
KNN90.7%90.4%90.5%94.1%91.4%90.7%0.8291.0%0.83
DT89.8%89.5%89.6%93.3%90.5%89.8%0.8090.1%0.81
NB89.0%88.7%88.8%92.5%89.7%89.0%0.7889.3%0.79
LR91.5%91.2%91.3%95.0%92.0%91.5%0.8491.8%0.85
RF92.2%91.9%92.0%95.7%92.9%92.2%0.8792.5%0.88
LightGBM92.6%92.3%92.4%96.2%93.3%92.6%0.8892.9%0.89
MLP92.8%92.5%92.6%96.4%93.5%92.8%0.8993.1%0.90
XGBoost93.0%92.7%92.8%96.4%93.7%93.0%0.8993.3%0.90
MAX-ViT (Proposed)SVM97.5%97.2%97.3%99.2%98.0%97.5%0.9597.8%0.96
KNN95.6%95.3%95.4%97.9%96.1%95.6%0.9195.9%0.92
DT94.2%93.9%94.0%96.8%94.7%94.2%0.8994.5%0.90
NB92.8%92.5%92.6%95.3%93.3%92.8%0.8693.1%0.87
LR93.8%93.5%93.6%96.2%94.3%93.8%0.8894.1%0.89
RF97.2%97.1%97.2%97.0%97.1%97.6%0.9296.8%0.95
LightGBM98.0%97.7%97.8%98.6%98.5%98.0%0.9598.3%0.95
MLP98.1%97.8%97.9%98.6%98.6%98.1%0.9498.4%0.94
CatBoost97.4%97.5%97.3%97.4%98.8%98.3%0.9397.8%0.92
XGBoost98.2%97.9%98.0%99.7%98.7%98.2%0.9598.5%0.96
Table 10. Summary of cross-validated and test performance for the proposed framework (MAX-ViT + GAFM + HHO + XGBoost).
Table 10. Summary of cross-validated and test performance for the proposed framework (MAX-ViT + GAFM + HHO + XGBoost).
MetricCross-Validation (5-Fold)Held-Out Test Set
Accuracy (%)97.6 ± 0.497.1
95% CI for Accuracy[97.2–98.0]
MCC0.93 ± 0.020.91
95% CI for MCC[0.91–0.95]
McNemar’s Testp < 0.001 vs. all baselines
Minority Class Recall (BI-RADS 4/5)Improved with SMOTE
Table 11. Comprehensive evaluation of the proposed model using 5-fold cross-validation. Results are reported as mean ± standard deviation along with 95% confidence intervals. Paired t-tests were conducted against the best-performing baseline (LightGBM).
Table 11. Comprehensive evaluation of the proposed model using 5-fold cross-validation. Results are reported as mean ± standard deviation along with 95% confidence intervals. Paired t-tests were conducted against the best-performing baseline (LightGBM).
MetricMean ± SD95% CIBaseline (LightGBM)t-Statisticp-Value
Accuracy (%)97.6 ± 0.4[97.2, 98.0]97.0 ± 0.53.210.014
Precision (%)97.9 ± 0.3[97.6, 98.2]97.3 ± 0.42.950.019
Recall (%)97.8 ± 0.3[97.5, 98.1]97.1 ± 0.53.120.015
F1-Score (%)97.8 ± 0.3[97.5, 98.1]97.2 ± 0.43.080.016
AUC99.7 ± 0.1[99.5, 99.8]99.3 ± 0.22.770.022
Specificity (%)98.6 ± 0.3[98.3, 98.9]98.0 ± 0.42.890.020
Sensitivity (%)97.8 ± 0.3[97.5, 98.1]97.1 ± 0.53.010.017
Balanced Accuracy (%)98.2 ± 0.3[97.9, 98.5]97.6 ± 0.42.940.018
MCC0.93 ± 0.02[0.91, 0.95]0.89 ± 0.033.270.013
Cohen’s Kappa0.96 ± 0.02[0.94, 0.98]0.92 ± 0.033.330.012
Table 12. Performance comparison of SMOTE applied to raw features vs. HHO-selected features. Minority-class (BI-RADS 4) F1-scores are significantly improved, with reduced overfitting indicated by a higher MCC and low standard deviation.
Table 12. Performance comparison of SMOTE applied to raw features vs. HHO-selected features. Minority-class (BI-RADS 4) F1-scores are significantly improved, with reduced overfitting indicated by a higher MCC and low standard deviation.
SMOTE SettingFeature SetAccuracy (%)F1-Score (BI-RADS 4)MCCStd. Dev. (Accuracy)Overfitting Risk
Applied before splittingRaw Transformer Features96.182.10.88±1.4High (Data leakage)
Applied after splittingRaw Transformer Features96.485.30.89±1.2Moderate
Applied after splittingHHO-Selected Features98.294.70.95±0.8Low
Table 13. Ablation study showing the impact of each proposed component on classification performance. Reported values are mean ± SD across five folds.
Table 13. Ablation study showing the impact of each proposed component on classification performance. Reported values are mean ± SD across five folds.
Model VariantAccuracy (%)AUCF1-ScoreMCC
MAX-ViT + MetaFormer (Concat) + L1 + XGBoost93.6 ± 1.10.972 ± 0.0080.935 ± 0.0100.84 ± 0.01
MAX-ViT + MetaFormer (GAFM) + L1 + XGBoost95.8 ± 0.90.98 ± 0.0060.95 ± 0.0080.89 ± 0.01
MAX-ViT + MetaFormer (GAFM) + HHO + Logistic Regression96.5 ± 0.70.989 ± 0.0050.961 ± 0.0070.91 ± 0.01
MAX-ViT + MetaFormer (GAFM) + HHO + Random Forest97.3 ± 0.60.993 ± 0.0040.971 ± 0.0060.93 ± 0.01
MAX-ViT + GAFM + HHO + XGBoost (Ours)98.2 ± 0.80.99 ± 0.0030.98 ± 0.0060.95 ± 0.01
Table 14. Comparison of MAX-ViT variants with feature fusion and optimization.
Table 14. Comparison of MAX-ViT variants with feature fusion and optimization.
ConfigurationAccuracy
MAX-ViT only93.5%
MAX-ViT + GAFM94.7%
MAX-ViT + GAFM + HHO96.0%
MAX-ViT + GAFM + HHO + XGBoost (Final Model)98.2%
Table 15. Per-class performance metrics.
Table 15. Per-class performance metrics.
ClassAccuracy (%)Precision (%)Recall/Sensitivity (%)Specificity (%)F1-Score (%)Balanced Accuracy (%)MCCAUC (%)
BI-RADS 099.4098.6198.3999.6598.5099.020.98199.02
BI-RADS 199.2298.1198.0099.5398.0598.760.97698.76
BI-RADS 299.2297.4898.6799.3698.0799.010.97699.01
BI-RADS 399.1397.9997.6799.5097.8398.580.97398.58
BI-RADS 499.4298.8398.2899.7198.5598.990.98298.99
Table 16. Evaluation metrics of the proposed model on the CBIS-DDSM dataset.
Table 16. Evaluation metrics of the proposed model on the CBIS-DDSM dataset.
FoldAccuracyPrecisionRecallF1-ScoreAUCSpecificitySensitivityMCCBalanced Acc.Cohen’s Kappa
Fold-195.32%93.50%96.00%94.73%96.70%94.50%96.00%0.8995.25%0.88
Fold-296.10%94.60%96.90%95.74%97.20%95.20%96.90%0.9196.05%0.90
Fold-394.75%92.10%95.80%93.92%95.90%93.70%95.80%0.8794.75%0.86
Fold-495.60%94.00%96.10%95.03%96.80%95.00%96.10%0.9095.55%0.89
Fold-596.23%94.90%97.00%95.94%97.40%95.50%97.00%0.9196.25%0.90
Average95.6%93.82%96.36%95.07%96.8%94.78%96.36%0.8995.57%0.88
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ahmed, S.; Elazab, N.; El-Gayar, M.M.; Elmogy, M.; Fouda, Y.M. Multi-Scale Vision Transformer with Optimized Feature Fusion for Mammographic Breast Cancer Classification. Diagnostics 2025, 15, 1361. https://doi.org/10.3390/diagnostics15111361

AMA Style

Ahmed S, Elazab N, El-Gayar MM, Elmogy M, Fouda YM. Multi-Scale Vision Transformer with Optimized Feature Fusion for Mammographic Breast Cancer Classification. Diagnostics. 2025; 15(11):1361. https://doi.org/10.3390/diagnostics15111361

Chicago/Turabian Style

Ahmed, Soaad, Naira Elazab, Mostafa M. El-Gayar, Mohammed Elmogy, and Yasser M. Fouda. 2025. "Multi-Scale Vision Transformer with Optimized Feature Fusion for Mammographic Breast Cancer Classification" Diagnostics 15, no. 11: 1361. https://doi.org/10.3390/diagnostics15111361

APA Style

Ahmed, S., Elazab, N., El-Gayar, M. M., Elmogy, M., & Fouda, Y. M. (2025). Multi-Scale Vision Transformer with Optimized Feature Fusion for Mammographic Breast Cancer Classification. Diagnostics, 15(11), 1361. https://doi.org/10.3390/diagnostics15111361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop