Next Article in Journal
A Matrix Integrable Enlargement of the DNLS Soliton Hierarchy Incorporating Two Diagonal Matrix Blocks
Previous Article in Journal
Design and Experimental Validation of a Novel Particle Swarm Optimization Algorithm Designed to Optimize Solar Power Extraction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification

by
Hector E. Zepeda-Reyes
1,
Hayde Peregrina-Barreto
1 and
Gabriela C. Lopez-Armas
2,*
1
Instituto Nacional de Astrofísica, Óptica y Electrónica, Luis Enrique Erro 1, Santa Maria Tonantzintla, San Andres Cholula 72840, Mexico
2
Subdireccion de Investigacion y Extension, Centro de Enseñanza Técnica Industrial, C. Nueva Escocia 1885, Guadalajara 44638, Mexico
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(12), 2080; https://doi.org/10.3390/math14122080
Submission received: 11 April 2026 / Revised: 4 June 2026 / Accepted: 7 June 2026 / Published: 10 June 2026

Abstract

Breast density has a significant impact on how clearly masses appear in mammography. It can also introduce bias in automatic localization systems when density distributions are uneven. Although advances in deep learning-based detection methods have been made, most studies report overall performance without explicitly accounting for variability associated with breast density. Breast cancer diagnosis from mammography is strongly influenced by dataset composition, annotation variability, and breast density distribution, factors that are rarely controlled in current AI evaluations. We introduce Mass-Bench, a clinically balanced and harmonized multi-dataset benchmark that integrates CBIS-DDSM, INBREAST, VINDr-Mammo, and DMID under a unified canonical schema, with standardized ACR density and BI-RADS encoding. Using a leakage-controlled and distribution-aware evaluation protocol, density-stratified mass detection and lesion-centered regions of interest (ROIs) classification were assessed across datasets. YOLO-based detection models achieved peak area under the curve (AUC) values up to 0.943; however, performance systematically degraded with increasing ACR density, revealing limitations that are often masked in imbalanced evaluations. By enforcing clinically representative density distributions, Mass-Bench provides a more reliable estimation of localization performance, which directly impacts downstream clinical tasks. In this context, binary ACR classification achieved F1-scores up to 0.976, while binary BI-RADS discrimination reached accuracies up to 0.93. However, multi-class classification remained more challenging, showing increased sensitivity to dataset heterogeneity and contextual information. These findings demonstrate that conventional evaluations may overestimate robustness, particularly in dense breast categories, and highlight the importance of density-aware benchmarking for developing reliable and clinically applicable AI systems in mammography.

1. Introduction

Breast cancer (BC) is the most frequently diagnosed cancer in women worldwide and remains one of the leading causes of cancer-related mortality [1,2]. Consequently, health systems prioritize early detection and accurate characterization of breast lesions to reduce mortality and improve patient outcomes. Mammography continues to be the screening modality of choice due to its cost-effectiveness and its ability to reveal abnormalities in breast tissue, particularly breast masses and microcalcifications associated with BC development [3,4]. In clinical practice, mammographic findings are described according to the American College of Radiology (ACR) guidelines established in the Breast Imaging Reporting and Data System (BI-RADS), which provide standardized descriptors for breast density and lesion assessment [5,6]. Mass detection and characterization are influenced by multiple factors, including texture, shape, margins, and the surrounding parenchymal density [7,8,9]. In particular, high breast density poses a major challenge, as dense fibroglandular tissue may obscure lesions and reduce mammographic sensitivity, as explicitly acknowledged in the BI-RADS manual [6]. Because high breast density is prevalent among patients with suspected BC, it has been adopted as a clinically relevant risk indicator that guides screening strategies and follow-up protocols [10,11,12].
In recent years, machine learning (ML) and deep learning (DL) approaches have been widely investigated to support mammographic analysis, aiming to improve both mass detection and classification [13,14,15]. Computer-aided diagnosis (CAD) systems have traditionally relied on public mammography datasets annotated by expert radiologists, while advances in computational resources have enabled DL models to extract increasingly complex image representations. Modern convolutional neural networks, including region-based CNN architectures [16,17,18], have demonstrated strong performance in identifying suspicious regions and learning discriminative features beyond simple benign/malignant classification [19,20]. Breast masses constitute one of the most common mammographic findings associated with BC. Their detection and analysis are particularly challenging in dense breasts, where fibroglandular tissue may occupy more than 50% of the breast area and exhibit radiographic appearances similar to tumors. Several studies have reported an increased likelihood of missed cancers in women with high breast density [21,22], while epidemiological evidence indicates that dense breasts are also associated with a higher risk of developing BC [23]. As a result, automated systems have primarily focused on improving mass localization to assist radiologists in visual screening tasks [24].
Early computational approaches framed mass localization as a binary pattern recognition problem. Campanini et al. [25] employed support vector machines (SVMs) to identify suspicious regions in mammograms, achieving a sensitivity of 0.80 on the Digital Database for Screening Mammography (DDSM) [26], later improved to 0.89 using twin-SVM variants [27]. Subsequent DL-based studies leveraged convolutional architectures to enhance mass detection performance [28], reporting sensitivities around 0.85 on DDSM using networks such as AlexNet, VGGNet, and GoogLeNet. More recently, single-stage detectors from the YOLO family have been extensively applied to mammographic mass localization. High accuracies have been reported on datasets such as DDSM, INBREAST and CBIS-DDSM, [29,30,31,32,33]. However, most DL-based pipelines still require an initial lesion localization stage, as accurate characterization depends on the correct identification of Regions of Interest (ROIs).
Mass localization is essential for studies that investigate automated mass characterization and the direct prediction of BI-RADS assessment categories and ACR breast density levels from mammographic images. Early approaches relied on handcrafted descriptors describing lesion morphology, margins, and texture combined with classical ML classifiers such as support vector machines (SVMs), decision trees, and k-nearest neighbors. For example, Keller et al. [34] demonstrated the feasibility of automated breast density estimation using tissue segmentation and SVM-based analysis, reporting strong agreement with radiologist assessments, including Pearson correlation coefficient (r) values up to 0.89 for continuous density estimation and weighted Cohen’s kappa ( κ ) values up to 0.79 for categorical density assessment, supporting the use of engineered features and traditional ML approaches. With the emergence of deep learning, convolutional neural networks (CNNs) have become the dominant paradigm for BI-RADS prediction. Tsai et al. [35] proposed a block-based deep neural network capable of classifying mammographic findings into eight BI-RADS categories, achieving an accuracy of 94.2% and an AUC of 0.972. Nguyen et al. [36] proposed a two-stage multi-view framework integrating deep convolutional feature extraction with a LightGBM classifier for joint BI-RADS and breast density prediction. Their results demonstrated consistent improvements over single-view approaches, with gains of up to 5–10% in macro F1-score across datasets, highlighting the effectiveness of multi-view feature fusion.
Despite recent advances in deep learning-based mammographic analysis, the limited availability of annotated samples in public datasets remains a significant challenge. This limitation can hinder model generalization and may introduce hidden biases. While data augmentation is often used to address this issue, it often fails to maintain sufficient diversity and realistic data properties [37,38]. Additionally, most studies report performance using datasets with highly skewed breast density distributions. Because lesion visibility is significantly influenced by fibroglandular density, this imbalance can mask density-dependent performance degradation and lead to overly optimistic conclusions. Although this factor is clinically relevant, a systematic evaluation of mass localization using a density-balanced experimental design has not been reported.
Furthermore, prior studies frequently evaluate models using isolated datasets under heterogeneous experimental protocols, making direct comparison across studies difficult. In many cases, differences in dataset composition, annotation strategies, breast density distribution, and BI-RADS grouping are not explicitly controlled, which may introduce hidden sources of bias and limit the reproducibility and clinical interpretability of reported results [39,40,41]. Recent studies have also emphasized that dataset heterogeneity, distributional shifts, and the lack of harmonized evaluation protocols remain major barriers to reliable and clinically deployable mammographic AI systems [40,42]. Additionally, conventional global performance metrics may mask density-dependent degradation effects, potentially leading to overly optimistic conclusions under highly imbalanced breast density distributions. Moreover, most existing work focuses primarily on binary classification, whereas clinically relevant multi-class characterization remains considerably less explored under standardized conditions. These limitations highlight the need for a unified, density-aware benchmark to enable reproducible, clinically relevant evaluations of mammographic mass localization and classification across datasets [43]. Recent uncertainty-aware CAD frameworks have highlighted the importance of reliability-oriented evaluation strategies that characterize model confidence and robustness across heterogeneous clinical conditions [44].
This study makes the following contributions. First, we introduce Mass-Bench, a clinically balanced and harmonized multi-dataset benchmark curated from CBIS-DDSM, INBREAST, VINDr-Mammo, and DMID under a unified canonical schema. The benchmark standardizes lesion-centered ROIs and encodes ACR breast density and BI-RADS assessment categories into consistent numeric representations, enabling reproducible cross-dataset evaluation. By explicitly balancing representation across the four ACR density categories defined, Mass-Bench addresses the common issue of density imbalance observed in public mammography datasets. Second, we provide the first detailed density-stratified analysis of automated breast mass localization across multiple heterogeneous datasets. This reveals differences in the performance of automatic localization models that are often hidden in overall statistics. We also measure bias by using Kullback–Leibler divergence to see how closely the density distributions of our datasets match what is found in clinical practice. This links the composition of datasets to how well models perform in real-world settings. The findings show that models trained on imbalanced datasets can mistakenly appear more effective due to differences in representation, especially for dense breast cases. Finally, the results highlight the importance of evaluation strategies that account for density and distribution when developing AI systems for mammographic mass detection and classification.

2. Materials and Methods

This section describes the experimental design and protocol used to evaluate clinical classification based on mammographic ROIs. We begin by describing the publicly available datasets integrated into Mass-Bench, followed by the construction of the benchmark and the harmonization strategy used to organize the ACR breast density labels and BI-RADS assessment. Finally, we detail extensively the preprocessing steps and the ROI generation procedure and classification protocol with the evaluation metrics calculated in all experiments. For experiments, Python 3.10 within the Google Colab environment (Google LLC, Mountain View, CA, USA) and PyTorch 2.0 for DL models were used.

2.1. Publicly Available Datasets

This study was based exclusively on samples from publicly available mammography datasets accessed from their original sources, and no data have been modified or redistributed. Each dataset was analyzed not only in terms of size and annotation availability, but also with respect to clinical label distribution (ACR and BI-RADS), enabling explicit imbalance characterization prior to integration (Table 1). The DDSM [26] was one of the first collections to provide cranio-caudal (CC) and mediolateral oblique (MLO) views of digitized mammograms for each patient. Its curated subset, CBIS-DDSM [11], contains higher-quality images and is among the most widely used today, comprising 10 , 480 mammograms of 2048 × 2048 pixels and 16-bit depth. It includes detailed annotations such as ACR grade, abnormality type, mass characteristics, and benign or malignant label. INBREAST [45] is a more recent dataset created by researchers from the Radiology Institute of the University of Lisbon (Portugal). It contains 410 high-quality digital mammograms of size 4096 × 4096 from 115 patients, with CC and MLO views, and provides BI-RADS and ACR grades with annotations validated by two experts. Another current dataset is DMID [46], which was acquired from participants in India and contains 2040 mammograms from 510 participants categorized as normal, benign, and malignant. It provides BI-RADS and ACR grades, as well as ground-truth annotations for the lesions, including bounding box coordinates and radii, with dimensions 4751 × 6000 pixels.
More recently, VinDr-Mammo [47] was introduced as a large-scale full-field digital mammography dataset. It comprises 20 , 000 images of size 4080 × 4080 of 5000 patients from Vietnam, with CC and MLO views, providing breast-level assessment and extensive lesion-level annotations. Two radiologists annotated the dataset, and in cases of discrepancy, the evaluation of a third independent radiologist was included.
Table 1. Summary of publicly available mammography datasets integrated into Mass-Bench.
Table 1. Summary of publicly available mammography datasets integrated into Mass-Bench.
Dataset#
Patients
#
Images
#
Masses
BI-RADSACR
CBIS-DDSM [11]262010,4801698YesYes
INBREAST [45]115410108YesYes
VINDr-Mammo [47]500020,0001205YesYes
DMID [46]5102040469YesYes
Mass-Bench824532,9303480YesYes
Mass-Bench integrates multiple publicly available mammography datasets under a unified protocol, including patient-level harmonization, mass-centered annotations, and standardized BI-RADS and ACR labels.

2.2. Canonical Schema and Label Harmonization

To ensure cross-dataset consistency, all datasets were mapped into a unified canonical schema. Each mammogram containing an annotated breast mass was represented using the following standardized fields:
  • dataset: source dataset identifier.
  • image_id: globally unique image identifier.
  • patient_id: anonymized patient-level identifier.
  • view: mammographic projection (CC or MLO).
  • laterality: breast side (left or right).
  • bbox_x, bbox_y, bbox_w, bbox_h: normalized bounding box coordinates.
  • acr_mc: multi-class ACR density (1–4).
  • acr_bin: binary ACR grouping (1–2 vs. 3–4).
  • birads_mc: multi-class BI-RADS category.
  • birads_bin: binary BI-RADS grouping (<4 vs. ≥4).
ACR density grades originally encoded as A–D were mapped to numeric values 1–4 to ensure consistent representation for multiclass classification. Binary ACR classification grouped low-density (1–2) versus high-density (3–4) cases, reflecting clinically meaningful stratification. BI-RADS categories retain their numeric form; however, this study focuses on BI-RADS 2–5, excluding BI-RADS 1 (no lesion) and BI-RADS 6 (biopsy-proven malignancy), to ensure consistency with lesion-centered ROI analysis. For binary classification of BI-RADS, it was considered the class benign/likely benign (<4) versus suspicious/malignant (≥4).

2.3. Mass-Bench Construction and Clinical Balancing

The Mass-Bench was constructed explicitly considering the distribution of clinical labels. A search was conducted across all publicly available databases that provide the original ACR and BI-RADS frequencies, and these frequencies were quantified to analyze each dataset prior to integration. Formally, the integrated benchmark dataset is represented as:
D = { ( I i , l i , d i , b i ) } i = 1 N
where I i denotes the mammographic image associated with sample i-th, l i represents the corresponding BI-RADS assessment label, d i denotes the associated ACR breast density category, b i corresponds to the lesion bounding box annotation, and N represents the total number of mammographic mass samples included in the integrated benchmark. The empirical distribution associated with each clinical label category was estimated as:
P ( k ) = N k N
where P ( k ) denotes the empirical probability of a clinical category k { l , d } , N k represents the number of samples associated with category k, and N corresponds to the total number of samples within the evaluated dataset. Under this formulation, the benchmark integration process can be interpreted as a distribution-aware aggregation strategy that preserves clinically meaningful label relationships while reducing severe distributional imbalance across mammographic datasets.
The datasets differed in annotation completeness. However, we did not apply imputation strategies for missing clinical grades, as synthetic assignment could introduce bias. Therefore, all classification experiments were performed on explicitly labeled subsets to ensure label integrity. BI-RADS 1 was excluded, as it corresponds to cases without visible lesions and is not compatible with the lesion-centered ROI formulation used in this study. It should also be noted that those mass instances with incomplete or inconsistent bounding box definitions were excluded after Intersection over Union (IoU) verification [48] by Equation (3), where B l o c denotes the harmonized lesion annotation and B r e f represents the corresponding reference annotation (ground-truth). This ensures geometric consistency between harmonized lesion regions and original expert annotations across heterogeneous datasets.
I o U ( B l o c , B r e f ) 0.85
For each mammogram that showed a mass, we standardized (i) the definition of the region of interest, (ii) the associated ACR breast density label, and (iii) the BI-RADS assessment when available. Based on this analysis, our resulting reference set includes 3480 mammograms with mass annotations, enabling consistent training and evaluation across individual datasets and as an integrated benchmark. Compared to isolated datasets, Mass-Bench improves acquisition heterogeneity and enables analysis with clinically realistic label distributions. The plots in Figure 1 reveal marked imbalance among breast ACR density and BI-RADS categories in the original datasets. It is observed that most samples belong to intermediate classes, which may bias model optimization toward these dominant classes and lead to overly optimistic global performance estimates. The Mass-Bench framework seeks to address this limitation through density-aware balancing that preserves inter-dataset heterogeneity while reducing class-distribution disparities.

2.4. Experimental Framework

In order to ensure consistency between heterogeneous datasets, we start by designing all classification experiments based on a unified protocol with leak control that covers pre-processing procedures, mass localization, ROI generation, feature extraction, and classification, as shown in Figure 2.

2.4.1. Pre-Processing

All mammograms included in Mass-Bench were preprocessed using a standardized pipeline to harmonize acquisition variability across datasets. Training, validation, and test splits (70/20/10) were fixed at the patient level prior to preprocessing to prevent information leakage. Local contrast enhancement was performed to improve differentiation among image ROIs. This pre-processing was performed using Contrast Limited Adaptive Histogram Equalization (CLAHE) [49], as expressed in Equation (4), where H ( · ) denotes the CLAHE transformation operator, I is the image, α is the contrast clipping limit, and β defines the local tile partition size used for adaptive histogram equalization. A clip limit of 2.0 and a tile grid size of 8 × 8 , following established mammographic preprocessing practices [33].
I C L A H E = H ( I ; α , β )
CLAHE was selected for its ability to enhance local contrast while limiting noise amplification, which is particularly important in mammographic imaging, where subtle intensity variations define lesion visibility [50]. In medical image analysis, CLAHE has been widely adopted for enhancing low-contrast structures, particularly in mammography, where subtle intensity variations are critical for lesion detection [51,52]. To ensure consistency across datasets and understand the effects of dataset composition and model architecture, we developed a standardized enhancement pipeline using CLAHE. This approach avoids the confusion that can arise from testing multiple preprocessing strategies. In multi-dataset settings, varying preprocessing methods can introduce additional variability, and evaluation becomes complex.
This step improves mass visibility in dense tissue while limiting noise amplification. During training, data augmentation included horizontal flips, in-plane rotations ( ± 90 ), and moderate adjustments to brightness and contrast were applied using random intensity scaling factors within the range [0.8, 1.2], sampled during training to simulate acquisition variability while preserving anatomical consistency. Augmentation was restricted to transformations preserving lesion morphology. No lesion-specific enhancement or sensitivity-driven preprocessing was applied. For computational consistency, all images were resized to 768 × 768 pixels prior to ROI extraction and classification. While the original image resolutions range from 2048 to 6000 pixels, resizing to 768 × 768 represents a trade-off between preserving anatomical detail and ensuring computational efficiency. Fixed-resolution strategies have been widely adopted in YOLO-based mammographic detection studies across heterogeneous datasets [53,54,55,56], showing that they retain sufficient information to detect small lesions while enabling consistent and fair comparisons across models and datasets.

2.4.2. Automated Mass Localization via YOLO

An automated lesion localization model f θ , parameterized by learnable parameters θ , can be interpreted as a supervised object detection problem [57] in which a mammographic image I is mapped into a set of predicted bounding boxes B ^ associated with suspicious lesion regions (Equation (5)). Each predicted lesion candidate b ^ i is described by its center coordinates ( x ^ i , y ^ i ) , predicted width and height ( w ^ i , h ^ i ) , objectness confidence score s ^ i , and class probability p ^ i , associated with a breast mass candidate in this case (Equation (6)).
f θ : I B ^
b ^ i = ( x ^ i , y ^ i , w ^ i , h ^ i , s ^ i , p ^ i )
Under this formulation, the YOLO optimization objective L Y O L O directly supervises the spatial localization variables, objectness confidence, and class probability components defined in Equation (6). The overall detection objective is expressed in Equation (7), where L Y O L O denotes the total detection loss minimized during training, L b o x denotes the localization loss associated with bounding box regression, L o b j represents the objectness confidence loss, and L c l s corresponds to the classification loss. The weighting coefficients λ b o x , λ o b j , and λ c l s control the relative contribution of each optimization component during training. These coefficients act as positive scaling factors that regulate the optimization balance among spatial localization accuracy, objectness confidence estimation, and lesion classification performance.
L Y O L O = λ b o x L b o x ( x ^ i , y ^ i , w ^ i , h ^ i ) + λ o b j L o b j ( s ^ i ) + λ c l s L c l s ( p ^ i )
The YOLO architecture performs object localization in a single stage using a convolutional backbone composed mainly of small kernels. In this work, we focus on YOLOv5, YOLOv8, and YOLOv11 [58,59]. These versions have been widely used in the mammogram analysis literature, providing an accurate framework for comparison and allowing the incorporation of progressively different architectural and detection strategies. YOLOv5 corresponds to a mature anchor-based design widely adopted in medical imaging applications, whereas YOLOv8 and YOLOv11 incorporate more recent anchor-free and decoupled detection mechanisms with improved feature extraction and localization capabilities. The backbone relies on 3 × 3 convolutions with 1 × 1 bottlenecks for channel mixing, followed by normalization and nonlinear activations (LeakyReLU or SiLU). Spatial resolution is progressively reduced through stride-2 convolutions, increasing feature depth while decreasing spatial dimensions.
Multi-scale feature maps (e.g., P 3 , P 4 , P 5 ) are fused through a neck module (FPN/PAN), enabling the integration of low- and high-level information. The detection head produces dense predictions per spatial location. While YOLOv5 uses an anchor-free coupled head, YOLOv8 and YOLOv11 adopt an anchor-free decoupled design for classification and bounding box regression. IoU quantifies the spatial overlap between predicted and reference lesion regions, ranging from 0 (no overlap) to 1 (perfect overlap). Training incorporates IoU-based loss functions (e.g., GIoU/CIoU) to refine bounding box accuracy, and non-maximum suppression is applied during inference to remove overlapping detections. YOLOv5 emphasizes improvements in feature aggregation through cross-stage partial (CSP) connections. YOLOv11 further enhances feature extraction by incorporating modules such as C3k2, SPPF, and C2PSA blocks, improving detection performance, particularly for small objects.
All models were trained under identical hyperparameters to ensure a fair and controlled comparison across architectures, following commonly adopted YOLO configurations reported in the literature [32,59]. The input image size was fixed at 768 × 768 pixels to preserve sufficient anatomical detail for detecting small masses while maintaining computational efficiency. A batch size of 16 was selected as a standard compromise between GPU memory constraints and stable gradient estimation during training. The initial learning rate ( l r 0 = 0.001 ) and the AdamW optimizer were adopted based on widely used YOLO training configurations, providing stable convergence and effective regularization. An IoU threshold of 0.5 was used to define positive detections, consistent with standard object detection evaluation protocols. Finally, models were trained for 130 epochs, which is within the typical range reported in mammographic detection studies and was empirically verified to ensure convergence without overfitting.

2.4.3. ROI Generation

Lesion-centered ROIs, automatically extracted from each mammogram, are now evaluated in the context of the surrounding tissue. While detection targets the lesion region, the surrounding perilesional tissue may contain additional information. Research in radiomics and deep learning has demonstrated that perilesional and contextual tissues provide complementary diagnostic information beyond the lesion core [60,61]. The corresponding lesion-centered ROI R i from a predicted bounding box b ^ i was extracted by cropping the lesion-centered image region from the mammographic image I (Equation (8)).
R i = I [ x ^ i : x ^ i + w ^ i , y ^ i : y ^ i + h ^ i ]
Therefore, multiple ROI padding strategies were implemented to incorporate additional information. Bounding boxes were expanded by fixed relative margins of 0%, 10%, 20%, and 30% of the original lesion size, yielding progressively larger crops that incorporate increasing amounts of adjacent tissue. Context-aware ROI expansion was modeled through a padding factor γ applied proportionally to the predicted lesion dimensions as:
R i ( γ ) = I [ x ^ i γ w ^ i : x ^ i + ( 1 + γ ) w ^ i , y ^ i γ h ^ i : y ^ i + ( 1 + γ ) h ^ i ]
where γ { 0 , 0.1 , 0.2 , 0.3 } defines the contextual padding ratio relative to the predicted lesion size. In this approach, increasing values of γ incorporate more surrounding perilesional tissue into the extracted ROI. This allows for the systematic analysis of how contextual information affects mammographic features and classification performance. Geometrically, the padding operation serves as a multiscale sampling strategy centered around the lesion, capturing both the lesion core and surrounding tissue within larger spatial neighborhoods.

2.4.4. Feature Extraction

After ROI generation and spatial normalization, feature representations were extracted from each lesion-centered patch using two complementary strategies: handcrafted texture descriptors and deep convolutional embeddings. For handcrafted features, classical radiomic-inspired descriptors Φ were computed directly from grayscale ROIs R i ( γ ) . These included first-order statistical measures (mean, variance, skewness, kurtosis), texture features derived from gray-level co-occurrence matrices (GLCM) [62], and local binary pattern (LBP) histograms [63]. GLCM features were computed across multiple orientations to capture spatial intensity dependencies, and summary statistics were aggregated to form a compact representation.
F i = Φ ( R i ( γ ) ) , ϕ ω ( R i ( γ ) )
In parallel, deep convolutional features were also extracted from R i ( γ ) using pretrained ImageNet models ϕ ω , where ω denotes the pretrained network parameters associated with the deep feature extractor: VGG19 [64], ResNet50 [65], DenseNet121 [66], and EfficientNet-B3 [67]. Networks were used solely as fixed-feature extractors, without fine-tuning. The final convolutional activation maps were subjected to global average pooling to obtain compact embedding vectors, thereby preserving high-level semantic representations while reducing dimensionality.
For each ROI padding configuration, both handcrafted and deep representations were computed to obtain the complete feature vector F i (Equation (10)). All features were standardized prior to classifier training. This unified extraction protocol allowed systematic evaluation of representation type (texture-based vs. deep embedding), contextual inclusion (padding level), and classifier interaction under identical experimental conditions across datasets.

2.4.5. Classification

All ROIs were resized to a fixed spatial resolution prior to feature extraction. ROIs truncated by image borders or containing incomplete lesion information were excluded to avoid downstream bias. Binary and multiclass classification of ACR and BI-RADS grades was evaluated following the formulation in Section 2.2. Thus, the classification task for Mass-Bench was modeled as a supervised learning problem. Where F i denotes the feature representation of sample i, k i { l i , d i } represents the associated BI-RADS or ACR label, and D c l s defines the supervised classification dataset containing N samples.
D c l s = { ( F i , k i ) } i = 1 N
Features were extracted using a consistent protocol across datasets and ROI padding configurations described before. All models were trained exclusively on training partitions and evaluated on held-out test sets. Performance was analyzed at three levels: (i) individual dataset evaluation, (ii) cross-dataset comparison under identical conditions, and (iii) integrated Mass-Bench benchmark assessment. This unified framework enables systematic analysis of dataset-specific variability, generalization behavior, and the impact of clinically realistic label distributions on ROI-based classification. Under this formulation, ROI-based mammographic classification can be interpreted as a mapping from multiscale lesion-centered feature representations into clinically meaningful diagnostic categories, enabling systematic analysis of representation learning, contextual information, and dataset variability within a unified probabilistic framework.

2.5. Evaluation Metrics

Classification performance was evaluated using metrics commonly adopted in biomedical decision-making tasks. Specifically, we report accuracy, precision, sensitivity (recall), specificity, and F1-score for all classification experiments.
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
S e n s i t i v i t y = T P T P + F N
S p e c i f i c i t y = T N T N + F P
F 1 s c o r e = 2 · Precision · S e n s i t i v i t y Precision + Sensitivity
For multi-class problems such as ACR density grading and BI-RADS classification, the Macro F1-score was also computed [68]. This metric calculates the F1-score independently for each class and then averages the results, giving equal importance to all classes regardless of their frequency:
F 1 m a c r o = 1 C i = 1 C F 1 i
where C corresponds to the total number of classes considered in the evaluated classification task and F 1 i corresponds to the F1-score computed for class i. Macro-averaging is particularly appropriate in medical datasets with class imbalance, as it prevents dominant classes from disproportionately influencing the overall performance estimate.
Receiver operating characteristic (ROC) curves were used to evaluate the trade-off between sensitivity and specificity across varying decision thresholds for each model and dataset. The area under the ROC curve (AUC), along with 95% confidence intervals, was reported as a threshold-independent measure of discriminative performance.
To further characterize discriminative performance across datasets, optimal decision thresholds were determined using the Youden Index. To identify a clinically relevant operating point, the Youden Index (J) [69] was employed, as it maximizes the combined sensitivity and specificity:
J = Sensitivity + Specificity 1
The optimal decision threshold τ * was defined as:
τ * = arg max τ ( S e n s i t i v i t y ( τ ) + S p e c i f i c i t y ( τ ) 1 )
Here, τ represents the decision threshold applied to the detection confidence scores. The optimal threshold τ * was determined independently for each model and dataset using the Youden Index criterion and was subsequently used for ROC-based performance analysis and optimal cutoff reporting in the Supplementary Materials.
Dataset suitability was evaluated using the Kullback–Leibler (KL) divergence [70] to compare the ACR density distribution of each dataset against the expected clinical reference distribution. It is formally defined in Equation (20), where P ( i ) denotes the empirical probability of ACR category i within the evaluated dataset, Q ( i ) represents the corresponding clinical reference probability, and C = 4 corresponds to the total number of ACR density categories considered in the analysis. Q follows the expected clinical proportions (10% ACR 1, 40% ACR 2, 40% ACR 3, and 10% ACR 4). Lower KL divergence values indicate greater agreement between dataset and reference distributions, suggesting improved statistical representativeness and reduced distributional bias. In contrast, larger divergence values indicate stronger deviations from clinically expected proportions and potentially less reliable evaluation conditions. Thus, KL divergence serves as an information-theoretic measure of clinical distributional realism for comparative analysis across heterogeneous mammographic datasets.
D K L ( P Q ) = i = 1 C P ( i ) log P ( i ) Q ( i )

3. Results

The first goal of this study is to locate masses across various datasets from diverse sources and under different conditions. To achieve accurate model performance focused on mass location, a balanced, diverse dataset that covered the four ACR density grades was necessary. With this purpose, the Mass-Bench curated dataset was compiled. To ensure reliable results, the number of samples was balanced to account for the minority class (ACR 4), yielding 240 samples per class. Each mammogram was labeled with its corresponding breast density category according to the ACR system. For all experiments, data were split into training (70%), validation (20%), and testing (10%) sets. For mass localization, YOLOv5, YOLOv8, and YOLOv11 were trained exclusively on mammograms that exhibit masses. All models were trained for 130 epochs, and comparisons were made with and without data augmentation, as detailed in the methodology. All localization and classification experiments reported in this study were conducted using 5-fold cross-validation to improve the robustness and reliability of the evaluation.

3.1. Performance of Mass Localization Across ACR Breast Density Grades

To understand the influence of breast density on automatic mass localization, model performance was first evaluated for each ACR density grade. A stratified analysis allows the isolation of the effect of tissue composition on localization accuracy, independent of the dataset distribution. YOLOv8 was used for this purpose. YOLOv8 achieved its best performance in low- to mid-density categories. In CBIS, the highest F1-score was observed at ACR 2 (0.791), while Mass-Bench reached its peak at ACR 2–3 with F1-scores of 0.866. INBREAST reported near-perfect performance in ACR 2 and 4 (F1 = 0.941–0.999), although based on very limited samples. In contrast, VINDr-Mammo showed lower performance, with F1-scores decreasing from 0.750 (ACR 2) to 0.571 (ACR 4). DMID remained relatively stable across densities, with F1-scores ranging from 0.702 to 0.778. Overall, these results indicate a consistent degradation in localization performance as breast density increases. Results showed that localization performance exhibited a clear dependency on breast density (Table 2). As shown in Table 2, some datasets contain a limited number of test samples in specific ACR categories (e.g., INBREAST and VINDr-Mammo with as few as 1–4 samples). This reflects the inherent class imbalance in mammographic datasets, in which certain density levels are underrepresented. While small sample sizes may introduce variability in isolated metrics, performance in this study is evaluated across multiple datasets and density levels within the Mass-Bench framework, which mitigates these limitations and provides a more robust assessment of model behavior.
Higher accuracy and sensitivity were consistently observed for low-density breast (ACR 1–2), while performance tends to degrade when breast density increases (ACR 3–4). This trend was observed across datasets. In particular, dense breast grades showed increased localization errors and reduced true-positive rates, suggesting that fibroglandular tissue limits the separability of masses. These effects could be partially masked when results are reported over the entire test set without stratification by breast density. Since low-density samples comprise most of the dataset, the average could be biased upward.

3.2. Breast Density Imbalance on Localization Performance

Although stratified analysis shows a clear relationship between localization performance and breast density, dataset composition also influences the results. Therefore, the role of density imbalance across datasets must be analyzed. In practice, many datasets exhibit highly imbalanced distributions of ACR density grades, which can significantly bias aggregate performance metrics. To quantify this imbalance, the distribution across datasets was evaluated using KL divergence (see Figure 3). As shown in the figure, DMID (0.0243) and Mass-Bench (0.0404) present the lowest divergence values, indicating a closer alignment with the expected clinical distribution and making them more suitable for training and evaluating unbiased models. CBIS-DDSM (0.0964) also shows relatively low divergence, whereas INBREAST (0.4540) and VINDr-Mammo (0.5697) exhibit moderate deviations, reflecting imbalanced density representation. These findings highlight that dataset representativeness plays a critical role in reliable model evaluation.
As observed in Table 2, in more balanced datasets, the metrics remain relatively consistent across grades, showing limited dispersion. In contrast, when one ACR class contains significantly more samples, performance tends to concentrate around that majority class, while other grades exhibit noticeable declines. Similarly, when a class has very few samples, the metrics become unstable, often resulting in extreme or overoptimistic values that are likely due to statistical variability rather than true generalization. Overall, class imbalance increases metric instability across density grades. Although DMID shows relatively controlled variability, it still exhibits mild distribution differences, resulting in slightly wider metric fluctuations. In comparison, Mass-Bench stands out for its balanced distribution, its reliability in grade-specific evaluation, and its reduced risk of biased performance toward a dominant class.
Therefore, comparisons of results based solely on global performance scores may be misleading. This analysis emphasizes that breast density imbalance is a significant source of hidden bias in mammographic mass localization. It underscores the importance of implementing density-aware evaluation protocols to ensure accurate, clinically relevant model assessments.

3.3. Overall Localization Performance on the Mass-Bench

Once the impact of density imbalance has been established, the overall localization performance across the benchmark is analyzed, consistent with the evaluation strategy reported in previous studies. For this purpose, three YOLO-based architectures—YOLOv5, YOLOv8, and YOLOv11—were trained and tested on the full Mass-Bench. As Table 3 summarizes, all evaluated YOLO variants achieved a high sensitivity (recall), which is clinically desirable as it minimizes the risk of missed malignant cases (false negatives). Beyond recall, the models also achieved a strong F1-Score, indicating a favorable balance between true positives and false positives. Detection quality was further evaluated using mean Average Precision (mAP@50 and mAP@50-95) metrics. As shown in Table 3, YOLOv11 achieved the best localization performance with an mAP@50 of 0.663 and an mAP@50-95 of 0.335, outperforming YOLOv8 (0.646 and 0.326, respectively) and YOLOv5 (0.598 and 0.287). Although performance differences among the three models are moderate, YOLOv11 obtained the highest overall accuracy (0.721), precision (0.717), and F1-score (0.835). Since the benchmark is density-balanced, the global metrics are not influenced by the overrepresentation of any specific ACR category. The improved performance of YOLOv11 compared to YOLOv5 and YOLOv8 can be attributed to its enhanced feature extraction capabilities, particularly through modules such as C3k2 and C2PSA, which better capture spatial and contextual dependencies. These improvements are especially beneficial in mammographic imaging, where masses often exhibit subtle intensity variations and poorly defined boundaries. Additionally, enhanced multi-scale feature integration supports the detection of small lesions within heterogeneous breast tissue, while the refined detection head design contributes to more stable localization and classification predictions.
Across datasets, AUC values ranged from moderate to high, reflecting consistent model capability for mass discrimination. YOLOv5 achieved the highest AUC in INBREAST (0.914), CBIS-DDSM (0.671), and VINDr-Mammo (0.801), while YOLOv8 obtained the best performance in DMID (0.943). YOLOv11 showed competitive performance across all datasets but did not outperform the other variants in any specific case (Figures S1–S4 of Supplementary Materials).

ACR Breast Density Classification

To evaluate the ability of lesion-centered features to characterize breast density, an ACR classification analysis across datasets using both binary (ACR 1–2 vs. 3–4) and multi-class (ACR 1–4) formulations was conducted. The Table 4 summarizes the best performance obtained for each dataset under the optimal feature-classifier configuration. Binary classification was evaluated on datasets with enough representation of low and high-density categories, whereas multi-class classification was assessed when all four ACR categories were available.
Table 4 presents the best performance achieved in ACR density classification using lesion-centered ROIs, while the class distribution provides context for the difficulty of each scenario. In addition to the best-performing configurations, an extensive experimental evaluation was conducted across multiple representation and classification strategies. Specifically, for each dataset, five feature extraction approaches were evaluated, including HC, VGG19, R50, DN121, and ENB3, combined with five classical machine learning classifiers, namely LR, SVM, KNN, RF, and XGB. Each configuration was further analyzed under four bounding-box padding conditions (B0, B10, B20, and B30), allowing the assessment of the impact of contextual information around the lesion. This comprehensive setup resulted in a large combinatorial exploration of feature, classifier and padding interactions, from which the best-performing configuration per dataset is reported in the table.
In the binary classification, the highest accuracy is achieved using VINDr-Mammo, which reports 0.95 accuracy. However, this result should be interpreted considering the strong disproportion in the density distribution, where ACR 3 clearly dominates the dataset compared to ACR 2, ACR 4, and especially ACR 1. This concentration in a single class may facilitate the separation between the two binary groups. In contrast, the CBIS-DDSM dataset presents broader distribution, with ACR 2 being the most representative class, followed by ACR 3, while ACR 1 and ACR 4 are less represented. This dataset shows a more balanced distribution with intermediate densities, reaching a binary classification with 0.70 accuracy, indicating greater difficulty in separating the groups when multiple densities are present. The DMID dataset shows a relatively balanced distribution between ACR 2 and ACR 3, while ACR 1 and ACR 4 have fewer samples. This proximity between the central densities may hinder separation within binary groups, as reflected in a moderate accuracy of 0.68. In INBREAST, the number of samples is considerably smaller, mainly in ACR3 and ACR 4, which limits the model’s generalizability and is reflected in the lower binary performance observed (0.63 accuracy). At the Mass-Bench level, binary ACR classification achieves an accuracy of 0.90, reflecting the benefit of integrating heterogeneous datasets under a unified protocol while preserving variability in density distribution.
In multiclass classification, VINDr-Mammo again achieves the best performance (0.95 accuracy), although this result is influenced by the high prevalence of ACR 3, which may facilitate the learning of dominant patterns. Although INBREAST achieves 0.81 accuracy, with a limited dataset and high class imbalance, the overall accuracy may mask poor performance on minority classes if the model correctly classifies the majority classes. In DMID, where the intermediate ACR 2 and ACR 3 have similar sizes, multiclass classification is more challenging and achieves an accuracy of 0.70. In contrast, CBIS-DDSM achieves a lower performance in the multiclass setting (0.48 accuracy), despite covering all four ACR categories. At the Mass-Bench level, multiclass ACR classification achieves an accuracy of 0.822, indicating that integrating heterogeneous datasets under a unified protocol supports improved generalization across density categories. In several of the best-performing configurations, larger ROI paddings (B20 and B30) were frequently observed, suggesting that additional contextual information may support discrimination between adjacent density levels.

3.4. BI-RADS Classification Performance

In addition to mass localization and density-aware detection, we evaluated the feasibility of inferring BI-RADS assessment categories from lesion-centered ROIs. BI-RADS classification represents a higher-level clinical task that takes into account both lesion morphology and contextual information. While most studies use the entire mammographic image for this task, this experiment aims to assess the accuracy of BI-RADS classification when relying only on information centered on the mass and its surrounding area.
BI-RADS classification experiments were conducted using the same ROIs containing breast masses described previously but now identified by BI-RADS grade. Both binary and multi-class formulations were evaluated and supported by the dataset annotations. Class distribution in Mass-Bench comprises BI-RADS 2 (n = 300 ), BI-RADS 3 ( n = 1258 ), BI-RADS 4 ( n = 1591 ), and BI-RADS 5 ( n = 597 ), as illustrated in Figure 1; BI-RADS 1 is not associated with annotated mass lesions. For the binary task, BI-RADS categories were grouped into benign/likely benign (BI-RADS 2–3) and suspicious/malignant (BI-RADS 4–5) to reflect a clinically relevant threshold for suspicious findings. The multi-class setting retained the original categories BI-RADS 2–5 provided by each dataset. All experiments were performed independently per dataset and subsequently analyzed under the unified Mass-Bench framework.
BI-RADS classification results are summarized in Table 5. For the binary classification, the highest accuracy was achieved in DMID (0.93); however, it must be taken into account that the dataset has very few samples across all categories, which may either facilitate class separation or introduce variability in the metric. Although INBREAST achieved an accuracy of 0.83, the predominance of BI-RADS 2 and the low representation of other categories could simplify the discrimination between cases of lower and higher suspicion. While CBIS-DDSM and VINDr-Mammo achieved moderate performance (0.72), it is observed that most of the samples fall within BI-RADS 3–4. In this scenario, the binary separation leaves the benign/likely benign being predominantly represented by BI-RADS 3, thereby limiting the actual difficulty of the problem. Despite these limitations, these findings suggest that binary ROI-based BI-RADS classification is feasible but remains sensitive to dataset-specific annotation consistency and class distribution. At the Mass-Bench level, binary BI-RADS classification achieves an accuracy of 0.90, indicating stable discrimination between low- and high-suspicion groups under a unified multi-dataset setting.
On the other hand, for multi-class BI-RADS classification greater variability was observed compared to the binary setting. INBREAST achieved the highest multi-class accuracy (0.86) using EfficientNet-B3 with Logistic Regression under 30% padding (B30), highlighting the importance of broader contextual inclusion for fine-grained stratification. DMID followed with 0.83 using DenseNet121 with Random Forest (B0), demonstrating strong separability even with minimal padding. In contrast, CBIS-DDSM and VINDr-Mammo exhibited lower multi-class performance (0.53 and 0.59, respectively), reflecting the increased difficulty of distinguishing adjacent BI-RADS categories under heterogeneous acquisition conditions. At the Mass-Bench level, multi-class BI-RADS classification achieves an accuracy of 0.84, demonstrating that integrating heterogeneous datasets under a unified framework supports competitive performance despite increased variability.
Also, the results indicate that deep embeddings (ENB3 and DN121) with moderate contextual inclusion (B0 and B10) can effectively differentiate cases in a binary classification task. However, in the multiclass case, broader contextual information is required (B30) because adjacent classes may share similarities. Overall, the analysis and assessment indicate that previously reported performance may be overestimated due to biases specific to the dataset and class imbalance.

4. Discussion

This study provides a unified evaluation of deep learning–based mammographic mass detection and clinical characterization under a harmonized and density-aware protocol. By jointly examining detection performance, density-stratified behavior, ACR classification, BI-RADS feasibility, and exploratory clinical correlations, the results offer a structured perspective on the capabilities and current constraints of contemporary models across heterogeneous datasets.
In terms of mass detection, YOLO-based architectures have demonstrated competitive, stable performance for mammographic lesion localization.Table 6 shows the comparison with representative studies. Su et al. [53], proposed YOLO-LOGO, a hybrid transformer for mass localization and segmentation, based on YOLOv5-L6, a variant that allows for processing images of high resolution (full mammograms). Mass localization results achieved an mAP of 0.65 on the CBIS-DDSM dataset and 0.61 on INbreast. In [54], several YOLO models were evaluated, including YOLOv5-Transformer trained over CBIS-DDSM. INbreast was used to show the difference in performance with and without transfer learning. Results showed that YOLOv5s achieved the best mass detection performance, with 0.85 precision and 0.72 recall.
Recent approaches have explored architectural modifications to improve lesion detection. In [71], an improved version of YOLOv10 was proposed for a two-stage segmentation model (DVF-YOLO-Seg). This improvement introduces varifocal loss and the DualConv modules to improve lesion detection. The model was tested for mass localization on the CBIS-DDSM dataset, achieving 0.79 and 0.81 in precision and recall, respectively. YOLOv5n was also reported in [72], achieving the best performance for mass localization on the CBIS-DDSM (mAP@50 0.50, mAP@50-95 0.20) and INbreast (mAP@50 0.68, mAP@50-95 0.31) datasets, compared with YOLOv8.
Other studies have investigated the effect of contextual modeling and multi-dataset training. Trang et al. [55] also addressed breast mass localization, proposing MANGA-YOLO, an architecture based on the MAMBA architecture and YOLOv11 that takes into account contextual information about the mass. The model was tested on three datasets: CBIS-DDSM, INbreast, and VinDr-Mammo. The best mass localization was achieved in INbreast (precision = 0.91, recall = 0.79), while it decreased significantly for VinDr-Mammo (precision = 0.68, recall = 0.70) and CBIS-DDSM (precision = 0.69, recall = 0.71). Also, a cross-dataset evaluation was performed, obtaining the best result by training on INbreast and testing on VinDr-Mammo and CBIS-DDSM, achieving 0.88 in precision and 0.36 in recall. Finally, Abdikenov et al. [56] also addressed mass localization under a similar approach of dataset combination (INbreast, CBIS-DDSM and VinDr-Mammo) and YOLOv12-L. Results showed similar behaviour, with the best individual performance on INbreast and decreasing performance on CBIS-DDSM and VinDr-Mammo. When the three datasets were combined and evaluated on YOLOv12-L and RTMDet-X, the results were similar to those of the individual CBIS-DDSM and VinDr-Mammo tests, possibly due to the influence of their majority classes.
Despite architectural improvements, the comparison across studies reveals that dataset composition remains a major limiting factor. Most publicly available mammography datasets exhibit strong imbalance across ACR density categories, which may bias the training process toward the dominant classes. This issue is particularly relevant because breast tumor localization becomes increasingly difficult in dense tissue (ACR 3–4), where lesion conspicuity is reduced due to the surrounding fibroglandular parenchyma. This behavior was also reflected in the density-stratified evaluation, where sensitivity and F1-score generally declined with increasing ACR category across datasets. Such results indicate that models trained under imbalanced density distributions may perform well under dataset-specific conditions but exhibit reduced robustness when evaluated under different density profiles. To mitigate this effect, the Mass-Bench evaluation protocol enforces density-aware balancing during training and testing. The integrated benchmark therefore enables evaluation under more representative density distributions. Within this framework, the proposed YOLOv11 model achieved high sensitivity (recall = 0.99) while maintaining competitive localization accuracy (mAP@50 0.66). Although this constraint may lead to slightly lower numerical metrics, it reduces optimistic bias and provides a more realistic assessment of detection performance across heterogeneous breast density conditions.
Table 6. Recent studies on automatic breast mass localization in mammography.
Table 6. Recent studies on automatic breast mass localization in mammography.
ReportYearModelDatasetACR-BalancedPrecisionRecallmAP@50mAP@50-95
[53]2022YOLOv5-L6CBIS-DDSM--0.65
INBREAST--0.61
[54]2024YOLOv5sCBIS-DDSM0.850.720.83-
[71]2025Improved YOLOv10CBIS-DDSM0.840.860.85-
[72]2025YOLOv5nCBIS-DDSM--0.500.20
INBREAST--0.680.31
[55]2025MANGA-YOLOCBIS-DDSM0.690.710.660.27
INBREAST0.910.790.880.56
VinDr-Mammo0.680.700.690.34
VinDr-Mammo + CBIS-DDSM0.880.360.360.07
[56]2025YOLOv12-LINBREAST0.980.850.96-
CBIS-DDSM0.610.550.56-
VinDr-Mammo0.730.510.59-
CBIS-DDSM + VinDr-Mammo + INBREAST0.710.590.63-
RTMDet-XCBIS-DDSM + VinDr-Mammo + INBREAST0.730.650.68-
Ours2026YOLOv11Mass-Bench0.710.990.660.33
Most studies do not control for ACR density distribution, which may bias results. Mass-Bench achieves high recall (0.99) with competitive mAP. Reported (✓) or no reported (✕) ACR-balanced sampling.
In terms of ACR breast density classification, Table 7 summarizes the comparison with recent studies on public datasets. Across the literature, density estimation is addressed through various different formulations, ranging from binary to full ACR grade classification. For instance, Mohamed et al. [73] addressed an ACR classification problem, considering two adjacent density categories and treating the problem as a binary classification task between those categories, thereby reducing its complexity. Lehman et al. [74] evaluated a large cohort, dominated by intermediate classes (ACR 2 and 3) with underrepresentation of extreme categories (ACR 1 and 4). While strong performance was achieved, 0.87 accuracy in binary and 0.77 in multi-class settings, class balancing was not addressed. Similarly, López-Almazán et al. [75] reported multi-class classification, documenting a severe class imbalance, particularly for ACR 4, and incorporating weighted loss functions to mitigate its effect. An accuracy of 0.85 was achieved. These findings highlight that class distribution plays a critical role in ACR classification performance and that explicit handling of imbalance can influence reported results. Rigaud et al. [76] report both binary and multi-class classification results based on DL models, achieving an accuracy of 0.88 and 0.82, respectively. Despite using a large dataset, there was an imbalance, and no balancing strategy was applied.
A key contribution of this work lies in the comprehensive evaluation protocol, which incorporates both binary and multi-class formulations, as well as macro-averaged metrics that are largely absent in prior studies. While most existing works rely on accuracy or AUC, these metrics may be biased toward dominant classes in imbalanced datasets. In contrast, the use of macro F1-score enables a more balanced assessment of performance across all ACR categories, providing additional insight into class-wise behavior and model robustness. Although previous studies report higher absolute performance, these results are often obtained under simplified or controlled experimental conditions. The proposed Mass-Bench framework introduces a more realistic and challenging evaluation scenario for robust ACR breast density classification. These findings suggest that future research should prioritize standardized multi-dataset benchmarks and comprehensive evaluation protocols that better reflect real-world clinical variability.
In terms of BI-RADS-related classification, prior work has predominantly focused on binary formulations, typically distinguishing between benign and malignant findings, rather than explicit BI-RADS grades classification [19,61,78]. While these studies demonstrate strong performance, they address a coarser discrimination between benign and malignant findings. Consequently, their reported metrics are not directly comparable to BI-RADS-based categorization, particularly in multi-class settings where finer-grained distinctions are required.
To contextualize these results, Table 8 summarizes studies addressing BI-RADS-classification from mammography. Baccouche et al. [79] reported multiclass accuracies of 0.85 and 0.99 for CBIS-DDSM and INBREAST datasets, respectively, using a stacked ensemble of residual networks for BI-RADS categorization. However, their approach relies on a fully integrated CAD pipeline including prior mass detection and segmentation, where classification is performed on automatically detected and segmented lesion ROIs rather than standardized regions surrounded by parenchymal tissue.
Similarly, Li et al. [80] introduced a framework for multi-class BI-RADS prediction operating on whole mammograms. This framework incorporates BI-RADS assessments for both mass detection and breast density estimation. Their findings achieved AUC (Area Under the Curve) values of 0.92 on the CBIS-DDSM and CDD-CESM datasets and an impressive 0.97 on the INBREAST dataset. The closest work addressing multi-class BI-RADS classification on a balanced dataset is reported in [35], achieving 0.94 multiclass accuracy, 0.95 mean F1-score, and 0.97 AUC across BI-RADS categories 0–5 using a large private dataset.
Table 8. Audited representative studies explicitly addressing BI-RADS classification.
Table 8. Audited representative studies explicitly addressing BI-RADS classification.
ReportYearModelDatasetTaskAcc (Bin)Acc (Multi)F1 (Bin)F1 (Multi)AUC
[79]2022Stacked ensemble of ResNet modelsCBIS-DDSMBI-RADS (2–6) on detected/segmented masses0.850.94
[79]2022Stacked ensemble of ResNet modelsINBREASTBI-RADS (2–6) on detected/segmented masses0.991.00
[35]2022Deep neural networkClinical cohortBI-RADS (0,1,2,3,4A,4B,4C,5)0.940.950.97
[81]2022Deep CNNClinical datasetBI-RADS (1 vs. 2/3 vs. 4/5) on manually cropped ROIs0.90
[82]2025Deep learning modelMulticenter clinical cohortBI-RADS (3 vs. 4A)0.800.74
[80]2025Explainable multi-task CADCBIS-DDSMBI-RADS assessment (B2–B5) on whole mammograms0.770.92
[80]2025Explainable multi-task CADCDD-CESMBI-RADS assessment (B1–B5) on whole mammograms0.780.92
[80]2025Explainable multi-task CADINBREASTBI-RADS assessment (B1–B5) on whole mammograms0.830.97
[83]2024InceptionResNetV2RSNA + VinDr-MammoBI-RADS (0 vs. 1–2 vs. 4–5)
Ours2026ML + handcrafted/deep featuresMass-BenchBI-RADS binary and multiclass (2–5)0.900.840.900.82
Prior works report strong BI-RADS classification performance. Our proposed Mass-Bench framework achieves competitive results (Acc = 0.90 binary, 0.84 multiclass; F1 = 0.90 binary, 0.82 multiclass) with a balanced benchmark.
Together, these works emphasize the heterogeneity of BI-RADS classification settings and the lack of standardized evaluation protocols across studies. A key observation emerging from Table 9 is that even when studies report using the same public datasets, such as CBIS-DDSM and INBREAST, class distributions differ substantially. For instance, several studies do not explicitly report class distributions or splitting strategies, making reproducibility and comparability challenging.
In contrast, the proposed Mass-Bench framework was designed to address these limitations by enforcing a unified and reproducible evaluation setting. The global benchmark preserves the natural heterogeneity of the integrated datasets (B2 = 300, B3 = 1258, B4 = 1591, B5 = 597), while a balanced subset was constructed by matching all classes to the size of the limiting category (300 samples per class). Unlike previous studies that reported high performance in single-dataset or controlled environments, our findings in this standardized context clearly distinguish between binary and multi-class BI-RADS classification. In the binary task, the best configuration achieved 0.90 accuracy and 0.90 macro F1-score. When extending to multi-class classification, performance reached 0.84 accuracy and 0.82 macro F1-score. Multi-class BI-RADS classification involves finer-grained distinctions; consequently, misclassification may reflect class imbalance rather than purely model limitations. This also indicates that performance measured in a binary context might overlook important limitations in clinically relevant multi-class classification.
Task definition and Mass-Bench evaluation highlight the gap between binary performance and clinically meaningful stratification. Standardized evaluation protocols and balanced datasets are essential, as demonstrated by the variability observed across datasets.

Limitations

Despite the promising results obtained with Mass-Bench, several limitations should be acknowledged. First, all images were resized to a fixed spatial resolution ( 768 × 768 ) to ensure computational consistency across models and datasets. While this approach enables standardized comparison, it may reduce fine-grained anatomical detail in very high-resolution mammograms, particularly for small lesions.
Additionally, although experiments were conducted under controlled and reproducible conditions, repeated statistical runs and uncertainty estimation were not exhaustively explored. Future work should incorporate repeated experiments, confidence interval analysis, and larger multi-institutional cohorts to further validate the robustness and generalizability of the proposed framework.
Finally, the present study focused primarily on representative YOLO-based object detectors and conventional deep feature extractors. Future research may explore transformer-based architectures, multimodal learning approaches, and clinically integrated CAD pipelines for more comprehensive mammographic analysis.

5. Conclusions

This work introduces Mass-Bench, a unified benchmark for mammographic mass detection and clinically relevant characterization tasks, evaluated under standardized and reproducible conditions. By integrating heterogeneous public datasets with harmonized preprocessing and annotation protocols, Mass-Bench enables a more clinically representative evaluation scenario compared to isolated single-dataset studies. For mass localization, YOLO-based detectors demonstrated robust performance under leakage-controlled conditions, confirming their ability to generalize across diverse acquisition settings. For clinical characterization, ACR density classification showed stable performance, particularly in binary settings, whereas BI-RADS classification remained more challenging, especially in the multi-class scenario, reflecting its inherently complex and partially subjective nature.
Overall, these findings highlight the critical impact of dataset composition, class balance, and evaluation protocols on reported performance. In this context, Mass-Bench provides a more rigorous and clinically relevant framework for evaluating mammographic analysis tasks. Future work will focus on expanding underrepresented categories to improve multi-class evaluation and further enhance benchmark reliability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math14122080/s1, Figure S1: ROC curves with 95% confidence intervals (CI) and optimal cutoffs for YOLOv5, YOLOv8, and YOLOv11 on the INBREAST dataset; Figure S2: ROC curves with 95% confidence intervals (CI) and optimal cutoffs for YOLOv5, YOLOv8, and YOLOv11 on the CBIS-DDSM dataset. Figure S3: ROC curves with 95% confidence intervals (CI) and optimal cutoffs for YOLOv5, YOLOv8, and YOLOv11 on the DMID dataset. Figure S4: ROC curves with 95% confidence intervals (CI) and optimal cutoffs for YOLOv5, YOLOv8, and YOLOv11 on the VINDr-Mammo dataset.

Author Contributions

Conceptualization, H.E.Z.-R., H.P.-B. and G.C.L.-A.; Methodology, H.E.Z.-R.; Software, H.E.Z.-R.; Validation, H.E.Z.-R.; Formal analysis, H.P.-B. and G.C.L.-A.; Investigation, H.E.Z.-R., H.P.-B. and G.C.L.-A.; Resources, G.C.L.-A.; Data curation, H.E.Z.-R.; Writing—original draft, H.E.Z.-R., H.P.-B. and G.C.L.-A.; Writing—review & editing, H.E.Z.-R., H.P.-B. and G.C.L.-A.; Supervision, H.P.-B. and G.C.L.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study is publicly available from the repositories of their original publications. CBIS-DDSM is available at The Cancer Imaging Archive (TCIA): https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY. INBREAST is available from the authors upon request in the original work [45]. VinDr-Mammo is publicly available at PhysioNet: https://physionet.org/content/vindr-mammo/1.0.0/, accessed on 1 September 2024. DMID is available via Figshare: https://doi.org/10.6084/m9.figshare.24522883.v2.

Acknowledgments

The author H.E.Z.-R. thanks the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) for the support through the scholarship CVU 743389. The authors acknowledge the use of OpenAI’s ChatGPT (GPT-5, OpenAI, San Francisco, CA, USA) as a language model tool to assist in text refinement, LaTeX formatting, and table generation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
  2. DeSantis, C.E.; Ma, J.; Gaudet, M.M.; Newman, L.A.; Miller, K.D.; Sauer, A.G.; Jemal, A.; Siegel, R.L. Breast cancer statistics, 2019. CA Cancer J. Clin. 2019, 69, 438–451. [Google Scholar] [CrossRef]
  3. Ren, W.; Chen, M.; Qiao, Y.; Zhao, F. Global guidelines for breast cancer screening: A systematic review. Breast 2022, 64, 85–99. [Google Scholar] [CrossRef]
  4. Samala, R.K.; Chan, H.P.; Hadjiiski, L.; Helvie, M.A.; Wei, J.; Cha, K. Mass detection in digital breast tomosynthesis: Deep convolutional neural network with transfer learning from mammography. Med. Phys. 2016, 43, 6654–6666. [Google Scholar] [CrossRef]
  5. Surendiran, B.; Ramanathan, P.; Vadivel, A. Effect of BIRADS shape descriptors on breast cancer analysis. Int. J. Med. Eng. Inform. 2015, 7, 65–79. [Google Scholar] [CrossRef]
  6. American College of Radiology. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System, 5th ed.; American College of Radiology: Reston, VA, USA, 2013. [Google Scholar]
  7. Couture, H.D.; Williams, L.A.; Geradts, J.; Nyante, S.J.; Butler, E.N.; Marron, J.S.; Perou, C.M.; Troester, M.A.; Niethammer, M. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. npj Breast Cancer 2018, 4, 30. [Google Scholar] [CrossRef] [PubMed]
  8. Li, H.; Meng, X.; Wang, T.; Tang, Y.; Yin, Y. Breast masses in mammography classification with local contour features. BioMed. Eng. OnLine 2017, 16, 44. [Google Scholar] [CrossRef]
  9. Bodewes, F.T.; van Asselt, A.A.; Dorrius, M.D.; Greuter, M.J.; de Bock, G.H. Mammographic breast density and the risk of breast cancer: A systematic review and meta-analysis. Breast 2022, 66, 62–68. [Google Scholar] [CrossRef] [PubMed]
  10. Mann, R.M.; Athanasiou, A.; Baltzer, P.A.; Camps-Herrero, J.; Clauser, P.; Fallenberg, E.M.; Forrai, G.; Fuchsjäger, M.H.; Helbich, T.H.; Killburn-Toppin, F.; et al. Breast cancer screening in women with extremely dense breasts recommendations of the European Society of Breast Imaging (EUSOBI). Eur. Radiol. 2022, 32, 4036–4045. [Google Scholar] [CrossRef] [PubMed]
  11. Lee, R.S.; Gimenez, F.; Hoogi, A.; Miyake, K.K.; Gorovoy, M.; Rubin, D.L. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 2017, 4, 170177. [Google Scholar] [CrossRef]
  12. Crivellé, M.S.I. An approach to breast density. Rev. Senol. Patol. Mamar. 2014, 27, 138–142. [Google Scholar] [CrossRef]
  13. Dhungel, N.; Carneiro, G.; Bradley, A.P. Automated Mass Detection in Mammograms Using Cascaded Deep Learning and Random Forests. In Proceedings of the 2015 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2015; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
  14. Agarwal, R.; Díaz, O.; Yap, M.H.; Lladó, X.; Martí, R. Deep learning for mass detection in Full Field Digital Mammograms. Comput. Biol. Med. 2020, 121. [Google Scholar] [CrossRef]
  15. Hassan, N.M.; Hamad, S.; Mahar, K. Mammogram breast cancer CAD systems for mass detection and classification: A review. Multimed. Tools Appl. 2022, 81, 20043–20075. [Google Scholar] [CrossRef]
  16. Ballard, D.; Lecun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition; Technical Report; MIT Press: Cambridge, MA, USA, 1989. [Google Scholar]
  17. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Technical Report; IEEE: New York, NY, USA, 2015. [Google Scholar]
  18. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision; Technical Report; IEEE: New York, NY, USA, 2015. [Google Scholar]
  19. Shen, L.; Margolies, L.R.; Rothstein, J.H.; Fluder, E.; McBride, R.; Sieh, W. Deep learning to improve breast cancer detection on screening mammography. Sci. Rep. 2019, 9, 12495. [Google Scholar] [CrossRef] [PubMed]
  20. Medeiros, A.; Ohata, E.F.; Silva, F.H.; Rego, P.A.; Filho, P.P.R. An approach to bi-rads uncertainty levels classification via deep learning with transfer learning technique. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS); Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2020; pp. 603–608. [Google Scholar] [CrossRef]
  21. Kolb, T.M.; Lichy, J.; Newhouse, J.H. Comparison of the Performance of Screening Mammography, Physical Examination, and Breast US and Evaluation of Factors that Influence Them: An Analysis of 27,825 Patient Evaluations. Radiology 2002, 225, 165–175. [Google Scholar] [CrossRef] [PubMed]
  22. Mandelson, M.T. Breast Density as a Predictor of Mammographic Detection: Comparison of Interval- and Screen-Detected Cancers. J. Natl. Cancer Inst. 2000, 92, 1081–1087. [Google Scholar] [CrossRef]
  23. Centers for Disease Control and Prevention. About Dense Breasts. 2024. Available online: https://www.cdc.gov/breast-cancer/about/dense-breasts.html (accessed on 15 March 2026).
  24. Nazari, S.S.; Mukherjee, P. An overview of mammographic density and its association with breast cancer. Breast Cancer 2018, 25, 259–267. [Google Scholar] [CrossRef] [PubMed]
  25. Campanini, R.; Dongiovanni, D.; Iampieri, E.; Lanconelli, N.; Masotti, M.; Palermo, G.; Riccardi, A.; Roffilli, M. A novel featureless approach to mass detection in digital mammograms based on support vector machines. Phys. Med. Biol. 2004, 49, 961–975. [Google Scholar] [CrossRef]
  26. Heath, M.; Bowyer, K.; Kopans, D.; Moore, R.; Kegelmeyer, P. The Digital Database for Screening Mammography; Technical Report; Springer: Dordrecht, The Netherlands, 1998. [Google Scholar]
  27. Xiong, S.; Lu, J. Mass detection in digital mammograms using twin support vector machine-based CAD system. In Proceedings of the 2009 WASE International Conference on Information Engineering, ICIE 2009; IEEE: New York, NY, USA, 2009; Volume 1, pp. 240–243. [Google Scholar] [CrossRef]
  28. Ertosun, M.G.; Rubin, D.L. Probabilistic Visual Search for Masses within Mammography Images Using Deep Learning. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: New York, NY, USA; 2015, pp. 1310–1315. [CrossRef]
  29. Al-masni, M.A.; Al-antari, M.A.; Park, J.M.; Gi, G.; Kim, T.Y.; Rivera, P.; Valarezo, E.; Choi, M.T.; Han, S.M.; Kim, T.S. Simultaneous detection and classification of breast masses in digital mammograms via a deep learning YOLO-based CAD system. Comput. Methods Programs Biomed. 2018, 157, 85–94. [Google Scholar] [CrossRef]
  30. Al-antari, M.A.; Al-masni, M.A.; Choi, M.T.; Han, S.M.; Kim, T.S. A fully integrated computer-aided diagnosis system for digital X-ray mammograms via deep learning detection, segmentation, and classification. Int. J. Med. Inform. 2018, 117, 44–54. [Google Scholar] [CrossRef]
  31. Al-antari, M.A.; Han, S.M.; Kim, T.S. Evaluation of deep learning detection and classification towards computer-aided diagnosis of breast lesions in digital X-ray mammograms. Comput. Methods Programs Biomed. 2020, 196, 105584. [Google Scholar] [CrossRef]
  32. Rahman, M.M.; Jahangir, M.Z.B.; Rahman, A.; Akter, M.; Nasim, M.A.A.; Gupta, K.D.; George, R. Breast Cancer Detection and Localizing the Mass Area Using Deep Learning. Big Data Cogn. Comput. 2024, 8, 80. [Google Scholar] [CrossRef]
  33. Baccouche, A.; Garcia-Zapirain, B.; Olea, C.C.; Elmaghraby, A.S. Breast lesions detection and classification via YOLO-based fusion models. Comput. Mater. Contin. 2021, 69, 1407–1425. [Google Scholar] [CrossRef]
  34. Keller, B.M.; Nathan, D.L.; Wang, Y.; Zheng, Y.; Gee, J.C.; Conant, E.F.; Kontos, D. Estimation of breast percent density in raw and processed full field digital mammography images via adaptive fuzzy c-means clustering and support vector machine segmentation. Med. Phys. 2012, 39, 4903–4917. [Google Scholar] [CrossRef]
  35. Tsai, K.J.; Chou, M.C.; Li, H.M.; Liu, S.T.; Hsu, J.H.; Yeh, W.C.; Hung, C.M.; Yeh, C.Y.; Hwang, S.H. A High-Performance Deep Neural Network Model for BI-RADS Classification of Screening Mammography. Sensors 2022, 22, 1160. [Google Scholar] [CrossRef] [PubMed]
  36. Nguyen, H.T.X.; Tran, S.B.; Nguyen, D.B.; Pham, H.H.; Nguyen, H.Q. A Novel Multi-View Deep Learning Approach for BI-RADS and Density Assessment of Mammograms. arXiv 2022. [Google Scholar] [CrossRef]
  37. Goceri, E. Medical image data augmentation: Techniques, comparisons and interpretations. Artif. Intell. Rev. 2023, 56, 12561–12605. [Google Scholar] [CrossRef] [PubMed]
  38. Islam, T.; Hafiz, M.S.; Jim, J.R.; Kabir, M.M.; Mridha, M. A systematic review of deep learning data augmentation in medical imaging: Recent advances and future research directions. Healthc. Anal. 2024, 5, 100340. [Google Scholar] [CrossRef]
  39. Velarde, O.M.; Lin, C.; Eskreis-Winkler, S.; Parra, L.C. Robustness of deep networks for mammography: Replication across public datasets. J. Imaging Inform. Med. 2024, 37, 536–546. [Google Scholar] [CrossRef]
  40. Zafari, Y.; Pan, H.; Durak, G.; Bagci, U.; Rashed, E.A.; Mabrok, M. MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization. arXiv 2025, arXiv:2511.02400. [Google Scholar]
  41. Pan, H.; Durak, G.; Aktas, H.E.; Bejar, A.M.; Tutun, B.; Uysal, E.; Bulbul, E.; Dogan, M.F.; Erok, B.; Yildirim, B.A.; et al. LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol. arXiv 2026, arXiv:2603.14644. [Google Scholar] [CrossRef]
  42. Añez, D.; Conti, G.; Uriarte, J.J.; Serrano-Olmedo, J.J.; Martínez-Murillo, R.; Casanova-Carvajal, O. Artificial Intelligence Pipeline for Mammography-Based Breast Cancer Analysis. Medicina 2025, 61, 2237. [Google Scholar] [CrossRef]
  43. Xu, Z.; Li, J.; Yao, Q.; Li, H.; Zhao, M.; Zhou, S.K. Addressing fairness issues in deep learning-based medical image analysis: A systematic review. npj Digit. Med. 2024, 7, 286. [Google Scholar] [CrossRef] [PubMed]
  44. Chegini, M.; Mahloojifar, A. Uncertainty-aware deep learning-based CAD system for breast cancer classification using ultrasound and mammography images. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2024, 12, 2297983. [Google Scholar] [CrossRef]
  45. Moreira, I.C.; Amaral, I.; Domingues, I.; Cardoso, A.; Cardoso, M.J.; Cardoso, J.S. INbreast: Toward a Full-field Digital Mammographic Database. Acad. Radiol. 2012, 19, 236–248. [Google Scholar] [CrossRef]
  46. Oza, P.; Oza, R.; Oza, U.; Sharma, P.; Patel, S.; Kumar, P.O. Digital mammography Dataset for Breast Cancer Diagnosis Research (DMID). Biomed. Eng. Lett. 2023, 14, 317–330. [Google Scholar] [CrossRef]
  47. Nguyen, H.T.; Nguyen, H.Q.; Pham, H.H.; Lam, K.; Le, L.T.; Dao, M.; Vu, V. VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography. Sci. Data 2023, 10, 277. [Google Scholar] [CrossRef]
  48. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 658–666. [Google Scholar]
  49. Zuiderveld, K. Contrast Limited Adaptive Histogram Equalization. In Graphics Gems IV; Heckbert, P.S., Ed.; Academic Press: Cambridge, MA, USA, 1994; pp. 474–485. [Google Scholar]
  50. Suradi, S.H.; Abdullah, K.A.; Mat Isa, N.A. Improvement of image enhancement for mammogram images using fuzzy anisotropic diffusion histogram equalisation contrast adaptive limited (fadhecal). Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2022, 10, 67–75. [Google Scholar] [CrossRef]
  51. Oyelade, O.N.; Ezugwu, A.E. A novel wavelet decomposition and transformation convolutional neural network with data augmentation for breast cancer detection using digital mammogram. Sci. Rep. 2022, 12, 5913. [Google Scholar] [CrossRef]
  52. Fusco, R.; Granata, V.; Vallone, P.; Petrosino, T.; Iasevoli, M.D.; Raso, M.M.; Pupo, D.; Trovato, P.; Simonetti, I.; Pariante, P.; et al. Engineering the Image Representation for Deep Learning in Contrast-Enhanced Mammography: A Systematic Analysis of Preprocessing and Anatomical Masking. Bioengineering 2026, 13, 322. [Google Scholar] [CrossRef]
  53. Su, Y.; Liu, Q.; Xie, W.; Hu, P. YOLO-LOGO: A transformer-based YOLO segmentation model for breast mass detection and segmentation in digital mammograms. Comput. Methods Programs Biomed. 2022, 221, 106903. [Google Scholar] [CrossRef]
  54. Prinzi, F.; Insalaco, M.; Orlando, A.; Gaglio, S.; Vitabile, S. A yolo-based model for breast cancer detection in mammograms. Cogn. Comput. 2024, 16, 107–120. [Google Scholar]
  55. Trang, K.; Ting, F.F.; Vuong, B.Q.; Ting, C.M. MANGA-YOLO: A Mamba-inspired YOLO model with group attention for breast mass detection in mammograms. Comput. Biol. Med. 2025, 199, 111339. [Google Scholar] [CrossRef]
  56. Abdikenov, B.; Rakishev, D.; Orazayev, Y.; Zhaksylyk, T. Enhancing breast lesion detection in mammograms via transfer learning. J. Imaging 2025, 11, 314. [Google Scholar] [CrossRef]
  57. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
  58. Hussain, M. YOLOv1 to v8: Unveiling Each Variant-A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
  59. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 June 2026).
  60. Dhungel, N.; Carneiro, G.; Bradley, A.P. Deep structured learning for mass segmentation from mammograms. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2015; pp. 2950–2954. [Google Scholar]
  61. Ribli, D.; Horváth, A.; Unger, Z.; Pollner, P.; Csabai, I. Detecting and classifying lesions in mammograms with deep learning. Sci. Rep. 2018, 8, 4165. [Google Scholar] [CrossRef]
  62. Haralick, R.M.; Shanmugam, K.; Dinstein, I. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef]
  63. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  64. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR); ICLR: Appleton, WI, USA, 2015. [Google Scholar]
  65. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
  66. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017; pp. 4700–4708. [Google Scholar]
  67. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
  68. Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  69. Youden, W.J. Index for Rating Diagnostic Tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef] [PubMed]
  70. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  71. Abudukelimu, H.; Gao, Y.; Abulizi, A.; Musideke, M.; Wu, S.; Wang, M.; Aizizi, M.; Yehaiya, G.; Abudukelimu, M. DVF-YOLO-Seg: A two-stage breast mass segmentation model with enhanced feature extraction and small lesion detection. Digit. Health 2025, 11, 20552076251374192. [Google Scholar]
  72. Manolakis, D.; Bizopoulos, P.; Lalas, A.; Votis, K. A two-stage lightweight deep learning framework for mass detection and segmentation in mammograms using YOLOv5 and depthwise SegNet. J. Imaging Inform. Med. 2025, 38, 3852–3867. [Google Scholar] [CrossRef]
  73. Mohamed, A.A.; Berg, W.A.; Peng, H.; Luo, Y.; Jankowitz, R.C.; Wu, S. A deep learning method for classifying mammographic breast density categories. Med. Phys. 2018, 45, 314–321. [Google Scholar] [CrossRef]
  74. Lehman, C.D.; Yala, A.; Schuster, T.; Dontchos, B.; Bahl, M.; Swanson, K.; Barzilay, R. Mammographic Breast Density Assessment Using Deep Learning: Clinical Implementation. Radiology 2019, 290, 52–58. [Google Scholar] [CrossRef]
  75. Lopez-Almazan, H.; Pérez-Benito, F.J.; Larroza, A.; Perez-Cortes, J.C.; Pollan, M.; Perez-Gomez, B.; Trejo, D.S.; Casals, M.; Llobet, R. A deep learning framework to classify breast density with noisy labels regularization. Comput. Methods Programs Biomed. 2022, 221, 106885. [Google Scholar] [CrossRef]
  76. Rigaud, B.; Weaver, O.O.; Dennison, J.B.; Awais, M.; Anderson, B.M.; Chiang, T.Y.D.; Yang, W.T.; Leung, J.W.T.; Hanash, S.M.; Brock, K.K. Deep Learning Models for Automated Assessment of Breast Density Using Multiple Mammographic Image Types. Cancers 2022, 14, 5003. [Google Scholar] [CrossRef] [PubMed]
  77. Busaleh, M.; Hussain, M.J.; Aboalsamh, H.A.; e Amin, F.; Al Sultan, S.A. TwoViewDensityNet: Two-View Mammographic Breast Density Classification Based on Deep Convolutional Neural Network. Mathematics 2022, 10, 4610. [Google Scholar] [CrossRef]
  78. Ragab, D.A.; Sharkas, M.; Marshall, S.; Ren, J. Breast cancer detection using deep convolutional neural networks. Biomed. Signal Process. Control 2021, 65, 102280. [Google Scholar] [CrossRef]
  79. Baccouche, A.; Garcia-Zapirain, B.; Elmaghraby, A.S. An integrated framework for breast mass classification and diagnosis using stacked ensemble of residual neural networks. Sci. Rep. 2022, 12, 12259. [Google Scholar] [CrossRef] [PubMed]
  80. Li, P.; Zhong, J.; Chen, H.; Hong, J.; Li, H.; Li, X.; Shi, P. An explainable and comprehensive BI-RADS-assisted diagnosis pipeline for mammograms. Phys. Medica 2025, 132, 104949. [Google Scholar] [CrossRef]
  81. Sabani, A.; Landsmann, A.; Hejduk, P.; Schmidt, C.; Marcon, M.; Borkowski, K.; Rossi, C.; Ciritsis, A.; Boss, A. BI-RADS-Based Classification of Mammographic Soft Tissue Opacities Using a Deep Convolutional Neural Network. Diagnostics 2022, 12, 1564. [Google Scholar] [CrossRef]
  82. Lin, X.; Liao, T.; Yang, Y.; Ouyang, R.; Zhou, Y.; Lai, X.; Ma, J. Value of deep learning model for predicting Breast Imaging Reporting and Data System 3 and 4A lesions on mammography. Quant. Imaging Med. Surg. 2025, 15, 4047–4058. [Google Scholar] [CrossRef]
  83. Tekin, A.; Toktay, B.; Günay, A.C.; Yazgan, H.; İnan, N.G.; Kocadağlı, O. BI-RADS classification in mammography using deep learning. In Güncel Ekonometri ve İstatistiksel Uygulamalar ile Akademik Çalışmalar; Özgür Publications: Istanbul, Turkey, 2024. [Google Scholar] [CrossRef]
Figure 1. (a,b) ACR breast density and (c,d) BI-RADS grades distributions across the individual public datasets and the integrated Mass-Bench.
Figure 1. (a,b) ACR breast density and (c,d) BI-RADS grades distributions across the individual public datasets and the integrated Mass-Bench.
Mathematics 14 02080 g001
Figure 2. Overview of the Mass-Bench experimental framework for density-aware mammographic analysis. Pre-processing includes CLAHE-based contrast enhancement and resizing to 768 × 768 . Lesion localization (red and green frames) is performed using YOLO-based models (YOLOv5, YOLOv8, YOLOv11), followed by lesion-centered ROI extraction with progressive contextual padding (0%, 10%, 20%, 30%). Feature extraction combines handcrafted descriptors (first-order statistics, GLCM, LBP) and deep embeddings (VGG19, ResNet50, DenseNet121, EfficientNet-B3). Classification is conducted using classical ML models to evaluate ACR density and BI-RADS categories under binary and multi-class settings.
Figure 2. Overview of the Mass-Bench experimental framework for density-aware mammographic analysis. Pre-processing includes CLAHE-based contrast enhancement and resizing to 768 × 768 . Lesion localization (red and green frames) is performed using YOLO-based models (YOLOv5, YOLOv8, YOLOv11), followed by lesion-centered ROI extraction with progressive contextual padding (0%, 10%, 20%, 30%). Feature extraction combines handcrafted descriptors (first-order statistics, GLCM, LBP) and deep embeddings (VGG19, ResNet50, DenseNet121, EfficientNet-B3). Classification is conducted using classical ML models to evaluate ACR density and BI-RADS categories under binary and multi-class settings.
Mathematics 14 02080 g002
Figure 3. KL divergence values of each dataset with respect to the expected clinical ACR distribution (10% A, 40% B, 40% C, 10% D). Lower values indicate distributions more closely aligned with the clinical reference, while higher values reflect increased deviation and class imbalance. As shown, Mass-Bench (0.0404) and DMID (0.0243) exhibit the lowest divergence, indicating a more representative distribution, whereas datasets such as VINDr-Mammo (0.5697) and INBREAST (0.4540) present larger deviations, reflecting skewed class distributions.
Figure 3. KL divergence values of each dataset with respect to the expected clinical ACR distribution (10% A, 40% B, 40% C, 10% D). Lower values indicate distributions more closely aligned with the clinical reference, while higher values reflect increased deviation and class imbalance. As shown, Mass-Bench (0.0404) and DMID (0.0243) exhibit the lowest divergence, indicating a more representative distribution, whereas datasets such as VINDr-Mammo (0.5697) and INBREAST (0.4540) present larger deviations, reflecting skewed class distributions.
Mathematics 14 02080 g003
Table 2. Performance of YOLOv8 stratified by ACR density level (1–4) across datasets in samples containing masses. All reported metrics correspond to the test sets.
Table 2. Performance of YOLOv8 stratified by ACR density level (1–4) across datasets in samples containing masses. All reported metrics correspond to the test sets.
DatasetACR# Images
(Train/Aug/Val/Test)
AccuracyPrecisionSensitivityF1-Score
CBIS1236/1180/67/340.6400.7860.7750.701
2530/2650/151/760.7390.7660.8500.791
3314/1570/90/450.7270.9140.7000.677
4107/535/31/150.6300.8950.6800.654
INBREAST130/150/8/40.6670.8330.7690.714
226/130/8/40.9410.9990.9410.941
315/75/4/20.6240.6150.8890.696
45/25/1/10.9990.9990.9990.999
VINDr-Mammo13/15/1/10.7690.7690.9990.870
2102/510/29/140.6360.9130.9130.750
3690/3450/197/980.5330.6960.6960.604
449/245/14/70.5380.8240.6090.571
DMID151/255/15/70.9350.7000.7200.765
2124/620/36/180.9290.7800.7570.778
3128/640/36/180.8700.7650.6950.724
425/125/7/40.8990.7880.6740.702
Mass-Bench1168/840/48/240.6300.9990.6300.773
2168/840/48/240.7660.9990.7640.866
3168/840/48/240.7640.9990.7640.866
4168/840/48/240.7160.9990.7160.835
YOLOv8 shows consistent performance across ACR levels, with best results in intermediate categories (ACR 2–3). CBIS-DDSM peaks at ACR 2 (Accuracy = 0.739, F1 = 0.791), INBREAST at ACR 4 (0.999), and DMID at ACR 1 (0.935). The integrated Mass-Bench achieves its best balance in ACR 2–3 (F1 = 0.866), suggesting improved generalization in mid-density cases.
Table 3. Performance of YOLO models on the combined Mass-Bench dataset. Reported values correspond to mean ± standard deviation (SD) across repeated experimental runs. All reported metrics correspond to the test sets.
Table 3. Performance of YOLO models on the combined Mass-Bench dataset. Reported values correspond to mean ± standard deviation (SD) across repeated experimental runs. All reported metrics correspond to the test sets.
ModelAccuracyPrecisionRecallF1-ScoremAP@50mAP@50-95
YOLOv5 0.709 ± 0.022 0.709 ± 0.019 0.990 ± 0.004 0.829 ± 0.016 0.598 ± 0.021 0.287 ± 0.013
YOLOv8 0.717 ± 0.018 0.714 ± 0.015 0.990 ± 0.003 0.833 ± 0.014 0.646 ± 0.019 0.326 ± 0.011
YOLOv11 0.721 ± 0.017 0.717 ± 0.014 0.990 ± 0.003 0.835 ± 0.012 0.663 ± 0.017 0.335 ± 0.010
Table 4. Summary of ACR breast density classification performance using lesion-centered ROIs. All reported metrics correspond to the test sets.
Table 4. Summary of ACR breast density classification performance using lesion-centered ROIs. All reported metrics correspond to the test sets.
DatasetBinary MetricsBest Binary SetupMulti-Class MetricsBest Multi-Class Setup
CBIS-DDSM 0.704 ± 0.034 / 0.796 ± 0.029 / 0.835 ± 0.024 / 0.724 ± 0.031 ENB3 + XGB 0.486 ± 0.048 / 0.575 ± 0.043 / 0.586 ± 0.041 / 0.452 ± 0.052 DN121 + RF
INBREAST 0.630 ± 0.051 / 0.953 ± 0.018 / 0.612 ± 0.067 / 0.630 ± 0.048 R50 + SVM 0.810 ± 0.027 / 0.786 ± 0.034 / 0.774 ± 0.038 / 0.811 ± 0.025 HC + RF
VINDr-Mammo 0.950 ± 0.011 / 0.952 ± 0.010 / 0.952 ± 0.009 / 0.951 ± 0.010 HC + RF 0.950 ± 0.012 / 0.951 ± 0.013 / 0.949 ± 0.011 / 0.948 ± 0.012 HC + LR
DMID 0.680 ± 0.044 / 0.690 ± 0.041 / 0.660 ± 0.046 / 0.650 ± 0.043 R50 + LR 0.700 ± 0.040 / 0.680 ± 0.047 / 0.740 ± 0.044 / 0.530 ± 0.110 ENB3 + XGB
Mass-Bench 0.902 ± 0.016 / 0.902 ± 0.015 / 0.902 ± 0.014 / 0.902 ± 0.015 DN121 + KNN 0.822 ± 0.021 / 0.836 ± 0.023 / 0.699 ± 0.027 / 0.744 ± 0.024 HC + RF
Binary and multi-class metrics are reported as: Accuracy ± SD/Precision ± SD/Recall ± SD/F1-score ± SD. The table summarizes the best-performing configuration for binary (ACR 1–2 vs. 3–4) and multi-class (ACR 1–4) formulations for each dataset. All classification experiments were evaluated using a 5-fold cross-validation. Abbreviations: HC: handcrafted features; R50: ResNet50; DN121: DenseNet121; ENB3: EfficientNet-B3; LR: Logistic Regression; SVM: Support Vector Machine; RF: Random Forest; KNN: k-Nearest Neighbors; XGB: XGBoost.
Table 5. Summary of BI-RADS classification performance using lesion-centered ROIs. All reported metrics correspond to the test sets.
Table 5. Summary of BI-RADS classification performance using lesion-centered ROIs. All reported metrics correspond to the test sets.
DatasetBinary MetricsBest Binary SetupMulti-Class METRICSBest Multi-Class Setup
CBIS-DDSM 0.720 ± 0.036 / 0.704 ± 0.041 / 0.990 ± 0.009 / 0.698 ± 0.038 DN121 + KNN 0.530 ± 0.052 / 0.568 ± 0.048 / 0.597 ± 0.045 / 0.461 ± 0.057 R50 + XGB
INBREAST 0.838 ± 0.047 / 0.800 ± 0.061 / 0.781 ± 0.074 / 0.831 ± 0.052 DN121 + SVM 0.860 ± 0.027 / 0.786 ± 0.036 / 0.774 ± 0.041 / 0.860 ± 0.025 ENB3 + LR
VINDr-Mammo 0.727 ± 0.019 / 0.713 ± 0.024 / 0.724 ± 0.021 / 0.716 ± 0.022 ENB3 + SVM 0.597 ± 0.033 / 0.627 ± 0.037 / 0.563 ± 0.041 / 0.577 ± 0.039 ENB3 + SVM
DMID 0.931 ± 0.021 / 0.990 ± 0.008 / 0.670 ± 0.074 / 0.781 ± 0.011 ENB3 + KNN 0.837 ± 0.040 / 0.750 ± 0.061 / 0.702 ± 0.069 / 0.561 ± 0.110 DN121 + RF
Mass-Bench 0.904 ± 0.015 / 0.902 ± 0.017 / 0.898 ± 0.019 / 0.899 ± 0.016 HC + RF 0.836 ± 0.022 / 0.860 ± 0.024 / 0.798 ± 0.028 / 0.823 ± 0.026 HC + RF
Binary and multi-class metrics are reported as: Accuracy ± SD/Precision ± SD/Recall ± SD/F1-score ± SD. The table summarizes the best-performing configuration for binary and multi-class BI-RADS formulations for each dataset. All classification experiments were evaluated using a 5-fold cross-validation. Abbreviations: HC: handcrafted features. R50: ResNet50; DN121: DenseNet121; ENB3: EfficientNet-B3; LR: Logistic Regression; SVM: Support Vector Machine; RF: Random Forest; KNN: k-Nearest Neighbors; XGB: XGBoost.
Table 7. Related studies on automatic ACR breast density classification in mammography.
Table 7. Related studies on automatic ACR breast density classification in mammography.
ReportYearModelDatasetACR ClassesAcc (Bin)Acc (Multi)F1 (Bin)F1 (Multi)AUC (Bin)AUC (Multi)
[73]2017CNNClinical datasetB vs. C (binary)0.94
[74]2019CNNClinical cohortACR 1–4 + binary0.870.77
[75]2022Deep CNNDDM-SpainACR 1–40.85
[77]2022TwoViewDensityNetDDSM/INbreastACR 1–40.960.99
[76]2022EfficientNetMulti-center datasetACR 1–4 + binary0.880.820.950.93
Ours2026ML + Deep featuresMass-BenchACR 1–40.900.820.900.74
Related works report strong performance in ACR density classification. Busaleh et al. [77] achieved the highest multiclass accuracy (0.96) and AUC (0.99), while Rigaud et al. [76] reported robust performance across both binary and multiclass settings. In contrast, the proposed Mass-Bench framework achieves competitive results (Acc = 0.90 binary, 0.82 multiclass; F1 = 0.90 binary, 0.74 multiclass).
Table 9. BI-RADS class composition and experimental protocol across representative studies.
Table 9. BI-RADS class composition and experimental protocol across representative studies.
StudyBI-RADS CompositionSplit
Baccouche et al. (2022) [79]CBIS: 2 (792), 3 (1938), 4 (2328), 5 (3402); INBREAST: 2 (144), 3 (78), 4 (126), 5 (276), 6 (48)80/10/10
Li et al. (2025) [80]CBIS: 1 (0), 2 (345), 3 (434), 4 (1534), 5 (530); INBREAST: 1 (67), 2 (220), 3 (23), 4 (43), 5 (57)80/20; 90/10; CV
Tsai et al. (2022) [35]0 (520), 1 (0), 2 (2125), 3 (847), 4A (367), 4B (277), 4C (217), 5 (204)Train/test (blocks)
Sabani et al. (2022) [81]Grouped: 1 vs. (2–3) vs. (4–5)70/20/10
Lin et al. (2025) [82]3 (632), 4A (214)No standard split
Tekin et al. (2024) [83]Not explicitly reportedNot clearly specified
Ours (complete bench)2 (300), 3 (1258), 4 (1591), 5 (597)Multi-dataset
Ours (balanced)2 (300), 3 (300), 4 (300), 5 (300)Balanced
Most studies show imbalanced BI-RADS distributions and inconsistent or unclear splits, often simplifying the task through class grouping. Mass-Bench provides balanced settings under a unified multi-dataset protocol.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zepeda-Reyes, H.E.; Peregrina-Barreto, H.; Lopez-Armas, G.C. Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification. Mathematics 2026, 14, 2080. https://doi.org/10.3390/math14122080

AMA Style

Zepeda-Reyes HE, Peregrina-Barreto H, Lopez-Armas GC. Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification. Mathematics. 2026; 14(12):2080. https://doi.org/10.3390/math14122080

Chicago/Turabian Style

Zepeda-Reyes, Hector E., Hayde Peregrina-Barreto, and Gabriela C. Lopez-Armas. 2026. "Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification" Mathematics 14, no. 12: 2080. https://doi.org/10.3390/math14122080

APA Style

Zepeda-Reyes, H. E., Peregrina-Barreto, H., & Lopez-Armas, G. C. (2026). Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification. Mathematics, 14(12), 2080. https://doi.org/10.3390/math14122080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop