1. Introduction
For breast cancer, which remains a prevalent malignancy among women worldwide, early detection and accurate diagnosis are crucially important for improving survival. Mammography is the widely adopted standard for breast cancer screening, but its interpretation demands extensive expertise. Challenges persist related to diagnostic discrepancies and missed diagnoses among radiologists.
Beyond mammography, breast cancer diagnosis incorporates various methods, including visual inspection, palpation, and ultrasound examination. When these examinations reveal abnormalities, clinicians often perform highly invasive procedures such as cytological and histological examinations for definitive diagnosis. If deep learning-based image analysis of minimally invasive mammographic images were to achieve high diagnostic accuracy, then the need for highly invasive procedures would be reduced. This approach would simultaneously alleviate burdens on radiologists and breast surgeons responsible for interpreting these images.
Recent rapid advancements in artificial intelligence (AI) technology, particularly deep learning, have significantly accelerated the development of automated analysis and diagnostic support systems for mammographic images. For various image recognition tasks, deep learning algorithms, especially convolutional neural networks (CNNs), now demonstrate performance comparable to or exceeding human capabilities. For medical image diagnosis, these technologies often achieve superior accuracy and efficiency compared to conventional methodologies.
Many studies have explored deep learning applications for mammographic image diagnosis. For instance, Zhang et al. [
1] performed a two-stage classification (normal/abnormal and benign/malignant) using two-view mammograms (CC and MLO) on the public DDSM dataset using a multi-scale attention DenseNet. Lång et al. [
2] evaluated the potential of AI to identify normal mammograms by classifying cancer likelihood scores with a deep learning model on a private dataset, comparing the obtained results to radiologists’ interpretations. Another study by Lång et al. [
3] indicated that deep learning models trained on a private dataset can reduce cancer rates at intervals without supplementary screening. Zhu et al. [
4] predicted future breast cancer development in negative subjects during an eight-year period using a deep learning model with a private dataset. Kerschke et al. [
5] compared human versus deep learning AI accuracy for benign–malignant screening using a private dataset, highlighting the need for prospective studies. Nica et al. [
6] reported a high-accuracy benign–malignant classification of cranio-caudal view mammography images using an AlexNet deep learning model and a private dataset. Rehman et al. [
7] achieved high-accuracy architectural distortion detection using image processing and proprietary depth-wise 2D V-net 64 convolutional neural networks on the PINUM, CBIS-DDSM, and DDSM datasets. Yirgin et al. [
8] used a public deep learning diagnostic system on a private dataset, concluding that combined assessment with both the deep learning model and radiologists yielded the best performance. Tzortzis et al. [
9] demonstrated superior performance for efficiently detecting abnormalities on the public INBreast dataset using their tensor-based deep learning model, showing robustness with limited data and reduced computational requirements. Pawar et al. [
10] and Hsu et al. [
11] both reported high-accuracy Breast Imaging Reporting and Data System (BIRADS) category classifications, respectively, using proprietary multi-channel DenseNet architecture and a fully convolutional dense connection network on private datasets. Elhakim et al. [
12] investigated the feasibility of replacing the first reader with AI when double-reading mammography using a commercial AI system with a private dataset, emphasizing the importance of an appropriate AI threshold. Jaamour et al. [
13] improved the segmentation accuracy for mass and calcification images from the public CBIS-DDSM dataset by applying transfer learning. Kebede et al. [
14] developed a model combining EfficientNet-based classifiers with a YOLOv5 object detection model and an anomaly detection model for mass screening on the public VinDr and Mini-DDSM datasets. Ellis et al. [
15], using the UK national OPTIMAM dataset, developed a deep learning AI model for predicting future cancer risk in patients with negative mammograms. Elhakim et al. [
16] further investigated the replacement of one or both readers with AI when double-reading mammography images, emphasizing clinical implications for accuracy and workload. Sait et al. [
17] reported high segmentation accuracy and generalizability in multi-class breast cancer image classification using an EfficientNet B7 model within a LightGBM model on the CBIS-DDSM and CMMD datasets. Chakravarthy et al. [
18] reported high classification accuracy for normal, benign, and malignant cases using an ensemble method with a modified Gompertz function on the BCDR, MIAS, INbreast, and CBIS-DDSM datasets. Liu et al. [
19] achieved high classification accuracy on four binary tasks using a CNN and a private mammography image dataset, suggesting the potential to reduce unnecessary breast biopsies. Finally, Park et al. [
20] reported improved diagnostic accuracy, especially in challenging ACR BIRADS categories 3 and 4 with breast density exceeding 50%, learning both benign–malignant classification and lesion boundaries using a ViT-B DINO-v2 model on the public CBIS-DDSM dataset. AlMansour et al. [
21] reported high-accuracy BIRADS classification using MammoViT, a novel hybrid deep learning framework, on a private dataset.
Despite these advancements, several points of difficulty hinder the reproducibility of claims in deep learning applications for mammographic image diagnosis. Studies using private, non-public datasets or proprietary deep learning models with undisclosed details make verification difficult. Methods incorporating subject information alongside mammographic images as training data also face reproducibility issues caused by limited commonalities across different datasets. Similarly, studies combining mammographic images with other modality images require specific data combinations, thereby complicating claim reproduction.
Given these considerations, we prioritized reproducible research by addressing studies using publicly available datasets and open-source deep learning models. Furthermore, we emphasized the generalizability of claims across multiple public datasets and various deep learning models.
Scrutiny of the dataset indicated that the number of regions of interest (ROIs) tends to increase along with symptom severity: normal, benign, and malignant. This tendency suggests that the presence or absence of ROIs is a useful feature.
Therefore, this study tested the hypothesis that prediction accuracy improves when images are classified based on whether or not they have annotated mask information for regions of interest, with subsequent separate training and prediction for each of the four mammographic views (RCC, LCC, RMLO, LMLO), before merging the results. A standard mammographic examination typically includes four views: the left mediolateral oblique view (LMLO), right mediolateral oblique view (RMLO), left craniocaudal view (LCC), and right craniocaudal view (RCC) (
Figure 1).
This approach is compared to cases for which image data are not separated based on the availability of mask information for regions of interest. Using two public datasets and two deep learning models, we validated this hypothesis, addressing the presence or absence of annotated mask information as a novel feature.
4. Discussion
Our “ROI-Stratified” approach emphasizes data stratification based on the binary presence or absence of an ROI. This strategy shows promise, but it does not leverage richer, quantitative information about the ROIs, such as their size, shape, and internal texture. Considering that diagnosticians use these features as diagnostic cues, our current binary treatment might oversimplify the available information. A key challenge for future work, therefore, is to explore methods that incorporate these quantitative ROI features as additional inputs into the model, potentially facilitating a more nuanced decision-making process.
Some data lead to a diagnosis that is malignant but lacks a visible region of interest (ROI). This discrepancy is likely attributable to factors such as dense breast tissue, which can obscure ROIs by causing the entire image to appear uniformly opaque. In such cases, a malignant diagnosis is reached despite the absence of a clear ROI on the image, which will likely necessitate corroborating results from other diagnostic modalities such as biopsies. Instances where data was diagnosed as normal but exhibited an ROI were also observed. The presence of an ROI in a “normal” diagnosis seems unusual and suggests a potential misrepresentation or artifact in the diagnostic labeling process. Such anomalous data points, whether they involve a malignant diagnosis without a discernible ROI or a normal diagnosis with an ROI, introduce noise into deep learning models. This noise can strongly hinder the model’s ability to learn accurate patterns. The noise consequently diminishes its predictive performance. Preprocessing the dataset to identify and remove or re-evaluate these inconsistent data points before training might enhance the learning and prediction accuracy of deep learning algorithms for medical image analysis.
This study used data with pre-existing ROI mask images. However, mammographic images requiring a benign–malignant classification do not always have corresponding mask images available. Therefore, future research should specifically examine the generation of mask images for mammographic data lacking existing masks, employing techniques such as semantic segmentation or object detection, and subsequently validating these approaches.
The deep learning models used for this study, such as Swin Transformer and ConvNeXtV2, demonstrated superior accuracy in both training and prediction compared to other deep learning models. We hypothesize that this improved performance derives from differences in the respective layer architectures of these models. Detailed analysis of this phenomenon is a subject for future investigation.
Other architectures, including those already existing and some yet to be developed, might outperform the two models used for this study. Consequently, an exhaustive investigation into a broader range of high-performance models remains a key avenue for future research.
While this study specifically addressed benign–malignant classifications, mammographic data are typically categorized into normal versus abnormal findings, with abnormal cases subsequently classified as either benign or malignant. An important area for future investigation is assessing whether our methodology can effectively classify normal and abnormal cases. If successful, then this capability would enable diagnostic prediction for a broader range of arbitrary mammographic data.