1. Introduction
Breast cancer is the leading cause of death among women (30% of all cancers in females) [
1]. Imaging performed with X-ray, MRI, and US is the basis of its detection. Although X-ray mammography is used as a primary screening tool, US is usually performed as a follow-up to gather more diagnostic information. MRI is only used for special cases (e.g., high-risk genetic mutation, multifocal disease), and its value for screening is currently under debate due to its high costs and the need for contrast agents [
2]. One of the biggest challenges of US imaging is its high operator dependence [
3]. This problem not only considers the repeatability of measurement, but also the user expertise.
Recently, the idea of “computers helping doctors” has increasingly become a reality due to the evolution of artificial intelligence (AI) technologies. The implementation of AI in CAD systems holds great promise for the future of cancer detection [
4]. These systems are designed to support first-line tumor diagnosis by providing a second, potentially objective, opinion on the content of medical images.
The training process of machine or deep learning algorithms, embedded in CAD, is strongly influenced by the quality and quantity of the dataset [
5,
6]. Hence, the first step in the CAD system is dataset preparation. Pre-processing is an operation that suppresses undesired noise or enhances image features [
7]. For medical images, this significantly increases the ability to interpret their content, even for non-imaging experts [
8]. Image pre-processing also considers geometric transformations, which are widely used in data augmentation—a technique devoted to enlarge and diversify the images dataset [
9]. Because AI-based CAD systems need big and diverse datasets for training that are difficult to obtain, data pre-processing can overcome this issue. For instance, using spatial transformations with other augmentation techniques noticeably improved skin cancer detection [
10]. Furthermore, Zhang et al. proposed a very interesting and successful image augmentation approach by extending the dataset of breast US images with the BIRADS-oriented feature maps [
11]. Including these maps in the training of breast lesion classification frameworks can improve the accuracy of a breast cancer diagnosis. Nevertheless, it is often not enough to use data pre-processing for enlarging a dataset for a particular task. Here, transfer learning is another technique that can help to overcome the data shortage in the development of AI-based CAD systems. In this method, the network is trained with the use of a dataset that is not necessarily composed of medical images [
12,
13]. Thus, the algorithm is exposed to a broader spectrum of information that improves its generalization capabilities. The training results in obtaining a robust pre-trained network, which can be further fine-tuned to develop a task-specific detection or classification algorithms.
The lesion needs to be found before it can be classified. However, many CAD frameworks concentrate solely on lesion classification rather than detection [
14,
15]. For example, Han et al. presented a CAD system for breast cancer classification, where lesion detection was still performed manually by clinicians [
16]. The localization of the lesion with a point or detection box constitutes the base for a segmentation step in the CAD system, and without segmentation, the lesion may not be classified [
17,
18]. However, the segmentation of a breast lesion is still challenging because tumor margins are often not completely covered in US images and disturbed by artifacts. To better address the lesion segmentation task, authors have focused on developing new or refining already existing algorithms [
19,
20]. For example, Xue et al. proposed a deep convolutional neural network equipped with a global guidance block that enhances breast lesion segmentation by utilizing the broad contextual features of US images [
21]. Nonetheless, from the clinical perspective, detailed lesion segmentation, however, is not required to diagnose breast cancer. Indeed, clinicians localize and measure the size of lesions to monitor growth, to perform tumor staging, and to control the therapy outcome, but for decisions about tumor type, the analysis of the border zone and the adjacent tissue is of pivotal importance [
22]. Thus, detection and area definition should be carefully incorporated in CAD.
To diagnose breast cancer, the doctor analyzes a few parameters related to lesion size, shape, and echo pattern. In the early days of CAD systems, the embedded machine learning algorithms were trained to search only for the same features that the clinician would look for [
15,
23]. However, images are more than pictures, and they can be used to extract multiple powerful features that are not visible to the naked eye [
24,
25]. This became possible with the introduction of radiomics analysis, which tries to exploit the full information content of medical images for cancer diagnosis [
26,
27,
28]. For this purpose, radiomics analysis can provide new parameters reflecting important characteristics of the tumor microenvironment. For instance, using textural features extracted from US images, differentiation between triple-negative breast cancer, invasive ductal carcinoma, and fibroadenoma can be improved [
29,
30]. However, mining the image-based features is a complicated task that involves multiple image processing steps (i.e., segmentation, feature extraction), which have a potential influence on the developed tumor classification model [
31]. Introducing deep learning into CAD frameworks enabled the automated derivation of descriptive features [
32]. Therefore, omitting the feature extraction and selection steps makes these systems more user independent. Although deep learning algorithms often represent a black box, and the ability to explain results remains a critical issue, they already outperform the machine learning-based CAD systems in the sensitivity of cancer diagnosis [
33]. Furthermore, it has been shown that the sensitivity of deep learning CAD systems increases when using both handcrafted and automatically derived features [
34].
In this study (
Figure 1), we investigated the advantage of image pre-processing as a data augmentation technique and assessed the influence of the training dataset composition on the performance of deep learning and machine learning-based breast lesion detection algorithms. We hypothesized that an effective radiomics signature for breast cancer classification can be extracted from lesion detection bounding boxes alone by omitting the segmentation task.
2. Materials and Methods
2.1. Internal and Public Dataset
The retrospective study was approved by the Institutional Review Board (or Ethics Committee) of the University Hospital of the RWTH Aachen International University (EK 066/18) and was conducted according to the guidelines of the Declaration of Helsinki.
The study collective includes ultrasound images of 119 female patients who were identified in the database of the Department of Obstetrics and Gynecology, University Clinic Aachen and the “Radiologie Baden-Baden” diagnosis center. In 71 patients, 77 breast cancer lesions were detected and documented by US. All the breast cancers were confirmed histopathologically. In 48 patients, the diagnosis of 50 benign lesions was made with US. Diagnosis of benign lesions was confirmed in 12 patients by histology (5 fibroadenomas and 7 cysts) and in 36 patients by follow up studies. In the latter cases, the follow up included at least one follow up carried out after 12 months.
The used US images were acquired with an Acuson Antares ultrasound system (Siemens-Healthineers, Erlangen, Germany) equipped with a 13.5 MHz transducer (VFX13-5). Results were stored in DICOM format (Reitz-CS computer systems, Dresden, Germany). For the study, images were retrospectively reviewed by a breast radiologist with 11 years of experience (M.P.) with knowledge about the histologic results and/or the follow-up studies, and anonymized for the following analysis.
To extend the assembled study collective, breast US images from two publicly available datasets were used [
17,
35]. The first dataset was collected at the UIDAT Diagnostic Center of the Parc Taulí Corporation, Sabadell Spain. It comprises 163 US images generated from different female patients in which 110 benign and 53 malignant lesions were found. The second dataset was obtained and provided by the Department of Radiology of Thammasat University and Queen Sirkit Center of Breast Cancer of Thailand. The study collective includes US images of 249 female patients with the diagnosis of 62 benign solid mass lesions, 21 fibroadenomas, 22 cysts, and 144 cancer lesions. The dataset provides the manual segmentations of the documented lesions drawn by 3 clinicians from the Department of Radiology of Thammasat University.
2.2. Dataset Preparation and Sampling
The patient data, identified in the database of the Department of Obstetrics and Gynecology, University Clinic Aachen and the “Radiologie Baden-Baden” diagnosis center, were exported from the DICOM format. All samples were saved as 8-bit grayscale images, normalized, and cropped to the size of 600 × 700 pixels. Discrete wavelet transform was used for speckle noise removal [
36]. The images from the public dataset were reviewed. The examples comprising caliper measurements embedded in the image were excluded. The final dataset was composed of 497 patients/505 lesions (
Table 1).
The prepared dataset was divided into 2 data pools. The first data pool (234 patients/235 lesions) was used for developing the breast lesion detection functions. The second data pool (263 patients/270 lesions) was used for developing the breast lesion classification model.
2.3. Dataset Augmentation
The images from the first data pool were augmented spatially and by computing their exponential, logarithm, Laplacian of Gaussian, square root, squared, and wavelet derivatives. All the augmentation scenarios are listed and described in
Table 2 and captured in
Figures S1–S6. The final number of augmented images derived from one original image was 118 (109 spatial and 9 filtered/processed). Data augmentation was performed in MATLAB (Version 2020a, The Math Works Inc. MATLAB, Natick, MA, USA). The derived augmentation scenarios were used for building 8 training datasets (
Table 3).
2.4. Breast Lesion Detection
The patients from the first data pool were divided into training, validation, and test groups using random sampling (
Table 4). The ground truth was labeled based on the US images with caliper measurements taken by the expert radiologist (M.P.) in the training dataset. For the validation and test dataset, 3 users (radiologist, physician, and ultrasound expert) were asked to detect the lesions by marking them with a bounding box. The labelling was performed using MATLAB software.
The breast lesion detection functions were developed using Viola–Jones and YOLOv3 algorithms. The first, Viola–Jones, computes the feature descriptor with a sliding window, and this results in object detection [
37]. The second, YOLOv3, is a convolutional neural network that solves a single regression problem to localize objects [
38]. Both algorithms follow the underlying gray-level patterns of the images to localize the objects of interest. The Viola–Jones and YOLOv3 algorithms were trained with 8 assembled datasets (
Table 3).
The Viola–Jones classifiers were trained from scratch. In every image, 1 positive and 4 negative regions of interests were marked (
Figure S7). The negative regions of interests were cropped from the original image and included in the pool of negative samples. All classifiers were trained using histograms of oriented gradients features [
39]. The size of the object being searched for was set automatically by the Viola–Jones algorithm. The inference was performed with 10, 15, 20, and 25 stages. The experiment was implemented with MATLAB software.
The YOLOv3 classifiers were trained using the open-source Python library ImageAI [
40]. This library provides classes and methods for training new detection models on any type of image without a need for any additional adjustments on the used dataset. The ImageAI library is built on the Tensorflow backbone. The pre-trained YOLOv3 network (i.e., base model), provided by the ImageAI developers, was trained with the COCO dataset [
13]. The custom detection functions were trained with two different transfer learning strategies. First, all detection layers of the pre-trained YOLOv3 network were frozen and the new models were trained on top of it. Second, the new detection functions were obtained with the so-called “fine tuning” of the base model by retraining the pre-trained YOLOv3 network on the new dataset with a very low learning rate (i.e., 0.001). During this process, the pre-trained features incrementally adapt to the new data. Only the positive examples had to be provided for the training. The positive annotations were issued in Pascal VOC format. The size of the objects being searched for was set automatically by the algorithm. The inference was performed with stochastic gradient descent with the learning rate of 0.01 (transfer learning by “freezing layers”) and 0.001 (“fine tuning”), and batch size of 4. Each model was trained with 10, 15, 20, and 25 epochs. The experiment was implemented in the Python programming language.
2.5. Evaluation and Performance Metrics for Breast Lesion Detection
The intersection over union (IoU) and localization error (LE) were used to evaluate the accuracy of breast lesion detection functions. Both Viola–Jones and YOLOv3 algorithms compute the coordinates of the found detection boxes. Therefore, they were used to calculate IoU and LE.
IoU is the gold standard metric to evaluate object detection models [
13,
41]. The “overlap criterion” states that the detection bounding box, which has IoU greater than 0.5 with the ground truth bounding box, is a true positive finding. Otherwise, it is considered a false positive. A case where no lesion is detected is considered a false negative [
18]. No criteria for the true negative detections were established. The IoU was calculated with the following formula:
LE measures the disagreement in the localization and size of the detection box and ground truth box. Thus, we hypothesized that it could be used as a supporting evaluation metric together with IoU. The detection is classified as true positive when its LE, with reference to ground truth, is less than 0.1. This means that the detection box is localized less than 10 pixels from the ground truth box, and its width and height are less than 10% smaller/bigger than the original. Detections that do not meet the criteria are considered false positives. A case where no lesion is detected is considered a false negative. No criteria for true negative detections were established. The LE was calculated with the following formula:
where
xcgt,
ycgt are coordinates of the center of the ground truth bounding box;
xcd,
ycd are coordinates of the center of the detection bounding box;
wgt,
hgt are width and height of the ground truth bounding box;
wd,
hd are width and height of the detection bounding box.
The final values of IoU and LE per image were computed as the mean of IoU and LE scores calculated separately for all the users. The final IoU and LE of the lesion detection algorithm were calculated as the mean of all the IoU and LE scores obtained for every image in the test dataset. Additionally, the standard deviation was calculated. Together with IoU and LE, recall (3), precision (4), and
F1-score (5) were computed. Furthermore, the robustness of the detection algorithms was assessed with the recall-IoU and recall-LE curves [
42].
When no false negative (
FN) samples were obtained in the detection process, the recall (6) was calculated as a quotient of the total number of true positive (
TP) findings and the total number of ground truths in the dataset (
N):
2.6. Detecting Breast Lesions in the Classification Dataset
The second data pool was used to develop the breast lesion classification models. The patients comprising the data pool were divided into two subsets: Feature Selection and Classification. In each group the patients were randomly sampled. Thus, the benign and malignant examples were equally distributed. In the Classification Subset, the patients were sampled into training and test groups (
Table 5). The images comprising the second data pool were not augmented.
The developed detection functions (i.e., the best Viola–Jones and the best YOLOv3 models) were applied to localize breast lesions in both subsets of the second data pool. The obtained detection boxes were used to solely outline (i.e., “segment”) breast lesions in the images. The ground truth segments were obtained manually by a radiologist with 20 years of experience (F.K.) in breast US imaging (
Figure 2). The images and the corresponding binary representations (i.e., masks) of the segments outlined by the expert radiologist, and YOLOv3 and Viola–Jones detection functions, were assembled in 3 separate datasets named in the following manner: “Manual Segmentation”, “YOLOv3”, and “Viola–Jones”. For the samples that were not detected by the YOLOv3 or Viola–Jones models, the segments in the size of the image were computed. The Manual Segmentation, YOLOv3, and Viola–Jones datasets were later used to develop 3 independent breast lesion classification models.
2.7. Radiomics Signature Extraction for Breast Lesion Classification
The radiomics features were calculated with the PyRadiomics software [
43], which is an open-source package for mining radiomics features from medical images. The histogram-based (with
binWidth: 25), textural (Gray Level Co-occurrence Matrix (GLCM), Gray Level Size Zone Matrix (GLSZM), Gray Level Run Length Matrix (GLRLM), Gray Level Dependence Matrix (GLDM), Neighboring Gray Tone Difference Matrix (NGTDM), and wavelet (with
‘coif1′ wavelet) features were calculated from original and derived images (i.e., Laplacian of Gaussian, squared, square root, logarithm, exponential, gradient, and Local Binary Pattern). The shape-based features were not extracted. All the considered groups of features were previously thoroughly described [
27]. Radiomics features were extracted separately for the datasets based on Manual Segmentations, YOLOv3, and Viola–Jones. Hence, 3 separate sets of features were formed. The breast lesions outlined by radiologists, YOLOv3, and Viola–Jones were used separately to extract radiomics features. Thus, 3 separate datasets of features—Ground Truth, YOLOv3 and Viola–Jones—were obtained. In total, 1023 features per dataset were mined and the values of extracted features were normalized. The least absolute shrinkage and selection operator (LASSO) with L1 regularization was used for the feature selection task [
44]. This is a supervised algorithm that identifies features that are strongly correlated with the response variable (benign or malignant). Moreover, LASSO determines the features that are loosely associated with the response. It is important to identify the most and least descriptive traits for the classification task because the latter, in particular, can promote the overfitting of the trained model. In the training process of LASSO, the magnitude of the penalty coefficient lambda is determined. LASSO utilizes this coefficient both to select the most descriptive features for the classification task, and to remove the least descriptive traits.
The optimal magnitude of the penalty coefficient, lambda, was determined with the 10-fold-cross-validation search performed on the Feature Selection Subset of Manual Segmentation, YOLOv3, and Viola–Jones datasets. Finally, LASSO was used to find 3 separate effective RS for Manual Segmentation, YOLOv3, and Viola–Jones datasets.
2.8. Evaluation and Performance Metrics for Breast Lesion Classification
To identify the best algorithm for the breast lesion classification, the Classification Lerner App, a MATLAB built-in application was used. The Manual Segmentation, YOLOv3, and Viola–Jones RS, selected from the training groups of Classification Subsets of the second data pool, were trained with the 5- and 10-fold-cross-validation and 20%, 25%, and 30% holdout. All the trainings were done once per set of conditions. After each training, we selected the 3 best performing breast lesion classification functions per RS, which were later applied to the test groups of Classification Subsets. The selected breast lesion classification models were evaluated by calculating and comparing their sensitivity (7), specificity (8), and accuracy (9). Additionally, the Receiver Operating Characteristic curves (ROC) were drawn, and the area under the Receiver Operating Characteristic curve (AUROC) was computed. Finally, the best breast lesion classification model was chosen concerning the calculated evaluation metrics.
Here, the correct classification was considered as true positive (
TP). The incorrectly classified sample was assigned as false positive (
FP). The data accurately assigned as negative was counted as true negative (
TN), and as false negative (
FN) in the opposite case. Finally, we conducted DeLong’s test to statistically evaluate the performance of models trained with YOLOv3- and Viola–Jones-derived RS by comparing their AUCROC to the Manual Segmentation RS-based model, which in this case is considered a gold standard [
45].
4. Discussion
In this study, we methodically analyzed the different steps a CAD system should consider to detect and classify benign or malignant breast lesions in US images. First, we found that computing pre-processed images is a valid data augmentation technique for a dataset of US images. Including these images in the training dataset improves the performance of breast lesion detection models trained with YOLOv3 and Viola–Jones algorithms. Moreover, we found that YOLOv3-based breast lesion detection is more robust and reproducible in comparison to the Viola–Jones-based detection. In the second part of our study, we discovered that the effective RS can be extracted solely from the detection of bounding boxes. The obtained model achieved promising results in the classification of both malignant and benign breast lesions.
Data augmentation prevents overfitting and can provide different information that can be extracted from the original dataset [
7,
8,
9]. In our study, we used a broad selection of spatial augmentations to build a versatile training dataset for developing the breast lesion detection strategy. Some of the used transformations may not represent the typical presentation of US images with an transducer on top; however, these images are still highly useful to expand the learning abilities of the detection algorithms. Furthermore, we expanded the heterogeneity of the training dataset by including the pre-processed images. Application of imaging filters created a new matrix of grey levels. Hence, the algorithm faced a different pool of features that could be learned. We found that using image filtering methods for the data augmentation, along with the spatial transformations, can improve the performance of breast lesion detection. In particular, the inclusion of logarithmic images, derived from the original US data, for the training of the YOLOv3 algorithm, improves its performance in breast lesion detection. By comparison, the Viola–Jones-based model for breast lesion localization benefits from being trained on a dataset with all the presented augmentation of the original US data. Furthermore, our study showed that YOLOv3 is a better choice than Viola–Jones for developing breast lesion detection functions. YOLOv3 models express higher robustness and reproducibility of breast lesion detection in US images. Furthermore, these models obtained higher scores while being evaluated with reference to the gold standard IoU and proposed LE. IoU is one of the most popular and most reliable metrics used for the evaluation of object detection models [
13,
41]. However, we showed that it is not ideal for analyzing breast lesion detection in US images. The detection boxes that are smaller than or encompassed by the ground truth bounding box compute a lower score with respect to IoU, even though the lesion was detected correctly. Thus, it results in a high number of false positive detections. The LE score calculated for the same detection boxes classifies them as true positives. Often, where the IoU will discard a positive sample, the LE helps to preserve it. LE considers the seeding point plus the size of the detected bounding box, which makes it more robust than seed-point-based evaluation [
17,
18]. Nevertheless, using LE alone can be also misleading. One can see that neither IoU nor LE is an ideal measure for scoring the breast lesion detection. However, in combination, they can give a better overview of the detection function performance. Our findings suggest that using LE as a supporting score for IoU is beneficial for the evaluation of the breast lesion detection algorithm.
Typically, the lesion detection is followed by the segmentation task in CAD systems. Segmentation is much more complex than drawing a bounding box around a region of interest. In US imaging, in particular, one needs to analyze images obtained with different transducer positions, to capture the whole shape of the lesion. This can be challenging for any segmentation algorithm, as it cannot work with a well-arranged series of images as in CT or MRI [
16,
46]. Moreover, using bounding boxes may be more real-time capable. Thus, during the examination, a region of interest could be simultaneously analyzed with a changing transducer position [
47,
48].
Developers of classical machine learning or deep learning-based segmentation models aim to obtain a detailed outline of the tumor [
16,
49] because the identified segments constitute the base for the last element of CAD, which is the lesion classification [
46,
50]. Generating the accurate segment of a breast lesion provides an opportunity to compute the morphological shape features, which were reported to have more discriminative power over the textural traits [
51]. However, these features are frequently computed with respect to 2D US images. In our opinion, it would be more reliable to use 3D US images to assess the breast tumor shape morphology. Thus far, it has been shown that there is no significant difference between extracting textural radiomics features from the whole lesion or just a part of it [
52]. Furthermore, the inclusion of textural features enables the capture of characteristics of breast lesions not only at the microscale, but also the macroscale, i.e., by quantifying its gray level zones. Moreover, bounding boxes comprise the breast lesion and adjacent tissue, which is not the case for accurate segmentation. Thus, the selected features reflect the underlying characteristics of the breast lesion and its neighboring tissue. In the clinic, a doctor diagnoses the breast lesion, while simultaneously analyzing its surrounding tissue, and segmenting is not important for this task.
We investigated whether the generated detection bounding boxes, representing “segments”, can be applied for the breast lesion classification. First and foremost, it is of high importance to derive the RS that will explain well a particular classification problem. Using a large number of features can lead to overfitting; thus, it is favorable to use feature selection methods to identify the most descriptive and reproducible traits [
53,
54]. In our study, we reduced the obtained features space with the LASSO model. This resulted in the identification of three RS comprising traits that were most correlated with either malignant or benign breast lesions. Our results show that the classification of benign and malignant breast lesions with these RS derived from just the detection box is a promising and robust alternative. In particular, the sensitivity and specificity of the breast lesion classification model, based on the features derived from the YOLOv3 dataset, are similar to those obtained by other groups [
55,
56]. Our model obtained balanced values of sensitivity and specificity, which implies that it has almost equal ability in discriminating malignant and benign breast lesions. This is also the case for RS derived from the Manual Segmentation dataset. The classification model based on the gold standard manual segmentation-derived RS obtained higher values of sensitivity and specificity in comparison to the YOLOv3 model. Furthermore, its overall accuracy was higher than that of the other developed breast lesion classification functions. However, drawing the ROC curves has an advantage over calculating the overall accuracy in describing the performance of the classification model [
57]. ROC graphs are plotted for different classification thresholds of machine learning or deep learning algorithms. Thus, they indicate the robustness of the developed classification function. Furthermore, ROC curves allow the calculation of AUROC, which represents the discriminative ability of a model. The value of AUROC indicates how likely it is that the classifier will rank a randomly selected true positive sample higher than the negative sample [
58]. Therefore, the classification model with a higher AUROC is more likely to classify a truly positive sample correctly. In our study, the Manual Segmentation dataset-derived classification had the highest AUROC of all the developed breast lesion classification models. The second best AUROC was obtained by the classification model built with RS selected from the YOLOv3 dataset. However, the statistical analysis of AUROC curves obtained for RS derived from YOLOv3 detection bounding boxes and gold standard manual segmentations revealed that these two breast lesion classifications models are comparable. Therefore, both models can be used for the task, regardless of class distribution or misclassification costs indicated by the precision metrics. The opposite conclusion was established for the RS derived from the Viola–Jones dataset. Furthermore, this model obtained the lowest value of AUROC. The final classification outcome of RS derived from Viola–Jones detection bounding boxes may have been worsened by the high number of false positive samples comprised in the Viola–Jones classification dataset.
In the presented study, we concluded that the bounding boxes that comprise the breast lesion and adjacent tissue are promising candidates for building the breast cancer classification model. Furthermore, the classification results obtained using these bounding boxes for building effective RS are statistically comparable to those computed by RS derived from the accurate segments. In the future, it would be interesting to compare the performance of our breast lesion classification method with a deep learning classification network. This would include the evaluation of whether a combination of YOLO3-based lesion detection followed by CNN-based lesion characterization is superior to the use of areas segmented by alternative methods or to a CNN-based analysis of the entire US image. Moreover, the generalizability of the obtained RS may be increased by incorporating additional statistical [
59] or filtering feature selection methods [
60]. Finally, the performance of our breast lesion classification model may be improved by using unsupervised classification algorithms [
61].
Finally, we would like to mention the limitations of our study. First, our dataset was small, which may limit the strength of our conclusion. Second, some of the cases were not found in the second data pool with the selected detection methods, hence resulting in the extraction of the descriptive features from the whole image instead of the localized region of interest. Improving the performance of our breast lesion detection method will be an important issue for future studies because it has a direct influence on the extracted RS. Third, using classical machine learning and handcrafted features may have influenced the developed breast lesion classification models [
4,
32]. Finally, our study did not investigate the classification between different benign breast lesion types. Although the utilized dataset included patients with histologically proven cysts and fibroadenomas, they were not considered as separate classes in the lesion classification task due to small sub-cohorts. Building a balanced dataset with more examples of different benign breast lesion phenotypes would expand the classification abilities of the proposed algorithm. This sub-analysis, however, will be performed once our data repository has sufficiently grown.