A Decision Support System Based on BI-RADS and Radiomic Classiﬁers to Reduce False Positive Breast Calciﬁcations at Digital Breast Tomosynthesis: A Preliminary Study

: Digital breast tomosynthesis (DBT) studies were introduced as a successful help for the detection of calciﬁcation, which can be a primary sign of cancer. Expert radiologists are able to detect suspicious calciﬁcations in DBT, but a high number of calciﬁcations with non-malignant diagnosis at biopsy have been reported (false positives, FP). In this study, a radiomic approach was developed and applied on DBT images with the aim to reduce the number of benign calciﬁcations addressed to biopsy and to give the radiologists a helpful decision support system during their diagnostic activity. This allows personalizing patient management on the basis of personalized risk. For this purpose, 49 patients showing microcalciﬁcations on DBT images were retrospectively included, classiﬁed by BI-RADS (Breast Imaging-Reporting and Data System) and analyzed. After segmentation of microcalciﬁcations from DBT images, radiomic features were extracted. Features were then selected with respect to their stability within different segmentations and their repeatability in test–retest studies. Stable radiomic features were used to train, validate and test (nested 10-fold cross-validation) a preliminary machine learning radiomic classiﬁer that, combined with BI-RADS classiﬁcation, allowed a reduction in FP of a factor of 2 and an improvement in positive predictive value of 50%.


Introduction
Breast calcifications are a diagnostic challenge in mammography interpretation and frequently prompt a needle biopsy [1], being a possible sign of breast cancer (BC) [2].
The introduction of quasi-three-dimensional (3D) acquisition with digital breast tomosynthesis (DBT) has brought considerable advantages in BC detection rates and also, in some studies, lowered the false positive (FP) rate and then the recall (i.e., assessment) rate of patients [3]. In addition, it is worth noting that DBT vacuum-assisted biopsy (VAB) has been recently shown to significantly reduce operation time and radiation exposure [4].
However, accurate visualization of calcifications and discrimination between benign and malignant ones remains an issue for human readers even with DBT. Impressions from initial studies on DBT in the 1990s supposed lower accuracy compared to standard digital mammography (DM), mainly because a cluster of microcalcifications may be visible on different two-dimensional (2D) images, with poor resolution in out-of-focus images and lack of comprehensive cluster visualization [5]. These drawbacks have been overcome using DBT image series and/or DBT-derived synthetic two-dimensional (2D) views, offering a visualization of calcifications similar or even highlighted when compared to standard DM. Notwithstanding these improvements, the malignancy rate of the calcifications addressed to needle biopsy on the basis of DBT remained relatively low, as it was with 2D mammography, with a not-negligible amount of FP, as reported by Lang et al. [6].
Radiomics is a relatively new image-analysis approach allowing the quantitative measurement of high-throughput features from radiological images that are supposed to express the heterogeneity of texture, shape and size of distinct tissue phenotypes correlated to different clinical outcomes [7]. When combined with machine learning algorithms, the radiomic approach was proven able to provide automatic computer-aided classifiers of radiological images. Indeed, radiomic analysis has already been widely implemented in the last few years in various clinical applications [8][9][10], showing promising results in clinical decision support systems (DSS) and fostering highly tailored medical decision-making in both diagnosis and prognosis [11,12].
Radiomic analysis of DBT (as well as DM) images is supposed to have the intrinsic capability to capture and quantitatively measure those morphometric and textural heterogeneity features of calcifications invisible to the radiologists' eyes that can be associated with malignancy or poor prognosis, potentially addressing the above-mentioned shortcomings, thus boosting DBT diagnostic performance in women recalled for assessment after calcifications are detected at screening DM [3] or when using DBT as a screening tool, as recently allowed by guidelines issued by the European Commission Initiative on Breast Cancer [13].
While computer-aided detection of calcifications on DBT has been the object of various studies for automatic diagnosis [14][15][16][17][18], few radiomic applications with predictive malignancy or benign role on DBT images have been reported [19,20]. To the best of our knowledge, no radiomic study has been published attaining a higher diagnostic accuracy of radiomics applied to DBT, which would ultimately reduce FP and underpin personalized risk strategies for the management of breast calcifications, eventually considering the not immediate need of biopsy referral but alternative recall for low-risk patients according to a personalized medicine approach.
In this study, we aimed therefore to conduct a radiomic analysis of DBT calcifications with the aim to predict the risk of malignancy among those that would be addressed to biopsy by the radiologists and to give them a helpful DSS during their diagnostic activity.

Dataset
In this retrospective, non-consecutive, study, DBT images acquired from patients at our Institution between May 2018 and December 2019 were retrieved, retrospectively evaluated and collected for a radiomic-based classification study to predict malignant vs. benign breast calcifications with low FP ratio. Patient consent was waived by the Ethical Committee due to the retrospective nature of this study. In particular, the study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethical Committee of Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico in Milan (protocol code Tomo-AI; protocol-ID 1666; approved on 14 October 2020).
DBT images were acquired using two full-field digital mammography systems with tomosynthesis (Selenia Dimensions; Hologic Inc., Marlborough, MA, USA). At the time of the study, both the digital mammography systems machines were working within the manufacturers' specifications and EC quality control regulations. Both the cranio-caudal and the mediolateral-oblique projection on patients' breasts were reviewed by an expert breast radiologist with more than 10 years of experience, and the projection in which calcifications were most evident was ultimately analyzed to provide the diagnosis.
Inclusion criteria for the study were: patients with higher than normal risk (e.g., familiar risk) with suspicious microcalcifications detected on DBT by an expert radiologist reader, classified by the radiologist according to the Breast Imaging-Reporting and Data System (BI-RADS) radiological classification [21], diagnosed by the radiologist as "suspected of malignancy" (positive at DBT), then sent to DBT-VAB for histopathological reports and followed up for 1-2 years.
DBT-VAB was performed by three dedicated breast radiologists (with 5-15 years of experience) using a prone breast biopsy system (Affirm Prone and Eviva; Hologic Inc., Marlborough, MA, USA) in combination with a 9-gauge needle for tissue sampling, acquiring at least 12 specimens for each patient.
The number of DBT images were selected in order to fulfill a 2:1 ratio of benign to malignant ratio of calcifications as diagnosed at histopathological reports (and clinical follow up) in order to properly train the model on low false-positive performance.

Radiologist Classification
For all patients considered in this study (with calcifications found positive at DBT), an expert radiologist reader performed the BI-RADS classification for each calcification on DBT images, fully blinded with respect to the histological results of breast calcifications at DBT-guided VAB (and with respect to the clinical follow up in case of B3).

Radiomic Classifier
Radiomic methodology was applied to include DBT images of patients, according to the International Biomarker Standardization Initiative (IBSI) guidelines [22].
For this purpose, the TRACE4© radiomic platform was used [23], allowing the whole IBSI-compliant radiomic workflow combined with ensembles of machine learning (ML) classifiers to be obtained in a fully automatic way.
IBSI radiomic workflow included: (i) the segmentation of the calcification region from each patient DBT image, (ii) the preprocessing of image content within the segmented region of interest for the radiomic feature extraction, (iii) the extraction of radiomic features from the segmented region of interest, (iv) the selection of radiomic features stable with respect to different segmentations (it may occur that different human operators segment the calcification regions of different images) and repeatable in test-retest study, (v) the use of such stable radiomic features to train, validate, and test different ensembles of ML classifiers in the binary classification task of interest (malignant vs. benign), including the reduction of such stable and repeatable features to not-redundant features in a number that is statistically proper to the number of included image samples of patients.
More specifically: The segmentation of the calcification region was performed manually, slice by slice, by the expert radiologist, using the TRACE4 Segmentation tool.
The preprocessing of image intensities within the segmented region of interest included resampling to isotropic voxel spacing, using a down-sampling scheme by considering image slice thickness of 1 mm and intensity discretization using a fixed number of 64 bins.
Their definition, computation and nomenclature are compliant with the IBSI guidelines, except for the features of the family morphology, originally designed for 3D images, which were replaced with ten 2D equivalent features (e.g., 3D features volume and surface were replaced with 2D features area and perimeter, respectively).
Additional details: • Morphology (10 features). Morphological features (such as area) describe the geometry of an ROI and are based on the voxel contained in the analyzed ROI. For this feature class, IBSI guidelines could not be entirely followed, since 3D images are required while mammography images are 2D images by definition.

•
Intensity-based statistics (18 features). Intensity-based statistical features (such as mean intensity, median intensity and intensity variance) describe the intensity distribution within the ROI. • Intensity histogram (22 features). To calculate intensity histogram features (such as mean discretized intensity, discretized intensity variance and discretized intensity kurtosis), the original intensity distribution is discretized into intensity bins. • Grey level co-occurrence matrix (100 features). These features are second-order features describing image texture according to pixels' distribution. For this purpose, a specific matrix was designed to represent in each element the number of times two specific pixels are at a defined distance and angle. The given matrix was used to calculate features such as autocorrelation, cluster shade and cluster prominence [24,25]. • Grey level run length matrix (63 features). These features derive from a support matrix representing the number of times specific pixel/voxel intensity is present in a given direction. Examples for this group of features are grey level non-uniformity, run-length non-uniformity and grey level variance [24,25]. • Grey level size zone matrix (32 features). Features of this group (such as small and large area emphasis, describing the distribution of small and large size zones respectively, size-zone non-uniformity, and zone percentage) quantify the grey level zone of an image, defined as the zone where adjacent (i.e., with distance equal to 1) pixels/voxels share the same intensity [25,26]. • Grey level distance zone matrix (30 features). These features describe how many homogeneous connected areas are present within the Region-Of-Interest (ROI) volume, considering a certain intensity and distance to the shape border [6]. • Neighborhood grey tone difference matrix (10 features). Features from this group (such as coarseness, contrast and busyness) describe spatial changes in the intensity of pixels/voxels in the ROI, analyzing differences between a specific pixel and the surrounding ones. This is accomplished through the creation of a one-dimensional matrix containing-for each pixel intensity value-the summation of the differences between the analyzed pixel and all surrounding neighbors [26,27]. • Neighboring grey level dependence matrix (34 features). These features describe the grey level dependency, defined as the number of voxels within a given distance from the central voxel that they depend on. A neighboring voxel is considered dependent if the difference between that voxel and the central voxel is smaller than a defined threshold. Therefore, the ensuing grey level dependence matrix contains the number of times a specific voxel has n dependent voxels in its neighborhood [26,27].
These steps were performed using the TRACE4 Radiomic tool. Radiomic features were reported by TRACE4 according to IBSI standards.
The selection of radiomic features that were stable with respect to different segmentations and repeatable in the test-retest study was performed by ICC (ICC > 0.80) when comparing features obtained by data augmentation strategies, (a) randomly manipulating the manual segmentation of the lesion region (performed by the expert operator), and (b) rotating the original images and segmentations. The selected radiomic features (stable and repeatable) were reported by TRACE4.
Two different ensembles of ML classifiers were trained, validated, and tested, for the binary classification task (malignant vs. benign or negative, based on histopathology and radiological results), selecting stable, reproducible and not redundant features. The oversampling technique for the minority class (malignant) was applied by adaptive synthetic sampling method in order to balance the training.
The first considered ML system was an ensemble of 200 Decision Trees combined with Gini index; the second machine learning system was an ensemble of 100 Support Vector Machines combined with principal components analysis and fisher discriminant ratio. For both systems, a nested K-fold cross-validation method was used (k = 10), and a majority vote rule was applied to assign the binary classes.
For both the classification ensembles, the predictive performances were measured across the different folds (k = 10) in terms of mean Accuracy, Sensitivity, Specificity, AUC, with 95% Confidence Interval and p-value, FP), FN. The classification system with the best performances was chosen as the best classification system for the binary task of interest (malignant vs. benign).
Moreover, a permutation test was performed to assess the statistical significance of the results and to exclude the presence of false discoveries. The permutation test consists in (1) randomly permuting the labels associated with the DBT images, (2) performing training, validation and testing using these permuted labels, and (3) computing classification performance. This procedure is repeated 100 times, thus obtaining a set of 100 "permuted models" and 100 corresponding classification performances. This set of 100 classification performances is then compared to the performance obtained by the non-permuted generated model. A p-value is calculated as the number of "permuted" models that performed better than the "original" model. These steps were performed by using the TRACE4 Statistics and Modeling tool.

Dataset
In this retrospective study, 49 DBT images including 49 breast calcifications from 49 patients (mean age 51 years, interquartile range 41-81 years) who matched inclusion criteria (positives at DBT for an expert radiologist, see Materials and Methods) were ultimately analyzed and used to build a radiomic-based classifier designed to reduce FP microcalcifications at DBT.
All patients had their calcification histopathological reports available from DBTguided VAB and 1-2 years follow-up. Histological results at biopsy were malignant (B4 or B5) in 18/49 calcifications (37%) and benign (B2 or B3 with disease-free survival at 1-2 years follow up) in 31/49 (63%), with a benign-to-malignant ratio equal to 1.7, thus matching the criterion of 2:1 benign-to-malignant ratio, according to the inclusion criteria (refer to Materials and Methods). Among the benign calcifications (31), 19 were diagnosed as B2 at biopsy, 4 were diagnosed as B1, and 8 were reported as B3 at biopsy but had a disease-free survival of 12-24 months. Thus, calcification class-labels as "malignant" or "benign" were known and used as the reference standard for the supervised training of the radiomic classifier (refer to Materials and Methods). Table 1 shows the reference standard classification for every single patient included in the study.  Table 2 shows, for all 49 patients considered in this study (with calcifications found positives at DBT), the results of the BI-RADS classification for each calcification as performed by an expert radiologist reader on DBT images (refer to Materials and Methods). The histological results of breast calcifications at DBT-guided VAB (with the clinical follow-up in the case of B3) (reference standard classification) are reported for comparison. Among the calcifications classified as BI-RADS 4 or 5 (26), 12 were benign (FP) at DBT-guided VAB, and 14 were malignant (true positives, TP). Among those classified as BI-RADS 3 (23), 19 were benign (FP) and 4 were malignant (TP). An overall Positive Predictive Value (PPV) of 37% (18/49) was found from the radiologist assessment (31 FP, refer to Table 3). However, PPV was 54% for BI-RADS 4 or 5 and only 17% for BI-RADS 3.  Table 4 for a summary of the performances and Figure 2 for ROC-AUC).

Radiologist Classification
Permutation test showed a p < 0.005 for the presence of false discoveries with respect to Accuracy, Specificity, AUC, and PPV, and a p = 0.02 for NPV. No statistical significance was found for Sensitivity (p = 0.1). Table 4. Performances obtained in testing the best ensemble of machine learning radiomic classifiers for benign calcifications versus malignant calcification. Performances are reported as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and Area Under the Curve (AUC), (95% Confidence Interval), * = p-value < 0.05, ** = p-value > A total of 319 radiomic features compliant with IBSI guidelines [23,24] were extracted from each ROI similarly segmented over the 49 breast calcifications on their DBT images (refer to Materials and Methods). The list, nomenclature and values of such radiomic features are reported in Supplementary file S1 for the above-mentioned representative patient with malignant calcification.
Among the 319 radiomic features, 150 were found stable considering different segmentation perturbations originating from the segmentations defined by the expert radiologist, and the test-retest study (intra-class correlation coefficient > 0.8) (refer to Materials and Methods). Stable features were found belonging to the following sub-groups: morphology (7 features), intensity-based statistics (16 features), intensity histogram (7 features Table 4 for a summary of the performances and Figure 2 for ROC-AUC).  Radiomic classification results are reported in Table 5, for each tested patient (with reference standard classification from DBT-guided VAB and follow-up reported for comparison), and in Table 6, for all tested patients, in terms of True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN).  Permutation test showed a p < 0.005 for the presence of false discoveries with respect to Accuracy, Specificity, AUC, and PPV, and a p = 0.02 for NPV. No statistical significance was found for Sensitivity (p = 0.1).
Radiomic classification results are reported in Table 5, for each tested patient (with reference standard classification from DBT-guided VAB and follow-up reported for comparison), and in Table 6, for all tested patients, in terms of True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). As more important results of our study that may impact the management of the patients, we can observe from Table 6 that there are only six calcifications classified by our radiomic classifier as FP, showing a reduction of a factor of 5 in FP with respect to radiological classification (31 FP, Table 3). Among these six FP, we did not find BI-RADS 3 (from the total of 23 BI-RADS 3), proving that such a radiomic classifier could be effective in reducing FP in this class for radiologists. The overall Positive Predictive Value (PPV) of our radiomic classifier was 70% (14/20). However, due to non-negligible FN (4) (error = 4/31 = 19%), our radiomic classifier cannot be used as an automatic reader of DBT images, at least in this preliminary state. However, in combination with an expert radiologist reading and classification, it may be used to reduce FP in the BI-RADS 3 class and to stratify the risk.
According to such a combined approach, we defined the following DSS for the BI-RADS 3 class: (1) if the radiologist reader classifies a breast calcification as BI-RADS 3 AND the radiomic classifier classifies as BENIGN -> DSS predicts low risk of malignancy; (2) if the radiologist reader classifies a breast calcification as BI-RADS 3 AND the radiomic classifier classifies as MALIGNANT -> DSS predicts a high risk of malignancy; (3) if the radiologist reader classifies a breast calcification as BI-RADS 4 or 5 -> DSS assigns a high risk of malignancy.
Overall, there were 15 FP for our DSS; thus reduced by a factor of 2 with respect to FP from radiological classification (31). The overall PPV was 54.5% (18/33), improving by 50% the PPV performance of the radiologist (37%) (refer to Table 7 for detailed results and  Table 8 for the confusion matrix summarizing the reference-standard and DSS-classification results of the tested patients). Table 7. Reference standard classification (from DBT-guided VAB and follow-up, and DSS classification, for each single patient included in the study.

Discussion
Radiomics derive multiple quantitative features from single or multiple medical imaging modalities and techniques, highlighting image traits that are not visible to the naked eye and hence potentially significantly augmenting the diagnostic and prognostic power of medical imaging. This quantitative "big data" approach is a relatively new discipline showing possible limitless applications in clinical practice and research [28,29].
Radiomics strengths in oncology, however, until now have been frequently demonstrated for tomographic imaging, being inherently quantitative tools, where radiomics can provide a comprehensive noninvasive characterization of the whole 3D tumor, defining what has been named the radiomics signature of the tumor [7]. Among published studies, radiomics has been increasingly gaining ground to improve cancer diagnosis, monitoring of treatment response and prognosis, also in the field of breast care [30].
In this work, a radiomic classifier was developed for DBT images on suspected breast calcifications to predict associated malignancy at needle biopsy with the particular purpose to reduce FP of DBT. We analyzed a total of 49 breast calcifications detected at DBT as positive, classified by an expert radiologist according to BI-RADS, and ad-dressed to DBT-guided VAB with known histopathologic characterization and follow up at 1-2 years. We used these supervised associations to train an ensemble of ML classifiers to automatically distinguish malignant from benign cases from their DBT. Such a predictive radiomic classifier achieved an Accuracy, Sensitivity, Specificity, AUC, PPV and NPV of 0.82, 0.78, 0.85, 0.80, 0.74, and 0.88, respectively, according to a nested 10-fold cross-validation classification. A sensitivity of 78% is currently insufficient to avoid immediate referral to VAB only on the basis of our radiomic model. However, since such limited performances have been obtained on a limited cohort of patients, we can expect that our preliminary classifier could increase its predictive power when a larger sample (labeled by histological finding and follow up) could be available.
Concerning the application of radiomics to 2D DM for calcifications, very few papers have been published. Chen et al. [31] proved that a multimodal radiomic model, consisting of radiomic features from both mammography and dynamic contrast-enhanced magnetic resonance imaging (MRI), combined with a random forest classifier, showed a Sensitivity of 83% and a Specificity of 80% at Leave-One-Out validation, performances comparable to those obtained by our single-modality DBT-based radiomic classier with an ensemble of random forests. Moreover, our performances were validated with a more robust method (nested k-fold cross-validation) and tested also with a permutation test.
In literature, radiomics has been also applied to other mammographic techniques such as contrast-enhanced spectral mammography. Mao et al. [32] published a multicenter study showing that a radiomics nomogram of contrast-enhanced spectral mammography was able to predict axillary lymph node metastasis in breast cancer. Other authors showed correlations between mammographic radiomics features and the level of tumor-infiltrating lymphocytes in patients with triple-negative breast cancer [33].
A recent study on Chinese patients published by Zhang et al. [34] confirmed the promises of radiomics combined with DBT to destinguish malignant from benign calcifications, showing that, when radiomic features can be extracted from the reconstructed volume of calcifications and integrated into the 2D radiomic features, the model is able to reduce the FP rate up to 20%. To be noted, in our work, the number of FP classified by our radiomic classifier was found reduced by a factor of 5 from the initial number classified by the expert radiologist, thus effectively reducing the risk of inappropriate biopsies. However, we must consider that a reduction in FP rate without a contemporary high NPV (over 98%) has no clinical impact if BI-RADS clinical rules are applied.
In order to account for this, we designed a DSS based on both the radiological classification (BI-RADS) and the low FP rate achieved by our radiomic classifier on suspected breast calcifications classified as BI-RADS 3. The final results of such a DSS were very good (Table 4), showing a reduction by a factor of 2 in FP from the initial expert radiologist, at no cost of FN, and an improvement in the PPV by 50%, suggesting potential in our approach, and further validation of these preliminary results with larger datasets. If confirmed by multicenter studies with large sample size, our DSS could reinforce the decision of BI-RADS 3 category to not send to biopsy but to recall for monitoring or to downscale the diagnostic category from BI-RADS 3 to BI-RADS 2.
Our study has important limitations. First, the retrospective single-center design, which made it difficult to obtain a larger sample size useful to train and independently test both the radiomic classifier and the DSS developed. However, the different examinations were collected using different DBT units; thus, we can consider the performance of our model to be somewhat robust with respect to independent DBT systems. Second, our study suffers from the lack of a temporal and geographical independent dataset to test the developed models. However, we performed a robust nested k-fold cross-validation on the radiomic model to avoid using testing data during training, and, as already highlighted above, a permutation test to assess the statistical significance of our results and to exclude the presence of false discoveries (p < 0.05). Third, the non-consecutive enrollment of cases, which does not allow a disease-prevalence interpretation of the PPV and NPV obtained in our work, although they can be compared quantitatively. Fourth, we should consider that the DBT system guiding the VAB was different from the one used for the first image acquisition and the acquired images. Potential differences between those images could have influenced the choice of the radiologist performing the VAB (site of biopsy, number of samplings). However, we must consider that the aim of the VAB is to sample the target finding identified on the initial DBT images, and a careful breast positioning and repeated comparison with the initial images are performed to obtain a precise tissue sampling allowing a pathological examination to be strictly correlated with the initially detected finding. Moreover, it should be considered that the aim of this work was to provide a decision-support system to reduce the false positive cases sent to biopsy using a combination of negative follow-up and final pathology for all cases of negative or borderline (B1, B2, B3) VAB results.

Conclusions
In conclusion, we think that our preliminary results open the way to further research in radiomics of calcifications on DBT, in particular for reducing FP rate and improving the diagnostic confidence of suspected malignant patients with a risk stratification approach. This advantage seems to be particularly useful in our study for the borderline BI-RADS category as BI-RADS 3 and could be also for BI-RADS 4a in larger studies, providing suggestions to eventually consider the not immediate need of biopsy referral but alternative recall for low-risk patients. Increased sample size, integration with clinical as well as personal and history data may lead to increased performances of radiomic-base classifiers and DSS in this particular field. Informed Consent Statement: Patient consent was waived by the Ethical Committee due to the retrospective nature of this study.
Conflicts of Interest: C.S. is CEO of DeepTrace Technologies S.R.L, a spin-off of Scuola Universitaria Superiore IUSS, Pavia, Italy. I.C. and M.I. own DeepTrace Technologies S.R.L shares. All other authors declare that they have no conflict of interest and that they have nothing to disclose.