Breast Lesion Classification with Multiparametric Breast MRI Using Radiomics and Machine Learning: A Comparison with Radiologists’ Performance

Simple Summary Currently, breast contrast-enhanced MRI is the most sensitive imaging technique for breast cancer detection; however, its specificity is low given the common characteristics shared by benign breast lesions and some cancers. This leads to a high number of false-positive cases and, therefore, unnecessary biopsies. Multiparametric MRI including diffusion-weighted imaging assists in this task by increasing the specificity for breast lesion discrimination. Nevertheless, interpretation of breast MRI is still highly dependent on the reader’s level of experience. Our work combines radiomic features extracted from multiparametric MRI to generate predictive models for breast cancer differentiation. Additionally, decision support models were compared with the performance of two breast dedicated radiologists for lesion differentiation. Our work proves the potential of multiparametric radiomics coupled with machine learning to be implemented in clinical practice for lesion differentiation on breast MRI. AI algorithms show value to assist less experienced readers, improving the accuracy for breast lesion discrimination. Abstract This multicenter retrospective study compared the performance of radiomics analysis coupled with machine learning (ML) with that of radiologists for the classification of breast tumors. A total of 93 consecutive women (mean age: 49 ± 12 years) with 104 histopathologically verified enhancing lesions (mean size: 22.8 ± 15.1 mm), classified as suspicious on multiparametric breast MRIs were included. Two experienced breast radiologists assessed all of the lesions, assigning a Breast Imaging Reporting and Database System (BI-RADS) suspicion category, providing a diffusion-weighted imaging (DWI) score based on lesion signal intensity, and determining the apparent diffusion coefficient (ADC). Ten predictive models for breast lesion discrimination were generated using radiomic features extracted from the multiparametric MRI. The area under the receiver operating curve (AUC) and the accuracy were compared using McNemar’s test. Multiparametric radiomics with DWI score and BI-RADS (accuracy = 88.5%; AUC = 0.93) and multiparametric radiomics with ADC values and BI-RADS (accuracy= 88.5%; AUC = 0.96) models showed significant improvements in diagnostic accuracy compared to the multiparametric radiomics (DWI + DCE data) model (p = 0.01 and p = 0.02, respectively), but performed similarly compared to the multiparametric assessment by radiologists (accuracy = 85.6%; AUC = 0.03; p = 0.39). In conclusion, radiomics analysis coupled with the ML of multiparametric MRI could assist in breast lesion discrimination, especially for less experienced readers of breast MRIs.


Introduction
Medical imaging has always played a pivotal role in breast cancer diagnosis and treatment decision-making. The inherently high sensitivity of dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) (81-100%) [1] had led to its wide use in the evaluation of breast cancer, with many indications. Despite its powerful ability to identify abnormalities in the breast, DCE-MRI has limitations, such as reduced availability, high cost, and a reduced pooled specificity of 70% [2][3][4][5][6][7].
In the last decade, radiomics has become an area of increasing interest. As a technique that mines quantitative imaging features that are hidden to the radiologist's eye, radiomics can be combined with clinical data (e.g., histopathologic, genomic, or molecular information) and artificial intelligence (AI) to build algorithms that are capable of emulating the human brain in tasks of learning and problem solving. To date, the development of such decision-support algorithms for breast cancer evaluation has mainly relied on radiomics data derived from DCE-MRI. This development has had applications for the characterization of different molecular profiles of breast cancer [19,20], the prediction of likelihood for axillary lymph node metastatic involvement [21], and the probability of tumor response to chemotherapy treatment [22], as well as the differentiation between breast lesions [23][24][25][26][27][28][29]. However, the facets of clinical implementation of these support decision models are still to be determined, particularly in the setting of multiparametric MRI.
The purpose of this multicenter retrospective study was to evaluate the diagnostic value of radiomics coupled with machine learning (ML) of multiparametric MRI as used in the clinical routine by comparing its performance with that of experienced radiologists in the classification of enhancing breast tumors. Multiparametric MRI-based algorithms could help less experienced breast MRI readers in the task of breast lesion differentiation.

Study Sample
This institutional review board-approved multicenter retrospective study was conducted in compliance with the United States Health Insurance Portability and Accountability Act. The need for written informed consent was waived. Some patients were previously reported on in a different context [8,30].
Consecutive patients were identified following a review of databases from Memorial Sloan Kettering Cancer Center (MSK) spanning the period from January 2018-March 2020, and the Medical University of Vienna (MUV) spanning the period from January 2011-August 2014. Figure 1 illustrates the selection of the patients included in the study. Inclusion and exclusion criteria are described in the Supplementary Materials.
DWI was acquired consistently before gadolinium-based contrast injection and the apparent diffusion coefficient (ADC) maps were obtained using a built-in software. The structure and parameters for both MRI protocols are presented in the Supplementary Materials (Tables S1 and S2).
DWI was acquired consistently before gadolinium-based contrast injection and the apparent diffusion coefficient (ADC) maps were obtained using a built-in software. The structure and parameters for both MRI protocols are presented in the Supplementary Materials (Tables S1 and S2).

Imaging Evaluation and Processing
Lesions were manually segmented for radiomics analysis. Two breast radiologists (IDN and RLG), each one with five years level of experience in breast imaging, reviewed Digital Imaging and Communications in Medicine (DICOM) images from early post contrast-enhanced T1-weighted imaging, DWI, and ADC mapping in consensus to segment lesions. Lesions from the three sets of images were matched on OsiriX viewer v 9.0, and the slice containing the largest lesion diameter was recorded. Subsequently, one 3D segmentation was performed on each set of images by using the online ITK-SNAP v 3.6.0 tool. The same number of segmentations was performed per radiologist by delineating the borders of each lesion in every slice where it was visible to obtain a volume of interest (VOI). In the case of DW images, VOIs were directly extrapolated to ADC maps and manually corrected in the case of mismatched areas for feature extraction.
One month after segmentation, independent reads of the multiparametric MRI images were performed. Two radiologists (IDN and JSR) with five and six years of experience in breast imaging, respectively, rated cases in two sessions separated by at least three weeks. The radiologists were blinded to the patients' selection criteria, histopathological results, and previous imaging. In the first reading session, DW images and corresponding ADC maps were reviewed by each radiologist, using the previously recorded slice containing the largest lesion diameter as a reference. A category for suspicion (1-very low, 2-low, 3intermediate, 4-high, 5-very high) was assigned according to the signal intensity of the lesions on DW images (b = 800 s/mm 2 ). Lesions assigned a category ≥4 were considered positive for malignancy. Additionally, the corresponding ADC values on ADC maps (for ADC values, a cut-off of 1.3 × 10 −3 mm 2 /s were noted. Lesions with ADC values above the cut-off were considered positive for malignancy, as recommended by the European Society of Breast Imaging international breast diffusion-weighted imaging Working Group [17]. An example of region of interest placement to obtain ADC values is shown in Figure 2.

Imaging Evaluation and Processing
Lesions were manually segmented for radiomics analysis. Two breast radiologists (IDN and RLG), each one with five years level of experience in breast imaging, reviewed Digital Imaging and Communications in Medicine (DICOM) images from early post contrast-enhanced T1-weighted imaging, DWI, and ADC mapping in consensus to segment lesions. Lesions from the three sets of images were matched on OsiriX viewer v 9.0, and the slice containing the largest lesion diameter was recorded. Subsequently, one 3D segmentation was performed on each set of images by using the online ITK-SNAP v 3.6.0 tool. The same number of segmentations was performed per radiologist by delineating the borders of each lesion in every slice where it was visible to obtain a volume of interest (VOI). In the case of DW images, VOIs were directly extrapolated to ADC maps and manually corrected in the case of mismatched areas for feature extraction.
One month after segmentation, independent reads of the multiparametric MRI images were performed. Two radiologists (IDN and JSR) with five and six years of experience in breast imaging, respectively, rated cases in two sessions separated by at least three weeks. The radiologists were blinded to the patients' selection criteria, histopathological results, and previous imaging. In the first reading session, DW images and corresponding ADC maps were reviewed by each radiologist, using the previously recorded slice containing the largest lesion diameter as a reference. A category for suspicion (1-very low, 2-low, 3-intermediate, 4-high, 5-very high) was assigned according to the signal intensity of the lesions on DW images (b = 800 s/mm 2 ). Lesions assigned a category ≥4 were considered positive for malignancy. Additionally, the corresponding ADC values on ADC maps (for ADC values, a cut-off of 1.3 × 10 −3 mm 2 /s were noted. Lesions with ADC values above the cut-off were considered positive for malignancy, as recommended by the European Society of Breast Imaging international breast diffusion-weighted imaging Working Group [17]. An example of region of interest placement to obtain ADC values is shown in Figure 2. Axial MR images of a 48-year-old woman with a 14-mm benign mass in the right breast, of which biopsy yielded fibro-adenomatoid changes (yellow arrows). (A) Axial dynamic contrastenhanced image depicts a heterogeneous, oval, and circumscribed enhancing mass in the right breast corresponding to a heterogeneous hyperintense lesion on axial diffusion-weighted imaging (DWI) at a b value of 800 s/mm 2 ; (B). (C) Correlative parametric apparent diffusion coefficient (ADC) map with a region of interest (ROI) placed within the darkest part of the lesion and ROI information. ADC values are expressed in mm 2 /s. This lesion was heterogeneous vs. non-enhancing septa and therefore characterized as BI-RADS 3 and 3 based on the DWI score in the consensus reading of radiologists.
In the second reading session, DCE images were assessed using BI-RADS. Like the suspicion score for the DW images, lesions assigned a category for suspicion ≥ 4 based on BI-RADS were considered positive for malignancy. Additionally, a multiparametric MRI Figure 2. Axial MR images of a 48-year-old woman with a 14-mm benign mass in the right breast, of which biopsy yielded fibro-adenomatoid changes (yellow arrows). (A) Axial dynamic contrastenhanced image depicts a heterogeneous, oval, and circumscribed enhancing mass in the right breast corresponding to a heterogeneous hyperintense lesion on axial diffusion-weighted imaging (DWI) at a b value of 800 s/mm 2 ; (B). (C) Correlative parametric apparent diffusion coefficient (ADC) map with a region of interest (ROI) placed within the darkest part of the lesion and ROI information. ADC values are expressed in mm 2 /s. This lesion was heterogeneous vs. non-enhancing septa and therefore characterized as BI-RADS 3 and 3 based on the DWI score in the consensus reading of radiologists.
In the second reading session, DCE images were assessed using BI-RADS. Like the suspicion score for the DW images, lesions assigned a category for suspicion ≥ 4 based on BI-RADS were considered positive for malignancy. Additionally, a multiparametric MRI classification combining BI-RADS categories and ADC values was performed. In cases of discrepancy between the suspicion level for BI-RADS categories and DWI scores, ADC with a cut-off 1.3 × 10 −3 mm 2 /s was used to classify the lesions.
In a third reading session, consensus analysis of the two radiologists for all cases regarding the level of suspicion was made for the BI-RADS, DWI score, and multiparametric MRI suspicion score. This allowed for comparison of the radiologists' performance to that of the ML models. ADC values obtained by the radiologist with more experience in using DWI (IDN) were used for consensus on the multiparametric MRI suspicion score. Additionally, BI-RADS descriptors for enhancing lesions were noted for mass and non-mass enhancement (NME) lesions as shown in Table 1.

Radiomics Analysis
The information extracted from the VOIs derived from DCE and DW images was entered into the Computational Environment for Radiological Research (CERR) software (available on Github) using an in-house MATLAB (MathWorks Inc., Natic, MA, USA) code. CERR then allowed for the calculation of radiomic features [31], based on the grey level run length matrix (RLM), grey level co-occurrence matrix (GLCM), grey level size zone matrix (SZM), neighborhood grey tone difference matrix, neighborhood grey level dependence matrix, and first-order statistics.
Data reduction to 16 grey levels was performed to account for the reduced number of pixels in some lesions. To ensure sufficient counting statistics for the calculation of texture features, only a distance of one was regarded between pixels. To optimize the models, lesions containing less than 40 pixels were disregarded. As a result, 23 patients with 23 lesions were excluded. The final study sample consisted of 93 patients (30 from MSK and 63 from MUC) with 104 lesions (38 from MSK and 66 from MUV), with 11 patients showing more than one lesion on MRI.

Reference Standard
The reference standard was histopathology obtained through image-guided biopsy in all lesions, whether MRI (30 lesions) or ultrasound-guided (74 lesions). In cases of histopathology that yielded a benign but high-risk lesion (e.g., atypical ductal or lobular hyperplasia or papilloma), the postsurgical histopathology report was consulted to verify concordance with results from image-guided biopsy.

Statistical Analysis and Predictive Model Building
Means (±SD) and medians (range) were used to define continuous variables whereas proportions were used to summarize categorical variables.
Prior to statistical analysis (SPSS version 25, IBM Corp., Armonk, NY, USA), ComBat harmonization was performed to reduce possible variability between the different MRI protocols used [32]. Afterwards, statistical univariable modelling afforded the identification of significantly different radiomic features between benign lesions and cancers. To prevent model overfitting, feature selection was performed using a fivefold cross-validated elastic net and combining least absolute shrinkage and selection operator (LASSO) regression and ridge regression. We selected the top five radiomic features to ensure sufficient cases per feature for model building of the minority class. Multivariate modeling through medium Gaussian support vector machine (SVM) modelling with fivefold cross-validation afforded the generation of robust ML models for breast lesion differentiation. Z-score normalization of the radiomic parameters was used for model building in consideration of the different degrees of magnitude found in radiomics. Figure 3 shows the workflow for radiomics and radiologist analysis.

Patient Sample and Breast Lesion Characteristics
A total of 93 women (mean age: 48.5 years ± 12 years) with 104 lesions (mean size: 22.8 ± 15.1 mm) were included in the final patient sample. There were 46 cancers (mean size: 28.8 ± 18.2 mm), of which 35 were mass lesions and 11 were non-mass enhancements. Benign lesions accounted for 58 lesions (mean size: 18.2 ± 10 mm), of which 50 were mass lesions and 8 were non-mass enhancements. Patient and lesion characteristics are summarized in Tables 2 and 3.  The area under the receiver operating characteristic curve (AUC) and accuracy were used to assess the models' performance. Diagnostic accuracies were compared using McNemar's test, and p-values < 0.05 were considered significant. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for both radiologists and models. Diagnostic metrics were obtained for mass and NME lesions together, as well as for masses alone, allowing for the evaluation of models that utilize individual BI-RADS descriptors (internal enhancement, shape, margins, enhancing kinetics).

Patient Sample and Breast Lesion Characteristics
A total of 93 women (mean age: 48.5 years ± 12 years) with 104 lesions (mean size: 22.8 ± 15.1 mm) were included in the final patient sample. There were 46 cancers (mean size: 28.8 ± 18.2 mm), of which 35 were mass lesions and 11 were non-mass enhancements. Benign lesions accounted for 58 lesions (mean size: 18.2 ± 10 mm), of which 50 were mass lesions and 8 were non-mass enhancements. Patient and lesion characteristics are summarized in Tables 2 and 3.

Radiomics Analysis for Breast Lesion Differentiation
The median size of segmented lesions was 255 pixels (range: 40-5379 pixels) for benign lesions and 2104 pixels (range: 115-58,485 pixels) for malignant lesions.
After CERR analysis, 102 radiomic features were obtained: 22 based on first-order statistics; 26 based on GLCM; 16 based on RLM; 16 based on SZM; 17 based on neighborhood grey level dependence matrix; and 5 based on neighborhood grey tone difference matrix.
Univariable analysis yielded 34 DWI and 27 DCE radiomic features that were significantly different between benign and malignant lesions. Feature selection, followed by multivariable modelling, resulted in ten models for the classification of all lesions, as well as for the classification of mass lesions alone. The top five radiomics parameters selected to develop each model are provided in Tables S3 and S4.

Radiologist Performance vs Radiomics Coupled with ML for Malignant vs. Benign Classification for Mass Lesions
The performance of radiologist consensus reading, as well as that of different models for the classification of mass lesions, are shown in Table 4. The "radiomics DWI data model" demonstrated a higher diagnostic accuracy for the classification of mass lesions based on DWI (78.6%, CI: 68.3-86.8%) than either the ADC value (73.8%, CI: 63.1-82.8%) or the DWI score (75.0%, CI: 64.4-83.8%) assessed by radiologists. However, this increase in diagnostic accuracy was not significant (p > 0.38). Both the "radiomics DWI data with DWI score model "(78.6%, CI: 68.3-86.8%) and the "radiomics DWI data with ADC value model" (82.1%, CI: 72.3-89.7%) did not lead to a significant improvement in diagnostic accuracy when compared with the "radiomics DWI data model" (p > 0.39 for both).  BI-RADS descriptors for masses model" offered no further significant improvement in diagnostic accuracy (86.9%, CI: 77.8-93.3%; p = 1.00).

Radiologist Performance vs. Radiomics Coupled with ML for Malignant vs. Benign Classification for All Lesions (Mass and Non-Mass Lesions)
For detailed results regarding radiologist performance vs. radiomics coupled with ML performance for the classification of all lesions together (mass and non-mass enhancement), see the Supplementary Data (Tables S5-S7).

Discussion
We investigated the diagnostic performance of different models for the classification of enhancing breast tumors that were deemed suspicious on routine clinical breast MRI evaluation and subsequently recommended for biopsy. We compared the performance of radiomics analysis coupled with machine learning models against that of dedicated breast radiologists. A total of ten models were developed that used radiomic features extracted from DCE and ADC maps derived from DW images with clinical information (e.g., BI-RADS category, BI-RADS descriptors, or DWI-derived data) to discriminate between malignant and benign breast lesions.
Our results showed that multiparametric MRI interpretation by radiologists, as well as radiomic models based on multiparametric MRI combined with BI-RADS and DWI clinical data, achieved the highest accuracies and AUC values. While yielding slightly higher diagnostic accuracies, the multiparametric radiomics models with BI-RADS and ADC values did not significantly improve upon the diagnostic accuracy of dedicated study radiologists. It must be noted that all the lesions evaluated in this study had been previously classified as suspicious on routine reads and were already recommended for biopsy, indicating that such AI-enhanced multiparametric MRI models would be promising in clinical practice where readers of all levels of experience are reading breast MRIs.
Regarding non-multiparametric assessments, the models based on DWI features did not improve upon radiologist performance using DWI alone. Based on DCE, only the "radiomics DCE data with BI-RADS model" provided a borderline significant improvement in diagnostic accuracy when compared with radiologists' assessment of breast lesions using BI-RADS classification. This may be due to the addition of an algorithmic/decision-tree component to the subjective assessment with BI-RADS. This trend was sustained, and a significant improvement was observed when the actual BI-RADS descriptors (internal enhancement, shape, margins, enhancing kinetics) were incorporated into a radiomics model based on DCE in the classification of mass lesions.
Multiparametric breast MRI with DCE-MRI and DWI as a supportive sequence for the discrimination of breast lesions is shown to be the best imaging technique for breast cancer diagnosis [33,34]. Yet only a few studies have been published comparing the performance of AI-enhanced models to that of radiologists for breast cancer diagnosis. This information is key for the implementation of these models in clinical practice. We showed that diagnostic accuracy of radiologists using multiparametric MRI with ADC values (85.6%; p = 0.39) can be improved through use of a multiparametric radiomics model with BI-RADS and ADC values (88.5%, CI: 80.7-93.9%). This was further emphasized when we analyzed the subgroup of masses, in which the multiparametric radiomics model with individual BI-RADS descriptors and ADC values provided borderline significance when compared with the accuracy of radiologists based on multiparametric MRI using ADC values (91.7% vs. 86.9%; p = 0.063). This result is unsurprising given that non-mass lesions usually present as diffuse infiltrating enhancements with ill-defined margins and often represent a diagnostic challenge for both manual segmentation and DWI scoring [35].
Among the few studies directly comparing AI-enhanced models and radiologists, Sutton et al. [36] proved that quantitative radiomic features extracted from DCE-MRI of breast cancer could replicate human-extracted tumor size and BI-RADS imaging phenotypes. Another study, conducted by Truhn et al. [37], assessed the performance of a convolutional neural network (CNN) model against radiomics analysis, comparing it with the prospective assessment of three breast radiologists discriminating breast MRI enhancing lesions. The CNN model seemed to outperform radiomics analysis (AUC of 0.88 vs. 0.81) but did not achieve better performance than multiparametric MRI interpreted by breast radiologists (AUC of 0.98). Unlike our study, the input to perform radiomics analysis and to generate the CNN model was only from features extracted from DCE images, and thus not representative of the full diagnostic potential of breast multiparametric MRI. Lo Gullo et al. [38] compared the qualitative morphological assessment with BI-RADS classification to radiomics coupled with machine learning for the differentiation of subcentimeter breast masses in BRCA mutation carriers. They found that radiomics analysis coupled with ML achieved a better diagnostic accuracy (81.5%) compared with radiologists using BI-RADS classification (53.4%). In yet another study, Bickelhaupt et al. [27] investigated two radiomics classifiers based on contrast-free MRI sequences (DWI and T2-weighted sequences) alone, and combined with ADC parameter, for the discrimination of breast lesions found suspicious on screening mammography. As in our study, they reported that their radiomics models performed better than ADC alone and that the inclusion of the mean ADC increased the accuracy of the model (from an AUC of 0.842 to 0.851), demonstrating the advantages of data sharing. Nevertheless, the performance of the proposed model combined with ADC was lower than that of expert breast radiologists (AUC of 0.959) using multiparametric breast MRI.
It is worth noting some of our study's strengths. First, our study included data from 3D segmentations, which contributes more pixels and thereby enables better model building and accuracy. Moreover, our data are derived from images acquired from different scanners and MRI protocols across two different institutions. Although this could be understood as a weakness, e.g., the introduction of data noise or dilution of the association by the protocol/image quality differences, it is helpful for the generalizability of our results. Secondly, our study's readers were experienced and breast-dedicated radiologists. Their excellence in assessing the lesions certainly impacted on our results. Therefore, our study indicates potential for AI-enhanced multiparametric MRI to be useful in clinical practice as a decision support tool for readers of all levels of experience or as support for breast radiology residents and fellows. Having said that, it is important to highlight that breast MRI has a high cost, and its access may be limited in certain countries. Therefore, it is important to quest for alternative AI-enhanced tools with wider availability. New approaches such as ultrasound elastography coupled with machine learning techniques may represent a feasible alternative to MRI for breast cancer diagnosis [39].
Regarding limitations, our study included a relatively small representative sample comprised of 104 breast lesions. This small sample size precluded the separation of data into training and test sets. In addition, some of these lesions, particularly benign tumors, were subcentimeter, which may affect the number of pixel contributions for feature extraction and lead to an increased proportion of features that were potentially contaminated by partial volume effects. We tried to overcome this limitation while ensuring adequate counting statistics by including only lesions with more than 40 pixels and lowering the data to only 16 grey levels (vs. 32 or 64 grey levels, as previously employed in breast MRI).

Conclusions
In conclusion, multiparametric radiomics analysis coupled with ML and combined with clinical data from multiparametric MRI performed similarly to breast radiologists for the classification of breast enhancing lesions on MRI. Multiparametric models could be useful as a supportive decision tool to accurately classify breast lesions, especially for less experienced breast MRI readers.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers14071743/s1, Table S1: Summary of imaging protocols and acquisition parameters; Table S2: Summary of DWI protocols and acquisition parameters; Table S3: Summary of radiomics features selected for each model for the analysis of all lesions; Table S4: Summary of radiomics features selected for each model for the analysis of mass only lesions; Table S5: Diagnostic metrics for the performance of radiologists * and radiomics combining different approaches for mass and non-mass lesions; Table S6: Results from radiologist consensus reading regarding DWI suspicion score and BI-RADS descriptors and classification for mass and non-mass lesions; Table S7: Results from radiologist independent reading regarding DWI suspicion score, BI-RADS classification and multiparametric MRI classification for mass and non-mass lesions.  Informed Consent Statement: Patient consent was waived due to the retrospective nature of the study.

Data Availability Statement:
The datasets used and analyzed in this study are not publicly available due to patient privacy requirements but are available upon reasonable request from the corresponding author. The code for radiomic feature extraction used in this study is publicly available via the opensource software CERR (https://github.com/cerr/CERR. Accessed on 7 June 2021.).