Prediction of Breast Cancer Histological Outcome by Radiomics and Artificial Intelligence Analysis in Contrast-Enhanced Mammography

Simple Summary The assessment of breast lesions through mammographic images is currently challenging, especially in dense breasts. Contrast-enhanced mammography has been shown to overcome the limitations of standard mammography but it greatly depends on the interpretative skills of the physician. The aim of this study was to evaluate the potentialities of statistical and artificial intelligence algorithms as a tool for helping the radiologists in the interpretation of images. The most remarkable results were achieved in discriminating benign from malignant lesions and in the identification of the presence of the hormone receptor. A tool to support the physician’s decision-making process may be designed starting from simple logistic regression and tree-based algorithms. This type of tool may help the radiologist in assessing the investigated breast and in choosing the appropriate follow-up without resorting to histology. Abstract Purpose: To evaluate radiomics features in order to: differentiate malignant versus benign lesions; predict low versus moderate and high grading; identify positive or negative hormone receptors; and discriminate positive versus negative human epidermal growth factor receptor 2 related to breast cancer. Methods: A total of 182 patients with known breast lesions and that underwent Contrast-Enhanced Mammography were enrolled in this retrospective study. The reference standard was pathology (118 malignant lesions and 64 benign lesions). A total of 837 textural metrics were extracted by manually segmenting the region of interest from both craniocaudally (CC) and mediolateral oblique (MLO) views. Non-parametric Wilcoxon–Mann–Whitney test, receiver operating characteristic, logistic regression and tree-based machine learning algorithms were used. The Adaptive Synthetic Sampling balancing approach was used and a feature selection process was implemented. Results: In univariate analysis, the classification of malignant versus benign lesions achieved the best performance when considering the original_gldm_DependenceNonUniformity feature extracted on CC view (accuracy of 88.98%). An accuracy of 83.65% was reached in the classification of grading, whereas a slightly lower value of accuracy (81.65%) was found in the classification of the presence of the hormone receptor; the features extracted were the original_glrlm_RunEntropy and the original_gldm_DependenceNonUniformity, respectively. The results of multivariate analysis achieved the best performances when using two or more features as predictors for classifying malignant versus benign lesions from CC view images (max test accuracy of 95.83% with a non-regularized logistic regression). Considering the features extracted from MLO view images, the best test accuracy (91.67%) was obtained when predicting the grading using a classification-tree algorithm. Combinations of only two features, extracted from both CC and MLO views, always showed test accuracy values greater than or equal to 90.00%, with the only exception being the prediction of the human epidermal growth factor receptor 2, where the best performance (test accuracy of 89.29%) was obtained with the random forest algorithm. Conclusions: The results confirm that the identification of malignant breast lesions and the differentiation of histological outcomes and some molecular subtypes of tumors (mainly positive hormone receptor tumors) can be obtained with satisfactory accuracy through both univariate and multivariate analysis of textural features extracted from Contrast-Enhanced Mammography images.


Introduction
Mammography is one of the main techniques in the diagnosis of breast cancer, showing a key role in both screening and follow-up [1,2]. Mammographic screening has been shown to be highly accurate in detection of breast lesions; however, it suffers from some limitations, especially in the case of dense breasts. In fact, dense breasts show a hyper-intense signal over the mammary parenchyma, resulting in very little contrast between the latter and the lesions. For the mammographic screening of patients with dense breasts, other techniques, such as Magnetic Resonance Imaging (MRI), are commonly preferred [3,4]. In particular, one of the most recent novel approaches is Contrast-Enhanced Mammography (CEM). Combining the potential and benefits of Full-Field Digital Mammography (FFDM), CEM has been shown to be highly effective for the detection and the correct staging of cancer, particularly in dense breasts [4][5][6][7][8][9][10]. More specifically, CEM combines the enhancing properties of the intravenous administration of an iodinated contrast medium with the high precision of digital imaging from FFDM; therefore, the neo-vascularity associated with actively growing malignancy is remarkably emphasized. Due to this property, CEM is not only able to detect cancer with high accuracy, but it is also a powerful technique for the identification of cancers that are obscure at mammography; furthermore, it allows a more accurate evaluation of the disease extent and offers guidance in the planning of surgery and treatment [4][5][6][7][8][9][10]. However, as in all imaging techniques, the evaluation of CEM images depends on the experience and skills of the radiologist, making the identification of automated or semi-automated techniques, which can provide decision support, a considerable challenge.
Recent significant advancements in this sense rely on the application of artificial intelligence and radiomics for the processing of large quantities of data by different imaging modalities [11,12].
Radiomics is the process of extracting quantitative properties, named features, from medical images. This feature extraction generally includes pattern recognition algorithms and provides, as a result, a set of numbers, each representing a quantitative description of a specific either geometrical or physical property of the image portion under consideration. In the context of tumor characterization, the radiomics features typically considered are those that describe properties related to size, shape, intensity, and texture of the tumor [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27].
Biological and molecular features related to breast cancer are commonly extracted by biopsy, which is invasive and not always able to detect tumor heterogeneity [28]. In recent years, there has been growing interest in non-invasive methods to directly derive insights from radiologic images. In this context, the radiomics analysis of tumor features extracted from CEM represents an important tool for breast tumor characterization. As several authors suggest, radiomics analysis combined with artificial intelligence techniques can be used to create a tool to support the physician's decision-making process in the classification of breast cancer [29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44]. In fact, through an appropriate tool, the physician would be able to discriminate the tumor nature and/or grading, identifying the adequate treatment for a single patient (e.g., neoadjuvant therapy) or even a more conservative approach (e.g., wait-and-see or conservative surgery). However, based on our knowledge, only some recent studies have used CEM in the prediction of histological grading and receptor status of breast cancer [45,46].
This work aimed to evaluate radiomics features to differentiate malignant versus benign lesions, to predict low versus moderate and high grading, to identify positive or negative hormone receptors, and to discriminate positive versus negative human epidermal growth factor receptor 2 related to breast cancer.

Patient Selection
From October 2017 to December 2021, according to regulations issued by the local Institutional Review Board, 182 patients (mean age ± standard deviation of 55.3 ± 10.9 years (range 31-80)) with known breast lesions were enrolled retrospectively. All women signed informed consent.
Inclusion criteria: patients with known breast lesions (from radiological or clinical screening, symptom of palpable lesions), histologically proven, and that underwent dualenergy CEM. CEM images of patients were acquired at Istituto Nazionale Tumori-IRCCS-Fondazione G. Pascale (Naples, Italy) and at Istituto Tumori "Giovanni Paolo II" of Bari (Bari, Italy).
Exclusion criteria: patient with breast implants, presence of non-removable drilling at the nipple, pacemakers, clips or other metal implants, pregnancy or possible pregnancy, inability to keep upright immobility during the examination, renal disease, or chemotherapy treatment at the time of imaging [41].
Overall, 118 malignant lesions and 64 benign lesions were analyzed.

Imaging Protocol
A total of 136 CEM examinations were performed using the Selenia ® Dimensions ® Unit dual-energy mammography system (Hologic, Bedford, MA, USA), whereas the remaining 46 CEM image were acquired with the Senographe Essential dual-energy mammography system (GE Healthcare, Princeton, NJ, USA).
The same acquisition protocol was implemented for all the images using both scanners. Specifically, two minutes after the intravenous injection of 1.5 mL/(kg bw) of iodinated contrast medium (Visipaque 320; GE Healthcare, Inc., Princeton, NJ, USA) at a rate of 2-3 mL/s, a set of images was acquired in quick succession, in both CC and MLO views. The CEM examination obtained two images: a low-energy (LE) acquisition at 26-30 kVp and a high-energy (HE) acquisition at 45-49 kVp, depending on breast density and thickness. CEM acquisition details were reported in previous studies [41,42,45].

Image Processing
Two expert radiologists, with 25 and 20 years of experience in breast imaging, manually segmented images by drawing slice-by-slice the contours of the lesions where contrast uptake was emphasized both in CC and MLO views.

MRI Post-Processing with PyRadiomics Tool
For each region of interest, 837 radiomics features were extracted as median values using the PyRadiomics Python package [47] including: First Order Statistics, Grey Level Co-occurrence Matrix, Grey Level Run Length Matrix, Grey Level Size Zone Matrix, Neigh-boring Grey Tone Difference Matrix, and Grey Level Dependence Matrix features before and after the wavelet filtering. The extracted features comply with feature definitions as described by the Imaging Biomarker Standardization Initiative (IBSI) [48] and as reported in (https://readthedocs.org/projects/pyradiomics/downloads/, accessed on 20 January 2017).
We used wavelet filtering, with all possible combinations of both high-pass (H) and low-pass (L) filters along the three axes (X, Y, and Z axes), to derive six different matrices: A graphical representation of the process for features extraction in a radiomics context is reported in Figure 1.
ually segmented images by drawing slice-by-slice the contours of the lesions where contrast uptake was emphasized both in CC and MLO views.

MRI Post-Processing with PyRadiomics Tool
For each region of interest, 837 radiomics features were extracted as median values using the PyRadiomics Python package [47] including: First Order Statistics, Grey Level Co-occurrence Matrix, Grey Level Run Length Matrix, Grey Level Size Zone Matrix, Neighboring Grey Tone Difference Matrix, and Grey Level Dependence Matrix features before and after the wavelet filtering. The extracted features comply with feature definitions as described by the Imaging Biomarker Standardization Initiative (IBSI) [48] and as reported in (https://readthedocs.org/projects/pyradiomics/downloads/, accessed on 20 January 2017).
We used wavelet filtering, with all possible combinations of both high-pass (H) and low-pass (L) filters along the three axes (X, Y, and Z axes), to derive six different matrices: A graphical representation of the process for features extraction in a radiomics context is reported in Figure 1.

Histopathological Analysis
The reference standard (ground truth) was the histopathologic examination of tissue as reported in [41]. Breast lesions were categorized based on the American Joint Committee on Cancer staging. The histological grade and the expression of estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki-67 antigen associated with cell proliferation were determined by immune-histochemical analysis.
The tumor grade G was defined on a three-grade scale according to the Elston-Ellis modification of the Scar-Bloom-Richardson grading system.
The hormone receptor (HR) was also considered; a breast cancer is classified as HRpositive if its cells have receptors for the hormones estrogen and progesterone.
Before proceeding with statistical analysis, the dataset was balanced with respect of each outcome. The balancing was performed through the synthetization of samples for the less-represented classes using the Adaptive Synthetic Sampling (ADASYN) approach [50,51].
In the context of univariate analysis, the non-parametric Wilcoxon-Mann-Whitney test for continuous variables was used. Receiver operating characteristic (ROC) analysis and the Youden index were considered to obtain the optimal cut-off value for each feature; then, the area under ROC curve (AUC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and accuracy (ACC) were computed. Bonferroni correction was used to adjust for multiple comparison.
In the context of multivariate analysis, logistic regression and tree-based algorithms were appropriately designed to predict each outcome individually; the main predictive features were also extracted. Before proceeding with the analysis, three pre-processing steps were performed.
Firstly, the dataset was randomly split into a training set and a test set, using the createDataPartition R function. Specifically, 90% of the entire dataset was used to train the algorithms, designing a cross-validated procedure; the remaining 10% of samples was used to estimate the accuracy of algorithms on 'new' samples, which are samples not used to train the algorithms themselves. Successively (and before running algorithms), a variable selection procedure was designed to remove redundant features from the training set. To achieve this aim, the cross-correlation between each predictor was calculated and all the features with a correlation higher than 0.7 (as an absolute value) with each single predictor were discarded. Finally, the input predictors were centered and scaled before running the logistic regression algorithm.
The machine learning approaches designed for the aim of this paper are described in the following. For each approach, the performance (accuracy) was assessed on both the training and test sets, also considering the values of sensitivity and specificity.
Logistic Regression. Considering the dichotomic nature of each outcome, a logistic regression was executed using all non-redundant features. The method was run using the glm R function.
Logistic Regression with least absolute shrinkage and selection operator (LASSO) method. In a different approach, the logistic regression model was fitted on training data, performing a further variable selection with the LASSO regularization method [52,53]. The LASSO was designed using the glmnet R function and the hyperparameter was tuned through a 10-fold cross validation procedure. The variables selected were saved to train the logistic regression algorithm.
Logistic Regression with two predictors. An additional variation of the logistic regression was considered predicting each outcome with all possible couples of features. All combinations that reached a test accuracy higher than 0.9 were saved and analyzed.
Tree-based algorithms. Among all tree-based algorithms, Classification and Regression Trees (CART) and Random Forest (RF) algorithms were chosen and designed. The CART algorithm was trained taking into account the possibility of obtaining a decision chart, whereas the RF method was used for a more robust evaluation of performances. Tuning of functions' hyperparameters was performed through a 10-fold cross validation procedure. Table 1 shows the distribution of characteristic of analyzed patients.  Table 2 reports the diagnostic accuracy of significant textural parameters for dualenergy CEM, in both CC and MLO views, obtained in the context of univariate analysis. As the table shows, in the classification of malignant versus benign lesions, the best performance was reached by the original_gldm_DependenceNonUniformity feature, extracted on CC view, with an accuracy of 89.83%, a sensitivity of 92.37%, and a specificity of 85.59%, and with a cut-off of 2.31.

Results
In the classification of grading, the best performance was reached by the origi-nal_glrlm_RunEntropy feature, extracted on CC view, with an accuracy of 83.65%, a sensitivity of 90.38%, and a specificity of 76.92%, and with a cut-off of 0.80.
In the identification of HER2+, the best performance was reached by the wavelet_HLH_ gldm_LargeDependenceHighGrayLevelEmphasis feature, extracted on MLO view, with an accuracy of 69.63%, a sensitivity of 62.22%, and a specificity of 77.04%, and with a cut-off of 0.74.
In the identification of HR+, the best performance was reached by the original_gldm_ DependenceNonUniformity feature, extracted on CC view, with an accuracy of 81.65%, a sensitivity of 96.99%, and a specificity of 65.59%, and with a cut-off of 2.55. Tables 3 and 4 show the results obtained with logistic regression-based and tree-based methods, respectively. Considering the CC view, the best performances were obtained when predicting the tumor nature (malignant versus benign). Logistic regression proved to be the best performing model (test accuracy of 95.83%) when using an approach without LASSO regularization. The goodness of the logistic regression method was also observed with LASSO regularization (test accuracy of 91.67%). Almost comparable results were obtained when using the tree-based algorithms (Table 5), with a test accuracy of 91.67% in the prediction of tumor nature. The decisional chart obtained with the CART method is shown in Figure 2 and the goodness of training procedure on 500 trees with the RF method is shown in Figure 3.    Considering the features extracted from MLO view images, the best test accuracy (91.67%) was obtained when predicting the grading using a CART algorithm, while the use of all non-redundant features (44 predictors) in a logistic regression model significantly reduces the accuracy value (max test accuracy of 75.00%). The goodness of treebased algorithms is confirmed by the error evolution plot of RF, reaching an accuracy value of 87.50% on the test set. The decision chart and error evolution are shown in Figure  4.
Combinations of only two features, extracted from both CC and MLO views, always showed test accuracy values greater than or equal to 90.00% (Table 5), with the only ex- Considering the features extracted from MLO view images, the best test accuracy (91.67%) was obtained when predicting the grading using a CART algorithm, while the Combinations of only two features, extracted from both CC and MLO views, always showed test accuracy values greater than or equal to 90.00% (Table 5), with the only exception being the HER2 outcome, where the best performance (test accuracy of 89.29%) was obtained with the RF algorithm.

Discussions
The radiomics analysis of tumor features extracted from CEM images represents an important tool for breast cancer characterization.
In this study, we aimed to perform radiomics analysis with texture features extracted by dual-energy CEM, evaluating its ability to classify malignant and benign breast lesions, and to predict grading and breast cancer receptors status (HER2+ and HR+).
In a retrospective study, Marino et al. [46] examined the potential of radiomics analysis using features from both CEM and MRI. In particular, they assessed the tumor invasiveness, the hormone receptor status, and the tumor grade in patients with primary breast cancer through common radiomics parameters. In their results, they showed remarkable accuracies when performing CEM radiomics analysis for discriminating HR+ versus HR− breast cancers (95.6%) and invasive versus non-invasive breast cancers (92.0%); slightly lower results were obtained, instead, in the classification of G1 + G2 versus G3 invasive cancers (77.8%).
The results of the univariate analysis of the present study show that the classification of malignant versus benign lesions achieved the best performance when considering the original_gldm_DependenceNonUniformity feature extracted on CC view (accuracy of 88.98%). The features extracted on CC view appeared to perform better, as the results in the classification of both grading and HR suggest. In fact, an accuracy of 83.65% was reached in the classification of grading, whereas a slightly lower value of accuracy (81.65%) was found in the classification of HR+; the features extracted were the original_glrlm_RunEntropy and the original_gldm_DependenceNonUniformity, respectively. In the identification of HER2+, the best performance, however low (accuracy of 69.63%), was reached when considering the wavelet_HLH_gldm_LargeDependenceHighGrayLevelEmphasis feature, extracted on MLO view images.
The results of multivariate analysis showed that better performances could be achieved when using two or more features as predictors for the classification of malignant and benign lesions and for the prediction of HR positive status. The best performance was achieved when predicting the tumor nature from the CC images through a logistic regression model, where the test accuracy reached a value of 95.83% without LASSO regulation. Nevertheless, the LASSO regularization selected 12 out of 27 predictors, significantly reducing the model complexity, at the price of an imperceptible reduction in performance (91.67%).
The same performance of logistic regression was not observed when predicting the same outcome (malignant versus benign classification) using MLO images, confirming the tendency of results in the univariate analysis context.
Features extracted from MLO images were shown to be useful in the prediction of grading with a CART algorithm. However, the results obtained with a logistic regression approach using only two predictors (minimum test accuracy of 90.00%, maximum test accuracy of 95.83%) suggest that simpler models are preferred, with the only exception of the HER2 outcome, where the best performance (test accuracy of 89.29%) was obtained with the RF algorithm.
Remarkable results were also obtained in the prediction of both the grading and the HR+, from both CC and MLO views; however, the performance of logistic regression (regularized or not) and of tree-based algorithms is surpassed by the accuracies obtained when using only two predictive features. Therefore, it can be stated that the prediction of all the outcomes is preferable with less complex models (that is, logistic regression with only two predictors or with a regularized approach). It is furthermore useful to note that the results of univariate analysis are less performant of those of the multivariate approach, suggesting that artificial intelligence can be powerfully used to extract insights from CEM images analysis.
The main limitation of this study is the need for manual segmentation of the images, which is time consuming and operator dependent. The problem of biased results due to this weakness was addressed by having two radiologists perform the segmentation. A foreseeable solution may be the use of automatic or semi-automatic segmentation; however, this may be difficult to implement, especially in the cases of multicentric lesions or background parenchymal enhancement. A further limit is that the interpretation of machine learning algorithm results is not always intuitive and may require specific expertise from the clinician.

Conclusions
The results of this study confirm that radiomics textural features extracted from CEM images can be highly informative about both the tumor nature and grading, and some molecular subtypes of tumors. Therefore, the results suggest that the combination of artificial intelligence algorithms with the concept of radiomics analysis can be successfully used to create a tool for supporting the physician's decision-making process in the classification of breast cancer. In particular, the identification of malignant breast lesions and HR positive status can be performed with a high predictive power, even using simpler models.