Radiomic Feature Reduction Approach to Predict Breast Cancer by Contrast-Enhanced Spectral Mammography Images

Contrast-enhanced spectral mammography (CESM) is an advanced instrument for breast care that is still operator dependent. The aim of this paper is the proposal of an automated system able to discriminate benign and malignant breast lesions based on radiomic analysis. We selected a set of 58 regions of interest (ROIs) extracted from 53 patients referred to Istituto Tumori “Giovanni Paolo II” of Bari (Italy) for the breast cancer screening phase between March 2017 and June 2018. We extracted 464 features of different kinds, such as points and corners of interest, textural and statistical features from both the original ROIs and the ones obtained by a Haar decomposition and a gradient image implementation. The features data had a large dimension that can affect the process and accuracy of cancer classification. Therefore, a classification scheme for dimension reduction was needed. Specifically, a principal component analysis (PCA) dimension reduction technique that includes the calculation of variance proportion for eigenvector selection was used. For the classification method, we trained three different classifiers, that is a random forest, a naïve Bayes and a logistic regression, on each sub-set of principal components (PC) selected by a sequential forward algorithm. Moreover, we focused on the starting features that contributed most to the calculation of the related PCs, which returned the best classification models. The method obtained with the aid of the random forest classifier resulted in the best prediction of benign/malignant ROIs with median values for sensitivity and specificity of 88.37% and 100%, respectively, by using only three PCs. The features that had shown the greatest contribution to the definition of the same were almost all extracted from the LE images. Our system could represent a valid support tool for radiologists for interpreting CESM images.


Introduction
Reducing the breast cancer mortality rate in the population of women worldwide is greatly impacted by early diagnosis using mammography [1,2].
In recent years, new mammography techniques such as contrast-enhanced spectral mammography (CESM) and tomosynthesis have contributed to further improve the performance of mammography, even in less "readable" types of breasts due to their high density [3,4].
Specifically, the CESM represents a unique and very particular methodology in the clinical and scientific field due to the considerable amount of information derived from morphological images and contrast dynamics obtained simultaneously on the same breast and the same region of interest.
While CESM has the same type of indications as DCE-MRI due to the radiation dose and non-uniform availability of the territory, CESM is currently used according to the guidelines as an alternative to DCE-MRI in cases where this is not feasible (for absolute contraindications or related) or not available [5].
In MRI the performance of the human reader can be reduced by the presence of marked background parenchymal enhancement (BPE), defined as the normal background impregnation of the breast but with very different intensity and distribution in relation to age, hormonal phase, menopausal state and any therapies in progress [6]. In these circumstances, diagnostic aids to reporting and fusion-imaging systems in different positions can support the diagnosis [7][8][9]. These aspects are also expected in CESM.
Nowadays, clinicians are aided in the interpretation of these images by automated support systems, known as computer-aided detection/diagnosis (CAD), which is software used in clinical medicine to suggest diagnoses and treatments based on the clinical data and algorithms for their interpretation. Although various models of CAD systems for breast diseases using mammographic images have been developed in the past years [10][11][12][13][14][15], the literature is limited with regard to CAD systems for breast lesions using CESM images. Most of the recent literature includes comparative studies on the diagnostic performance of human readers on CESM images with respect to those of mammographic and MR images [3,[16][17][18]. However, interest in this new instrumentation is strong and some support tools have recently been proposed for characterizing breast lesions [19][20][21][22][23] and predicting the response to neoadjuvant therapy [24,25].
For this reason, we have developed an accurate computer-aided diagnosis tool to classify breast cancer lesions based on radiomic analysis of CESM images. In our previous work, we proposed a preliminary radiomics analysis aimed to explore the usefulness of quantitative information extracted from CESM images, to understand the behavior of each different set of well-known textural features automatically extracted from CESM images, and to compare them with each other [21,23]. In these works, an important role was played by the feature selection processes used to describe and characterize the regions of interest (ROIs) identified by our expert radiologists. In order to reduce the dataset dimensionality, starting from the initial feature set, a sub-set of these features, which were characterized by their high discriminating power, was selected by filtering techniques (i.e., statistical tests) for more manageable data processing [26]. Then, we selected the most important features by developing different approaches to feature selection, such as embedded and wrapper methods.
The feature selection techniques can influence the process and accuracy of cancer classification. Indeed, although the subset of features identified was the one with the greatest discriminating power with respect to the various evaluation criteria considered, the feature selection techniques involve a natural and inevitable loss of information due to the exclusion of features from the original dataset. For this reason, before proceeding with the exploration of new features useful for increasing the accuracy of the classification performance, we considered it appropriate to evaluate a different dimensionality reduction technique that does not produce information losses but only a reduction of the noise inherent in the dataset. Therefore, we evaluated a classification scheme that includes dimension reduction. Specifically, firstly, a principal component analysis (PCA) dimension reduction method that includes the calculation of the proportion of variance for eigenvector selection was used. Following, the ROIs classification was performed by training different binary classifiers on a subset identified by sequential supervised learning procedures.

CESM Examination
The CESM is based on a dual-energy exposure after a single injection of an iodinated contrast medium (CM), which produces three images: a low-energy (LE), a high-energy (HE) and a third image, defined as a recombined (RC) image that is obtained thanks to the digital subtraction of the LE image from the HE one.
Finally, two images are displayed by the radiologist on the reporting monitor for diagnostic purposes: the LE, which can be superimposed on a standard 2D mammography, which allows morphological surveys, and the RC, which provides information on the tumor neoangiogenesis of the breast.
An example of CESM images is shown in Figure 1, which includes a LE image (a), a HE image (b) and their RC image (c). Therefore, we evaluated a classification scheme that includes dimension reduction. Specifically, firstly, a principal component analysis (PCA) dimension reduction method that includes the calculation of the proportion of variance for eigenvector selection was used. Following, the ROIs classification was performed by training different binary classifiers on a subset identified by sequential supervised learning procedures.

CESM Examination
The CESM is based on a dual-energy exposure after a single injection of an iodinated contrast medium (CM), which produces three images: a low-energy (LE), a high-energy (HE) and a third image, defined as a recombined (RC) image that is obtained thanks to the digital subtraction of the LE image from the HE one.
Finally, two images are displayed by the radiologist on the reporting monitor for diagnostic purposes: the LE, which can be superimposed on a standard 2D mammography, which allows morphological surveys, and the RC, which provides information on the tumor neoangiogenesis of the breast.
An example of CESM images is shown in Figure 1, which includes a LE image (a), a HE image (b) and their RC image (c).

Figure 1.
Images produced by contrast-enhanced spectral mammography (CESM instrumentation). Typical example of low energy (a), high energy (b), and recombined (c) images [17]. The white arrow points out a suspicious lesion.
For all CESM exams, a modified Full-Field Digital Mammography (FFDM) system derived from a standard Senographe Essential (GE Healthcare) was used. First, the breast with no pathology was imaged, then the breast with the suspected lesion. Both craniocaudal (CC) and mediolateral oblique (MLO) views were collected. All of the images obtained were in DICOM format and were evaluated by a dedicated radiologist with more than 10 years of experience in reading mammography and breast MR images and trained in reading contrast-enhanced images.

Experimental Dataset
The study was pre-approved by the Scientific Board of the Istituto Tumori "Giovanni Paolo II" of Bari, Italy. As this is a retrospective study, the anonymized images of patients who had given consent to the use of data for scientific purposes, as required by our Regulations, were acquired.
Once we had selected images from 53 patients aged between 37 and 76 years (with a mean of 52.2 ± 10.1 years), showing a positive result according to CESM, a radiologist from For all CESM exams, a modified Full-Field Digital Mammography (FFDM) system derived from a standard Senographe Essential (GE Healthcare) was used. First, the breast with no pathology was imaged, then the breast with the suspected lesion. Both craniocaudal (CC) and mediolateral oblique (MLO) views were collected. All of the images obtained were in DICOM format and were evaluated by a dedicated radiologist with more than 10 years of experience in reading mammography and breast MR images and trained in reading contrast-enhanced images.

Experimental Dataset
The study was pre-approved by the Scientific Board of the Istituto Tumori "Giovanni Paolo II" of Bari, Italy. As this is a retrospective study, the anonymized images of patients who had given consent to the use of data for scientific purposes, as required by our Regulations, were acquired.
Once we had selected images from 53 patients aged between 37 and 76 years (with a mean of 52.2 ± 10.1 years), showing a positive result according to CESM, a radiologist from our Institute dedicated to senologic diagnostics manually identified the ROIs with a box on the image by using the reporting tool. Some patients had more than one, so a total of 58 ROIs were identified according to the BIRADS classification [27]: lesions belonging to BIRADS 2 and 3 classes were considered as benign, while lesions belonging to BIRADS 4 and 5 classes were labeled as malignant. Then, the histological diagnosis based on bioptic sampling established that 15 ROIs contained benign lesions while 43 ROIs included malignant ones.

Methods
As summarized in Figure 2, after the radiologist manually segmented a region of interest (ROI) from each LE and RC image, a large feature set consisting of five different kinds was extracted, such as points and corners of interest, textural and statistical features from both the original ROIs and also the ones obtained by a Haar decomposition and a gradient image implementation. Subsequently, principal component analysis (PCA) dimension reduction method that includes the calculation of variance proportion for eigenvector selection was used from each of five features set. Finally, different binary classifiers were trained to discriminate benign and malignant ROIs by developing a sequential forward feature selection algorithm that selected feature sub-sets first individually, and then simultaneously. MATLAB R2017a (MathWorks, Inc., Natick, MA, USA) software was used for all analyses. our Institute dedicated to senologic diagnostics manually identified the ROIs with a box on the image by using the reporting tool. Some patients had more than one, so a total of 58 ROIs were identified according to the BIRADS classification [27]: lesions belonging to BIRADS 2 and 3 classes were considered as benign, while lesions belonging to BIRADS 4 and 5 classes were labeled as malignant. Then, the histological diagnosis based on bioptic sampling established that 15 ROIs contained benign lesions while 43 ROIs included malignant ones.

Methods
As summarized in Figure 2, after the radiologist manually segmented a region of interest (ROI) from each LE and RC image, a large feature set consisting of five different kinds was extracted, such as points and corners of interest, textural and statistical features from both the original ROIs and also the ones obtained by a Haar decomposition and a gradient image implementation. Subsequently, principal component analysis (PCA) dimension reduction method that includes the calculation of variance proportion for eigenvector selection was used from each of five features set. Finally, different binary classifiers were trained to discriminate benign and malignant ROIs by developing a sequential forward feature selection algorithm that selected feature sub-sets first individually, and then simultaneously. MATLAB R2017a (MathWorks, Inc., Natick, MA, USA) software was used for all analyses. Finally, three binary classifiers were trained and their performances were evaluated on 100 ten-fold cross-validation rounds.

Feature Extraction
Starting from each ROI extracted from both original LE and RC images, five feature sets were automatically extracted to make the lesion classification more objective and operator independent. We then started to mathematically define a digital image [28].
An image can be defined as a two-dimensional function mapping the spatial coordinates and into a value ( , ) representing the intensity of gray level of the image at that point. When , , and ( , ) are all finite and discrete quantities, we call the image a digital image. A digital image is composed of a finite number of elements called pixels, each one having a particular position and gray intensity value. The section of the real plane spanned by the coordinates of the image is called the spatial domain. A very Figure 2. Schematic overview of the CESM images classification process. First, five sets of features were automatically extracted from each region of interest (ROI), then a principal component analysis was performed on each feature set. Finally, three binary classifiers were trained and their performances were evaluated on 100 ten-fold cross-validation rounds.

Feature Extraction
Starting from each ROI extracted from both original LE and RC images, five feature sets were automatically extracted to make the lesion classification more objective and operator independent. We then started to mathematically define a digital image [28].
An image can be defined as a two-dimensional function mapping the spatial coordinates x and y into a value f (x, y) representing the intensity of gray level of the image at that point. When x, y, and f (x, y) are all finite and discrete quantities, we call the image a digital image. A digital image is composed of a finite number of elements called pixels, each one having a particular position and gray intensity value. The section of the real plane spanned by the coordinates of the image is called the spatial domain. A very common representation for a digital image is a two-dimensional array represented as an M × N numerical matrix.
Therefore, given a digital image represented by a matrix, it was possible to extract statistical and textural features and, moreover, to identify points, edges and corners of interest. As shown in Figure 3, in this work we used five different extracted feature subsets, which we named as follows: STAT set, COUNT set, GRAD set, HAAR set and GLCM set [21,23].
common representation for a digital image is a two-dimensional array represented as an × numerical matrix. Therefore, given a digital image represented by a matrix, it was possible to extract statistical and textural features and, moreover, to identify points, edges and corners of interest. As shown in Figure 3, in this work we used five different extracted feature subsets, which we named as follows: STAT set, COUNT set, GRAD set, HAAR set and GLCM set [21,23]. The STAT set consists of 22 statistical features that describe the distribution of the ROI gray levels, measuring the likelihood of observing a gray value in random positions in the image, without taking into account the spatial information [29]. From each LE and RC original ROI, we extracted the following features: mean, standard deviation and their ratio, variance, skewness, entropy, relative smoothness, kurtosis, minimum and maximum values of gray-level and their difference. The STAT set consists of 22 statistical features that describe the distribution of the ROI gray levels, measuring the likelihood of observing a gray value in random positions in the image, without taking into account the spatial information [29]. From each LE and RC original ROI, we extracted the following features: mean, standard deviation and their ratio, variance, skewness, entropy, relative smoothness, kurtosis, minimum and maximum values of gray-level and their difference.
The COUNT set contains a total of 10 features that describe the points, edges and corners of interest. In this work, we used five known learning algorithms: • Scale invariant feature transform (SIFT) algorithm [30,31], which detects and describes image local features • Minimum eigenvalue algorithm, which underlies the Shi-Tomasi corner detection algorithm [32] for identifying the corners of an object • Features from the accelerate segment test (FAST) algorithm [33,34], which is another corner detection method • Binary robust invariant scalable key-points (BRISK) method [35], which combines the SIFT and the FAST algorithms to feature detection, descriptor composition and key-points matching • Maximally stable external regions (MSER) algorithm [36], which is a method of blob detection in images whose aim consists of finding correspondence between image elements from two images with different viewpoints.
The GRAD set is formed by 24 features and contains some previously defined statistical features, that is, mean, variance, skewness, entropy, relative smoothness and kurtosis, extracted from the gradient's magnitude and direction of each LE and RC original ROI. The gradient of an image is represented as a two-component vector (x-and y-derivative) defined at each pixel [28]. These can be computed by the convolution with a kernel, such as the Sobel or Prewitt operator, since the image is a discrete function for which the derivatives are not defined. For each vector, the magnitude Gmag shows how quickly the intensity of each pixel is changing in the neighborhood of the pixel (x, y) in the direction of the gradient, while the direction Gdir represents the orientation of greatest intensity change in the neighborhood of the pixel (x, y). The gradient can be approximated by convolving a kernel, in this work a Sobel kernel, with the original image [28]. The importance of calculating the gradient image lies in the two pieces of information it provides: the magnitude, which is a measure of how quickly the image is changing and the direction, which illustrates the direction in which the image is changing most rapidly.
The HAAR set contains 96 features, the same statistical features previously computed in the GRAD set, but this time extracted from each sub-ROI obtained by decomposing each LE and RC original ROI thanks to the Haar wavelet transform [28,37]. This technique obtains multi-resolution representations of images, which are very effective for analyzing the information content of images due to the dependence of the texture on the scale at which an image is analyzed. Particularly, once Haar two-dimensional scaling and wavelets functions are computed, these can be used as filters in order to decompose the image into four bands. First the image is low-pass filtered and downscaled in order to obtain the low low (LL1) band, then it is high-pass filtered in the three different directions in order to obtain the three types of detail images: horizontal (HL1), vertical (LH1) and diagonal (HH1). The operations can then be repeated on the LL1 band using the identical filters, as shown in Figure 4. The COUNT set contains a total of 10 features that describe the points, edges and corners of interest. In this work, we used five known learning algorithms: • Scale invariant feature transform (SIFT) algorithm [30,31], which detects and describes image local features • Minimum eigenvalue algorithm, which underlies the Shi-Tomasi corner detection algorithm [32] for identifying the corners of an object • Features from the accelerate segment test (FAST) algorithm [33,34], which is another corner detection method • Binary robust invariant scalable key-points (BRISK) method [35], which combines the SIFT and the FAST algorithms to feature detection, descriptor composition and keypoints matching • Maximally stable external regions (MSER) algorithm [36], which is a method of blob detection in images whose aim consists of finding correspondence between image elements from two images with different viewpoints.
The GRAD set is formed by 24 features and contains some previously defined statistical features, that is, mean, variance, skewness, entropy, relative smoothness and kurtosis, extracted from the gradient's magnitude and direction of each LE and RC original ROI. The gradient of an image is represented as a two-component vector ( -and -derivative) defined at each pixel [28]. These can be computed by the convolution with a kernel, such as the Sobel or Prewitt operator, since the image is a discrete function for which the derivatives are not defined. For each vector, the magnitude Gmag shows how quickly the intensity of each pixel is changing in the neighborhood of the pixel ( , ) in the direction of the gradient, while the direction Gdir represents the orientation of greatest intensity change in the neighborhood of the pixel ( , ). The gradient can be approximated by convolving a kernel, in this work a Sobel kernel, with the original image [28]. The importance of calculating the gradient image lies in the two pieces of information it provides: the magnitude, which is a measure of how quickly the image is changing and the direction, which illustrates the direction in which the image is changing most rapidly.
The HAAR set contains 96 features, the same statistical features previously computed in the GRAD set, but this time extracted from each sub-ROI obtained by decomposing each LE and RC original ROI thanks to the Haar wavelet transform [28,37]. This technique obtains multi-resolution representations of images, which are very effective for analyzing the information content of images due to the dependence of the texture on the scale at which an image is analyzed. Particularly, once Haar two-dimensional scaling and wavelets functions are computed, these can be used as filters in order to decompose the image into four bands. First the image is low-pass filtered and downscaled in order to obtain the low low ( 1) band, then it is high-pass filtered in the three different directions in order to obtain the three types of detail images: horizontal ( 1), vertical ( 1) and diagonal ( 1). The operations can then be repeated on the 1 band using the identical filters, as shown in Figure 4.  Finally, the GLCM set is formed by a total of 312 features, which represent the spatial relationship that on average links the gray levels of the image to each other. Texture analysis returns textural variables, which are any geometric and repetitive arrangement of gray levels. The textural information of an image can be described by second-order variables computed on the N × N gray-level co-occurrence matrix (GLCM) (N represents the number of intensity values in the gray-scale image, which is usually reduced from 256 to 8 at the beginning of the algorithm to reduce the computational cost) [38,39].
Given a gray-level image f, the first step in building the matrix consists in defining a specific spatial relationship between pixels. This relationship, known as offset, is the distance D between a pixel of interest and its neighbors with respect to a specific direction, identified by an angle θ ∈ {0 • , 45 • , 90 • , 135 • }. Thus, the (i, j) element in GLCM matrix represents the number of times the combination of level i and j occurs in two pixels in the image, which satisfy the relationship given by the offset [28].
Thus, for each sub-ROI previously decomposed by the Haar transform only at first level, the co-occurence matrices are extracted in the four possible directions by choosing the parameter D equal to 2 in order to evaluate the relationship between the gray levels of the pixels immediately close in the image. Finally, the following features are extracted from both LE and RC sub-ROIs: contrast, correlation, cluster prominence, cluster shade, dissimilarity, energy, entropy, homogeneity, sum average, sum variance, sum entropy, difference entropy and normalized inverse difference moment.

Principal Component Analysis
The feature extraction procedures returned five feature sets. The next step consisted of exploring the discriminating power of these sets to identify a sub-set of significant features. To do this, a feature reduction was performed through PCA [40].
The central idea of PCA is to reduce the dimensionality of a dataset consisting of a large number of interrelated variables while retaining as much as possible of the variation present in the data set. This is achieved by performing a linear transformation of the features that projects the original ones into a new Cartesian system, where the variables are sorted in descending order with respect to the overall variance percentage explained. In this work, a PCA was performed for each feature set.
The first step consists of representing the data set as a matrix. In particular, if n is the number of observations, in this case ROIs, and m the number of variables, a n × m matrix X is obtained. Then, the raw data in the matrix have to be standardized so that each variable contributes equally to the analysis. Thus, a new matrix Z is computed.
The next step consists of processing the correlation matrix R of Z, a m × m matrix whose elements are the correlation coefficients between the variables.
In general, the k-th principal component y k is such that: where λ k is the k-th largest eigenvalue of R and γ k is the corresponding eigenvector. Now, a sub-set of principal components has to be selected to replace the m elements of Z by a much smaller number p of PCs. There are different rules for deciding how many PCs should be retained in order to account for most of the variation in Z, without serious information loss [40]. In this work, we adopted the explained variance criterion: the required number of PCs is the smallest value of p for which the chosen percentage, in this case 80%, is exceeded. In particular, since the principal components are sorted in descending order with respect to their variance, it is sufficient to select the p components whose summed variance exceeds the selected threshold value.

Classification Model
Once the significant PCs were selected for each feature set, three different classification models were trained to discriminate the ROIs into benign and malignant, first on each PCs' sub-set, then on the set containing all the selected PCs. We trained a random forest (RF) classifier [41], a naïve Bayes (NB) classifier [42] and a logistic regression (GLM) [43] in a procedure of PCs stepwise forward selection [44].
The RF algorithm is among the most popular machine learning algorithms because it generally provides good predictive performance combined with low over-fitting. In particular, the tree-based strategy used by RF naturally ranks by how well they improve the purity of the node: nodes with the greatest decrease in impurity are at the start of the trees, while nodes with the least decrease in impurity occur at the end of trees measured by Gini's diversity index.
The NB algorithm is a probabilistic approach based on Bayes' theorem with an assumption of independence among predictors. In simple terms, a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
The GLM is a statistical model that uses a logistic function to model a binary variable. This regression transforms its output using the sigmoid function to return a probability value that can then be mapped to the two classes under question.
The stepwise forward selection algorithm identifies the best sub-set of PCs by sequentially adding a PC to the set selected in the previous steps. Starting from a model composed by a single PC that showed the highest median AUC [45] on 100 ten-fold cross-validation rounds, we iteratively added the PC that allowed us to obtain the highest classification performances in terms of median value AUC. The process was repeated until all variables were included into the model. The performance of each classification was evaluated on 100 ten-fold cross-validation rounds in order to obtain the variability of the experimental results.
Once the best model was selected for each method and dataset, we compared the classification performances of these prediction models also in terms of: where TP and TN stand for true positive (number of true malignant ROIs identified) and true negative (number of true benign ROIs identified) cases, while FP (number of benign ROIs identified as malignant) and FN (number of malignant ROIs identified as benign) are the false positive and false negative ones, respectively. Specifically, the above values were calculated to identify the optimal threshold by means of Youden's index on ROC curves [46], an index able to solve dataset unbalance problems (15 benign and 43 malignant).

Results
The aim of this study was to devise a prediction model that was successful in discriminating benign and malignant ROIs. First, we reduced the initial dataset dimensionality through a principal component analysis, and we obtained one sub-set of discriminant principal components for each initial feature set. Then, we used three different classification algorithms on both the individual obtained sub-set and on the complete sub-set of significant principal components.

Principal Component Analysis
We performed a PCA for each standardized set of features, then we adopted the explained variance criterion to select a sub-set of discriminant principal components. We chose a threshold value equal to 0.8.
The STAT set required 4 PCs, the GRAD set was represented by a set of 9 PCs, the COUNT set was replaced by a set of 3 PCs, the sub-set linked to the HAAR set was formed by a total of 19 PCs, and the GLCM set needed 11 PCs, as shown in Figure 5. The STAT set required 4 PCs, the GRAD set was represented by a set of 9 PCs, the COUNT set was replaced by a set of 3 PCs, the sub-set linked to the HAAR set was formed by a total of 19 PCs, and the GLCM set needed 11 PCs, as shown in Figure 5.

Classification Performances
Once the PCs' subsets were extracted, we trained the random forest algorithm, the naïve Bayes method and the logistic regression in the procedure of PCs stepwise forward selection, first on each PCs' sub-set, then on the set containing all the selected PCs.
Concerning the individual PCs' sub-sets, the best models were obtained by training either the RF or the GLM classifier on sub-sets containing two PCs at most (Table 1). Table 1. Classification performances of the best models obtained from the individual PCs' sets calculated on 100 ten-fold cross-validation rounds. For each set of features, the performance measures of the different classification algorithms implemented are reported in correspondence with the best PC combination. with the best PC combination. The third column shows the PCs selected by the sequential forward selection algorithm. The best results are highlighted in bold.

Classification Performances
Once the PCs' subsets were extracted, we trained the random forest algorithm, the naïve Bayes method and the logistic regression in the procedure of PCs stepwise forward selection, first on each PCs' sub-set, then on the set containing all the selected PCs.
Concerning the individual PCs' sub-sets, the best models were obtained by training either the RF or the GLM classifier on sub-sets containing two PCs at most (Table 1). Table 1. Classification performances of the best models obtained from the individual PCs' sets calculated on 100 ten-fold cross-validation rounds. For each set of features, the performance measures of the different classification algorithms implemented are reported in correspondence with the best PC combination. with the best PC combination. The third column shows the PCs selected by the sequential forward selection algorithm. The best results are highlighted in bold. Since the PCs were obtained as a linear combination of the starting variables, we were interested in understanding which of these features contributed most and positively to the computation of the PCs, which determined the best classification model for each set. We selected the variables shown in Table 2, excluding the ones characterized by coefficients close to zero compared to the those of the chosen variables. Table 2. Overview of the features that were important in the computation of the selected PCs on 100 ten-fold cross-validation rounds. The factors, LE and RC identify the features extracted from the LE images and the RC images, respectively. Table 3. Classification performance of the best models obtained from the complete set of PCs calculated on 100 ten-fold cross-validation rounds. The best result is highlighted in bold.

Classifier
Best Indeed, while the NB classifier reached a median AUC of 88.99% and the GLM achieved a median AUC of 90.08%, the model obtained by training the RF classifier on the set, including the second and the twelfth PCs belonging to the HAAR set and the first PC belonging to the GRAD set, reached a median AUC of 95.66% and a median accuracy of 90.52%. This model is also characterized by a better classification performance than the performance of the best model obtained by considering the single PCs' sets, and reached a sensitivity of 88.37% and a specificity of 100%.

Discussion
Although its use is still not very widespread in the area, CESM is a very interesting technique in our opinion, and due to its intrinsic characteristics, it is particularly suitable for the analysis of images using radiomics. The opportunity to simultaneously analyze similar morphological images on the same mammographic and dynamic images with contrast medium for the evaluation of neoangiogenesis provides the system with a remarkable and varied series of information that can achieve excellent results. Moreover, the absence of a consistent number of studies on this method and on these issues, in our opinion, makes this work even more precious.
In this work, we proposed an automated support system able to characterize and discriminate breast lesions as benign/malignant. We extracted 58 ROIs from 53 CESM images and for each ROI we determined five set of features: the STAT set, which included statistical features extracted from the original ROIs, the GRAD and the HAAR set that comprised statistical measures extracted from the ROIs' manipulations by filters and wavelet functions, respectively, the GLCM set that included textural features and the COUNT set, which comprised information about points and corners of interest. Subsequently, each above-mentioned feature set was replaced by the set containing the related discriminant principal components. In particular, for each set a principal component analysis was performed and a sub-set of principal components was selected by setting a variance explained threshold value of 0.8. Then, all the PCs' sets obtained were used, first individually and then simultaneously, to train three different classifiers combined with a stepwise forward algorithm. The classification performances were evaluated and compared in terms of accuracy, sensitivity and specificity and AUC values on 100 ten-fold cross-validation rounds.
The best model among the ones developed on the single PCs' sets turned out to be the model obtained by training the RF classifier on the HAAR PCs' sub-set including the second and the twelfth PCs. It was observed that using only these two variables led to a median AUC value of 94.95%, a median accuracy of 87.93%, a sensitivity of 86.05% and a specificity of 100%.
On the other hand, it was observed that adding the first GRAD set's PC to the previous model led to better results. Indeed, this new model reached a median AUC value of 95.66%, a median accuracy of 90.52%, a sensitivity of 88.37% and a specificity of 100%. In both cases, the specificity value highlights that these prediction models are able to correctly identify all the benign ROIs.
Thus, the sequential forward selection algorithm allowed us to obtain a very well performing classification model by selecting only three PCs: the second and the twelfth PCs from the HAAR set and the first PC from the GRAD set. The above-mentioned HAAR PCs were estimated thanks to the contributions of the variables, mainly calculated on LE images, only one of them was extracted from the RC images. On the other hand, the features that contributed most to the calculation of the first GRAD PC were all extracted from the LE images.
We compared the performance of the proposed approach with respect to the literature, and in this work, we improved the classification performances obtained with the CAD system developed in our previous work [21] (Table 4). Indeed, by using principal component analysis as a feature selection technique, instead of a backward feature selection algorithm combined with a naïve Bayes classifier, we reduced the number of discriminant variables and at the same time we obtained better results in terms of sensitivity and specificity (in the previous work we obtained a sensitivity of 87.5% and a specificity of 91.7%). Moreover, compared to state of the art models, our model seems to perform better. In [19], the authors reached a sensitivity of 88% and a specificity of 92% by training a SVM classifier on a feature set extracted from 50 lesions manually segmented by radiologists.
Finally, we want to emphasize the importance of radiomic analysis for the characterization of benign and malignant ROIs and the achievement of more balanced values for sensitivity and specificity. Specifically, we want to compare our results with the ones obtained in [20] (a sensitivity of 100% and a specificity of 66%) with the use of only textural descriptors provided by the radiologist combined with CESM pixel information extracted directly from the images.

Conclusions
With the aim of improving the number of early diagnoses of breast cancer, in this work we proposed an automated expert system for discriminating benign and malignant ROIs. We proposed the use of a principal component analysis combined with machine learning techniques in order to select the optimal subset for characterizing breast regions and classifying them. Particularly, we trained three binary classifiers on an increasing number of features sorted by their diagnostic power, evaluated in terms of AUC. Our model's performance represents a step forward compared to our previous work, both in regard to the greater accuracy achieved in the classification of benign and malignant lesions and the smaller number of variables used to obtain these results.