Breast Cancer Detection in Mammography Images: A CNN-Based Approach with Feature Selection

: The prompt and accurate diagnosis of breast lesions, including the distinction between cancer, non-cancer, and suspicious cancer, plays a crucial role in the prognosis of breast cancer. In this paper, we introduce a novel method based on feature extraction and reduction for the detection of breast cancer in mammography images. First, we extract features from multiple pre-trained convolutional neural network (CNN) models, and then concatenate them. The most informative features are selected based on their mutual information with the target variable. Subsequently, the selected features can be classiﬁed using a machine learning algorithm. We evaluate our approach using four different machine learning algorithms: neural network (NN), k-nearest neighbor (kNN), random forest (RF), and support vector machine (SVM). Our results demonstrate that the NN-based classiﬁer achieves an impressive accuracy of 92% on the RSNA dataset. This dataset is newly introduced and includes two views as well as additional features like age, which contributed to the improved performance. We compare our proposed algorithm with state-of-the-art methods and demonstrate its superiority, particularly in terms of accuracy and sensitivity. For the MIAS dataset, we achieve an accuracy as high as 94.5%, and for the DDSM dataset, an accuracy of 96% is attained. These results highlight the effectiveness of our method in accurately diagnosing breast lesions and surpassing existing approaches.


Introduction
Breast cancer (BC) is a widespread form of cancer with millions of new diagnoses and deaths each year [1].In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths [2].Although mortality rates have declined due to the implementation of regular mammography screening, early detection, and treatment remain important for reducing cancer fatalities [3].Currently, early detection of BC from radiology images requires the expertise of highly trained radiologists.A looming shortage of radiologists in several countries will likely worsen this problem [4].Mammography screening also leads to a high incidence of false positive results.This can result in unnecessary anxiety, inconvenient follow-up care, extra imaging tests, and sometimes a need for tissue sampling (often a needle biopsy) [5,6].Additionally, machine learning techniques have the potential to improve the process of evaluating multiple-view radiology images based on graph-based clustering techniques [7][8][9][10].Deep learning as a subset of machine learning in recent years has revolutionized the interpretation of diagnostic imaging studies [11].A convolutional neural network (CNN) is one of the most significant networks in the deep learning field [12].Compared to traditional screening techniques, computer-aided diagnosis (CAD) systems utilizing convolutional neural networks (CNN) offer faster, more reliable, and more robust screening.CNNs have emerged as a prominent method for pattern recognition in image analysis [13].CNN has been extensively used for breast cancer detection in different types dicom format from roughly 11,000 patients.For each patient, there are at least four images from different laterality and views.For each subject, two different views CC and MLO, and images from left and right laterality were provided.The images are of various sizes and formats, including jpeg and jpeg2000, and different types, such as monochrome-1 and monochrome-2.The dataset provides additional features some of which can be used for classification purposes: age, implant, BIRADS, and density.We base our work on this dataset, but since this dataset is new, it has not been used in any published research yet.Hence, for comparison purposes, we use two other well-known datasets MIAS and DDSM.This dataset is imbalanced as only 2 percent of the images are from cancer patients, which makes any classification method biased.To compensate for this, we use all positive cases and only 2320 images from negative cases. Figure 1 depicts two sample images from this dataset for cancer and normal cases.
of the images are from cancer patients, which makes any classification method biased.To compensate for this, we use all positive cases and only 2320 images from negative cases. Figure 1 depicts two sample images from this dataset for cancer and normal cases.B. The mammographic image analysis society (MIAS) [23] dataset is a well-known and widely used dataset for the development and evaluation of CAD systems for BC detection.It consists of 322 mammographic images, with each image accompanied by a corresponding ground truth classification of benign or malignant tumors.The dataset is particularly valuable for researchers interested in developing machine learning algorithms for BC detection, as it includes examples of both normal and abnormal mammograms, as well as a range of breast densities and lesion types.Figure 2 depicts two sample images from this dataset for cancer and normal cases.C. The digital database for screening mammography (DDSM) [24] includes 55,890 images, of which 14% are positive, and the remaining 86% are negative.Images were tiled into 598 × 598 tiles, which were then resized to 299 × 299.A subset of this dataset which is for positive cases and is called CBIS-DDSM, has been annotated and the region of interest has been extracted by experts.In this research, we do not use the CBIS-DDSM and use the original DDSM dataset as we are classifying the images from normal subjects and cancer patients.Figure 3 depicts two sample images from this dataset for cancer and normal cases.Table 1 summarizes these three datasets.

B.
The mammographic image analysis society (MIAS) [23] dataset is a well-known and widely used dataset for the development and evaluation of CAD systems for BC detection.It consists of 322 mammographic images, with each image accompanied by a corresponding ground truth classification of benign or malignant tumors.The dataset is particularly valuable for researchers interested in developing machine learning algorithms for BC detection, as it includes examples of both normal and abnormal mammograms, as well as a range of breast densities and lesion types.Figure 2 depicts two sample images from this dataset for cancer and normal cases.C.
The digital database for screening mammography (DDSM) [24] includes 55,890 images, of which 14% are positive, and the remaining 86% are negative.Images were tiled into 598 × 598 tiles, which were then resized to 299 × 299.A subset of this dataset which is for positive cases and is called CBIS-DDSM, has been annotated and the region of interest has been extracted by experts.In this research, we do not use the CBIS-DDSM and use the original DDSM dataset as we are classifying the images from normal subjects and cancer patients.Figure 3 depicts two sample images from this dataset for cancer and normal cases.EfficientNet [27] is a family of deep CNN architectures that were introduced in 2019 and have achieved state-of-the-art performance on a range of computer vision tasks.EfficientNet uses a compound scaling method to simultaneously optimize the depth, width, and resolution of the network, allowing it to achieve high accuracy while maintaining computational efficiency.EfficientNet consists of a backbone network that extracts features from input images and a head network that performs the final classification.The backbone network uses a combination of mobile inverted bottleneck convolutional layers and squeeze-and-excitation (SE) blocks to capture both spatial and channel-wise correlations in the input.The head network uses a combination of global average pooling and fully connected layers to perform the final classification.D.
MobileNet [28] is a deep learning architecture suitable for efficient and accurate analysis of medical images, specifically in the context of BC diagnosis.With its emphasis on computational efficiency, MobileNet can effectively extract features from mammography images, enabling the detection of subtle patterns or abnormalities associated with breast cancer.By utilizing depthwise separable convolutions, MobileNet optimizes memory consumption and computational load, making it ideal for resource-constrained environments.The integration of the ReLU6 activation function further enhances efficiency and compatibility with medical imaging devices.Overall, MobileNet offers a valuable solution for BC analysis, providing accurate results while operating efficiently on limited computational resources.E.
ConvNeXt [29] is an architecture that enhances the representational capacity of CNNs by leveraging parallel branches to capture diverse and complementary features, leading to improved performance on challenging visual recognition tasks.It has demonstrated excellent performance on various computer vision tasks, including image classification, object detection, and semantic segmentation.Its ability to capture complex relationships between features has made it a popular choice for tasks requiring a high-level understanding of visual data.

Proposed Method
In this paper, we propose a method based on the extraction and concatenation of features obtained from various CNN models.The extracted features are then reduced such that only good features are selected and then used for the classification of normal and cancerous images.Figure 4 illustrates the block diagram of the proposed system.As one can see, the images from different datasets are first preprocessed, and then features are extracted through different CNN models.The extracted features are reduced and then classified into two: cancer and no cancer.The details for each block are as follows: A. Preprocessing: In this research, the images obtained from various datasets exhibit variations in sizes and resolutions.Utilizing this contour, we generated a mask that enables us to crop the image and isolate the specific region of interest for further analysis.

3.
Image Alignment: In breast cancer datasets, there are two distinct laterality categories: left and right.To enhance consistency and improve accuracy in analysis, we align all laterality labels to the left side.This process involves horizontally flipping all left breast images to create a uniform orientation throughout the datasets.By standardizing the laterality representation, we ensure a consistent and reliable dataset for further research and analysis purposes.
B. Feature extraction: For feature extraction, we exploit the features computed by pretrained CNN models described in Section 2.2.For each model, the features are extracted from the last layer before the last fully connected (FC) layer as the output of the final FC layer has been trained for 1000 classes of the ImageNet dataset, and hence, we skip this layer and extract the features from the last layer before the final FC layer.Table 2 depicts the layer before the final FC layer and the number of features extracted for each CNN model used in this paper.C. Feature concatenation: The 1-dimensional (1D) features extracted in the previous step are concatenated to form a single 1D feature vector.Note that for each CNN model, we have extracted features from two different views CC and MLO.Hence, 10 1D vectors are concatenated here.This forms a vector with a size of 18,384 For the RSNA dataset that we use as the basis of our research, we have an additional useful feature for the patient age. Figure 5 depicts the distribution of the age feature provided by the RSNA dataset for both cancer and non-cancer subjects.As can be observed, age can also be considered a valuable feature.We can also simply normalize and add age to our feature vector to have 18,385 features in total.D. Feature selection: The majority of the features are redundant and do not carry any useful information and only increase the complexity of the system.Figure 6 illustrates 2 samples of good and weak features.As one can see from the figure, in the case of weak features, the distribution of the feature for normal and cancerous subjects are similar showing that there is no useful information in this feature and the calculated mutual information between them is zero.For the case of good features, normal and cancerous subjects have obviously different distributions showing that these features carry useful information, although small, that can improve the performance of classifiers used in the next step.To compute mutual information we use the method in [30].We empirically found a 0.02 threshold gives us the best results.Note that we have also adopted feature selection based on mutual information empirically and after using various feature selection methods.The number of features for each dataset before and after feature selection is presented in Table 3. E. Feature classification: After selecting the best features, we need to classify them.For this purpose, we tried multiple machine learning algorithms such as k-NN, random forest (RF), SVM, and NN.In our study, we utilize an RF algorithm with specific parameters to enhance breast cancer detection.We construct an ensemble of 100 trees, setting the minimum number of samples required to split a node as 2. Additionally, we limit the maximum number of features considered for each tree to 5 and the maximum tree depth to 4. These parameter settings are chosen to optimize the performance of our model and improve the accuracy of breast cancer detection in our X-ray image datasets.In our SVM classifier implementation, we utilize a linear kernel and set the regularization parameter "C" to a value of 1.The linear kernel allows us to learn a linear decision boundary, while the "C" parameter balances the trade-off between training accuracy and the complexity of the decision boundary.In the k-NN classifier, we set k = 5, and for the NN classifier, we used two fully connected (FC) layers with a hidden layer including 96 neurons and a single-neuron classification layer.For the classification layer, we use a sigmoid activation function that classifies non-cancer cases from cancerous ones.
Information 2023, 14, x FOR PEER REVIEW 7 of 14 Table 2.This table shows the CNN models used in the proposed method along with the layer name where the features have been extracted and the number of features extracted from each model.C. Feature concatenation: The 1-dimensional (1D) features extracted in the previous step are concatenated to form a single 1D feature vector.Note that for each CNN model, we have extracted features from two different views CC and MLO.Hence, 10 1D vectors are concatenated here.This forms a vector with a size of 18,384 For the RSNA dataset that we use as the basis of our research, we have an additional useful feature for the patient age. Figure 5 depicts the distribution of the age feature provided by the RSNA dataset for both cancer and non-cancer subjects.As can be observed, age can also be considered a valuable feature.We can also simply normalize and add age to our feature vector to have 18,385 features in total.D. Feature selection: The majority of the features are redundant and do not carry any useful information and only increase the complexity of the system.Figure 6 illustrates 2 samples of good and weak features.As one can see from the figure, in the case of weak features, the distribution of the feature for normal and cancerous subjects are similar showing that there is no useful information in this feature and the calculated mutual information between them is zero.For the case of good us the best results.Note that we have also adopted feature selection based on mutual information empirically and after using various feature selection methods.The number of features for each dataset before and after feature selection is presented in Table 3. E. Feature classification: After selecting the best features, we need to classify them.For this purpose, we tried multiple machine learning algorithms such as k-NN, random forest (RF), SVM, and NN.In our study, we utilize an RF algorithm with specific parameters to enhance breast cancer detection.We construct an ensemble of 100 trees, setting the minimum number of samples required to split a node as 2. Additionally, we limit the maximum number of features considered for each tree to 5 and the maximum tree depth to 4. These parameter settings are chosen to optimize the performance of our model and improve the accuracy of breast cancer detection in our X-ray image datasets.

CNN Models
In our SVM classifier implementation, we utilize a linear kernel and set the regularization parameter "C" to a value of 1.The linear kernel allows us to learn a linear decision boundary, while the "C" parameter balances the trade-off between training accuracy and the complexity of the decision boundary.
In the k-NN classifier, we set k = 5, and for the NN classifier, we used two fully connected (FC) layers with a hidden layer including 96 neurons and a single-neuron classification layer.For the classification layer, we use a sigmoid activation function that classifies non-cancer cases from cancerous ones.Table 2.This table shows the CNN models used in the proposed method along with the layer name where the features have been extracted and the number of features extracted from each model.

Results and Discussion
This section showcases the results obtained from the three datasets introduced in Section 2.1 using the models described in Section 2.2, as well as a combination of all datasets as illustrated in Figure 4.For each dataset, we employed k-fold cross-validation with k = 10.This means that the method was trained and tested 10 times, with 90% of the data allocated for training and 10% for testing in each iteration.

Evaluation Metrics [31]
To assess the performance of our experiments, we utilize various evaluation metrics.

•
True positives (TP): Instances where the predicted class and actual class are both positive.This indicates that the classifier accurately classified the instance with a positive label.

•
False positives (FP): Instances where the predicted class is positive but the actual class is negative.This means that the classifier incorrectly classified the instance with a positive label.In the context of breast abnormality classification, an FP response corresponds to a type I error according to statisticians.For example, it could refer to a calcification image being classified as a mass lesion or a benign mass lesion being classified as a malignant mammogram in the diagnosis.

•
True negatives (TN): Instances where the predicted class and actual class are both negative.This indicates that the classifier correctly classified the instance with a negative label.

•
False negatives (FN): Instances where the predicted class is negative but the actual class is positive.This means that the classifier incorrectly classified the instance with a negative label.In the context of breast abnormality classification, an FN response is considered a type II error.For instance, it could refer to a mass mammogram being classified as calcification or a malignant mass lesion being classified as a benign mammogram in the diagnosis.Type II errors are particularly significant in their consequences.

•
Accuracy: This metric represents the overall number of correctly classified instances.
In the case of the abnormality classifier, accuracy signifies the correct classification of image patches containing either mass or calcification.Similarly, accuracy shows the correct classification of image patches as either malignant or benign in the pathology classifier.

•
Sensitivity or Recall: This metric represents the proportion of positive image patches that are correctly classified.In the abnormality type classifier, sensitivity indicates the fraction of image patches that are truly mass lesions and are correctly classified.
Similarly, the abnormality pathology classifier shows the fraction of truly malignant image patches that are correctly classified.Given the significance of type II errors, this metric is valuable for evaluating performance.

•
Precision: This metric reflects the proportion of positive predictions that are correctly categorized.It is calculated using the following formula:

•
F1 Score: This measure combines the impact of recall and precision using the harmonic mean, giving equal penalties to extreme values.It is commonly calculated using the formula:

Performance Evaluation of the Proposed Model for Different Classifiers
Table 4 presents a comparison of performance metrics for different CNN models using the RSNA dataset.Among the individual CNN models, EfficientNet consistently outperforms the other models in terms of accuracy, sensitivity, precision, AUC, and F-Score.Its superior performance can be attributed to its architecture, which enables it to capture relevant features and make accurate predictions on the RSNA dataset.EfficientNet proves to be the most effective choice among the individual models for accurately classifying medical images in the RSNA dataset.From the last row of the table, one can see that the proposed concatenation scheme, significantly improves all performance metrics, for instance, the achieved accuracy is 6 percent more than the best CNN model, i.e., EfficientNet.Table 5 presents a summary of the results obtained using the kNN classifier with k = 5.The findings indicate a significant decline in performance compared to the NN model.Specifically, without feature concatenation, highest accuracy is achieved with AlexNet, which is 8 percent lower than the accuracy of the same model with the NN classifier, and 13 percent lower than the best-performing EfficientNet model with the NN classifier.Additionally, the accuracy of the concatenated model is also 14 percent lower compared to the concatenated model with the NN classifier.Table 6 displays the results obtained from the RF classifier.It demonstrates that the accuracy of the concatenated Model is equivalent to that of the KNN classifier, but falls short compared to the NN.Among the individual models, EfficientNet exhibits the most favorable performance metrics, while mobileNetSmall exhibits the least favorable performance.Table 7 displays the results of the proposed method using the SVM classifier.It is evident from the table that SVM exhibits the lowest accuracy among all four investigated methods.Specifically, the accuracy of the SVM-based method is 19 percent lower than that of the NN-based method.Furthermore, in comparison to the KNN and RF-based systems, the accuracy of the concatenated model decreased by 5 percent.Based on the findings presented in Tables 4-7, it is evident that the NN classifier achieves the highest level of performance.Therefore, we employed the suggested approach using the NN classifier as the benchmark to compare it with the existing methods.
To the best of our knowledge, the RSNA dataset has not been utilized in any previously published papers.Consequently, for the purposes of this section, we conducted a comparison of our proposed model against existing methods using the MIAS and DDSM datasets and summarized the results in Table 8.Upon examining Table 8, it is evident that our proposed model has exhibited superior performance compared to state-of-the-art algorithms in terms of accuracy and sensitivity across both the MIAS and DDSM datasets.While the method described in [32] demonstrated slightly better precision for the MIAS dataset, our algorithm outperformed it in the remaining two performance metrics.

Cross-Dataset Validation
So far, we have trained and tested the proposed method on the same dataset.However, it is crucial to evaluate the ability of a model trained on one dataset to perform well on different datasets or images collected from diverse machines and under varying image collection standards.In this subsection, we assess the performance of our method when trained on one of three datasets: RSNA, MIAS, and DDSM, and subsequently tested on images from a different dataset.The results of these experiments are summarized in Table 9.Since the RSNA dataset comprises images of various types and resolutions, crossvalidating it with another dataset yields slightly lower performance metrics.Specifically, when the method is trained on either the MIAS or DDSM dataset and tested on RSNA images, the achieved performance is slightly reduced.Figure 1 visually depicts the resemblance between RSNA and MIAS images compared to RSNA and DDSM images, further supporting the observation that cross-validation between RSNA and MIAS datasets leads to higher accuracy compared to cross-validation involving RSNA and DDSM datasets.These findings are also supported by the results presented in Table 9.

Conclusions
We have developed a novel method to address the accurate diagnosis of breast cancer in mammography images.Our approach involves the extraction and selection of features from multiple pre-trained CNN models, followed by classification using various machine learning algorithms: kNN, SVM, RF, and NN.The results obtained for different datasets demonstrate the effectiveness of our proposed scheme.
Our findings indicate that the NN-based classifier yielded the best performance in our experiments.Notably, we achieved impressive accuracies of 92%, 94.5%, and 96% for the RSNA, MIAS, and DDASM datasets, respectively.These results surpass those of existing methods, underscoring the superiority of our approach in terms of accuracy and sensitivity.
In terms of future work, we envision several directions to enhance our method.Firstly, exploring advanced deep learning techniques, such as attention mechanisms, could further improve the model's performance.Secondly, investigating the integration of additional clinical and genomic data could potentially enhance the accuracy and predictive capabilities of our system.Lastly, conducting rigorous validation on larger-scale datasets from multiple healthcare institutions would provide more robust evidence of the method's effectiveness and generalizability.

Figure 1 .
Figure 1.These figures show two sample images from the RSNA dataset for (a) a cancerous, and (b) a normal subject.Figure 1.These figures show two sample images from the RSNA dataset for (a) a cancerous, and (b) a normal subject.

Figure 1 .
Figure 1.These figures show two sample images from the RSNA dataset for (a) a cancerous, and (b) a normal subject.Figure 1.These figures show two sample images from the RSNA dataset for (a) a cancerous, and (b) a normal subject.

Figure 2 .Figure 3 .
Figure 2.These figures show two sample images from the MIAS dataset for (a) cancerous, and (b) normal subjects.

Figure 2 .
Figure 2.These figures show two sample images from the MIAS dataset for (a) cancerous, and (b) normal subjects.

Figure 2 .Figure 3 .
Figure 2.These figures show two sample images from the MIAS dataset for (a) cancerous, and (b) normal subjects.

Figure 3 .
Figure 3.These figures show two sample images from the DDSM dataset for (a) cancerous and (b) normal subjects.

Figure 4 .
Figure 4. Block diagram of the proposed system.

Figure 4 .
Figure 4. Block diagram of the proposed system.

Figure 5 .
Figure 5.This figure shows the distribution of age for cancer and noncancer subjects in the RSNA dataset.

Figure 5 .
Figure 5.This figure shows the distribution of age for cancer and noncancer subjects in the RSNA dataset.

Figure 5 .Figure 6 .
Figure 5.This figure shows the distribution of age for cancer and noncancer subjects in the RSNA dataset.

Figure 6 .
Figure 6.These figures show distributions of (a) a good feature and (b) a weak feature extracted using a pre-trained CNN model.for cancer and noncancer subjects in the DDSM dataset.The mutual information computed for these two features is 0.035 and zero, respectively.

Table 1
summarizes these three datasets.Information 2023, 14, x FOR PEER REVIEW 4 of 14

Table 1 .
This table shows the description of three datasets.

Table 1 .
This table shows the description of three datasets.

Table 1 .
[26] table shows the description of three datasets.ImageNet dataset, which contains over one million images, demonstrated the potential of deep neural networks for image recognition tasks and paved the way for further advances in the field of computer vision.B.ResNet50[26]is a deep CNN architecture that uses residual connections to enable learning from very deep architectures without suffering from the vanishing gradient problem.It consists of 50 layers, including convolutional layers, batch normalization layers, ReLU activation functions, and fully connected layers.ResNet50 also uses a skip connection that bypasses several layers in the network, allowing it to effectively learns both low-level and high-level features.C.
1.Normalization: The RSNA dataset consists of images in various formats, including 12 and 16 bits per pixel.Additionally, it has two different photometric interpretations known as MONOCHROME1 and MONOCHROME2.The former represents grayscale images with ascending pixel values from bright to dark, while the latter represents grayscale images with ascending pixel values from dark to bright.To ensure consistency within the RSNA dataset, we convert all MONOCHROME1 images to MONOCHROME2.In order to standardize the pixel values across the RSNA dataset, intensity normalization is performed.This involves scaling the pixel values to the range of 0 to 255, which is equivalent to 8 bits per pixel.By applying this normalization process, the pixel values across the dataset become more consistent and comparable.On the other hand, the DDSM and MIAS datasets already have pixel values within the range of 0 to 255, eliminating the need for additional normalization.Therefore, the pixel values in these datasets are deemed suitable, and no further adjustment is required.2.Region of Interest Selection: To select the region of interest, we initially apply a global thresholding method to the image.Subsequently, we extract the contour of the largest object present in the image, which corresponds to the breast area.

Table 3 .
The total number of features obtained from each dataset before and after feature selection.
1RSNA dataset provides two views for each subject and one additional feature for age.

Table 4 .
Performance comparison of the proposed method for different CNN models and Concat.Model with the NN classifier for RSNA dataset.

Table 5 .
Performance comparison of the proposed method for different CNN models and Concat.Model with the kNN classifier for RSNA dataset.

Table 6 .
Performance comparison of the proposed method for different CNN models and Concat.Model with the RF classifier for RSNA dataset.

Table 7 .
Performance comparison of the proposed method for different CNN models and Concat.Model with the SVM classifier for RSNA dataset.

Table 8 .
Performance comparison of our proposed model vs. methods using the MIAS and DDSM datasets.

Table 9 .
Performance of the proposed model with cross-dataset validation, i.e., trained and tested with different datasets.