Classiﬁcation of Breast Cancer in Mammograms with Deep Learning Adding a Fifth Class

: Breast cancer is one of the diseases of most profound concern, with the most prevalence worldwide, where early detections and diagnoses play the leading role against this disease achieved through imaging techniques such as mammography. Radiologists tend to have a high false positive rate for mammography diagnoses and an accuracy of around 82%. Currently, deep learning (DL) techniques have shown promising results in the early detection of breast cancer by generating computer-aided diagnosis (CAD) systems implementing convolutional neural networks (CNNs). This work focuses on applying, evaluating, and comparing the architectures: AlexNet, GoogLeNet, Resnet50, and Vgg19 to classify breast lesions after using transfer learning with ﬁne-tuning and training the CNN with regions extracted from the MIAS and INbreast databases. We analyzed 14 classiﬁers, involving 4 classes as several researches have done it before, corresponding to benign and malignant microcalciﬁcations and masses, and as our main contribution, we also added a 5th class for the normal tissue of the mammary parenchyma increasing the correct detection; in order to evaluate the architectures with a statistical analysis based on the received operational characteristics (ROC), the area under the curve (AUC), F1 Score, accuracy, precision, sensitivity, and speciﬁcity. We generate the best results with the CNN GoogLeNet trained with ﬁve classes on a balanced database with an AUC of 99.29%, F1 Score of 91.92%, the accuracy of 91.92%, precision of 92.15%, sensitivity of 91.70%, and speciﬁcity of 97.66%, concluding that GoogLeNet is optimal as a classiﬁer in a CAD system to deal with breast cancer.


Introduction
Breast cancer is the most common type of cancer in the world, with almost 2.26 million, and ranking 5th in causes of death with 685,000 deaths, surpassed by stomach, liver, colon, and lung cancer [1]. According to the World Health Organization, breast cancer is the most prevalent globally; in 2020, about 2.3 million women were diagnosed with breast cancer. At the end of 2020, there were 7.8 million women alive who were diagnosed with breast cancer [2]. For 2017 only in the U.S., according to the study by [3], 255,180 new cases were estimated with approximately 40,610 deaths. In 2019, 268,600 new cases were assessed, with about 66,020 deaths [4]. In Mexico, according to the National Institute of Statistics and Geography [5], for the year 2017, 24 out of 100 hospital discharges were due to malignant breast cancer tumors in the population aged 20 years or more. In 2018, 7257 female deaths caused by breast cancer were registered. In 2019, for every 100,000 women aged 20 years or older, 35.24 new breast cancer cases were reported. At the national level, the mortality rate from breast cancer is 17.19 deaths per 100,000 women aged 20 years or older. In 2020, 29,929 new cases were diagnosed, making breast cancer the disease with the first place of incidence, registering 7931 deaths with a cumulative risk of 1.18 mortality [6]. Most developed health care systems have implemented imaging for breast cancer detection based on the evidence that mammography confers a reduction in mortality [7]. The European Commission on Breast Cancer Screening and diagnosis (ECIBC) conducted a study on the optimal number of readings to allow mammography readers to work on mammography screening programs; they conclude that the increasing number of readings is related to a decrease over false positive rate of 1.35% for fewer than 3500 inspections per year; likewise, the sensitivity is increased between 0.893 and 0.939 when carrying out approximately 7000 readings; as these numbers of assessments are exceeded annually, the rates above begin to decrease [8].
Some studies on the human efficiency to detect breast cancer show that there is an area under the curve (AUC) around 82%, a sensitivity lower than 86%, a false positive detection rate of around 9.1% [9,10]. Another study maintains that the frequency of false positives varies among readers and screening programs and has been reported in the range of 3 to 12% of screening examinations [11]. Human interpretation can also misinterpret a mammogram; about 15% to 35% of cancers that occur in screened women are not detected on imaging and are presented clinically as interval cancers due to errors or because the tumor is not perceptible to the radiologist [12]. The most surprising fact from crowdsourcing studies in the medical domain is the conclusion that many non-professional and inexperienced users as well as medical experts can perform radiological imaging [13]. Due to this fact, artificial intelligence with computer-aided diagnosis systems is beginning to play a significant role in imaging techniques and the timely detection of breast cancer.
Research has been carried out around the world on different computer-aided diagnosis (CAD) systems to detect benign and malignant masses and microcalcifications to increase the timely detection of cancer. The works analyze mammograms from various databases and images, such as full field digital mammogram (FFDM), histopathology, ultrasound, and magnetic resonance imaging (MRI). The authors include machine learning (ML) or deep learning (DL) techniques to obtain classifiers such as support vector machines (SVM), k-nearest neighbor (KNN), linear discriminant analysis (LDA), and stacked autoencoders (SAE) to create deep networks [14]. Likewise, transfer learning is applied to train convolutional neural networks (CNN) and combined with ML techniques such as SVM [15][16][17]. In order to make comparisons or change CNN attributes such as weights and learning rates, the architectures have been trained several times with different optimizers such as stochastic gradient descent (SGD) [18], stochastic gradient descent with momentum (SGDM), adaptive moment estimation (Adam) [19], and RMSprop. An example is the work of Masud et al. in which they apply the previously mentioned optimizers in eight different CNNs with a breast ultrasound database, testing the performance of their classifiers, including a third class to refer to normal breast tissue and even creating their model of a shallow CNN with which they claim to have 100% on all performance metrics.
The area of deep convolutional neural networks (DCNN) has been explored by several studies, including shallow DCNN [20], adding optimizers, and ML combinations such as SVM and extreme learning machine (ELM) [21], in turn, these deep networks have been implemented together with the fully complex valued relaxation network (FCRN) [22] and in works with faster region convolutional neural networks (Faster R-CNN) [23]. Since deep learning methods have achieved the most advanced performance in different medical applications, they have been studied to improve further and take the leading step in medical diagnostics. However, there is still room for improvement [18,24].
In this study, we analyze the performance metrics of 12 classifiers based on AlexNet [25], GoogLeNet [26], and Resnet50 [27]. These CNNs are trained with balanced and unbalanced databases to classify benign and malign masses and microcalcifications. Moreover, we added a fifth class to classify normal tissue in order to analyze which of the CNN best categorizes the lesions. We aim to obtain a classifier to apply it in a CAD system that can help in the early detection of breast cancer, reducing the mortality rate and unnecessary examinations and expenses.

Related Work
Research in the early detection of breast cancer by imaging is a large area of study in which there is extensive research on the subject. Using ML techniques to classify masses and calcifications, Wang et al. implemented a semi-automatic segmentation method in a training group of 1000 images with 677 benign and 323 malignant lesions. Testing various neural networks models and a DL model, they got 85.8%, 84.3%, 74.0%, and 89.7% accuracy for the SVM, KNN, LDA, and the DL model, respectively. Although, there is research made on histopathological images using the Breast Cancer Wisconsin database. It contains 569 records of 32 characteristics of benign and malignant tumors. Shirsat and Bhosale [28] researched five features using principal component analysis (PCA) and random forest to classify the lesions. They conclude that the best feature selection methods are the recursive selection with cross-validation that achieves 97.34% accuracy and the recursive selection of features with 95.8% accuracy.
Another investigation [29], used 75 lumbar images to develop a classifier method of the image quality perceived by radiologists and then obtain characteristics of the images to feed a classification algorithm and validate with another 20 images remaining from their data set. In their experiment, to classify the quality of the images, three neuroradiologists established the criteria for evaluating the image to classify it as good, poor, fair, reasonable, and excellent. Using a parallel system, they extract the characteristics and classify with machine learning techniques such as LDA, quadratic discriminant analysis (QDA), SVM, logistic regression, and MLP, obtaining accuracies between 70% and 77% with a recall of 82%, and an area under the curve (AUC) 77% at best.
The work of Hong et al., 2021 [30] proposes a method of classifying multiple classes of lung diseases with CNN ImageNet, applying fine-tuning with their initial weights, and feeding the network with images in * .tif format. The datasets used were the National Institutes of Health (NIH) of EE, divided into Normal, Pneumonia, and Pneumothorax, and the Cheonan Soonchunhyang University Hospital dataset that includes Tuberculosis. To improve performance, they used the Center Crop technique in pre-processing while maintaining the image's aspect ratio in the training procedure; they also updated the training weights for classes with a more significant number of images and performed data augmentation. As a result, they obtained a performance of 85.32%. The predictions of four classes measured with data from the Soonchunhyang University Hospital in Cheonan had a mean precision of 96.1%, a mean sensitivity of 92.2%, a mean specificity of 97.4%.
In the work of Mahmood et al., 2021 [31], they improved mammography using the medium, Gaussian, and bilateral filters to apply the Contrast Limited Adaptive Histogram Equalization method (CLAHE) later and the Otsu method to cut out the breast area. Regarding training, they used 7259 mammography images from MIAS, INbreast, and one private dataset, of which they divide 87.4% for training (6346) and the rest (913) for tests. They performed data augmentation in seven different ways to train several fine-tuned CNNs in parallel and classify masses and microcalcifications in mammograms. Through transfer learning, they changed the CNN final layers, keeping the initial weights intact. The CNNs they used were VGGNet, MobileNet, GoogLeNet, ResNet, DenseNet and proposed a deep hybrid network ConvNet + SVM. After training and testing with the hybrid network, they obtained an accuracy of 97.8%, AUC 91.4%, and F1-score of 97.06%.
In [32], they compared 14 different neural networks applied on diverse databases to identify which method can perform better classification of malignant breast cells, conclud-ing that the CNN offers slightly higher precision than the multi-layer perceptron neural network (MLP). Huynh et al. used a database with 607 breast lesions and labeled 261 as benign and 346 as malignant; to perform classification, they used two methods; one with transfer learning, implementing the CNN AlexNet together with a trained SVM classifier with the extracted characteristics and obtaining an AUC = 0.86. Similarly, [33] used random patches to extract benign and malign lesions to find a relationship with the environment and then used the neural networks of the Vision Geometric Group team (VGG) [34] and ResNet to compare the results. Obtaining an AUC of 0.943 and 0.923, respectively, shows that both models suffer overfitting, and it is superior in the ResNet model. Moreover, Dhungel et al. [16] generate a mass detection, segmentation, and classification system, through several cascade DL detectors combined with Bayesian optimization, level set method, and more than 700 characteristics annotated by hand to get rid of false positives and get better results. When applied in the INbreast database [35], it obtains an AUC of 0.91 ± 0.12 for manual configuration and 0.76 ± 0.23 for minimal user intervention configuration.
In the work of Ragab et al. [17], they used the DDSM and CBIS-DDSM databases from which they extract a total of 5272 regions of interest (ROI) to feed the AlexNet network with 70% for training and 30% for testing. They obtained an efficiency of 71.01% with an AUC equal to 88%, and after applying SVM, they increased the efficiency by 87.2% and AUC 94%. Using transfer learning, Hagos et al. [18] proposed a patch-based multi-input VGG-based CNN that gives the symmetric difference to detect breast masses in conjunction with the SGD optimizer. They trained on a data set with 28,294 images that they managed to obtain after increasing the data through rotations and filters, with which they achieved an AUC of 0.933. In a similar work, Lévy and Jain, 2016 [19] applied AlexNet with an Adam optimizer, GoogLeNet with an SGDM optimizer, and a Shallow CNN based on AlexNet, they used the DDSM database to classify only benign or malignant masses using 1820 images, which were augmented 25 times per image. Splitting the database into 80% for training, 10% for testing, and 10% for validation, they got 89% accuracy for AlexNet, 92.9% for GoogLeNet, and 60.4% for the shallow CNN.
Likewise, Masud et al. [36] used two databases of breast ultrasound images to test eight different pre-trained and fine-tuned CNNs and created a shallow CNN-based model to classify the findings into only three classes (benign, malignant, and normal). They have 1030 ultrasound images, of which they used 80% (824) for training and 20% (206) for testing. The analyzed CNNs are AlexNet, DarkNet19, GoogLeNet, MobileNet, ResNet18, Resnet50, VGG16, and XCeption; trained with the SGDM, Adam, and RMSprop optimizers. For the CNNs of our interest, the accuracies obtained for AlexNet, GoogLeNet, and Resnet50 with SGDM are 96.7%, 95.9%, and 95.2%, respectively. In the model that they propose, they obtained 100% in all performance metrics.
In a deep learning approach with CNN (DCNN), G and Suresh [21] using the CBIS-DDSM database to classify malignant and benign masses with the AlexNet topology by adding SVM classifiers in the final layers of the network, obtained an accuracy of 97.36%. When adding an ELM, they reported a precision of 100% in the classification. Following the previous approach, Duraisamy and Emperumal [22] employed MIAS [37] and BCDR databases to classify masses as benign or malignant, increasing the regions seven times to feed DCNN with 80% of the ROI and reserving 20% for testing. In the best of their experiments, they used the level set method to segment the images and a DCNN based on the VGG topology and added a fully complex-valued relaxation network (FCRN), with which they obtained an accuracy of 99%, a specificity of 100%, a sensitivity of 98.75% with an AUC = 98.15%. Furthermore Gao et al. 2018 [20] used low energy (LE) and FFDM images from the INbreast database to recombine them and perform malignant or benign mass classifications with a shallow DCNN (SD-CNN) based on ResNet. They got an AUC of 0.84 for LE images, and when using LE plus recombined images, increased the AUC to 0.92 with an accuracy of 0.90, specificity of 0.94, and a recall of 0.83.
Using the DDSM database, Ben-Ari et al. [23] applied transfer learning using a CNN based on the VGG architecture. Following the region proposal convolution neural networks (R-CNN) methodology, they extended the work to a specific domain (DS-RCNN), adding SVM to classify architectural distortions, achieving more than 80% sensitivity and specificity, with 0.46 false positives per image with a true positive rate of 83%. Aiming to generate faster detections; Hadush et al. [38] wielded the INbreast and CBIS-DDSM databases [39], applying CLAHE to improve the quality of the images [40] and Otsu thresholding to remove its background and subtract initial weights to train their model with mammograms taken in Ethiopian hospitals. Creating their own Faster R-CNN based on VGG to perform classifications, they reached a detection precision of 91.86%, a sensitivity of 94.67%, and AUC-ROC of 92.2%. Table 1 summarizes the studied approaches and methodologies that use neural networks to classify breast cancer. It shows the databases used with their type and quantity of images, data augmentation, detectable classes, and models with different performance metrics.
Among methods to improve the quality of the images, CLAHE stands out due to its ability to improve images through the histogram by limiting contrast, especially in homogeneous areas, avoiding noise amplification. The main advantages of CLAHE are its modest calculation requirements, ease of use, and excellent result in most images [39]. Makandar and Halalli [41] aimed to use filters such as the median, average, minmax, and wiener, combined with CLAHE; applying the method to 20 mammograms and concluding in a mammography enhancement where the wiener filter reduces noise while CLAHE enhances the image.
When performing segmentation and classification with DL and neural networks, the literature usually reports classifications for malignant and benign masses or microcalcifications; nevertheless, few studies use both kinds of lesions. Furthermore, several ROIs segmented by the algorithms are false positives [20,23], which reduce the rate of correct detections. Due to this, in this work, we train three CNNs and propose a fifth classification of the "Normal" type, corresponding to breast tissue without any lesions, intending to increase the detector's effectiveness by classifying and reducing the false positive rate

Datasets
In this work, the software Matlab ® was used to analyze the databases of the Mammographic Image Analysis Society MIAS [37] and INbreast [35], widely studied in research regarding breast cancer [16,20,22,33,38].
The MIAS database contains 161 cases with 322 images digitized in PGM format, of which 204 are normal, and 118 contain some findings; of the latter, 66 are benign and 52 malignant. This database provides the annotations organized in 7 columns, showing the name of the mammogram, the type of tissue, the class of anomaly, its severity (Benign/Malignant), the X and Y coordinates, and the radius of the finding.
The INbreast database contains a total of 115 cases with 410 images in DICOM format, four mammograms per case from the left and right craniocaudal (CC) and middle lateral oblique (MLO) views for both cases. The database authors provide an XML table with information on the contours of the findings and also another file in which it includes information on the laterality of the breast, the type of view, the date of acquisition, the name of the file, and the classifications of the American College of Radiology (ACR) and Breast Imaging Reporting and Data System (BI-RADS). Four more fields were added for classifying the finding in masses, microcalcifications, distortions, and asymmetry. Of the 410 mammograms, some mammograms have only masses, others have only MC, and others contain both types of lesions.
When linking the INbreast and MIAS databases, we take the BI-RADS classification as a basis. The BI-RADS findings from 0 to 3 are classified as benign and 4 to 6 as malignant or suspects. We proposed five classes in the classification: benign or malignant masses, benign or malignant microcalcifications, and the normal classification; introduced to reduce false positives. In Figure 1, several regions of the "normal" type can be observed, extracted from the random areas of the mammary parenchyma and with fatty, glandular, and dense tissues. In the MIAS database, there are 330 annotations listed of FFDM mammograms with findings in this manner: fifteen benign calcifications, three malignant calcifications, 54 benign masses, 51 malignant masses, and 207 are mammograms taken as normal. The INbreast database provides information of 483 ROIs as follows: 225 calcifications with BIRADS category 1 to 3, which we handle as benign, 83 calcifications with category 4 to 6 taken as malignant, 37 masses with BIRADS category between classifications 1 and 3, and 71 masses diagnosed in BIRADS 4 to 6. The databases only report calcification clusters, and not all masses are listed; thus, we can select smaller groups or annotate each microcalcification separately. For this reason, we proceeded to record more visible microcalcifications and masses to feed the neural network for training and testing.
Using the Matlab Image Labeler App, more regions with lesions were noted, including those with the "normal" classification; any position not marked by the databases and with different sizes was chosen. In the MIAS database, 123 lesions are recorded, while with the annotations carried out, 1829 regions were reached, 613 being lesions and 1194 standard regions. From the INbreast database, we took 416 ROIs, and after making more annotations, we got 4189 ROIs, 543 from findings, and 3646 from areas with no lesions.

Data Augmentation and Transfer Learning
When working with mammograms, the number of images is limited; toward performing a better CNN training without overfitting [13,24], it is necessary to increase the amount of ROIs to obtain better efficiency in classifiers. For this reason, we performed data augmentation by rotating the images in 90 • intervals, turning from left to right, and applying the CLAHE technique to improve the contrast of the cropped ROIs. Experimentation shows that using the exponential function in CLAHE best fits the histogram model usable for a mammogram. We observed how values were equalized, tending toward dark values and partially maintaining certain gray levels, resulting in a decrease in image brightness and a more excellent contrast [36,39]. The different types of classification can be observed in Figure 2, showing the significant effect that the CLAHE method has on the regions by exponentially improving the image. By augmenting the data using "normal" ROIs, applying CLAHE, rotating the images 90 • , 180 • , 270 • , flipping left to right, and removing their backgrounds, we reached a total of 12,582 regions of interest from both databases. Table 2 shows the data augmentation. Since the formulation of CNN architecture can take too much time and resources in testing, we decided to engage in transfer learning. A previously trained image classification network that has already learned to extract robust features from images can be a starting point for training breast cancer classifiers. Accuracy, precision, speed, and network size are the most important characteristics; furthermore, networks should be chosen with a trade-off between these characteristics [42].
For the lesion classification, we used 4 CNN topologies that were pre-trained with 1 million images to classify 1000 different classes of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [43]. The first CNN we used was AlexNet [25] with a depth of eight layers, three fully connected and five convolutional layers, obtaining 61 million parameters. The second CNN we used was GoogLeNet [26]; it contains 22 layers with 7 million parameters and is based on the inception architecture to better use the computational resources on the network and go deeper, achieving this through the network-in-network (NIN) approach of [44]. Due to the network architecture, it is possible to work on computers with low computational resources and even training only on CPUs. The third network we used was ResNet50, created by He et al., 2015 [27] with a depth of 50 layers that give way to 25.6 million parameters and set the tone at ILSVRC 2015 by successfully controlling the gradient vanishing problem. The last CNN applied was Vgg19, it uses 144 millions of parameters with 19 layers depth having a size of 535Mb, being the heaviest networks in this work [34].
Implementing AlexNet, GoogLeNet, Resnet50, and Vgg19 topologies, we carried out fourteen studies with different training parameters, applying transfer learning. Each of these three widely used classification topologies has advantages in efficiency, convergence, and processing time. The networks mentioned above were created to classify 1000 objects, the fully connected layer (AlexNet: 23, GoogLeNet: 142, Resnet50: 175, and vgg19: 45) was modified to classify 4 or 5 classes. Aiming to feed the neural networks, all regions were changed to the input size of each CNN, being 256 × 256 pixels for AlexNet and 224 × 224 for GoogLeNet, Resnet50, and Vgg19.
According to the number of ROIs obtained after data augmentation, we created two databases of regions of interest, one unbalanced and one balanced. Training employs 70% of ROIs, and the remaining 30% is used for testing. The unbalanced database with the five classes uses 12,582 regions and only 6548 for the training and tests with four categories. In contrast, the balanced database uses 2336 lessons for each type, encompassing 11,680 images when handling the five classes and 9344 areas using four categories. Table 3 summarizes the number of regions used for each database. Initially, we trained the topologies with the unbalanced databases only for four categories, and later the fifth category was added, doing a total of seven classifiers. Afterward, the number of malignant calcifications doubled, and the other classifications equaled this number to balance the database. With this, we trained another six classifiers on the four topologies with 4 and 5 classes. Figure 3 shows the data augmentation for a malignant mass, starting from the original image at the top and the effect after applying the CLAHE method at the bottom; and then increase with its rotations in the clockwise direction of 90 • , 180 • , and 270 • , and later make a turn from left to right and rotate again. Overfitting is an undesirable factor in training; we added a function to avoid it if the neural networks that reached 99.5% accuracy. Under this guideline, AlexNet and GoogLeNet were trained for 20 epochs, while Resnet50 converged much faster, stopped training at the 5th epoch, and Vgg19 stopped training at the 13th epoch when using four classes and stopped at the 10th epoch while training with five classes; all CNNs were trained using the SGDM optimizer with a learning rate α = 1 × 10 −4 and miniBatch size = 8.
Regarding each CNN, we counted the number of true positive cases (TP), which indicate the number of cases correctly identified, followed by obtaining the true negative cases (TN), that is, the number of predictions in which the negative classes are classified correctly. The false positive (FP) cases represent the number of predictions where the classifier interprets a negative class as positive. Finally, the false negative (FN) cases refer to the number of predictions where the classifier predicts the positive class as negative.
Regarding the performance metrics, accuracy shows the number of correct predictions made about the different classes. At the same time, the precision indicates the ability of a classification model to identify only the relevant cases and also shows the rate of false positives. Sensitivity or recall indicates the ability of a CNN to detect each case, while specificity indicates the ability to discriminate negative cases between regions without malignancy. From these metrics, while we increase the precision, we decrease the recall and vice versa. Therefore, the F1 Score is added, which is a value that combines recall and precision by using the harmonic mean being more representative than a simple average because it takes into account the false positives and false negatives.
An expression for the confusion matrix with multiple classes (a n ) is shown in (1): The expressions (2), (3), (4), and (5) describe the elements of a confusion matrix for a particular class: Below are the equations to obtain the values for accuracy (6), precision (7), recall/sensitivity (8), specificity (9), and F1 score (10) for a specific class: The average values for a particular class are given by computing the mean value for all types.

Results
We evaluated the classification efficiency of the CNNs, with various performance metrics based on the classification performed on the set of tests and its confusion matrix. Table 4 shows the types of training and their results on the test set; the table is organized by the best AUC and shows the accuracy, precision, sensitivity (recall), specificity, and the F1 Score for each classifier. The CNN with the best performance and highest metrics was the GoogLeNet network trained with the balanced database with five classes that applies the normal tissue classification. In this training, the values obtained show outstanding performance, high efficiency, and the best percentages for precision, recall, and F1 score.
The training and tests of GoogLeNet with the balanced database of five classes used 11,680 regions, with 70% for training and 30% for tests, corresponding to 8175 and 3505 regions, respectively. In Figure 4, the training progress shows that as the iterations and training epochs increase, the accuracy also increases, and the error decreases. It can also be seen that the stoppage programmed in training to 99.5% accuracy performed its function to avoid overfitting, stopping the training in the middle of the 20th epoch at the 20,054 iterations from 20,420 with an approximate time of 3 h of training.  Table 5 shows the classifier's results comparing the current regions concerning the classifier's predictions with the GoogLeNet network with five categories and the percentages of each prediction resulting in the confusion matrix. Finally, to measure the efficiency of the classifiers generated by the CNN through transfer learning and fine-tuning, the receiver operating characteristic (ROC) curves were performed, measuring the AUC of all the classifiers. Figure 5 shows the x and y coordinates of the optimal ROC operating points for each type of findings and their respective AUCs for the best GoogLeNet network. Averaging the values, we obtain an AUC = 0.99286, and the operating points x = 0.01424 and y = 0.91214, achieving an excellent classifier.

Discussion
The implementation of deep Learning in medical imaging brings the evolution of inspection paradigms, making these techniques the new forms of analysis to better detect lesions. As mentioned in [9][10][11][12], human inspection involves a series of errors or omissions in the accurate detection that could lead to fatal events, unnecessary follow-up, or higher costs. As can be seen in this work, all topologies exceed the radiologist's performance (≈ 82%). Therefore, the implementation of CAD systems is a must in the early detection of breast cancer. In addition to this, a CAD system does not experience fatigue, is faster, and improves with more training data.
We have analyzed the AlexNet, GoogLeNet, Resnet50, and Vgg19 topologies on breast cancer lesions evaluating their AUC, F1 score, accuracy, precision, sensitivity, and specificity, obtaining magnificent performance metrics, especially for GoogLeNet when training with the fifth category of the normal tissue. This work shows that all topologies have better performance when adding another class, confirming our hypothesis by reducing the detection of false positives and false negatives, increasing sensitivity and specificity.
The confusion matrix shows that the classifier has relatively lower efficiencies when classifying benign microcalcifications and masses, leading to classifications with falsepositive cases. On the other hand, it has greater efficiencies when detecting malign microcalcifications and masses. The misdetections exhibit a soft spot in the classifier; it could confuse benign microcalcifications as malignant and benign masses as malignant but not in the opposite case. As such, the correct detection and classification of the lesions that can occur in a breast is not an easy task [20,31]. False negatives occur in grayish mammograms associated with dense breast tissue, which could be reduced by implementing CLAHE image enhancement as it dramatically improves areas with low contrast.
In this study, we do not use other CNN architectures due to the relationship between the relative prediction time and accuracy. The MobileNet and DarkNet architectures have less accuracy and use a longer prediction time than the architectures tested in this work [40]. Regarding networks, Vgg19 is the one with a more significant number of parameters (144 millions), followed by AlexNet (61 millions) that has more than double of ResNet50 (25.6 millions), and eight times more than GoogLeNet (7 million), which indicates that a vast number of parameters do not generate a certainty of efficiency. Even if we look at the depth of each CNN, AlexNet is the shallowest with 16 layers but performs better than ResNet50, which is the deepest with 50 layers, effectively showing that the overfitting is more significant in the ResNet model.
The data we obtained on the analyzed architectures agree with [17,19] concerning GoogLeNet and AlexNet since similar metrics were acquired. However, these metrics differ from the work of [21], where their AlexNet-based network achieves accuracies above 97% with the Adam optimizer. [33] reported similar performance metrics similar to ours; however, for ResNet50, their values are higher than we achieved. Their shallow DCNN with 100% of all metrics is questioned since they only trained with 1030 ultrasound images to classify benign or malignant masses and normal tissue. The DCNN could incur overfitting; since no data augmentation evidence was found, it would be interesting to test theirs with FFDM images. Moreover, our hypothesis about adding a fifth class confirms the assumption of [18], where they mentioned that adding one more class can improve detection. In [33] and [38] the AUC of Vgg was reported with 94.3% and 93.3%, metrics that are under our approach with five classes, which remarks the aim and extra class.
Comparing with [30], the procedures are quite similar when classifying masses and microcalcifications, but what distinguishes us is the fifth category with the normal class to discriminate false positives in the best manner. It is also notorious how annotation and data augmentation significantly improve our work. Observing the performance metrics of GoogLeNet and ResNet50, our metrics are higher, although with a minor difference in their model with the hybrid network, contrasting the efficiency of the methods correctly.
The precedents concerning breast cancer classification indicate that mass detection in breast tissue is more challenging than MC detection. Observing Table 1, most of the analyzed works do not use microcalcifications, just masses. In [14], they use both injuries, and their performance metrics are lower than those of this work, which shows that DL techniques yield better detections than ML techniques. With greater expectation, the results of the classifiers and the trained networks can be used in breast cancer detection interfaces and even be combined with techniques such as Faster R-CNN or make an expert system with several CNNs.

Conclusions
Mammography as an early detection technique has been vital in the fight against breast cancer, and the work carried out by specialists in the diagnosis is critical. Additionally, CAD systems can provide significant support by performing diagnoses with greater accuracy in less time than a human being. For this work, we inspected the performance of three CNN topologies after training them with a balanced and an unbalanced database, using 4 and 5 classes corresponding to benign and malign microcalcifications and masses plus a fifth class of the "normal" type to represent the non-lesioned tissue of the breast.
Data augmentation played an essential role in this work since out of 539 annotated regions of masses and calcifications in the MIAS and INbreast databases, we reached 12,582 regions. Augmentation was made by rotating the images in 90 • intervals, rotating from left to right, removing its background, and improving its quality with the CLAHE method.
The transfer learning and fine-tuning concepts applied in this work produced outstanding results after using the AlexNet, GoogLeNet, ResNet50 and Vgg19 topologies, and adapting their output parameters for training. The AUC of all networks performed above 95%.
The GoogLeNet topology obtained the best performance metrics when trained for five classes with a balanced database of 2336 regions for each class; that is, 11,680 regions in total. The images were divided into 8175 for training and 3505 for tests corresponding to 70% and 30%; obtaining an AUC = 99.29%, F1 score = 91.92%, effectiveness = 91.70, and precision = 92.15%, sensitivity = 91.70%, and specificity = 97.66%. These values were the highest recorded except for the effectiveness only surpassed by the same GoogLeNet network trained with five classes in the unbalanced database, obtaining a value of 91.97%; 0.27% more accurate.
The CNN GoogLeNet is followed by the Vgg19 CNN trained with five classes in a balanced dataset as well, having a close development in its metrics and even with a better AUC of 99.50%. The training with four classes of the GoogLeNet topology in a balanced database achieved the third place, differing only 0.15% in the AUC; in the same way, it obtains values above 90% in all its metrics. Therefore, we consider that our hypothesis about adding the 5th class of normal breast tissue improves the classifier's performance, although not to a great extent, but enough to differentiate itself and improve the medical diagnosis. Generally, CNN's trained with five classes performed better than their analogs trained with four classes and this can be noted in a great manner looking at the Vgg19 topologies in Table 4.
On the other hand, the networks with the lowest performance were those trained with the unbalanced database; in more particular cases, it occurred in the Resnet50 topology, associating their rapid convergence with the low number of training epochs and lower classification efficiency; this can be noted too in the Vgg19 CNN trained with four classes.
A CAD system in which these classifiers participate could reduce the rate of false positives findings and eliminate the follow-up associated with these cases and biopsies and appointments or double examinations. In addition, CAD systems could include information from the patients' clinical history or family medical histories to correlate with breast cancer to increase the correct diagnosis made by specialists.