Application of Deep Learning to Construct Breast Cancer Diagnosis Model

(1) Background: According to Taiwan’s ministry of health statistics, the rate of breast cancer in women is increasing annually. Each year, more than 10,000 women suffer from breast cancer, and over 2000 die of the disease. The mortality rate is annually increasing, but if breast cancer tumors are detected earlier, and appropriate treatment is provided immediately, the survival rate of patients will increase enormously. (2) Methods: This research aimed to develop a stepwise breast cancer model architecture to improve diagnostic accuracy and reduce the misdiagnosis rate of breast cancer. In the first stage, a breast cancer risk factor dataset was utilized. After pre-processing, Artificial Neural Network (ANN) and the support vector machine (SVM) were applied to the dataset to classify breast cancer tumors and compare their performances. The ANN achieved 76.6% classification accuracy, and the SVM using radial functions achieved the best classification accuracy of 91.6%. Therefore, SVM was utilized in the determination of results concerning the relevant breast cancer risk factors. In the second stage, we trained AlexNet, ResNet101, and InceptionV3 networks using transfer learning. The networks were studied using Adaptive Moment Estimation (ADAM) and Stochastic Gradient Descent with Momentum (SGDM) based optimization algorithm to diagnose benign and malignant tumors, and the results were evaluated; (3) Results: According to the results, AlexNet obtained 81.16%, ResNet101 85.51%, and InceptionV3 achieved a remarkable accuracy of 91.3%. The results of the three models were utilized in establishing a voting combination, and the soft-voting method was applied to average the prediction result for which a test accuracy of 94.20% was obtained; (4) Conclusions: Despite the small number of images in this study, the accuracy is higher compared to other literature. The proposed method has demonstrated the need for an additional productive tool in clinical settings when radiologists are evaluating mammography images of patients.


Introduction
According to the 2020 statistics globally sourced by the World Health Organization, countries have reported that breast cancer is considered a dominant disease ranked the highest cause of death in women [1]. Taking Taiwan as an example, according to the Ministry of health's 2019 death statistics of women, the leading cause of death among all diseases such as diabetes, chronic respiratory disease, hypertension, chronic liver diseases, etc., breast cancer mortality rate was ranked fourth. Its mortality rate significantly increased from 12.8% (1439) in 2006 to 22.2% (2633) in 2019 [2]. Compared to the United States of America (USA), the USA projected 41,760 deaths in 2019 at a rate of 19.9 per 100,000 women per year, and more than 3.8 million have a history of breast cancer [3,4].
Despite the presence of an upsurge in the annual mortality rate, early detection of breast cancer increases the survival rate of patients if the appropriate treatment is provided to avoid the need for a surgical procedure. R. Kate and R. Nadig (2016) [5] mentioned that physicians and healthcare workers may make more informed decisions regarding a patient's condition if breast cancer survivability can be accurately predicted. In the last decade, numerous data mining tools have been used to determine the factors affecting the survival of patients with breast cancer [6][7][8][9][10][11][12]. Due to the advancement of technology, many machine learning tools were used to predict and diagnose patients with breast cancer [13]. These data mining tools have assisted doctors in making accurate diagnoses. However, one of the most used machine learning methods for detecting or diagnosing diseases are classifiers [14]. It was noted by [15][16][17] that ANN and SVM are among the most commonly used supervised learning methods in the medical field for breast cancer diagnosis. Despite, the fact that multiple classifiers were used to deal with medical-related classification problems. Ren (2012) [18] suggested that when the training samples are imbalanced, balanced learning with an optimized decision is employed to improve the performance of both ANN and SVM. Although, some previous works have highlighted significant gains in the area of deep learning to identify lesion in breast cancer [19]. Despite the small number of images in this study, we developed a model architecture by exploiting ANN, SVM, AlexNet [20], ResNet 101 [21], and InceptionV3 [22] networks to improve the diagnostic accuracy and reduce the misdiagnoses rate of patients with breast cancer.

Materials and Methods
This study applied the use of machine learning methods in artificial intelligence to train and construct machine learning models for breast cancer diagnosis using data related to patients' breast cancer risk factors and mammograms. In the first stage, ANN and SVM were applied to the breast cancer dataset, and their performances were compared to determine the relevant breast cancer risk factors. In the second stage, an image recognition model was setup using pretrained AlexNet, ResNet101, Inception_V3 fed with preprocessed mammograms. The networks were studied using Adaptive Moment Estimation (ADAM) and Stochastic Gradient Descent with Momentum (SGDM) based optimization algorithms to diagnose benign or malignant tumors, after which their accuracies were evaluated. Since the accuracy of breast cancer survivability is essential, the stepwise approach has significantly boosted the classification accuracy level. Furthermore, soft-voting was applied to average prediction results obtained from the pretrained networks.

Research Framework
This study focuses on using machine learning in artificial intelligence to build a system that can assist doctors in making diagnostic decisions. First, the patient's breast cancer risk factor data is inputted into the breast cancer diagnosis model, and the model is used to classify the patient's current status as normal or breast cancer, and when the result is normal, the patient is followed up regularly; when the result shows a high risk of breast cancer, the patient is further scheduled to undergo mammography for in-depth examination, and the images obtained from mammography are processed and inputted into the breast cancer confirmation model for suspected tumors. To determine whether the mammogram is a benign or malignant tumor, the mammogram model is used to provide an additional basis for the physician to make a judgment and allow the patient to receive the required treatment. The overall structure is shown in Figure 1.

Classification
In this study, two classification methods, Backpropagation Neural Network (BPN) and Support Vector Machine (SVM) were used to classify the breast cancer risk factors of breast cancer patients.

Classification
In this study, two classification methods, Backpropagation Neural Network (BP and Support Vector Machine (SVM) were used to classify the breast cancer risk factors breast cancer patients.

Back Propagation Network (BPN)
Step1. Determining the network architecture

Back Propagation Network (BPN)
Step1. Determining the network architecture This includes variables that determine the input and output layers, select the number of hidden layers, neurons in the hidden layer, and the activation function.

1.
Input layer: Generally, data that are about to enter the input layer need to be preprocessed to reduce possible prediction errors caused by different data units. After the data were preprocessed by normalization, the weight adjustment rate of the data should be similar to avoid weight dispersion. Additionally, the number of neural processing units depends on the problem, and the number of neurons in the input layer of this study is the number of attributes of the cancer data used.

2.
Output layer: The output layer represents the output variables of the network, and the number of neural processing units depends on the problem. According to this research, the output was set to benign and malignant. 3.
Hidden layer: The hidden layer represents the interaction between the neural processing units of the input layer. In this study, the hidden layer represents the interaction between the input and output layers. The computation of the hidden layers, the number of neurons, and the activation function were determined here. (a) Number of hidden layers: A neural network with one hidden layer can approximate the most complex functions with the required accuracy [23]. Therefore, this study sets the hidden layers to one. (2) (c) Activation Function: the primary function of the activation function is to convert the output value of the function into the output of the processing unit. In this study, a sigmoid function, as shown in Equation (3) was used. This function converts the output value to a value between zero and one.
Step 2. Finding the best parameter combination 1. Learning Rate and momentum: the learning rate is mainly used to control the magnitude of the weight each time it changes. If it is too large or too small, it may negatively impact the network in ways such as (1), causing the model to converge too quickly to a suboptimal solution or (2) causing the process to jam. Therefore, in this study, we set a large initial value (e.g., 0.01) for the learning rate, and then gradually reduce it during the training process, to strike a balance between speeding up the convergence and avoiding the oscillation. This concept is also known as decaying learning rate as shown in Equation (4).
where α current is the learning rate at the current stage, α initial is the initial learning rate, r decay is the decay rate, n epoch is the current iteration number, In this study, α initial was set to 0.01, r decay is set to 0.9 to prevent the learning rate from being close to 0 due to too long iterations, and the minimum learning rate was set to 1 × 10 −8 .

2.
Convergence conditions: to find a stable and predictive network architecture, certain evaluation indicators should set criteria for network architecture selection. Since this study is a classification problem, the classification accuracy rate and Mean Square Error (MSE) are used as an index to evaluate the performance of the network prediction ability. The accuracy rate and MSE are shown in Equations (5) and (6), respectively.
where N is the total number of samples,ŷ i is the predicted value of the ith sample, and y i is the actual value of the ith sample. In this study, the convergence criteria were based on the highest Accuracy Rate for selecting the best network architecture and parameter combination was the first priority and the test sample with the smallest MSE value was the second priority.
3. Support Vector Machine (SVM): is a machine learning method published by Cortes and Vapnik in 1995 [24]. SVM has been widely used in recent years to solve various classification problems [25], [26]. By calculating the training data, SVM can find an optimal hyperplane and classification decision function to effectively separate the data points belonging to two categories, and when a new case is predicted for classification, the hyperplane can be used to determine the category to which the case belongs [27].
If you want to use SVM to deal with nonlinear problems, you can use the kernel function to map the data group, mapping the observation point to a higher-dimensional feature space, make it a linear hyperplane, and then find its solution; in other words, the kernel function converts nonlinear data into linear data, and the classifier performs the classification work. The definition of the core function is shown in Equation (7).
where ϕ is the mapping function, x can be mapped to a higher-dimensional feature space through ϕ.
The four commonly used core functions are Linear, Polynomial, Radial Basis Function (RBF) and Sigmoid, as shown in Equation (8) to (11), respectively.
In the above equation, γ, r and d are the core parameters. Different core functions are matched with different core parameters to have different classification effects. Therefore, the selection of the core functions and parameters are important. According to Hsu et al. (2010) [28], the selection of core functions should give priority to Radial Basis Function (RBF) because of its advantages:

1.
Radial Basis Function can classify non-linear and high-dimensional data.

2.
By simply adjusting the Upper Bound parameter C and the core parameter γ, the operation is less complex and achieves better prediction capability.

3.
The input data is limited between 0 and 1 to reduce the complexity of the calculation time.

4.
Due to the high dimensional nature of the data used in this study, the RBF was used to find the optimal C and γ, and input the test data to evaluate the prediction accuracy rate (Accuracy Rate) of the model.

Image Recognition
In this study, a Convolutional Neural Network (CNN) was used for image recognition of mammograms to determine the stage of breast cancer in patients. The functions of the Convolutional Layer, Pooling Layer, and Fully Connected Layer of the CNN and the parameters required to construct a CNN are described below.

Convolutional Neural Network (ConvNet)
CNNs are types of biologically-inspired feed-forward networks characterized by a sparse connectivity and weight sharing among their neurons [29]. It accepts twodimensional input data in contrast with other DL algorithms [30]. A CNN can also be referred to as a sequence of convolution and subsampling layers where the entire network will take an image input of size (h,w,c), where H is height, W weight, and C is the number of channels in the image. Those channels are mostly referred to different (RGB) colors [31] and output the conditional probability distribution over the categories p(y|x). This is carried out by a sequence of the nonlinear level image [32]. For each pixel in an image, the kernel multiplies the pixel and its adjacent pixels that the kernel covers by the opposite kernel pixels. The products are then totalled and their results are set as the pixel value in the convolved image at the preliminary pixel location [33].
The CNN architecture consists of three layers. Two consecutive conventional pooling layers and a final fully-connected layer [34] The convolutional layer: is the main layer that forms a ConvNet and is also the more computationally intensive layer. Its main function is to extract features from the input image pixels. The first few layers of ConvNet can extract features at lower levels. As the network progresses to a deeper level, the features that can be extracted from the convolutional layers gradually increases. The calculation is based on the Element-wise multiplication of Input and Filter and then summed up. The so-called Filter (also known as Kernel) can be regarded as a window composed of several weights, by which the window is padded on the image, and each pixel value in the area covered by the window during the padding is multiplied with the weight of the window at its corresponding position, resulting in a convolution, hence, the name convolutional layer.

Pooling Layer or Downsampling
In most ConvNet, after the computation of the convolutional layer, the Input usually enters the pooling layer. The main purpose of the pooling layer is to reduce the spatial dimension of the Feature Map (Resolution) [35]. Downsampling is conducted along the width and height of the image to reduce the computational requirements progressively through the network and minimize overfitting.
The role of pooling is the process of reducing the image size by padding the image through the Filter window after the convolution is completed. The image size was reduced by extracting a specific pixel value (maximum or average) from the Filter window at each transition. In addition, when the image size is reduced, it also means that the number of parameters to be calculated reduces. As the complexity of the parameters decreases, the computation time reduces.
Fully connected layers: In a typical ConvNet architecture, besides the convolutional layer and the pooling layer, the last layer of ConvNet is the Fully Connected layer. When Input starts from the beginning of ConvNet, it passes through several layers of convolutional layer and pooling layer, and then passes through the last pooling layer (provided that there is a pooling layer), when it is about to enter the fully connected layer, The neurons in the previous pooling layer will connect to the feature values, which have already been activated and the structure becomes a common neural network.
Transfer Learning: Pretrain network architectures on an extensive large dataset and uses the trained model on a dataset with minimal size on a new classification task [36] The application of transfer learning to ConvNet is to extract the features of pictures using the weights in the pretrained model and using the extracted features for classification.

Stochastic Gradient Descent
The common algorithms used to calculate the gradient in training networks are Stochastic Gradient Descent with Momentum (SGDM) and Adaptive Moment Estimation (ADAM) [37,38]. In this study, we used different optimizers, including SGDM and Adaptive Moment Estimation (ADAM), to train the network and later compared the performance of the two. The SGDM algorithm is shown in Equation (12), while the ADAM algorithm uses Equations (13) and (14) and updates the network parameters with Equation (15).
When there is enough data for training a model, a new model can be built and trained from scratch, but when there is less data, the problem of overfitting can easily occur, and then transfer learning can be used to overcome overfitting.

Soft Voting
After averaging the output probabilities of breast cancer obtained from each model with Equation (16) [39], the Equation (17) where Output i is the output of the voting combination model, n is the number of CovNet used for the voting model and net j (i) is the output i of the jth ConvNet.
In Equation (6), where y_predict is the category predicted using the combined voting model; i_0 represents category B or N; i_1 represents category M, that is, if the classification probability is greater than category B or N, the output is category B or N; if the classification probability is greater than category M, then the output is category M.

Confusion Matrix
The model performance was evaluated through a standard data classification system based on accuracy, sensitivity, and specificity. True Positive (TP) and true negative (TN) results represent correctly classified cases. A test's Accuracy is computed by estimating the fraction of true positive and negative instances in all cases as computed in Equation (18). Sensitivity, are correctly generated positive cases with either cancer or cancer free (also known as TP rate) as in Equation (19). Specificity, correctly generated negative cases of those without cancer or cancer-free (also known as the TN rate) as in Equation (20) [40].

Data Description
This study uses data obtained from the Breast Cancer Surveillance Consortium (BCSC), Data Resource [41,42]. The breast cancer risk factor assessment dataset is used in constructing the first stage of the breast cancer diagnosis model.
The breast cancer risk factor assessment dataset contains 2,392,998 cases with 12 attributes namely: menopause, age group, breast density, ethnicity, Hispanic origin, BMI value, age, number of relatives with breast cancer, previous breast-related surgery, last mammogram result, menopause mode, hormonal treatment or not, and response variable Class. The response variable (Class) was to evaluate whether the patient had invasive breast cancer or noninvasive breast cancer (Ductal Carcinoma in Situ). After removing the cases with missing values in the data, it was found that the number of cases in the category was imbalanced, so the Synthetic Minority Oversampling Technique (SMOTE) [43][44][45] Duplicate method which is considered the "de facto" standard able to learn from imbalance data was used to increase the categories with fewer cases, and after processing the data a total of 88,763 cases were used to train and classify the model.

Data Pre-Processing
Before inputting the mammogram image into ConvNet for training, it is necessary to preprocess the mammogram images, and the following steps were applied.
(1) Cropping-The original image was cropped to retain only the main Region of Interest (ROI), i.e., the tumor area was cropped out. An example is depicted in Figure 2.
(2) Rotation-The image was rotated at a random angle within a specific range.

Input Layer
The input size of all images was then scaled to the required input size for each Convolutional Neural Network, 227 × 227 × 3 for AlexNet, 224 × 224 × 3 for ResNet101, and 299 × 299 × 3 for Inception v3, where three is the number of color channels. This means that the input images of these ConvNets are all RGB images (color images). The last fully linked layer of AlexNet, ResNet101, and Inception v3 was removed and replaced with a new spreading layer and a Softmax layer, and the number of output neurons were changed from 1000 to 2 (benign or normal (B or N) and malignant (M)).

Input Layer
The input size of all images was then scaled to the required input size for each Convolutional Neural Network, 227 × 227 × 3 for AlexNet, 224 × 224 × 3 for ResNet101, and 299 × 299 × 3 for Inception v3, where three is the number of color channels. This means that the input images of these ConvNets are all RGB images (color images). The last fully linked layer of AlexNet, ResNet101, and Inception v3 was removed and replaced with a new spreading layer and a Softmax layer, and the number of output neurons were changed from 1000 to 2 (benign or normal (B or N) and malignant (M)).

Model Development
In this study, MATLAB R2018a software [46] was used in building the model. Since the weights of a specific number of layers need to be fixed when conducting transfer learning, the learned weights were used to extract the features of the image to reduce the probability of overfitting, but at the same time allowing deeper convolutional layers to conduct higherlevel feature extraction of the image may also improve the classification accuracy of the model. Therefore, we let the deeper convolutional layers of ResNet101 and Inception v3 with deeper network depth learn the images (i.e., the weights of the deeper layers were not fixed) to compare and find the best fixed-weight layers between the two models. Since the network depth (number of layers) of AlexNet was not as deep as that of ResNet101 and Inception v3, the number of layers with fixed weights was not explored here.

Back Propagation Network (BPN)
The number of neurons in the input layer of BPN is the number of attributes in the BCSC breast cancer risk factor data set (there were 12 neurons in the input layer), and the number of neurons in the output layer is used to determine the presence of invasive breast cancer or noninvasive breast cancer (N = 1). Based on this, to find the best network architecture for BPN, we set the hidden layer of the network to one and determined the number of neurons in the hidden layer using the Equations (21) and (22).
Number of neurons in the hidden layer = 12 + 1 2 = 6.5 (21) Number of neurons in the hidden layer = √ 12 × 1 = 3.46 (22) According to the above equations, the number of neurons in the hidden layer was set to two, three, four, five, six, seven, and tested. The sigmoid function was used for the activation function, and the results were averaged over three times for each number of neurons (Table 1). The BCSC breast cancer risk factor dataset was entered into the BPN, and the results were obtained by training the architecture with six hidden layer neurons, 31,901 cases with cancer were correctly classified, while 10,212 cases were classified as not having cancer. Among those without cancer, 12,291 cases were correctly classified, and 34,339 cases were classified as having cancer.
The experiment performed on our breast cancer risk factors model accumulated the results of the various evaluation metrics. Tables 2 and 3 depicts the accuracy, sensitivity, and specificity for the BPN and SVM. The accuracies of the BPN and SVM are 74.63% and 91.6%, respectively, showing a march between the predicted and the actual instances. The sensitivity and specificity are inversely proportional, meaning as the sensitivity increases, the specificity decreases and vice versa [47]. For instance, the BPN sensitivity is 24.24% whiles the specificity is 75.75%. under cancer-free the sensitivity is 73.64% whiles the specificity is 26.36%.

Support Vector Machine (SVM)
The core function used here is the Radial Basis Function (RBF) SVM with outputs zero (normal) and one (breast cancer), and the algorithm used to train this SVM is the Sequential Minimal Optimization. The SVM was then cross-validated 10 times to obtain a loss of 0.0842, i.e., a classification accuracy of 91.6% and an AUC of 0.96. According to the classification results, 42,092 cases were correctly classified, and 21 cases were not classified as having cancer. Among those without cancer, 39,213 cases were correctly classified, and 7437 cases were classified as having cancer (Table 3).
From the above experimental results, we found that both SVM and BPN were excellent in correctly classifying cancer patients. The performance of the support vector machine was better than that of the backpropagation neural network in terms of classification accuracy (91.6%) and AUC (0.96).

Breast Cancer Validation Model
Due to personal information attached to the mammograms; data collection was difficult. As a result, the amount of data obtained was relatively small, and when the data is small, training the network from scratch can easily cause overfitting and affect the generalization of the model. Therefore, transfer learning will be ideal for training the image recognition model.
Since the number of mammograms in this study was very minimal, among which the benign tumor images are the least, the diagnostic results of benign and normal were combined into one category (B or N), and malignant (M) into another category, this reduces the occurrence of overfitting.

1.
AlexNet-Optimizer: Adaptive Moment Estimation (ADAM) The classification accuracies obtained from the training and testing on three occasions were 79.71%, 81.16%, and 81.16%, respectively, with an average classification accuracy of 80.68% (Table 4). The classification accuracies obtained from the training and testing on three occasions were 81.16%, 85.51%, and 84.06%, and the average classification accuracy was 83.58% (Table 5). Comparing the average accuracy in Table 4 with that of Table 5, the classification accuracy obtained by AlexNet using SGDM as the optimizer during training was better than that obtained using ADAM. The results were 24 of the benign or normal cases were correctly classified and six were misclassified as malignant; 32 of the malignant cases were correctly classified and seven were misclassified as benign or normal (Table 6). 3.

ResNet101-Optimize: Adaptive Moment Estimation (ADAM)
From the experimental results, it can be found that when using the ADAM optimizer, the average classification accuracy of 81.16% was obtained after fixing the weights of each layer in front of module 5c, 83.09% was obtained after fixing the weights of each layer in front of module 5b, and 81.16% was obtained after fixing the weights of each layer in front of module 5a (Table 7). It can be seen from Table 7 that not allowing all the deeper convolutional layers of the model to perform higher-level feature extraction on the image will improve the classification accuracy. From the perspective of ResNet101, when using the ADAM optimizer, fixing the weights of the first to the 323rd layer can make the model develop a higher classification accuracy.

4.
ResNet101-Optimizer: Stochastic Gradient Descent with Momentum (SGDM) From the experimental results, it was observed that the average classification accuracy of 79.71% was obtained after fixing the weights of each layer in front of module 5c, 82.61% was obtained after fixing the weights of each layer in front of module 5b, and 79.71% was obtained after fixing the weights of each layer in front of module 5a (Table 8). From Table 8, it can be observed that when ResNet101 uses SGDM as the optimizer, fixing the weights of layer one to layer 323 still results in a better average classification accuracy rate of 82.61%. Although, the performance is not as good as when using ADAM (83.09%). Generally, the classification accuracy rate when using ADAM optimizer is still slightly better than when using SGDM. After applying the data to ResNet101 with ADAM as the optimizer, the classification accuracy of 85.51% was obtained, and the results were; 25 of the benign or normal cases were correctly classified and five were misclassified as malignant; 34 of the malignant cases were correctly classified and five were misclassified as benign or normal (Table 9). From the experimental results, it can be observed that when using ADAM as the optimizer, the average classification accuracy of 85.51% was obtained after fixing the weight of each layer before the merge point mixed10. The average classification accuracy of 87.44% was obtained after setting the weights of each layer before the merge point mixed9; the average classification accuracy of 90.71% was obtained after setting the weights of each layer before the merge point mixed8 (Table 10). From Table 10, it can be observed that the classification accuracy of Inception v3 increases as the deeper layers are allowed to conduct higher-level feature extraction on the images. For this reason, the number of layers with fixed weights were reduced to test whether reducing the number of layers with fixed weights could increase the classification accuracy again (i.e., fixed weights from layer one to layer 198).
From Table 10 the results showed that the classification accuracies obtained through three intervals of training and testing were 82.61%, 84.06%, and 81.16%, with an average accuracy of 82.61%; in other words, higher classification accuracy was obtained when the weights of Inception v3 were fixed from layer one to layer 230.

6.
Inception v3-Optimizer: Stochastic Gradient Descent with Momentum (SGDM) From the experimental results, it can be observed that when using SGDM as an optimizer, the average classification accuracy of 83.09% was obtained after fixing the weight of each layer before the merge point mixed10. The average classification accuracy of 81.64% was obtained after fixing the weights of each layer before mixed9; 83.58% was obtained after fixing the weights of each layer before mixed8 (Table 11). As shown in Table 11, the highest classification accuracy is still obtained when using the SGDM optimizer with fixed weights from the first layer to the 230th layer, but in contrast with when using the ADAM optimizer. The second-highest classification accuracy was obtained using SGDM optimizer with fixed weights from the first layer to the 281st layer instead of the fixed weights from the first layer to the 250th layer. It was also observed that the overall performance of Inception v3 using ADAM optimizer was better than the performance using SGDM.
Since the best classification accuracy was obtained using Inception v3 with ADAM optimizer and transfer learning with fixed weights from the first layer to the 230th layer, the test data set was integrated into this completed model and 91.3% classification accuracy was obtained. Twenty-eight (28) of the cases (benign or normal cases) were correctly classified, and two cases were incorrectly classified as malignant, 35 of the malignant cases were correctly classified, and four were incorrectly classified as benign or normal (Table 12). Generally, from the classification results of AlexNet, ResNet101, and Inception v3, Inception v3 had the best performance, while the classification accuracy of ResNet101 was higher than that of the AlexNet. It can also be observed that as the network deepens, the classification accuracy increases. The results obtained from the training and testing of every single model are summarized in Tables 13-15. The accuracy of the soft-voting model is 94.20%, which is about 2.9% higher than the accuracy of the single model Inception v3. Table 14 shows the classification results of the soft-voting model, and Table 15 shows the classification results of the majority voting. It can be observed that applying soft-voting can successfully reduce the cases of misclassification and improve the classification accuracy, and this led to a significant classification accuracy rate of 89.85%.

Discussion
The proposed model demonstrates that it is capable to Improve diagnostic accuracy and reduce the misdiagnosis rate of breast cancer. The results showed that when the three networks were compared using ADAM and SGDM, the InceptionV3 achieved the highest accuracy 91.30% when compared to [48]. This was due to the deep network of the InceptionV3 after being fine-tuned. Although, the AlexNet is capable of achieving excellent results on highly challenging datasets using purely supervised learning but if a single convolution layer is removed the network's performance degrades [49]. In relation to this, it was observed when we fix module 5a layers in Table 7 and it degraded the performance of the network, resulting to an accuracy of 81.16%.
In comparing ADAM and SGDM, ADAM outperformed SGDM since its adaptation of learning rate scale for different layers instead of hand-picking manually in SGDM [50].
Regarding SVM and ANN, the SVM outperformed the ANN and this is attributed to the ability of SVM handling large feature space, avoiding overfitting and condensing of information for a given dataset [51]. In this regard, the SVM results have demonstrated a highly classification accuracy of 91.60%.
Although, our soft voting model was able to correct misclassified data using a single model. We are cognizant of the fact that there was a small proportion of the data used with limited computational resources which has hindered our efforts to perfectly fine-tune the networks. Thus, in future research, we would consider employing a large dataset and carryout more exhaustive tests to optimize the performance of the deep learning networks and test other algorithms such as AdaBelief [52] optimizer which converges fast and has high accuracy on image classification and language modeling.

Conclusions
This research was aimed at developing a stepwise breast cancer model architecture to improve diagnostic accuracy and reduce the misdiagnosis rate of breast cancer. In the first stage, a breast cancer risk factor dataset was used. In the second stage, an image recognition model was set up using pretrained AlexNet, ResNet101, Inception_V3 fed with preprocessed mammograms. The networks were studied using Adaptive Moment Estimation (ADAM), and SGDM based optimization algorithms to diagnose benign or malignant tumors, and their accuracies were evaluated. Since the accuracy of breast cancer survivability is essential, the stepwise approach has significantly boosted the classification accuracy level. It was observed that using a single model may misclassify a patient with benign or normal tumor as malignant; or misclassify a patient with malignant tumor as benign or normal, resulting in a missed opportunity to receive appropriate treatment. However, using multiple ConvNets voting models, soft voting can classify several cases that were originally misclassified using a single model to the correct category. This allows patients to have more time to receive proper treatment.