Breast Cancer Classification from Ultrasound Images Using Probability-Based Optimal Deep Learning Feature Fusion

After lung cancer, breast cancer is the second leading cause of death in women. If breast cancer is detected early, mortality rates in women can be reduced. Because manual breast cancer diagnosis takes a long time, an automated system is required for early cancer detection. This paper proposes a new framework for breast cancer classification from ultrasound images that employs deep learning and the fusion of the best selected features. The proposed framework is divided into five major steps: (i) data augmentation is performed to increase the size of the original dataset for better learning of Convolutional Neural Network (CNN) models; (ii) a pre-trained DarkNet-53 model is considered and the output layer is modified based on the augmented dataset classes; (iii) the modified model is trained using transfer learning and features are extracted from the global average pooling layer; (iv) the best features are selected using two improved optimization algorithms known as reformed differential evaluation (RDE) and reformed gray wolf (RGW); and (v) the best selected features are fused using a new probability-based serial approach and classified using machine learning algorithms. The experiment was conducted on an augmented Breast Ultrasound Images (BUSI) dataset, and the best accuracy was 99.1%. When compared with recent techniques, the proposed framework outperforms them.


Introduction
Breast cancer is one of the most common cancers in women; it starts in the breast and spreads to other parts of the body [1]. This cancer affects the breast glands [2] and is the second most common tumor in the world, next to lung tumors [3]. Breast cancer cells create a tumor that might be seen in X-ray images. In 2020, approximately 1.8 million cancer cases were diagnosed, with breast cancer accounting for 30% of those cases [4]. There are two types of breast cancer: malignant and benign. Cells are classified based on their various characteristics. It is critical to detect breast cancer at an early stage in order to reduce the mortality rate [5].
Many imaging tools are available for the prior recognition and early treatment of breast cancer. Breast ultrasound is one of the most commonly used modalities in clinical practice for the diagnosis process [6,7]. Epithelial cells that border the terminal duct lobular unit are the source of the breast cancer. In situ or noninvasive cancer cells are those that remain inside the basement membrane of the draining duct and the basement membrane of the parts of the terminal duct lobular unit [8]. One of the most critical factors in predicting treatment decisions in breast cancer is the status of axillary lymph node metastases [9]. Ultrasound imaging is one of the most widely used test materials for detecting and categorizing breast disorders [10]. In addition to mammography, it is a common imaging modality used for performing radiological cancer diagnosis. The problems we may encounter in real life are not even reported. It is imperative to consider the presence of speckle, and to consider pre-processing such as wavelet-based denoising [11], in the first and second generations [12].
Ultrasound is non-invasive, well-tolerated by women, and radiation-free; therefore, it is a method that is frequently used in the diagnosis of breast tumors [9]. In dense breast tissue, ultrasound is a highly powerful diagnostic tool, often finding breast tumors that are missed by mammography [13]. Other types of medical imaging, such as magnetic resonance imaging (MRI) and mammography, are less portable and more costly than ultrasound imaging [14]. Computer-aided diagnosis (CAD) systems were developed to assist radiologists in the analysis of breast ultrasound tests [15,16]. Earlier CAD systems often relied on handmade visual information that is difficult to generalize across ultrasound images taken using different methods [17][18][19][20][21][22]. Recent developments have helped the construction of artificial intelligence (AI) technologies for the automated identification of breast tumors using ultrasound images [23][24][25]. A computerized method includes a few important steps such as the pre-processing of ultrasound images, tumor segmentation, extraction of features from the segmented tumor, and finally classification [26].
Recently, deep learning showed a huge improvement for cell segmentation [27], skin melanoma detection [28], hemorrhage detection [29], and a few more [30,31]. In medical imaging, deep learning was successful, especially for breast cancer [32], COVID-19 [33], Alzheimer's disease recognition [34], brain tumor [35] diagnostics, and more [36][37][38]. CNN is a type of deep learning that includes several hierarchies of layers. Through CNN, image pixels are transformed into features. The features are later utilized for infection detection and classification. In CNN, the features are extracted from the raw images. The features extracted from the raw images also produce some irrelevant features that later affect the classification performance. Therefore, it is essential to select only the most relevant features for a better classification precision rate [39].
The selection of the best features from the originally extracted features is an active research topic. Many selection algorithms are introduced in the literature and applied in medical imaging, such as Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and a few more. Using these methods, the best subset of the features instead of entire feature space. The main advantage of feature selection methods is that they improve system accuracy while decreasing computational time [40]. However, sometime during the best feature selection process, a few important features are also ignored, which impact on the system accuracy. Therefore, computer vision researchers introduced feature fusion techniques [41]. The fusion process increases the number of predictors and increases the accuracy of the system [42]. Some well-known feature fusion techniques are serial-based fusion and parallel fusion [43].
The following problems are considered in this article: (i) the available ultrasound images are not enough for the training of a good deep model as a model trained on a smaller number of images performs incorrect prediction; (ii) the similarity among benign and malignant breast cancer lesions is very high, which leads to misclassification; (iii) the features extracted from images contain irrelevant and redundant information that causes wrong predictions. To solve these problems, we propose a new fully automated deep learning-based method for breast cancer classification from ultrasound images.
The major contributions of this work are listed below.
• We modified a pre-trained deep model named DarkNet53 and trained it on augmented ultrasound images using transfer learning. • The best features are selected using reformed deferential evolution (RDE) and reformed gray wolf (RGW) optimization algorithms.

•
The best selected features are fused using a probability-based approach and classified using machine learning algorithms.
The rest of the manuscript is organized as follows. The related work of this manuscript is described in Section 2. Section 3 presents the proposed methodology, which includes deep learning, feature selection, and fusion. Results and analysis are discussed in Section 4. Finally, we conclude the proposed methodology in Section 5.

Related Work
Researchers present a number of computer vision-based automated methods for breast cancer classification using ultrasound images [44,45]. A few of them concentrated on the segmentation step, followed by feature extraction [46], and a few extracted features from raw images. Researchers used the preprocessing step in a few studies to improve the contrast of the input images and highlight the infected part for better feature extraction [47]. For example, Sadad et al. [48] presented a computer-aided diagnosis (CAD) method for the detection of breast cancer. They applied Hilbert Transform (HT) for reconstructing brightness-mode images from the rough data. After that, the tumor is segmented using a marker-controlled watershed transformation. In the subsequent step, shape, and textural features are extracted and classified using the K-Nearest Neighbor (KNN) classifier and the ensemble decision tree model. Badawy et al. [3] performed semantic segmentation, fuzzy logic, and deep learning for breast tumor segmentation and classification from ultrasound images. They used fuzzy logic in the preprocessing step and segmented the tumor using the semantic segmentation approach. Later, eight pre-trained models were applied for final tumor classification.
Mishra et al. [49] introduced a machine learning (ML) radiomics-based classification pipeline. The region of interest (ROI) was separated, and useful features were extracted. The extracted features were classified using machine learning classifiers for the final classification. The experimental process was conducted on the BUSI dataset and showed improved accuracy. Byra [14] introduced a deep learning-based framework for the classification of breast mass from ultrasound images. They used transfer learning (TL) and added deep representation scaling (DRS) layers between pre-trained CNN blocks to improve information flow. Only the parameters of the DRS layers were updated during network training to modify the pre-trained CNN to analyze breast mass classification from the input images. The results showed that the DRS method was significantly better compared with the recent techniques. Irfan et al. [5] introduced a Dilated Semantic Segmentation Network (Di-CNN) for the detection and classification of breast cancer. They considered a pre-trained DenseNet201 deep model and trained it using transfer learning that was later used for feature extraction. Additionally, they implemented a 24-layered CNN and parallel fused feature information with the pre-trained model and classified the nodules. The results showed that the fusion process improves the recognition accuracy.
Hussain et al. [50] presented a contextual level set method for segmentation of breast tumors. They designed a UNet-style encoder-decoder architecture network to learn highlevel contextual aspects from semantic data. Xiangmin et al. [51] presented a deep doubly supervised transfer learning network for breast cancer classification. They introduced a Learning using Privileged Information (LUPI) paradigm, which was executed through the Maximum Mean Discrepancy (MMD) criterion. Later, they combined both techniques using a novel doubly supervised TL network (DDSTN) and achieved improved performance. Woo et al. [52] introduced a computerized diagnosis system for breast cancer classification using ultrasound images. They introduced an image fusion technique and combined it with image content representation and several CNN models. The experimental process was conducted on BUSI and private datasets and achieved notable performance. Byra et al. [53] presented a deep learning model for breast mass detection in ultrasound images. They considered the problem of variation in breast mass size, shape, and characteristics. To solve these issues, they performed selective kernel U-Net CNN. Based on this approach, they fused the information and performed an experimental process on 882 breast images. Additionally, they considered three more datasets and achieved improved accuracy.
Kadry et al. [54] created a computerized technique for detecting breast tumor section (BTS) from breast MRI slices This study employs a combined thresholding and segmentation approach to improve and extract the BTS from 2D MRI slices. To improve the BTS, a tri-level thresholding based on the Slime Mould Algorithm and Shannon's Entropy is created, and Watershed Segmentation is implemented to mine the BTS. Following the extraction of the BTS, a comparison between the BTS and ground truth is carried out, and the required Image Performance Values are generated. Lahoura et al. [55] used an Extreme Learning Machine (ELM) to diagnose breast cancer. Second, the gain ratio feature selection approach is used to exclude unimportant features. Finally, a cloud computing-based method for remote breast cancer diagnostics is presented and validated on the Wisconsin Diagnostic Breast Cancer dataset.
Maqsood et al. [56] offered a brain tumor diagnosis technique based on edge detection and the U-NET model. The suggested tumor segmentation system is based on image enhancement, edge detection, and classification using fuzzy logic. The contrast enhancement approach is used to pre-process the input pictures, and a fuzzy logic-based edge detection method is utilized to identify the edge in the source images, and dual tree-complex wavelet transform is employed at different scale levels. The decaying sub-band pictures are used to calculate the features, which are then classified using the U-NET CNN classification, which detects meningioma in brain images. Rajinikanth et al. [57] created an automated breast cancer diagnosis system utilizing breast thermal images. First, they captured images of various breast orientations. They then extracted healthy/DCIS image patches, processed the patches with image processing, used the Marine Predators Algorithm for feature extraction and feature optimization, and performed classification using the Decision Tree (DT) classifier, which achieved higher accuracy (>92%) when compared with other methods. In [58], the authors presented a novel layer connectivity based architecture for the low contrast nodules segmentation from ultrasound images. They employed dense connectivity and combined it with high-level coarse segmentation. Later, the dilated filter was applied to refine the nodule. Moreover, a class imbalance loss function is also proposed to improve the accuracy of the proposed architecture.
Based on the techniques mentioned above, we discovered that most researchers do not pay attention to the preprocessing step. Typically, researchers performed the segmentation step first, followed by the extraction of features. A few of them used feature fusion to improve their classification results. They did not, however, concentrate on the selection of optimal features. They also ignored computational time, which is now an important factor. In this paper, we proposed an optimal deep learning feature fusion framework for breast mass classification. A summary of a few of the latest techniques is given below Table 1.

Proposed Methodology
The proposed framework for breast cancer classification using ultrasound images is presented in this section. Figure 1 illustrates the architecture of the proposed framework. Initial data augmentation is performed on the original ultrasound images and then passed to the fine-tuned deep network DarkNet53 for training purposes. Training is performed using TL and extract features from the global average pool layer. Extracted features are refined using the reformed feature optimization techniques, such as reformed differential evolution (RDE) and reformed gray wolf (RGW) algorithms. The best selected features are fused using a probability-based approach. Finally, the fused features are classified using machine learning classifiers. A detailed description of each step is given below.

Proposed Methodology
The proposed framework for breast cancer classification using ultrasound images is presented in this section. Figure 1 illustrates the architecture of the proposed framework. Initial data augmentation is performed on the original ultrasound images and then passed to the fine-tuned deep network DarkNet53 for training purposes. Training is performed using TL and extract features from the global average pool layer. Extracted features are refined using the reformed feature optimization techniques, such as reformed differential evolution (RDE) and reformed gray wolf (RGW) algorithms. The best selected features are fused using a probability-based approach. Finally, the fused features are classified using machine learning classifiers. A detailed description of each step is given below.

Dataset Augmentation
Data augmentation has been an important research area in recent years in the domain of deep learning. In deep learning, neural networks required many training samples; however, existing data sets in the medical domain belong to the low resource domain. Therefore, a data augmentation step is necessary to increase the diversity of the original dataset.
In this work, the BUSI dataset is used for the validation process. There are 780 images in the collection with an average image size of 500 × 500 pixels. This dataset consists of three total categories: normal (133 images), malignant (210 images), and benign (487 images) [59], as illustrated in Figure 2. We divided this entire dataset into the training and testing of ratio 50:50. After this, the training images of each class were normal (56 images), malignant (105 images), and benign (243 images). This dataset is not enough to train the deep learning model; therefore, a data augmentation step is employed. Three operations such as horizontal flip, vertical flip, and rotate 90 are implemented and performed on original ultrasound images to increase the diversity of the original dataset. These implemented operations are performed multiple times until the number of images in each class has reached 4000. After the augmentation process, the number of images in the dataset is 12,000.

Dataset Augmentation
Data augmentation has been an important research area in recent years in the domain of deep learning. In deep learning, neural networks required many training samples; however, existing data sets in the medical domain belong to the low resource domain. Therefore, a data augmentation step is necessary to increase the diversity of the original dataset.
In this work, the BUSI dataset is used for the validation process. There are 780 images in the collection with an average image size of 500 × 500 pixels. This dataset consists of three total categories: normal (133 images), malignant (210 images), and benign (487 images) [59], as illustrated in Figure 2. We divided this entire dataset into the training and testing of ratio 50:50. After this, the training images of each class were normal (56 images), malignant (105 images), and benign (243 images). This dataset is not enough to train the deep learning model; therefore, a data augmentation step is employed. Three operations such as horizontal flip, vertical flip, and rotate 90 are implemented and performed on original ultrasound images to increase the diversity of the original dataset. These implemented operations are performed multiple times until the number of images in each class has reached 4000. After the augmentation process, the number of images in the dataset is 12,000.

Modified DarkNet-53 Model
DarkNet-53 is a 53-layer deep convolutional neural network. It serves as the basis for the YOLOv3 object detection method. It can ensure super expression of features while avoiding the gradient problem produced by a too-deep network by combining Resnet's qualities. The structure of the DarkNet-53 model is shown in Figure 3. It combines the residual network with the deep residual network. It contains successive 1 × 1 and 3 × 3 convolution layers and residual blocks. The convolutional layer is defined as follows: a n m = ∑ j∈X i a n−1 j * y n jm + z n m (1)

Modified DarkNet-53 Model
DarkNet-53 is a 53-layer deep convolutional neural network. It serves as the basis for the YOLOv3 object detection method. It can ensure super expression of features while avoiding the gradient problem produced by a too-deep network by combining Resnet's qualities. The structure of the DarkNet-53 model is shown in Figure 3. It combines the residual network with the deep residual network. It contains successive 1 × 1 and 3 × 3 convolution layers and residual blocks. The convolutional layer is defined as follows: In Equation (1), the input image is twisted by several convolution kernels to produce separate feature maps , which is represented in layer by the feature map. The symbol * represents the convolution operation. The feature vector of the image is represented by and the element of the convolution kernel in the layer is represented by .

Modified DarkNet-53 Model
DarkNet-53 is a 53-layer deep convolutional neural network. It serves as the basis for the YOLOv3 object detection method. It can ensure super expression of features while avoiding the gradient problem produced by a too-deep network by combining Resnet's qualities. The structure of the DarkNet-53 model is shown in Figure 3. It combines the residual network with the deep residual network. It contains successive 1 × 1 and 3 × 3 convolution layers and residual blocks. The convolutional layer is defined as follows: In Equation (1), the input image is twisted by several convolution kernels to produce separate feature maps , which is represented in layer by the feature map. The symbol * represents the convolution operation. The feature vector of the image is represented by and the element of the convolution kernel in the layer is represented by .  In Equation (1), the input image is twisted by several convolution kernels to produce m separate feature maps a n m , which is represented in layer n by the m feature map. The symbol * represents the convolution operation. The feature vector of the image is represented by X i and the j element of the m convolution kernel in the layer n is represented by y n j . The next important layer is the batch normalization (BN) layer.
In Equation (2), the scaling factor is represented by ∝, the mean of all outputs is represented by ∂, the input variance is represented by ω, ϕ is a constant offset represented by γ, and the convolution calculation result is denoted by a out . The result of BN denoted by a out . The output is normalized using Batch Normalization corresponding to the same distribution of the coefficients of the same batch of eigenvalues. Following that, it has a convolutional layer that can accelerate network convergence, as well as avoiding overfitting. The next layer is also known as an activation layer. In DarNet53, a leaky ReLu layer is included as an activation function. This function increases the nonlinearity of the network: In Equation (3), the input value is denoted by y j , the activation value is represented by x j , and the fixed parameter in the interval (1, +∞) is denoted by b j . Another important layer in this network is pooling layer. This layer is employed for the downsampling of weights in the network. The max-pooling layer is used in this network. In the last example, all weights are combined in one layer in the form of a 1D array, also called features. These extracted features are finally classified in the output layer. The depth of this model is 53, the size is 155 MB, the number of parameters is 41.6 million, and the image input size is 256-by-256. The detailed layer-wise architecture is given in Figure 4.
the same distribution of the coefficients of the same batch of eigenvalues. Following that, it has a convolutional layer that can accelerate network convergence, as well as avoiding over-fitting. The next layer is also known as an activation layer. In DarNet53, a leaky ReLu layer is included as an activation function. This function increases the nonlinearity of the network: In Equation (3), the input value is denoted by , the activation value is represented by , and the fixed parameter in the interval (1, +∞) is denoted by . Another important layer in this network is pooling layer. This layer is employed for the downsampling of weights in the network. The max-pooling layer is used in this network. In the last example, all weights are combined in one layer in the form of a 1D array, also called features. These extracted features are finally classified in the output layer. The depth of this model is 53, the size is 155 MB, the number of parameters is 41.6 million, and the image input size is 256-by-256. The detailed layer-wise architecture is given in Figure 4.

Transfer Learning
Transfer learning (TL) is a machine learning approach in which a pre-trained model is reused for another task [60]. Reusing or transferring data from previous learned tasks for the newly learned tasks has the potential to dramatically improve the sampling efficiency of a supervised learning agent from a practical standpoint [61]. Here, TL is employed for the deep feature extraction. For this purpose, initially pre-trained model is fine-tuned and then trained using TL. Mathematically, TL is defined as follows: Task: Given a particular domain d, there are two components of task t {X, g(.)}: a label space X, and a predictive function g (.); this is not visible, but can be derived from training data (m j , n j j{1, 2, 3, . . . N}, where m j Y and n j X . From a probabilistic point, f m j may also be written as p n j m j , thus we can rewrite the task t as t = {X, P(x|Y)}. If two tasks are dissimilar, their label spaces may differ X p = X q or result in dissimilar distributions with conditional probabilities p X p Y p = p X q Y q .
The visual process of transfer learning is illustrated in Figure 5.

Best Features Selection
In this work, two optimization algorithms are reformed for the selection of best features such as differential evolution and gray wolf and fused their information for the final classification. The vector size after performing a differential evolution algorithm is 4788 × 818. Here, 818 is the number of features and 4788 is the number of images. The vector size after performing the gray wolf optimization algorithm is 4788 × 734.
The DE algorithm searches the solution space using the differences between individuals as a guide. The DE's main idea is to scale and differentiate two different specific

Best Features Selection
In this work, two optimization algorithms are reformed for the selection of best features such as differential evolution and gray wolf and fused their information for the final classification. The vector size after performing a differential evolution algorithm is 4788 × 818. Here, 818 is the number of features and 4788 is the number of images. The vector size after performing the gray wolf optimization algorithm is 4788 × 734.

Reformed Differential Evolution (RDE) Algorithm
The DE algorithm searches the solution space using the differences between individuals as a guide. The DE's main idea is to scale and differentiate two different specific vectors in the same population, then add a third individual vector to this population to generate a mutation independent vector, which is crossed with the parent independent vector with a certain possibility to produce an intended individual vector. Finally, greedy selection is applied to the generated individual vector and the parent independent vector, and the consistently better vector is preserved for the future generation. The DE's fundamental evolution processes are as follows: Initialization: D-dimensional vectors (D) are used as the starting solution in the DE algorithm. The population number can be represented by P, each independent factor can be denoted by z j (Y) = z j1 (Y), z j2 (Y), z j3 (Y), . . . , z jn (Y) , and z j (Y) ∈ denotes the deep extracted features. The starting population is produced in [z min , z max ]. Here, the number of D-dimensional vectors is denoted by D, population numbers are represented by P, and z j (Y) represents the jth individual.
where Y denotes the Yth generation, the maximum and minimum values of the search space are representing by z max and z min , respectively, and rand(0, 1) indicates a random number that falls inside (0, 1) the normal distribution. Mutation Operation: The DE method generates a mutation vector M j,Y for each individual z j,Y in the existing population (target vector) using the mutation operation. A specific mutation technique can generate a relevant mutation vector for each derived target vector. Several DE mutation strategies are established based on the varied generating ways of mutation people. The five most widely utilized mutation techniques are: DE/rand/1: DE/best/1: DE/rand-to-best/1: DE/rand/2: Random exclusive integers are created and denoted by r 1 , r 2 , r 3 , r 4 and r 5 within [1, D]. To scale a divergence vector, the scaling factor E is a positive constant value. In the Yth generation, z best, Y is an independent vector with the best global value.
Crossover Operation: To construct a test vector v j,Y = v 1,Y , v 2,Y , v 3,Y , . . . , v j,y , each pair of target vectors z j,Y and their matching mutation vectors M j,Y are crossed.
A binomial crossover is defined as follows in the DE algorithm: v j,Y = M j,Y i f (rand i (0, 1) ≤ C) or (i = i rand , i = 1, 2, 3, . . . , K) z j,y Otherwise where C denotes the crossover frequency and is a constant on [0, 1]. This is used to limit the quantity of the duplicated mutation vector. The selected integer on [1, K], which is random, is denoted by i rand . Selection operation: If the parameter values reach the upper or lower bounds, they can be regenerated in a random and uniform manner within the specified range. The values of all the objective functions of the test vectors are then evaluated, and the selection operation is carried out. Each test vector's objective function f v j,Y is matched to the associated target vector's optimal solution value of the associated target vector in the current sample. If the test vector's objective function is much less than or similar to the target vector's, the target vector is replaced by the test vector for the upcoming generation. The target vector is kept for the following generation if this is not the case.
After obtaining the selected features v j,Y , features are further refined using another threshold function called the selected standard error of mean (SSEoM). Using this new threshold function, the Sl(k) features are selected as a final phase.
where Tr is a threshold function and SM is the standard error mean.

Reformed Binary Gray Wolf (RBGW) Optimization
The key update Equation for bGWO1 in this approach is provided as follows: crossover(l, m, n) is an appropriate crossover between solutions l, m, n and l 1 , l 2 , l 3 , which are binary vectors showing the effect of wolves moving towards alpha, beta, and delta grey wolves, in that order. l 1 , l 2 , l 3 can be computed by using the following Equation (13): where position vector in dimension t is denoted by l t 1 and binary step is represented by bistep t a in dimension t. It can be computed by using Equation (15): where rand is an integer picked at random from a uniformly distributed ∈ [0, 1], and the continuous value of the size step is denoted by costep t a ; this can be computed by the following Equation (15): where X t 1 and D t a are computed through Equations (16) and (17) that were later employed for the threshold selection as follows: Otherwise (18) where in Equation (16), → X is the updated position of prey, → r 1 denotes the random distribution, and c is constantly reduced in the scope of (2,0). In Equation (18), → D α represent the distances of prey from each gray wolf and → C 1 represent the coefficient variable. In Equation (18), the position vector in dimension t is denoted by l t 2 and the binary step is represented by bistep t b in dimension t. It can be computed by using the following Equation (19): where rand is an integer picked at random from an uniformly distributed ∈ [0, 1] and the continuous valued of size step is denoted by costep t b ; this can be computed by the following Equation (20): where D t b in dimension t can be computed by Equation (21).
where the position vector in dimension t is denoted by l t 3 and the binary step is represented by bistep t c in dimension t. It can be computed by using the following Equation (23): where rand is an integer picked at random and uniformly distributed ∈ [0, 1], and the continuous value of the size step is denoted by costep t c ; this can be computed by Equation (24).
where Y t c in dimension t can be computed by Equation (25).
A stochastic crossover process is used per dimension to crossover u, v, w solutions.
Binary values are u t , v t and w t . These are three parameters in dimension t. The output of the crossover is denoted by l t in dimension t.
The algorithm is summarized in Algorithm 1.

Algorithm 1. Reformed Features Optimization Algorithm
Input: g the pack's total number of grey wolves, Gth the number of optimization iterations. Output: l α Binary position of the grey wolf that is optimal, f (l α )Best fitness value Begin 1. Create a population of g wolves with random positions ∈ [0, 1] 2.
Find a, b, c solutions that are based on fitness.

II
Examine the individual wolf positions. III Update a, b, c.

Feature Fusion and Classification
The best selected features from the RDE and RGW algorithms are finally fused in one feature vector for the final classification. For the fusion of selected deep features, a probability-based serial approach is adopted. In this approach, initially probability is computed for both selected vectors and only one feature is employed based on the high probability value. Based on the high probability value feature, a comparison is conducted and features are fused in one matrix. The main purpose of this comparison is to tackle the problem of redundant features of both vectors. The final fused features are next classified using machine learning algorithms for the final classification. The size of the vector is 4788 × 704 after fusion.

Experimental Results and Analysis
Experimental Setup: During the training of fine-tuned deep learning model, the following hyper parameters are employed, such as a learning rate of 0.001, mini batch size of 16, epochs at 200, the optimization method is Adam, and the feature activation function is sigmoid. Moreover, the multiclass cross entropy loss function is employed for the calculation of loss.
All experiments are performed on MATLAB2020b using a desktop computer Core i7 with 8GB of graphics card and 16GB RAM.
The following experiments have been performed to validate the proposed method:

Results
The results of the first experiment are given in Table 2  The sensitivity rate of Cubic SVM is validated through the confusion matrix illustrated in Figure 6. In addition, the computational time of each classifier is noted, and the best time is 120.909 (s) for LDA, and the worst time is 207.879 (s) for MGSVM. The results of the second experiment are given in Table 3. The best accuracy of 99.3% was obtained for Cubic SVM. A few other parameters are also computed, such as sensitivity rate, precision rate, F1 score, accuracy, FNR, and time complexity, and their values are 99.3%, 99.3%, 99.3%, 99.3%, 0.7%, and 11.112 (s), respectively. The MGSVM and Q-SVM classifiers obtained the second-best accuracy of 99.3% and 99.2%, respectively The rest of the classifiers also achieved better performance. The confusion matrix of the Cubic SVM is illustrated in Figure 7. In addition, the computational time of each classifier is noted, and the minimum time is 111.112 (s) for the Cubic SVM, whereas the highest time is 167.126 (s) for ESD. When comparing the results of this experiment in Table 2, the classification accuracy is found to be consistent, but the computational time is minimized.  The results of the second experiment are given in Table 3. The best accuracy of 99.3% was obtained for Cubic SVM. A few other parameters are also computed, such as sensitivity rate, precision rate, F1 score, accuracy, FNR, and time complexity, and their values are 99.3%, 99.3%, 99.3%, 99.3%, 0.7%, and 11.112 (s), respectively. The MGSVM and Q-SVM classifiers obtained the second-best accuracy of 99.3% and 99.2%, respectively. The rest of the classifiers also achieved better performance. The confusion matrix of the Cubic SVM is illustrated in Figure 7. In addition, the computational time of each classifier is noted, and the minimum time is 111.112 (s) for the Cubic SVM, whereas the highest time is 167.126 (s) for ESD. When comparing the results of this experiment in Table 2, the classification accuracy is found to be consistent, but the computational time is minimized.  The results of the third experiment are given in Table 4. This table presented accuracy obtained at 98.9% for Cubic SVM. The MGSVM and Q-SVM obtained ond-best accuracy of 98.7%. The rest of the classifiers such as ESD, LSVM, E FKNN, LD, CGSVM, and WKNN, and their accuracy values are 98.7%, 98.6% 97.8%, 98.1%, 97.9% and 97.2%, respectively. The confusion matrix of Cubic SV lustrated in Figure 8. In addition, the computational time of each classifier is also and the best time is 76.2 (s) for Cubic SVM and the worst time is 107.679 (s) ESKNN classifier. The accuracy of classifiers from experiments (i)-(iii) using d training/testing ratios is summarized in Figure 9. This figure illustrated that the mance at 50:50 is overall better than the rest of the selected ratios.  The results of the third experiment are given in Table 4. This table presented the best accuracy obtained at 98.9% for Cubic SVM. The MGSVM and Q-SVM obtained the second-best accuracy of 98.7%. The rest of the classifiers such as ESD, LSVM, ESKNN, FKNN, LD, CGSVM, and WKNN, and their accuracy values are 98.7%, 98.6%, 98%, 97.8%, 98.1%, 97.9% and 97.2%, respectively. The confusion matrix of Cubic SVM is illustrated in Figure 8. In addition, the computational time of each classifier is also noted, and the best time is 76.2 (s) for Cubic SVM and the worst time is 107.679 (s) for the ESKNN classifier. The accuracy of classifiers from experiments (i)-(iii) using different training/testing ratios is summarized in Figure 9. This figure illustrated that the performance at 50:50 is overall better than the rest of the selected ratios.    Table 5 presents the results of the fourth experiment. In this experiment, a 50:50 training/testing ratio is used. The best features are selected using the binary DE method. The 99.1% accuracy is achieved by Cubic SVM after feature selection. A few other parameters are also computed for this classifier, such as sensitivity rate, precision rate, F1 score accuracy, FNR, and time complexity, and their values are 99.1%, 99.06%, 99.08%, 1, 0.9 and 16.082, respectively. The confusion matrix of Cubic SVM is illustrated in Figure   98. 0% 1.2% 0.8%

99.7%
Predicted Class Figure 9. Summary of DarkNet53 classification accuracy using different training/testing ratios. Table 5 presents the results of the fourth experiment. In this experiment, a 50:50 training/testing ratio is used. The best features are selected using the binary DE method. The 99.1% accuracy is achieved by Cubic SVM after feature selection. A few other parameters are also computed for this classifier, such as sensitivity rate, precision rate, F1 score accuracy, FNR, and time complexity, and their values are 99.1%, 99.06%, 99.08%, 1, 0.9 and 16.082, respectively. The confusion matrix of Cubic SVM is illustrated in Figure 10. The computational time of each classifier is also noted, and the best time is 28.082 (s) for the CSVM classifier, and the worst time is 42.829 (s) for the WKNN classifier. This shows that the computational time after the selection process is significantly minimized compared with the time given in Tables 2 and 3. Table 5. Classification results of binary differential evolution selector using ultrasound images, where the training/testing ratio is 50:50.

Classifier
Sensitivity (  The results of the fifth experiment are given in Table 6. In this experiment, the gray wolf optimization algorithm is implemented and selects the best features fo nal classification. This table presents the best accuracy obtained of 99.1% for Cub A few other parameters are also computed for this classifier, such as sensitivi precision rate, F1 score, accuracy, FNR, and time complexity, and their values are 99.1%, 99.08%, 1, 0.94, and 15.239, respectively. The confusion matrix of Cubic illustrated in Figure 11. The computational time of each classifier is also noted, best time is 25.239 (s) for CSVM. This table shows that the overall time is minimiz the accuracy is consistent with Tables 2 and 3.  The results of the fifth experiment are given in Table 6. In this experiment, the binary gray wolf optimization algorithm is implemented and selects the best features for the final classification. This table presents the best accuracy obtained of 99.1% for Cubic SVM. A few other parameters are also computed for this classifier, such as sensitivity rate, precision rate, F1 score, accuracy, FNR, and time complexity, and their values are 99.06%, 99.1%, 99.08%, 1, 0.94, and 15.239, respectively. The confusion matrix of Cubic SVM is illustrated in Figure 11. The computational time of each classifier is also noted, and the best time is 25.239 (s) for CSVM. This table shows that the overall time is minimized, and the accuracy is consistent with Tables 2 and 3.  Finally, the best selected features are fused using the proposed approach. The results are given in Table 7. This table presented the best accuracy obtained with 99.1% for Cubic SVM. The confusion matrix of Cubic SVM is illustrated in Figure 12. In this figure, the diagonal values show the correct predicted values. In addition, the computational time of each classifier is noted, and the best time is 13.599 (s) for the CSVM classifier.   Finally, the best selected features are fused using the proposed approach. The results are given in Table 7. This table presented the best accuracy obtained with 99.1% for Cubic SVM. The confusion matrix of Cubic SVM is illustrated in Figure 12. In this figure, the diagonal values show the correct predicted values. In addition, the computational time of each classifier is noted, and the best time is 13.599 (s) for the CSVM classifier.

Statistical Analysis
For statistical analysis and comparison of the results, we used the post-hoc Nemenyi test. Demšar [62] has suggested using the Nemenyi test to compare techniques in a paired manner. The test determines a critical difference (CD) value for a given degree of confidence α. If the difference in the average ranks of two techniques exceeds the CD value, the null hypothesis, H0, that both methods perform equally well, is rejected.
The results of statistical analysis are summarized in Figure 14 (mean ranks of classifiers) and Figure 15 (mean ranks of feature selection methods). The best classifier is CSVM, but MGSVM and QSVM also show very good results, in terms of accuracy, which are not significantly different from CSVM. The best feature selection method among the

Statistical Analysis
For statistical analysis and comparison of the results, we used the post-hoc Nemenyi test. Demšar [62] has suggested using the Nemenyi test to compare techniques in a paired manner. The test determines a critical difference (CD) value for a given degree of confidence α. If the difference in the average ranks of two techniques exceeds the CD value, the null hypothesis, H0, that both methods perform equally well, is rejected.
The results of statistical analysis are summarized in Figure 14 (mean ranks of classifiers) and Figure 15 (mean ranks of feature selection methods

Statistical Analysis
For statistical analysis and comparison of the results, we used the post-hoc Nemenyi test. Demšar [62] has suggested using the Nemenyi test to compare techniques in a paired manner. The test determines a critical difference (CD) value for a given degree of confidence α. If the difference in the average ranks of two techniques exceeds the CD value, the null hypothesis, H0, that both methods perform equally well, is rejected.
The results of statistical analysis are summarized in Figure 14 (mean ranks of classifiers) and Figure 15 (mean ranks of feature selection methods). The best classifier is CSVM, but MGSVM and QSVM also show very good results, in terms of accuracy, which are not significantly different from CSVM. The best feature selection method among the four methods analyzed is the proposed feature fusion approach, which is significantly better than other approaches (DE, BGWO, and original).

Comparison with the State of the Art
The proposed method is compared with the state-of-the-art techniques, as given Table 8. In [63], the authors used ultrasound images and achieved an accuracy of 73% [64], the adaptive histogram equalization method was used to enhance ultrasound i ages and obtained an accuracy of 89.73%. In [52], a CAD system was presented for tum identification that combines an imaging fusion method with various formats of ima content and ensembles of multiple CNN architectures. The accuracy achieved for t data set was 94.62%. In [65], the source breast ultrasound image was first processed us bilateral filtering and fuzzy enhancement methods. The accuracy achieved was 95. 48 In [66], authors implemented a semi-supervised generative adversarial network (GA model and achieved an accuracy of 90.41%. The proposed method achieved an accura of 99.1% using a BUSi augmented dataset, where the computational time is 13.599 (s).

Comparison with the State of the Art
The proposed method is compared with the state-of-the-art techniques, as gi Table 8. In [63], the authors used ultrasound images and achieved an accuracy of 7 [64], the adaptive histogram equalization method was used to enhance ultrasoun ages and obtained an accuracy of 89.73%. In [52], a CAD system was presented for identification that combines an imaging fusion method with various formats of content and ensembles of multiple CNN architectures. The accuracy achieved fo data set was 94.62%. In [65], the source breast ultrasound image was first processed bilateral filtering and fuzzy enhancement methods. The accuracy achieved was 95 In [66], authors implemented a semi-supervised generative adversarial network ( model and achieved an accuracy of 90.41%. The proposed method achieved an acc of 99.1% using a BUSi augmented dataset, where the computational time is 13.599 (   Table 8. Comparison with the state-of-the-art techniques.

Comparison with the State of the Art
The proposed method is compared with the state-of-the-art techniques, as given in Table 8. In [63], the authors used ultrasound images and achieved an accuracy of 73%. In [64], the adaptive histogram equalization method was used to enhance ultrasound images and obtained an accuracy of 89.73%. In [52], a CAD system was presented for tumor identification that combines an imaging fusion method with various formats of image content and ensembles of multiple CNN architectures. The accuracy achieved for this data set was 94.62%. In [65], the source breast ultrasound image was first processed using bilateral filtering and fuzzy enhancement methods. The accuracy achieved was 95.48%. In [66], authors implemented a semi-supervised generative adversarial network (GAN) model and achieved an accuracy of 90.41%. The proposed method achieved an accuracy of 99.1% using a BUSi augmented dataset, where the computational time is 13.599 (s).

Conclusions
We proposed an automated system for breast cancer classification using ultrasound images. The proposed method is based on a few sequential steps. Initially, the breast ultrasound data are augmented and then retrained using a DarkNet-53 deep learning model. Next, the features were extracted from the pooling layer and then the best feature was selected using two different optimization algorithms such as the reformed BGWO and the reformed DE. The selected features are finally fused using a proposed approach that is later classified using machine learning algorithms. Several experiments were performed, and the proposed method achieved the best accuracy of 99.1% (using feature fusion and CSVM classifier). The comparison with recent techniques shows improvement in the results using the proposed framework. The strength of this work is: (i) augmentation of the dataset improved the training strength, (ii) the selection of best features reduced the irrelevant features, and (iii) the fusion method further reduced the computational time and consistency of accuracy.
In future, we will focus on two key steps: (i) increasing the size of the database, and (ii) designing a CNN model from scratch for breast tumor classification. We will discuss our proposed model with ultrasound imaging specialists and medical doctors, aiming for practical implementation at hospitals.

Data Availability Statement:
The dataset used in this paper is available from https://scholar.cu.edu. eg/?q=afahmy/pages/dataset (accessed on 20 November 2021).

Conflicts of Interest:
The authors declare no conflict of interest.