Breast Cancer Tumor Classiﬁcation Using a Bag of Deep Multi-Resolution Convolutional Features

: Breast cancer accounts for 30% of all female cancers. Accurately distinguishing dangerous malignant tumors from benign harmless ones is key to ensuring patients receive lifesaving treatments on time. However, as doctors currently do not identify 10% to 30% of breast cancers during regular assessment, automated methods to detect malignant tumors are desirable. Although several computerized methods for breast cancer classiﬁcation have been proposed, convolutional neural networks (CNNs) have demonstrably outperformed other approaches. In this paper, we propose an automated method for the binary classiﬁcation of breast cancer tumors as either malignant or benign that utilizes a bag of deep multi-resolution convolutional features (BoDMCF) extracted from histopathological images at four resolutions (40 × , 100 × , 200 × , and 400 × ) by three pre-trained state-of-the-art deep CNN models: ResNet-50, EfﬁcientNetb0, and Inception-v3. The BoDMCF extracted by the pre-trained CNNs were pooled using global average pooling and classiﬁed using the support vector machine (SVM) classiﬁer. While some prior work has utilized CNNs for breast cancer classiﬁcation, they did not explore using CNNs to extract and pool a bag of deep multi-resolution features. Other prior work utilized CNNs for deep multi-resolution feature extraction from chest X-ray radiographs to detect other conditions such as pneumoconiosis but not for breast cancer detection from histopathological images. In rigorous evaluation experiments, our deep BoDMCF feature approach with global pooling achieved an average accuracy of 99.92%, sensitivity of 0.9987, speciﬁcity (or recall) of 0.9797, positive prediction value (PPV) or precision of 0.99870, F1-Score of 0.9987, MCC of 0.9980, Kappa of 0.8368, and AUC of 0.9990 on the publicly available BreaKHis breast cancer image dataset. The proposed approach outperforms the prior state of the art for histopathological breast cancer classiﬁcation as well as a comprehensive set of CNN baselines, including ResNet18, InceptionV3, DenseNet201, EfﬁcientNetb0, SqueezeNet, and ShufﬂeNet, when classifying images at any individual resolutions (40 × , 100 × , 200 × or 400 × ) or when SVM is used to classify a BoDMCF extracted using any single pre-trained CNN model. We also demonstrate through a carefully constructed set of experiments that each component of our approach contributes non-trivially to its superior performance including transfer learning (pre-training and ﬁne-tuning), deep feature extraction at multiple resolutions, global pooling of deep multiresolution features into a powerful BoDMCF representation, and classiﬁcation using SVM.


Introduction
Breast cancer accounts for 30% of all female cancers [1,2], has the highest death rate of all types of cancers [1], and the number of new cases is expected to rise by almost 70% in the next two decades. There are two kinds of growth in breast tissue: non-harmful (benign) and dangerous (malignant or cancerous) that should be distinguished from each other during patient assessments. The World Health Organization (WHO) has stated that in when classifying any single resolution (40×, 100×, 200× or 400×). In our evaluation, we demonstrate through a carefully constructed set of experiments that each component of our approach contributes non-trivially to its superior performance, including transfer learning pre-training and fine-tuning, deep feature extraction at multiple resolutions, global pooling of deep multiresolution features into a powerful BoDMCF representation, and classification using SVM.
Novelty: Our work is novel because while some prior work has utilized CNNs for breast cancer classification, they did not explore using CNNs to extract and pool a bag of deep multi-resolution features. Other prior work utilized CNNs for deep multi-resolution feature extraction from chest X-ray radiographs to detect other conditions such as pneumoconiosis but not for breast cancer detection from histopathological images. The BoDMCF approach innovatively leverages several key insights. First, pre-training state-of-the-art CNNs on larger repositories such as the 14 million image ImageNet repository equips them with the intelligence to learn the most predictive features and low-level image attributes such as edges and corners from histopathological breast cancer images. Secondly, extracting and pooling features from multiple resolutions of histopathological images improves classification accuracy as discriminative visual attributes may be most visible at different resolutions. Thirdly, global pooling of multiresolution breast cancer features creates a bag of features that is so powerful that classifying them using SVM achieves highly accurate binary breast cancer classification (malignant vs. benign) of histopathological images. The deep BoDMCF approach has yielded impressive results in other image classification domains including multimedia image retrieval [28] and remote sensing image scene classification [26]. Ours is the first work to innovatively apply this powerful representation learning technique to binary breast cancer image classification (malignant vs. benign). The specific combination of state-of-the-art deep learning architectures we utilize are also novel and were carefully selected after extensive, systematic experimentation.
Challenges: First, the heterogeneity of the visual texture patterns observable on breast histopathological images makes tumor malignancy classification a challenging task even for CNNs, affecting their performance [29]. Secondly, the discriminative visual attributes of tumor malignancy can be most visible at different resolutions of histopathological images. By directly addressing these two challenges, the BoDMCF approach is particularly suited to classifying tumor malignancy.
Related work that utilized Deep Learning and CNNs for breast cancer tumor classification are summarized in Table 1. While there has been some prior work that utilized neural networks for breast cancer classification, none of them explored the deep BoDMCF representation with a global pooling approach, which we propose. Maqsood et al. in [30] classified screening mammogram using CNN and achieved average accuracy of 97.49%. Spanhol et al. [31] utilized the AlexNet CNN model for classifying tumors in histopathological images as malignant or benign. Kowal et al. [32] explored deep learning models for nuclei segmentation, in which the instances were classified as harmless or non-harmless on a dataset of 269 images, achieving average accuracies from 80.2% to 92.4%. Shen et al. [33] utilized an active learning approach to classify breast cancer images. Byra et al. [34] combined statistical parameters with a CNN for breast cancer classification. Nejad et al. [35] used a fast one-layer CNN for breast cancer classification that was tested on histopathological images with a magnification factor of 40×. Nahid et al. in [36] used DNN models guided by unsupervised clustering methods for breast cancer classification. Murtaza et al. [3] comprehensively reviewed cutting-edge deep-learning-based breast cancer classification using medical images. Ogundokun et al. [37] utilized artificial neural networks and CNNs with hyperparameter optimization for malignant vs. benign classification, while the support vector machine (SVM) and multilayer perceptron (MLP) were utilized as baseline classifiers for comparison. Vogado et al. [38] proposed a technique used to correctly classify images with different characteristics derived from different image databases which does not require a segmentation process. Gandomkar et al. [39] classified breast histopathological images into malignant and benign subtypes using deep residual networks. Han et al. [40] previously utilized deep neural networks in classifying histopathological breast images into their sub-types and used majority voting for patient classification. Whilst their work focused on classifying different breast histopathological images into their sub-types and achieved 93.2% accuracy, we performed binary classification using BoDMCF extracted from breast histopathological images without considering their image subtypes. Related work that used CNNs to extract deep features from medical images Wichakam et al. proposed an automated system that uses a CNN for feature extraction and an SVM for classification for mass detection on digital mammographic images [41] but did not explore multi-resolution extraction and pooling to create a bag of deep features. Devnath et al. [42] used CNN models for automated detection of pneumoconiosis by extracting deep multi-level features from X-ray images that were then classified using SVM. Devnath et al. [43] conducted a systematic review of computer-aided diagnosis of coal workers' pneumoconiosis in chest X-ray radiographs using machine learning, which included approaches that utilized CNNs for feature extraction. Devnath et al. [44] utilized the CheXNet-121 model as a feature extractor as part of a method for detecting and visualizing pneumoconiosis using an ensemble of multi-dimensional deep features learned from chest X-rays. Firstly, they removed the last layer close to the output layer; next, a global average pooling layer was added which converted the output of the model into one-dimensional vectors. Huynh et al. [45] tested the optimal point at which to extract features from pre-trained CNN, identifying the specific utility of transfer learning in computer-aided diagnosis (CADx) systems. Zhang et al. [46] proposed to build ensemble learners through fusing multiple deep CNN learners for pulmonary nodule classification. Other related work includes research by Filipczuk et al. [47] and George et al., who previously extracted nuclei features from fine needle biopsies. First, the circular Hough transform was utilized for detecting nuclei candidates and false-positive reduction, followed by using machine learning and Otsu thresholding [48].
The rest of this paper is structured as follows. Section 2 presents some background required to understand our work, including introducing the BreaKHis database, basic concepts of CNNs, and a description of the pre-trained CNNs we explored for feature extraction. Our proposed BoDMCF representation and machine learning methodology is presented in Section 3. Section 4 presents our experimental results, and Section 5 discusses our findings and Section 6 concludes the paper.

BreakHis Breast Cancer Histopathological Image Dataset
Our neural networks breast cancer models were created by analyzing the BreaKHis database [27], which contains 7,909 tiny histopathological biopsy images of benign and malignant breast tumors. The distribution of images in the BreakHis database is summarized in Table 2. In an IRB-approved study, patients with traces of breast cancer who visited the the P&D, Brazil, between January to December 2014 were recruited. Those who agreed to participate properly consented. Breast tissue biopsy test slides were created by staining the samples with hematoxylin and eosin, prepared for histopathological examination, and marked by pathologists at the P&D Lab. The widely accepted paraffin preparation methodology was utilized. The overall preparation technique incorporates several steps including fixation, dehydration, clearing, infiltration, inserting, and cutting [49]. Lastly, an experienced pathologist diagnosed every case, which was confirmed by correlative tests, such as by utilizing the immunohistochemistry assessment. An Olympus BX-50 magnifying device having a transfer focal point and magnification of 3.3× fixed to a Samsung sophisticated digital camera SCC-131AN was employed to acquire digitized pictures from the breast tissue slides. Images were procured in red, green, and blue channels (RGB) color space (3-   Convolutional neural networks (CNNs) have recently become the best performing neural networks for image analyses and classification. The BoDMCF approach utilizes pre-trained, state-of-the-art CNN models for feature extraction. This section provides a summary of some of the technical details of the CNN architecture. CNNs are in the category of feedforward neural network (FFN) models, where the signal passes within the network without a loop back and can be expressed in Equation (1) [50].
where H indicates the number of hidden layers, and g i denotes the function in the matching layer i. The core functional layers in a typical CNN model incorporate activation, fully connected (FC), pooling layers, and a classification layer. The convolutional layer, f , is comprised of various convolutional kernels ( f 1 . . . f y−1 , f y ) where every f y denotes a linear function in the y th kernel that can be represented by Equation (2) The position of the pixel in the input I is denoted by the coordinates (x, j, z), the weight for the y th kernel is denoted by W y , and the height, width, and depth of the filter is denoted by m, n, and w. The rectified linear unit (ReLU) is a pixel-wise non-linear function, g, known as the activation layer, is represented in Equation (3) [50][51][52].
The pooling layer, k, is a layered non-linear down-sampling function designed to repeatedly decrease the feature representation size. The FC layer is considered a variation of the convolutional layer whose kernel has the size 1 × 1. The classification SoftMax layer (σ( z) i = e z i ∑ K j=1 e z j ) is typically added to the last fully connected layer to calculate the probabilities of I i fit into different classes. Figure 2 shows a simple example of a CNN model that is made up of convolutional, ReLU, max-pooling, and FC layers. The first, second, and fifth ReLu layers precede the maximum-pooling layer, which in turn precedes the three FC layers. In order to express max-pooling formally, let Z be a n l × n l × m l tensor. Max-pooling involves determining the maximum value over the element-wise product of subtensor Z l k (i, j, q) and filter W, given by Equation (4).

Pre-Trained CNNs for Deep Image Feature Extraction
To create the bag of deep multi-resolution convolutional features (BoDMCF) representation, features are extracted from four resolutions (40×, 100×, 200×, and 400×) of histopathological breast cancer images using three (3) state-of-the-art CNN-based models: (1) (Efficientnet-b0) [17], (2) Inception deep CNN architecture (Inception-v3) [18], (3) ResNet50 [19]. These models were pre-trained on the ImageNet repository that has 14 million images in 1000 categories, enabling them to gain significant intelligence about images [53]. Pre-training is part of a transfer learning approach, which yields higher starting/initial model accuracy during training, faster convergence; and higher asymptotic accuracy (the accuracy level to which the training converges). We now provide some background on these state-of-the-art deep CNN image classification models.
EfficientNet [17]: This architecture and scaling method utilizes a compound coefficient to uniformly scale all depth, width, and resolution dimensions of the CNN using a set of fixed scale coefficients. Given a ConvNet defined as the EfficientNet architecture can be formulated as an optimization problem given by Equation (5) max In a principled manner, EfficientNet scales network width, depth, and resolution based on a single δ compound coefficient as expressed in Equation (6).
For instance, in order to utilize 2N times more computational resources, the network depth can simply be increased by αN, the width by βN, and the image size by γN, where α, β, and γ are constant coefficients determined by a small grid search on the original small model. In order to capture more fine-grained patterns from a larger input image, the compound scaling method uses more layers to increase the receptive field and more channels to capture a larger number of channels. MobileNet-V2's [49] inverted bottleneck residual blocks along with squeeze-and-excite blocks are the basis of EfficientNet-B0's base network. Figure 3 is the architecture for the EfficientNet B0 model. Inception: This architecture has introduced multiple versions. The first version of the Inception CNN model was introduced as GoogLeNet [54], named Inceptionv1. The enhanced usage of computing resources within the inception1 network is the fundamental feature of this architecture, accomplished by increasing the network's depth and depth, while sustaining the computational budget. Version 2, also named (Inception-v2), incorporated batch normalization [55]. Version 3 (called Inception-v3) utilized additional factorization ideas [18] . The main distinction of Inception-V3 is that 5 × 5 convolutional layers were used instead of two consecutive layers of 3 × 3 convolutions with up to 128 filters and also the addition of a Batch Norm (BN)-auxiliary. A BN auxiliary is a version of the auxiliary classifier in which the fully connected layer is also normalized in addition to the convolutions. The RMSProp optimizer was also utilized, which has an update rule that can be expressed as: where E(g) is the moving average of squared gradients, ( δC δw ) 2 is the gradient of the cost function with respect to the weight, η is the learning rate, and β is the moving average parameter. The classification layers utilized label smoothing regularization (LSR). LSR can be obtained by replacing a single cross entropy H(q, p) in the loss function with a pair of losses in the cross entropy, H(q, p) and H(u, p), as given in Equation (8) below. The second loss penalizes the deviation of predicted label distribution p from the prior u, with the relative weight 1− . H(u, p) is a measure of how dissimilar the predicted distribution p is to uniform.
The model is 48 layers deep and capable of classifying images into 1000 image classes, including various object types, keyboard, mouse, pencil, and different animals. This pretraining ensures that the model has gained knowledge of deep high-level feature depictions of an extensive variety of images. Figure 4 is the architecture for the Inception-v3 model.
ResNet: This architecture introduced a deep residual learning structure, which reformulates the CNN's layers as learning residual functions of the layer inputs. Correctly denoting the desired underlying mapping as K(i), the stacked non-linear layers were made to fit another mapping of E(i) := K(i) − i. ResNet solved the vanishing gradient, whereby the value of the neural network's gradient decreases significantly during backpropagation until its weights barely change. ResNet solved the vanishing gradient problem using a skip connection by adding the original input to the output of the convolutional block. A skip connection is a direct connection that skips over some of the model layers and can be expressed as represents the residual mapping to be learned. Resnet utilizes the SGD optimizer with momentum given by Equation (9 where v t+1 is the momentum value, ρ is a friction, ∇ f (x t−1 ) is the gradient of the objective function at iteration t − 1, x t are parameters, and α is the learning rate. ResNet50 [19], which our approach utilized, is a variant of ResNet. It has 48 convolutional layers and 1 MaxPool layer as well as an average pool layer. Figure 5 is the architecture for the ResNet model.

Materials and Methods
Our overall approach involves extracting deep multi-resolution features from four resolutions (40×, 100×, 200×, and 400×) of high resolution (2048 × 1536) histopathological breast cancer images using the Efficientnet-b0 [17], Inception-v3) [18], and ResNet50 [19] pre-trained image pre-trained CNN models that are pooled using global pooling to create a BoDMCF. A support vector machine (SVM) classifier then uses the BoDMCF to clas-sify histopathological breast cancer images as either malignant or benign. As shown in Figure 6, the proposed framework of breast cancer classification consists of three main modules: (i) data pre-processin,g (ii) deep BoDMCF feature extraction, and (iii) classification using SVM.

Step 1: Histopathological Image Pre-Processing
During this step, each histopathological image is resized to fit into an input size suitable for different deep CNN models. The histopathological images were resized from 2048 × 1536 to 299 × 299 for inception-v3 and EfficientNet-B0 and to 224 × 224 for resnet-50. Random color data augmentation was also performed on each image by changing the brightness of the image randomly between 50% (1 − 0.5) and 150% (1 + 0.5) of the original image. (See Figure 7) Data augmentation generates diverse samples, which enables the model to learn a robust representation that is invariant to minor changes [56]. Examples of resized histopathological images are shown in Figure 8. After pre-processing, training and test sets were created using a 70:30 split ratio.

Step 2: Deep Multi-Resolution Feature Extraction Using Pre-Trained CNNs
This stage involves extracting the BoDMCF by modifying the final layers of the three pre-trained deep convolutional networks: Efficientnet-b0, Inception-v3, and ResNet50. These pre-trained models were trained on full-sized ImageNet images, then transfer learning (fine-tuning) was performed on histopathological breast cancer images in our dataset. These feature extractor CNN models utilized layer activations as features. The rich multilevel activations (features) extracted from four resolutions of histopathological images were then pooled to form the BoDMCF and finally used to train a support vector machine (SVM).
EfficientNet [17]: The input size of Efficientnet-b0 was 224 × 224, and Table 3 shows the activation strengths of 56 features learned by the average pooling layer by setting channels to be the vector of indices 1:56 and setting pyramid levels to 3 (three) so that the images are not scaled.  Table 4 shows the activation strength of 56 features learned by the average pooling layer by setting channels to be the vector of indices 1:56 and setting pyramid levels to 1 (one) so that images are not scaled. 3.09 1 10 3.29 1 ResNet-50 [19] The input size of ResNet18 is 224 × 224 and Table 5 shows the activation strengths of 56 features learned by the average pooling layer, derived by setting channels to be the vector of indices 1:56 and setting pyramid levels to 1 (one) so that the images are not scaled.

Step 3: Global Pooling of Features to Create BoDMCF
Features extracted by the three state-of-the-art CNN models (ResNet-50, InceptionV3, and Efficientnet-b0) were pooled to acquire high-quality image descriptions using the activations of the global pooling layers at the end of the network as shown in Figure 9. The network constructs a hierarchical representation of input images. Deeper layers contain higher-level features, constructed using the lower-level features of earlier layers. To obtain the feature representations of the training and test images, activations on the global pooling layer, 'avg_pool', at the end of the network are utilized. The global pooling layer pools the input features over all spatial locations, giving 512 features in total as described in Figure 9. For each spatial location, the f activations maps labelled f 1 , f 2 , f 3 , to f 512 are collected, forming 1 × 1 × f column features of dimensions 1,1, 1,2 to h,w. These multiple features are then concatenated into a BoDMCF that is classified using SVM.

Step 4: BoDMCF Classification Using SVM
SVM was utilized to classify the BoDMCF extracted by the three CNN models as described above. Given a training set and class label (B n , A n ), n = 1, . . . , N, B n ∈ RD, A n ∈ 1, 1, the support vector machine (SVM) classifier [57] tries to find a hyperplane in feature space, which maximizes the margin between two classes (malignant vs. benign). SVM is based on the theory of maximum linear discriminants. For two classes to be classified, SVM finds peripheral data points in each class that are closest to the other class (called support vectors). For a dataset D with n points x i in a d-dimensional space, a hyperplane function h(x) can be defined as Overall, with n points, the margin of the linear classifier can be defined as the minimum distance of a point from the separating hyperplane given as: The SVM classifier finds the optimal hyperplane dividing the two classes by solving the minimization problem with an objective function: with linear constraints: Then, the class of a new point is predicted as:

Evaluation Metrics
The following metrics were used to evaluate all neural networks breast cancer classification models. Accuracy (Acc): This demonstrates how many malignant cases are correctly predicted and how many benign cases are correctly diagnosed. Equation (15) describes it.
Sensitivity (Sens): This is the percentage of positive instances correctly predicted, which can be computed using Equation (16).
Precision (Prec): This expresses how many of the positive predictions are actually correct as expressed as Equation (17).
Specificity (Spec): This measures the percentage of correct negative predictions and can be expressed as Equation (18): F1-score (Fscore): This analyzes sensitivity and precision in harmony by applying a penalty to extreme values in order to reflect their simultaneous impact and can be expressed as Equation (19).
AUC: This is a probability curve that plots the True Postive Rate (TPR) against the False Positive Rate (FPR) at various threshold values and essentially separates the 'signal' from the 'noise' and is expressed as Equation (20). AUC is a number that ranges from 0 to 1. An AUC value of one indicates a perfect model, while an AUC of 0.5 or below indicates an inadequate model.
where I p and I n denote the number of malignant and benign breast images, respectively, and R i is the rank of the ith positive image in the ranked list. The Matthews Correlation Coefficient (MCC): This is a contingency matrix metric for calculating the Pearson product-moment correlation coefficient between actual and predicted values that is unaffected by the unbalanced datasets issue. MCC can be expressed as Equation (21).
Kappa (Kapp): This is a statistic that compares observed and expected accuracy. It is a measure of how well the instances categorized by a classifier matched the data designated as ground truth. Equation (22) can be used to calculate Kappa.

Baseline State-of-the-Art CNN Image Classification Architecture
Many of the baseline CNN models we selected for comparison were carefully selected for various reasons, including being winning entries to image analysis and classification competitions and are state of the art and/or performed well on similar problems. They include: DenseNet201 [58]: This is a 201-layer CNN in which each layer is connected to every other layer in a feedforward manner to eliminate the vanishing gradient problem, enhance feature propagation, promote reuse of features, and drastically reduce the number of parameters. DenseNet is based on the idea that convolutional networks can be more accurate and efficient to train if they have shorter connections between the layers near the input and the layers near the output. We selected DenseNet 201 because it was utilized in prior work [59] as a feature extractor for deep hybrid architectures for binary classification of breast cancer images.
SqueezNet [60]: This is a lightweight CNN that employs various design strategies that reduce the number of parameters, particularly with use of fire modules which "squeeze" parameters using 1 × 1 convolutions for the network to carry fewer parameters. The problem of storage efficiency and speed of models for prediction was solved using a technique known as model compression, which it accomplished by: (i) compressing the perspective of model weight values, and (ii) compressing the perspective of network architecture. SqueezeNet was selected because it was previously utilized for deep feature extraction and classification of breast ultrasound images [61].
ShuffleNet [62]: This is a convolutional neural network specifically designed for mobile devices with low processing power. The architecture uses two new operations, pointwise group convolution and channel shuffle, to reduce computation costs while preserving accuracy. ShuffleNet was selected as a baseline because it was utilized for breast cancer classification in prior work [63].

Experiments
In this section, experiments to rigorously evaluate our proposed BoDMCF approach using the BreakHis dataset of histopathological breast cancer images [27] that is summarized in Table 2 are described. The classification task was performed by fine-tuning (transfer learning) the CNN models that were previously pre-trained on the ImageNet dataset and on the BreakHis dataset. Various hyperparameters shown in Table 6 were determined using grid search and specified, followed by pre-processing, training, and validation of histopathological images. Test images were then provided as inputs to the trained models. The fine-tuned, pre-trained CNN models were used to extract features at four resolutions, which were pooled to form the BoDMCF that was then classified using SVM. The classifier performance was evaluated with ten-fold cross-validation with a cross-validation error of 0.0462. Experiment: train-test curves: Figure 10 shows sample train-test curves we generated during training of the EfficientNetb0 model demonstrating model convergence after about 200 epochs. The goal of this experiment was to establish baseline performance of individual stateof-the-art CNN image classification models (ResNet18, InceptionV3, InceptionResnetV2, DenseNet201, ResNet50, EfficientNetB0, SqueezeNet, and ShuffleNet) using weights determined via pre-training on ImageNet (no fine-tuning on the BreakHis dataset). Classification was at individual image resolution with no pooling of features to create the BoDMCF. Our goal was to eventually demonstrate that pooling multiple resolutions of features to create our BoDMCF approach outperforms these powerful baselines that perform classification on single image resolutions. The results of this experiment are shown in Table 7. Except for the precision metric (ResNet18 on 200× magified image hasthe highest precision), SqueezeNet performed best on all other metrics (accuracy, F1 score, recall, AUC, Kappa, and MCC). These results suggest that visual attributes that most clearly distinguish malignant tumors from benign ones are most observable at 100× magnification and that the SqueezeNet neural networks model outperforms all other baseline models when model weights learned from ImageNet during pre-training (no fine-tuning on the BreakHis dataset) are utilized.  100×, 200×, and 400×) of histopathological breast cancer images by baseline CNN models fine-tuned on the BreakHis dataset, which are then classified using SVM: The goal of this experiment was to demonstrate the power of pooling multiple resolutions of deep CNN features. Specifically, we benchmarked the performance of deep features extracted at individual magnifications using state-of-the-art fine-tuned CNN image classification models (ResNet18, InceptionV3, InceptionResnetV2, DenseNet201, ResNet50, EfficientNetB0, SqueezeNet, and ShuffleNet) without pooling multiple magnifications into a single BoDMCF representation as we proposed. The results of this experiment are shown in Table 8. Except for the precision metric (ResNet50 on the 40× magnified image has the highest precision), DenseNet201 performed best on all other metrics (accuracy, F1 score, recall, AUC, Kappa, and MCC). These results suggest that when utilized as feature extractors, visual attributes that most clearly distinguish malignant tumors from benign ones are most observable at 40× magnification and that using the DenseNet201 with fine-tuning on the BreakHis dataset outperforms all other baselines as a feature extractor.  (40×, 100×, 200×, and 400×) of histopathological breast cancer images using baseline models with model parameters (weights) determined by pre-training on ImageNet: The main difference with our proposed approach is that, while all four magnifications were pooled in this experiment, only a single CNN pre-trained model (one of ResNet18, InceptionV3, InceptionResnetV2, DenseNet201, ResNet50, EfficientNetB0, SqueezeNet. and ShuffleNet) was used for classifying the pool of images at a time. In contrast, our proposed approach extracts features using an ensemble of three CNN models (ResNet18, InceptionV3, and ResNetInceptionV2). The results of this experiment are shown in Table 9. Except for the precision metric (DenseNet201 has the highest precision), SqueezeNet performed best on all other metrics (accuracy, F1 score, re-call, AUC, Kappa, and MCC). These results suggest the SqueezeNet neural networks architecture outperforms all other baselines on a multi-resolution bag of features when model weights learned from ImageNet during pre-training are utilized.  100×, 200×, and 400×) of histopathological images using baseline CNN models that are classified using SVM The main difference with our proposed approach is that, while all four magnifications were pooled in this experiment, features were extracted using only a single CNN pre-trained model (one of ResNet18, InceptionV3, InceptionResnetV2, DenseNet201, ResNet50, EfficientNetB0, SqueezeNet, and ShuffleNet) at a time. In contrast, our proposed approach extracts features using an ensemble of three CNN models (ResNet18, InceptionV3, and ResNetInceptionV2). The results of this experiment are shown in Table 10. Except for the precision metric (DenseNet50 has the highest precision), EfficientNetB0 performed best on all other metrics (accuracy, F1 score, recall, AUC, Kappa, and MCC). These results suggest the EfficientNetB0 neural networks architecture outperforms all other baselines as a deep feature extractor from a pool of multiple magnifications of histopathological images. Results of Our BoDMCF approach: Our approach has two key distinctions with the baseline approaches presented thus far. First, we extract features from all four magnifications of histopathological images, which are then pooled into a BoDMCF. Secondly, we use multiple (three) state-of-the-art CNN models (ResNet-50, InceptionV3, and Efficientnet-b0) as feature extractors. The results of our approach, which are shown in Table 10 for individual networks and Table 11, demonstrate that our approach outperforms the baseline approaches. Figure 11 shows samples of test images with their predicted labels from our proposed method. Finally, to demonstrate that the difference in performance between our BoDMCF approach and other ensemble baselines is was statistically significant, we performed the Nemenyi post hoc test [64]. At a confidence level a = 0.05, the critical distance (CD) is 1.2536. Table 11. Results of our proposed approach with features extracted from all four histopathological image magnifications (40×, 100×, 200, × and 400×) by three state-of-the-art CNN models (ResNet-50-, InceptionV3, and EfficientNet-b0). The effects of the number of features used on model performance are also shown. The 2-and 3-model combinations were based on best performing single model performance in Table 9. Accuracy achieved by prior breast cancer binary classification work are also shown in the bottom  Figure 11. Four sample test images with their predicted labels from our proposed algorithm.
Experiment: ROC Curves The receiver operating characteristic (ROC) curve, shown in Figure 12 for our approach, is a graphical plot that shows the diagnostic ability of a binary classifier as its discrimination threshold is varied. In simple terms, the ROC curve plots our approaches FPR vs. its TPR. The ROC curve is almost a perfect right angle at the top left corner, demonstrating that our proposed approach achieves excellent FPR and TPR.  Figure 13. The columns correspond to the targeted class, and the rows correspond to the output class (anticipated class). The diagonal cells match with observations that are rightly classified. The off-diagonal cells refer to incorrect classifications. The percentage of the overall number of observations and the number of observations in every cell is also presented. The column on the extreme right displays the proportions of incorrect (red color) and correct (green color) classifications that were predicted. These metrics are referred to as the false discovery rate and the positive predictive value. While the lowest row indicates the percentages of incorrect and correct classifications, and these metrics are referred to as false negative rate (FNR) and true positive rate (TPR), the cell in the bottom-most right shows the general precision. A column-normalized column summary displays the percentages of incorrectly and correctly classified observations for every predicted class. A row-standardized row summary exhibits the percentages of incorrectly and correctly classified observations for every true class. In the confusion matrix, most of the results fall on the leading diagonal with very few off the diagonal, which demonstrates that the proposed approach did not confuse the benign and malignant classes. Experiment: classifying the BoDMCF representation using different machine learning classifiers: The goal of this experiment was to compare the performance of the support vector machines (SVM) with other traditional machine learning (ML) classifiers for the task of classifying the BoDMCF representation into target labels of "Benign" and "Malignant". Results in Table 13 show that SVM outperformed all other ML classifiers for this binary classification task. This is likely because SVM is well-known to perform well on binary classification tasks. Experiment: CNN model interpretability using grad-cam [65] The goal of this experiment was to ensure that the breast cancer classification model focused on the appropriate regions of the image when analyzing the image. Grad-cam computes the gradient of the ranking score in relation to the CNN characteristics map, highlighting the specific ROIs based on the greatest gradient score. Grad-cam computes the gradients with respect to feature maps of a convolutional layer, which are then global-average-pooled to obtain the importance weights α c k ; α c k represents a partial linearization of the deep network downstream from A, capturing the importance of feature map k for a target class c is the gradient of the score for class c, y c , with respect to feature maps A k of a convolutional layer. A grad-cam heatmap is then generated as a weighted combination of forward activation feature maps, but followed by a ReLU activation function where L c Grad−CAM is the class-discriminative localization map. Grad-cam was applied to produce a coarse localized map highlighting the most important ROIs in the histopathological images to classify the images as benign or malignant. Sample grad-cam results are shown in Figure 14. Experiment: analysis of misclassified images: The objective of this experiment was to discover reasons behind model misclassifications, which could be addressed either to improve this manuscript or in future work. Misclassifications resulted for benign images that looked similar to malignant images or vice versa. One example each of misclassified malignant and benign histopathological images, respectively, from the Breakhis dataset is shown in Figure 15. The outline and uniformity of the texture differences in the benign image are comparable to those in a malignant image. There is less dispersion of cells in misclassified malignant images than in ordinary malignant images. Consequently, the cells appear benign, resulting in misclassification. Benign histopathology images usually have fewer dispersed cells and only a few spreads elsewhere.

Discussion
Through rigorous experimentation, as shown in Table 11, we demonstrated that the proposed BoDMCF approach outperforms a comprehensive set of baselines as well as the prior state-of-the-art methods (Table 12) for binary classification of histopathological images. Our results also demonstrate that all key components of our approach contribute non-trivially to its superior results, including: Transfer learning by pre-training on a large image repository (ImageNet) with finetuning on the BreakHis breast cancer image dataset that enables the CNN feature extractors models to learn a robust image representation from the large image repository. Fine-tuning on the BreakHis breast cancer dataset transfers the learned intelligence to the task of analyzing and classifying breast cancer. This conclusion is evident by comparing results in Tables 7 (pre-training with no fine-tuning) and 8 (pre-training with fine-tuning).
Using an ensemble of CNNs for deep feature extractors achieves superior performance to using any single pre-trained CNN for feature extraction, which is evident by comparing results in Tables 10 and 11. In fact, as shown by the results in Table 11, we also show that the three specific state-of-the-art CNN models (ResNet-50, InceptionV3, and Efficientnet-b0) discovered through extensive experimentation and utilized for feature extraction, outperform other CNN combinations and ensembles. Intuitively, each CNN extracts slightly different image features. Feature extraction using multiple CNNs combines these different features into a superset of features that outperforms features extracted from any single CNN.
Extracting deep features for four magnifications (40×, 100×, 200×, and 400×) of histopathological images that are then pooled into a BoDMCF, is important as the visual attributes that distinguish malignant from benign tumors may be most discernable at different resolutions. This conclusion is evident because the results of the pooled, multiresolution BoDMCF features (Table 13) outperform results of classifying deep features extracted from any individual single resolution as shown in Table 8.
Global pooling of multiresolution features to create a bag (BoDMCF0) is an essential step that also enables downstream classification using traditional machine learning algorithms such as SVM. Deep BoDMCFs are a powerful representation, which had the best performance for all combinations of CNN models explored in this study as shown in Table 13. The proposed technique of using BoDMCF features, pooled and classified using SVM, outperformed single CNN model approaches in Table 10.
SVM outperformed all other traditional ML classification algorithms for classifying the BoDMCF into malignant and benign target classes as shown in Table 13. We believe that this is because SVM's maximal margin hyperplane determination approach performs well on binary classification.
Limitations of this work and potential future work: The results acquired show that very significant classification performance can be achieved. While our proposed approach is shown to perform well on the BreakHis dataset, one of the most widely distributed histopathological images hosted on the public domain, some limitations can be addressed in future work. Firstly, extending the dataset to include more images from more magnifications could yield more robust classifiers before deployment for use in hospitals. Secondly, we used three existing deep models. In future, fusing deeper models could yield better performance. Third, we would like to validate our results on other histopathological breast cancer datasets. Finally, implementing our methods on mobile devices can be a promis-ing direction that facilitates deployment in under-resourced environments such as third world countries.

Conclusions
We have proposed an automatic classification method for breast cancer histopathological images into malignant vs. benign categories. Particularly, we have shown that a deep BoDMCF feature extraction from multiple magnifications (40×, 100×, 200×, and 400×) of histopathological images using three state-of-the-art pre-trained CNN models (ResNet-50, Inception-v3, and EfficientNet-b0) with pooling and classification using SVM, can also be leveraged for binary (malignant vs. benign) breast cancer classification. Moreover, combining deep rich features from various global average pooling layers of various pre-trained convolutional deep models was shown to yield improved classification performance. In rigorous evaluation experiments, our deep BoDMCF feature approach with global pooling achieved an average accuracy of 99.92% for the classification task, sensitivity of 0.9987, specificity (or recall) of 0.9797, positive prediction value (PPV) or precision of 0.99870, F1-Score of 0.9987, MCC of 0.9980, Kappa of 0.8368, and AUC of 0.9990 on the BreaKHis dataset [27]. Our deep BoMCF approach outperforms state-of-the-art CNN baselines including ResNet18, InceptionV3, DenseNet201, EfficientNetb0, SqueezeNet, and ShuffleNet when classifying any of the individual resolutions (40×, 100×, 200× or 400×) or when SVM is used to classify a BoMCF extracted using any single pre-trained CNN model. The high accuracy, sensitivity, PPV, and F1 score achieved by our approach is extremely encouraging and could be useful in supporting the work of health practitioners in low-resource settings with few experts. However, before deployment, a careful validation study and comparison of our model's performance to human experts needs to be conducted. In future work, combining several other image magnifications using emerging CNN models could yield even better breast cancer classification models.