A Multi-Stage Approach to Breast Cancer Classification Using Histopathology Images

Breast cancer is one of the deadliest diseases worldwide among women. Early diagnosis and proper treatment can save many lives. Breast image analysis is a popular method for detecting breast cancer. Computer-aided diagnosis of breast images helps radiologists do the task more efficiently and appropriately. Histopathological image analysis is an important diagnostic method for breast cancer, which is basically microscopic imaging of breast tissue. In this work, we developed a deep learning-based method to classify breast cancer using histopathological images. We propose a patch-classification model to classify the image patches, where we divide the images into patches and pre-process these patches with stain normalization, regularization, and augmentation methods. We use machine-learning-based classifiers and ensembling methods to classify the image patches into four categories: normal, benign, in situ, and invasive. Next, we use the patch information from this model to classify the images into two classes (cancerous and non-cancerous) and four other classes (normal, benign, in situ, and invasive). We introduce a model to utilize the 2-class classification probabilities and classify the images into a 4-class classification. The proposed method yields promising results and achieves a classification accuracy of 97.50% for 4-class image classification and 98.6% for 2-class image classification on the ICIAR BACH dataset.


Introduction
Breast cancer is one of the most dangerous and fatal diseases in women around the world. Breast cancer has been ranked the number one cancer among Indian females. It has an age-adjusted rate as high as 25.8 per 100,000 women and a mortality rate of 12.7 per 100,000 women. The breast cancer projection for India in 2020 suggests the number will go as high as 1,797,900. Better health awareness and availability of breast cancer screening programs and treatment facilities would cause a favorable and positive clinical picture in the country [1]. According to the American Cancer Society (ACS), breast cancer is the most common cancer found in American women, except for skin cancers. The average risk of a woman in the U.S. developing breast cancer sometime in her life is about 12% or a 1 in 8 chance. The chance that a woman will die from breast cancer is about 2.6% or a 1 in 38 chance. According to a 2013 World Health Organisation (WHO) report, in 2011, around half a million women worldwide succumbed to breast cancer. The mortality rate of breast cancer is very high compared to other types of cancer [2]. The prognosis for cancer patients varies depending on the type of cancer. Patients with small-cell lung carcinoma have the worst prognosis. Surgery, chemotherapy, radiation, targeted therapy, and immunotherapy are currently used to treat cancer. These methods of curative treatment are only available to people with localized disease or treatment-sensitive malignancies; therefore, their contributions to cancer curability are limited [3]. However, early diagnosis significantly increases treatment success. Specifically, during the diagnosis procedure, specialists evaluate both overall and local tissue organization via whole-slide and microscopy images. However, the large amount of data and complexity of the images coefficient-based feature selection method. The majority of these methods concentrated on a two-class classification task of malignant or non-malignant tissues [14,15]. For instance, Basavanhally et al. in 2011 [14] used the O'Callaghan neighborhood to solve the problem of tubule identification on hematoxylin and eosin (H & E)-stained breast cancer (BCa) histopathology images, where a tubule is characterized by a central lumen surrounded by cytoplasm and a ring of nuclei around the cytoplasm. The detection of tubules is important because tubular density is an important predictor in determining the grade of cancer. The potential lumen areas are segmented using a hierarchical normalized cut (HNCut) and an initialized color-gradient-based active contour model (CGAC). Evaluation of 1226 potential lumen areas from 14 patient studies produced an area under the receiver operating characteristic curve (AUC) of 0.91, along with the ability to classify true lumens with 86% accuracy. Dundar et al. in 2011 [15] evaluated digitized slides of tissues for certain cytological criteria and classified the tissues based on the quantitative features derived from the images. The system was trained using a total of 327 regions of interest (ROIs) collected across 62 patient cases and tested with a sequestered set of 149 ROIs collected from 33 patient cases. An overall accuracy of 87.9% was achieved on the test data. The test accuracy of 84.6% was obtained in the borderline case. Melekoodapattu et al. (2022) [16] implemented a system for autonomously diagnosing cancer using an integration method, which includes CNN and image texture attribute extraction. The nine-layer customized convolutional neural network is used to categorize data in the CNN stage. To improve the effectiveness of categorization in the extraction-based phase, texture features are defined and their dimensions are reduced using Uniform Manifold Approximation and Projection (UMAP). The findings of each phase were combined by an ensemble algorithm to arrive at the ultimate conclusion. The final categorization is presumed to be malignant if any of the stage's output is malignant. In the MIAS repository, the ensemble method's testing specificity and accuracy were 97.8% and 98%, respectively, whereas on the DDSM repository, they were 98.3% and 97.9%. However, these handcrafted features and engineering techniques suffered from some major drawbacks. This is because extracting high quality features from low resolution images is very difficult. Then comes another drawback of combining patch-level classification with image-level classification. Usually, patch-level classification is not very effective; to obtain good image-level accuracy, the patch features have to be used in a proper and efficient way.
Breast cancer is one of the few cancers for which an effective screening test is available. Breast cancer detection has been one of the most important fields in the computer vision field. Many DL-based networks have been used by researchers in the past [17][18][19][20][21][22][23]. There are basically two ways of making breast cancer predictions: using DL models and using feature descriptors. In current scenarios where computational capability is increasing exponentially, it is becoming quite natural to utilize it to the fullest to save both time and human effort. Moreover, since huge amounts of computational power are required for prediction using DL models, there is no scope for compromising with human efficiency. Moreover, machines, when trained, can gather vast amounts of information and use it to predict in a short amount of time.
For instance, in this paper [24], Melekoodappattu et al. proposed a model using extreme learning machines (ELM) with the fruitfly optimization algorithm (ELM-FOA) to tune the input weight to obtain optimal output at the ELM's hidden node to obtain the solution analytically. The testing sensitivity and precision of ELM-FOA were 97.5% and 100%, respectively. The developed method can detect calcifications and tumors with 99.04% accuracy. Nirmala et al. in 2021 [25] proposed a new methodology for integrating the run-length features with the bat-optimized learning machine-BORN. BORN also features the most efficient visual saliency segmentation process to obtain a highly efficient diagnosis. BORN was tested with two different datasets, MIAS and DDSM, with different learning kernels, and compared to other intelligent algorithms, such as RF-ELM, EGAM, and associate classifiers. The proposed classifier achieved 99.5% accuracy. In [26], a soft computing classifier for the seven different CNN models was proposed for breast cancer histopathology-image classification. The proposed methodology uses the basic CNN with four convolutional layers, the basic CNN with five layers (with data augmentation), the VGG-19 transfer-learned model (with and without data augmentation), the VGG-16 transfer-learned model (without data augmentation), and the Xception transfer-learned model (with and without data augmentation). It uses seven models to extract features and then passes them through the classifier to make the final predictions. The dataset used in this research was the ICIAR BACH dataset for experimentation, and the model achieved a maximum accuracy of 96.91%. The primary drawback of this method is that training using the seven CNN models takes a huge amount of memory and time. Preetha et al. [27] proposed a PCA-LDA-based feature extraction and reduction (FER) technique that reduces the original feature space to a large extent. For classification, an ANNFIS classifier that uses the neural network concept with some fuzzy rule logic was used. In another paper [28], the authors combined two DCNNs to extract distinguished image features using the concept of transfer learning. The pre-trained Inception and Xceptions models were used in parallel. Then, the feature maps were combined and reduced by dropout before being fed to the last fully connected layer for classification. Then followed sub-image classification and whole-image classification based on a majority vote and maximum probability rules.
In this work, four tissue malignancy levels are considered: normal, benign, in situ carcinoma, and invasive carcinoma. The experiments were performed on the BACH dataset. The overall accuracy for the sub-image classification was 97.29%, and for the carcinoma cases, the sensitivity achieved was 99.58%. Wang et al. [29] proposed a hybrid model with a CNN (for feature extraction) and support vector machine (SVM) (for classification) to classify breast cancer histology images into four classes, benign, normal, in situ, and invasive, for the ICIAR BACH dataset. In addition to traditional image augmentation techniques, this study introduced deformation to microscopic images and then used a multi-model vote to obtain a validation accuracy of 92.5% and a test accuracy of 91.7%. This model used Xception and Inception ResNet v2 as the backbone models of the CNN. Nazeri et al. [30] proposed two consecutive CNNs to predict the classes of breast-cancer images. Firstly, due to the high resolution of the images, the images were converted to 512 × 512 resolution patches, and patch features were extracted by the first CNN. The second CNN utilized these patch features to obtain final image classification. It obtained 95% accuracy on the validation dataset. Training the patch-wise network with the same labels as the image-wise network is a disadvantage to the model's performance because not every patch in an image represents the same category. Golatkar et al. [31] proposed a nuclei-based patch extraction strategy and fine-tuned a pretrained Inception-v3 network to classify the patches. It obtained an average accuracy of 85% for the four classes and 93% for non-cancer (i.e., normal or benign) vs. malignant (in situ or invasive carcinoma). The authors used majority voting in the patch predictions to classify the images. Sanyal et al. [32] used a patch classification ensemble technique for the ICIAR BACH Dataset 2018. At first, the patches are stain-normalized and then passed through VGG-19, Inception-ResNet v2, Inception v3, and ResNet101, and the posterior probabilities are then classified using the XGBoost classifier. Then, all five classifications (four from the models and one from the classifier) are ensembled to obtain the patch-prediction probabilities. After that, in the image-wise pipeline, the patch probabilities were ensembled using mean, product, weighted mean, weighted product, and majority voting to obtain the final image classification. This model obtained a 4-class classification accuracy of 95% and a 2-class classification accuracy of 98.75%.
To capture more discriminant deep features for pathological breast-cancer images, Zou et al. [33] introduced a novel attention high-order deep network (AHoNet) by simultaneously embedding attention mechanisms and high-order statistical representation into a residual convolutional network. AHoNet first employed an efficient channel attention module with non-dimensionality reduction and local cross-channel interaction to achieve local salient deep features of pathological breast-cancer images. Then, their second-order covariance statistics were further estimated through matrix power normalization, which provides Diagnostics 2023, 13, 126 5 of 23 a more robust global feature presentation of pathological breast-cancer images. AHoNet achieved optimal patient-level classification accuracy of 85% on the BACH database. Vang et al. [34] fine-tuned a pre-trained Inception-v3 network on the extracted patches. To obtain image-wise prediction, the patch-wise prediction was ensembled using majority voting, logistic regression, and gradient boosting trees. Mohamed et al. [35] proposed a deep learning approach to detect breast cancer from biopsy microscopy images, in which they examined the effects of different data preprocessing techniques on the performances of deep learning models. They introduced an ensemble method for aggregating the best models in order to improve performance. They showed that Densenet 169, Resnet 50, and Resnet 101 were the three best models, achieving accuracy scores of 62%, 68%, and 85%, respectively, without data pre-processing. With the help of data augmentation and segmentation, the accuracy of these models increased by 20%, 17%, and 6%, respectively. Additionally, the ensemble learning technique improved the accuracy of the models even further. The results show that the best accuracy achieved was 92.5%.
Many methods have been proposed to classify histology images for the ICIAR BACH 2018 dataset, which is an extension of the Bio-imaging 2015 dataset. In all these papers [12][13][14][15][16][24][25][26][27], the high-resolution histology images (1536 × 2048) were pre-processed using different techniques and then segmented into patches. The patches were then passed through fine-tuned, pre-trained deep learning models such as VGG-19, Inception, Xception, and ResNet, and the patch-level results were obtained. After that, the patch-level results were used to obtain image-level results using different functions specific to each paper. Awan et al. [36] fine-tuned the ResNet model for patch-wise classification. Then, the trained ResNet model was used to extract deep feature representations from the patches. After that, the flattened features of 2 × 2 overlapping blocks of patches were used to train an SVM classifier, followed by a majority voting scheme for image-wise classification. Rakhlin et al. [37] proposed a transfer learning approach without fine-tuning based on extracting deep convolutional features from pre-trained networks. The authors used pre-trained CNNs to encode the patches to obtain sparse feature descriptors of low dimensionality, which were trained using a LightGBM classifier to classify the histopathology images.
With the advent of DL models, researchers have started to use these techniques for histopathological breast image classification and have achieved state-of-the-art results. DL is a subcategory of machine learning but is more efficient than machine learning classifiers due to its ability to learn new functions, unlike them, where there is a predefined classifier function. In this study, we used deep CNN models and multiple classification algorithms to form an ensemble for histopathology image classification. The main problem with this sort of learning algorithm is that high-resolution images take a huge amount of time to learn. Moreover, such images are difficult to train because different parts of the image require different attention weights. Hence, a viable alternative is to use learning from patches of the whole image and then ensemble it for the final image prediction.
In the present work, we propose a DL-based method to classify high-resolution breast histology images into one of the four classes: normal, benign, in situ, and invasive. We consider the images and divide them into training, testing, and validation sets. Then, we perform patch-level classification. The issue here is that we do not know what the patch labels are. What we know are the image labels. Thus, at the beginning, we assign each patch the same label as that of its parent image and design a model that outputs the probability of how likely it is that that patch label will be the same as that of the parent image. After that, we extract the deep features using a DL model. Lastly, we classify the patch as either normal, benign, in situ, or invasive using different classifiers and ensemble them. The final step is to convert the patch-level classification into an image-level classification. We take each image and create an array containing the number of patches corresponding to the four classes predicted by the patch prediction model for that image. Then, we pass this frequency array through a two-stage model to obtain the final prediction of four classes.
The key points of this work are as follows: • We aimed to increase the image-level breast histopathology classification accuracy using a patch-level classification.

•
We designed an efficient patch-level training model that is computationally fast. We fine-tuned a pre-trained model (pre-trained on the ImageNet dataset) with convolution, max pooling, and dense layers at the end. We obtained the patch features from this model.

•
We introduce a two-stage model using a neural network to map the patch-level information to image-level prediction into two classes (cancerous and non-cancerous) and four classes (normal, benign, in situ, and invasive).

•
We evaluated our model on a publicly available BACH histopathological dataset and achieved state-of-the-art classification accuracy of 97.50% for four classes and 98.6% for two classes.
The remainder of the paper is structured as follows. The proposed method is described in Section 2. The experimental results and analysis of the proposed method are presented in Section 3, where the discussion of the results is also presented. Finally, we conclude our work and state the limitations and some future possibilities in Section 4.

Materials and Methods
In this section, we first detail the dataset used for the experimentation, and then we discuss the complete process of our proposed model and go over each step in detail.

Dataset Description
The dataset used in this work was ICIAR 2018 BACH (BreAst Cancer Histology images) [38]. The dataset is composed of hematoxylin and eosin (H&E)-stained breast histology microscopy and whole-slide images. Microscopy images are labelled as normal, benign, in situ carcinoma, or invasive carcinoma according to the predominant cancer type or its absence in each image. The annotation was performed by two medical experts, and images where there was a disagreement were discarded. The dataset contains a total of 400 microscopy images; each of the four classes has 100 images. Microscopy images are in .tiff format and have the following specifications: •

Methodology
First, we divided the dataset into training, testing, and valid sets. First, we split the 400 histology images into 80% as the intermediate set and 20% as the test set. Further, we divided the intermediate set into 65% for the training set and 15% for the validation set. In the later part of the whole model, after patch classification, we added the 15% of the validation data into the training data during image classification. Next, we divided the training and validation images into patches and trained a patch-level classifier. After that, we applied an image pre-processing model and normalized the images using the Macenko stain normalization [39] method. Then, we fed the images into different pre-trained deep learning models to extract model features and concatenated these features. Next, we converted these patch-level predictions to create a frequency array and merge the validation data with the training data. Then, we passed the features through different classifiers and ensembled them using different techniques, analyzing which ensemble technique works best. After that, we devised a neural network to map these patch-level predictions to imagelevel predictions. State-of-the-art methods have achieved this patchy image transition through different ensembling techniques, including the mean rule, product rule, max rule, etc. We designed a deep learning model and then leave the work of learning an appropriate function for the model. We divide the model into 2 broad classes: cancerous (invasive and in situ) and non-cancerous (normal and benign). Then, we use this information to modify the frequency array. Another neural network was created to classify the final image into four categories: normal, benign, in situ, and invasive. We basically made a two-stage neural network. As a result, we did not have to experiment with various ensemble methods to convert this patch-level classification into image-level classification. The overall pipeline of the proposed method is shown in Figure 1. Concisely, the following are the major steps of our proposed method:

•
Dividing the dataset into training, testing, and validation sets. • Patch segmentation and labelling. • Pre-processing the image using a stain normalization technique and an image augmentation model.

•
Feature extraction from patches using concatenation of features from different DL models.

•
Patch classification using an ensemble of machine learning classifiers. • Image classification using the patch classification results, where a two-stage neural network has been applied.
through different ensembling techniques, including the mean rule, product rule, max ru etc. We designed a deep learning model and then leave the work of learning an approp ate function for the model. We divide the model into 2 broad classes: cancerous (invas and in situ) and non-cancerous (normal and benign). Then, we use this information modify the frequency array. Another neural network was created to classify the final i age into four categories: normal, benign, in situ, and invasive. We basically made a tw stage neural network. As a result, we did not have to experiment with various ensem methods to convert this patch-level classification into image-level classification. The ov all pipeline of the proposed method is shown in Figure 1. Concisely, the following are t major steps of our proposed method: • Dividing the dataset into training, testing, and validation sets.

•
Patch segmentation and labelling.

•
Pre-processing the image using a stain normalization technique and an image au mentation model.

•
Feature extraction from patches using concatenation of features from different D models.

•
Patch classification using an ensemble of machine learning classifiers. • Image classification using the patch classification results, where a two-stage neu network has been applied.

Image Splitting
There are a total of 400 images with 1536 × 2048 resolution in the ICIAR BACH 20 dataset [38]. The pixel values of the images are in the range 0 to 255. Thus, to prevent t values from exploding, we converted the pixel values into the range 0 to 1 by dividing 255. Thus, for every image, we had 1536 × 2048 data points (pixels) with a standard de ation value of 1 and a mean value between 0 and 1. We used 65% of the dataset for t training, 15% of the dataset for validation, and the remaining 20% for the testing of t model. We added 15% of the validation data as training data in the latter part of this pip line during image classification. We did this so that the image classification model can

Image Splitting
There are a total of 400 images with 1536 × 2048 resolution in the ICIAR BACH 2018 dataset [38]. The pixel values of the images are in the range 0 to 255. Thus, to prevent the values from exploding, we converted the pixel values into the range 0 to 1 by dividing by 255. Thus, for every image, we had 1536 × 2048 data points (pixels) with a standard deviation value of 1 and a mean value between 0 and 1. We used 65% of the dataset for the training, 15% of the dataset for validation, and the remaining 20% for the testing of the model. We added 15% of the validation data as training data in the latter part of this pipeline during image classification. We did this so that the image classification model can be trained using data from the validation set, resulting in a more robust model that performed better on the test set. Figure 2 shows all four classes of images in the dataset. trained using data from the validation set, resulting in a more robust model that performed better on the test set. Figure 2 shows all four classes of images in the dataset.

Patch Segmentation and Labelling
Since these individual images have very high resolution, we converted the images into patches. The main problems with learning from high-resolution images are the time and space requirements. It is almost impossible to fit those high-resolution images into a model unless we have high-end resources. Thus, we divide each image into patches and then feed it to a model for learning. We consider the image and convert it into patches of 512 × 512 × 3 with an overlap of 50% with the adjacent patch to keep the spatial orientation of the data. The size of the patches is taken as 512 because making it smaller would mean the patches would get a much-magnified portion and may be misclassified by a learning model. Smaller patches, on the other hand, would fail to take the context of the images into account and would appear similar for different classes of images. Again, another problem is that it is not completely correct to label all 35 patches with the same label as the image of which they are part. The reason is that there are normal, healthy cells among the cancer cells in the invasive and in situ cases. As a result, some patches are less likely to have the same label as the image of which they are a part. For example, a cancerous cell image may have a patch that is predominantly healthy, i.e., normal or benign. Thus, the problem now is the patch labelling. Since we do not have a region of interest (ROI) for each cell, we assign the same label to the patch as to the image. Thus, all 35 patches (with a patch size of 512 × 512 and a stride of 256) from an image will have the same labels. L(i, j) represents the label of the jth patch of the ith image, which we assign as the ground truth. If the ground truth label of each image is represented as L(i), then l(i, j) can be defined as l(i, j) = L(i) for every j such that 1 ≤ j ≤ 35.

Patch Segmentation and Labelling
Since these individual images have very high resolution, we converted the images into patches. The main problems with learning from high-resolution images are the time and space requirements. It is almost impossible to fit those high-resolution images into a model unless we have high-end resources. Thus, we divide each image into patches and then feed it to a model for learning. We consider the image and convert it into patches of 512 × 512 × 3 with an overlap of 50% with the adjacent patch to keep the spatial orientation of the data. The size of the patches is taken as 512 because making it smaller would mean the patches would get a much-magnified portion and may be misclassified by a learning model. Smaller patches, on the other hand, would fail to take the context of the images into account and would appear similar for different classes of images. Again, another problem is that it is not completely correct to label all 35 patches with the same label as the image of which they are part. The reason is that there are normal, healthy cells among the cancer cells in the invasive and in situ cases. As a result, some patches are less likely to have the same label as the image of which they are a part. For example, a cancerous cell image may have a patch that is predominantly healthy, i.e., normal or benign. Thus, the problem now is the patch labelling. Since we do not have a region of interest (ROI) for each cell, we assign the same label to the patch as to the image. Thus, all 35 patches (with a patch size of 512 × 512 and a stride of 256) from an image will have the same labels. L(i, j) represents the label of the jth patch of the ith image, which we assign as the ground truth. If the ground truth label of each image is represented as L(i), then l(i, j) can be defined as l(i, j) = L(i) for every j such that 1 ≤ j ≤ 35.

Image Augmentation and Pre-Processing
Data normalization and regularization are the two most important aspects of any machine learning algorithm. Data normalization helps us overcome many training issues, such as vanishing or exploding gradients, faster training, and smoother descent to minima for the cost function. Data regularization, on the other hand, prevents the training data from over-fitting.
We designed a sequential model, as shown in Figure 3, which adds regularized versions of the images to the train dataset. The model is:

•
Horizontal and vertical flips randomly; • Rotates by a random angle given as a parameter; • Translates by a random fraction.
• Rotates by a random angle given as a parameter; • Translates by a random fraction.
At first, we pass the patches through Macenko stain normalization [39]. This basicall converts all the images into the same fixed color space as the template image, which taken to be one of the patches of the training dataset. This helps to give a constant bas and reduce stain varieties across the dataset. This also gives a generic base for the image and helps increase the accuracy of the model. The original image, the stained normalize image, and the separated hematoxylin and eosin stain are all shown in Figure 4. The no malized image has a more contrasting effect that is even easier for the human eye to re ognize than that of the original image. These random layers help to increase variety in th training dataset and help in generalizing the training dataset, which ultimately increase the size and orientation of training data and helps the model learn a greater variety o images. This increases the overall accuracy of the patch prediction model and preven the model from over-fitting the training data. Figure 3. Image pre-processing method that consists of three sequential layers that take images the input and give pre-processed images as the output. Figure 3. Image pre-processing method that consists of three sequential layers that take images as the input and give pre-processed images as the output.
At first, we pass the patches through Macenko stain normalization [39]. This basically converts all the images into the same fixed color space as the template image, which is taken to be one of the patches of the training dataset. This helps to give a constant base and reduce stain varieties across the dataset. This also gives a generic base for the images and helps increase the accuracy of the model. The original image, the stained normalized image, and the separated hematoxylin and eosin stain are all shown in Figure 4. The normalized image has a more contrasting effect that is even easier for the human eye to recognize than that of the original image. These random layers help to increase variety in the training dataset and help in generalizing the training dataset, which ultimately increases the size and orientation of training data and helps the model learn a greater variety of images. This increases the overall accuracy of the patch prediction model and prevents the model from over-fitting the training data.

Patch-Level Feature Extraction
We designed a DL-based model to extract features from patches, with four outputs each representing the probability with which the patch belongs to that particular class, as mentioned earlier. Let "P" be the output vector of this model. P(i, j) is the probability with which the "ith" patch belongs to the "jth" class, where 1 ≤ j ≤ 4, which indicates each of the 4 classes: normal, benign, in situ, and invasive. We obtained the probability value from a softmax function applied at the last layer of the neural network. We divided the model into two stages: first, dimension reduction, and second, finetuning. We designed a dimension reduction method using a pre-trained DL model. We first pass all the patches through a pre-trained model and use their trained weights to output a low-dimensional feature vector. We first fine-tuned a few deep learning models, VGG-19 [40], VGG-16 [40], Inception-ResNet v2 [41], Xception [42], and ResNet 50 [43]; and calculated the validation accuracy of these models. We found out that VGG-19, VGG-16, and Inception-ResNet v2 produce better accuracy than Xception and ResNet 50. Thus, we moved forward by taking VGG-19, VGG-16, and Inception-ResNet v2 as the dimension-reduction models.
However, the problem is that these features are untrained and may not be completely suitable for this classification. Now comes the second stage, i.e., fine-tuning. Thus, we trained these features using a few convolutional, dropout, activation, batch-normalization and dense layers. The output was a dense layer of four nodes with softmax activation, as shown in Figure 5. We used a dropout of 0.4 and 64 neurons in the added layer. We also used L2 regularization, "He-uniform" as the kernel initializer and sigmoid as the activation function. The sigmoid function can be defined as follows:

Patch-Level Feature Extraction
We designed a DL-based model to extract features from patches, with four outputs each representing the probability with which the patch belongs to that particular class, as mentioned earlier. Let "P" be the output vector of this model. P(i, j) is the probability with which the "ith" patch belongs to the "jth" class, where 1 ≤ j ≤ 4, which indicates each of the 4 classes: normal, benign, in situ, and invasive. We obtained the probability value from a softmax function applied at the last layer of the neural network. We divided the model into two stages: first, dimension reduction, and second, finetuning. We designed a dimension reduction method using a pre-trained DL model. We first pass all the patches through a pre-trained model and use their trained weights to output a low-dimensional feature vector. We first fine-tuned a few deep learning models, VGG-19 [40], VGG-16 [40], Inception-ResNet v2 [41], Xception [42], and ResNet 50 [43]; and calculated the validation accuracy of these models. We found out that VGG-19, VGG-16, and Inception-ResNet v2 produce better accuracy than Xception and ResNet 50. Thus, we moved forward by taking VGG-19, VGG-16, and Inception-ResNet v2 as the dimensionreduction models.
However, the problem is that these features are untrained and may not be completely suitable for this classification. Now comes the second stage, i.e., fine-tuning. Thus, we trained these features using a few convolutional, dropout, activation, batch-normalization and dense layers. The output was a dense layer of four nodes with softmax activation, as shown in Figure 5. We used a dropout of 0.4 and 64 neurons in the added layer. We also used L2 regularization, "He-uniform" as the kernel initializer and sigmoid as the activation function. The sigmoid function can be defined as follows: the average values for every channel. Thus, for every channel, there was only one value left. Thus, the dimension became (batch_size x no. of channels). This then passed through the batch normalization layer, followed by two dense layers with 64 and 4 neurons, respectively. The first dense layer had a sigmoid activation function. Then, lastly, the data passed through a softmax activation function. Then, after the training of the model, we obtain the features from the output of the GAP layer. Thus, the features have 512 dimensions. Details of hyper-parameter tuning for the VGG19 model were as follows: • A dropout value of 0.4 produced the best results. • Learning rate: the learning rate was varied as lr = lr0/ (1 + decay rate) after each epoch. Lr0 was chosen as 0.008.    Figure 5 depicts the extraction of the first untrained patches from the three models: VGG-19, VGG-16, and Inception-ResNet v2. The fine-tuning model was then used to train these features one by one to extract them. After that, the features were concatenated from the three separate models. In the fine-tuning model, the first layer had 512 kernels of size (3 × 3) with a stride of 1. The next max pooling layer had a (2 × 2) kernel and a stride of (2, 2). Then, the data passed through a GAP (global average pooling) layer, which resulted in the average values for every channel. Thus, for every channel, there was only one value left. Thus, the dimension became (batch_size x no. of channels). This then passed through the batch normalization layer, followed by two dense layers with 64 and 4 neurons, respectively. The first dense layer had a sigmoid activation function. Then, lastly, the data passed through a softmax activation function. Then, after the training of the model, we obtain the features from the output of the GAP layer. Thus, the features have 512 dimensions. Details of hyper-parameter tuning for the VGG19 model were as follows: We compared the results of some pre-trained models for the purpose of dimension reduction and observed that the VGG-19 model gave the best result. During training, the models overfit after about 40 epochs, so it was necessary to stop early during the epochs when the validation accuracy increased by a negligible amount in the consecutive five epochs. Thus, we defined a callback function to do the same. We obtained the features from the second-to-last layer of the three pre-trained models (VGG-19, VGG-16, and Inception-ResNet v2) and then concatenated them. Since different models give different types of features, we allowed the overall model to learn as much as possible from all three models by concatenating the three feature sets from the three models. We concatenated the features after the fine-tuning stage, which were later classified by machine learning classifiers, as described in Section 2.2.5. We trained this model using the adam optimizer and the cross-entropy loss function. Cross-entropy loss is calculated according to Equation (2).
Cross Entropy Loss = − ∑ y original log y predicted (2) where y original is the original label of the patch and y predicted is the predicted label of the patch. Categorical accuracy is calculated as the ratio of the patches correctly predicted to the total number of patches.

Patch Classification
We obtained concatenated features from the feature extraction model, as shown in Figure 5. Now, to get output probabilities from the features, we need to classify these features into four different classes. In this stage, patch features are classified using different machine learning classifiers [44]. The classification algorithms used are: We took the patch features obtained in the previous step and then passed them through the different classifiers mentioned earlier. Different classifiers have different benefits, and no single classifier can provide all the information. Hence, to get combined information from different classification techniques based on the probability values obtained previously, we performed different ensemble techniques on the outputs of these classifiers, which are described in Section 2.2.6. Ensembling provided a higher level of accuracy than any of the individual standalone classifiers.

Ensemble Techniques
An ensemble method is a machine learning technique that combines the outputs of several base models in order to produce one optimal predictive model. It basically uses different types of results, giving different types of information about an entity, and then combines them using an ensemble function to obtain more refined information. In this case, we use an ensemble technique, as shown in Figure 6. Different ensemble techniques are defined as follows: Out of these, we chose the one that produces the maximum accuracy as the final ensemble technique. Next, to normalize the probabilities, these one-hot probabilities were passed through a softmax activation layer. From patch feature extraction and patch classification we obtained the class of which a particular patch is a member. For each image, we generated a frequency array (as shown in Figure 7) that counts the numbers of normal, benign, in situ, and invasive patches. We also added the remaining 15% of the validation set for prediction with the original 65% train images at this stage. Then, we divided the patch predictions into 35 equal halves. As a result, the first 35 patches represent the first image, the following 35 patches represent the second image, and so on. Then, for each set of 35 patches, we defined a one-dimensional array of size 4. Array positions 0, 1, 2, and 3 represent the counts of normal, benign, in situ, and invasive patches of that image. Therefore, the array for the "ith" image would be Freq(i) = {Count (Normal), Count (Benign), Count (In situ), Count (Invasive)}, where Count(x) represents the frequency of patches predicted as type "x" of all the 35 patches belonging to "ith" image. Out of these, we chose the one that produces the maximum accuracy as the final ensemble technique. Next, to normalize the probabilities, these one-hot probabilities were passed through a softmax activation layer.

Image Classification: Part 1
From patch feature extraction and patch classification we obtained the class of which a particular patch is a member. For each image, we generated a frequency array (as shown in Figure 7) that counts the numbers of normal, benign, in situ, and invasive patches. We also added the remaining 15% of the validation set for prediction with the original 65% train images at this stage. Then, we divided the patch predictions into 35 equal halves. As a result, the first 35 patches represent the first image, the following 35 patches represent the second image, and so on. Then, for each set of 35 patches, we defined a one-dimensional array of size 4. Array positions 0, 1, 2, and 3 represent the counts of normal, benign, in situ, and invasive patches of that image. Therefore, the array for the "ith" image would be Freq(i) = {Count (Normal), Count (Benign), Count (In situ), Count (Invasive)}, where Count(x) represents the frequency of patches predicted as type "x" of all the 35 patches belonging to "ith" image. In this stage, we classify images into two classes: cancerous and non-cancerous. We consider both normal and benign types as non-cancerous, whereas in situ and invasive types as cancerous. For this, we created another frequency array, which can be defined as follows: New_Freq(i) = [Cancerous Count, Non-Cancerous Count] for i = train images + validation images where, Non-cancerous Count = Normal Count + Benign Count and Cancerous Count = In situ Count + Invasive Count. Now, to map the frequencies into patch predictions, we used a neural network to select combinations of these frequencies and a suitable activation for getting image-level predictions. The model shown in Figure 8 used the softmax activation layer, the dense layer with one output node with "L2" regularization, and the sigmoid activation. Here, we used the adam optimizer and the binary cross entropy loss function. We have also experimented with the RMSprop and the SGD as optimizer functions, but Adam produced the best result, as it combines the properties of both the RMSprop and the SGD.  In this stage, we classify images into two classes: cancerous and non-cancerous. We consider both normal and benign types as non-cancerous, whereas in situ and invasive types as cancerous. For this, we created another frequency array, which can be defined as follows: New_Freq(i) = [Cancerous Count, Non-Cancerous Count] for i = train images + validation images where, Non-cancerous Count = Normal Count + Benign Count and Cancerous Count = In situ Count + Invasive Count. Now, to map the frequencies into patch predictions, we used a neural network to select combinations of these frequencies and a suitable activation for getting image-level predictions. The model shown in Figure 8 used the softmax activation layer, the dense layer with one output node with "L2" regularization, and the sigmoid activation. Here, we used the adam optimizer and the binary cross entropy loss function. We have also experimented with the RMSprop and the SGD as optimizer functions, but Adam produced the best result, as it combines the properties of both the RMSprop and the SGD. In this stage, we classify images into two classes: cancerous and non-cancerous. We consider both normal and benign types as non-cancerous, whereas in situ and invasive types as cancerous. For this, we created another frequency array, which can be defined as follows: New_Freq(i) = [Cancerous Count, Non-Cancerous Count] for i = train images + validation images where, Non-cancerous Count = Normal Count + Benign Count and Cancerous Count = In situ Count + Invasive Count. Now, to map the frequencies into patch predictions, we used a neural network to select combinations of these frequencies and a suitable activation for getting image-level predictions. The model shown in Figure 8 used the softmax activation layer, the dense layer with one output node with "L2" regularization, and the sigmoid activation. Here, we used the adam optimizer and the binary cross entropy loss function. We have also experimented with the RMSprop and the SGD as optimizer functions, but Adam produced the best result, as it combines the properties of both the RMSprop and the SGD.  From the 2-class classification model, we obtained a probability array that indicates whether the image is cancerous or non-cancerous. Then, we predicted the training images using the above method and considered the image to be cancerous if the probability was greater than 0.5, and non-cancerous otherwise. Let the sigmoid probability value for the "ith" image to be cancerous be P i . This means that the probability of the image being cancerous is P i , and the probability of it being non-cancerous is (1 − P i ). Thus, the new modified frequency array will be defined as follows: In situ Count X (P i ), Invasive Count X (P i )] (7) In this case, a factor of P i is multiplied by the cancerous class (i.e., in situ and invasive types) and (1 − P i ) by the non-cancerous class, resulting in a weighted frequency array that improves and accelerates learning. These probabilities are similar to attention weights that are used on different inputs. Thus, to make our input features, i.e., the frequency array, better, we used weights on them so that cancerous images are not predicted as non-cancerous or vice versa.

Image Classification: Part 2
This stage helps in training the final image prediction model. By studying the old frequency array, it was found that cancerous images are rarely misjudged as non-cancerous or vice versa. Rather, it happens in intra-class scenarios. Normal cells are misjudged as benign, or vice versa. Similarly, in situ cells are misjudged as invasive, or vice versa. Thus, we classify the cells into cancerous and non-cancerous types, then perform multiplication (as shown in Figure 9) between the weights and the original frequency array to get the modified frequency array. This stage helps in training the final image prediction model. By studying the frequency array, it was found that cancerous images are rarely misjudged as non-ca ous or vice versa. Rather, it happens in intra-class scenarios. Normal cells are misju as benign, or vice versa. Similarly, in situ cells are misjudged as invasive, or vice v Thus, we classify the cells into cancerous and non-cancerous types, then perform m plication (as shown in Figure 9) between the weights and the original frequency arr get the modified frequency array.
The modified frequency array is then subjected to a softmax activation to norm the value. We pass these frequency arrays through our final image prediction model model depicted in Figure 10 employs the dense layer with four output neurons and moid activation, and the dense layer with four output neurons and softmax activa The final softmax layer yields the probabilities with which the image belongs to that c Output layer = {P1, P2, P3, P4}, where Pi is the probability with which the imag longs to the "ith" class for 1 ≤ i ≤ 4, where "i" represents either normal, benign, in sit invasive. During the training of this model, we provide a decay rate of approxim (learning rate/number of epochs) to the learning rate. The learning rate (lr) gets upd as lr = lr/ (1 + decay rate) after each epoch.   The modified frequency array is then subjected to a softmax activation to normalize the value. We pass these frequency arrays through our final image prediction model. The model depicted in Figure 10 employs the dense layer with four output neurons and sigmoid activation, and the dense layer with four output neurons and softmax activation. The final softmax layer yields the probabilities with which the image belongs to that class.
Output layer = {P1, P2, P3, P4}, where Pi is the probability with which the image belongs to the "ith" class for 1 ≤ i ≤ 4, where "i" represents either normal, benign, in situ, or invasive. During the training of this model, we provide a decay rate of approximately (learning rate/number of epochs) to the learning rate. The learning rate (lr) gets updated as lr = lr/ (1 + decay rate) after each epoch.
Performing convolution on high-resolution images is very time-consuming. Moreover, learning from a large image has its own problems, as the model sometimes fails to decide which parts of the image to emphasize and which parts to ignore. Thus, dividing images into patches is one of the ways to solve this problem. We trained the patches first, and then devised a method for mapping patch classification to image classification.

Experimental Setup
The computational system used for the work has a single GPU system P100 with 25 GB RAM and 128 GB storage. We used an IPython notebook for writing the code and used quite a few libraries to facilitate our model to get faster results. These libraries included tensorflow keras libraries for using the inbuilt layers such as Dense and Convolution, to name a few; matplotlib.pyplt for plotting the graphs of accuracy vs. epochs; sklearn for train-test-split function to divide the dataset into required sets.

Performance Evaluation Metrics
We evaluated our model using some standard performance metrics, based on which we can determine how well the model works for classifying the breast cancer histology images. We define the metrics as follows: Accuracy is the measure of the ratio of correctly predicted labels to the total size of the dataset. It is defined by Equation (8).
Precision is defined as the proportion of correctly predicted labels in a class to the total number of samples in the class. It is calculated according to Equation (9).
Recall is defined as the proportion of positive samples to all positive samples in that class.
The F1-score is the weighted harmonic mean of the precision and recall scores.
where true positive, true negative, false positive, and false negative are defined as follows: True positive = number of images correctly classified as that class. False positive = number of images incorrectly classified as that class. False negative = number of images in that class incorrectly classified as some other class.

Experimental Results
At first, we experimented with a few transfer learning models and analyzed the results obtained on the BACH dataset. Table 1 shows the accuracy of different fine-tuned models, which shows that small models such as VGG-16 and VGG-19 produce comparatively better results than the deeper models. For this experimentation, we froze all but the last layer for each model, then added a dense layer with softmax activation to get the class prediction. Next, the last two layers for each of the models were unfrozen, and we calculated the accuracy and went on increasing the number of unfrozen layers. The unfrozen layers were actually the ones that were being trained. We observed that after four layers of unfreezing, the accuracy was not increasing but rather experiencing a slight dip towards the end. This is because in the beginning, the amount of data was quite high for training small parameters, but as we went on unfreezing the layers, the number of parameters to train went on increasing, and an optimum was reached, after which the data became smaller in comparison to the number of parameters to train. Thus, we kept four unfrozen layers at the end while training the fine-tuned model. The number of epochs was generally 50-60, after which it started overfitting and there was a drop in validation accuracy. We used the adam optimizer with beta_1 set to 0.8, beta_2 set to 0.99, and the rest of the parameters left alone. Here, beta_1 and beta_2 are the decay rates of the first and second moment estimates, respectively. From Table 1, we can see that the fine-tuned VGG-16 model yielded the best result of 81.5% validation accuracy-superior to Inception-Resnet v2 and VGG-19. We see that lighter models such as VGG-16 or VGG-19 tend to give better results when huge amounts of data are present. After fine-tuning the pre-trained model, we tried out different splitting combinations so as to find the optimal solution. The idea is to get enough data to train the parameters, but we also have to keep in mind not to overfit the model. Overfitting is one of the main problems in the training phase. To counteract this, we introduce the concept of a validation dataset. The validation set checks the cross-entropy loss and categorical accuracy after each epoch. This helps to check whether the accuracy of unseen data is increasing with epochs or the training data are being overfit. Hence, as soon as the validation accuracy reaches a flat area in the curve, we tend to stop the training at that point. From Table 2, we can see that the train-test split 80/20 worked best. Figure 11 shows how the training and validation accuracies increased on increasing the number of epochs. It also shows the model overfits, but we stopped training when the validation accuracy did not increase when further increasing the number of epochs. Figure 12 shows the validation loss curves of the VGG-16, VGG-19 and Inception-ResNet v2 models.

Patch Classification Results
A maximum accuracy of 81.5% was obtained with the VGG16 model, as shown in Table 1. After that, the features were concatenated, and the concatenated features were used for the classification using machine learning classifiers. The classification accuracy was listed in Table 3.

Patch Classification Results
A maximum accuracy of 81.5% was obtained with the VGG16 model, as shown in Table 1. After that, the features were concatenated, and the concatenated features were used for the classification using machine learning classifiers. The classification accuracy was listed in Table 3.

Patch Classification Results
A maximum accuracy of 81.5% was obtained with the VGG16 model, as shown in Table 1. After that, the features were concatenated, and the concatenated features were used for the classification using machine learning classifiers. The classification accuracy was listed in Table 3. We extracted features from the three pre-trained models (VGG-19, VGG-16, and Inception-ResNet v2) and then fine-tuned them. After that, we concatenated the features. Now, after concatenating the features, we classified the features using machine learning classifiers, such as KNN, SVM, random forest, Adaboost, and XGBoost. Among the classifiers, SVM yielded the best result of 81.5% accuracy. We ensembled the classification results using the average, product, and maximum rules, as mentioned in Section 2.2.6. We obtained the final ensemble results, which used information from all the DCNN models and various ensemble techniques. Out of the three ensemble techniques, we found that the averaging technique produced the best accuracy of 82.5%. Since the ensemble uses all the information from all the classification techniques, it produces better results than any of them.
In Table 4, we can see the different performance metrics explained in Section 3.2 for all four classes of histopathology images. We found that invasive images had the highest precision and F1 score, whereas in situ images had the highest recall values.

Image Classification Results
The model, as explained in Section 2.2.7, gives an accuracy of 98.6% in 2-class classification: benign (normal and benign) and malignant (invasive and in situ). However, going further into the model, we use the posterior probabilities only to modify the frequency array and use the probabilities as weights to amplify the input going into the final image classification model.
From Table 5, we can see that normal, in situ, and invasive have comparatively higher accuracy than benign images, and we achieved the highest overall accuracy of 97.50% across all the four classes, which is superior performance to many state-of-the-art models. Table 4 lists the performance metrics for the patches of the four classes. From Table 4, we can see that invasive patches have the highest precision, and in situ patches have the lowest precision. The reason is invasive images have very different structures compared to the other three classes. In situ images, on the other hand, have localized infections, and their infections seem as lobules in normal and benign cases. Additionally, another valid reason can be that while converting images to patches, we have labelled all the patches to have the same label as that of the image. Thus, it is difficult to even manually detect in situ images, so we achieved a low level of precision. However, in reality, we can have a few normal or benign patches in in situ images. Thus, labelling those patches as in situ will be like feeding wrong information to the model, which can have a negative impact on the accuracy of the patch prediction model. After patch classification comes the part of developing a method that takes the patchlevel output as input and uses it to get image-level output. For each image, as described in the proposed method section, we create a frequency array that indicates the number of patches that belong to each class for an image. Out of 35 patches for an image, something like {28, 4, 2, 1} indicates that 28 patches belong to class 1, 4 patches belong to class 2, 2 patches belong to class 3, and 1 patch belongs to class 4. Here, we added the 20% untrained validation data to this trained data to prevent the data from over-fitting. Now, we use a neural network that maps this frequency array to the final image prediction. The training and validation accuracies for 4-class image classification over the epochs are shown in Figure 13. An accuracy of 97.5% on the test dataset was obtained. In Table 5, we list different performance metrics, such as precision, recall, F1 score, and accuracy for the image-level classification.  After patch classification comes the part of developing a method that takes the patchlevel output as input and uses it to get image-level output. For each image, as described in the proposed method section, we create a frequency array that indicates the number of patches that belong to each class for an image. Out of 35 patches for an image, something like {28, 4, 2, 1} indicates that 28 patches belong to class 1, 4 patches belong to class 2, 2 patches belong to class 3, and 1 patch belongs to class 4. Here, we added the 20% untrained validation data to this trained data to prevent the data from over-fitting. Now, we use a neural network that maps this frequency array to the final image prediction. The training and validation accuracies for 4-class image classification over the epochs are shown in Figure 13. An accuracy of 97.5% on the test dataset was obtained. In Table 5, we list different performance metrics, such as precision, recall, F1 score, and accuracy for the imagelevel classification. We compared our proposed method with some of the modern methods and record the results in Table 6. It can be seen in Table 6 that our method outperforms the state-ofthe-art methods considered here for comparison by a good margin.  We compared our proposed method with some of the modern methods and record the results in Table 6. It can be seen in Table 6 that our method outperforms the state-of-the-art methods considered here for comparison by a good margin.

Conclusions
Classification of breast-cancer tissues using histopathological images is a challenging task. In the present work, we proposed a DL-based method for breast cancer image classification with histopathological images. First, we introduced a model to classify image patches, as these images are of high resolution, and it is very expensive to deal with the images directly in DL-based models. We divide the images into patches, perform pre-processing, and classify these patches into four categories. After that, we use the information about patch probabilities from this model to convert this patch-level accuracy to image-level accuracy. In general, state-of-the-art models convert patch-level classification results to image-level predictions by using various ensembling techniques, such as averaging the results, taking the element wise product, or taking the maximum of all, but in this work, we designed a two-stage neural network to map the patch-level information to image-level prediction in four classes: normal, benign, invasive, and in situ. We compared the proposed method's results with past methods, and it is shown that the method produces promising results and can be used as a reliable diagnostic tool for breast cancer classification from histopathological breast cancer images.
Our proposed model has achieved promising classification results with histopathological breast images. However, for this model, the major limitation is that we have to train quite a few models to achieve the ensemble accuracy, which is time-consuming and makes the entire model a little bit complex. Hence, development of an end-to-end deep learning model can be explored in future research attempts, where high-resolution images could be directly fed as input and hopefully produce good results. Additionally, the patch-level accuracy could be improved by designing a model that can assign patch-level labels more accurately, rather than manually assigning all the patches the same labels as the images of which they are parts. Other histopathological image datasets can be explored in the future for the robustness of the model.