Machine Learning Methods for Histopathological Image Analysis: A Review

Histopathological images (HIs) are the gold standard for evaluating some types of tumors for cancer diagnosis. The analysis of such images is not only time and resource consuming, but also very challenging even for experienced pathologists, resulting in inter- and intra-observer disagreements. One of the ways of accelerating such an analysis is to use computer-aided diagnosis (CAD) systems. In this paper, we present a review on machine learning methods for histopathological image analysis, including shallow and deep learning methods. We also cover the most common tasks in HI analysis, such as segmentation and feature extraction. In addition, we present a list of publicly available and private datasets that have been used in HI research.


Introduction
Current hardware capabilities and computing technologies provide the ability of computers to solve problems in many fields. The medical field nobly employs technologies as a means of improving populations' health and life quality. Medical computer-aided diagnosis is one of the suitable examples thereof. Amongst the aforementioned diagnosis, image-based diagnosis such as magnetic resonance imaging (MRI), X-rays, computed tomography (CT), and ultrasound have been attracting growing interest of scientists and academics. Likewise, histopathological images (HIs) are another kind of medical imaging obtained by means of microscopy of tissues from biopsies, which brings to the specialists their ability to observe tissues characteristics in a cell basis (Figure 1).
Cancer is a disease with high mortality rates in developed and in developing countries. In addition to causing death, the costs for related treatment are high and have an impact on the public and on the private healthcare system, penalizing, therefore, the government and the population. According as it is mentioned by Torre et al. [1], the mortality rate among high-income countries is stabilizing or even decreasing due to programs regarding the risk factors reduction (e.g. smoking, over-weighting, physical inactivity) and due to treatment improvements. In low and middle-income countries mortality rates are rising due to the increase in risk factors. One of the key points of improvements in treatment is the early detection of tumors. In fact, in 140 out of 184 countries, breast cancer is the most prevalent type of cancer among women [2]. Imaging exams like mammography, ultrasound or CT can diagnose the presence of masses growing in breast tissue, notwithstanding the confirmation of which type of tumor can only be accomplished by means of a biopsy. Biopsies, in turn, take more time to provide a result due to the acquisition procedure (e.g. fine-needle aspiration or open surgical biopsy), the tissue processing (creation of slide with the staining process) and finally pathologist visual analysis. Naturally, pathologist analysis is a highly specialized and time-consuming task prone to inter and intra-observer discordance [3]. Furthermore, the staining process can cause the variance in the process of analysis. Hematoxylin and eosin (H&E), although both are the most common and accessible type of stain, they can nevertheless produce different color intensities depending on the brand, storage time, and temperature. Therefore, computer-aided diagnosis (CAD) can increase pathologists' throughput and improve the confidence of results by not only adding reproducibility to the diagnosis process but also reducing observer subjectivity.
(a) (b) Figure 1. Example of (a) benign and (a) malignant HIs [4] One important feature in cancer diagnosis is the observation of nuclei. Tumors like ductal carcinoma and lobular carcinoma present an irregular growing on epithelial cells at these structures. A high number of nuclei or a high number of mitotic cells in a small region can indicate the presence of an irregular growth of tissue, representing a tumor. An HI can capture this feature, but besides the nuclei, it will capture other healthy tissues that can be seen in images of benign tumors. Stroma is a type of tissue that shows the same characteristics in parts of malignant and benign images. Selecting more relevant patches could improve the classification processes.
In the last years, we have experienced an increasing use of machine learning (ML) methods in CAD and HI analysis. ML methods have been used in the pathological diagnosis of cancer in different tissues or organs such as breast, prostate, skin, brain, bones liver, and others. ML methods have also potential advantages in HI analysis. ML methods have been widely used in segmentation, feature extraction and classification of HIs. HIs have rich geometric structures and complex textures, which are different from the visual characteristics of macro vision images used in other machine learning tasks such as object recognition, face recognition, scene reconstruction or event detection.
In this review we attempt to capture the most relevant works of the last decade that employ ML methods for HI analysis. We present a comprehensive overview of ML methods for HI analysis including segmentation, feature extraction and classification. The motivation is to understand the development and use of ML methods in HI analysis and discover the future potential of ML methods in HI analysis. Furthermore, this review aims to address the following three research questions: 1. Which ML methods have been used for HI classification and how HIs are provided to the ML methods (raw images or pre-processed images or extracted features)? This question aims at identifying which monolithic classifiers, ensembles of classifiers or DL methods have been frequently used to classify HIs.
2. Which elements of HIs are considered the most important ones and how they are obtained? This question aims at identifying which types of tissues or structures can be identified using ML methods. 3. What are the trends that have been dominating HI analysis? This question aims at identifying what are the most promising ML methods for HI analysis for for the near future.
The main contributions of this paper are: (i) it covers a period of exponential change in the computer vision, from the handcrafted features to representation learning methods; (ii) it is a comprehensive review, which does not focus on HIs of specific tissues or organs; (iii) it categorizes the works according to the task: segmentation, feature extraction, classification, and representation learning and classification. This allows researchers to compare their works in the same context. There are several survey and review papers related to HI analysis, which are presented in Section 7. Different from such previous reviews or surveys on HIs that only focus on HIs of specific tissues or organs, or on a single learning modality (supervised, unsupervised or DL techniques), in this review we cover different approaches, methodologies, datasets, and experimental results, so that readers can identify possible opportunities for future research in HI analysis.
This paper is organized as follows. Section 2 proposes a taxonomy to categorize the ML methods used in HIs as well as an overview of the process of selecting journals and proceedings. Section 3 presents the segmentation methods that attempt to identify important structures in HIs, which may help to diagnosis. Section 4 presents the feature extraction methods that have been used to represent HIs for further classification. Section 5 presents the shallow methods that have been used for classifying the main types of tissues and tumors in HIs. Given the importance and the growing interest in DL methods, Section 6 is devoted to present the recent approaches for HI analysis that employ such methods. Section 7 brings together other reviews and surveys papers that have been published recently, as well as a compilation of several HI datasets that have been used in the last decade. Finally, in the last section we present the conclusions and perspectives of future works.

Taxomony and Overview
Based on the three research questions presented previously, we have created a search query 1 which was slightly adapted to each search engine. We have searched for references comprising the period of between 2008 and 2020 into five research portals (engines): IEEE Xplore, ACM Digital Library, Science Direct, Web of Science and Scopus. Table 1 presents the number of results obtained with the search query. We have searched based on the title, abstract and keywords for all search engines, except for Science Direct. In this case, we added the full-text search also, because the number of relevant works was very low. 1 ((histology AND image) or (histopathology AND image) or (eosin AND hematoxylin)) and (("machine learning") or ("artificial intelligence") or ("image processing")) Based on these results, the first exclusion criterion was based on the title and abstract. Most of the exclusions in this step were due to papers that mentioned "image processing" in the text, but the sense of the term was linked to the process of digitizing HIs for visual analysis by pathologists. Another exclusion criterion was the presence of the terms eosin and hematoxylin or histopathology to exclude medical images that were not the focus of this review, such as CT, MRI or radiology images. Finally, we have eliminated the duplicated articles resulting and we ended up with 353 articles. The second exclusion criterion was based on the full-text reading to evaluate the adherence of the paper's contents to the goal of this review, which has excluded almost 50% of the papers retained by the first filtering. Therefore, we have ended up with 178 articles. Besides the papers selected using the search query, we have also included several papers related to the HI datasets used in many references cited in this review, as well as, some other references that discuss about specific ML methods and techniques that are also referred in many of the selected references. This review focuses on ML methods for HI analysis. Therefore, we have categorized the ML methods according to the most common ML tasks as shown in Figure 2. The top-level categories are: segmentation, feature extraction, shallow methods and deep methods. Notwithstanding DL approaches can be employed for both segmentation and classification, we proposed this division to highlight how the recent advances in DL have impacted the research on HI analysis, causing a paradigm shift to DL methods over traditional ML methods.
Segmentation of HIs was a very popular category during the first years covered by this review. Most of the works were based on image processing techniques, such as filtering, thresholding, contour detection techniques, while others rely on ML methods, such as classification and unsupervised learning at pixel level. Besides, inasmuch as the annotation for segmentation is a very time-consuming task, it is also common to find unsupervised methods along with the supervised ones. Most of the early works used segmentation to highlighting information in HIs to specialists. Feature extraction aims at finding discriminative characteristics in HIs and at aggregating them into a feature vector to train ML algorithms. Most shallow classifiers and ensemble methods use such a vector representation to learn linear or non-linear decision boundaries. We divided the category of shallow methods into two subcategories: monolithic classifiers and ensemble methods. Ensemble methods combine several diverse base models to reduce bias and/or variance in predictions as well as to improve the accuracy of predictions. The works that fall within both subcategories require a previous step of feature extraction.
Finally, the category of deep methods contains works focused on supervised and unsupervised learning of different architectures of deep neural networks. Most of the works within this category are end-to-end learning approaches, which integrate representation learning and decision-making.
The number of publications related to the field of this research is presented in Figure 3. Based on Figure 3 it is possible to note that the research on the topic has been increasing in last years. The search was accomplished in January 2020, and regardless of the latter having been performed at the beginning of the year, some publications of the same year were found. It is also possible to note a great increase in the use of DL methods, while ensembles and feature extraction kept their rates. Table 2 shows the number of publications per journal between 2008 and 2020 apropos of the subject of this review and Table 3 shows the top 15 journals in number of publications.

Segmentation Methods for HIs
Typically, pathologists look for tissue regions relevant to the disease being diagnosed. HI segmentation usually aims to label regions of pixels according to the structure that they may represent. For instance, the identification of nuclei structures can be used to extract morphological features, such as the number of nuclei per region, their size, and format, which may be very helpful to diagnose a tumor. In this section we present several approaches for segmenting HIs where most of them are either based on supervised or unsupervised ML methods. The former requires HI datasets with region annotation of while the latter does not require any type of annotation.

Unsupervised Approaches
The k-means algorithm is an unsupervised ML method for clustering that has been used for segmentation of pixel regions, and in the context of this review, it represents the core of fourteen segmentation methods, as shown in Table 4. Fatakdawala et al. [5] proposed a methodology based on the expectation-maximization of the geodesic active contour for detecting lymphocyte nuclei, which is able to identify four structures: lymphocyte nuclei, stroma, cancer nuclei, and background. The process initiates with segmentation by a k-means algorithm, which clusters pixels of similar intensities and afterwards such clusters are improved with an expectation-maximization algorithm. The contours are identified based on the magnetic interaction theory. After contours have been defined, an algorithm searches for the concavity of contours, meaning that there is nuclei overlapping. The experiments were conducted using a breast cancer dataset. A multi-scale segmentation with k-means is the subject of study of Roullier et al. [6]. This work uses the same idea of the pathologist to analyze a whole slide image (WSI). The segmentation starts at a lower magnification factor and finishes at a higher magnification where it is easier to identify mitotic cells. The result of the clustering algorithm aims to identify regions of interest in each magnification. Rahmadwati et al. [7] employed the k-means algorithm to help classify HIs. Although the focus is not on the k-means per se, but on Gabor filters, this clustering method is essential in the segmentation process. Peng et al. [8] used k-means and principal component analysis (PCA) to split HIs into four types of structures: glandular lumen, stroma, epithelial-cell cytoplasm, and cell nuclei. Subsequently, morphological operations of closing and filling are performed. He et al. [9] used a mixture of local region-scalable fitting and k-means to segment cervix HIs. Fatima et al. [10] used k-means for segmentation followed by skeletonization and shock graphs to identify nuclei in the previously segmented image. If the shock graph provides a confidence value smaller than 0.5 for nucleus identification, a second attempt of identification is made using a multilayer percepton (MLP). This hybrid approach achieves 92.5% of accuracy in nucleus identification.
Mazo et al. [11] also used k-means to segment cardiac images in three classes: connective tissues, light areas, and epithelial tissue. A flooding algorithm processes light areas in order to merge its result with epithelial regions and improve the final result. Finally, plurality rule was used to assign cells into flat, cubic, and cylindrical. This method achieved a sensitivity of 85%. This work was extended in Mazo et al. [12]. Tosun et al. [13] proposed segmentation based on k-means that clusters all pixels into three categories (purple, pink, white), which are further divided into three subcategories. The object-level segmentation based on clustering achieved 94.89% of accuracy against 86.78% for pixel-level segmentation. Nativ et al. [14] presented a k-means clustering based on morphological features of lipid droplets previously segmented using active contours models. A decision tree (DT) was used to verify the rules that lead to the classes obtained by the clustering. The correlation with pathologist evaluations reached 97%. A two-step k-means is used by Shi et al. [15] in order to segment follicular lymphoma HI. The first step segments nuclei and another type of tissues into two clusters. The next step segments "another type tissue" area from the previous step into three classes (nuclei, cytoplasm and extracellular spaces). The final step is a watershed algorithm to extract better contours of nuclei. The difference between the manual segmentation and automated was of around 1%. Brieu et al. [16] presented a segmentation approach based on k-means. The result of k-means segmentation is improved and simplified using a sequence of thresholds that attempt to preserve the form of objects. The key point of such a method is not the segmentation, but nucleus detection. Shi et al. [17] used k-means to cluster pixels represented in the L*a*b color space using pixel neighborhood statistics. A thresholding step improves contours detection of fat droplets, and human specialists analyze morphological information related to the droplets to come up with a diagnosis. Shi et al. [18] proposed a segmentation method that considers the local correlation of each pixel. A first clustering performed by a k-means algorithm generates a poorly segmented cytoplasm, and a second clustering that does not consider the nuclei identified by the first clustering is performed. Finally, a watershed transform is applied to complete the segmentation.
Other clustering algorithms have also been used to segment HIs. The work proposed by Liu et al. [19] used the iterative self-organizing data analysis technique (ISODATA) to cluster cell images and create prototypes. Hafiane et al. [20] studied two strategies for initialization of clustering methods: geodesic active contours and multi-phase vector level sets. The last one proved to be more efficient when using spatial constraint fuzzy c-means, with accuracy values of 68.1% and 67.9% respectively, and k-means achieved 60.6% in this case. He et al. [21] presented segmentation based on Gaussian mixture models. Their methodology uses the stain color features (hematoxylin with blue color and eosin in pink and red) to apply two segmentation steps in the red channel and other channels subsequently. It does not present ground truth comparison, only visual results compared to k-means. [22] presented a quasi-supervised approach based on nearest neighbors to cluster an unlabeled dataset based on itself and another labeled dataset. A comparison between quasi-supervised approach and support vector machine (SVM) has shown that SVM presents a better performance but it requires labeled data. Yang et al. [23] proposed a system for content recovery based on a three-step method that uses histogram features. The first two steps use dissimilarity measures of histograms to find candidate images. The last step uses mean shift clustering. The area under the curve (AUC) of the proposed method is 0.87, which is better than 0.84 achieved by the method based on local binary patterns (LBP) features. A mitotic cell detection system using a dictionary of cells is presented by Sirinukunwattana et al. [24]. A shrinkage/thresholding method groups intensity features represented by a sparse coding to create a dictionary. This method achieved 80.5% and 77.9% of F-score on Asperio and Hamatsu subsets of MITOS dataset, respectively. Huang [25] proposed a semi-supervised method based on exclusive component analysis (XCA) that uses the separation of stains to improve the performance. This method needs a small interaction of the user, who must provide a set of references from nuclei and from the cytoplasm. Finally, it is worth mentioning that unsupervised methods based on DL approaches have also been proposed for segmenting HIs. We will present some recent works in Section 6. Table 4. Summary of publications on unsupervised ML methods for HI segmentation.

Supervised Approaches
In this section, we present the works related to HI segmentation which are based on supervised ML approaches. Most of the works presented in this section are based on classification algorithms and therefore, they require labeled datasets in which pixels or pixel regions are annotated. Table 5 summarizes the recent publications on supervised ML methods used for the segmentation purpose, where eight out of fourteen works are based on SVM classifiers.
Yu and Ip [26] presented an approach to encode HIs using a patching procedure and a method called spatial hidden Markov model (SHMM). Each patch is represented by a feature vector that uses a mixture of Gabor energy and gray-level features. The SHMM showed improvements from 4% to 17% in multiple tissues in comparison to a hidden Markov model. The work of Arteta et al. [27] uses the concept of extremal regions on gray-scale images to identify nuclei on HIs. In order to identify the threshold of extremal regions, which are organized in overlap tree, they used an SVM classifier. This approach achieved 88.5% of F1-score against 69.8% achieved by the state-of-the-art, considering the number of cells found after segmentation. Janssens et al. [28] presented a segmentation procedure to identify muscular cells. First a segmentation based on thresholding identifies connective tissues and cells. Then, an SVM receives the segmented regions and classify them recursively into three classes (connective tissue, clump of cells and cells) until only connective and cell tissues appear. This approach achieved an F-score of 62%, which was the state-of-the-art at that time. Saraswat and Arya [29] proposed a segmentation procedure with a non-dominated sorted genetic algorithm (NSGA-II) and a threshold classifier. The NSGA-II generates the threshold for feature values from ground-truth images. The comparison between learned thresholds and feature values generates the segmentation. Breast cancer prognosis is the subject of the study of Qu et al. [30]. They used an SVM to perform pixel-wise classification to separate nuclei from the stroma. A second step based on a watershed algorithm identifies nuclei. The approach achieved 72% of accuracy using pixel-level, object-level, and semantic-level features. Salman et al. [31] proposed a segmentation method based on k-NN to analyze WSIs. The method computes histograms from patches of 64×64 pixels extracted from the H&E channels obtained by color deconvolution. The best accuracy was 73.2% using histograms of both H&E channels. Chen et al. [32] proposed a method based on pixel-wise SVM to identify stroma and tumor nests. Nuclei segmentation is carried out by a watershed algorithm, which results in 314 object-level features and 16 semantic-level features. The feature dimensionality was reduced using the analysis of feature importance. Geessink et al. [33] used a normal density-based quadratic discriminant classifier (QDA) to segment colorectal images. The segmentation uses the L*a*b color space with a threshold to eliminate background pixels and HSV color space to classify the remaining pixels. After classification, errors are corrected based on histological constraints. The algorithm produced an error rate of 0.6% for tumor quantification which, according to the authors, is lower than the error of pathologists (4.4%). Zarella et al. [34] trained an SVM to distinguish stained pixels from unstained pixels. For such an aim, they selected manually positively stained pixels and negatively stained pixels from a set of representative images in HSV color space. The SVM identifies regions of interest for further analyses. Santamaria-Pang et al. [35] proposed an algorithm to enhance and improve general segmentation methods by utilizing a cell shape ranking function. The shape of the cells detected by the watershed transform is used to train an SVM, which discriminates real cells from false positives. Wang et al. [36] proposed the use wavelet decomposition, region growing, double strategy splitting model and curvature scale space to highlight nucleus regions for further classification. Textural and shape features are extracted from nuclei and feature selection is carried out based on genetic algorithms and SVM. The best results were 91.5% and 91.6% for sensitivity and specificity, respectively. Arteta et al. [37] improved the post-processing step of the method proposed by Arteta et al. [27]. Nucleus regions are refined using a surface. Two nucleus regions have their optimal area defined by a smoothness factor. The improvement provided 91% of F1-score in the same dataset. A nuclei segmentation was proposed by Brieu and Schmidt [38] based on an adaptive neighborhood provided by a regression tree. A comparison showed an improvement of 9% relative to a nuclei segmentation without adaptive thresholding. Finally, Song et al. [39] presented a nuclear segmentation as a cascade of two-class classification problem. An effective learning formulation was proposed by adapting sparse convolutional models across the different layers in order to estimate the latent morphology information. For improving the region probabilities, low-level appearance and high-level contextual features from original images and probability maps estimated, respectively, are integrated into a new sequence of probabilistic binary DTs. The outcome led to a reliable contour set for each nucleus and final complete contour inferences. The experimental results over 26,500 nuclei from the Farsight, KIRC, and Kumar datasets showed that the proposed method achieved better performance than other automated segmentation approaches. Again, it is worth mentioning that supervised methods based on DL approaches have also been proposed for segmenting HIs. We will present such recent works in Section 6. Table 5. Summary of publications on supervised ML methods for HI segmentation.

Feature Extraction for HIs
Supervised shallow methods depend on the feature extraction from raw data before performing classification. HI problems require a transformation of the image pixels into meaningful features prior to classification. Feature extraction methods process images and provide a reasonable number of features summarizing the information contained in the image. In fact, feature extraction methods aim not only to reduce the dimensionality of the input, but also to highlight relevant information related to the problem (presence/absence or amount of a certain element, texture, shape, histogram, etc.) providing a representation independent on translation, scale, and rotation. Several different types of features have been used with HIs, such as shape, size, texture, fractal, or even combination of these features. Table 6 summarizes the articles related to feature extraction.
Object-level and morphometric features like shape and size are particularly important for disease grading and diagnosis. Ballarò et al. [40] proposed the segmentation of HIs to identify unhealthy or healthy megakaryocytes, structures from which morphometric features are extracted. Petushi et al. [41] employed the Otsu algorithm to highlight nuclei and then extracted different features such as inside radial contact, inside line contact, area, perimeter, area-perimeter ratio, curvature, aspect ratio, and major axis alignment. Feature vectors are built by the concatenation of histograms of all these features. Madabhushi et al. [42] presented an approach for predicting disease outcome from multiple modalities including MRI, digital pathology, and protein expression. For histopathology images, they used graph-based features such as Voronoi diagram (total area of all polygons, polygon area, polygon perimeter, polygon chord length), Delaunay triangulation (triangle side length, triangle area), minimum spanning tree (edge length), and nuclear statistics (density of nuclei, distance to nearest nuclei in different pixel radius) to represent the spatial arrangement of nuclei. Song et al.
[43] applied thresholding and watershed transform to extract features like cystic cytoplasm length, cystic mucin production, and cystic cell density. These three features are used to train different classifiers. The experimental results showed that these three features outperformed morphological features (shape and size) achieving 90% of accuracy against 64%. Besides that, the combination of these features with morphological features achieved only 85% of accuracy. The system described by Gorelick et al.
[44] for prostate cancer detection and classification uses a segmentation step to identify super pixels, where the segmented images are represented by morphometric and geometric features. The framework for cytological analysis and breast cancer diagnosis presented by Filipczuk et al. [45] employed morphometric features. After isolation of nuclei from the images, for each nucleus they calculated area, perimeter, eccentricity, major and minor axis length, luminance mean and variance, and distance to the centroid of all nuclei. Ozolek et al.
[46] performed the classification of follicular lesions on thyroid tissue. After a preprocessing step for nucleus segmentation, the chromatin texture of nuclei with linear optimal transport provides features for the final classification. Fukuma et al.
[47] compared spatial-level and object-level descriptors like Voronoi tessellation, Delaunay triangulation, minimum spanning tree, elliptical, convex hull, bounding box and boundaries. Object-level features reached 99.07% of accuracy at best case against 82.88% achieved by the spatial ones. Morphometric features can also be obtained from other structures like glands, which are easier to identify due to the difference of the lumen and other cellular structures. This is the subject in the work presented by Loeffler et al.
[48] which uses inverse compactness and inverse solidness as measures for gland alteration on prostate cancer. The features were obtained based on the area (object and convex hull area) and perimeter of threshold highlighted objects. Marugame et al.
[49] used morphometric features extracted from image objects indicating nuclear aggregations to represent three categories of ductal carcinomas in breast HIs. The number of pixels, length, and thickness of the objects reflect their size and shape. Osborne et al.
[50] employed four geometrical features extracted from nuclei after segmentation to melanoma diagnoses in skin HIs. The four features are the ratio of the area of nuclei to the area of cytoplasm, the ratio of the perimeter of a nucleus to its area, the ratio of the major axis length of a nucleus to its minor axis length, and the ratio of the number of nuclei to the area of cytoplasm. The multi-view approach to detect prostate cancer presented by Kwak and Hewitt [51] extracted morphological and intensity features from multiple resolutions. Features like area, compactness, smoothness, roundness, convex hull ratio, major-minor axis ratio, extent, bounding circle ratio, distortion, and shape context are extracted from lumens and epithelial nuclei, as well as other relational features between them. Olgun et al.
[52] introduced a feature extractor for HIs, which is based on the local distributions of objects, which are segmented by color intensity. The feature extractor measures the distance between an object and its neighborhood. The proposed method outperformed other 13 methods that use textural and structural features.
Texture descriptors have become quite popular in HI analysis due to the different types of textures found in HIs. For instance, high/low concentration of nuclei and stroma present quite different patterns of textures. For this reason, several researchers have been investigating a large spectrum of textural descriptors for HI classification. Descriptors based on GLCM had been used by several authors to represent textures in HI. Kuse et al.
[53] used GLCM as features with a pre-segmentation process based on unsupervised mean-shift clustering. Such a method reduces color variety to segment the image using thresholds. After this process nuclei are identified and have the overlapping removed by a contour and area restrictions. Finally, GLCM features are extracted from the segmented image used for classification. Caicedo et al.
[54] combined seven feature extraction methods, including GLCM, and create a kernel-based representation of the data on each feature type. Kernels are used inside an SVM to find similarity between data and to implement a content retrieval mechanism. Fernández-Carrobles et al.
[55] presented a feature extraction method based on frequency and spacial textons. The use of textons implies that images are represented by a reduced vocabulary of textures. Features used for the classification are histograms of textons and GLCM features extracted from texton maps. They also evaluated the impact of different colormaps on these features. Best classification results (98.1%) were achieved by combining six color models and GLCM for textons. Despite the fact that GLCM requires a gray-level image, the conversion of the H&E color image to gray-level is affected by the variability of the staining color, so in the end GLCM is also affected.
Another descriptor that is very often used to represent texture is the Local Binary Pattern (LBP). Mazo et al. [12] proposed the classification of cardiac tissues into five categories using a patching approach that aims to optimize the patch size to improve the representation. The texture of HIs was described using Local Binary Patterns (LBP), LBP Rotation Invariant (LBPri), Haralick features and different concatenations between them. Haralick features include contrast, angular second moment, homogeneity, correlation, entropy, and first and second correlation measures. Peyret et al.
[56] applied LBP in the context of multispectral HIs. They used an SVM to evaluate the proposed LBP, which aligns all spectra and uses pixels from all other bands. It also uses a multi-scale kernel size. This feature extractor reached 99% of accuracy compared to 88.3% achieved by the standard LBP and 95.8% reached by the concatenated spectra LBP. Bruno et al.
[57] used a curvelet transform to handle multiscale HIs. The LBP algorithm extracts features from curvelets coefficients which are reduced by an ANOVA analysis. The algorithm proposed by Phoulady et al. [58] uses adaptive and iterative thresholding to find nuclei area and extracts texture information using LBP and histograms of oriented gradients (HOG). The proposed method achieved 93.3% of accuracy against 92.3% of the second-best method. The work presented by Reis et al.
[59] focused on the stroma maturity to evaluate breast cancer. The features for the stroma are Basic Image Features (BIF), obtained by convolving images with a bank of derivatives-of-Gaussian filters, and LBP with multiple scales for the neighborhood. Gertych et al. [60] presented a system for prostate cancer classification, which also uses LBP as feature. The best accuracy was 68.4% for cancer detection. Balazsi et al. [61] presented an invasive ductal breast carcinoma detector that extracts patches by tesselation without the square shape constraint. A set of 16,128 features derived from multiple histograms and LBP (multiple radii) using L*a*b, gray-scale and RGB color spaces are used to represent each patch. Atupelage et al. [62] extracted features using fractal geometry analysis, and compare them with Gabor filter bank, Leung-Malik filter bank, LBP and GLCM features. The proposed approach outperformed the other methods achieving 95% of accuracy.
Huang et al. [63] proposed a two-step feature extraction approach composed of a receptive field for detecting regions of interest and a sparse coding. The sparse coding groups features from patches of the same region. The mean and covariance matrix of receptive fields and sparse coding are the final filters. Noroozi and Zakerolhosseini [64] proposed an automated method for discriminating basal cell carcinoma tumor from squamous cell carcinoma tumor in skin HIs using Z-transform features, which are obtained from the combination of Fourier transform features. Wan et al. [65] used a dual-tree complex wavelet transform (DT-CWT) to represent the images in the context of mitosis detection in breast cancer detection. Generalized Gaussian distribution and symmetric alpha-stable distribution parameters were used as features. Chan and Tuszynski [66] also used fractal dimension features for breast cancer detection. These features perform well for an HI magnification of 40× to distinguish between malignant and benign tumors.
Recently, deep features have become very popular in several image classification tasks, including HIs. Niazi et al. [67] presented a CAD system for bladder cancer that focuses on the extraction of epithelium features with segmentation using an automatic color deconvolution matrix construction. Spanhol et al. [68] used deep features from a pre-trained AlexNet to classify breast benign and malignant tumors. Vo et al. [69] presented a method for feature extraction based on the combination of CNNs and boosting tree classifiers (BTC). This method utilizes an ensemble of inception CNNs to extract visual features from multi-scale images. In the first stage, data augmentation methods were employed. Afterwards, ensembles of CNNs were trained to extract multi-context information from multi-scale images. The latter stage extracted both global and local features of breast cancer tumors. George et al. [70] proposed an approach for breast cancer diagnosis, which extracts features from nuclei based on CNNs. The methodology consists of different approaches for extracting nucleus features from HIs and select the most discriminative spatially sparse nucleus patches. A pre-trained set of CNNs was used to extract features from such patches. Subsequently, features belonging to individual images are fused using 3-norm pooling to obtain image level features.
Finally, several works use or combine different feature categories in an attempt to capture information from both textures and geometrical structures found in HIs. Leo et al. [71] presented a method for quantifying instability of features across four prostate cancer datasets with known variations due to staining, preparation, and scanning platforms. They evaluated five families of features: graph-based features, which include first-and second-order descriptors of Voronoi diagrams, Delaunay triangulations, minimum spanning trees, and gland density; gland shape features, which measure the average shape of all the glands in an image, and include the lumen boundaries and the resulting area, perimeter, distance, smoothness, and Fourier descriptors; co-occurring gland tensor features, which capture the disorder of neighborhoods of glands as measured by the entropy of orientation of the major axes of glands within a local neighborhood; subgraph features, which describe the connectivity and clustering of small gland neighborhoods using gland centroids; Haralick texture features. Yu et al. [72] investigated the best features for characterizing lung cancer. The authors extracted objective quantitative image features such as Haralick texture features of the nuclei (sum entropy, InfoMeas, difference variance, angular second moment), edge intensity of the nuclei, texture features of the cytoplasm and intensity distribution of the cytoplasm, Zernike shape, texture and radial distribution of intensity. Caicedo et al. [73] proposed a low-level to high-level mapping to facilitate imaging retrieval. This mapping process consists of gray and color histograms, LBP, Tamura texture histogram, Sobel histogram, and invariant feature histograms. Pang et al. [74] proposed a CAD system for lung cancer detection, which uses textural features such as LBP, GLCM, and Tamura, shape features such as SIFT, global features and morphological features. Kruk et al. [75] used morphometric, textural, and statistical (histogram) features to describe nuclei for clear-cell renal carcinoma grading. Genetic algorithm and Fisher discriminant were used to select the most important features. Basavanhally et al. [76] proposed a multi field-of-view (FOV) classification scheme to recognize low versus high-grade ductal carcinoma from breast HIs. It uses a multiple patch size procedure for WSI to analyze whether morphological or textural or graph-based features is the most relevant to each patch size. Tashk et al. [77] presented a complete framework for breast HI classification that estimates mitotic pixels in L*a*b color space. A combination of LBP, morphometric, and statistical features are extracted from mitotic candidates. Cruz-Roa et al. [78] proposed a patching method on HI slides to create small regions and extract scale-invariant feature transform (SIFT), luminance level, and discrete cosine transform features to create a bag-of-words. Semantic features are high-level information that can be associated with HIs to aid their classification.
Orlov et al. [79] compared four color spaces (RGB, L*a*b, gray-scale and RGB) with H&E representation and eleven features such as Zernike, Chebychev, Chebyshev-Fourier, color histograms, GLCM, Tamura, Gabor, Haralick, edge statistics and others to represent lymph node HIs. De et al. [80] propose a fusion of several feature types for uterine cervical cancer HI classification. They used a 62-dimensional feature vector based on GLCM, Delaunay triangulation and weighted density distribution. Vanderbeck et al. [81] used morphological, textural and pixel neighboring statistics features to represent seven categories of white regions of liver HIs. Kandemir et al. [82] proposed a MIL approach to detect Barrett's cancer in HIs. They used cell-level morphometric features such as central power sums, area, radius, perimeter, and roundness of segments, maximum, mean, and minimum intensity, and intensity covariance, variance, skewness, and kurtosis within regions and patch-level features such as LBP, SIFT and color histograms from segmented images using the watershed algorithm. Coatelen et al. [83][84] proposed a feature selection method of liver HI classification based on morphometric features such as area, compactness, perimeter, aspect ratio, Zernick moment, etc., textural features such as GLCM, LBP, fractal dimension, Fourier distance, etc., and structural or graph-based features such as number of nodes/edges, modularity, pi, eta, theta, beta, alpha, gamma and Shimbel indexes, etc. Two greedy algorithms (fselector and in-house recursive) selected features in a pool of 200 features where the fitness function was implemented by an SVM classifier. Michail et al. [85] highlighted nuclei using connected-component labeling to classify centroblast and non-centroblasts cells. Morphometric, textural and color features are used as features. Das et al. [86] proposed the so-called geometric-and texture-aware features, which are based on Hu moments and fractal dimensional, respectively. Such a set of features was applied to detect geometrical and textural changes in nuclei to discriminate mitotic and non-mitotic cells. The method proposed by Kong et al. [87] classifies neuroblastomas using textural and morphological features. It considers that pathologists use morphological features for their analysis and textural features can be easily extracted. They also use GLCM features and sequential floating forward selection to select features.
In summary, given the rich geometric structures and complex textures that may be found in HIs, most of the works combine different types of features. Morphometric features are important to characterize geometric structures, but they are more complex to obtain since they require complex pre-processing, e.g. find the contour of nuclei to count them. On the other hand, textural features such as LBP and GLCM usually do not require a previous segmentation of HIs. Finally, the most recent methods of feature extraction are focused on deep features. They can be interpreted as a sequence of filters that can detect both geometric structures and textures. Therefore, deep features and deep methods seem to be very promising methods for HI analysis.

Shallow Methods for HI Classification
ML algorithms trained in a supervised fashion can accomplish different HI analyses such as identifying types of tumors and tissues, nucleus features (e.g. mitosis phases) or specific characteristics in some organs (e.g. fat inside the liver or the size of epithelial tissue on the cervix). We present in this section the ML methods based on shallow classifiers. We start by presenting some works that employ single (monolithic) classifiers followed by classification methods based on ensemble (multiple) of classifiers. Both shallow and ensemble methods depend on a previous stage of feature extraction because they rely on handcrafted feature vectors to learn discriminant functions. Therefore, most of the feature extraction methods presented in Section 4 can be used together with the methods presented in this section.

Monolithic Classifiers
Different ML methods for supervised learning have been employed in HI analysis, such as support vector machines (SVM), decision trees (DT), naïve Bayes (NB), k-nearest neighbors (k-NN), multilayer perceptron (MLP), among others. Table 7 summarizes the works reviewed in this section in terms of classification algorithm, tissue or organ from where the HI was obtained and the publication year.
SVMs are the most used classification algorithm for HIs. Several works have employed SVM with different feature categories. Mazo et al. [12] proposed the classification of cardiac tissues into five categories using a patching approach that aims to optimize the patch size to improve the representation. A cascade of linear SVMs separate tissues in four classes, followed by a polynomial SVM, which classifies one of these four classes in two sub-classes. Osborne et al.
[50] employed segmentation and morphological features with an SVM classifier to melanoma diagnoses in skin HIs. The propose approach achieved Table 6. Summary of publications devoted to feature extraction from HIs.

Reference
Year Tissue / Feature Organ Caicedo et al. [73] 2008 Skin Color and gray histograms, LBP, Tamura Ballarò et al. [40] 2008  [89] presented an automatic orientation detection for the epithelium with more features and used an SVM classifier. Harai and Tanaka [90] presented a colorectal CAD system, which starts with an Otsu thresholding of red channel to separate nuclei, background, and stroma. An SVM classifier achieves 78.3% of accuracy against 67% of a method based on texture features. Peikari et al. [91] proposed a nucleus segmentation pipeline based on multi-level thresholding and watershed algorithm on the L*a*b color space. The nucleus classification uses a cascade of SVMs. The cascade phase initially separates lymphocytes from epithelial tissue and then classify epithelial in benign and malignant. An interesting comparison with two pathologists' evaluation shows that the agreement between pathologists was 89% and between the automated system was 74% and 75%. The classification of ovarian cancer is the subject of study of BenTaieb et al. [92]. The proposed method localizes regions of cancer in WSI using a multi-scale mechanism considering that each tumor type has specific characteristics which are better detected at different scales.  [80] proposed grading of uterine cervical cancer using an LDA classifier. A specialist manually segmented the images to identify tumors and split them into ten segments for feature extraction. A voting strategy combined results from the segments. The best grading result was 70.5% for the whole epithelium against 62.3% for the vertically segmented epithelium. Mete and Topaloglu [95] evaluated eleven different color spaces for representing HIs. The combination of a spherical coordinate transform and DT achieved the best accuracy, outperforming SVM and NB classifiers. Sidiropoulos et al. [96] proposed a classification algorithm based on a probabilistic neural network (PNN) implemented on GPUs to grade cases of rare brain tumors. The advantage is the reduced processing time that allows an exhaustive feature combination search. For demonstration purposes, a comparison of CPU-and GPU-based algorithms showed that the GPU version takes 278 times less computation time than the CPU version for a feature vector with 20 attributes. The work presented by Michail et al. [97] classifies follicular lymphomas using a preprocessing step to segment images based on intensity thresholds and an expectation maximization algorithm. The segmented cells are classified by LDA, achieving a detection rate of 82.6%. A random kitchen sink (RKS) classifier is used by Beevi et al. [98] to identify mitotic nuclei on breast cancer HIs. Nuclei are identified using thresholding in the red channel. Local active contour model selects and models nuclei. The approach achieved F-score of 96.0% for RKS and 83.4% for RFs on MITOS 2014 dataset. A CAD system proposed by Jothi and Rajam [99] used HI converted to gray-scale giving priority to the red channel. Otsu thresholding guided by particle swarm optimization segments the gray-scale images and noise segments are reduced using area constraints based on the nuclei size. The closest matching rule (CMR) classifier achieved the accuracy of 100% against 99.5% for NB. Awan et al. [100] studied the classification of mitosis on breast cancer using a dataset labeled with the four major phases of mitosis. Classes are imbalanced, posing a challenge for the classification. They proposed a data augmentation method based on PCA and its eigenvectors and compared it to the synthetic minority over-sampling technique. Barker et al. [101] used a patching procedure based on a grid over the WSI to grade brain tumor type. Each patch has general features clustered using k-means. The final classification is performed over the features of the nuclei identified in the clustering step using elastic net model. The proposed model outperformed the methods from the 2014 MICCAI Pathology Classification challenge.
Multiple instance learning (MIL) has also been used for classification of HIs. Several works have employed such MIL methods with different classifiers and feature categories. MIL is a weakly supervised learning paradigm that considers that instances are naturally grouped in labeled bags, without the need that all the instances of each bag have individual labels. The MIL method proposed in [82] to compare three MIL SVMs: SIL-SVM, MI-SVM and mi-SVM with mi-Graph. mi-Graph achieved accuracy of 87% against 69% of mi-SVM. The proposed methodology is based on patching. All images are previously segmented using the watershed algorithm. Another work from the same research group carried out a benchmark of MIL SVM methods, finding out that MILBoost gives better accuracy for instance-level approach (66.7%) and mi-Graph performs better in bag-level prediction (72.5%) [102]. A stain separation is performed in Cosatto et al. [103] using a support vector regressor (SVR) to identify regions of interest (ROI), which is a high occurrence of hematoxylin in low-level magnification. This work uses an MIL approach because ROIs are not labeled but the WSIs, so all ROIs from a positive slide to receive positive labels. MIL uses MLP for classification, but it requires a modified loss function to represent the one-positive rule for a slide, which means that if in the prediction only one ROI appears as positive, the entire slide is positive. In the comparison between the MIL approach and SVM classification, the SVM required ROI labeling. The AUC of MIL was 0.95 against 0.94 of SVM with the advantage of reducing labeling efforts. Xu et al. [104] introduced MCIL, an MIL-based method that uses patching procedure to create instance-level Gaussian classifiers which are clustered using the k-means algorithm. The work performs comparisons with regular image-level classification methods and MIL methods. The fully supervised method presented F-score of 76.6% (using patch labeling) and the proposed method achieved 69.9%. MCIL achieved 71.7% and 60.1% in another dataset (not patch labeled) with constrained and unconstrained MCIL respectively against 25.3% of MIL-Boosting. Sudharshan et al. [105] compared different MIL approaches to the diagnostic of breast cancer patients. In this approach, every patient is seen as a bag, which is labeled with her diagnosis. Therefore, HIs do not need to be individually labeled, as they can share the bag label. Instances are patches extracted from the corresponding HIs, considering different magnification factors (40×, 100×, 200× and 400×). The hypothesis is that a bag-based (patches) analysis is valuable for the analysis of HIs in comparison with single instance (entire image). The experiments were carried out on the BreakHis database using CNN, 1-NN, QDA, RF and SVM classifiers and the best accuracy of 92.1% was achieved for 40× magnification by non-parametric MIL.
Finally, several works compare the performance of different monolithic classifiers on HIs, but without combining their predictions. Ballarò et al. [40] proposed the segmentation of HIs to identify unhealthy or healthy megakaryocytes. The classification is carried out by a k-NN and DTs. Song et al.
[43] trained different classifiers such as SVM, k-NN, neural networks, and Naïve Bayes (NB) on morphometric features. The experimental results showed that these three features outperform morphological features achieving 90% of accuracy against 64%. Besides that, the combination of these features with morphological features achieves only 85% of accuracy. Bruno et al.
[57] used a curvelet transform to handle multi-scale HIs. Texture features extracted from curvelets coefficients which are reduced by an ANOVA analysis and evaluated using DTs, SVM, RF and polynomial classifiers. They achieved an AUC of 1.00 which is higher than the best previous method (0.986). Pang et al. [74] proposed a CAD system for lung cancer detection. Sparse contribution analysis selects non-redundant features, which are used to train SVM, RF and extreme learning machine (ELM). Another contribution is the concave-convex variation, which consists of measuring the concavity of all nuclei in an image and use such a measurement to weight the probabilities of the classifiers. This method achieved the accuracy of 98.74%, which is slightly better than RFs (97.68%). Orlov et al. [79] compared four color spaces (RGB, L*a*b, gray-scale and RGB) with H&E representation. A weighted k-NN achieved the best results (99%) followed by an RBF network and NB with 99% and 90% of accuracy, respectively. The best results were achieved for a color space called eosin representation. Irshad et al. [106] presented a multimodal approach with multispectral images focusing on selecting the best spectral bands for classification of mitotic cells on MITOS 2012 dataset. SVM, DT, and MLP are used for classification purpose. SVM achieved the best F-score (63.7%) using only eight best bands, which is higher than the state-of-the-art (59%). WSI is the core of the work proposed by Homeyer et al. [107], which compares k-NN, NB and RFs for classification of slides based on a patching procedure and textural and intensity features. RFs with a group of all features achieved the best result (94.7%). Khan et al. [108] proposed a framework for malignant cell classification in breast cytology images. Selected features train SVM, NB and RF classifiers. At the end, an ensemble method is employed to combine the classifiers based on the majority voting. The experiments have shown the accuracy of 98.0% in the detection and classification of malignant cells. Kurmi et al. [109] presented an approach consisting of nuclei localization in HIs and further classification as benign or malignant using MLP and SVM models. MLP achieved the best average accuracy of 95.03%. Table 7. Summary of publications focusing on HI classification based on monolithic classifiers.

Ensembles Approaches
Ensembles approaches combine the predictions of multiple base classifiers in an attempt to improve generalization and/or robustness over a single classifier. Several researchers have proposed combining classifiers for improving the performance of HI approaches. Table 8 summarizes the works reviewed in this section in terms of type of base classifier and combination strategy, tissue or organ from where the HI was obtained and the publication year.
Zarella et al. [34] employed classification using an ensemble of SVMs on ROIs segmented from WSI. Multiple "weak" classifiers trained with subsets of features and different parameters combined with a weighted sum (WS) function achieved the accuracy of 88.6%. Daskalakis et al. [110] used a preprocessing step of segmentation to enhance nuclei and extract morphological and textural features. A multi-classifier approach combines k-NN, linear least squares minimum distance (LLSMD), statistical quadratic Bayesian, SVM, and PNN using majority vote, minimum, maximum, average, and product rules. PNN achieved the best accuracy of 89.6% for a base classifier while the ensemble method achieved 95.7% with the majority vote rule. The method proposed by Kong et al. [87] classifies neuroblastomas using textural and morphological features. An ensemble approach combining k-NN, LDA, NB and SVM classifiers using the weighted voting (WV) rule achieved the accuracy of 87.8%. Meng et al. [111] proposed an ensemble of principal component classifiers (PCC). This ensemble classified 25 patches of each image, which are represented by 50 features. The accuracy achieved on a liver dataset was 96.41% using the majority vote (MV) rule compared to 95.09% achieved by a 3-NN. The same approach achieved 99.4% of accuracy on lymphoma classification against 92.08% achieved by the Adaboost approach. A CAD system composed of a staining separation module, densitometric and texture feature extraction and an AdaBoost algorithm was proposed by Wang and Yu [112]. The proposed system achieved accuracy of 94.37% against 86.44% of the best base classifier (k-NN) trained on raw H&E images. The system described by Gorelick et al.
[44] uses a segmentation step to identify super pixels. An Adaboost algorithm classifies the segmented images represented by morphometric and geometric features. The system achieved the accuracy of 85%. A framework for cytological analysis is presented by Filipczuk et al. [45]. Morphometric features represent nuclei obtained after segmentation. The proposed method uses a combination of random subspaces with perceptrons to create an ensemble. The comparison showed that the ensemble approach achieved accuracy of 96.0% compared to 90% achieved by a boosting algorithm. Vink et al. [113] proposed a nucleus detection method based on two Adaboost stages. The first step is based on features extracted from stain separated images. The second Adaboost refines the result of the first with line-based features. An optimal active contour refines the results from the second ensemble achieving the accuracy of 95.02%. Phoulady et al. [114] proposed an ensemble of Otsu thresholding algorithms with certain constraints and morphological operations. Four segmentation algorithms are responsible for the segmentation, but each image can have characteristics that would require different parameters for the segmentor set. The final result of segmentation is one among 18 segmentor sets that have most parameters shared with the set of segmentors that presented less difference in the segmentation. This approach achieved accuracy of 84.3% compared to 77.4% achieved by other methods.
Di Franco et al. [115] used an ensemble of SVM classifiers, where each model is trained with a variation of images pre-processed by Gaussian filters and color spaces. The classifiers are combined using the average rule and the best AUC value achieved was 0.978. Albashish et al. [116] proposed a feature selection method that uses the entropy of a feature in relation to a class as a redundancy criterion and constraints in this value and the inter-feature entropy. SVM classifiers specialized in one subtype of tissue derived from prior segmentation are combined using the sum rule. The performance of ensemble approach using 37 features (94.08%) is only 0.2% better than the best SVM with recursive feature learning (RFE) method using 46 features. A comparison of multiple classifiers and features is presented by Huang and Kalaw [117]. A set of monolithic classifiers is compared with Adaboost implemented with SVM, DT, and RF. Adaboost achieved 97.8% of accuracy. Fernández-Carrobles et al. [118] presented a classification framework for WSI with a bagging of DTs and GLCM features which achieved 0.995 for AUC and 98.13% for true positives. The multi-view approach presented by Kwak and Hewitt [51] extracts features from multiple resolutions. A boosting algorithm combining linear SVMs and the features from multiple views achieved 0.98 of AUC compared to 0.96 of the concatenation of views. Kruk et al. [75] used morphometric, textural, and statistical features to describe nuclei for classification. An ensemble made up of SVM and RF classifiers and trained with a subset of features resulting from the feature selection achieved the accuracy of 96.7%, which was higher than the state-of-the-art (93.1%) and the best single SVM classifier (91.1%). An Adaboost ensemble is used by Romo-Bucheli et al. [119] to grade skin cancer. The ensemble classifies images described by features created with graph theory to represent the nuclei distribution. The ensemble achieved 72% of accuracy. A multi field-of-view (FOV) classification scheme is proposed by Basavanhally et al. [76]. It uses a multiple patch size procedure for WSI that firstly analyzes which features are the most relevant to each patch size. After that, it uses a RF to aggregate multiple FOV patches. They do not present a baseline for accuracy comparison, only the AUC result, showing better values for nucleus architecture features to recognize low versus high-grade ductal carcinoma. Table 8. Summary of recent publication on ensemble approaches for HI analysis.

Reference
Year Tissue / Base Classifier Combination Organ

Rule/Function
Daskalakis et al. [110] 2008 Thyroid k-NN, LLSMD, SQ-Bayes, Vot, Min, Max, SVM, PNN Sum, Prod Kong et al. [87] 2009 Neuroblastoma k-NN, LDA, Bayesian, SVM WV Meng et al. [111] 2010 Liver, PCC WV Lymphocytes DiFranco et al. [120] 2011 Prostate SVM and RF MV Wang and Yu [112] 2013 Lung DT Adaboost Gorelick et al. [44] 2013 Prostate DT Adaboost Filipczuk et al. [45] 2013 Breast SVM, Perceptron Perceptron Vink et al. [113] 2013 Breast DT, Stumps Adaboost Basavanhally et al. [76] 2013 Breast RF MV Phoulady et al. [114] 2014 Uterus Otsu segmentors Similarity Zarella et al. [34] 2015 Lymphoma SVM WS Di Franco et al. [115] 2015 Prostate SVM Avg Gertych et al. [60] 2015 Prostate SVM, RF MV Tashk et al. [77] 2015 Breast SVM, RF MV Albashish et al. [116] 2015 Prostate SVM Sum Huang and Kalaw [117] 2016 Prostate k-NN, SVM, DT, RF, LDA, Adaboost QDA, NB Balazsi et al. [61] 2016 Breast RF MV Wright et al. [121] 2016 Colorectal RF MV Fernández-Carrobles et al. [118]  Tashk et al. [77] presented a complete framework for HI classification. They employ maximum likelihood estimation to obtain the mitotic pixels in L*a*b color space. A cascading classification is performed firstly with SVM and next with RFs. A comparison shows that this method achieves the accuracy of 96.5% against 82.4% of the best previous result in MITOS 2012 dataset. Gertych et al. [60] presented a system for prostate cancer classification, which consists of SVM and RF classifiers. SVM separates the stroma and epithelium and the RF identifies benign/normal and carcinogenic tissue. The best accuracy was 68.4% for cancer detection. Balazsi et al. [61] extended the work described in [123]. The authors used simple linear iterative clustering (SLIC) to extract patches by tesselation. A set of multiple histogram and texture features are extracted from the L*a*b, gray-scale and RGB color spaces of each patch. This number of features is suitable for a RF classifier, which achieved 79.51% of F-score for tessellated patches, compared to 77.57% of squared patches and 71.80% of the baseline. SLIC is also applied by Wright et al. [121] in a pipeline for colorectal cancer to initially segment images. Histogram and texture features extracted from the HSV color space; likewise, statistics from H&E channels were extracted, in addition to GLCM as features. A comparison showed that the proposed work achieved the accuracy of 79% against 75% from their previous work for RF. Valkonen et al. [122] presented a system for the classification of WSI. The segmentation step uses Otsu, morphological operations and histological constraints. The classification algorithms, including RF, SVM, k-NN and logistic regression were trained with textural, morphometric, and statistical features extracted from random patches of segmented images. RF achieved the best accuracy (93%). A comparison between different ensemble approaches to classify patches of WSI is presented in [120]. A set of 114 features were selected and ranked using RFs. Based on the selected and ranked features, multiple linear and RBF SVMs and RF classifiers were built. The aggregation function was the majority vote. The AUCs were 0.955, 0.951 and 0.948 for RBF SVM, RF and linear SVM respectively. The best previous result was 0.935.

Methods Based on Deep Learning (DL)
DL methods are gaining the attention of the scientific community due to recent achievements to solve complex machine learning problems on large datasets. A convolutional neural network (CNN) is able to learn in a single optimization process, both a representation and a decision boundary. However, CNNs usually require a huge amount of data for adequate training in order to avoid overfitting problems, but most of the HI datasets have only a few patients and hundreds of images. Data augmentation [124] [125] and transfer learning [126] are two possible approaches to circumvent the lack of data in HI datasets. For instance, ImageNet, which has more than 14 million images, is one of the most common datasets used for training CNNs for object recognition. Data augmentation generates new HIs from existing ones by using affine transformations or morphological operations. Another common way of data augmentation is patching HIs, which consists in producing the effect of selecting pieces of a HI with the same structure but that belong to different classes. On the other hand, the transfer learning method reuses CNNs previously trained in large datasets, which usually belongs to a different domain from the target problem. The pre-trained CNNs can be used in two ways: to extract features from HIs and use these features with shallow classifiers, as already described in Sections 4 and 5; to fine-tune such CNNs on an HI dataset, which means that filters learned on a large dataset will be adapted to the HI dataset. Recently, DL methods have been employed in HI analysis. Table 9 summarizes the works reviewed in this section in terms of network architecture, tissue or organ from where the HI was obtained and the publication year.
Malon et al. [88] were one the first authors to employ DL methods in HI analysis. They used a classical LeNet-5, a 7-layer CNN architecture proposed by Lecun et al. [127], which in 1998 to learn a representation from HIs previously segmented with an SVR. The features extracted by the CNN were classified by an SVM. The purpose of the classification was to find mitotic nuclei. The remarkable aspect of this work is the comparison between machines and three pathologists. The pathologists showed a Cohen Kappa factor of 0.13 and 0.44 in the best case, emphasizing the inter-observer problem. Kainz et al. [128] presented two CNNs based on the LeNet-5 architecture for segmentation and classification of glands in tissue of benign and malignant colorectal cancer. The first CNN separates glands from background, while the second CNN identifies gland-separating structures. Experimental results on Warwick-QU colon adenocarcinoma and GlaS@MICCAI2015 challenge datasets showed a tissue classification accuracy of 98% and 95%, respectively.
Some works used CNNs based on the AlexNet architecture proposed by Krizhevsky et al. [129] in 2012. AlexNet is similar to LeNet-5 but it has 12 layers, with more filters per layer, and with stacked convolutional layers. Stanitsas et al. [130] employed the AlexNet CNN to classify breast cancer HIs. They compared the CNN results with some handcrafted feature extractors and shallow classifiers and they concluded that the CNN was not able to outperform the shallow methods. Spanhol et al. [131] evaluated architectures based on AlexNet CNN for the problem of breast cancer HI classification. The experimental results on the BreaKHis dataset showed that the CNN achieved mean accuracy rates between 81.7% and 88.6%, depending on the magnification, at patient level, which is better than other shallow ML approaches based on textural features. Sharma et al. [132] also used an AlexNet CNN as well as other custom CNN architectures to classify benign and malignant tumors. Due to the small sample size, authors had to carry out data augmentation by patching and affine transforms. For cancer classification, 11 WSIs produced 231,000 images. For necrosis detection, four WSIs produced a total of 47,130 images for training. Both AlexNet and the custom CNN architectures compare favorably to most handcrafted features and a RF classifier. Budak et al. [133] proposed an end-to-end model based on a pre-trained AlexNet CNN and a bidirectional LSTM (BLSTM) for detecting breast cancer in HIs. The convolutional layers are used to encode HIs into a high-level representation, which is flattened and fed into the BLSTM. Experimental results on the BreaKHis dataset showed that the proposed model achieved the best average accuracy of 96.32% for the magnification factor of 200×. Moreover, for the magnification factor of 40×, 100×, and 400×, the average accuracy was 95.69%, 93.61%, and 94.29%, respectively.
Some works use CNNs based on the inception architecture proposed by Szegedy et al. [134]. The inception modules have parallel paths where the image is passed through filters of different dimensions (1×1, 3×3, 5×5). Additionally, max pooling is also performed. The outputs are concatenated and sent to the next inception module. GoogleLeNet, a.k.a Inception-V1 [134] has 9 such inception modules stacked linearly. It has 27 layers and employs global average pooling at the end of the last inception module. Inception-V2 and Inception-V3 [135] used an upgraded inception module and auxiliary outputs, which increase the accuracy and reduce the computational complexity. Another architecture is the Inception-ResNet, which combines the inception model with the ResNet model [136]. Li et al. [137] compared AlexNet and Inception-V1, handcrafted features and SVM, and features extracted by CNNs to classify regions of colon histology images as either gland or non-gland. The combination of handcrafted features with an SVM and the prediction of a CNN showed the best results. They used data augmentation with rotations and mirroring for handcrafted features and CNNs. Yan et al. [138] integrated a pre-trained Inception-V3 with a BLSTM for classifying breast cancer HIs into normal, benign, in situ carcinoma, or invasive carcinoma. The method consists of dividing HIs into 12 small patches on average. Afterwards, a fine-tuned Inception-V3 CNN extracts features from the patches, where a 5,376-dimensional feature vector is made up of the concatenation of the weights of the last three layers of the CNN. Such feature vectors are the input of a BLSTM compounded of four layers to fuse the features of the 12 small patches and come up to an image-wise classification. The experiments show that such an approach achieved the average accuracy of 91.3%. de Matos et al. [126] proposed a classification approach for breast cancer HIs that uses transfer learning to extract features from HIs using an Inception-V3 CNN pre-trained with the ImageNet dataset. The proposed approach improved the classification accuracy in 3.7% using the feature extraction transfer learning and an additional 0.7% using the irrelevant patch elimination.
Deep residual neural network (ResNet) [139] is another architecture that has been used in the classification of HIs. The residual block alleviates the problem of training very deep networks. Khosravi et al. [140] evaluated the versatility of CNNs on eight different datasets of breast, lung, and bladder tissues with H&E and immunohistochemistry images (IHC). Such an evaluation included Inception and ResNet CNNs, the combination of both CNNs, as well as the concept of transfer learning. Results showed a good performance in spite of using the raw images without any pre-processing. Vizcarra et al. [141] fused CNN and SVM outputs for HI classification. The pipeline consists of extracting SURF features for the shallow learner (SVM) and color normalization (Reinhard method) and image resizing (downsampling) for fine-tuning Inception-V3 and Inception-ResNet-V2 CNNs, pre-trained on the ImageNet dataset. CNN. The output from both shallow and deep learners are fused for final prediction. Experimental results on the BACH dataset showed a moderate accuracy of 79% and 81% achieved by the SVM and the CNN, respectively. On the other hand, the fusion of SVM and CNN outputs outperformed the individual learners, achieving the accuracy of 92%. Zerhouni et al. [142] proposed the use of a wide residual CNN to classify mitotic and non-mitotic pixels in breast HIs. The CNN is trained on mitotic and non-mitotic patches extracted from the ground truth images. Experimental results on the MICCAI TUPAC Challenge dataset showed that the wide residual CNN outperformed most of other approaches. Gandomkar et al. [143] proposed the MuDeRN framework aiming at classifying HIs into benign or malignant, and next into four subtypes. In the first stage, a ResNet with 152 layers has been trained to classify HI patches of different magnification factors as whether benign or malignant. Afterwards, the results thereof were subdivided into four subcategories of malignant and benign likewise. Lastly, for each patient, the diagnosis was conducted by combining the output of the ResNet using a meta-DT. MuDeRN [144] also used a ResNet to detect invasive ductal carcinoma as well as to classify lymphoma subtypes. First, the convolutional layers are trained in an unsupervised way for extracting useful features to reconstruct the input image. On the other hand, the fully connect layers are trained in a supervised way. In both cases, the softmax classifier produces a probability of the input image belonging to a given class. Talo [145] presented an approach based on pre-trained ResNet-50 and DenseNet-161 CNN models for automatic classification of gray-scale and color HIs. The results achieved by both CNNs outperformed the existing studies in the literature, with 95.79% of total accuracy for the gray-scale images. ResNet-50 achieved 97.77% of total accuracy to classify color HIs.
Another CNN architecture that has been used in HI classification is the VGG-net, which is a very uniform architecture that has 16 convolutional layers with a large number of 3×3 filters. Bejnordi et al. [146] used a VGG-net CNN [147] for classification of tissue into epithelium, stroma, and fat followed by a VGG16 CNN for classifying stroma into normal stroma or tumor-associated stroma. The first CNN achieved a pixel-level accuracy of 95.5%, while the second CNN achieved a pixel-level binary accuracy of 92.0%. The authors employed data augmentation by randomly rotating and flipping patches, as well as by randomly jittering the hue and saturation of pixels in the HSV color space. The work presented by Xu et al. [148] segments and distinguishes glands. They proposed an approach combining a fully convolutional network (FCN) for the foreground segmentation channel, a faster region-based CNN (R-CNN) for the object detection channel, and a holistically-nested edge detector CNN for the edge detection channel. All three CNNs are based on the VGG16 CNN. The results of these three CNNs feed another CNN that outputs a segmented image. Data augmentation by affine and elastic transformation is carried out to enhance performance and avoid overfitting. The proposed approach achieved state-of-the-art results on the dataset from the MICCAI 2015 Gland Segmentation Challenge. Kumar et al. [149] developed a variant of VGG16 CNN architecture, which replaces the fully connected layers by different classifiers. The approach consists of stain normalization and data augmentation, which uses images with and without normalized stain. The augmented dataset is applied to the fused VGG16, where features are taken at the global average pooling layer. Finally, the binary classification is carried out by SVM and RF classifiers. Experiments conducted on canine mammary tumor (CMTHis) and breast cancer HIs (BreakHis), which are both randomly split into training (70%) and test (30%) sets. The approach achieved the accuracy of 97%, and 93% on BreakHis and CMTHis datasets, respectively.
Other CNN architectures have also been used in HI classification, such as DenseNet [150] and MobileNet [151]. Kassani et al. [152] proposed an approach for classification of breast cancer HIs based on an ensemble of three pre-trained CNNs, namely VGG19, MobileNet, and DenseNet. Stain normalization, data augmentation, fine-tuning and hyper-parameter tuning were used to improve the performance of the CNNs. The multi-model ensemble method achieved better performance than single classifiers with the accuracy of 98.13%, 95.00%, 94.64%, and 83.10% for BreakHis, ICIAR, PatchCamelyon, and Bioimaging datasets, respectively. Yang et al. [153] introduced the use of additional region-level supervision for classifying breast cancer HIs with a DenseNet-169 CNN pre-trained on ImageNet. For this purpose, ROIs are localized and used to guide the attention of the classification network simultaneously. This process activates neurons in regions relevant to diagnose while suppressing activation in irrelevant and noisy areas. Hence, the prediction of the network is based on the regions which a pathologist expects the network to focus on. Such an approach achieved the accuracy of 93% on the BACH dataset.
Finally, several works proposed custom CNN architectures for HI classification, which are usually based on some well-known architectures. The authors attempt to optimize mainly the number and the dimension of kernels and the number of layers. Bayramoglu et al. [154] proposed two different CNN architectures, both with 10 layers, for breast cancer HI classification. The first CNN predicts only malignancy, while the second one predicts both malignancy and image magnification level simultaneously. Experimental results on the BreaKHis dataset showed that the magnification independent CNN approach improved the performance of magnification specific model, and that the results are comparable with previous state-of-the-art results obtained by handcrafted features. They also used data augmentation based on affine transformations.
Albarqouni et al. [155] introduced a CNN for aggregating annotations from crowds in conjunction with learning a model for a challenging classification task. During learning from crowd annotations phase, the CNN architecture is augmented with an aggregation layer to aggregate the ground-truth from the crowdvote matrix. Experimental results on the AMIDA13 dataset showed that the proposed CNN architecture was robust to noisy labels and positively influences the performance. Cruz-Roa et al. [123] proposed a custom 3-layer CNN to classify patches of WSI as invasive ductal carcinoma (breast cancer) or not. Patches ended up labeled due to the region labeling. Some regions of the WSI such as background and adipose cells were excluded manually and were not patched. Patches were pre-processed using color normalization and the YUV color space. CNN outperformed an RF trained on the best handcraft feature extractor by 4%. Compared to other works, this one has a simple protocol and uses a small network, but it was one of the precursors of CNNs to analyze HIs. Ciompi et al. [156] proposed an 11-layer CNN to analyze the impact of stain normalization in the training and evaluation pipeline of an automatic system for CRC tissue classification. Experimental results on the CRC dataset validated the performance of the proposed CNN as well as the role of stain normalization in CRC tissue classification. Kwak and Hewitt [157] proposed a 6-layer CNN to identify prostate cancer and compared it with other CNNs (AlexNet, VGG, GoogLeNet, ResNet) as well as with shallow classifiers such as SVM, RF, k-NN and NB. Experimental results on four tissue microarrays showed that the 6-layer CNN achieved an AUC of 0.974 and it outperformed all other approaches either based on handcrafted features with shallow classifiers or other CNN architectures.
Roy et al. [158] proposed five custom CNN architectures for classification of patches of breast cancer HIs. The approach consists of extracting patches, classify them and compare the result of individual patches with the one of the whole image. The output is considered correct if there is an agreement between the class labels of all extracted patches and the class label of the HI. They have also boosted the number of training samples per class using affine transformation for data augmentation. Experimental results on the ICIAR 2018 challenge dataset showed that a 14-layer CNN achieved the best results: patch-wise classification accuracy of 77.4% and 84.7% for four and two classes respectively; image-wise classification accuracy of 90.0% and 92.5% for four and two classes, respectively. de Matos et al. [124] proposed a 7-layer CNN architecture based on texture filters that has fewer parameters than traditional CNNs but is able to capture the difference between malignant and benign tissues with relative accuracy. The experimental results on the BreakHis dataset showed that the proposed texture CNN achieves 85% of accuracy for classifying benign and malignant tissues. The authors also employed data augmentation based on composed random affine transforms including flipping, rotation, and translation. Ataky et al. [125] proposed a novel approach for augmenting HI dataset considering the inter-patient variability by means of image blending using the Gaussian-Laplacian pyramid. Experimental results on the BreakHis dataset with a texture CNN [124] have shown promising gains vis-à-vis the majority of DA techniques presented in the literature. The research carried out by Gecer et al. [159] presented a method for breast diagnosis based on WSIs. This method aims at classifying images into five categories. At first, a salience detection was performed by a pipeline consisting of four sequential 9-layer CNNs based on the VGG-net [147] architecture for multi-scale processing of the WSIs, considering different magnifications for localization of diagnostically pertinent ROIs. Afterwards, a patch-based multi-class CNN is trained on representative ROIs resulting from the consensus of three experienced pathologists. Finally, the final slide-level diagnosis is obtained by fusing the salience detector and the CNN for pixel-wise labeling of the WSIs by a majority vote rule. They claimed that the CNNs used for both detection and classification outperformed competing methods that used handcrafted features and statistical classifiers. Moreover, the proposed method achieved results comparable to the diagnoses provided by 45 pathologists on the same dataset. Experiments using 240 WSIs showed the five-class slide-level accuracy of 55%.
Wang et al. [160] employed a bilinear CNN (BCNN), which consists of two individual CNNs, whose outputs of the convolutional layers are multiplied with outer product at each corresponding spatial location, resulting in the quadratic number of feature maps. The input of both CNNs is H&E images with the H and E channels separated in a pre-processing stage by a color decomposition algorithm. The proposed BCNN-based algorithm achieves the best performance with a mean classification accuracy of 92.6%. Compared to other CNN-based algorithms, BCNN improves at least 2.4% on classification accuracy on the CRC dataset. Li et al. [162] presented an automatic method for mitosis detection based on semantic segmentation. Such a method used a CNN in which a novel label with concentric circles was added instead of a single-pixel representation of mitosis. The inner circle represents a mitotic region whereas the ring around the inner circle is a "middle ground". This concentric loss allows training the semantic segmentation CNN with weakly annotated mitosis data. The semantic segmentation employed on breast cancer HIs to seek out mitotic cells achieved the F-score of 0.562, 0.673, 0.669, on ICPR2014, MITOSIS, AMIDA13, and TUPAC16 datasets, respectively. Hou et al. [161] proposed a semi-supervised approach that uses a sparse convolutional autoencoder (CAE) with a crosswise constraint that decomposes patches from HIs into foreground (e.g. nuclei) and background (e.g. cytoplasm). Such a CAE initializes a supervised CNN, which carries out nucleus detection, feature extraction, and classification/segmentation in an end-to-end fashion. The experimental results not only showed that the proposed approach outperformed other approaches, but also the noteworthiness of the crosswise constraint in boosting performance. The proposed CAE-CNN achieved results comparable to the state-of-the-art using only 5% of training data needed by other methods. Sheikh et al. [163] proposed a four-input 24-layer custom CNN for classification Table 9. Summary of publications using DL methods in HI analysis.

Reviews, Surveys and Datasets
This section brings a summary of the reviews and surveys related to HIs and ML methods. As shown in Table 10, we have found nineteen works in this category. Reviews and surveys published between 2012 and 2015 highlight mainly approaches for nucleus segmentation and classification. On the other hand, recent publications are focused on classification of whole medical images. The reviews presented by Saha et al. [171], Nawaz and Yuan [169], Chen et al. [172] and Robertson et al. [173] were published in medical journals and provided a deeper view of the histology information. However, such publications overlooked aspects related to ML methods. For instance, Nawaz and Yuan [169] analyzed the characteristics of tumors and presented a brief study on how computational methods can deal with HIs. Komura and Ishikawa [180] presented the use of ML methods in HI as well as several HI datasets. Litjens et al. [176] reviewed DL methods applied to a variety of medical images, including HIs. Zhou et al. [181] presented a comprehensive overview of breast HI analysis techniques based on both classical and DL methods and publicly HI datasets. Finally, Krithiga and Geetha [182] presented a systematic review of breast cancer detection, segmentation and classification on HIs focused on the performance evaluation of ML and DL techniques to predict breast cancer recurrence rates.
Given the importance of datasets for the research on HI, we have also compiled in Tables 11 and 12, a list of the datasets that have been used in experiments of several works we covered in this review. We included the dataset reference, year of creation, their contents in terms of the number of images and patients, and references to the papers that have used them. Table 11. Summary of publicly available HI datasets.

Year Reference
Dataset Reference Dataset Size

Conclusion
In this paper, we have presented a review of the ML methods usually employed in analysis of HIs. This review revealed an increasing interest in the classification task, while the interest in other tasks such as segmentation and feature extraction are in a clear declining in the last years, as shown in Tables 4 to 8, where the related works are arranged in ascending chronological order. We point out that the main reason for such a change is due to the introduction of DL methods, which are able to deal with raw HIs with a little or even without any pre-processing step. Normalization is one of the most used preprocessing, but in the early years other preprocessing methods such as thresholding, filtering, color models, had also been used to improve the quality of HIs for subsequent tasks such as segmentation and feature extraction, or even classification.
In the years preceding the wide adoption of DL methods, several works had focused on identifying nuclei in HIs, which are important structures to cancer diagnosis. Therefore, that lead to the exploitation of different segmentation approaches as reviewed in Section 3. Some works used the concept of semantic features, based on the e.g. counting of nuclei, its relation to the stroma, the distance between nuclei. Stain normalization is also a recurrent topic that has appeared in several works across the years covered by this review. Such an image processing method, which reduces the color and intensity variations present in stained images, has been widely used even in conjunction with DL methods. Feature extraction methods were the focus of interest of researchers between 2008 and 2016. Morphometric feature and textural features such as GLCM, LBP and their variants have been the most frequent features used in HI analysis, either alone or in combination with other feature types. It is important to note that the shallow classifiers require a feature extraction method. Again, the adoption of DL methods, which are able to learn representation and decision boundaries in a single optimization process, is probably the main cause of declining interest in feature extraction methods from 2016. Furthermore, pre-trained CNNs can also be used as feature extractors for HIs. Several works removed the fully connected layers of pre-trained CNNs and used the output of the last convolutional layer as feature vectors to feed shallow classifiers. Comparing Tables 7, 8 and 9 we can say that DL approaches are becoming prevalent over shallow approaches in the last five years. Although studies are still necessary for understanding how these networks learn data representation, especially with respect to HIs.
Finally, Tables 11 and 12 also help us to understand the increasing interest in HI analysis in the last years. We have found that most of the early works are based on small private datasets, what makes difficult for other researchers that do not have access to such HI datasets to carry out research in this area as well as to reproduce the scientific results. On the other hand, most of the recent works are based on public HI datasets, which are a great contribution to the science as they provide a way to researchers to develop new methods and compare their performance with the existing ones. However, there is still a lack of large scale supervised WSI datasets.
In conclusion, this review shows the evolution of HI analysis and the recent shift over DL methods. This review also provides valuable information to researchers in the field about datasets and other reviews and surveys.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: