Classification of Multiple H&E Images via an Ensemble Computational Scheme

In this work, a computational scheme is proposed to identify the main combinations of handcrafted descriptors and deep-learned features capable of classifying histological images stained with hematoxylin and eosin. The handcrafted descriptors were those representatives of multiscale and multidimensional fractal techniques (fractal dimension, lacunarity and percolation) applied to quantify the histological images with the corresponding representations via explainable artificial intelligence (xAI) approaches. The deep-learned features were obtained from different convolutional neural networks (DenseNet-121, EfficientNet-b2, Inception-V3, ResNet-50 and VGG-19). The descriptors were investigated through different associations. The most relevant combinations, defined through a ranking algorithm, were analyzed via a heterogeneous ensemble of classifiers with the support vector machine, naive Bayes, random forest and K-nearest neighbors algorithms. The proposed scheme was applied to histological samples representative of breast cancer, colorectal cancer, oral dysplasia and liver tissue. The best results were accuracy rates of 94.83% to 100%, with the identification of pattern ensembles for classifying multiple histological images. The computational scheme indicated solutions exploring a reduced number of features (a maximum of 25 descriptors) and with better performance values than those observed in the literature. The presented information in this study is useful to complement and improve the development of computer-aided diagnosis focused on histological images.


Introduction
Histopathology considers the study of biological tissues and their microscopic structures.Histopathologists examine tissue samples under a microscope to identify abnormal changes in cell structure and indicate possible pathological conditions [1,2].In this process, staining techniques, such as hematoxylin and eosin (H&E), and microscopy are commonly explored in order to obtain information for the diagnosis, treatment and understanding of the progression of hyperplasias, dysplasias, metaplasias and neoplasms [1,3,4].For example, when the H&E technique is used, the staining process aims to highlight the basophils and eosinophils in tissues [5,6].The first dye highlights cell nuclei with a bluish color, while the second makes the cytoplasm reddish.Other cellular structures have colors derived from mixing these dyes [7].These conditions contribute to the analysis of regions of interest, whether by specialists or automated systems aimed at classifying and recognizing cancer patterns, considered a global public health issue.Over the past decade, there has been a 20% increase in cancer incidence, and more than 25 million new cases are expected by 2030.According to [8], cancer has become an important cause of premature mortality globally and is associated with high social and economic costs.The estimated productivity losses are EUR 104.6 billion of the gross national domestic product in Europe and USD 46.3 billion of the combined gross domestic product of the BRICS countries (Brazil, Russia, India, China and South Africa).
In this context, delays in the diagnosis and treatment of cancer can increase the rates of the disease in advanced stages and consequently, mortality.On the other hand, the characterization of cellular changes and their associations with cancer are complex and challenging processes for histopathologists [3,4,9].Methods based on computer vision and artificial intelligence techniques have provided important advances to minimize these challenges, increasing cancer diagnostic and prognostic accuracy values, especially through computer-aided diagnosis (CAD) [10,11].In these systems, the extraction and classification of features are essential for the recognition of histological patterns commonly examined by pathologists, especially in the contexts of colorectal cancer, breast cancer, oral dysplasia and liver tissue [12,13].It is noted, for example, that the combined use of techniques (ensemble learning), such as those based on convolutional neural networks (CNN), make it possible to characterize information at different levels and scales to verify how they relate to each other in the data space [14].
Although a CNN architecture directly performs image classifications, when the values in its internal layers (deep-learned features) were investigated separately, the results indicated more relevant performance, with new approaches in the context of medical images [15][16][17][18][19][20], including the use of transfer learning [21] or even different ensembles of descriptors [22][23][24][25][26][27].Among the combinations with deep-learned features, the use of fractal descriptors (handcrafted), such as fractal dimension, lacunarity and percolation, deserves to be highlighted because they are capable of appropriately measuring complex shapes generally found in nature and in the context of H&E samples [12,[28][29][30][31][32][33][34].Moreover, the handcrafted features that we selected are commonly used for describing complex structures like the ones found in histopathological images.Therefore, it is noted that proposals based on ensemble learning have been indicated as one of the main research fields for the development of new models [35].
In addition, it is important to note that explainable artificial intelligence (xAI) has contributed significantly to the improvement of ensemble learning models, especially in the validation and interpretation of results [36,37], in order to ensure that accurate classifications are determined for the right reasons [38,39].This question motivated the development of strategies based on class activation mappings (CAMs) [40], specifically, gradient-weighted class activation mappings (Grad-CAMs) and local interpretable model-agnostic explanations (LIME) [41].These approaches can be applied to produce visualizations of the image regions that support the CNN classification process.Quantifying this type of image with fractal techniques can complement the process of classification and pattern recognition of histological images.Despite these observations, it is noted that this hypothesis has not yet been investigated in the specialized literature in the context of multiple H&E datasets such as those explored here.
The strategies highlighted previously result in a highly complex feature space, a fact that can make investigations of contexts with a reduced number of samples unfeasible [42,43], such as histological datasets commonly used to investigate colorectal cancer, breast cancer, oral dysplasia and liver tissues.This situation can be overcome by identifying the most relevant features for the classification process and, consequently, indicating more accurate and robust CAD systems.Thus, feature selection plays a critical role in identifying patterns [44], but there is no universal approach in order to define the best results for all contexts [45][46][47][48].On the other hand, algorithms based on ranking and filters, such as ReliefF, are capable of detecting feature dependencies and provide the best solutions in different experiments [47,[49][50][51].This strategy provides sets of features with different dimensions via any desired criteria.In addition, a set of descriptors can also be highly dependent on the heuristics of the classifier used to evaluate the model [52].This challenge can be minimized through a classification process with different heuristics (ensemble of classifiers) [53].Techniques that are part of this set consider the so-called crowd wisdom, in which a decision is made from different perspectives and when associated, can be more accurate.The justification is that possible individual errors are compensated by the successes of the other components [53].Therefore, investigating the most relevant combinations of descriptors and techniques for the analysis, classification and pattern recognition of H&E images remains an ongoing challenge.This makes ensemble-learning-based solutions more generalizable and robust.
Even with some initiatives observed [12,[54][55][56][57], models based on ensemble learning with multiscale and multidimensional fractal descriptors, such as those investigated here, have not yet been fully explored in the literature, including the quantification with the Grad-CAM and LIME representations.In this context, some insights are still pertinent, such as whether it is possible to define standards between the techniques used to classify multiple types of H&E images; whether multiscale and multidimensional fractal descriptors indicate gains in relation to the results achieved via deep-learned features with transfer learning; whether fractal descriptors obtained from Grad-CAM and LIME representations can contribute to the performance of an ensemble learning scheme; and whether the combination of ensembles (descriptors and classifiers) indicates more competitive performance in relation to that available in the specialized literature.Some architectures such as DenseNet-121 [58], Inception-V3 [59], ResNet-50 [60] and VGG-19 [61] can still be explored to provide deeplearned features, even with the success provided in the direct classification of some types of histological images [23,[62][63][64].EfficientNet [65] has not yet been fully explored with the approaches presented here.Therefore, the strategies and conditions previously presented are useful to make knowledge comprehensible to the specialists focused on developing and improving CAD systems.Moreover, the proposed scheme to identify, select and classify the main combinations is based on aspects widely discussed in information theory, image processing and pattern recognition, especially to obtain more robust baseline schemes in this context of histological samples.
In this work, a computational scheme was defined to identify the most relevant feature ensembles in order to complement and improve the development of CAD systems focused on H&E images.The handcrafted descriptors were defined using multiscale and multidimensional fractal techniques (percolation, fractal dimension and lacunarity) to quantify the original H&E samples and the corresponding LIME and Grad-CAM representations.The deep-learned features were obtained from the DenseNet-121, EfficientNet-b2, Inception-V3, ResNet-50 and VGG-19 architectures.The descriptors were analyzed based on different ensembles, considering the ReliefF algorithm with an ensemble of classifiers (support vector machine, naive Bayes, random forest and K-nearest neighbors).The proposed methodology was applied to distinguish histological samples representative of breast cancer, colorectal cancer, oral dysplasia and liver tissue.The information and conditions obtained were detailed from each experiment.The main contributions of this work are: • A computational scheme capable of indicating the main ensembles of descriptors for the study of histological images, exploring the ReliefF algorithm and multiple classifiers; • An optimized ensemble of deep-learned features with the best results for classifying colorectal cancer, liver tissue and oral dysplasia, using a reduced number of features (up to 25 descriptors); • Indications of the discriminative power of ensembles based on fractal features from the LIME and CAM representations; • Solutions without overfitting and a more robust baseline scheme, with the necessary details for comparisons and improvements of CAD systems focused on H&E images.
In the Section 2 of this paper, the methodology is described in detail, and in Section 3, the results are presented and discussed.The conclusion is drawn in Section 4.

Materials and Methods
In this section, the main steps for the proposed scheme are described, exploring the combined use of deep-learned features via transfer learning with fractal descriptors obtained from original H&E images and their CAM and LIME representations.In the first step, the CNN architectures were defined, and the output layer was fine-tuned to match the classes available on each dataset.Also, in the same step, the CAM and LIME representations were generated considering the fine-tuned models.The second step defined the extraction of the deep-learned features from the selected architectures and the multiscale and multidimensional fractal features using the original H&E images and their xAI representations.In the third step, the extracted features were combined (feature vectors) in different ensembles in order to identify the most relevant information in each context.In the fourth step, the features from each ensemble were ranked and selected to define the best solutions with a reduced number of descriptors.In the fifth and last step, the discriminative capacities of the optimized ensembles were verified through a heterogeneous ensemble with four classifiers.An overview of the proposed scheme is illustrated in Figure 1, with details presented in the next sections.

Datasets
The proposed scheme was tested on four H&E-stained histological datasets, representatives of breast cancer (UCSB) from [66], colorectal cancer (CR) from [67], liver tissue (LG) from [68] and oral epithelial dysplasia (OED) from [69].The main details about these datasets are in Table 1, with some samples presented in Figure 2 in order to illustrate each context.Five CNN architectures were considered in the present scheme: DenseNet-121 [58], EfficientNet-b2 [65], Inception-V3 [59], ResNet-50 [60] and VGG-19 [61].These models were chosen considering their different image classification strategies, which provided a broader investigation of the proposed scheme with different deep-learned features.All models were obtained from the PyTorch library, with details presented in Table 2, including accuracy values achieved on the ImageNet dataset [70].
The fine-tuning step was applied to map the last layer of each CNN with the available classes, changing the final connections and weights corresponding to the total number of groups in each H&E dataset (Table 1).This strategy avoided the full network training stage and made it possible to investigate datasets with a reduced number of images.This process was performed with a k-fold cross-validation (k = 10), which divided the input in k folds of approximately the same size in order to train the model from several possible samples.For each of the k iterations of the training process, a fold was used as evaluation data and the other k − 1 as training data.Each iteration should train a new model, and the output selected was the model with the highest accuracy when classifying test data left out of the process's training input.This technique was applied to reduce the possibility of overfitting.LG from [68], male (e) and female (f); OED from [69], healthy (g) and severe (h).
In addition, each training considered 10 epochs, using the stochastic gradient descent (SGDM) strategy, and an initial learning rate lr = 0.01 with a reduction factor of 0.75 every 2 epochs; the cross-entropy function was used to calculate the adjustment on the parameters.This was repeated for each permutation of architecture and dataset.It is important to highlight that the input images were normalized considering the standard deviation and average of the ImageNet dataset's color channel values to match the methodology used on the model's pretraining [72].Finally, the resulting fine-tuned models were the ones that achieved the highest accuracy for the evaluation set, independent of the epoch.

xAI Representations: LIME and Grad-CAM
To obtain each xAI representation, LIME and Grad-CAM methods were applied to every image using the fine-tuned model for the corresponding dataset.The Grad-CAM representations were defined through the last convolutional layer of each CNN architecture, using the approach of [73].This choice was based on the idea that the deepest layers contained values related to global patterns on the input [74].The output was a map, converted into a heatmap, with the weights indicating the contribution of each pixel to the final classification.In Figure 3, the Grad-CAM representations of some H&E images are illustrated.For the LIME method, the results were defined via 1000 local disturbances and using the quick-shift segmentation algorithm [75].The obtained representation indicated an image with five regions of interest, or superpixels, that were most relevant to explaining the classification result.In Figure 4, some examples of LIME representations obtained from H&E images are illustrated.

Step 2-Feature Extraction
In this study, the attributes were defined from three origins to compose the ensembles: fractal features from the H&E images; deep-learned features from the layer preceding the output on multiple CNN models; and fractal features from the LIME and CAM representations.These three groups were identified as handcrafted features, deep-learned features and xAI features, respectively.

Handcrafted Features: Multiscale and Multidimensional Fractal Techniques
The quantification was carried out using multiscale and multidimensional fractal techniques, specifically the fractal dimension, lacunarity and percolation approaches.The fractal dimension was based on the idea of expanding Euclidean concepts, in which measurements were contained in an n-dimensional space, with n being an integer greater than zero.The fractal dimension quantified the amount of space filled, indicating the roughness of the structure under analysis.Lacunarity is a complementary measure to the fractal dimension, quantifying the distribution and organization of pixels contained in an image.The lacunarity values represented how the patterns were organized at different observation scales.Percolation is a physical concept that can be observed in the movement and filtering of fluids through porous materials.A classic example of this phenomenon is water flowing through a glass of coffee powder.Considering the quantification process, this concept was explored to indicate the number of clusters, image porosity and cluster size [32].
Fractal techniques were calculated from probability matrices, responsible for storing the probabilities of an image containing a square region filled with each of the shapes that constituted it.This filling was verified via a distance relationship between pixel values under analysis and the size of the area in question [76].Each matrix was obtained using the gliding-box method, which consisted of sliding a square box of side r across the entire image and checking whether the pixels were inside or outside the box [77].Thus, given an r-sided box with a central pixel p c , a pixel p was considered when its distance d in relation to p c was less than or equal to r.This process resulted in a frequency distribution matrix N(m, r), with m representing the number of pixels within a box of side r.The probability matrix P(m, r) was obtained through the normalization of N(m, r), according to Equation (1), dividing the value of each count by n r , which represented the total number of boxes of side r contained in the image with the application of the gliding box, as seen in Equation (2).
n r = (width − r + 1) * (height − r + 1). (2) The previously described strategy indicated a multiscale quantification due to the variation in r.In addition, as described by [78], in the proposed scheme, the Chebyshev distance or chessboard was applied to calculate d.Finally, as the application context uses colored images with RGB color space, the multidimensional strategy was applied considering a pixel via a 5-dimensional representation, such as (x, y, r, g, b) [78].
From these procedures, the fractal dimension FD was calculated considering the estimate of boxes of side r necessary to overlay an image, as shown in Equation (3).The lacunarity L was calculated from the first and second moments of the probability matrix, obtained according to Equations ( 4) and ( 5), respectively.These moments were combined using Equation ( 6) and resulted in the measurement of the lacunarity on a specific observation scale.
Percolation measurements were also extracted using the gliding-box algorithm, following the strategy of [32].Therefore, given a box with side r, the pixels contained in the box were considered pores and labeled using the Hoshen-Kopelman [79] algorithm.Pores with the same label were understood as part of the same cluster.This process was repeated for each box of side r.Thus, it was possible to compute the metrics C(r), which indicated the average number of clusters present in each box; Q(r), which defined the average coverage ratio of the largest cluster; and P(r), which provided the ratio of percolating boxes.
The C(r) metric was obtained via an average count of the number of clusters c (present in each box i), according to Equation (7).The metric Q(r) was obtained considering an average of the size of the largest cluster |c max | in each of the i boxes, according to Equation (8).Finally, the ratio of percolating boxes was defined by dividing the number of boxes in which percolation occurred by the number of boxes computed in an image.Percolation in a box p i occurred when the ratio between the number of pores Ω i and the total number of pixels in the box r 2 exceeded the percolation threshold 0.59275 [12,32], as shown in Equation ( 9).Thus, P(r) was defined via Equation (10).
It is important to highlight that the quantification carried out in this work used matrices defined from boxes with side r, within the 3 ≤ r ≤ 41 range, according to descriptions presented by [12,78].Also, the r parameter was set to an odd value to guarantee the existence of a central pixel in each box.Therefore, the increase in the value of r was of two units in each iteration.The probability matrix based on these parameters guaranteed a quantification on 20 different scales.
In addition, both lacunarity functions and percolation measures were also interpreted as scalar values in order to obtain representative descriptors of possible patterns existing in each observation [12,32,78,[80][81][82][83].In these proposals, the authors were able to point out how some of these curves displayed a distinct behavior for each of the classes, making them relevant to the classification process.These features were defined as:

•
Area under the curve (A): It indicates the complexity of the texture.For a discrete function consisting of N points defined in x 1 , . . ., x n , this descriptor can be obtained via Equation (11), with a and b as the point indices that delimit the analysis range; • Skewness (S): it is defined via Equation (12), where N is the number of points in the function, x i is the i-th point in the function, x is the average of the function values, and a and b are the indices of the points that delimit the interval; • Area ratio (R): From the asymmetry, the ratio between the halves of the area under the curve must also present similar values for similar classes.This descriptor was obtained through Equation ( 11), with a and b indicating the points that delimit the interval; ) • Maximum point: It indicates the value in the largest heterogeneous area of the curve.Thus, images from the same class can present similar values, for both f (x) and x.
Totally different values are expected for different classes.
In summary, the fractal descriptors were organized into a feature vector with 116 descriptors: 20 fractal dimensions; 20 lacunarities; 20 average numbers of clusters present in each box; 20 average ratios covered by the largest cluster on each box; 20 percentages of boxes percolated; and 16 curve descriptors, 4 for each of the functions L(r), C(r), Q(r) and P(r).

Deep-Learned Features
In this study, the deep-learned features were extracted from five CNN architectures: DenseNet-121 [58], EfficientNet-b2 [65], Inception-V3 [59], ResNet-50 [60] and VGG-19 [61].The results were five vectors of deep-learned features.It is important to highlight that in general, the values in the initial layers of a CNN define the quantification of local patterns, such as shape, edge and color.The deeper layers are useful for identifying global patterns, such as texture and semantics [74].Therefore, in order to explore the global patterns, the normalization layer after the last dense block was chosen from the DenseNet-121 architecture.This layer provided 1024 values.From the EfficientNet-b2, Inception-V3 and ResNet-50 architectures, each final average pooling layer contributed 1408, 2048 and 2048 values, respectively.From VGG-19, the extraction occurred in the last fully connected layer before the output, providing a vector with 4096 values.

Step 3-Feature Ensemble
This step was defined to organize the handcrafted and deep-learned descriptors with the corresponding ensembles, concatenating the features in order to analyze their discriminative capabilities in each of the H&E datasets [84].It is important to emphasize that each representation obtained through xAI techniques was quantified with fractal approaches (Section 2.3.1),completing the handcrafted set.For each of the five CNN architectures, two feature vectors of xAI representations were defined via fractal techniques, with one from the Grad-CAM images and the other from the LIME images.Each vector considered 116 descriptors (described in Section 2.3.1):20 fractal dimensions; 20 lacunarity values; 20 average numbers of clusters present in each box; 20 average ratios covered by the largest cluster on each box; 20 percentages of boxes that percolated; and 16 curve descriptors, 4 for each of the functions L(r), C(r), Q(r) and P(r).
In this context, the descriptors explored here were organized into 55 distinct compositions, with 16 individual compositions according to their origins and the number of available descriptors.Among the vectors, three groups of distinct origins were de  3. The remaining 39 feature vectors were created with ensembles of these individual vectors by aggregation, as illustrated in Figure 5.The ensembles were divided into three groups: an ensemble of handcrafted and deep-learned features; an ensemble of deeplearned features; and an ensemble of xAI features.The first group was composed of six vectors, five of them associating the fractal vector with each of the deep-learned features from a CNN, and the other was composed via fractal vector with all the deep-learned features from all CNN models, as presented in Table 4.The ensemble group of deep-learned features was composed of 11 vectors, with 10 permutations of deep-learned features from 2 distinct architectures and one with all available deep-learned features, as shown in Table 5.The last group contained 22 vectors with ensembles of xAI representation descriptors, organized according to the methodology applied to deep-learned ensembles (see Table 6).

Step 4-Feature Selection
The results achieved in Step 3 were vectors with high dimensions, which can lead to classifications with overfitting [43].In order to avoid this problem, the solution was to apply a dimensionality reduction based on the ranking of the ReliefF technique [85].This technique was considered to control the number of descriptors under analysis and due to its success found in other studies [45,47,86].It is important to highlight that the analyzed vectors had dimensions between 116 and 10,740.In this context, considering that the total samples available in our experiments ranged from 58 to 265 images (Table 2), the tests were carried out with totals between 5 and 25 descriptors, starting the analysis process with the maximum value of descriptors and exploring decrements of 5 descriptors, until reaching the smallest dimension [83].

Step 5-Classifier Ensemble and Evaluation Metrics
The last step consisted of carrying out the classifications based on the vectors with an ensemble of heterogeneous classifiers, representative of different categories: support vector machine (SVM) [87], based on functions; naive bayes [88], based on probabilities; random forest [89], based on decision trees; and instance-based K-nearest neighbors [90].
The classifications were combined based on the average of probabilities.In this strategy, the average of all probabilities resulting from each of the classifiers was calculated, so the class with the highest average was selected as the answer (assigned class).
The results were analyzed considering the accuracy metric capable of indicating the global performance of the model (among all the classifications, how many the model classified correctly) [91].Accuracy was determined through Equation ( 14), where true positives (TP) are the positive values that the model correctly classified as positive; true negatives (TN) are the negative values that the model correctly classified as negative; false positives (FP) are the negative values that the model incorrectly classified as positive; and false negatives (FN) are the positive values that the model incorrectly classified as negative.
In addition, the classification process also considered the k-fold cross-validation approach, with k = 10.It is important to highlight that the descriptor selection process was applied to each training set of each fold.The selected descriptors were used to perform the classification on the corresponding test set.This technique was applied to reduce the possibility of overfitting occurring, since the result was obtained through the classification average [92].Figure 6 illustrates this cross-validation process with the selection of descriptors.

Software Packages and Execution Environment
In the proposed scheme, the Pytorch 1.9.0 [71] machine learning library was used to define the convolutional network models.Grad-CAM explanations were obtained via the pytorch-grad-cam 1.3.1 library [93].The LIME explanations were obtained using the lime 0.2.0.1 library [94], implemented in python [41].The application of each model was carried out in a remote environment on Google Colab, which provided a free console for executing codes in Python3 and a GPU for processing with 12 GB of available memory.The fractal descriptors were implemented and executed in parts in Matlab R2019b [95] and Python3.Feature selection and classification were processes carried out using the Weka v3.8.5 platform [96].For this last part, the executions were carried out using an Intel® Core™ i5-8265U 1.60 GHz processor and 8 GB of RAM (Intel, Santa Clara, CA, USA).

Results and Discussion
The proposed scheme was tested on datasets of histological images (Section 2.1), with comparisons of benign versus malignant (UCSB and CR datasets), healthy versus severe (OED dataset), and male versus female (LG dataset).A performance overview is presented in Section 3.2, with results through the main associations against those from CNNs applied directly to classify the samples and from studies available in the specialized literature.

Feature Ensemble Performance
The combinations were tested according to details presented in Section 2.4, and the 10 highest accuracy values using the smallest number of descriptors were computed to define the average performance in each category.The average performance values are in Table 7 with the highest values in bold.From the average values, it is noted that the ensemble of deep-learned features provided the highest values in all datasets, with an emphasis on the CR dataset (average accuracy of 100%).In the other datasets, this type of solution indicated accuracy values of 99.66% (LG), 97.23% (OED) and 92.93% (UCSB).Moreover, these facts confirm the feasibility of using the transfer learning strategy in order to obtain relevant compositions for the analysis of histological images.In addition, the ensemble of handcrafted with deep-learned features was the second association capable of indicating relevant accuracy values for the CR (99.76%),LG (99.02%) and OED (96.49%) datasets.Classifications using descriptors based on the xAI representations presented less expressive results, with average accuracy values between 78% and 89% (approximately), but with clear indications about the discriminative potential that can be explored on other research fronts.These facts are relevant contributions to support solutions without overfitting and more robust baseline schemes commonly explored for the improvement of CAD systems focused on H&E images.
These results were analyzed with the Friedman test in order to verify whether there were statistically significant differences between the compositions.The p-value was contrasted against α = 0.05.Thus, if p-value < α, the difference was considered significant.The test was based on the averages of the first 10 results present in each descriptor category.Thus, taking into account a comparison among the six types of feature vectors, the p-value was 0.0033, indicating statistically significant differences, with an emphasis on the ensemble of deep-learned features against handcrafted, xAI and xAI ensemble.Each pairwise p-value is displayed in Table 8.The differences between the obtained results via an ensemble of deep-learned features and those from an ensemble of handcrafted with deep-learned features were not statistically significant.In addition, the Friedman test provided an average ranking between the performance values of the main associations tested here.The best-ranked combination was the ensemble of deep-learned features, with average accuracy values ranging from 92.93% (UCSB) to 100% (CR).The best combination was defined based on the highest accuracy value using the smallest number of descriptors.Considering this criterion, the obtained rankings are displayed in Tables 9-12 with the first 10 solutions for the CR, LG, OED and UCSB datasets, respectively.It is noted that the average ranking of the solution validates its position.
In relation to the data collected from the CR dataset (Table 9), it is possible to observe that the 10 best results indicated an accuracy of 100%.It is noted that the descriptors from the DenseNet-121 (D) and EfficientNet-b2 (E) models were those that contributed the most to these results.In these cases, the vectors were defined with a maximum of 20 descriptors, minimizing overfitting.
When the LG dataset was considered (Table 10), it was observed that only one combination indicated an accuracy value of 100%: the ensemble of deep-learned features from the DenseNet-121 (D) and ResNet-50 (R), exploring only 25 descriptors.The contribution of descriptors from the DenseNet-121 was a highlight, present in all vectors of the best results.For the OED dataset (see Table 11), the highest accuracy was 97.97% with an ensemble of descriptors from the Inception-V3 (I) and VGG-19 (V) architectures, using a reduced number of attributes (20 features).Another combination that achieved the same performance considered the descriptors from the EfficientNet-b2 (E) and Inception-V3 (I) models.However, this last combination involved 25 attributes.In addition, Inception-V3 (I) was the architecture that contributed the most to the best results, followed by DenseNet-121 (D).The main vectors were also defined with up to 25 descriptors.
Finally, considering the UCSB dataset, which is shown in Table 12, it is noted that the highest accuracy value was 94.83%.This performance was defined via the association of deep-learned features from DenseNet-121 (D) with EfficientNet-b2 (E) exploring a total of 25 descriptors.From the top 10 solutions, it is observed that nine compositions were defined based on the DenseNet-121 or ResNet-50 architectures.The total number of descriptors present in the vectors follows the previously found pattern consisting of solutions with a maximum of 25 attributes.

Feature Summary
The best result was defined through the ensembles of deep-learned features, considering the top 10 solutions for each H&E dataset.For the CR dataset (Figure 7a), it is observed that the compositions were defined mainly based on the EfficientNet-b2 (E) model, representing from 60% to 90% of the features present in 7 of the top 10 solutions.Regarding the LG dataset (Figure 7b), the obtained vectors from the DenseNet-121 (D) architecture predominated (9 of the 10 compositions), indicating from 48% to 100% of the total features in each ensemble.When the OED dataset (Figure 7c) was considered, the highest frequency was obtained through Inception-V3 (I), present in 7 of the 10 compositions, but with a smaller presence of features in the ensembles (from 10% to 40%).Finally, the compositions for the UCSB dataset (Figure 7d) were based on more homogeneous compositions among DenseNet-121 (D), EfficientNet-b2 (E) and ResNet-50 (R).However, in each ensemble, there was a higher incidence of features from the ResNet-50 (R) architecture (from 73% to 100% of the total descriptors).Based on the details presented previously, some patterns and/or behaviors were observed.The ensembles explored here effectively contributed to a distinction between the H&E images, surpassing the results provided via single-source descriptors.The ensemble of deep-learned features supported the main solutions, with highlights, in terms of occurrence, for the descriptors extracted from the DenseNet-121 (present in 28 of the 40 vectors) and EfficientNet-b2 (present in 19 of the 40 vectors) models.Finally, considering the dimensionalities of the feature vectors, compositions involving 10 to 25 deep-learned features were sufficient to determine the top 10 solutions.

Proposed Scheme versus Fine-Tuned CNN Classifications
In order to verify possible gains of the proposed scheme in relation to CNN models applied directly to each H&E dataset, the performance values of the architectures (DenseNet-121 [58], EfficientNet-b2 [65], Inception -V3 [59], ResNet-50 [60] and VGG-19 [61]) were collected after the fine-tuning process.The methodological details were presented in Section 2.2.In addition,  show the best accuracy rates via CNNs against those of the proposed scheme.It is important to highlight that other experiments could be defined, involving different conditions and limits, but those carried out here provided sufficient information to support the contributions of our investigation.When image classifications were given directly by the CNN architectures, the DenseNet-121 and EfficientNet-b2 models indicated the lowest performance values, ranging from 67.81% to 79.26% and from 54.35% to 80.35%, respectively.The VGG-19 and ResNet-50 networks provided the best performance: LG (88.29%) and OED (97.59%) classified via VGG-19; CR (91.26%) and UCSB (88.52%) through the ResNet-50.However, these accuracy values were lower than those achieved through the proposed scheme, exploring an ensemble of deep-learned features with four heterogeneous classifiers.Also, through the information detailed here, it was noted that the classification performance via a CNN model was not directly related to that achieved through deep-learned features used in an independent process.Most of the best results explored features from the DenseNet-121 and EfficientNet-b2 networks, indicating the lowest performance.These results confirm the contributions obtained in this study.

Performance Overview in Relation to the Literature
Different techniques are available in the specialized literature in order to investigate patterns in histological images, such as for the CR, LG, OED and UCSB datasets [12,13,19,55,69,[97][98][99][100].Therefore, an illustrative overview is important to show the quality of our proposal, indicated in Tables [13][14][15][16] for each of the datasets.From this illustrative overview, it is possible to conclude that the proposed scheme provided solutions that surpassed a single type of descriptor or even other relevant associations [19,55,69,97,99,100,106,112].Moreover, the computational scheme was capable of indicating optimized ensembles with the best results for classifying colorectal cancer, liver tissue and oral dysplasia, considering a maximum of 25 descriptors.Finally, these solutions without overfitting and with more robust baseline schemes are for improving CAD systems focused on H&E images.These contributions complement the proposal presented here in this illustrative overview.

Conclusions
In this work, a computational scheme was developed in order to define the main ensembles of descriptors for the study of histological images, exploring their ranking based on the ReliefF algorithm with a robust ensemble of classifiers (four heterogeneous algorithms).The handcrafted descriptors were established from multiscale and multidimensional fractal techniques (fractal dimension, lacunarity and percolation) and applied to quantify H&E images and their Grad-CAM and LIME representations.The deep-learned features were obtained from multiple CNN architectures, considering the transfer learning strategy.The experiments were carried out on H&E images, representative of breast cancer, colorectal cancer, oral dysplasia and liver tissue.
From the results, the ensemble of deep-learned features provided the highest values in all datasets, with accuracy rates of 94.83% (UCSB), 97.97% (OED) and 100% (CR and LG), exploring a reduced number of features (up to 25 attributes).The descriptors were mainly obtained from the DenseNet-121 and EfficientNet-b2 architectures.In addition, the proposed scheme also indicated that handcrafted ensembles with deep-learned features provided expressive distinctions in the contexts of multiple histological images, with accuracy rates of 99.76% (CR), 99.02% (LG) and 96.49% (OED).This type of composition indicated better performance values than those achieved through individualized analyses, commonly observed in the specialized literature, whether exploring only deep learning [102,103,106,112] or handcrafted techniques [33,69,74,107].In both categories of ensembles, this study provided useful details and conditions for the community interested in the development and improvement of models for classifying patterns in H&E samples.In relation to the experiments exploring xAI representations, the results were less expressive, with average accuracy values ranging from 78% to 89%.On the other hand, this type of composition achieved results similar to those of directly applied networks, responsible for providing the xAI representations.Thus, we believe that there are still several study avenues to understand the full information capacity present in this type of representation, seeking to improve CAD system designs focused on histological images.
The best solutions were analyzed in relation to the results obtained from consolidated machine-learning techniques, directly applying CNN models to classify the histological datasets.This process considered the DenseNet-121, EfficientNet-b2, Inception-V3, VGG-19 and ResNet-50 architectures.The results were accuracy values from 54.35 to 97.59.The VGG-19 and ResNet-50 networks indicated the best rates: LG (88.29%) and OED (97.59%) via VGG-19; CR (91.26%) and UCSB (88.52%) through ResNet-50.These performance values were lower than those achieved through the best solutions with the proposed scheme.When an illustrative overview was considered in relation to the specialized literature, it was possible to conclude that the proposed scheme provided solutions that surpassed a single type of descriptor or even other relevant associations [55,69,97,106,112].Moreover, the computational scheme was capable of indicating optimized ensembles with the best results for classifying colorectal cancer, liver tissue and oral dysplasia.Therefore, these conditions highlight the ability of the proposed scheme to present solutions without overfitting and a more robust baseline scheme, with the necessary details for the analysis and testing of CAD systems, focused on H&E samples.Regarding classifications involving representative samples of breast cancer (UCSB dataset), the proposed scheme provided a lower performance, indicating a possible limit of the main solution in this context.
In future works, it is intended to (1) expand the number of handcrafted techniques to quantify H&E images and their representations, especially to define the possible limits involving combinations via xAI; (2) carry out new tests after applying adjustments to the parameters of the CNN architectures in order to verify their impacts on the xAI representations and corresponding quantification; (3)

Figure 1 .
Figure 1.An overview of the proposed scheme.

Figure 3 .
Figure 3. Examples of xAI representations based on the Grad-CAM technique, with (a) the original image, (b) the weights mapped to indicate the contribution of each pixel, (c) the mapping transformed into a heatmap and (d) the heatmap overlaying the original image.

Figure 4 .
Figure 4. Examples of LIME representations with (a) indicating the original image and (b) the five selected superpixels for the explanation.

Figure 5 .
Figure 5. Illustration of the ensemble of features by aggregation.

Table 4 .
Ensembles of handcrafted and deep-learned features.+ E + I + R + V 10,740

Figure 6 .
Figure 6.Illustration of the k-fold cross-validation strategy applied in this step.

Figure 7 .
Figure 7. Proportion of features from each CNN in the ensembles, observing the 10 best accuracy values for the CR (a), LG (b), OED (c) and UCSB (d) datasets.

Figure 8 .
Figure 8. Ranking of accuracy values (%) provided by different approaches in classifying the CR dataset.

Figure 9 .
Figure 9. Ranking of accuracy values (%) provided by different approaches in classifying the LG dataset.

Figure 10 .
Figure 10. of accuracy values (%) provided by different approaches in classifying the OED dataset.

Figure 11 .
Figure 11.Ranking of accuracy values (%) provided by different approaches in classifying the UCSB dataset.

Table 1 .
Main details of the H&E-stained histological datasets.

Table 2 .
[71]ils of the pretrained architectures and respective accuracies (data from[71]) explored to define the deep-learned features and xAI representations.

Table 3 .
Feature vectors obtained via different techniques.

Table 5 .
Ensembles of deep-learned features.
E LI ME + I LI ME + R LI ME + V LI ME 580

Table 7 .
Average accuracy values (%) computed from the 10 best results in each combination and for each type of H&E image.

Table 8 .
Ranking of the best associations with their respective pairwise p-values, according to average accuracy values and the Friedman test.

Table 9 .
Top 10 resultsfor the classification of the CR dataset with feature vectors composed of ensembles of deep-learned features.

Table 10 .
LG dataset: top 10 results exploring the ensembles of deep-learned features.

Table 11 .
OED dataset: top 10 results via the ensembles of deep-learned features.

Table 12 .
UCSB dataset: Top 10 results exploring the ensembles of deep-learned features.

Table 13 .
Classification of colorectal samples: accuracy values (%) provided by different approaches.

Table 14 .
Classification of liver tissue: accuracy values (%) via different approaches.

Table 15 .
Classification of oral dysplasia: accuracy rates (%) provided by different methods.
explore multiview learning approaches to complement multiple representations, including investigations into possible gains after applying learning enrichment strategies with fractal techniques.