A Hierarchical Feature-Based Methodology to Perform Cervical Cancer Classiﬁcation

: Prevention of cervical cancer could be performed using Pap smear image analysis. This test screens pre-neoplastic changes in the cervical epithelial cells; accurate screening can reduce deaths caused by the disease. Pap smear test analysis is exhaustive and repetitive work performed visually by a cytopathologist. This article proposes a workload-reducing algorithm for cervical cancer detection based on analysis of cell nuclei features within Pap smear images. We investigate eight traditional machine learning methods to perform a hierarchical classiﬁcation. We propose a hierarchical classiﬁcation methodology for computer-aided screening of cell lesions, which can recommend ﬁelds of view from the microscopy image based on the nuclei detection of cervical cells. We evaluate the performance of several algorithms against the Herlev and CRIC databases, using a varying number of classes during image classiﬁcation. Results indicate that the hierarchical classiﬁcation performed best when using Random Forest as the key classiﬁer, particularly when compared with decision trees, k-NN, and the Ridge methods.


Introduction
The World Health Organization recently estimated 605,000 new cases and 342,000 deaths from cervical cancer worldwide [1]. Over the years, the use of the Pap smear test for population-based cervical cytological screening has shown remarkable success in the early detection of such cancers; despite this, there is much to improve within this program [2,3].
The Papanicolaou exam, commonly known as the Pap smear, identifies pre-neoplastic changes in the cervix's desquamated cells based on several cytomorphological and clinical criteria. The main criteria are based on nuclear characteristics, such as nuclear augmentation, irregularity of the nuclear membrane, nuclear hyperchromasia, and relation of the nucleus and cytoplasm sizes [4].
In the laboratory routine, a cytopathologist evaluates up to 300,000 cervical cells in a single smear [4]; also, the workload can reach 100 smears per day. The recommendation worldwide of the daily hours worked varies depending on the country: in Canada, it is 80 smears/day; in Brazil, it is 70 smears/day, and in the United States, it is 100 smears/day [5,6]. This scenario encompasses tiring and repetitive work that leads to errors inherent in human visual interpretation. Investigations conducted since before the 1990s show rates of 2% to 62% false-negatives in Pap test results [7][8][9][10][11].
To solve the limitations and improve the screening exams' quality, computer vision and computer-aided systems are used to analyze Pap smear images, making the process more accurate and reliable [12]. One of the great difficulties in proposing such systems is the need for robust data from several real images of cervical cells, properly labeled by cytopathologists, using the widespread Bethesda System nomenclature. However, it comes up against the limitations of the existing Papanicolaou examination image datasets; these issues include synthetic images, images without classes images with pre-neoplastic or incomplete alteration, images with single cells, and liquid-based cytology images. The most widely used base, Herlev [13], has images with a single cell and a division into seven preneoplastic classes that do not follow the most-used nomenclature; the ISBI database [14] has simulated images and those without pre-neoplastic changes; SIPaKMeD [15] divides its images into five categories that differ from the Bethesda System.
Many authors have proposed solutions to this problem of detecting cervical cells, using synthetic databases or working with databases that do not represent the reality of conventional Pap smear images, in which there are many cells, often overlapping, in a single image [16][17][18][19][20][21][22][23][24]. Therefore, the investigation of methodologies capable of being applied in the real context of cervical cancer screening is still a great challenge.
Performing cell classification is one step in constructing a decision-aid tool for analyzing the Pap smear test. Some authors perform the cell classification with traditional machine learning [25][26][27], and others employ convolutional neural networks [17,23,28,29].
Diniz et al. [30] proposed a methodology using Simple Linear Iterative Clustering (SLIC), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Iterated Local Search (ILS) algorithms to segment nuclei in synthetic images based on their morphologic features. Using the irace package, López-Ibáñez et al. [31] and Diniz et al. [16] concluded that the important features for the methodology were minimum circularity, maximum intensity, and minimum area.
Ghoneim et al. [17] proposed a methodology based on the Shallow, VGG-16, and Caf-feNet architectures to extract cervical cell characteristics. They also used the Extreme Learning Machine and Autoencoder to classify the cells into two or seven classes.
Lin et al. [18] presented a CNN-based method to classify cells based on their appearance and morphology. They analyzed different input images for the proposed method. They considered a 2-channel image, the nucleus and the cytoplasm masks, a 3-channel image, the RGB image, and the 5-channel image, which joins the 2-channel and 3-channel images. The authors showed with experiments that 5-channel input images improve the classification.
Di Ruberto et al. [32] analyzed different descriptors used to extract image features from seven databases representing different computer vision problems. They used a k-NN model to evaluate Hu, Legendre, and Zernike moments, Local Binary Patterns (LBP), and co-occurrence matrix features. The authors concluded that extracting the invariant moments from the Gray Level Co-occurrence Matrices (GLCM) improves their overall accuracy. They also observed that extracting the descriptors from RGB images is better than grayscale ones.
Ensemble methods are a process of consulting several classifiers before making a final decision and also have been used by many researchers in bioinformatics. Bora et al. [24] introduced an ensemble method that uses Least Squares Support Vector Machine (LSSVM), Multilayer Perceptron (MLP), and Random Forest (RF) to construct a decision model based on shape, texture, and color features.
Gómez et al. [19] made a comparison of several algorithms to classify cervical cancer cells into two classes: normal and abnormal. They used 20 morphologic features and found that the combinations of algorithms Bagging + MultilayerPerceptron and AdaBoostM1 + LMT were the best scenarios analyzed by them.
Lakshmi and Krishnaveni [20] presented a method to extract nuclei and cytoplasm features of Pap smear images. Attributes such as center, perimeter, area, and average intensity were considered. The method uses the expectation-maximization (EM) algorithm and a Gaussian mixture model (GMM). Finally, the authors state that the method can be used to determine the cancer stage and be efficient for classifying cervical cells that present low-grade squamous intraepithelial lesion (LSIL) and high-grade squamous intraepithelial lesion (HSIL).
Win et al. [21] applied a median filter to the images to remove noise and used Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the contrast. The k-Means algorithm was implemented to segment the nucleus and cytoplasm regions of cervical cells. From these regions, 38 characteristics of texture, shape, and color were extracted. Attributes were selected using the Random Forest method. Next, the cells were classified into two and seven classes using the ensemble bagging method. The authors compared the approach with five classifiers (LD, SVM, k-NN, boosted trees, and bagged trees) and showed that their method performed better.
Hussain et al. [22] explore AlexNet, VGG-16, VGG-19, ResNet-50, ResNet-101, and GoogLeNet for the classification of cervical lesions. The authors also proposed an ensemble method of the three best models. They found GoogLeNet to be the best individual architecture, and they showed that the ensemble improved by using the AUC-ROC curve.
This article proposes the classification of pre-neoplastic cervical lesions based on features extracted from nuclei. The main contributions of this work can be summarized as: The outline of the paper is as follows. Section 2 exhibits the materials and methods considered. Section 3 displays the computational experiments and their results and discussion. Finally, Section 4 presents the conclusions of this work.

Materials and Methods
This section presents the materials and methods considered. Section 2.1 presents the cervical cell databases, Herlev and CRIC, used for lesion classification. Section 2.2 exhibits how the features were extracted and analyzes the correlation between the handcrafted and biological nuclei features. Section 2.3 presents the classification groups of each database used in the experiments. Section 2.4 shows the oversampling techniques analyzed in the experiments. Section 2.5 points out the classifier methods used. Finally, Section 2.6 shows the hierarchical classification structure proposed for nuclei classification.

Database
This work deals with two databases of cervical cells: (i) Herlev, well known and used in the literature, and (ii) CRIC, a new database with nucleus and cell segmentation results in smear images.
The Herlev database [13] (http://mde-lab.aegean.gr/index.php/downloads (accessed on 24 January 2021)) is collected at the Department of Pathology of Herlev University Hospital and the Department of Automation at the Technical University of Denmark. It consists of 917 single cervical cell images, divided into seven classes: superficial squamous epithelial; intermediate squamous epithelial; columnar epithelial; mild, moderate, and severe squamous non-keratinizing dysplasia; and squamous cell carcinoma in situ intermediate. All images also have a label of their regions, nuclei, and cytoplasm. Figure 1 shows a Herlev example image (a) and its label (b).  The CRIC Searchable Image Database (https://database.cric.com.br/ (accessed on 24 March 2021)) comprises cervical cell images and it is being developed by the Center for Recognition and Inspection of Cells (CRIC) and aims to support the Pap smear analysis. It covers cervical cells of conventional cytology, based on the standardized and most-used worldwide nomenclature in the diagnosis area, the Bethesda System nomenclature.
Currently, the CRIC database is divided into two collections: one containing only the marking of the cell's center (classification) and another containing the segmentation of the cell's nucleus and cytoplasm. In both cases, each cell also has its classification. Only the segmentation collection will be used in this work since the nucleus region's delimitation will be important for the methodology used. There are 400 images obtained from Pap smears, with 3233 segmentations. Figure 2 presents an example of a segmentation image.  Table 1 shows each database division, indicating the nuclei's categories and classifications. The number of nuclei in each class is also shown.
In 1941, George N. Papanicolaou created the first classification system for normal and abnormal cells (class I, II, III, IV, and V). The second system was created by James Reagan in 1953, separating the abnormal cells into mild, moderate, severe dysplasias, and carcinoma in situ. In 1967, Ralph Richart proposed the division into CIN I, CIN II, and CIN III (Cervical Intraepithelial Neoplasia). To standardize the terminologies, in 1988, the "Bethesda System" was developed and approved by the National Cancer Institute in the USA; the system underwent reviews in 1991, 2001, and 2014. With this new nomenclature system, the current terms for the classification of abnormal squamous cells are ASC-US, LSIL, ASC-H, HSIL, SC [33,34].
Herlev's database uses the second classification system developed in 1953, while the CRIC base uses the most current classification system, the Bethesda System. In this sense, comparing the terminologies, mild dysplasia corresponds to LSIL, and moderate, severe dysplasia and carcinoma in situ corresponds to HSIL. Therefore, Herlev does not include the classifications of classes ASC-US, ASC-H, and SC used in the laboratory routine today.
Our proposal uses information from the segmented nucleus to perform the classification of cells.

Biological versus Computational Features
As mentioned before, during the screening examination in a cytology laboratory, the cytopathologist manually analyzes optical images of cervical cells. Visual analysis is related to human interpretation of cervical smears, and even with many detailed procedures and routines, it is susceptible to errors of interpretation.
During the analysis, the cytopathologist assesses the variation in the smear cells' cytomorphological features. Some examples of this variation are the increase in the nucleus/cytoplasm ratio, the nuclear membrane irregularity, the nucleus hyperchromasia, and the chromatin granularity. All of them provide guidance on reporting of cytologic findings in cervical cytology in agreement with the Bethesda System [4,34].
Errors related to diagnostic interpretation happen when the cytopathologist either recognizes altered cells, but wrongly classifies them, or does not recognize them at all. Both situations may be attributed to the lack of experience of the professional, variation in the appearance of cytomorphological features, or workload, which affects the subjectivity of the process [35][36][37].
Our proposal extracts and evaluates morphological and texture characteristics related to the cell nucleus, correlated to the Bethesda System's visual interpretation. The computational results can guide the cell classification systems and assist the lesion diagnosis and interpretation, diminishing error results.
The methodology starts with a feature extraction of each nucleus segmented in the database. The following algorithms were employed: Region Props, Haralick's features, Local Binary Patterns (LBP), Threshold Adjacency Statistics (TAS), Zernike moments, and Gray Level Co-occurrence Matrix (GLCM). All were implemented in Python, in which Region Props and GLCM are from the scikit-image package [38], and the others are from the Mahotas package [39]. Unlike Di Ruberto et al. [32], we also include morphological and other texture features.
The Local Binary Patterns (LBP) [45], a set of texture features, were also extracted. The advantage of these features is that they are insensitive to orientation and lighting.
The Threshold Adjancency Statistics (TAS) [46] features were also considered in the classification. They are used to differentiate images of distinct subcellular localization quickly and with high accuracy.
The Zernike moment [47] features were extracted and considered in the proposed methodology because they measure how the mass is distributed in the region. Finally, the Gray Level Co-occurrence Matrices [44] are texture features extracted that consider the pixels' spatial relation. Figure 3 shows a sample image for each type of lesion present in the CRIC database, and Table 2 presents some feature values extracted from the images in Figure 3. These features are area, eccentricity (eccent.), circularity (circ.), maximum intensity (max. int.), and contrast. We extracted features inspired by the ones that a cytopathologist would use to perform the classification, manually. Morphological features such as area, perimeter, extent, and eccentricity are important because they are related to the nuclear size, which characterizes one of the fundamental biological criteria for differentiating abnormal cells from normal ones. For example, ASC-US interpretation requires that the cells in question demonstrate nuclei approximately 2.5 to 3 times the area of the nucleus of a normal intermediate squamous cell (approximately 35 µm 2 ) or twice the size of a squamous metaplastic cell nucleus (approximately 50 µm 2 ). The cells interpreted as ASC-H are the size of metaplastic cells with nuclei that are up 2.5 times larger than normal. Nuclear enlargement more than three times the area of normal intermediate nuclei characterizes LSIL. HSIL often contains relatively small basal-type cells with nuclear augmentation. The characteristic cells of carcinoma (SC) vary markedly in the area but usually show karyomegaly. Table 2 shows that the area feature has a behavior as observed by cytopathologists, in which the normal cell has the smallest area value and there is an increase in the value according to its lesion.
Another biologically relevant feature is the nuclear membrane shape, as abnormal cells have different degrees of irregularity. ASC-US shows minimal variation in the nuclear shape, while LSIL presents a contour of nuclear membrane ranging from smooth to very irregular with notches. ASC-H and HSIL show irregular nuclear contour, with anisokaryosis of HSIL being more pronounced. Carcinoma cells may show very marked nuclear pleomorphism (bizarre forms). As a whole, abnormal cells may have multinucleation, or variations in the circular shape of a normal cell's nucleus. This work considered morphological features related to these characteristics (nuclear contour and multinucleation), such as circularity, eccentricity, and minor and major axis. Table 2 shows eccentricity and circularity values that provide examples of features used in this work to measure the nuclear membrane's shape as they would typically be analyzed by cytopathologists. The eccentricity measures the nuclei irregularity, while the circularity value represents how circular the nuclei are. Analyzing the images in Figure 3, the less circular nucleus is the SC, and it has the smallest value for the feature (0.462). Simultaneously, the most irregular nucleus is also the SC, and it has the biggest eccentricity value (0.952).
Nuclear hyperchromasia and irregular chromatin distribution are essential biological characteristics for categorizing cells as abnormal. These characteristics also assist in differentiation among ASC-US, LSIL, ASC-H, and HSIL. Moreover, the morphological features directly related to these characteristics are minimum, mean, and maximum intensities, solidity, contrast, mass distribution in the region (Zernike moments), and a set of texture features such as Local Binary Patterns, Haralick features, and Gray Level Co-occurrence Matrices. Table 2 shows the maximum intensity and the contrast (Haralick feature) values. With attributes of intensity (minimum, maximum, and medium) and texture, it is possible to estimate the chromatin distribution analyzed by the cytopathologist in the manual analysis.
A total of 232 attributes of the cervical cell nuclei were extracted and considered in this work. A quick analysis of the attribute selection indicated that all attributes used brought benefits to the classification task; thus, all of them were used in our proposal for the nuclei classification.

Classification Groups
As shown in Table 1, the database images can be classified according to their category (normal/abnormal) or their classification (7 classes in Herlev and 6 classes in CRIC).
Based on cytopathologists' analysis of Herlev's data, this work proposes a classification of the data into five classes for abnormal cells. Note that once a cell is classified as normal, it is not necessary to further distinguish its particular type. Thus, the classes superficial squamous epithelial, intermediate squamous epithelial, and columnar epithelia can be grouped as normal cells. Figure 4 shows the possible classification groups for the Herlev database. Figure 4a presents the 2-class group, Figure 4b  Concerning CRIC, another possible classification task is grouping images into three classes: normal, low-grade lesion (ASC-US, and LSIL), and high-grade lesion (ASC-H, HSIL, and SC) cells. This classification is feasible due to their common disease follow-up. Women diagnosed with low-grade cell changes should repeat the exam after a certain period, according to her age, while the ones diagnosed with high-grade lesions should be referred for colposcopy followed by a biopsy [48]. Figure 5 shows the CRIC classification groups considered in this work's computational experiments. Figure 5a exposes the 2-class group, Figure 5b

Oversampling
In classification problems, a database is imbalanced when the difference between the amount of data of classes is large [49]. Classification algorithms are often sensitive to imbalance, which means that they tend to value classes with more data and ignore classes with fewer data [50,51]. It is possible to observe in Table 1 that the databases considered in this work are not balanced, so balancing techniques were analyzed.
The first used technique is the Synthetic Minority Oversampling Technique (SMOTE) [52], which creates artificial sample data based on neighboring interpolation to oversample the minority class. This technique considers the k-nearest neighbors for each sample x i of the minority class and creates a synthetic sample x new as follows: In Equation (1),x i corresponds to a random value of the k neighbors of x i and δ is a random number in the interval [0,1]. The new sample datum x new is a point on the edge that connects x i andx i .
Another technique was the Borderline-SMOTE [53]. The difference between the Borderline-SMOTE and the original SMOTE is that the Borderline-SMOTE only oversam- Finally, the third technique studied is SVM SMOTE. This technique differs from the others because it uses the support vectors to generate a Support Vector Machine (SVM) classifier to approximate the class boundaries.
All these three techniques were implemented in Python using the imbalanced-learn package [54], and their results were compared according to accuracy.

Decision Tree (DT)
A Decision Tree [57] is a supervised method to perform classifications supported by data descriptions based on tree-structural patterns. For the Decision Tree, the input and the goal variables do not need a previous relationship. Moreover, it can handle data at different scales [63].

k-NN
The k-NN [58] is a supervised learning method in which, when a new instance needs to be classified, its distance to all neighbors is calculated and given the label of the nearest k-neighbors. In this way, the generalization and the prediction are only made when a new instance needs to be classified (lazy). The distance used in the method is the Euclidian distance between points p and q, given by d p,q , calculated as follows: In Equation (2), n represents the number of features.

Random Forest (RF)
The Random Forest [61] is an ensemble learning method that uses multiple decision trees for decision making. In classification problems, the label defined by most decision trees is the label given to the new instance.

Ridge
The Ridge [62] classifier converts the label data to solve the problem with a regression method. In prediction, the highest value is accepted as a target class. For multiclass classification, multi-output regression is applied.

Hierarchical Classification
Some classification problems present hierarchical relations between classes, indicating that it is possible to divide the problem into sub-problems of less complexity that, when combined, reach the classification expected by the whole problem. These problems are known as hierarchical classification problems.
Here, we address a hierarchical classification problem because it can be reduced into a normal and abnormal classification followed by deeper classifications to discover the nuclei type. Figure 6 presents the hierarchical classification proposed in this work to classify nuclei features. Figure 6a shows the hierarchical classification of Herlev nuclei, while Figure 6b shows the hierarchical classification of CRIC nuclei. Considering the Herlev data (Figure 6a), the data can be classified first (with classifier 1 in blue) between normal and abnormal and, subsequently, the lesion can be classified (with classifier 2 in orange) into four other classes (mild squamous non-keratinizing dysplasia, moderate squamous non-keratinizing dysplasia, severe squamous non-keratinizing dysplasia, and squamous cell carcinoma in situ intermediate), and the normal ones can be classified (with classifier 3 in orange) into another three classes (superficial squamous epithelial, intermediate squamous epithelial, columnar epithelial). Therefore, the 2-class classification requires only classifier 1. In turn, the 5-class classification requires classifiers 1 and 2, while the 7-class classification requires the three classifiers.
Considering the CRIC database data (Figure 6b), the data can be classified (with classifier 1 in blue) into normal and abnormal. The lesion can be classified (with classifier 2 in orange) into low-grade lesion and high-grade lesion, which are then classified (with classifiers 3 and 4 in purple) according to their lesion's type. Thereby, the 2-class classification applies classifier 1; the 3-class, classifiers 1 and 2; and the 6-class, the four classifiers.

Experiments and Results
This section discusses the experiments developed to evaluate the hierarchical classification proposed. The experiments were performed on a computer with an Intel Core i7-8700 processor with a 3.20GHz processor, 16GB RAM, and a Windows 64-bit operating system. The hierarchical classification proposed uses the programming language Python, version 3.7.9. Codes are available at https://github.com/agcbianchi/AppliedScience-Feature.
Dhurandhar and Dobra [64] investigated cross-validation performance and presented explanations concerning varying sample size, number of folds, and the correlation between input and output attributes. They pointed out that 10-and 20-fold cross-validation worked best for small datasets. Thus, our experiments applied the 10-fold cross-validation. All the results correspond to the average of the 10-fold cross-validation.

Performance Metrics
Initially, the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) was calculated. TP and TN are the numbers of positive and negative classes correctly predicted, while FP and FN are the numbers of positive and negative classes incorrectly predicted. Table 3 shows the metrics used to measure performance of the hierarchical classification proposed. Table 3. Performance metrics.

Metric Equation Goal
Precision (Prec.) TP TP+FP Indicate, among the positive ratings, the amount that is correct.

TP TP+FN
Indicates the correct detection of abnormal nuclei.
Harmonic mean between precision and recall.

TP+TN TP+FP+TN+FN
Evaluate the proportion of all the correct tests (TP and TN), over all the results obtained.

TN TN+FP
Identifies if the method excludes nuclei without lesions correctly. Table 4 shows the performance results obtained using the balanced techniques and without any oversampling applied to the CRIC data. The 6-class classification is the most challenging; therefore, it was used in this experiment, aiming to provide greater differentiation in the results. As can been seen in Table 4, any oversampling technique improves the classification, but the best technique observed was the Borderline-SMOTE. For this reason, the Borderline-SMOTE was used as the oversampling technique in the experiments carried out. We only used this oversampling technique in the training data. The test set remained unchanged, guaranteeing the classification results' credibility. The balancing techniques applied to the Herlev database performed similarly to the CRIC database, so these results were not presented. Before balancing the data from the Herlev database with the Borderline-SMOTE technique, there were in the training base: 67 superficial squamous epithelial data; 63 intermediate squamous epithelial data; 83 epithelial columnar data; 164 mild squamous non-keratinizing dysplasia data; 132 moderate squamous non-keratinizing dysplasia data; 178 severe squamous non-keratinizing dysplasia data; and 135 squamous cell carcinoma in situ intermediate data. The largest class size (severe squamous non-keratinizing dysplasia with 178 data) was used as a reference size for balancing the data. Thus, after balancing, each of the seven classes comprised 178 data. These results were used in the 2-class balancing to allow all classes to be represented within the groups. In the 2-class balancing, we group the four abnormal classes, resulting in 712 (=4 × 178) abnormal cells, with data equally distributed among the four classes. The group of normal classes resulted in 534 (=3 × 178) data. Finally, we balanced the normal and abnormal groups so that they had the same number of samples.

Oversampling Results
In the CRIC database, there were 776 NILM data, 258 ASC-US, 539 LSIL, 483 ASC-H, 787 HSIL, and 70 SC. Thus, analogous to Herlev data balancing, to balance the CRIC data, we chose the reference size as 787, the most frequent class, HSIL. At the end of the balancing, each of the six classes was left with the same number of data. The 6-class balancing was also used for the 2-and 3-class balancing, aiming at the classes' representativeness within the groups. Thus, the abnormal group resulted in 3935 (=5 × 787) data, and the normal one 787 (=1 × 787). We employed a new balancing such that the two groups (normal and abnormal) were left with 3935 data. Finally, the high-grade lesion group resulted in 2361 (=3 × 787) data, and the low-grade group in 1574 (=2 × 787) data in the 3-class balancing. These last two groups were again balanced, and finally, both comprised 2361 data. Table 5 presents the results of precision, recall, F1, accuracy, and specificity for the 2-class, 5-class, and 7-class hierarchical classification of Herlev database nuclei images. The 7-class classification without hierarchy allows us to analyze and determine if the hierarchical classification improves the task. The best results are highlighted in bold. It is possible to observe that the RF classifier performs better than k-NN, Ridge, and DT, considering all metrics and number of classes. The results also reveal that the hierarchical methodology improves the classification.

Hierarchical Classification Results
In turn, Table 6 shows the results of 2-class, 3-class, 6-class hierarchical classification, and 6-class classification without a hierarchy of the CRIC database nuclei images. These results are similar to those for Herlev: our findings show that the RF is the best classifier, and the hierarchical methodology improves the classification. In RF, an ensemble learning technique [61], multiple decision trees are combined in a committee, known as boosting [65], whose final performance is better than the base classifiers. Each decision tree is trained with different features and is responsible for predicting diverse data in the classifier. Thus, the decision boundary becomes more stable and accurate with more trees. Simultaneously, the unpruned and diverse trees result in a high resolution in the feature space and a smoother decision boundary between the classes. These essential characteristics of RF, combined with the nonlinearity correlation of features, contribute to the good classification prediction [66].

Statistical Analysis
The statistical analysis aims to verify whether there is a statistically significant difference between the implemented algorithms' results.
Initially, we used the Shapiro-Wilk test [67] with a significance level of 0.05 to verify whether the normal distribution can approximate the probability distribution of the classifiers' results. It was found that the results obtained in all metrics do not follow a normal distribution.
For this reason, the Kruskal-Wallis non-parametric test [68] was chosen to determine whether the results obtained suggest that the samples are from different populations or are just random variations among random samples from the same population.
Thus, the best classifier found in the experiments, the RF, was compared pair-wise with the other classifiers using the non-parametric Kruskal-Wallis test with a significance level of 0.05 to check if there was a statistical difference between RF and the other classifiers concerning all performance metrics. Table 7 shows the p-value results obtained in the Kruskal-Wallis test when comparing the RF results with those of k-NN, Ridge, and DT for 2-class, 3-class, and 7-class hierarchical classification, and 7-class classification without hierarchy using the Herlev database. The results highlighted in bold have the same distribution as the RF results (p-value > 0.05). Table 7 reveals that in 7-class hierarchical classification, the Ridge results of accuracy and specificity have the same distribution of the RF results.
In turn, Table 8 presents the p-value results obtained in the Kruskal-Wallis test when comparing the RF results with those of k-NN, Ridge, and DT for 2-class, 3-class, and 6class hierarchical classification and 6-class classification without hierarchy using the CRIC database. As no p-value was higher than 0.05, it can be concluded that the RF statistically outperformed all other classifiers that classified the data from the CRIC database, considering all metrics.   Table 9 presents the precision, recall, F1, accuracy, and specificity values obtained by the best method found in these experiments, the RF hierarchical classification, and other literature methods. Blank fields indicate that the literature methods did not report the respective metrics results. The best result of each metric is highlighted in bold. As can be seen in Table 9, the proposed RF hierarchical classifier obtained the best values of precision and F1, as well as achieving high recall, accuracy, and specificity when compared to the other methods in 2-class classification. In the 5-class and 7-class classification, the proposed RF hierarchical classifier obtained the best values of all metrics considered.

Conclusions
This work proposes a hierarchical classification methodology to classify nuclei of Pap smear images using handcrafted features. As mentioned before, in the cytopathologist's routine, image analysis is entirely manual and subjective, a tiring and monotonous task. The proposal performs a computational screening procedure capable of excluding irrelevant nuclei images to identify possible lesions and reduce the number of images analyzed visually by the cytopathologist. The reduction of the professional workload helps to focus attention on the analysis of relevant images, decreasing false-negative rates.
The experiment indicates that hierarchical classification improves the results when compared with those without hierarchy. Considering the Herlev database, the results outperform the literature methods for 5-class and 7-class classification concerning the precision, recall, F1, accuracy, and specificity metrics. For the 2-class classification, our RF hierarchical method achieves the best results for two of the five metrics and had competitive results in the other metrics. Analyzing the metrics in which our method does not present the best result, we realize that the best result comes from a different method from the literature. However, even the best result for the specific metric performs poorly in all other metrics when compared to our method.
Additionally, this work introduces the CRIC segmentation cervix collection and presents 2-class, 3-class, and 6-class classification results considering precision, recall, F1, accuracy, and specificity metrics.
The present findings of cell nuclei classification suggest enhancing our understanding of the handcrafted features used in the machine learning algorithm. The hypothesis that features should be inspired in the biological criteria for differentiating abnormal cells from normal ones proved to be a feasible solution. The feature vector included a combination of nuclear contour shape morphologies with chromatin distribution (texture), and all attributes were used in the classification task.
We tested eight machine learning traditional classifier methods to perform the nuclei classification and chose the four best ones (Decision Tree, k-NN, Random Forest, and Ridge) to report their results in this work concerning the hierarchical classification proposed. A statistical analysis shows that the Random Forest is the best one to classify nuclei images of the Herlev and CRIC databases regardless of the number of classes.