Computer-Assisted Screening for Cervical Cancer Using Digital Image Processing of Pap Smear Images

: Cervical cancer can be prevented by having regular screenings to ﬁnd any precancers and treat them. The Pap test looks for any abnormal or precancerous changes in the cells on the cervix. However, the manual screening of Pap smear in the microscope is subjective with poorly reproducible criteria. Therefore, the aim of this study was to develop a computer-assisted screening system for cervical cancer using digital image processing of Pap smear images. The analysis of Pap smear image is important in the cervical cancer screening system. There were four basic steps in our cervical cancer screening system. In cell segmentation, nuclei were detected using a shape-based iterative method, and the overlapping cytoplasm was separated using a marker-control watershed approach. In the features extraction step, three important features were extracted from the regions of segmented nuclei and cytoplasm. RF (random forest) algorithm was used as a feature selection method. In the classiﬁcation stage, bagging ensemble classiﬁer, which combined the results of ﬁve classiﬁers—LD (linear discriminant), SVM (support vector machine), KNN (k-nearest neighbor), boosted trees, and bagged trees—was applied. SIPaKMeD and Herlev datasets were used to prove the e ﬀ ectiveness of our proposed system. According to the experimental results, 98.27% accuracy in two-class classiﬁcation and 94.09% accuracy in ﬁve-class classiﬁcation was achieved using the SIPaKMeD dataset. When the results were compared with ﬁve classiﬁers, our proposed method was signiﬁcantly better in two-class and ﬁve-class problems.


Introduction
Cancer is the uncontrolled growth of abnormal cells in the body. Rather than responding appropriately to the signals that control normal cell behavior, cancer cells grow and divide in an uncontrolled manner, invading normal tissues and organs and eventually spreading throughout the body [1]. Cervical cancer is cancer arising from the cervix. The most important risk factor for cervical cancer is infection with human papillomavirus (HPV). The goal of cervical screening is to identify and remove significant precancerous lesions in addition to preventing mortality from invasive cancer. Cervical cancer is the fourth most frequent cancer in women, with an estimated 570,000 new cases in 2018, representing 6.6% of all female cancers. Approximately 90% of deaths from cervical cancer occur in low-and middle-income countries. Precancerous changes in the cervix usually don't cause any signs or symptoms [2]. Symptoms of cervical cancer include irregular intermenstrual (between periods) or abnormal vaginal bleeding after sexual intercourse, back, leg or pelvic pain, fatigue, weight This paper is divided into four parts. Part 1 introduces about cervical cancer screening system and discusses the previous research studies. Part 2 expresses the explanations of the datasets used and the methodology that is used in this study. Part 3 indicates the results of the proposed system. Finally, Part 4 presents the conclusion.

Materials and Methods
The system flow diagram of the proposed computer-assisted screening of cervical cytology is presented in Figure 2. Our proposed system involved six stages-Image acquisition, image enhancement, cell segmentation, features extraction, features selection, and classification [33]. At the image acquisition step, the SIPaKMeD dataset was used for multi-cells, and the Herlev dataset was used for a single cell. Input Pap smear images were enhanced and denoised to improve the image quality as an image enhancement step. The next step was cell segmentation. This step partitioned the input images into the interesting regions-nucleus and cytoplasm. After segmentation, the next step was feature extraction. In feature extraction, distinctive interested points or features were extracted. In the features selection stage, the random forest algorithm was used as a selection method. The final step was the classification. In this stage, the cells were classified using bagging ensemble classifiers into normal or abnormal cells. The details of each step have been explained in each sub-sections. Recently, Chuanyun et al. [21] had proposed the segmentation of nuclei and cytoplasm regions using the gradient vector flow method (GVF). However, this study only used a single cell as an input to be analyzed. While each cell slide can contain over 10,000 cells [22], the approach using a single cell cannot be enough in real cases. However, it does not solve the issue of the overlapping cells. Many different algorithms have been proposed to solve the problem of overlapping cells [23]. Among them, marker controlled watershed transformation is one of the most used methods. The main problem in this approach is over-segmentation problems [24]. In feature extraction, mostly used features are shape [25], texture [26], and color features. The major advantage of using texture attribute is its simplicity. Therefore, the texture feature was extracted using GLCM (Gray Level Co-Occurrence Matrix) in our feature extraction stage. The mostly used classifiers in the multi-cell cervical image analysis are support vector machine (SVM) [27], LDA (Linear Discriminant Analysis) [28], k-nearest neighbor (KNN) [29], and ANN (Artificial Neural Networks) [30]. There have been many research studies about cervical cancer detection, but most studies have only targeted the segmentation of nuclei regions [31]. The segmentation of cytoplasm regions is also essential. The features that are extracted from cytoplasm regions are helpful for the classification of abnormal cells [32].
This paper is divided into four parts. Part 1 introduces about cervical cancer screening system and discusses the previous research studies. Part 2 expresses the explanations of the datasets used and the methodology that is used in this study. Part 3 indicates the results of the proposed system. Finally, Part 4 presents the conclusion.

Materials and Methods
The system flow diagram of the proposed computer-assisted screening of cervical cytology is presented in Figure 2. Our proposed system involved six stages-Image acquisition, image enhancement, cell segmentation, features extraction, features selection, and classification [33]. At the image acquisition step, the SIPaKMeD dataset was used for multi-cells, and the Herlev dataset was used for a single cell. Input Pap smear images were enhanced and denoised to improve the image quality as an image enhancement step. The next step was cell segmentation. This step partitioned the input images into the interesting regions-nucleus and cytoplasm. After segmentation, the next step was feature extraction. In feature extraction, distinctive interested points or features were extracted. In the features selection stage, the random forest algorithm was used as a selection method. The final step was the classification. In this stage, the cells were classified using bagging ensemble classifiers into normal or abnormal cells. The details of each step have been explained in each sub-sections.

Image Acquisition
For image acquisition, we used two datasets named SIPaKMeD [34] and Herlev datasets [35]. For single-cell classification, the Herlev dataset was used, and the SIPaKMeD dataset was used for multi-cells classification. The Herlev dataset contained 917 images. The classes 1 to 3 are normal cervical cells, and classes 4 to 7 are abnormal cervical cells. In multi-cells dataset, there were 966 images, and 4049 cells were cropped from these images. Cells were divided into normal, begin, and abnormal stage. There were five classes-superficial intermediate cells, parabasal cells, metaplastic cells, dyskeratotic cells, and koilocytotic cells. The details of each dataset have been explained in Tables 1 and 2

Image Acquisition
For image acquisition, we used two datasets named SIPaKMeD [34] and Herlev datasets [35]. For single-cell classification, the Herlev dataset was used, and the SIPaKMeD dataset was used for multi-cells classification. The Herlev dataset contained 917 images. The classes 1 to 3 are normal cervical cells, and classes 4 to 7 are abnormal cervical cells. In multi-cells dataset, there were 966 images, and 4049 cells were cropped from these images. Cells were divided into normal, begin, and abnormal stage. There were five classes-superficial intermediate cells, parabasal cells, metaplastic cells, dyskeratotic cells, and koilocytotic cells. The details of each dataset have been explained in Table  1 and Table 2. The sample pap smear images of Herlev dataset and SIPaKMeD dataset were shown in Figure 3 and Figure 4.   Moderate squamous non-keratinizing dysplasia, (f) Severe squamous non-keratinizing dysplasia, (g) Squamous cell carcinoma in situ.

Image Enhancement
Most of Pap smear images were noisy and low contrast, as shown in Figure 5a. Therefore, image enhancement was needed to remove the noises and increase the contrast. A median filter was used to remove the noises, as shown in Figure 5b, and CLAHE (contrast limited adaptive histogram equalization) was used to enhance the contrast, as shown in Figure 5c. High contrast images were easier and more precise for cell segmentation stage than in low contrast images.

Cells Segmentation
The aim of this step was to segment the regions of the cell from input images. The nuclei and cytoplasm are important components in the cell region. In a Pap smear screening system, cytologists examine the microscope images of cells and label the cells into cancer or normal cells based on the appearance of cells components. The automated screening system is also the same procedure. The segmentation of cell components is an important step in the automated detection system. There are many difficulties in the multi-cells segmentation process, such as overlapping cells or including unwanted artifacts. The nuclei segmentation is easier than cytoplasm segmentation. In our multi-cells image, nuclei were low intensity, and the shapes were well structured, mostly oval or round shape, and significantly different from the other regions, background, or cytoplasm. But the major issues in cytoplasm segmentation are overlapping boundary and poor contrast. Most studies have focused on nuclei segmentation, and rarely studies have focused on cytoplasm segmentation. In Pap smear analysis, the characteristics of cytoplasm are very important. We used the marker-controlled watershed algorithm for cytoplasm segmentation to solve the issue of overlapping boundary detection and splitting of touching cells into individual cells. The main problem of the standard watershed transform is over-segmentation. To avoid this problem, we used markers. The marker image was a binary image with one marker point or multiple points. The flowchart of the proposed modified watershed transform algorithm is described in Figure 6.

Image Enhancement
Most of Pap smear images were noisy and low contrast, as shown in Figure 5a. Therefore, image enhancement was needed to remove the noises and increase the contrast. A median filter was used to remove the noises, as shown in Figure 5b, and CLAHE (contrast limited adaptive histogram equalization) was used to enhance the contrast, as shown in Figure 5c. High contrast images were easier and more precise for cell segmentation stage than in low contrast images. Moderate squamous non-keratinizing dysplasia, (f) Severe squamous non-keratinizing dysplasia, (g) Squamous cell carcinoma in situ.

Image Enhancement
Most of Pap smear images were noisy and low contrast, as shown in Figure 5a. Therefore, image enhancement was needed to remove the noises and increase the contrast. A median filter was used to remove the noises, as shown in Figure 5b, and CLAHE (contrast limited adaptive histogram equalization) was used to enhance the contrast, as shown in Figure 5c. High contrast images were easier and more precise for cell segmentation stage than in low contrast images.

Cells Segmentation
The aim of this step was to segment the regions of the cell from input images. The nuclei and cytoplasm are important components in the cell region. In a Pap smear screening system, cytologists examine the microscope images of cells and label the cells into cancer or normal cells based on the appearance of cells components. The automated screening system is also the same procedure. The segmentation of cell components is an important step in the automated detection system. There are many difficulties in the multi-cells segmentation process, such as overlapping cells or including unwanted artifacts. The nuclei segmentation is easier than cytoplasm segmentation. In our multi-cells image, nuclei were low intensity, and the shapes were well structured, mostly oval or round shape, and significantly different from the other regions, background, or cytoplasm. But the major issues in cytoplasm segmentation are overlapping boundary and poor contrast. Most studies have focused on nuclei segmentation, and rarely studies have focused on cytoplasm segmentation. In Pap smear analysis, the characteristics of cytoplasm are very important. We used the marker-controlled watershed algorithm for cytoplasm segmentation to solve the issue of overlapping boundary detection and splitting of touching cells into individual cells. The main problem of the standard watershed transform is over-segmentation. To avoid this problem, we used markers. The marker image was a binary image with one marker point or multiple points. The flowchart of the proposed modified watershed transform algorithm is described in Figure 6.

Cells Segmentation
The aim of this step was to segment the regions of the cell from input images. The nuclei and cytoplasm are important components in the cell region. In a Pap smear screening system, cytologists examine the microscope images of cells and label the cells into cancer or normal cells based on the appearance of cells components. The automated screening system is also the same procedure. The segmentation of cell components is an important step in the automated detection system. There are many difficulties in the multi-cells segmentation process, such as overlapping cells or including unwanted artifacts. The nuclei segmentation is easier than cytoplasm segmentation. In our multi-cells image, nuclei were low intensity, and the shapes were well structured, mostly oval or round shape, and significantly different from the other regions, background, or cytoplasm. But the major issues in cytoplasm segmentation are overlapping boundary and poor contrast. Most studies have focused on nuclei segmentation, and rarely studies have focused on cytoplasm segmentation. In Pap smear analysis, the characteristics of cytoplasm are very important. We used the marker-controlled watershed algorithm for cytoplasm segmentation to solve the issue of overlapping boundary detection and splitting of touching cells into individual cells. The main problem of the standard watershed transform is over-segmentation. To avoid this problem, we used markers. The marker image was a binary image with one marker point or multiple points. The flowchart of the proposed modified watershed transform algorithm is described in Figure 6. In our proposed overlapping cells' segmentation method, there were ten stages to segment the multi-cells images into individual cells that were used for the nuclei and cytoplasm regions extraction. The summary of each step is shown in Table 3. As first, the original images were changed into gray images. The next step was foreground and background markers extraction and then segmentation using watershed transform function and viewing the segmented results. For foreground and background markers calculation, morphological operations based techniques were used. To separate overlapping cells into individual cells, the boundary of cytoplasm regions was detected after overlapping the cells' detection stage. Then, the area of each cell was detected by thresholding the predefined minimum and maximum area values to remove the unwanted object areas. After that, the detected regions were cropped by a bounding box, and the cropped regions were classified into three classes using unsupervised machine learning, k-means with six intensitybased features. The intensity variation of three groups of cell patches (isolated, touching, and overlapping cells) were significantly different, as shown in Figure 7. So, we used six intensity-based features (mean, variance, skewness, kurtosis, energy, and entropy). Table 3. Processing steps of proposed overlapping cells' segmentation method.
Step 1: Read color image and convert gray image Step 2: Mark the foreground objects Step 3: Compute background objects Step 4: Use markers' image that is roughly in the middle of the cells to be segmented Step 5: Compute the watershed transform of makers' image Step 6: Show the result of detected overlapping cells' regions Step 7: Calculate the boundaries of detected regions in the image Step 8: Detect areas between the minimum and maximum values for cells regions Step 9: Cropping the regions Step 10: Classify the regions of the cell into isolated, touching, or overlapped cells In our proposed overlapping cells' segmentation method, there were ten stages to segment the multi-cells images into individual cells that were used for the nuclei and cytoplasm regions extraction. The summary of each step is shown in Table 3. As first, the original images were changed into gray images. The next step was foreground and background markers extraction and then segmentation using watershed transform function and viewing the segmented results. For foreground and background markers calculation, morphological operations based techniques were used. To separate overlapping cells into individual cells, the boundary of cytoplasm regions was detected after overlapping the cells' detection stage. Then, the area of each cell was detected by thresholding the predefined minimum and maximum area values to remove the unwanted object areas. After that, the detected regions were cropped by a bounding box, and the cropped regions were classified into three classes using unsupervised machine learning, k-means with six intensity-based features. The intensity variation of three groups of cell patches (isolated, touching, and overlapping cells) were significantly different, as shown in Figure 7. So, we used six intensity-based features (mean, variance, skewness, kurtosis, energy, and entropy). Table 3. Processing steps of proposed overlapping cells' segmentation method.
Step 1: Read color image and convert gray image K-means clustering approach was used to classify the cropping cells' patches into three groups. It was first proposed by McQueen [36]. It organizes a set of observations that are represented as feature vectors into clusters based on their similarity. There are three basic steps in the training algorithm for k-means. They are initialization, update, and assignment. Initialization assigns each observation from the data set randomly to one of the k clusters and then takes k observations randomly from the data set and assigns each to a cluster. Figure 8 shows the three steps of the k-means clustering algorithm for classification. Results of the proposed cell segmentation algorithm are showed in Figure 9.  K-means clustering approach was used to classify the cropping cells' patches into three groups. It was first proposed by McQueen [36]. It organizes a set of observations that are represented as feature vectors into clusters based on their similarity. There are three basic steps in the training algorithm for k-means. They are initialization, update, and assignment. Initialization assigns each observation from the data set randomly to one of the k clusters and then takes k observations randomly from the data set and assigns each to a cluster. Figure 8 shows the three steps of the k-means clustering algorithm for classification. Results of the proposed cell segmentation algorithm are showed in Figure 9. K-means clustering approach was used to classify the cropping cells' patches into three groups. It was first proposed by McQueen [36]. It organizes a set of observations that are represented as feature vectors into clusters based on their similarity. There are three basic steps in the training algorithm for k-means. They are initialization, update, and assignment. Initialization assigns each observation from the data set randomly to one of the k clusters and then takes k observations randomly from the data set and assigns each to a cluster. Figure 8 shows the three steps of the k-means clustering algorithm for classification. Results of the proposed cell segmentation algorithm are showed in Figure 9. K-means clustering approach was used to classify the cropping cells' patches into three groups. It was first proposed by McQueen [36]. It organizes a set of observations that are represented as feature vectors into clusters based on their similarity. There are three basic steps in the training algorithm for k-means. They are initialization, update, and assignment. Initialization assigns each observation from the data set randomly to one of the k clusters and then takes k observations randomly from the data set and assigns each to a cluster. Figure 8 shows the three steps of the k-means clustering algorithm for classification. Results of the proposed cell segmentation algorithm are showed in Figure 9.

Nuclei and Cytoplasm Segmentation
In this stage, nuclei and cytoplasm regions from each segmented cell that resulted from overlapping cells' segmentation stage were segmented. There were three types of segmented cells that were outputs of overlapping cells' segmentation stage. They were isolated cells, touching cells, and overlapping cells. We divided the nuclei and cytoplasm segmentation into three sub-processes. The first one was the segmentation of the components of the cell from isolated cells. The second one was segmentation from touching cells, and the last one was segmentation from overlapping cells. The cytoplasm boundary of isolated cells, touching cells, and overlapping cells could be extracted from segmented cells' results of the watershed transform approach that was proposed in the overlapping cells' segmentation stage. The regions of touching and overlapping cytoplasm in the image obtained in the segmentation step were not enough to represent the boundaries of the cytoplasm. Thus, we did the process of smoothing the boundaries of the cytoplasm using an edge smoothing method that is described in Table 4. Table 4. Processing steps of edges smoothing method.
Step 1: Read grayscale image and convert binary image Step 2: Extract the largest blob only Step 3: Crop-off the frame on the left and top Step 4: Fill holes Step 5: Blur the image Step 6: Threshold again Step 7: Show the smoothed binary image The nuclei of isolated cells, touching cells, and overlapping cells were segmented using the shape-based iteration method that is described in Table 5.

Nuclei and Cytoplasm Segmentation
In this stage, nuclei and cytoplasm regions from each segmented cell that resulted from overlapping cells' segmentation stage were segmented. There were three types of segmented cells that were outputs of overlapping cells' segmentation stage. They were isolated cells, touching cells, and overlapping cells. We divided the nuclei and cytoplasm segmentation into three sub-processes. The first one was the segmentation of the components of the cell from isolated cells. The second one was segmentation from touching cells, and the last one was segmentation from overlapping cells. The cytoplasm boundary of isolated cells, touching cells, and overlapping cells could be extracted from segmented cells' results of the watershed transform approach that was proposed in the overlapping cells' segmentation stage. The regions of touching and overlapping cytoplasm in the image obtained in the segmentation step were not enough to represent the boundaries of the cytoplasm. Thus, we did the process of smoothing the boundaries of the cytoplasm using an edge smoothing method that is described in Table 4. Table 4. Processing steps of edges smoothing method.
Step 1: Read grayscale image and convert binary image Step 2: Extract the largest blob only Step 3: Crop-off the frame on the left and top Step 4: Fill holes Step 5: Blur the image Step 6: Threshold again Step 7: Show the smoothed binary image The nuclei of isolated cells, touching cells, and overlapping cells were segmented using the shape-based iteration method that is described in Table 5. Table 5. Processing steps of the shape-based nuclei detection algorithm.
Step 1: Read the input color image and invert the grayscale image Step 2: Remove noise using the median filter Step 3: Predefine minimum area, major and minor axis lengths, minimum and maximum intensity values, and solidity Step 4: Binarize the image using the lowest and highest thresholds Step 5: Remove the regions under limited shape and intensity values Step 6: Segment nuclei The nuclei segmentation was easier than cytoplasm segmentation. In our multi-cells image, nuclei were low intensity, and the shapes were well structured, mostly oval or round shape, and significantly different from the other regions, background, or cytoplasm. The area, average intensity, major and minor axis lengths, and solidity were the five most important features of nuclei to distinguish from other objects, cytoplasm, and background. Therefore, we used these features and developed a method to detect and segment nuclei. A median filter was used to remove the noise from the original Pap smear images. The shape-based features (area, intensity values, major axis length, minor axis length, and solidity) were used in the nuclei segmentation process. According to the experiments, the minimum area of nuclei was 362 µm 2 . The average intensity value was between 60 and 150, and solidity values were less than or equal 0.98. The major axis length of nuclei was between 24 m and 117 m, and the minor axis length was between 17m and 87m. Shape-based features' name and their formula are explained in Figure 10b.
maximum intensity values, and solidity Step 4: Binarize the image using the lowest and highest thresholds Step 5: Remove the regions under limited shape and intensity values Step 6: Segment nuclei The nuclei segmentation was easier than cytoplasm segmentation. In our multi-cells image, nuclei were low intensity, and the shapes were well structured, mostly oval or round shape, and significantly different from the other regions, background, or cytoplasm. The area, average intensity, major and minor axis lengths, and solidity were the five most important features of nuclei to distinguish from other objects, cytoplasm, and background. Therefore, we used these features and developed a method to detect and segment nuclei. A median filter was used to remove the noise from the original Pap smear images. The shape-based features (area, intensity values, major axis length, minor axis length, and solidity) were used in the nuclei segmentation process. According to the experiments, the minimum area of nuclei was 362 μm . The average intensity value was between 60 and 150, and solidity values were less than or equal 0.98. The major axis length of nuclei was between 24 m and 117 m, and the minor axis length was between 17m and 87m. Shape-based features' name and their formula are explained in Figure 10b.
The results of the nuclei segmentation step are shown in Figure 11 and Figure 12. The performance of our proposed nuclei detection method was evaluated, and their results are shown in Table 6. The nuclei and cytoplasm segmentation results of isolated cells, touching cells, and overlapping cells are described in Tables 7-9, respectively. The proposed overlapping cells' segmentation method had been tested on images with a 4049 cytoplasm in total. About 6.47% of the cytoplasm was isolating, while 35.27% of cytoplasm were touching, and 58.26 % of cytoplasm were overlapping. The rate of well-segmented cytoplasm was calculated by dividing the detected cytoplasm number with the actual cytoplasm number and multiplied by 100, and the results are shown in Table 10. The results of the nuclei segmentation step are shown in Figures 11 and 12. The performance of our proposed nuclei detection method was evaluated, and their results are shown in Table 6. The nuclei and cytoplasm segmentation results of isolated cells, touching cells, and overlapping cells are described in Tables 7-9, respectively. The proposed overlapping cells' segmentation method had been tested on images with a 4049 cytoplasm in total. About 6.47% of the cytoplasm was isolating, while 35.27% of cytoplasm were touching, and 58.26 % of cytoplasm were overlapping. The rate of well-segmented cytoplasm was calculated by dividing the detected cytoplasm number with the actual cytoplasm number and multiplied by 100, and the results are shown in Table 10.   The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.   The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.    The dice similarity coefficient (DSC)-the value was between 0 and 1. False-negative rate (FNR)-the total number of pixels that were incorrectly classified as a nucleus. True positive rate (TPR)-the total number of pixels that were correctly classified as a nucleus.

Features Extraction
After the cells' segmentation stage, the next stage was features extraction. In this stage, the important features, texture, shape, and color features were extracted. Texture features were obtained using the gray level co-occurrence matrix. In the cervical Pap smear image, healthy and abnormal cells were highly different in their distributions of color and shape [37]. Therefore, we extracted features of shape and color. The extracted texture features were contrast, smoothness, third moment, uniformity, and entropy for RGB channels [38]. The color-based features were mean of six color channels (red, green, and blue) and H, S, and V channels (hue, saturation, and value) that were extracted independently from RGB and HSV model. The extracted features' names of nuclei and cytoplasm are shown in Table 11.

Features Selection
In this stage, the random forest algorithm was used as a feature selection algorithm. The main reason for using a feature selection method was to select the most important features and to improve the accuracy of the classifier. The feature selection algorithm can reduce the complexity of the classification model and reduce the training time of machine learning algorithms. There are many feature selection algorithms. Among them, we used the random forest algorithm because the tree-based strategies used by random forests naturally rank by how well they improve the purity of the node. This means a decrease in impurity over all trees. Nodes with the greatest decrease in impurity were at the start of the trees, while nodes with the least decrease in impurity were at the end of trees. Thus, by pruning trees below a node, we could create a subset of the most important features. Table 12 shows the features' names and their top rank attributes by the random forest (RF) algorithm.

Classification
In this stage, we used a bagging ensemble classifier. Ensemble learning can help to improve the prediction results by combining several models. Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For combining the outputs of base learners, bagging uses voting for classification. Combining stable learners was less advantageous since the ensemble would not help improve generalization performance. In our proposed classifier, five classifiers were trained, and the results of the predictions of each classifier were combined. The result was decided based on majority voting. The block diagram of the combined five classifiers is shown in Figure 13. subsets for training the base learners. For combining the outputs of base learners, bagging uses voting for classification. Combining stable learners was less advantageous since the ensemble would not help improve generalization performance. In our proposed classifier, five classifiers were trained, and the results of the predictions of each classifier were combined. The result was decided based on majority voting. The block diagram of the combined five classifiers is shown in Figure 13.

Results
To validate the effectiveness of our proposed system, SIPaKMeD (multi-cells) dataset and Herlev dataset (single-cell) were used. In the multi-cells dataset, there were 996 images, and 4049 cells were cropped from these total images. These cells were divided into five classes, class1 (superficial intermediate cells), class 2 (parabasal cells), class 3 (metaplastic cells), class 4 (dyskeratotic cells), and class 5 (koilocytotic cells). For the two-class problem, the first three classes were grouped into normal cells and 1618 cells in total. The last two classes were grouped into abnormal cells and 2431 cells in total. The performance measures were accuracy, recall, specificity, precision, and F-measure [39]. The formulas are given in the below equations:

Results
To validate the effectiveness of our proposed system, SIPaKMeD (multi-cells) dataset and Herlev dataset (single-cell) were used. In the multi-cells dataset, there were 996 images, and 4049 cells were cropped from these total images. These cells were divided into five classes, class1 (superficial intermediate cells), class 2 (parabasal cells), class 3 (metaplastic cells), class 4 (dyskeratotic cells), and class 5 (koilocytotic cells). For the two-class problem, the first three classes were grouped into normal cells and 1618 cells in total. The last two classes were grouped into abnormal cells and 2431 cells in total. The performance measures were accuracy, recall, specificity, precision, and F-measure [39]. The formulas are given in the below equations: where The comparison results of classification performance in terms of accuracy with five classifiers and our ensemble classifier using nuclei features only, cytoplasm features only, combining nuclei and cytoplasm features without features selection method, and with features selection method for the two-class problem (SIPaKMed dataset) are shown in Figure 14 and five-class problem in Figure 15.  The comparison results of classification performance in terms of accuracy with five classifiers and our ensemble classifier using nuclei features only, cytoplasm features only, combining nuclei and cytoplasm features without features selection method, and with features selection method for the two-class problem (SIPaKMed dataset) are shown in Figure 14 and five-class problem in Figure 15.  Moreover, the five performance measures of each classifier using selected features are shown in Figure 16 for the two-class problem and Figure 17 for the five-class problem.
Moreover, the five performance measures of each classifier using selected features are shown in Figure 16 for the two-class problem and Figure 17 for the five-class problem.  The comparison results of five classifiers with our ensemble classifier in terms of accuracy using four datasets (nuclei features only, cytoplasm features only, combining nuclei and cytoplasm features without features selection method, and with features selection method) are shown in Figure 18 for the two-class problem and Figure 19 for the seven-class problem.  Moreover, the five performance measures of each classifier using selected features are shown in Figure 16 for the two-class problem and Figure 17 for the five-class problem.  The comparison results of five classifiers with our ensemble classifier in terms of accuracy using four datasets (nuclei features only, cytoplasm features only, combining nuclei and cytoplasm features without features selection method, and with features selection method) are shown in Figure 18 for the two-class problem and Figure 19 for the seven-class problem. The comparison results of five classifiers with our ensemble classifier in terms of accuracy using four datasets (nuclei features only, cytoplasm features only, combining nuclei and cytoplasm features without features selection method, and with features selection method) are shown in Figure 18 for the two-class problem and Figure 19 for the seven-class problem.  The five performance measures of each classifier using selected features for the Herlev dataset are shown in Figure 20 for the two-class problem and Figure 21 for the seven-class problem.  The five performance measures of each classifier using selected features for the Herlev dataset are shown in Figure 20 for the two-class problem and Figure 21 for the seven-class problem. The five performance measures of each classifier using selected features for the Herlev dataset are shown in Figure 20 for the two-class problem and Figure 21 for the seven-class problem.

Conclusions
This paper proposed a system for computer-assisted screening for cervical cancer using digital image processing of Pap smear images. Our proposed system consisted of six steps: image acquisition, image enhancement, cell segmentation, feature extraction, feature selection, and classification. Our proposed system first segmented each independent cell components, such as nucleus and cytoplasm, and then detected whether cells were cancerous or not through machine learning-based technique. There are several techniques, which had been proposed in the past in this direction. But the accuracy has not been found to be significantly accessible [40]. In our work, the average classification result showed an accuracy of 98.47% and 98.27% in the two-class problem and 90.84% (seven-class) and 94.09% (five-class) in multi-class problem using Herlev dataset and SIPaKMed dataset individually. The main advantage of our proposed method is an increase in the predictive performance in separating the abnormal cells from the normal cells.

Conclusions
This paper proposed a system for computer-assisted screening for cervical cancer using digital image processing of Pap smear images. Our proposed system consisted of six steps: image acquisition, image enhancement, cell segmentation, feature extraction, feature selection, and classification. Our proposed system first segmented each independent cell components, such as nucleus and cytoplasm, and then detected whether cells were cancerous or not through machine learning-based technique. There are several techniques, which had been proposed in the past in this direction. But the accuracy has not been found to be significantly accessible [40]. In our work, the average classification result showed an accuracy of 98.47% and 98.27% in the two-class problem and 90.84% (seven-class) and 94.09% (five-class) in multi-class problem using Herlev dataset and SIPaKMed dataset individually. The main advantage of our proposed method is an increase in the predictive performance in separating the abnormal cells from the normal cells.

Conclusions
This paper proposed a system for computer-assisted screening for cervical cancer using digital image processing of Pap smear images. Our proposed system consisted of six steps: image acquisition, image enhancement, cell segmentation, feature extraction, feature selection, and classification. Our proposed system first segmented each independent cell components, such as nucleus and cytoplasm, and then detected whether cells were cancerous or not through machine learning-based technique. There are several techniques, which had been proposed in the past in this direction. But the accuracy has not been found to be significantly accessible [40]. In our work, the average classification result showed an accuracy of 98.47% and 98.27% in the two-class problem and 90.84% (seven-class) and 94.09% (five-class) in multi-class problem using Herlev dataset and SIPaKMed dataset individually.
The main advantage of our proposed method is an increase in the predictive performance in separating the abnormal cells from the normal cells. The proposed system could be further enhanced by using other classifiers. Our proposed system showed better classification accuracy, sensitivity, and specificity than individual five classifiers. So, this framework could be used for cervical cancer screening system to detect women with precancerous lesions.
Author Contributions: K.P.W., Y.K., and K.H. designed and developed the study; K.P.W. performed the experiments, data analysis, validation, simulation, and manuscript writing; T.M.A. provided the medical reviews, annotated the pathologic cells, and verified the results; Y.K. and K.H. coordinated the whole efforts as the supervisors of this research. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.