Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods

Globally, cervical cancer remains as the foremost prevailing cancer in females. Hence, it is necessary to distinguish the importance of risk factors of cervical cancer to classify potential patients. The present work proposes a cervical cancer prediction model (CCPM) that offers early prediction of cervical cancer using risk factors as inputs. The CCPM first removes outliers by using outlier detection methods such as density-based spatial clustering of applications with noise (DBSCAN) and isolation forest (iForest) and by increasing the number of cases in the dataset in a balanced way, for example, through synthetic minority over-sampling technique (SMOTE) and SMOTE with Tomek link (SMOTETomek). Finally, it employs random forest (RF) as a classifier. Thus, CCPM lies on four scenarios: (1) DBSCAN + SMOTETomek + RF, (2) DBSCAN + SMOTE+ RF, (3) iForest + SMOTETomek + RF, and (4) iForest + SMOTE + RF. A dataset of 858 potential patients was used to validate the performance of the proposed method. We found that combinations of iForest with SMOTE and iForest with SMOTETomek provided better performances than those of DBSCAN with SMOTE and DBSCAN with SMOTETomek. We also observed that RF performed the best among several popular machine learning classifiers. Furthermore, the proposed CCPM showed better accuracy than previously proposed methods for forecasting cervical cancer. In addition, a mobile application that can collect cervical cancer risk factors data and provides results from CCPM is developed for instant and proper action at the initial stage of cervical cancer.


Introduction
One form of gynecological cancer is cervical cancer. Cervical cancer complications are often associated with the infection of human papillomavirus. It is a common debilitating disease among women worldwide. It is the third most regularly diagnosed cancer (~485,000 cases) and the fourth worldwide driving cause of cancer-related deaths (236,000) each year [1,2]. The main cause of cervical cancer is persistent infection by oncogenic human papillomavirus (HPV). Cervical intraepithelial neoplasia 1-3 and in situ carcinoma are the early manifestations of cervical cancer [3]. Additional factors, including sexually transmitted infections, oral contraceptive use, smoking status, parity, and diet can add to the development of cervical cancer [4]. Generally, patients detected with cervical cancer at initial phases give no noticeable signs or indications that could lead to misdiagnosis [5]. The danger of cervical cancer can be expanded by 2 to 3 times if an HPV-contaminated patient smokes [6]. In case of multiple pregnancies, female HPV-infected patients without pregnancies have lower occurrence of cervical cancer than those with more than one full-term pregnancy [7].

Related Work
Past research primarily employed a clinical feature-based approach, genetic feature-based approach, and image classification and segmentation to classify and understand cervical cancer's presence. In case of cervical cancer cell images, a study by Zhang et al. [26] used various machine learning algorithms and matched their segmentation refinement with an artifact-nucleus classifier, for which random forest has revealed the best output. Along with other robust refinement methods, supervised and unsupervised methods were used to distinguish image patches or superpixels from extracted elements, such as Adaboost detectors [27], support vector machine (SVM) [28] or Gauussian mixture models [29]. In a study by Zhao et al. [30] a novel superpixel-based Markov random field (MRF) segmentation was also implemented for non-overlapping cells.
A linear kernel SVM classifier was used by Tareef et al. [31] on superpixels, accompanied by edge enhancement and adaptive thresholding. The results indicated Nuclei Precision: 94.3%; Recall: 92.0%; Dice similarity coefficient (DSC): 0.926. Cytoplasm DSC: 0.914. In one other study, Zhao et al. 2016 [30] used an MRF classifier with a Gap-search algorithm + Automatic labeling map. The findings revealed that Nuclei DSC: 0.93. Cytoplasm DSC: 0.82. In another work by Tareef et al. [28], the authors used SVM classification + Shape based-guided Level Set based on Sparse Coding for overlapping cytoplasm. The results showed that Nuclei Precision: 95%; Recall: 93%; DSC: 0.93, and Cytoplasm DSC: 0.89.
Tseng et al. [32] have reported three classification models of C5.0, SVM, and extreme machine learning to anticipate cervical cancer reoccurrence and to classify the best associated risk factors by utilizing a clinical dataset (e.g., age, radiation therapy, cell type, and tumor size). Their findings indicate that cell type and radiation therapy are two risk factors associated with reoccurrence of cervical cancer. Their findings uncovered that C5.0 had the greatest classification accuracy ratio for all classifiers [32].
Hu et al. [33] have explained a predictive model using multiple logistic regression analysis and artificial neural network to predict the presence of cervical cancer and to identify the maximum risk factors linked with cervical cancer. They used features such as HPV, four genetic factors, and educational level. The experiment recognized that HLA DRB1× 13-2 and HLA DRB1×3-17 alleles were two risk factors of cervical cancer. Such risk factors have been the source of rising cervical cancer risks. Their results indicated that back-substitution fitting of artificial neural network achieved the highest classification accuracy ratio for all classifiers. Sharma [34] has shown a classification model for identifying stages of cervical cancer using C5.0 with different options such as rule sets, boosting, and advanced pruning. Features such as, for instance, clinical diameter, uterine body, renal pelvic, and primary renal carcinoma have been used. Experimental results indicated that C5.0 with advanced pruning achieved the maximum accuracy ratio to identify stages of cervical cancer [34]. Sobar et al. [35] used social science behavior theory to classify the probability of being at risk from cervical cancer by classification methods such as naïve bayes and logistic regression. Their findings indicated that naïve Bayes had better accuracy than logistics regression [35].
Wu and Zhou [10] identified a classification model based on SVM for the diagnosis of cervical cancer. They used recursive feature elimination (RFE) and principle component analysis (PCA) techniques for feature elimination. Their findings revealed that the SVM-PCA had higher accuracy for features selection than SVM-RFE. Although the SVM method can accurately classify cervical cancer data, its high computation cost is a limitation. Recently, Abdoh et al. [22] have used random forest classifier with SMOTE and feature reduction techniques such as RFE and PCA for cervical cancer diagnosis. Their findings revealed that the SMOTE-RF model exceeded the SVM classification technique, similar to the findings of Wu and Zhou [10].

Feature Selection
Feature selection is defined as the method of choosing a subset of relevant features in data that are the most valuable for model construction. It reduces overfitting and training time with improved accuracy [36]. In the present study, we do not need to utilize all features present in the data for making an algorithm work. We can train the algorithm with the features that are indeed more significant to guaranteeing better findings than utilizing all features.

Chi-Squared Feature Selection
Chi-squared feature selection is used to infer a feature's reliance on the class label [37]. It is one of the most frequently used methods for deciding the features that are effective. In Chi-square, a feature's information value is measured by calculating the chi-square statistical value [38]. Several studies used chi-square as a feature extraction technique, such as in breast cancer [37], Parkinson's disease using voice signal [39], cancer classification [38], computer-aided diagnosis of Parkinson's disease [40], and healthcare tweet classification [41]. In Equation (1) given below, c is the number of classes, I is the number of intervals, E ij is the expected number of samples, and A ij is the number of samples of the C class within the j-th interval. The larger the value of χ 2 , the more information the related feature provides.

Outlier Detection Method
In information science, an outlier is an observation point that is far off from the bulk of observations. Outlier detection is defined as the way toward identifying and removing outliers from a dataset.
The key benefit of removing outliers is that it will improve the accuracy. To the best of our knowledge, none of the studies in cervical cancer used the outlier removal technique. Several studies have reported that by removing the outliers while using the DBSCAN method [42][43][44][45], the performance of the prediction system is improved. Hence, we foresee that the outlier detection methods are able to enhance the accuracy of the classification model for cervical cancer. The present work used two outlier detection techniques, namely DBSCAN and iForest.

DBSCAN
This is a clustering-based method of outlier detection which may be used to isolate outliers [23]. Outliers are points that do not have a place in any cluster. The two key parameters of DBSCAN are epsilon (eps) and minimum points (MinPts). The eps shows the radius of neighborhood about a point x (ξ-neighborhood of x), while MinPts explain the minimum number of neighbors inside the eps radius. DBSCAN is a valuable tool for identifying and removing outliers. Past research has revealed that DBSCAN can successfully recognize outliers, showing excellent performance in social network community [43], wireless sensor networks [44], and type 2 diabetes and hypertension [46]. Verbiest et al. [47] have reported that combining outlier removal and oversampling method can produce better outcomes. Accordingly, merging DBSCAN for outlier detection with SMOTE and SMOTETomek methods might improve the accuracy of CCPM.

iForest
Isolation forest (iForest) is an outlier detection technique [48,49]. It differentiates outliers through developing isolation tress (iTrees) and handling outliers as instances/points that have a short average length inside iTrees. Past studies have revealed noteworthy findings using iForest for outlier and anomaly detection. Domingues et al. [50] have estimated diverse outlier detection techniques by means of UCI repository datasets. Their findings revealed that iForest could be successfully used to classify outliers while providing outstanding scalability on large datasets with bearable memory use. Calheiros et al. [51] have applied iForest for unsupervised anomaly detection to locate concerns in large scale cloud datacenters. Their findings indicate that iForest can be viable and beneficial to locate the anomaly. iForest uses the property that outliers are more vulnerable to isolation, so it is possible to identify outliers as observations with short predicted track lengths (i.e., less splits) throughout the forest [52]. Many studies used iForest as an outlier detection technique such as in the detection of insulin pump in artificial pancreas [53], fault detection in artificial pancreas [54], medication errors [55], diabetics [56], Medicare provider fraud [52], and detection of anomalous vital signs of the elderly [57].

Oversampling Method for Imbalance Dataset
Machine learning methods can confront difficulties when one class dominates a dataset (i.e., the number of records in one class exceeds the number of other classes by very much). This dataset is called an imbalanced dataset. It deceives the classification, with a negative impact on findings. In the present study, we used SMOTE and SMOTETomek to handle the imbalanced dataset problem.

SMOTE
SMOTE is a technique of oversampling proposed by Chawla et al. [24]. This randomly produces a new minority class instances from the sample's nearest minority class neighbors. These instances are created taking into account features of the original dataset, with the objective that they conclude original minority class instances. It is used in various fields including breast cancer detection [58,59], liver cancer [60], and cervical cancer [22] to resolve the unbalanced problem. To increase the minority class, SMOTE uses Equation (2) x Firstly, SMOTE recognizes the feature vector x i and find the K-nearest neighbors x knn . Then, it calculates the difference between the feature vector and k-nearest neighbor. Thereafter, it multiplies the difference by a random number from 0 to 1. It then adds the output number to feature vector to identify a new point on the line segment. Lastly, it repeats the above steps to find feature vectors.

SMOTETomek
SMOTETomek is a technique used to handle imbalanced data. Many past studies have used SMOTETomek and revealed favorable outcomes in balancing the data and enhancing the model performance. It showed better area under the curve value than synthetic minority oversampling technique edited nearest neighbor (SMOTEENN) when numerous imbalanced datasets are used [61]. Goel et al. [62] have reviewed five sampling techniques to resolve the imbalanced data problem by using eight datasets from the UCI repository. Their findings indicate that for most datasets, SMOTETomek can increase the model accuracy. Chen et al. [63] have used SMOTETomek to resolve the imbalanced data issue in lane-changing behavior and random forest to foresee the risk associated with lane changing. Their result revealed that SMOTETomek considerably enhanced the model by as much as 80.3%. Tomek Links can be described as a method for undersampling or as a technique for cleaning up data. They can be identified as a pair of the nearest neighbors of opposite classes, which are minimally distant [64]. They are used to remove the overlapping samples that SMOTE adds. Past studies used SMOTETomek as oversampling technique in various healthcare areas such as self-care problem identification for children with disability [65], cancer gene expression data [66], vertebral column pathologies, diabetes and Parkinson's disease [67], and breast cancer [68].

Random Forest
RF algorithm is an ensemble classifier which generates multiple decision trees along with weak classifiers learned from the data on a random sample [69,70]. RFs vanquish numerous issues with decision trees. For instance, they can reduce overfitting and produce low variance. We used random forest for prediction in cervical cancer. The following steps describe the generation of each tree in random forest: • Choose a value of n that shows the number of trees that will be increased in a forest; • Generate n bootstrap samples with bagging technique of the training set; • For each bootstrap dataset, grow a tree. If this training set would consist of M number of input variables, m<<M number of inputs are selected randomly out of M and the best split on these m attributes is used to split the node. The value of m will remain constant during forest growing; • The tree will be grown to the largest possible level; • The prediction results are obtained from the model (most frequent class) of each decision tree in the forest.
Past work has revealed that RF is useful for predicting cervical cancer, with high classification accuracy [22]. Past studies have shown that the outlier data, together with imbalanced datasets, are difficult issues in classification. For instance, they may decrease the system's overall performance [47]. Hereafter, we proposed a CCPM that comprised DBSCAN and iForest for outlier detection to eliminate the outlier data, SMOTE and SMOTETomek for class balancing, and RF for predicting cervical cancer. By eliminating outlier data along and balancing the dataset, RF is expected to give better results.

Dataset Description
The used dataset was published on the repository of UCI collected at Hospital Universitario de Caracas in Caracas, Venezuela [71]. The dataset contained 858 instances with 36 features. Table 1 displays dataset features, total number of entries, and the missing value for each feature. To deal with missing values, the present study used the mean equation as depicted in Equation (3).
There are four target variables (Schiller, Hinselmann, Cytology, and Biopsy). Schiller' test uses iodine solution in cervix. The cervix is examined by naked eye to diagnose cervical cancer [72]. Due to its poor performance, the Schiller' test has been replaced by cytology. Cytology test is used to examine cancer, precancerous conditions, and urinary tract infection. Hinselmann's test is applied to study the cervix, vulva, and vagina.

Prediction Model for Cervical Cancer
The proposed CCPM consists of outlier detection based on DBSCAN and iForest. It also has SMOTE and SMOTETomek to balance the data with RF for cancer prediction. Lastly, the performance of the proposed CCPM is compared with the performances of other existing models. Figure 1 elucidates the proposed CCPM model. Lastly, the performance of the proposed CCPM is compared with the performances of other existing models. Figure 1 elucidates the proposed CCPM model. We used 70 % dataset values for training and 30% for testing with 10 cross validation. We have used the Python programming language and scikit-learn, pandas, numpy libraries used for machine learning models. For outlier detection, we have used scikit-learn library in python programming language [74]. For oversampling, we have used the imbalanced-learn Python library [75].

Evaluation Metrics
The prediction output may have the following four possible outcomes on the basis of a confusion matrix [76], true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Table  2 displays the precision, recall, specificity, F1 score, and accuracy. Table 3 shows different outcomes of two-class prediction.  Table 3. Different outcomes of two-class prediction.
Predicted as "Yes" Predicted as "No" We used 70 % dataset values for training and 30% for testing with 10 cross validation. We have used the Python programming language and scikit-learn, pandas, numpy libraries used for machine learning models. For outlier detection, we have used scikit-learn library in python programming language [73]. For oversampling, we have used the imbalanced-learn Python library [74].

Evaluation Metrics
The prediction output may have the following four possible outcomes on the basis of a confusion matrix [75], true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Table 2 displays the precision, recall, specificity, F1 score, and accuracy. Table 3 shows different outcomes of two-class prediction.

Results and Discussion
This section deals with the results of feature extraction and four scenarios of CCPM in terms of precision, sensitivity, specificity, F1 score, and accuracy. Four scenarios are divided into four sections. Each section displays results and their explanation. We then compared biopsy results with results of past studies and some practical implications to conclude this section.

Feature Extraction Results
In the present study, we used chi-square to extract the features from the dataset. The main aim of the feature extraction technique is to extract the most valuable features from a given rather than the whole features. For simplicity, we have extracted first ten features that have the highest chi-score. Besides, we have added the chi-score of ten variables in the feature extraction table. We used these features for our analysis. After selecting these features, we used outlier detection techniques to remove outliers from the data. Table 4 displays the results of chi-square.

DBSCAN and iForest for Outlier Detection
To implement the DBSCAN-based outlier detection, the optimum value of MinPts and eps must first be established. If the value of eps is too low, it will generate more clusters and normal data may be counted as outliers. On the other hand, if it is too large, it will produce fewer clusters, and true outliers could be categorized as normal data [23,43,44]. We specified the MinPts value to be 5. Next, we have to determine the optimal number of eps. First, we measure each point's average distance from its nearest neighbors. The value k represents MinPts and is outlined by the user. The goal is to decide the "knee" used to estimate the parameter collection of eps. A "knee" is the point at which a sharp shift occurs along the k-distance curve [46]. Figure 2 displays the k-dist graph sorted for the cervical cancer data set and the optimal value of eps. The "knee" shows up at the distance of 3 in the cervical cancer dataset. Lastly, the outlier data are excluded, and standard data are used for further analysis. iForest works in two phases. The first (training) stage constructs isolation trees using training set subsamples. The second stage (testing) passes through isolation trees to obtain an outlier score for each case. Both subsample size (MaxSample) and number of trees to be built (NumTree) are essential parameters to be calculated. The iForest works well when the MaxSample is kept small; the larger MaxSample reduces the ability of iForest to isolate outer data, as normal data can meddle with isolation [49][50][51]. Number of Trees influences the scale of the ensemble. We tried different configuration parameters and found that MaxSample is 10% of the total data size and NumTree is 100% optimal. The iForest was implemented using scikit-learn python library. In case of DBSCAN, we found two outliers. We removed these outliers and processed the data for further analysis. However, iForest found eighty-six outliers. We removed those outliers as well and processed the data.

SMOTE and SMOTETomek for Balancing the Dataset
We also used over-sampling methods to increase the number of cases in a balanced way. We applied SMOTE or SMOTETomek methods to balance the datasets. SMOTE oversamples the minority class to randomly generate instances and increase minority class instances, and Tomek under-samples a class to remove noise while maintaining balanced distributions. As can be seen in Table 5, the dataset is balanced after the application of SMOTE and SMOTETomek. The classification aim is to diminish errors during the learning process; hence, we anticipate that a better model accuracy can be attained from the balanced datasets. iForest works in two phases. The first (training) stage constructs isolation trees using training set subsamples. The second stage (testing) passes through isolation trees to obtain an outlier score for each case. Both subsample size (MaxSample) and number of trees to be built (NumTree) are essential parameters to be calculated. The iForest works well when the MaxSample is kept small; the larger MaxSample reduces the ability of iForest to isolate outer data, as normal data can meddle with isolation [48][49][50]. Number of Trees influences the scale of the ensemble. We tried different configuration parameters and found that MaxSample is 10% of the total data size and NumTree is 100% optimal. The iForest was implemented using scikit-learn python library. In case of DBSCAN, we found two outliers. We removed these outliers and processed the data for further analysis. However, iForest found eighty-six outliers. We removed those outliers as well and processed the data.

SMOTE and SMOTETomek for Balancing the Dataset
We also used over-sampling methods to increase the number of cases in a balanced way. We applied SMOTE or SMOTETomek methods to balance the datasets. SMOTE oversamples the minority class to randomly generate instances and increase minority class instances, and Tomek under-samples a class to remove noise while maintaining balanced distributions. As can be seen in Table 5, the dataset is balanced after the application of SMOTE and SMOTETomek. The classification aim is to diminish errors during the learning process; hence, we anticipate that a better model accuracy can be attained from the balanced datasets.

Results of Target Variables: Biopsy, Schiller, Hinselmann, Cytology
The ten features extracted by Chi-square were used for all models (SVM, multilayer perceptron (MLP), logistic regression (LR), naïve Bayes, and K-nearest neighbors (KNN)), and all four target variables (Biopsy, Schiller, Hinselmann, and Cytology). For each of the target variables, the CCPM was compared with other conventional machine learning models. The results of CCPM for each of the target variables outperformed previous machine learning approaches (with reference to Tables 6-21).
The key reason for these results is the combination of the outlier removal and data balancing techniques. Hence, this improves the accuracy for our CCPM.

Comparison with Previous Studies
We compared the results of our CCPM with past studies that used the cervical cancer dataset. The study by Wu and Zhu [10] used SVM-RFE and SVM-PCA, while Abdoh et al. [22] used SMOTE-RF-RFE and SMOTE-RF-PCA. Tables 22-25 illustrates the comparison of results on the sensitivity, specificity, and accuracy of all target variables of CCPM with Wu and Zhu [10] and Abdoh et al. [22]. All four proposed scenarios of CCPM surpassed past studies by Wu and Zhu [10] and Abdoh et al. [22] in terms of sensitivity, specificity, and accuracy. A study by Deng et al. [76] used SMOTE for handling the imbalanced data, and SVM, XGBoost and RF to identify the risk factors of cervical cancers. Our CCPM produces better results as compared to the Deng et al. [76]. They accuracies achieved by SVM, XGBoostand Random Forest are 90.34, 96.34, and 97.39, respectively. A recent study by Adem et al. [20] used a stacked autoencoder with a soft-max layer and achieved an accuracy of 97.25% in the cervical cancer dataset. Our CCPM surpassed Deng et al. [76] and Adem et al. [20] in terms of accuracy. Thus, we can conclude that our proposed CCPM is better than other machine learning models as well as past studies.
Our proposed CCPM has four combinations, two outlier detection methods and two data oversampling methods, and their results varied target variables and performance measures. For example, the combination of iForest and SMOTETomek showed the best performances in sensitivity and accuracy in biopsy and specificity in Schiller and Hinselmann, but that of DBSCAN and SMOTE was the best in sensitivity and accuracy in Schiller and Hinselmann tests. In summary, iForest showed better results than DBSCAN for biopsy tests, but DBSCAN was usually better than iForest for the cytology test. However, we recommend using all of four combinations for predicting cervical cancer and using ensemble results of them for more robust prediction. In case of risk factors, Table 1 showed the entire attributes' name and the corresponding number of each attribute. The ten features in our study were Smokes (years), Hormonal Contraceptives (years), STDs (number), STDs: genital herpes, STDs: HIV, STDs: Number of diagnosis, Dx: Cancer, Dx: HPV, Dx. We compared the results of risk factors based on feature extraction results with past studies [10,22,77]. The factor Dx: Cancer found in our study is also validated by the Wu and Zhu [10], while the factors smokes per years and Hormonal contraceptives found by Wu and Zhu [10], and Smokes and Hormonal Contraceptives found by Abdoh et al. [22] seemed quite relevant to the factors Smokes (years) and Hormonal Contraceptives (years) factors found in the present study. Furthermore, we compared our features result with past study by Nithya and Ilango [77]. The complexity of an algorithm is generally calculated using Big-O notation [78][79][80]. Time complexity and space complexity are two types of computational complexity [81,82]. Time complexity deals with how long the algorithm is executed for, while space complexity deals with how much memory is used by its algorithm. An algorithm will process amounts of data, where N is a symbol of amounts of data. If an algorithm does not depend on N, then the algorithm has constant complexity or symbolized by O (1) Table 26 depicts the time and space complexities of some machine learning algorithms. In our CCPM, the RF becomes slow and requires more memory space for training as compared to other algorithms. In addition, the proposed CCPM requires additional computation for outlier detection and data balancing. In our CCPM, however, we got better accuracy compared to other conventional machine learning methods.

Practical Implications
Lee et al. [83] have studied the impacts of educational text messages concerning HPV vaccination and its advantages and observed a substantial upsurge in HPV vaccination intake in targeted populations. Cancer screening programs have also used text messaging in an attempt to tackle the screening intake. Weaver et al. [84] have recently examined how elderly patients would be interested in text messages intended to motivate their participation in a screening program. Their findings indicated that older populations were extremely interested in such messages based on their involvement. A recent study by Ijaz et al. [46] has used IoT for a healthcare monitoring system for patients at home and used personal healthcare devices that perceive and estimate a persons' biomedical signals. The system can notify health personnel in real-time when patients experience emergency situations.
In the present work, we implemented CCPM into a mobile app to show its practical implication for simple users. Figure 3 shows the architecture framework of CCPM. A mobile app collects user's risk factor data and then sends it to Representational State Transfer (REST API) to be stored in a a secure remote server. We used NoSQL MongoDB to store user data by keeping in mind that it could store large amounts of data. Finally, CCPM was used to forecast the presence of cervical cancer as long as users input risk factors. Predication results are shown in the mobile app.  Figure 4a,b shows a prototype of the mobile app. When a user presses the "send" button, risk factors that the user inputs are stored in the remote server. CCPM is then activated to foresee the existence of cervical cancer. Figure 4b depicts the interface of mobile application when the user receives prediction results. Therefore, it is expected that the CCPM mobile application can help users find the risk of cervical cancer proficiently at an early stage.  In our CCPM mobile application, maintaining the safety of healthcare data is of paramount importance. The data in the CCPM mobile application are encrypted. Encryption is an essential for  Figure 4a,b shows a prototype of the mobile app. When a user presses the "send" button, risk factors that the user inputs are stored in the remote server. CCPM is then activated to foresee the existence of cervical cancer. Figure 4b depicts the interface of mobile application when the user receives prediction results. Therefore, it is expected that the CCPM mobile application can help users find the risk of cervical cancer proficiently at an early stage.  Figure 4a,b shows a prototype of the mobile app. When a user presses the "send" button, risk factors that the user inputs are stored in the remote server. CCPM is then activated to foresee the existence of cervical cancer. Figure 4b depicts the interface of mobile application when the user receives prediction results. Therefore, it is expected that the CCPM mobile application can help users find the risk of cervical cancer proficiently at an early stage.  In our CCPM mobile application, maintaining the safety of healthcare data is of paramount importance. The data in the CCPM mobile application are encrypted. Encryption is an essential for In our CCPM mobile application, maintaining the safety of healthcare data is of paramount importance. The data in the CCPM mobile application are encrypted. Encryption is an essential for our CCPM mobile application since it scrambles a user's personal data. We used secure sockets layer (SSL) technology that encrypts information transmitted between a mobile application and the server. SSL uses a cryptographic system that uses two keys to encrypt data, e.g., a public key known to everyone and a private or secret key known only to the recipient of the message [85,86]. The NoSQL MongoDB helps to store the user data. A secure log-in feature can be added to mobile application which is called two-factor authentication (2FA). 2FA improves the mobile application's security. Some examples of 2FA include: Username/password + SMS code, Username/password + code sent via email, and Username/password + biometric authentication (a fingerprint). Another important feature to secure the user data is with the help of data wiping. To implement data wiping, an app can: log out a user after a certain period of inactivity, keep information in an encrypted form, and perform an automatic data wipe after a certain number of unsuccessful login attempts. This gives users a sense of greater control over privacy, security, and confidentiality in towards their healthcare data. The security systems described above can keep the health data secured and safe in our mobile application.

Conclusions
As indicated by the World Health Organization (WHO), about 80% cases of cervical cancer are noted in developing nations. A cure ratio is described as the ratio of female cases that are healed from the disease. It can be boosted by classifying the risk factors of cervical cancer [22]. This study proposed a CCPM that used Chi-square as feature extraction technique. We extracted ten features and used them in our study. The current dataset is unbalanced. It has a lot of missing values. For missing values, we used mean equation. The current work proposed CCPM by joining DBSCAN and iForest for outlier detection, with SMOTE and SMOTETomek for class balancing and RF as a classifier. The CCPM can help users find the risk of cervical cancer at an early stage. Accuracies achieved by Biopsy forDBSCAN + SMOTETomek + RF, DBSCAN + SMOTE+ RF, iForest + SMOTETomek + RF, and iForest + SMOTE + RF were 96.708%, 97.007%, 98.919%, and 98.925%, respectively. While for Schiller accuracies for DBSCAN + SMOTETomek + RF, DBSCAN + SMOTE+ RF, iForest + SMOTETomek + RF, iForest + SMOTE + RF were 99.48%, 99.22%, 98.50%, and 98.71% respectively. In case of Hinselmann, the accuracies achieved by DBSCAN + SMOTETomek + RF, DBSCAN + SMOTE+ RF, iForest + SMOTETomek + RF, iForest + SMOTE + RF were 99.50%, 99.01%, 99.50%, and 99.50% respectively. For Cytology, the accuracies achieved by DBSCAN + SMOTETomek + RF, DBSCAN + SMOTE+ RF, iForest + SMOTETomek + RF, iForest + SMOTE + RF were 97.72%, 97.22%, 97.50%, and 97.51% respectively. Hence, combining iForest with SMOTE and SMOTETomek can produce better results than combining DBSCAN with SMOTE and SMOTETomek. Besides, we compared Hinselmann, Schiller, Cytology, and Biopsy results with past studies by Wu and Zhu [10] and Abdoh et al. [22] in terms of sensitivity, specificity, and accuracy. Our results revealed that DBSCAN + SMOTETomek + RF, DBSCAN + SMOTE+ RF, iForest + SMOTETomek + RF, iForest + SMOTE + RF surpassed past studies by Wu and Zhu [10] and Abdoh et al. [22]. Besides, Our CCPM surpassed Deng et al. [76] and Adem et al. [20] in terms of accuracy. As a result, we can conclude that our proposed CCPM is better than other models as well as past studies.
In future work, we will employ more diverse techniques for outlier detection and over-sampling methods. We will also apply each combination to CCPM to improve its diagnosis performance. The proposed method can be applied to other cervical cancer datasets. Results on these may provide additional intuitions for the early diagnosis of cervical cancer.
This study also has a limitation, as only one dataset is employed. Since we only focus on cervical cancer in this study, we only used one dataset. In future research, the proposed CCPM can be applied to diverse cancer datasets (such as breast, liver, lung, prostate, thyroid, and kidney) to enhance the clarity and quality of results. Another limitation is that our algorithm (which is a combination of outlier technique and data balancing with RF) becomes slower and needs more memory to run, but as we got a better accuracy, it serves our purpose.