Cervical Cancer Diagnosis Using an Integrated System of Principal Component Analysis, Genetic Algorithm, and Multilayer Perceptron

Cervical cancer is one of the most dangerous diseases that affect women worldwide. The diagnosis of cervical cancer is challenging, costly, and time-consuming. Existing literature has focused on traditional machine learning techniques and deep learning to identify and predict cervical cancer. This research proposes an integrated system of Genetic Algorithm (GA), Multilayer Perceptron (MLP), and Principal Component Analysis (PCA) that accurately predicts cervical cancer. GA is used to optimize the MLP hyperparameters, and the MLPs act as simulators within the GA to provide the prediction accuracy of the solutions. The proposed method uses PCA to transform the available factors; the transformed features are subsequently used as inputs to the MLP for model training. To contrast with the PCA method, different subsets of the original factors are selected. The performance of the integrated system of PCA–GA–MLP is compared with nine different classification algorithms. The results indicate that the proposed method outperforms the studied classification algorithms. The PCA–GA–MLP model achieves the best accuracy in diagnosing Hinselmann, Biopsy, and Cytology when compared to existing approaches in the literature that were implemented on the same dataset. This study introduces a robust tool that allows medical teams to predict cervical cancer in its early stage.


Introduction
Cancer is a leading cause of death across the world. In 2020, around 604,000 women were diagnosed with cervical cancer and 342,000 cervical cancer deaths were recorded [1]. Cervical cancer is one of the most dangerous diseases for women, and approximately 80% of those diagnosed were aged 15 to 45.
Cervical cancer is caused by mutations in genes that regulate cell division and proliferation. Two associated symptoms of early-stage cervical cancer are pelvic discomfort and vaginal bleeding. Because cervical cancer cannot be diagnosed in its early stages, these symptoms are the only early warning signs. If gone unnoticed, cervical cancer can spread to other body regions, such as the lungs and abdomen. Cervical cancer can be identified with diffusion-weighted and Magnetic Resonance Imaging in its later stages (diffusion-weighted imaging). Symptoms, which include tiredness, back discomfort, leg pain, weight loss, and potential bone fractures, often become more severe as the disease progresses [1].
Cervical cancer risk is raised by early pregnancy, contraception usage, numerous pregnancies, cigarette use, and Human Papillomavirus (HPV) [2]. HPV, one of the most critical risk factors for cervical cancer, is a DNA virus that spreads mainly through sexual interaction. HPV is identified in cervical cancer patients 99.7% of the time and consists of more than 100 distinct factors [3].
Cervical cancer is detected using four specific tests: Hinselmann, Schiller, Cytology, and Biopsy (HSCB). Hinselmann is a member of the "colposcopy with acetic acid" group. Schiller, Cytology, and Biopsy, on the other hand, use Lugol iodine [4]. Diagnosing cervical cancer is expensive and time-consuming. Unfortunately, low-income nations face significant challenges in raising cancer awareness and screening. In addition, a lack of resources, such as medical expertise, equipment, and specialist doctors, contributes to the spread of cervical cancer in developing nations. As a result, patient fatality rates are rising [5].
One of the most effective techniques for predicting cervical cancer is NN. There are several types of NNs, such as Multilayer Perceptron (MLP), Convolution Neural Network (CNN), Probabilistic Neural Network (PNN), and Recurrent Neural Network (RNN) [18,19]. Many researchers over the last decade have endorsed the use of MLP when predicting cervical cancer because it provides a decent classification accuracy [19,20].
The MLP is a feedforward NN that learns via the Backpropagation method. It has an input layer of neurons that function as receivers, and it contains one or more hidden layers of neurons that compute data and iterate. Finally, an output layer predicts the result, as summarized in Figure 1. Genetic Algorithm (GA) is a stochastic population searching technique that examines large search areas efficiently. GA has been used to optimize a range of MLP parameters, including momentum, size of the neurons, stopping criteria, solver(s), and activation function(s) [19,21]. This paper is organized as follows: Section 2 discusses related literature on cervical cancer prediction. Section 3 summarizes the contribution of this research. Section 4 describes the research methods, data preprocessing, dataset for the study, and procedures used to combine GA with MLP. Section 5 discusses the results of the proposed method and compares the proposed method with other classification approaches. Finally, Section 6 provides the conclusions and future research directions.

Related Literature
This section discusses and criticizes relevant current ML approaches for predicting cervical cancer. Most of the studies that have been conducted to predict cervical cancer used the cervical cancer dataset from the University of California Irvine (UCI) Machine Learning Repository.
Wu and Zhou [4] implemented three different methods to diagnose four target variables of cervical cancer: HSCB. The dataset contained 32 factors that potentially cause cervical cancer. The authors used the Random Oversampling method to balance the dataset of 668 patients. They combined SVM with Recursive Feature Elimination (RFE), combined SVM with Principal Component Analysis (PCA), and employed the traditional SVM, and they compared these method's characterization of cervical cancer. The classification results show that SVM-RFE and SVM-PCA provide more accurate results when selecting only eight factors instead of using the traditional SVM techniques [4,22].
Abdoh et al. [22] predicted cervical cancer by using RFE and PCA to eliminate several features. The researchers used the Synthetic Minority Oversampling Technique (SMOTE) to balance the cervical cancer dataset from UCI. Random Forest (RF) was used to diagnose cervical cancer and was compared with RF-PCA and RF-RFE. Their results show that integrating SMOTE with RF classifiers enhances classification accuracy by about 4% when compared with the work conducted by Wu and Zhou [4].
Deng et al. [23] applied RF, SVM, and eXtreme Gradient Boosting (XGBoost) to diagnose four target variables of cervical cancer: HSCB. Similar to Abdoh et al. [22], the authors used SMOTE to balance the dataset. However, their research did not use feature elimination techniques. Their results show that RF and XGBoost performed better than SVM in terms of classification accuracy. The results of Deng et al. [23] are similar to those reported by Abdoh et al. [22].
Adem et al. [15] used Deep Learning to predict the same dataset of cervical cancer by employing Softmax classification with a stacked autoencoder. The stacked autoencoder was used as a dimensional reduction tool, while the softmax layer was used to predict HSCB of cervical cancer. The authors validated their approach with six ML methods: RF, Decision Tree (DT), MLP, SVM, Rotation Forest models, and KNN [15]. Their proposed approach for diagnosing the four target variables (HSCB) of cervical cancer achieved better performance than the method of Wu and Zhou (2017), with close to 4% improvement in classification accuracy.
Alsmariy et al. [24] used RF, Logistic Regression (LR), and DT to diagnose HSCB of cervical cancer. Similar to Deng et al. [23] and Abdoh et al. [22], the authors applied SMOTE to balance the dataset. They used PCA as a feature reduction approach. They implemented a voting technique that enables several algorithms to vote to select a winner. Their results perform better than those of Abdoh et al. [22], Deng et al. [23], and Wu and Zhou [4].
Wahid and Al-Mazini [5] adopted a meta-heuristic algorithm, Ant Colony Optimization (ACO) with the Ant-Miner data classification rule, to select the most critical risk factors to diagnose HSCB of cervical cancer. The authors claimed that their method was used for the first time in the literature for the UCI cervical cancer dataset. Their classification results perform better than those of Wu and Zhou [4] but are inferior to the reported results of Alsmariy et al. [24], Abdoh et al. [22], and Deng et al. [23].
Other researchers such as Devi et al. [10] and Fernandes et al. [14] used image processing and Deep Learning to diagnose and predict cervical cancer with a different dataset from the UCI. Devi et al. [10] classified cervical cancer into normal and abnormal cells using NN and Learning Vector Quantification (LVQ). Digital photographs of patients were used as inputs. To diagnose cervical cancer, LVQ was used to obtain the coefficient mean value of the extracted photographs. Their model achieves a 90% classification accuracy [10].
In contrast, deep learning algorithms were used by Fernandes et al. [14] to predict cervical cancer. Their approach is founded on a loss function that permits dimensional reduction to identify the most important classification variables. They concentrate on a specific form of cervical cancer called biopsy. Their algorithm achieves an Area Under the Curve (AUC) of 0.6875 [14].
There were a few studies that combined GA with MLP to diagnose cervical cancer. In one study, GA was used to determine the optimal initial weights and bias of MLP to classify cervical cancer [25]. The study was conducted on a dataset with 401 patients, of which 51.2% had cervical cancer, and included 16 risk factors. The accuracy of their proposed model improved from 94.51% to 96.26% when combining GA with MLP. Similarly, another study adopted GA to optimize the MLP's initial weights and threshold to identify Nanoparticle (NP) sensors in the early diagnosis of cervical cancer cells. Their study compared the performance of the GA and MLP combination with a standalone MLP. Their results indicated that combining GA with MLP achieved statistically better root mean Square (RMS) and Mean Absolute Error (MAE) than MLP alone [26].
In summary, most of the research used classical machine learning classification algorithms and deep learning approaches to diagnose cervical cancer. The literature on diagnosing HSCB cited above is also summarized in Section 5 where it compared with the proposed hybrid system of PCA-GA-MLP. There were two studies that used GA to optimize MLP's initial weights, threshold, or bias to diagnose cervical cancer. However, none of the research optimized the parameters of MLP, which include the size of each hidden layer, solvers, and activation functions, to diagnose cervical cancer. Further, a hybrid model of PCA-GA-MLP for the diagnosis of cervical cancer has not been proposed in the literature.

Contribution
This study is the first that integrates PCA, GA, and MLP altogether in one framework that accurately predicts cervical cancer using the benchmark dataset from UCI. The proposed method transforms all available features using the PCA method. The transformed features are utilized in model constructions of the MLP, which is within a hybrid system of GA and MLP. GA is used to optimize the MLP parameters, whereas the MLP acts as a simulator within the GA. The hybrid system iteratively evolves the optimal design of MLP that provides the best cervical cancer classification accuracy. The developed framework introduces a robust tool that allows medical teams to predict cervical cancer as a preventive strategy that reduces cervical cancer rates and costs while improving the quality of care for cancer patients.

Research Methodology
This research has four main steps, as summarized in Figure 2.
Step 1 involves preprocessing and balancing the dataset.
Step 2 describes the application of the feature selection process. In this research, four feature selection approaches separately diagnose each target variable/test of cervical cancer: using the transformed features from PCA; using all original features; using the top 18 features based on RF importance; and using the top 10 features based on RF importance. Step 3 explains how GA is used as an optimization tool to determine the optimal parameters of the MLP to predict cervical cancer. This process is applied to the four target variables (HSCB) of cervical cancer separately. Four feature selection approaches are implemented for each target variable, which results in 16 different scenarios (i.e., four scenarios for each cervical cancer variable: PCA-GA-MLP, GA-MLP using all 30 factors, GA-MLP using the top 18 factors, and GA-MLP using the top 10 factors).
Step 4 determines the performance measures for each scenario using a 5-fold cross-validation and compares each scenario's results with nine other classification algorithms. Five-fold cross-validation is used because it allows all available data instances to be used for both model development and model validation [27]. The nine algorithms are as follows: RF, Linear Discriminant Analysis (LDA), SVM, LR, Gaussian Naïve Bayes (NB), KNN, DT, Adaptive Boosting (AdaBoost), and Centroid-Displacement-based KNN (CD-KNN) [28]. Therefore, a total of 160 different experiments are implemented.

Dataset Description and Data Preprocessing
UCI published the cervical cancer dataset. The dataset contains 858 patients that have 32 cervical cancer factors as summarized in Table 1. It has four cervical cancer target variables: Hinselmann, Schiller, Cytology, and Biopsy. Each cervical cancer target variable is treated as a separate problem in this research. Therefore, four datasets are prepared to be used separately: Hinselmann, Schiller, Cytology, and Biopsy. Because some patients refused to answer personal questions, some data are missing. As a result, the dataset needed to be preprocessed to account for the null data. Two factors were removed due to having a large amount of missing data (i.e., factor numbers 27, and 28 in Table 1). Further, some samples were removed for the same reason. As a result, there are 668 patients in the final (preprocessed) dataset. Each data instance has 30 distinct risk factors and four target variables (HSCB) of cervical cancer. Normalization is used for some numerical factors to eliminate data redundancy and avoid any undesirable characteristics due to the wide range of values in those factors. The percentage of cervical cancer patients within the dataset is 4.5% for Hinselmann, 9.4% for Schiller, 5.8% for Cytology, and 6.7% for Biopsy. Therefore, the random oversampling technique is used to balance the unbalanced dataset.

Feature Selection and Principal Component Analysis
Feature selection is the process of minimizing the number of input variables when creating a predictive model. The number of input variables might be reduced to decrease the computational cost of modeling and, in some circumstances, to improve the model's performance [33,34].
RF is a Bagging algorithm that mixes several different DTs. Tree-based techniques in RFs are naturally rated by how well they improve node purity or, in other words, how well they minimize impurity (Gini impurity) across all trees. The nodes with the greatest reduction in impurity are located at the beginning of the trees, while those with the smallest reduction are found at the conclusion. A subset of the most important features may be created by pruning trees below a certain node [35][36][37]. The proposed framework investigates two feature selection methods, PCA and RF importance, where RF is used to select the best 18 factors and the best 10 factors with the highest relative importance in predicting cervical cancer, as shown in Tables 2 and 3. The factors listed in Tables 2 and 3 refer to the factor numbers provided in Table 1.  PCA is a statistical method that uses the eigenvector to determine the orientation of features. PCA's fundamental idea is to map a j-dimensional feature space into an idimensional space, which is generally known as the principal components, where i < j. The covariance matrix is calculated, and the eigenvectors and eigenvalues are computed. Because an eigenvalue shows the most significant relationship between the dataset characteristics, the eigenvector with the greatest eigenvalue is selected as the principal component of the cervical cancer dataset. The eigenvalues are sorted in ascending order to select the most significant principal component(s), while the lowest eigenvalues are discarded. This process reduces large dimensional datasets to smaller dimensional datasets. The variance measures the dispersion of the data in the cervical cancer dataset. Lastly, eigenvectors and eigenvalues for the covariance matrix are computed. Eigenvalues are transformed using varimax orthogonal rotation or oblique rotation [38].
In this research, PCA transforms the 30 original factors into a two-dimensional space, which accounts for most of the variance in the original dataset. The PCA approach reduces the noise in the cervical cancer dataset and also reduces the computational time for model development.

Combination of GA-MLP
GA is used in this research as an optimization tool to determine the optimal hyperparameters for the MLP that provides the highest classification accuracy in diagnosing each target variable of cervical cancer. The MLP has several hyperparameters that need to be fine-tuned, which include the size of each hidden layer, solvers, and activation functions. The hyperparameters of MLP are encoded as chromosomes in the GA. The population of solutions/chromosomes in the GA represents a population of MLPs. The classification accuracy of each MLP, after network training is completed, is used as the fitness value of that solution.
Initially, the GA has a random population of solutions. In each generation, each solution (MLP) goes through the training process to determine its fitness value (classification accuracy). The evaluated solutions will then go through the typical evolution process of the GA: select two parent solutions, crossover the parent solutions to create two children with a probability of Pc, and mutate the children with a probability of Pm. Once the internal loop is completed and reaches half of the population size (n/2), the replacement process will then cull all the parent solutions, the children will advance to the next generation, and the generation counter will increase by one. This process will continue until the termination criteria are met (gmax). Finally, the best solution out of all generations will be selected, which represents the MLP's optimized parameters for predicting patients with cervical cancer. Algorithm 1 represents the pseudocode of the hybrid GA-MLP for optimizing the MLP parameters. for i = 1, until n/2 do 7: Select two parents 8: Crossover to create two children with Pc 9: Mutate children with Pm 10: end for 11: Replace parents with children 12: end for 13: Return the best solution

Main Operators and MLP Hyperparameters
The following parameters are used to design the GA: tournament selection is used to select the parents (k = 4) for breeding; the crossover probability is set at 1 to perform a single-point crossover operation; the mutation probability is set at 0.001; the population size is fixed at 50; and the stopping criterion is 200 generations.
The hyperparameters of MLP are the encoded solution vectors in the GA. Each solution vector considers 4 types of activation functions, 3 types of solvers, and 50 different sizes for the first and the second layers in the MLP. However, some hyperparameters for the MLP are fixed: the learning rate is set at 0.001, the momentum is set at 0.90, and the stopping criterion is set to 200 iterations. Each solution undergoes network training, and the classification accuracy of the trained MLP is used as its fitness value in the GA. The proposed methodology is coded in the Python 3.9 environment.

Performance Metrics
Accuracy, sensitivity, specificity, and precision are the most common data mining performance metrics [39]. These performance metrics are defined in Equations (1)-(5). The confusion matrix output determines the following metrics: True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP) [40]. TP is the number of correct predictions that a patient has cervical cancer, i.e., cervical cancer is correctly classified as cervical cancer. TN is the number of correct predictions that a patient does not have cervical cancer, i.e., non-cervical cancer is correctly classified as non-cervical cancer. FN is the number of incorrect predictions that a patient does not have cervical cancer, i.e., cervical cancer is identified as non-cervical cancer. FP is the number of incorrect predictions that a patient has cervical cancer, i.e., non-cervical cancer is detected as cervical cancer. Table 4 summarizes the confusion matrix for diagnosing cervical cancer. In cervical cancer, the performance metrics are interpreted as follows: accuracy refers to how well the model can accurately categorize TP and TN cervical cancer cases out of all instances. Sensitivity refers to the percentage that the model correctly classifies TP cervical cancer cases out of all patients with cervical cancer. The model's specificity measures how well it can categorize individuals as not having cervical cancer out of those diagnosed without the disease. Precision refers to the percentage that the model correctly classifies TP cervical cancer cases out of all cases that are classified as cervical cancer. Finally, F1-score is generally considered a better performance measure; it represents a weighted average between sensitivity and precision.

Results and Discussion
Experimental results of each target variable of cervical cancer are discussed separately in this section. The proposed approach is compared with nine different classification algorithms. Further, the performance of the PCA-GA-MLP model is compared with the best results in the studies that were conducted on the same cervical cancer dataset from UCI. Table 5 summarizes the performance measures obtained when comparing GA-MLP with other classifiers to diagnose Hinselmann. When adopting PCA as a dimensional reduction tool, PCA-GA-MLP outperforms all nine classification algorithms; it has an accuracy of 98.20%, a sensitivity of 100.00%, a specificity of 96.37%, a precision of 96.54%, and an F1-score of 98.24%. PCA-AdaBoost is the next best method; it has an accuracy of 94.67%, a sensitivity of 100.00%, a specificity of 89.29%, a precision of 90.41%, and an F1-score of 94.96%.

Target Variable: Schiller
The performance measures of GA-MLP and other algorithms in diagnosing Schiller are summarized in Table 7. When applying PCA as a feature selection method, PCA-GA-MLP achieves the highest accuracy, sensitivity, specificity, precision, and F1-score of 96.78%, 100.00%, 93.61%, 93.97%, and 96.87%, respectively. PCA-CD-KNN is the secondbest method, with an accuracy of 87.52%, a sensitivity of 99.15%, a specificity of 75.88%, a precision of 80.50%, and an F1-score of 88.83%.
When using all 30 factors to diagnose Schiller, GA-MLP has the best accuracy, precision, and F1-score (94.13%, 89.82%, and 94.45%, respectively), followed by CD-KNN and KNN. When using the best 18 factors to diagnose Schiller, GA-MLP provides the best results across all five performance measures. In selecting the best 10 factors to diagnose Schiller, GA-MLP performs better than all other classifiers, in terms of accuracy, sensitivity, precision, and F1-score (93.39%, 99.17%, 88.98%, and 93.79%, respectively).   The proposed method, PCA-GA-MLP, is compared with the best approaches in the literature in the diagnosis of Schiller, using the same benchmark dataset from UCI. As shown in Table 8, PCA-GA-MLP performs better than the methods reported by Wu and Zhou [4], Abdoh et al. [22], Deng et al. [23], and Wahid and Al-Mazini [5] in terms of accuracy, sensitivity, and F1-score. The PCA-GA-MLP method achieves the highest sensitivity when compared with all other approaches. However, the method of Alsmariy et al. [24] has slightly better accuracy, specificity, and F1-score than the proposed PCA-GA-MLP method.   Table 9 compares the performance of GA-MLPs with other algorithms in diagnosing Cytology. When implementing PCA as a dimensional reduction technique, PCA-GA-MLP achieves the highest accuracy, specificity, precision, and F1-score of 97.54%, 95.15%, 95.27%, and 97.56%, respectively. The next best method is PCA-RF, which has an accuracy of 91.58%, a sensitivity of 95.30%, a specificity of 88.06%, a precision of 88.73%, and an F1-score of 91.83%.

Target Variable: Cytology
When using all available (30) Table 10 compares PCA-GA-MLP with other approaches from the literature in diagnosing Cytology. PCA-GA-MLP achieves better accuracy and F1-score than all reported approaches. In terms of sensitivity, PCA-GA-MLP ranks second, next to the method reported by Wu and Zhou [4]. In terms of specificity, PCA-GA-MLP performs better than the methods reported by Wu and Zhou [4] and Alsmariy et al. [24].   Table 11 compares the performance measures of GA-MLP with other classifiers to diagnose Biopsy. When adopting PCA as a dimensional reduction tool, PCA-GA-MLP achieves the highest accuracy, specificity, precision, and F1-score of 97.75%, 95.54%, 95.63%, and 97.76%, respectively. PCA-AdaBoost is the second-best method with an accuracy of 90.93%, a sensitivity of 94.24%, a specificity of 87.90%, a precision of 88.47%, and an F1-score of 91.19%. When using all 30 factors or the top 10 factors to diagnose Biopsy, GA-MLP outperforms all other approaches in all performance measures. In the case of using the top 18 factors to diagnose Biopsy, GA-MLP outperforms all other approaches in terms of accuracy, sensitivity, precision, and F1-score. Figure 6 compares the performance of the hybrid system of PCA-GA-MLP, and the three GA-MLPs that used the top 10 factors, the top 18 factors, and all 30 factors to diagnose Biopsy. As shown in Figure 6, the hybrid system of PCA-GA-MLP outperforms all three GA-MLPs across all but one performance indicators. All three GA-MLPs have similar performance regardless of the number of factors included in the GA-MLP models.  Table 12 compares the performance of PCA-GA-MLP with other approaches from the literature in diagnosing Biopsy. PCA-GA-MLP achieves better accuracy and F1-score than all reported approaches. In terms of sensitivity, PCA-GA-MLP ranks third, next to the methods reported by Wu and Zhou [4] and Alsmariy et al. [24].

Four Target Variables
This section compares the classification accuracy of the proposed PCA-GA-MLP method with reported studies from the literature for all four target variables of cervical cancer (HSCB) on the same dataset from UCI. The classification accuracy is the only performance measure that is reported in the six relevant studies from the literature.
As shown in Figure 7, PCA-GA-MLP achieves the highest classification accuracy in diagnosing Hinselmann, Cytology, and Biopsy. In diagnosing Schiller, PCA-GA-MLP ranks third, next to the methods reported by Alsmariy et al. [24] and Adem et al. [15].

Conclusions and Future Work
This research proposed an integrated system of PCA, GA, and MLP for diagnosing cervical cancer cases using the benchmark dataset from UCI. There are four target variables of cervical cancer in the dataset: Hinselmann, Schiller, Cytology, and Biopsy. Four feature selection approaches were explored; dimensional reduction was performed using the PCA method, and different subsets of the original factors were selected based on Random Forest Importance.
Experimental results show that the hybrid system of PCA-GA-MLP outperforms all other classification algorithms and the three GA-MLP versions on all four target variables. In comparison with the existing approaches in the literature that were implemented on the same cervical cancer dataset, the PCA-GA-MLP model achieves the highest classification accuracy in Hinselmann, Cytology, and Biopsy (98.20%, 97.54%, and 97.75%, respectively).
Given the growing interest in cervical cancer research using ML algorithms, this research developed a robust predictive tool for cervical cancer. Physicians and healthcare providers can use the proposed model to identify patients with cervical cancer in its early stages. For the identified cases, the medical team can focus their effort on preventive actions and plans to improve women's care and ultimately reduce cervical cancer rates and the associated costs.
The proposed research has a few limitations. The current approach did not consider RFE as a tool for selecting the best risk factors. Furthermore, the random oversampling technique was used to balance the unbalanced dataset. For future work, feature selection using RFE and advanced data balancing techniques such as SMOTE and cost-sensitive learning can be explored in the proposed method to further improve the overall performance.