In this section, the results of the experiments are presented and explained. The section starts with the evaluation metrics, followed by the feature selection results, and then, finally, the classification results are presented and discussed.
4.2. Feature Selection Results
The results obtained from the feature selection methods, presented in
Table 3, for classifying cervical cancer using the binary Waterwheel Plant Algorithm and Particle Swarm Optimization (bWWPAPSO) showcase varying performances across multiple evaluation metrics. These metrics provide crucial insights into the efficacy and suitability of each method in identifying pertinent features for accurate classification. The average error rate, a pivotal metric indicating classification accuracy, reveals notable differences among the methods. Notably, the bWWPAPSO method stands out with the lowest average error rate of 0.712, suggesting a superior ability to accurately classify cervical cancer cases compared to other techniques such as bPSO, bBA, bWAO, bBBO, and others. This lower error rate implies higher precision in distinguishing between cancerous and non-cancerous cases, signifying the potential effectiveness of bWWPAPSO in this specific context. Considering the average select size, which measures the number of features selected by each method, bWWPAPSO demonstrates a relatively lower average select size of 0.685. A smaller average select size indicates a more concise feature subset, which can potentially reduce computational complexity and overfitting. However, this needs careful consideration alongside classification accuracy to strike an optimal balance between a reduced feature count and maintaining high prediction performance. The fitness metrics—average, best, worst, and standard deviation—provide further insights into the selected feature subsets’ quality, variability, and stability across different runs. Regarding average and best fitness, bWWPAPSO consistently displays competitive results, indicating its effectiveness in generating feature subsets with strong classification capabilities. Moreover, the relatively lower standard deviation (Std Fitness) observed in bWWPAPSO runs suggests a more stable performance than other methods like bBBO, bMVO, and others.
The proposed binary Waterwheel Plant Algorithm and Particle Swarm Optimization (bWWPAPSO) method exhibits promising results in accurately classifying cervical cancer, showcased through its lower average error rate and competitive fitness metrics. These findings suggest the potential effectiveness of bWWPAPSO in selecting a concise yet powerful subset of features crucial for classification. Nonetheless, additional validation studies, including testing on independent datasets and further analysis, are essential to affirm its robustness, generalizability, and suitability for real-world applications in cervical cancer classification.
The statistical analysis of feature selection results, presented in
Table 4, for classifying cervical cancer using the binary Waterwheel Plant Algorithm and Particle Swarm Optimization (bWWPAPSO) presents a comprehensive overview of various statistical measures across different feature selection methods. The analysis encompasses descriptive statistics such as standard deviation, maximum, minimum, quartiles, mean, and other key parameters for ten feature selection methods. These statistics offer insights into the distribution, variability, and central tendencies of the performance metrics evaluated. Across the methods, the statistics unveil nuanced differences in the performance metrics. For instance, the minimum and maximum values indicate the range within which the performance metrics fluctuate. The minimum values showcase the least optimal performance achieved by each method, while the maximum values highlight the best-performing scenarios. In this case, the bWWPAPSO method demonstrates a minimum value of 0.710 and a maximum of 0.713, showcasing a relatively smaller range (0.003) than other methods, such as bBA, with a range of 0.026. The quartiles—25th, 50th (median), and 75th percentiles—indicate the spread of data around the median. Interestingly, the quartiles exhibit identical values for most methods, including bWWPAPSO, suggesting a consistent distribution of performance metrics within these methods. Moreover, the mean and standard deviation provide insights into the central tendency and dispersion of the data, respectively. The mean values for bWWPAPSO and other methods help gauge the average performance. At the same time, the standard deviation offers a measure of the variability or spread of the performance metric values around the mean. Here, bWWPAPSO displays a smaller standard deviation (0.001), indicating less variability in its performance across different runs than other methods. The statistical analysis of feature selection results showcases nuanced differences among various methods, providing valuable insights into their performance characteristics. The bWWPAPSO method demonstrates competitive and consistent performance, as indicated by its relatively smaller range, consistent quartile values, and lower standard deviation. However, while these statistics offer a quantitative understanding of performance, further analysis and validation, considering the trade-offs between accuracy, feature subset size, and stability, are essential to determine the most compelling feature selection method for cervical cancer classification.
The ANOVA (Analysis of Variance) test, presented in
Table 5, applied to the feature selection results for classifying cervical cancer using the binary Waterwheel Plant Algorithm and Particle Swarm Optimization (bWWPAPSO) provides critical insights into the significance of differences among the performance metrics obtained from various feature selection methods. The ANOVA (Analysis of Variance) table consists of three main components: treatment, residual, and total, each revealing specific information regarding the variability and significance of the performance of the feature selection methods. Treatment: This section of the ANOVA table assesses the variance between the different treatment groups, i.e., the various feature selection methods. The Sum of Squares (SS) for treatment is 0.056, indicating the total variability attributed to differences among the methods. The Degrees of Freedom (DF) is 9, representing the number of feature selection methods minus one. The Mean Square (MS), calculated as SS divided by DF, is 0.0062. The F-statistic (F) measures the variance ratio between the methods to the variance within the methods. In this case, the F-statistic is 286.3, with degrees of freedom for the numerator (DFn) as nine and denominator (DFd) as 90, resulting in a highly significant
p-value (
p < 0.0001). This implies significant differences among the feature selection methods regarding their impact on classification performance for cervical cancer. Residual: This section of the table focuses on the variance within each method or the variability that the treatment cannot explain (feature selection methods). The SS for the residual is 0.002, indicating the unexplained variance within the methods. The DF is 90, representing the total number of observations minus the total number of treatment groups. The MS for the residual is 0.00002. Total: The total variability in the dataset is accounted for in this section. The Total SS is 0.058, encompassing the variance due to the treatment (feature selection methods) and the residual variance. The Total DF is 99, representing the sum of DF for treatment and residual. The ANOVA test results suggest significant differences among the feature selection methods (treatments) concerning their impact on the classification performance for cervical cancer. The highly significant
p-value (
p < 0.0001) indicates that the variability in performance metrics across these methods is not due to random chance. Still, instead, there are genuine differences in their effectiveness. This analysis underscores the importance of selecting the most appropriate feature selection method as it significantly influences the classification outcome in cervical cancer analysis.
The evaluation of feature selection methods through ANOVA analysis required additional experiments that ran the evaluations with different configurations to guarantee the robustness and reliability of results. The study examined the F-statistic and significance levels when changing DF values through evaluations for 20 and 30 repetitions for each method.
Table 6 and
Table 7 present the ANOVA results for both configurations. The treatment design included a constant value of DF set to 9, representing the ten feature selection methods under analysis. The number of repetitions per method affected the residual DF value to be 190 for twenty runs and 290 for thirty runs. The statistical results display significant high values (
p < 0.0001) across all tests, which validates the dependable nature of the feature selection evaluation.
The Wilcoxon signed-rank test, presented in
Table 8, applied to the feature selection results for classifying cervical cancer using the binary Waterwheel Plant Algorithm and Particle Swarm Optimization (bWWPAPSO) aims to ascertain whether there are statistically significant differences between the performance of these methods. The test involves comparing measurements from the same dataset to determine if one method consistently outperforms the other. In this case, the theoretical median (expected performance) for all methods is 0, while the actual median performance of each method is given. The Wilcoxon signed-rank test examines the hypothesis that there is no difference in the medians of the paired samples. The “sum of signed ranks” (W), “sum of positive ranks”, and “sum of negative ranks” are calculated from the differences between the paired observations’ ranks. Here, the sum of signed ranks (W) for each method is 55, suggesting a consistent trend in performance across the methods evaluated. The “
p-value (two-tailed)” associated with each method is 0.002, indicating a high significance level. This
p-value suggests that there is only a 0.2% probability (assuming the null hypothesis is true) of observing such extreme differences in medians between the methods by chance alone. Hence, a low
p-value leads to rejecting the null hypothesis, suggesting significant differences in performance between these feature selection methods for classifying cervical cancer. The "Discrepancy" column illustrates the difference between the theoretical and actual medians for each method, reflecting the magnitude of deviation from the expected performance.
The Wilcoxon signed-rank test reveals statistically significant differences in the performance of feature selection methods, highlighting that these methods do not perform equally when applied to cervical cancer classification. It signifies the importance of choosing the most effective method based on statistical significance and actual performance metrics when selecting features for classifying cervical cancer cases. On the other hand, the average classification error of cervical using the proposed feature selection compared to the other feature selection algorithms is shown in
Figure 5. In this figure, it is clearly shown that the proposed feature selection algorithm achieves the lowest average error when compared to the other feature selection methods.
4.3. Cervical Classification Results
The classification results, shown in
Table 9, for cervical cancer based on selected features demonstrate varying performances across different machine learning algorithms. These metrics, including Accuracy, Sensitivity (True Positive Rate—TPR), Specificity (True Negative Rate—TNR), Positive Predictive Value (PPV), Negative Predictive Value (NPV), and F-score, provide comprehensive insights into the effectiveness of each algorithm in correctly classifying cancerous and non-cancerous cases. The Neural Network (Multi-Layer Perceptron—MLP) exhibits the highest Accuracy among the models, achieving 0.881, indicating the proportion of correctly classified cases. Additionally, it showcases a high Sensitivity (TPR) of 0.862, demonstrating the model’s ability to correctly identify most of the positive (cancerous) cases. Its high Specificity (TNR) of 0.889 implies its effectiveness in accurately identifying negative (non-cancerous) cases. Moreover, the Neural Network achieves a substantial Positive Predictive Value (PPV) of 0.769 and an impressive Negative Predictive Value (NPV) of 0.938, emphasizing its ability to precisely predict positive and negative cases, respectively. The F-score of 0.813 highlights a balanced performance between precision and recall.
The training data classification performance of the proposed model is depicted through its confusion matrix (
Figure 6). The data visualization includes an informative table of true-positive and negative cases with false-positive cases and false-negative cases showing sensitivity and specificity numbers during training processes.
A confusion matrix appeared in
Figure 7 for a test of the proposed model’s generalization capability. The practical and stable performance of the model exists beyond training due to its near-perfect ratio of correct predictions and its minimal number of false outcomes.
The Random Forest algorithm demonstrates good overall performance with an Accuracy of 0.767. It achieves a notably high Sensitivity (TPR) of 0.896, indicating a robust capability to identify positive cases correctly. However, its Specificity (TNR) is comparatively lower at 0.537, suggesting a moderate ability to identify negative cases accurately. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are 0.775 and 0.744, respectively. The F-score of 0.831 reflects a balanced performance between precision and recall. Other algorithms like Support Vector Machine, Gradient Boosting, K-Nearest Neighbors, Decision Tree, Logistic Regression, and AdaBoost exhibit varying levels of performance in terms of Accuracy, Sensitivity, Specificity, PPV, NPV, and F-score. They generally demonstrate moderate to good performance, with differences in their strengths in correctly classifying cervical cancer cases and their ability to avoid misclassification. The results highlight the diverse performance of machine learning algorithms in cervical cancer classification. While the Neural Network (MLP) and Random Forest show promising results with high Accuracy and balanced TPR-TNR trade-offs, the choice of the most suitable model should consider the specific needs of the application, balancing trade-offs between sensitivity, specificity, and predictive values for clinical or practical relevance.
The classification results, shown in
Table 10, for cervical cancer based on selected features using the WWPAPSO+MLP method showcase outstanding performance across various evaluation metrics, highlighting its effectiveness in accurately distinguishing between cancerous and non-cancerous cases. This method combines the binary Waterwheel Plant Algorithm and Particle Swarm Optimization for feature selection, followed by classification using a Multi-Layer Perceptron (MLP) model. The WWPAPSO+MLP method achieves an exceptionally high Accuracy of 0.973, indicating the proportion of correctly classified cases among the total instances. Moreover, it demonstrates an impressive Sensitivity (TPR) of 0.988, suggesting an exceptional ability to identify positive (cancerous) cases correctly. Additionally, it maintains a strong Specificity (TNR) of 0.914, indicating its capability to identify negative (non-cancerous) cases accurately. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are notably high at 0.978 and 0.952, respectively. This signifies the method’s ability to predict positive and negative cases precisely, emphasizing its reliability in making accurate predictions. The F-score of 0.983 showcases a balanced performance between precision and recall, indicating the WWPAPSO+MLP method’s ability to maintain high precision and recall simultaneously, making it a robust and well-rounded model for cervical cancer classification. Comparatively, other methods, including WWPA+MLP, PSO+MLP, WAO+MLP, FA+MLP, and GA+MLP, also exhibit strong performance in terms of Accuracy, Sensitivity, Specificity, PPV, NPV, and F-score. They demonstrate slightly lower metrics than the WWPAPSO+MLP approach but perform excellently in accurately classifying cervical cancer cases. The WWPAPSO+MLP method emerges as a highly accurate and reliable approach for cervical cancer classification based on selected features. Its exceptional performance across multiple evaluation metrics highlights its potential as a powerful tool in aiding accurate diagnosis and decision-making in clinical settings. However, while these results are promising, further validation studies and assessments on larger datasets are crucial to ensure their robustness and generalizability in real-world applications.
The ANOVA (Analysis of Variance) test, shown in
Table 11, applied to the classification results of cervical cancer, assesses whether there are statistically significant differences in the performance among multiple treatment groups or classification methods. Treatment: This section examines the variation in classification performance attributed to the different treatment groups or methods. The Sum of Squares (SS) for treatment is 0.022, suggesting the total variability among the methods regarding their classification results. The Degrees of Freedom (DF) for treatment is 5, representing the number of treatment groups minus one. The Mean Square (MS), calculated as SS divided by DF, is 0.004472. The F-statistic (F) measures the variance ratio between the methods to the variance within the methods. In this case, the F-statistic is 227.8, with degrees of freedom for the numerator (DFn) as five and the denominator (DFd) as 54, resulting in a highly significant
p-value (
p < 0.0001). This indicates significant differences among the classification methods regarding their performance on cervical cancer classification. Residual: This table section assesses the unexplained variance or variability within each method, not accounted for by the treatment (classification methods). The SS for the residual is 0.001, representing the unexplained variance within the methods. The DF for the residual is 54, indicating the total number of observations minus the total number of treatment groups. The MS for the residual is
. Total: The total variability in the dataset, encompassing both the treatment and residual variability, is accounted for in this section. The Total SS is 0.023, with a Total DF of 59, representing the sum of DF for treatment and residual. The ANOVA (Analysis of Variance) test results indicate significant differences in the performance of various classification methods used for cervical cancer classification. The highly significant
p-value (
p < 0.0001) suggests that the observed variability in performance metrics among these methods is unlikely due to random chance, indicating genuine differences in effectiveness. This analysis underscores the importance of selecting the most appropriate classification method as it significantly influences the outcome in cervical cancer classification. Further exploration, validation, and comparison of these methods on larger datasets or different populations are essential for a comprehensive understanding of their effectiveness and generalizability.
ANOVA tests became the method to evaluate the consistency of the optimization model component by running several experiments. The study included six optimization algorithms through which 20-run and 30-run experiments were conducted.
Table 12 and
Table 13 present the results. The treatment DF is 5 because six optimization methods were assessed in each execution. The remainder of the degrees of freedom in the model depends on how often each method was repeated. Statistical significance remains strong since both F-statistic results remain high and
p-values remain below 0.0001 regardless of run size.
The Wilcoxon signed-rank test, shown in
Table 14, applied to the classification results of cervical cancer, assesses whether there are statistically significant differences in the performance among multiple classification approaches. This non-parametric test is particularly useful when data may not meet the assumptions of normality and aims to determine if one method consistently outperforms the others. The table presents results for different classification approaches, including WWPAPSO+MLP, WWPA+MLP, PSO+MLP, WAO+MLP, FA+MLP, and GA+MLP. The “Theoretical median” represents the expected median performance (0 in this case), while the “Actual median” indicates the observed median performance for each method. The Wilcoxon signed-rank test computes the sum of signed ranks (W) based on the differences between paired observations’ ranks, indicating the consistency and direction of differences between the methods. In this case, all methods yield a sum of signed ranks (W) of 55, suggesting a consistent trend in performance across the methods evaluated. The “
p-value (two-tailed)” associated with each method is 0.002 for all cases. This low
p-value indicates a high significance level, suggesting that there is only a 0.2%
On the other hand,
Figure 8 presents the accuracy of the cervical cancer classification using the proposed approach in comparison to different approaches such as WWPA+MLP, PSO+MLP, WOA+MLP, FA+MLP, and GA+MLP. As shown in this figure, the classification achieved by the proposed approach outperforms the other approaches. This confirms the superiority of the proposed methodology.
Moreover, the results of the ANOVA (Analysis of Variance) analysis are visualized in the plots shown in
Figure 9. These plots include Residual, homoscedasticity, quartile-quartile (QQ), and heatmap plots. These plots show the proposed methodology’s effectiveness from the perspective of statistical analysis.
Figure 10 illustrates the mean values of six performance metrics for different optimization algorithms combined with an MLP classifier. WWPAPSO+MLP achieves the highest mean values across most metrics.
Figure 11 presents boxplots comparing six key metrics across various optimization algorithms combined with an MLP classifier: WWPAPSO+MLP, WWPA+MLP, PSO+MLP, WAO+MLP, FA+MLP, and GA+MLP. The WWPAPSO+MLP model consistently outperforms others across all metrics, showcasing the highest median values for accuracy, sensitivity, specificity, and F-score, indicating its robustness and effectiveness in classification tasks.
The pair plot in
Figure 12 illustrates the relationships between performance metrics across multiple models. Each scatter plot in the grid compares a pair of metrics for all models, while diagonal elements display distributions of individual metrics. WWPAPSO+MLP consistently appears at the upper end of most metrics, showcasing its superior performance compared to the other models.
Figure 13 displays the box plot with a swarm overlay. It highlights the accuracy distribution for each model while allowing for an examination of individual data points through the swarm overlay. WWPAPSO+MLP achieves the highest accuracy with minimal variance.
The proposed WWPAPSO technique has its convergence performance examined by matching it with stand-alone algorithms like the WWPA, PSO, GA, FA, and WOA, which utilize the same MLP classifier for fairness evaluation.
Figure 14 shows the WWPAPSO+MLP combination reaching the optimal solution much faster and achieving deeper levels throughout the iterations than other optimizers that show limited or stagnant fitness improvements. The Meilleure performance of this hybrid approach demonstrates its effectiveness in avoiding trapped solutions by helping it achieve faster convergence, thus making it suitable for biomedical applications with complex datasets.
The model classification consistency was evaluated through a regression analysis between Sensitivity (True-Positive Rate) and F-score across various configuration sets.
Figure 15 demonstrates that both measurements show a positive relationship that indicates sensitivity improvements directly produce F-score improvements. The model performs excellently because it scans positive incidents effectively while balancing its precision score and recall statistics. The model alignment proves vital in medical diagnosis because it controls both missed cases and incorrect alerts with substantial risks.
We performed a regression analysis between sensitivity (True-Positive Rate) and F-score throughout multiple evaluations to study their correlation. The linear representation in
Figure 16 demonstrates that elevated sensitivity directly improves the F-score. This evidence shows that the model successfully detects positive cases and sustains appropriate precision–recall ratios during medical diagnostics to reduce false-negative results.
A regression analysis of overall accuracy and F-score was used for validation of model reliability across different evaluation trials. As demonstrated in
Figure 17, a close linear correlation exists between classification accuracy and F-score. This indicates that higher accuracy measurements directly lead to better F-score results. The model demonstrates consistent performance, confirming that it identifies samples correctly and optimally balances precision and recall achievement.
The paper uses heatmap visualization to provide detailed evaluation metric comparisons for optimization algorithms. The table presented in
Figure 18 includes six hybrid models that merge a metaheuristic optimizer with MLP classifier and their corresponding performance metrics of Accuracy, Sensitivity (TPR), Specificity (TNR), Positive Predictive Value (PPV), Negative Predictive Value (NPV), and F-score. The developed WWPAPSO+MLP system demonstrates superior performance in all assessment indicators, including sensitivity, demonstrating its effective true identification capability with maintained precision. A gradient color scheme strengthens performance gap identification so users can easily evaluate their assessment models.
The performance metric distribution normality analysis used Quantile–Quantile (Q–Q) plots across all optimization models. Each box in
Figure 19 presents the results of Q–Q analysis performed on different evaluation metrics such as Accuracy and Sensitivity (TPR) and Specificity (TNR), and Positive Predictive Value (PPV), as well as Negative Predictive Value (NPV) and F-score. The distribution characteristics indicated by points next to the red diagonal line show normalcy for our performance metric data, thus validating parametric testing in our evaluation. Our comparative evaluation process achieves additional validation through this step because it strengthens the statistical foundation of our analysis.
The entire experimental workflow consisting of data processing, feature extraction model building, and assessment took place through Python 3.10. The development included the utilization of NumPy and Pandas libraries for data processing along with scikit-learn libraries for machine learning functions and preprocessing methods Matplotlib 3.9.0 and Seaborn libraries for graphics and TensorFlow/Keras libraries for MLP classification. A Dell Precision 3660 High-Performance Workstation functioned as the platform for experiments where it ran on a 12th Generation Intel Core i7-12700 processor with 2.10 GHz base frequency up to 4.9 GHz boost clock speed and 64 GB DDR5 RAM and a 2 TB SSD for quick data processing. A total of 858 records in the initial cervical cancer dataset comprised 36 features, but after data preprocessing, the research retained 737 complete records. The data collection was divided into two subset groups using an 80:20 percentage split method for training and validation purposes.