1. Introduction
Breast cancers are the most frequent cancers among women, according to World Health Organization. It concerns 2.1 million women each year, and it also causes the greatest number of cancer-related deaths of women [
1,
2]. In India Breast cancer is the most common form of cancer. In metro cities like Mumbai, Delhi, Bangalore breast cancer accounts for 25% to 32% of female cancers. This condition becomes more serious because nowadays it became more noticeable in the younger age groups. Around 50% of all cases are in the age group of range between 25 and 50 [
3]. The numbers are shocking and constantly rising [
4,
5]. According to the Indian Council for Medical Research in 2016, the total number of new cancer cases was about 14.5 × 10
5 and this figure is likely to increase to 17.3 × 10
5 in 2020. As the number of breast cancer cases in India increases, cancer fear levels increase too. If it’s not able to prevent breast cancer, it can increase the survival rates by being informed and choosing the right treatment at the right time. To improve breast cancer outcomes and survival, early detection is critical which can be effectively achieved by machine learning (ML) and data mining techniques [
6,
7]. The ML algorithms such as classification techniques can be utilized to develop a model to diagnose breast cancer either as malignant or benign [
8,
9]. Various data mining techniques such as class balancing, re-sampling, etc. can be used to handle the dataset and improve the classification accuracy. Once the data imbalance has been handled then using the same by applying feature selection algorithms, we can obtain the most important features which play important role in the accuracy of the classification model as well as reduce the computation time. Many such approaches have been proposed and we used nature-inspired algorithms.
This computation is done on breast cancer datasets on available repository datasets from the University of California, Irvine. We have implemented different classification methods to classify the data to detect the malignant and benign groups from the given dataset and applied various imbalance data handling techniques and feature selection algorithms to improve the performance of the classifiers. To classify breast cancer cells as malignant or benign by ML classifiers, many researchers have worked around. Saoud et al. [
10] have examined six different ML techniques for breast cancer diagnosis and found that Bayes network and support vector machine (SVM) gave an accuracy of 97.2818% on the Wisconsin breast cancer dataset. Saoud et al [
11] proposed an approach for breast cancer detection using supervised and unsupervised machine learning algorithms and showed that supervised algorithms are more efficient. Domingo et al. [
12] analyzed the various decision trees for classifying breast cancer stages. They observed that the fuzzy decision tree had better performance than the J48 tree. Sahu et al. [
13] found out that SVM was more efficient in comparison to other techniques, and studied the parameters such as accuracy, specificity, and sensitivity. Al-Shargabi et al. [
14] have obtained the best result for breast cancer classification with K-Nearest Neighbors and Random Forest with an accuracy of 100%, the second rank for the original Multi-Layer Perceptron with an accuracy of 97.19%. Zhang et al. [
15] have addressed the diagnosis of breast cancer and class imbalance problem using the K-Boosted C5.0 algorithm based on under-sampling. Devi et al. [
16] performed a comparative analysis among various ML algorithms evaluated based on the basis accuracy and ROC curve of each classifier. Fotouhi et al. [
17] have examined oversampling and under-sampling on various cancer datasets and found that balancing techniques had improved the classification of cancer datasets.
Many researchers have worked on feature selection approaches to further improve the accuracy of ML Classifiers. al Haq et al. [
18] suggested the hybrid framework using ML classifiers with Relief, Lasso, and mRMR feature selection algorithms for heart disease. Ahmed Abdullah Farid etal. [
19] have proposed an early diagnosis system for breast cancer using the CHFS feature selection algorithm and SVM achieved 98.25% accuracy. Bibhuprasad Sahu etal. [
20] have proposed the cancer classification approach based on SVM optimized with particle swarm optimization and reverse firefly algorithm. Sahu etal. [
21] have proposed the predictive model of cancer diagnosis using multivariate statistical and machine learning techniques for better accuracy. Kewat et al. [
22] have evaluated wrapper-based feature selection techniques particle swarm optimization, genetic search, and greedy stepwise. Tabrizchi et al. [
23] have proposed breast cancer diagnosis using the multi-verse optimizer-based gradient boosting decision tree. They have combined Gradient Boosting Decision Tree (GBDT) and multi-verse optimizer (MVO) to propose a robust classifier for optimal classification.
Based on the literature review, it was observed that in the majority of the cases various classifiers and feature selection approaches had been used to improve accuracy. In this paper, we propose a framework with a hybrid approach, which extends [
24] where the accuracy improvement by handling the data imbalance using re-sampling and SMOTE (Synthetic Minority Over-sampling Technique) technique has been suggested. The same technique has been used along with nature-inspired feature selection approaches to improve the accuracy of the classification models. In this work, the accuracy evaluation of ML classifiers has been done based on parameters such as Kappa statistics and MCC (Matthews Correlation Coefficient) which had rarely been taken into consideration in other literature.
Lahoura et al. [
25] have used techniques based on artificial neural networks (ANN) to verify, the possibility to apply it to disease diagnosis. The extreme learning machine is an example of ANN. It has a huge potential to solve various classification issues. The proposed paper approach is based on amalgamates three research domains. Firstly, an extreme learning machine was used to diagnose breast cancer. Then, the gain ratio feature selection method was used to eliminate insignificant features. Finally, the cloud computing-based system for remote diagnosis was proposed. The obtained results indicated that accuracy achieved is around 0.987, recall is 0.913, precision is 0.905, and F1-score is 0.813.
Yu et al. [
26] compared RMAF and RELU and other activation functions on deeper models. The RMAF was selected as the most appreciated. Experiments were based on training and classification on multi-layer perceptron MLP by benchmarking data. The applied dataset concerns Wisconsin breast cancer, MNIST, Iris, and Car evaluation. The results of the RMAF investigation indicated that the performance of 98.74%, 99.67%, 98.81%, and 99.42%. Then it was compared to Sigmoid, Tanh, and ReLU. Then, the experiment concerned the convolution neural network using MNIST, CIFAR-10, and CIFAR-100 data. The indicated performance accuracy was 99.73%, 98.77%, and 79.82% in comparison to Tanh, ReLU, and Swish.
Ferreira et al. [
27] distinguished five types of cancer. The investigation concerned RNA-Seq datasets: thyroid, skin, stomach, breast, and lung. Then the performance comparison was based on three autoencoders applied as a deep neural network weight initialization technique.
This work is segregated as follows:
Section 2 presents details about the materials and proposed methodology applied in this research.
Section 2.1 describes the details about the dataset under consideration.
Section 2.2 entails in detail the methodology of the proposed system, which includes feature selection algorithms, ML classifiers, evaluation parameters, etc.
Section 3 presents and discusses the results obtained and the comparison of the results with different approaches applied as proposed in the methodology.
Section 4 discusses the conclusion of the conducted experiment and suggests the methodology to obtain improved accuracy by implementing the hybrid approach.
The research has a huge social impact as it will facilitate medical treatment through the prognosis of breast cancer at its early stages. The related accuracy rate can be noted that will further be used as the threshold for future treatment of breast cancer. The research will be extremely helpful for academicians, researchers, oncologists and the results will lead to novel techniques for the proper prognosis of breast cancer.
4. Discussion
Following are the observations from the accuracy comparison from
Figure 4 and
Figure 5. In most cases, the accuracy of ML classifiers with the features from data with pre-processing is better than the features selected from data without preprocessing. This indicates that feature selection on data after applying re-sampling and SMOTE improves the accuracy of the classifier. For SVM classifier three out of all feature selection algorithm, PSO with KNN evaluator gives the maximum accuracy of 98.24%. Likewise, for the J48 Decision tree, the maximum accuracy of 98.83% is achieved by a Genetic search with a J48 evaluator. For the Multilayer perceptron classifier, the highest accuracy of 98.59% is obtained by using a Genetic search algorithm with KNN. Accuracy details of all the above-mentioned classifiers with other performance evaluation parameters such as MCC, Sensitivity, specificity AUC, Kappa statistics, etc. are shown in
Table 6. It is observed that out of all J48 decision tree classifiers with Genetic search feature selection algorithm outperforms all other classifiers not only in terms of accuracy but also in terms of Mathew’s Coefficient and Cohen’s Kappa statistics along with sensitivity and specificity.
Based on the Feature selection approach (PSO + Naive Bayes), the classifiers J48 and Multilayer Perceptron deliver the best accuracy (98.0%). In feature selection based on (PSO + KNN), the accuracy rate of the SVM classifier is best (98.2%). Based on the feature selection approach (PSO + J48), J48 delivers the best accuracy of 98.4%. Based on the feature selection approach (PSO + Random Forest), J48 delivers the best accuracy of 98.2%. With a feature selection approach (Genetic Search + Naive Bayes), J48 delivers the best accuracy of 97.8%. With (Genetic Search + KNN) approach Multilayer Perceptron gives the best accuracy of 98.6%. Based on the feature selection approach (Genetic Search + Random Forest) and (Genetic Search + J48), J48 delivers the best accuracy of 98.8%. Based on the feature selection approach (Greedy Stepwise + Naive Bayes), Multilayer perceptron delivers the best accuracy of 97.1%. In the feature selection approach (Greedy Stepwise + KNN), (Greedy Stepwise + J48), and (Greedy Stepwise + Random Forest), the J48 classifier delivers the best accuracy of >97%. The Kappa statistics is maximum (0.973) in J48 classifier based on Genetic Search feature selection algorithm. The error rate is also minimum in J48 thereby yielding the best accuracy rate of 98.83%. The sensitivity and specificity are maximum in the case of SVM (99.11%) and J48 (98.58%), respectively. The work can be further assessed based on a 95% confidence interval.
5. Conclusions
The paper points out a Hybrid Supervised Machine Learning Classifier System for breast cancer prognosis using feature selection and data imbalance approaches. The performance of the classifiers has been tested on all attributes and selected features separately to obtain and compare the achieved accuracy. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify important features. On these selected features popular machine learning classifiers such as Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) data pre-processing; (2) data imbalance handling; (3) feature selection; (4) machine learning classifiers; (5) classifier’s performance evaluation. Based on the experimental results, it is evident that the Support Vector Machine with the Particle Swarm Optimization algorithm for feature selection achieves an accuracy of 98.24%, MCC of 0.961, a sensitivity of 99.11%, a specificity of 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves an accuracy of 98.83%, MCC of 0.974, a sensitivity of 98.95%, a specificity of 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC of 0.968, a sensitivity of 98.6%, a specificity of 98.57%, and Kappa statistics of 0.9682. Given the above, it is relevant that the J48 decision tree classifier is the most appropriate machine learning-based classifier for optimum breast cancer prognosis. This work will facilitate medical treatment towards breast cancer prognosis in the light of machine learning. The future scope of work includes the prognosis of breast cancer using thermal images and IoT-based sensors.