Abstract
Software fault prediction (SFP) is vital for ensuring software system reliability by detecting and mitigating faults. Machine learning has proven effective in addressing SFP challenges. However, extensive fault data from historical repositories often lead to dimensionality issues due to numerous metrics. Feature selection (FS) helps mitigate this problem by identifying key features. This research enhances the Whale Optimization Algorithm (WOA) by combining truncation selection with a single-point crossover method to enhance exploration and avoid local optima. Evaluating the enhancement on 14 SFP datasets from the PROMISE repository reveals its superiority over the original WOA and other variants, demonstrating its potential for improved SFP.
1. Introduction
SFP greatly aids in producing high-quality software at a low cost by identifying fault-prone software modules [1]. Machine learning algorithms like decision trees, Bayesian learners, neural networks, support vector machines, and rule-based learning have shown promise, as have soft computing approaches like fuzzy computing, neural networks, evolutionary computing, and swarm intelligence [2].
Feature selection is frequently used to improve the SFP performance of machine learning (ML) algorithms, intending to increase data processing effectiveness and avoid algorithmic errors [3]. This is often conducted using metaheuristic algorithms like genetic algorithms and particle swarm optimization [4]. Among these metaheuristic approaches, the Whale Optimization Algorithm (WOA) has emerged as a promising choice for feature selection. However, the WOA is susceptible to local optima trapping, a challenge in large datasets.
This study addresses the local optima problem in the WOA for feature selection in SFP by introducing the truncation selection method. Building on recent advancements in WOA variants [5], this research investigates the effectiveness of truncation selection within the context of WOA selection enhancement. This novel approach aims to improve WOA’s performance in SFP, offering a potential solution to the local optima challenge.
In summary, this research aims to contribute to the field of software fault prediction by leveraging metaheuristic algorithms, specifically the WOA, in combination with the novel truncation selection method, building upon previous advancements to address local optima challenges and enhance the effectiveness of machine learning algorithms in software fault prediction.
2. Review of Related Work
A comprehensive review of the literature related to Software fault prediction (SFP), machine learning (ML)-based SFP, feature selection, and metaheuristic algorithms for software fault prediction is presented. This review aims to provide a foundational understanding of existing knowledge in this field to support the development and evaluation of the proposed methodology.
A study by [4] brings to the fore the traditional techniques used in SFP, encompassing software matrices, soft computing (SC), and machine learning (ML). While these approaches have significantly contributed to early fault prediction and the development of dependable software, they still have limitations in predicting certain types of faults. Additionally, they might be time-consuming, particularly when applied to complex software projects, leading to potentially diminished software testing effectiveness.
Some examples of ML-based SFP techniques include [6,7,8,9,10].
Feature selection (FS) has become a significant step in data mining, in general, and machine learning, in particular, since it helps to clean data by removing noisy, irrelevant, and redundant features [11]. A study by [12] developed a novel FS called evolving populations with mathematical diversification (FS-EPWMD), which uses arithmetic diversification among candidate solutions to avoid the local optimum. The guiding principle of populations evolving through crossover and mutation is the survival of the fittest. The results demonstrated that FS-EPWMD outperforms other models.
Swarm intelligence (SI) is a computational intelligence technique used to resolve complicated problems [13]. Swarm intelligence algorithms have demonstrated excellent performance in lowering the running time and addressing the FS problem. For instance, in order to solve the FS problem in the area of software fault prediction, ref. [14] proposed the island model to improve the BMFO. The EBMFO with the SVM classifier produced the best results overall. These findings show that the suggested model can be a useful predictor for the software fault issue.
3. Proposed Work
This section presents the research workflow and outlines the proposed methodology employed in this study, covering the enhancement of the algorithm and subsequent performance evaluation.
3.1. The Research Workflow
The proposed model works in four phases, namely, the literature review, methodology, implementation, and evaluation and results phase. The flow begins from the literature review down to the evaluation and results, and each phase is represented by its activities. Figure 1 represents the entire workflow of the model.
Figure 1.
The research workflow.
In the first phase, we conducted a comprehensive literature review, exploring 42 articles related to software fault prediction (SFP), ML-based SFP, feature selection (FS), and selection schemes. The second phase highlighted the main components of the proposed ML-based SFP model and the enhanced WOA. ML-based SFP handles prediction, while the enhanced WOA aims to enhance its performance. In the third phase, we implemented the ML-based SFP model using Google Colab and replicated baseline work in the same environment for comparison. To classify FS problems, four well-known classifiers were employed: support vector machine (SVM), decision tree (DT), linear discriminant analysis (LDA), and K-Nearest Neighbors (KNN). In the final phase, the proposed model’s performance was evaluated using the following metrics: area under curve (AUC), precision, recall, F1 score, and accuracy. The cross-validation technique was also used to assess the performance of the model, where 80% of the dataset was used for training and 20% was used for testing.
3.2. The Proposed ML-Based SFP Model
This section details the research methodology for ML-based software fault prediction (SFP). Figure 2 presents the proposed model diagram, which works in five stages: data collection, data pre-processing, feature selection, machine learning classifiers, and evaluation. We presented the description of the PROMISE dataset in Table 1.
Figure 2.
The proposed model.
Table 1.
Description of the PROMISE Dataset.
Stage 1 involves gathering 14 datasets from the PROMISE dataset repository, with details provided in Table 1. In Stage 2, data pre-processing was used to stabilize the dataset into a form suitable for training and validation. Stage 3 employs the Whale Optimization Algorithm (WOA) for feature selection, with a focus on improving its selection scheme using truncation selection. Stage 4 deploys four ML classifiers (DT, KNN, LDA, and SVM) to predict software faults, enhancing the WOA’s performance in feature selection. In the final stage, we evaluated the model using the evaluation metrics mentioned above.
3.3. The Proposed Enhanced Whale Optimization Algorithm
This work employs the Whale Optimization Algorithm (WOA) and enhances it by incorporating truncation selection to improve its selection scheme, as illustrated in Figure 3. In Figure 3, we addressed the challenge of a stuck best solution in local optima, where we proposed a solution involving the combination of the Whale Optimization Algorithm (WOA), truncation selection, and single-point crossover approaches. This enhancement primarily focuses on improving the selection part of the WOA, where truncation selection is employed. In the truncation selection process, individuals are ranked based on their fitness values, and only the best-performing individuals are chosen as parents for the next generation. This selection is governed by a primary truncation selection parameter known as the TRS threshold, which can vary between 50% and 10%.
Figure 3.
Flowchart of the proposed enhanced WOA.
It determines the percentage of the population that will serve as parents. Individuals falling below this threshold are eliminated as they are considered unfit for reproduction. The truncation selection processes are indicated below:
- The population is sorted based on each individual’s evaluation scores;
- The poorest-performing fraction of the population is removed;
- The eliminated individuals are replaced with individuals from the top-performing fraction, with each of the best individuals creating one offspring. These offspring subsequently replace one of the previously removed, lower-performing individuals.
4. Discussion of Results
This section is devoted to the presentation of the experimental results.
4.1. Implementation Environment
The work was implemented on the Google Colab environment using Python, Pandas, and Tensor Flow libraries.
4.2. Proposed Model Performance
The proposed enhanced WOA was evaluated by iteratively enhancing a population of candidate solutions using the truncation selection method, crossover, and coefficient update procedures. The results obtained from the experiments revealed promising outcomes, as shown in Table 2.
Table 2.
Truncation-based WOA (TRBWOA) with SVM.
In Table 2, Ant-1.7 performed best with an accuracy of 0.834, while Lucene-2.2 scored the lowest at 0.534. Log4j-1.2 had the highest precision (0.789), and Log4j-1.0 had the highest recall (0.727).
Log4j-1.0 also achieved the highest F1 score (0.643) and AUC (0.790). Conversely, Camel-1.6 consistently had lower scores across these metrics.
In Table 3, we summarized the dataset performance. Ant-1.7 had the best accuracy (0.832), while Camel-1.6 scored the lowest (0.878). Camel-1.4 achieved perfect precision (1.0), while Camel-1.6 had the lowest accuracy (0.333). Camel-1.2 had the highest recall (0.504), and Jedit-4.2 had the highest F1 score (0.500) and AUC (0.863). Camel-1.4 had the lowest AUC (0.517) and Camel-1.6 had the lowest F1 score (0.244)
Table 3.
Truncation-based WOA (TRBWOA) with KNN.
4.3. Results Comparison
Regarding AUC, Table 4 shows that TRBWOA performed better than TBWOA and all other variations when decision tree was applied. The TRBWOA performed remarkably well in datasets like ant-1.7 with 0.803, while the Random-Based Whale Optimization Algorithm (RBWOA) had the worst performance in log4j-1.2 with 0.486. Table 4 shows the results of the comparison of the models.
Table 4.
Comparison of results of the WOA implemented with different selection schemes with DT classifier in terms of AUC.
Regarding AUC, Table 5 shows that TRBWOA outperformed all other variations of the WOA, including the TBWOA, when KNN was applied. The TRBWOA performed well in datasets like Jedit-3.2 with 0.863, while the linear ranked-based Whale Optimization Algorithm (LRBWOA) had the worst performance in camel-1.6 with 0.47. Figure 4 shows the graphical presentation of the results compared.
Table 5.
Comparison of results of the WOA implemented with different selection schemes with KNN classifier in terms of AUC.
Figure 4.
Comparison of results of the WOA implemented with different selection schemes with KNN classifier in terms of AUC.
5. Conclusions
Remarkably, the proposed work demonstrated significant advancements over the previous work across all evaluated metrics and datasets. The results showcased the superior performance of the proposed work, consistently outperforming the previous work in terms of various evaluation measures. These findings affirm the efficiency and effectiveness of the proposed approach in addressing the research objectives and achieving improved outcomes.
Author Contributions
Contribution: Conceptualization, H.A. and L.Y.; methodology, H.A.; software, H.A.; validation, H.A., K.U. and R.A.; resources, H.A. and K.M., supervision, K.U. and L.Y.; project administration, K.U. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data were obtained from datasets gathered from the PROMISE dataset repository [http://promise.site.uottawa.ca/SERepository/].
Conflicts of Interest
The authors declare no conflict of interest.
References
- Rathore, S.S.; Kumar, S. A study on software fault prediction techniques. Artif. Intell. Rev. 2019, 51, 255–327. [Google Scholar] [CrossRef]
- Singh, P.D.; Chug, A. Software defect prediction analysis using machine learning algorithms. In Proceedings of the 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence, Noida, India, 12–13 January 2017. [Google Scholar]
- Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
- Hassouneh, Y.; Turabieh, H.; Thaher, T.; Tumar, I.; Chantar, H.; Too, J. Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 2021, 9, 14239–14258. [Google Scholar] [CrossRef]
- Heidari, A.A.; Aljarah, I.; Faris, H.; Chen, H.; Luo, J.; Mirjalili, S. An enhanced associative learning-based exploratory whale optimizer for global optimization. Neural Comput. Appl. 2020, 32, 5185–5211. [Google Scholar] [CrossRef]
- Bowes, D.; Hall, T.; Petrić, J. Software defect prediction: Do different classifiers find the same defects? Softw. Qual. J. 2018, 26, 525–552. [Google Scholar] [CrossRef]
- Pandey, S.K.; Mishra, R.B.; Tripathi, A.K. Machine learning based methods for software fault prediction: A survey. Expert Syst. Appl. 2021, 172, 114595. [Google Scholar] [CrossRef]
- Suryadi, A. Integration of feature selection with data level approach for software defect prediction. Sink. J. Dan Penelit. Tek. Inform. 2019, 4, 51–57. [Google Scholar] [CrossRef]
- Bahaweres, R.B.; Suroso, A.I.; Hutomo, A.W.; Solihin, I.P.; Hermadi, I.; Arkeman, Y. Tackling feature selection problems with genetic algorithms in software defect prediction for optimization. In Proceedings of the 2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 19–20 November 2020. [Google Scholar]
- Alkhasawneh, M.S. Software Defect Prediction through Neural Network and Feature Selections. Appl. Comput. Intell. Soft Comput. 2022, 2022, 2581832. [Google Scholar] [CrossRef]
- Chantar, H.; Mafarja, M.; Alsawalqah, H.; Heidari, A.A.; Aljarah, I.; Faris, H. Feature selection using binary grey wolf optimizer with elite-based crossover for Arabic text classification. Neural Comput. Appl. 2020, 32, 12201–12220. [Google Scholar] [CrossRef]
- Goyal, S. Software fault prediction using evolving populations with mathematical diversification. Soft Comput. 2022, 26, 13999–14020. [Google Scholar] [CrossRef]
- Shrivastava, D.; Sanyal, S.; Maji, A.K.; Kandar, D. Bone cancer detection using machine learning techniques. In Smart Healthcare for Disease Diagnosis and Prevention; Elsevier: Amsterdam, The Netherlands, 2020; pp. 175–183. [Google Scholar]
- Khurma, R.A.; Alsawalqah, H.; Aljarah, I.; Elaziz, M.A.; Damaševičius, R. An enhanced evolutionary software defect prediction method using island moth flame optimization. Mathematics 2021, 9, 1722. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).