Evaluating the Predictive Power of Software Metrics for Fault Localization

Arab, Issar; Magel, Kenneth; Akour, Mohammed

doi:10.3390/computers14060222

Open AccessArticle

Evaluating the Predictive Power of Software Metrics for Fault Localization

by

Issar Arab

¹

,

Kenneth Magel

² and

Mohammed Akour

^3,*

¹

Adrem Data Laboratory, Department of Computer Science, University of Antwerp, 2020 Antwerpen, Belgium

²

Faculty of Computer Science, North Dakota State University, Fargo, ND 58108, USA

³

College of Computer and Information Sciences, Prince Sultan University, Riyadh 12435, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(6), 222; https://doi.org/10.3390/computers14060222

Submission received: 8 April 2025 / Revised: 3 June 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

Fault localization remains a critical challenge in software engineering, directly impacting debugging efficiency and software quality. This study investigates the predictive power of various software metrics for fault localization by framing the task as a multi-class classification problem and evaluating it using the Defects4J dataset. We fitted thousands of models and benchmarked different algorithms—including deep learning, Random Forest, XGBoost, and LightGBM—to choose the best-performing model. To enhance model transparency, we applied explainable AI techniques to analyze feature importance. The results revealed that test suite metrics consistently outperform static and dynamic metrics, making them the most effective predictors for identifying faulty classes. These findings underscore the critical role of test quality and coverage in automated fault localization. By combining machine learning with transparent feature analysis, this work delivers practical insights to support more efficient debugging workflows. It lays the groundwork for an iterative process that integrates metric-based predictive models with large language models (LLMs), enabling future systems to automatically generate targeted test cases for the most fault-prone components, which further enhances the automation and precision of software testing.

Keywords:

fault localization; software quality assurance; machine learning; software metrics; test coverage; automated debugging

Graphical Abstract

1. Introduction

Software fault localization is a critical and time-consuming task in the software development lifecycle. The process of ensuring the delivery of fault-free software is a major challenge in software quality assurance. This activity represents one of the costliest endeavors, particularly in safety-critical domains such as high-integrity embedded system development [1,2]. According to a study by the U.S. National Institute of Standards and Technology, software bugs account for an annual economic loss of USD 59.5 billion, highlighting their profound financial impact [3]. Identifying and resolving these defects demands significant time and effort from practitioners, prompting extensive research into efficient fault localization methods. Foundational studies on fault localization, such as [4,5,6], have attracted significant attention within the research community, serving as a catalyst for the development of more advanced techniques. These efforts aim to reduce the workload of developers while meeting the growing complexity of modern software systems [7].

The software development process is typically guided by systematic engineering principles encapsulated within the software development lifecycle. This process emphasizes iterative cycles, including planning, designing, building, testing, deployment, and monitoring—often framed within the DevOps stages. Each stage is crucial in ensuring the quality and reliability of the final product. For example, design frameworks like UML (Unified Modeling Language) aid in the design phase by visualizing object-oriented systems or database structures [8]. However, challenges such as the lack of semantics in UML models have spurred researchers to propose solutions that enhance their utility [9]. In recent years, AI and machine learning techniques have been increasingly applied across nearly every phase of the DevOps pipeline—from code generation and automated testing to deployment [10].

The software testing lifecycle particularly includes requirement analysis, test planning, test designs and reviews, test case preparation, test execution, test reporting, bug fixing, regression testing, and software release. In the testing phase, fault localization is a key objective that aims to reduce costs, maximize efficiency, and accelerate software roll-out [11]. Automated techniques such as Spectrum-Based Fault Localization (SBFL) aim to improve debugging efficiency by ranking program elements based on their likelihood of being faulty. SBFL uses metrics derived from successful and failed test cases to compute the suspiciousness of code elements. While popular metrics like TARANTULA [12] and OCHIAI [13] provide a ranked list of suspected faults, empirical evidence suggests that SBFL techniques often lack practical effectiveness, particularly for large-scale projects [14,15]. Studies have highlighted issues such as vague inspection ordering and the inability of SBFL techniques to adapt to real-world developer behaviors, which can significantly impact their utility [14]. Despite advancements in metrics and evaluation methodologies, achieving reliable fault localization remains a challenge [14,16].

In recent years, machine learning has shown remarkable potential in solving complex problems across diverse domains. As an example, it has revolutionized fields such as bioinformatics for protein sequence analysis, computer vision for autonomous driving, mass spectrometry-based proteomics for protein profiling, cheminformatics for drug discovery, and natural language processing for text generation and sentiment analysis [17,18,19]. This success has inspired researchers to explore its applications in software engineering, including fault prediction and localization [20].

Recent machine learning research in software testing has seen a growing focus on the application of large language models (LLMs) to enhance various stages of the software testing lifecycle [21]. LLMs have shown strong performance, particularly in the mid-to-late phases of testing. In the test case preparation stage, they are frequently employed for tasks such as unit test generation, test oracle creation, and input generation for system testing [22,23]. These capabilities help developers and testers automate labor-intensive processes and catch issues early before further development progresses. Furthermore, LLMs have been highly effective in post-execution phases [21], particularly in test reporting, bug triage, and bug fixing [24,25,26]. Tasks such as bug report summarization, duplicate report detection, and automated patch generation have all been supported by LLMs, leveraging their strength in processing natural language and code semantics. These contributions are valuable in diagnosing and resolving defects before release.

In contrast to the recent focus on the semantic and generative capabilities of large language models (LLMs), our work takes a complementary, data-driven approach to software fault localization by leveraging software metrics to proactively identify fault-prone classes. Together, these approaches form a synergistic feedback loop—where metric-based models identify likely fault zones and LLMs act upon them—paving the way for more intelligent, adaptive, and efficient software testing. Our contributions address two central research questions: (1) To what extent can predictive models built on software metrics accurately localize faults at the class level? (2) Which types of metrics, among static, dynamic, and test suite characteristics, contribute most significantly to prediction accuracy?

The problem is formulated as a multi-class classification task. We evaluated our approach on the Defects4J benchmark [27], which is a widely used dataset in this domain. Defects4J comprises six major projects—the JFreeChart, Closure compiler, Apache commons-lang, Apache commons-math, Mockito, and Joda-Time— making it an ideal testbed for assessing fault localization strategies. In terms of features, we compiled a large set of 53 class-level metrics, categorized into three groups: static, dynamic, and test suite characteristics. By focusing on predicting faulty classes and identifying the most informative metrics, our approach aims to improve software quality assurance, streamline debugging efforts, and expand the practical applicability of machine learning in fault localization.

2. Materials and Methods

2.1. Dataset

A key prerequisite for any machine learning study is to prepare the dataset to train predictive models. In this study, we utilize Defects4J [27], a widely recognized repository of real-world projects maintained specifically for software testing research, development, and automation. Defects4J is a benchmark dataset commonly used in fault localization and other software testing studies. The repository employed in this research included six well-known projects: JFreeChart (27 buggy versions), Closure Compiler (134 buggy versions), Apache Commons-Lang (66 buggy versions), Apache Commons-Math (107 buggy versions), Mockito (39 buggy versions), and Joda-Time (28 buggy versions). Each buggy version included explicitly defined faulty components, providing a well-structured and reliable basis for further analysis. The complete process used to construct the software metrics dataset, covering static, dynamic, and test suite metrics, is illustrated in Supplementary Figure S1. For implementation specifics and reproducibility details, please refer to Section 2.9: Code Availability.

2.2. Prediction Abstraction Level

One of the goals of this research is to automate the localization of the faulty components in the above-introduced dataset using 3 types of software metrics. Achieving this requires clearly defining the abstraction level for predictions. In object-oriented programming, faults can be localized at various levels of abstraction, including statements, blocks, methods, classes, or packages. Early research in software testing predominantly focuses on the statement level [28,29,30]. However, more recent studies have explored fault localization at higher abstraction levels, particularly the class level [26,31,32,33,34,35].

In this work, we targeted the class level for fault localization, leveraging Defects4J to evaluate machine learning techniques at this abstraction. This choice aligns with contemporary research trends and offers a practical balance between granularity and computational feasibility, aiming to advance software quality assurance through more effective and scalable fault localization methods [26,36,37].

2.3. Class Labels

Spectrum-Based Fault Localization (SBFL) is a dynamic program analysis technique that identifies and ranks program elements that are likely to be faulty. It achieves this by analyzing program executions, specifically leveraging code coverage information (also known as program spectra) and test case outcomes [38,39]. Code coverage tracks which program elements—such as statements, blocks, functions, or classes—are executed during each test case, while test outcomes are categorized as either passed (expected behavior) or failed (unexpected behavior). The SBFL process uses a ranking formula to estimate the likelihood of faultiness for each program element, with prior evaluations indicating that most ranked formulas (also known as similarity measures) perform comparably [15,40].

To systematically define the multi-class labels for our study, we utilized the SBFL output generated using the GZoltar [41] tool, applied to the previously mentioned buggy project versions. GZoltar facilitated the collection and ranking of program elements based on their likelihood of being faulty, enabling a structured approach to defining class labels for our analysis. Furthermore, previous research [36] reported that software engineers find SBFL results less useful if the actual faulty class is ranked outside the top ten positions. Based on the information presented, we established the following multi-class labeling rules for our dataset:

Strongly Faulty: Real faulty class.
Faulty: Classes ranked between 1 and 5.
Fairly Faulty: Classes ranked between 6 and 10.
Weakly Faulty: Classes ranked between 11 and 15.
Not Faulty: Classes ranked beyond position 15.

This ranking approach, based on the Jaccard suspiciousness metric [40], provides a systematic method to define labels for analyzing and predicting fault-prone classes.

2.4. Feature Generation

To carry out machine learning experiments, we need to compute numerical statistics for each class in our dataset. These metrics serve as quantitative features encoding various aspects of the software, including source code characteristics, test suite properties, execution details, and fault repair information. Metrics were categorized into three groups based on their data source. The selection of metrics in the final feature set was guided by the hypothesis that each provides relevant information influencing fault localization prediction. Notably, the chosen code metrics were selected for their actionable nature, enabling software engineers to address their values to enhance code complexity and quality during development.

2.4.1. Static Metrics

Static software metrics are quantitative measures of software characteristics that are directly derived from the source code without requiring its execution. In this study, we utilized two tools for extracting these metrics:

CK (https://github.com/mauricioaniche/ck, accessed on 2 November 2024): This tool performs static analysis to compute a wide range of metrics, such as the well-known McCabe’s complexity, also referred to as the Weighted Method Per Class (WMC).

SCMS (https://github.com/issararab/SCMS, accessed on 18 November 2024): This software adopts a bottom-up approach, compiling method- and class-level metrics based on statement-level metrics. Its four foundational statement-level metrics are the number of operators (Op), number of levels (Lev), data flow (DF), and data usage (DU). For each of these metrics, SCMS calculates the total and maximum values at the method level, which are then aggregated at the class level. Additionally, this tool provides the number of within-class method calls and the in–out-degree of a class—this is defined as the number of external methods invoked by at least one method in the class. Another key metric is the count of public members in a class, encompassing both public fields and methods. (For further details, refer to [42]). Table 1 displays the list of compiled static metrics.

2.4.2. Dynamic Metrics

Dynamic software metrics measure a program’s runtime behavior, capturing aspects such as resource usage and performance during execution, which serve as a complement to static metrics. These metrics are often derived from dynamic call graphs generated by tools like JDCallgraph (https://github.com/dkarv/jdcallgraph, accessed on 8 September 2024). Additionally, both SCMS and CK can extract certain metrics based on call graphs.

Due to technical challenges in generating dynamic call graphs with JDCallgraph for all the projects in our dataset, we opted to rely exclusively on the dynamic metrics provided by CK and SCMS. The complete list of compiled dynamic metrics is presented in Table 2.

2.4.3. Test Suite Characteristics

Software test suite characteristics are quantitative measures that evaluate aspects such as the coverage, effectiveness, and efficiency of a test suite—this refers to a collection of test cases designed to verify software behavior and functionality. These metrics were obtained using Gzoltar [41] for coverage analysis. The complete list of compiled test suite metrics is presented in Table 3.

2.5. Data Preparation

An initial exploratory data analysis (EDA) was conducted to assess the structure and quality of the dataset (detailed in notebooks/exploratory_data_analysis.ipynb). The analysis revealed two key issues among others: (1) several features exhibited a high correlation (see Supplementary Figure S2), and (2) a significant class imbalance was observed in the distribution of faulty and non-faulty classes (Supplementary Figure S3), which could introduce bias during model training.

For preprocessing and feature selection, we applied a Scikit-learn [43] pipeline composed of four stages. First, a univariate imputer removed columns lacking any computed values and filled in missing values with the column mean. Second, feature standardization was performed by centering the data and scaling them to unit variance. Third, zero-variance features were discarded. Finally, for feature pairs with a Pearson correlation coefficient above 0.90, one feature was randomly removed to eliminate redundancy. This preprocessing pipeline reduced the dataset to 48 distinct software metrics.

Given the limited number of faulty class instances across all buggy versions, we applied a downsampling strategy to create a balanced dataset. The final dataset included five classes, each with 410 instances. This balanced dataset was then divided into a development set (80%) and an external test set (20%). Full details on how this dataset was generated can be found in the project’s notebooks.

2.6. Random Baseline Estimation

To assess classifier performance against a naive predictor, we estimated the random baseline in two ways. First, under the assumption of a uniform class distribution (given the balanced set), the random baseline accuracy could be computed analytically as 1/M, where M is the number of distinct classes. Second, to account for possible class imbalance, we performed a Monte Carlo simulation: random predictions were generated by sampling classes according to the empirical class probabilities observed in the test set. The random baseline accuracy was then estimated as the average proportion of correct predictions across the simulated random labels. Given the balanced dataset that we created, the Monte Carlo simulation yielded a similar result to the analytical calculation. For more details, check the notebook “random_baseline_calculation.ipynb” at the publicly available repository.

2.7. Machine Learning Models

The objective of this research is to evaluate which software metrics most effectively contribute to predicting faulty classes in Java projects. Given the multi-class classification nature of the problem and the need for interpretable results, we selected tree-based machine learning algorithms, such as Random Forest (RF), XGBoost, and LightGBM, as these models are widely recognized for their strength in explainability tasks [44]. To further validate our findings, we also trained a deep learning (DL) model, using it as a baseline to ensure that the best-performing tree-based model was at least comparable to, if not better than, the DL approach. Although the DL model was not the primary focus, it served as a major benchmark. An extensive grid hyperparameter search was conducted for all models (refer to Supplementary Tables S1–S4 for details). For tree-based models, performance was evaluated using 3-fold cross-validation on the training dataset to identify the best configuration for each algorithm. In total, we fitted 216 RF, 1944 XGBoost, and 648 LightGBM models.

2.8. Evaluation Metrics

To evaluate the performance of the multi-class classification models, we used the following metrics:

-: Accuracy (ACC): It measures the proportion of correctly predicted instances over the total number of instances. It provides a general overview of the model’s performance but can be misleading when dealing with imbalanced datasets.

$A c c u r a c y = \frac{# o f c o r r e c t p r e d i c t i o n s}{t o t a l n u m b e r o f p r e d i c t i o n s}$
-: Weighted Precision (WP): It quantifies the proportion of correctly identified positive predictions out of all predictions for a given class. Weighted precision extends this calculation by incorporating the support of each class, ensuring that the precision metric reflects the class distribution as follows:

$W e i g h t e d P r e c i s i o n = \frac{1}{N} \sum_{i = 1}^{N} W_{i} * {p r e c i s i o n}_{i}$

where $W_{i}$ is the fraction of true instances belonging to class $i$ , and N is the total number of classes.
-: Weighted Recall (WR): It measures the proportion of actual positive instances correctly identified by the model for a given class. Similarly to weighted precision, weighted recall accounts for class support, ensuring that it reflects the overall dataset distribution as follows:

$W e i g h t e d R e c a l l = \frac{1}{N} \sum_{i = 1}^{N} W_{i} * {r e c a l l}_{i}$

where $W_{i}$ is the fraction of true instances belonging to class $i$ , and N is the total number of classes.

2.9. Code Availability

All codes for this project are publicly available as an open source under the Apache 2.0 license on the GitHub repository, https://github.com/issararab/software-metrics-fault-localization-prediction (accessed on 1 May 2025). The implementation utilizes the following libraries: XGBoost (v2.0.3), LightGBM (v4.5.0), PyTorch (v2.1.2), and Scikit-Learn (v1.3.1) for data preprocessing and model development; NumPy (v1.23.5) and Pandas (v2.1.4) for scientific computing; and Matplotlib (v3.8.2) and Seaborn (v0.13.2) for data visualization. All data analysis was conducted using Jupyter Notebooks (v7.4.3).

The notebooks used to reproduce the results presented in this study are available at https://github.com/issararab/software-metrics-fault-localization-prediction/tree/main/notebooks (accessed on 1 May 2025). The workflow for reproducing the software metrics used (for all buggy versions) is detailed in Supplementary Figure S1. A mapping of scripts to their respective steps in the workflow diagram is provided in Supplementary Table S5, and all related scripts are accessible at https://github.com/issararab/software-metrics-fault-localization-prediction/tree/main/software_metrics_compilation (accessed on 1 May 2025).

3. Results and Discussion

After collecting metrics from all buggy versions of projects in the Defects4J repository, we applied an unsupervised preprocessing pipeline to the dataset, initially consisting of 53 features. This process reduced the feature space to 48 metrics by discarding five features. The removed features included two static metrics (Tot2DU and TotMaxDU), two dynamic metrics (MaxTotDF and TotMaxDF), and one test suite metric (Nntc, representing the total number of test cases).

The balanced dataset was used to train several machine learning models with different configurations. Tree-based models significantly outperformed both deep learning models and the random baseline (see Supplementary Figure S4) when evaluated on an external test set. The top-performing models—Random Forest (RF), XGBoost, and LightGBM—were selected to address the primary research questions. To evaluate the statistical significance of their performance, we conducted a bootstrap analysis with 1000 iterations on the evaluation set. Table 4 presents the benchmark results, reporting accuracy, weighted precision, and weighted recall for the multi-class classification task, along with the error bars of each model.

Gradient boosting models, specifically XGBoost and LightGBM, outperformed RF by approximately five percentage points while demonstrating similar performance to each other. Among these, LightGBM was selected as the best-performing model due to its slightly higher overall performance. The confusion matrix on the evaluation set, depicted in Figure 1, highlights the strong predictive capability of the chosen model. It is also worth mentioning that the model shows strong robustness, as evidenced by stable performance through bootstrap sampling and validation on an external set.

However, generalizability may be constrained due to its reliance on Defects4J and lack of cross-domain validation, suggesting that future work should assess performance across broader software domains. That said, since the studied projects are all implemented in object-oriented systems, written in Java, and share common design principles, it is plausible that the findings may extend to other projects developed with similar paradigms and language ecosystems. Nevertheless, we should be cautious when extrapolating these results to systems built with fundamentally different architectural styles or programming languages.

LightGBM, combined with the 48 input features, demonstrated significant potential in predicting the faulty class with approximately 90% confidence. To better understand the contribution of individual metrics and metric groups to the model’s decisions, Figure 2 illustrates the feature importance within the best-performing model.

Fifteen test suite metrics emerged as the most influential in this analysis, with TotPassTestRatio ranking first. This result aligns with our expectations, as TotPassTestRatio intuitively represents a highly informative feature. In an ideal scenario, a perfectly designed test suite capable of covering all edge cases could rely solely on this metric to effectively locate faults within the software. However, in practice, test suites are rarely comprehensive, necessitating the inclusion of additional metrics to make more informed decisions for identifying and prioritizing potentially faulty classes.

Beyond test suite metrics, the next most important features included one dynamic metric, Tot2DF, which ranked 16th, and one static metric, CBO, which ranked 17th. Tot2DF, which is based on the statement level data flow metric, is a dynamic data complexity metric that captures data interactions and behavior within the software. Its importance is rooted in its ability to convey critical insights into data flow, which aligns closely with the object-oriented principle of cohesion and has long been valued by software testers for its diagnostic utility.

Similarly, CBO (Coupling Between Objects) stands out as a pivotal static metric in software testing. By quantifying the degree of interdependence among classes in a codebase, CBO sheds light on the complexity of software architecture. High coupling not only complicates testing efforts but also increases the risk of cascading failures, as changes in one class may propagate errors or necessitate modifications in dependent classes. As such, CBO provides essential insights into modularity and maintainability, serving as a valuable indicator for identifying integration testing hotspots and guiding focused testing efforts to mitigate potential system fragility.

4. Conclusions

This work demonstrates the applicability of machine learning in improving software fault localization by leveraging diverse software metrics. By adopting a multi-class classification framework and utilizing the Defects4J dataset, we developed models capable of effectively predicting faulty classes with significant accuracy. The superior performance of LightGBM, combined with the insights derived from feature importance analysis, underscores the critical role of test suite metrics in fault localization. Additionally, the relevance of static and dynamic metrics highlights the need for a holistic approach that integrates structural and behavioral software characteristics.

As part of the practical implications of this work, the findings that test suite metrics, like TotPassTestRatio, are the most influential imply that improving test case design directly enhances fault localization effectiveness. This insight can guide quality assurance teams to invest in better test case coverage and diversity. Furthermore, this study shows further actionable feature insights, with features like CBO and Tot2DF being ranked highly and offering specific architectural guidance. Teams can focus on reducing coupling or analyzing data flow hotspots to preempt fault clusters.

We also discussed how software testing is not a strictly linear process but rather an iterative cycle of feedback and refinement. While LLMs have recently advanced test generation, debugging, and understanding of bug reporting, they largely operate in isolation or downstream in the testing pipeline. Our approach, which evaluates the predictive power of software metrics—particularly including test suite execution characteristics—naturally complements these methods by introducing fault localization models that evolve alongside test cycles. These predictive models not only benefit from data produced during test execution but also feed insights back into the testing process, enabling more targeted LLM applications.

This applies in the same direction as the current landscape of advanced intelligent systems, where there is a growing recognition that no single AI agent can effectively handle the full complexity of software engineering tasks. Instead, the future points toward compound AI systems as a combination of specialized agents or models that collaborate to complement each other’s strengths. Within this paradigm, metrics-based predictive models and LLMs represent two distinct but highly synergistic components, enabling a more robust, adaptive, and context-aware approach to software fault localization and testing. In this setup, metric-based models prioritize fault-prone classes, allowing LLMs to focus test generation on those areas—maximizing coverage and efficiency—while new test results feed back into the model to refine predictions and guide further targeted testing. This highlights a symbiotic relationship between metric-based prediction models and LLMs, working together to iteratively improve the effectiveness, efficiency, and coverage of software testing.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/computers14060222/s1. Figure S1: Workflow diagram to compile the software metrics dataset used for model building and data analysis in this study; Figure S2: Correlation matrix of all the collected software metrics; Figure S3: Class distribution of the fully compiled dataset; Figure S4: Benchmarking of all different models fitted on the data and evaluated on the external test dataset; Table S1: Hyperparameters considered for the Deep Learning model; Table S2: Hyperparameters considered for the Random Forest model; Table S3: Hyperparameters considered for the XGBoost model; Table S4: Hyperparameters considered for the LightGBM model; Table S5: Mapping table of the scripts used in each step detailed in Supplementary Figure S1 to compile the dataset.

Author Contributions

Conceptualization, I.A., K.M. and M.A.; methodology, I.A.; validation, I.A., K.M. and M.A.; data curation, I.A.; writing—original draft preparation, I.A.; writing—review and editing, I.A., K.M. and M.A.; visualization, I.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charge (APC) of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Haghighatkhah, A.; Oivo, M.; Banijamali, A.; Kuvaja, P. Improving the state of automotive software engineering. IEEE Softw. 2017, 34, 82–86. [Google Scholar] [CrossRef]
Rana, R.; Staron, M.; Hansson, J.; Nilsson, M. Defect prediction over software life cycle in automotive domain state of the art and road map for future. In Proceedings of the 2014 9th International Conference on Software Engineering and Applications (ICSOFT-EA), Vienna, Austria, 29–31 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 377–382. [Google Scholar]
Planning, S. The economic impacts of inadequate infrastructure for software testing. Natl. Inst. Stand. Technol. 2002, 1, 169. [Google Scholar]
Jones, J.A.; Harrold, M.J. Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, Long Beach, CA, USA, 7–11 November 2005; pp. 273–282. [Google Scholar]
Renieres, M.; Reiss, S.P. Fault localization with nearest neighbor queries. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering, Montreal, QC, Canada, 6–10 October 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 30–39. [Google Scholar]
Zeller, A. Isolating cause-effect chains from computer programs. ACM SIGSOFT Softw. Eng. Notes 2002, 27, 1–10. [Google Scholar] [CrossRef]
Omri, S.; Montag, P.; Sinz, C. Static analysis and code complexity metrics as early indicators of software defects. J. Softw. Eng. Appl. 2018, 11, 153. [Google Scholar] [CrossRef]
Arab, I.; Bourhnane, S.; Kafou, F. Unifying modeling language-merise integration approach for software design. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 4. [Google Scholar] [CrossRef]
Falah, B.; Akour, M.; Arab, I.; M’hanna, Y. An attempt towards a formalizing UML class diagram semantics. In Proceedings of the New Trends in Information Technology (NTIT-2017), Amman, Jordan, 25–27 April 2017; pp. 21–27. [Google Scholar]
Oyeniran, O.C.; Adewusi, A.O.; Adeleke, A.G.; Akwawa, L.A.; Azubuko, C.F. AI-driven DevOps: Leveraging machine learning for automated software deployment and maintenance. Eng. Sci. Technol. J. 2023, 4, 6, 728–740. [Google Scholar] [CrossRef]
Arab, I.; Bourhnane, S. Reducing the cost of mutation operators through a novel taxonomy: Application on scripting languages. In Proceedings of the International Conference on Geoinformatics and Data Analysis, Prague, Czech Republic, 20–22 April 2018; pp. 47–56. [Google Scholar]
Jones, J.A.; Bowring, J.F.; Harrold, M.J. Debugging in parallel. In Proceedings of the 2007 International Symposium on Software Testing and Analysis, London, UK, 9–12 July 2007; pp. 16–26. [Google Scholar]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J. On the accuracy of spectrum-based fault localization. In Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION (TAICPART-MUTATION 2007), Windsor, UK, 10–14 September 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 89–98. [Google Scholar]
Wong, W.E.; Gao, R.; Li, Y.; Abreu, R.; Wotawa, F. A survey on software fault localization. IEEE Trans. Softw. Eng. 2016, 42, 707–740. [Google Scholar] [CrossRef]
Pearson, S.; Campos, J.; Just, R.; Fraser, G.; Abreu, R.; Ernst, M.D.; Pang, D.; Keller, B. Evaluating and improving fault localization. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, 20–28 May 2017; pp. 609–620. [Google Scholar]
Liblit, B.; Naik, M.; Zheng, A.X.; Aiken, A.; Jordan, M.I. Scalable statistical bug isolation. ACM Sigplan Not. 2005, 40, 15–26. [Google Scholar] [CrossRef]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
Arab, I.; Barakat, K. ToxTree: Descriptor-based machine learning models for both hERG and Nav1. 5 cardiotoxicity liability predictions. arXiv 2021, arXiv:2112.13467. [Google Scholar]
Hapke, H.; Howard, C.; Lane, H. Natural Language Processing in Action: Understanding, Analyzing, and Generating Text with Python. Simon and Schuster; Amazon: Seattle, WA, USA, 2019. [Google Scholar]
Zou, Y.; Li, H.; Li, D.; Zhao, M.; Chen, Z. Systematic Analysis of Learning-Based Software Fault Localization. In Proceedings of the 2024 10th International Symposium on System Security, Safety, and Reliability (ISSSR), Xiamen, China, 16–17 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 478–489. [Google Scholar]
Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software testing with large language models: Survey, landscape, and vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
Tufano, M.; Drain, D.; Svyatkovskiy, A.; Deng, S.K.; Sundaresan, N. Unit test case generation with transformers and focal context. arXiv 2020, arXiv:2009.05617. [Google Scholar]
Mastropaolo, A.; Cooper, N.; Palacio, D.N.; Scalabrino, S.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. Using transfer learning for code-related tasks. IEEE Trans. Softw. Eng. 2022, 49, 1580–1598. [Google Scholar] [CrossRef]
Sobania, D.; Briesch, M.; Hanna, C.; Petke, J. An analysis of the automatic bug fixing performance of chatgpt. In Proceedings of the 2023 IEEE/ACM International Workshop on Automated Program Repair (APR), Melbourne, Australia, 16 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 23–30. [Google Scholar]
Mukherjee, U.; Rahman, M.M. Employing deep learning and structured information retrieval to answer clarification questions on bug reports. arXiv 2023, arXiv:2304.12494. [Google Scholar]
Mahbub, P.; Shuvo, O.; Rahman, M.M. Explaining software bugs leveraging code structures in neural machine translation. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 640–652. [Google Scholar]
Just, R.; Jalali, D.; Ernst, M.D. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA 2014, New York, NY, USA, 21–26 July 2014; ACM: New York, NY, USA, 2014; pp. 437–440. [Google Scholar]
Artzi, S.; Dolby, J.; Tip, F.; Pistoia, M. Fault localization for dynamic web applications. IEEE Trans. Softw. Eng. 2011, 38, 314–335. [Google Scholar] [CrossRef]
Mariani, L.; Pastore, F.; Pezze, M. Dynamic analysis for diagnosing integration faults. IEEE Trans. Softw. Eng. 2010, 37, 486–508. [Google Scholar] [CrossRef]
Baah, G.K.; Podgurski, A.; Harrold, M.J. Mitigating the confounding effects of program dependences for effective fault localization. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering; ACM: New York, NY, USA, 2011; pp. 146–156. [Google Scholar]
Ye, X.; Bunescu, R.; Liu, C. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering; ACM: New York, NY, USA, 2024; pp. 689–699. [Google Scholar]
Zhou, J.; Zhang, H.; Lo, D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 2012 34th International Conference on Software Engineering (ICSE); ACM: New York, NY, USA, 2012; pp. 14–24. [Google Scholar]
Kim, D.; Tao, Y.; Kim, S.; Zeller, A. Where should we fix this bug? a two-phase recommendation model. IEEE Trans. Softw. Eng. 2013, 39, 1597–1610. [Google Scholar]
Zagane, M.; Abdi, M.K.; Alenezi, M. Deep learning for software vulnerabilities detection using code metrics. IEEE Access 2020, 8, 74562–74570. [Google Scholar] [CrossRef]
Kochhar, P.S.; Xia, X.; Lo, D.; Li, S. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrücken, Germany, 18–20 July 2016; pp. 165–176. [Google Scholar]
Sarhan, Q.I.; Beszédes, Á. A survey of challenges in spectrum-based software fault localization. IEEE Access 2022, 10, 10618–10639. [Google Scholar] [CrossRef]
Ma, C.; Tan, T.; Chen, Y.; Dong, Y. An if-while-if model-based performance evaluation of ranking metrics for spectra-based fault localization. In Proceedings of the 2013 IEEE 37th Annual Computer Software and Applications Conference, Kyoto, Japan, 22–26 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 609–618. [Google Scholar]
Li, P.; Jiang, M.; Ding, Z. Fault localization with weighted test model in model transformations. IEEE Access 2020, 8, 14054–14064. [Google Scholar] [CrossRef]
Keller, F.; Grunske, L.; Heiden, S.; Filieri, A.; van Hoorn, A.; Lo, D. A critical evaluation of spectrum-based fault localization techniques on a large-scale software system. In Proceedings of the 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, 25–29 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 114–125. [Google Scholar]
Campos, J.; Riboira, A.; Perez, A.; Abreu, R. Gzoltar: An eclipse plug-in for testing and debugging. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, Essen, Germany, 3–7 September 2012; pp. 378–381. [Google Scholar]
Arab, I.; Falah, B.; Magel, K. SCMS: Tool for Assessing a Novel Taxonomy of Complexity Metrics for any Java Project at the Class and Method Levels based on Statement Level Metrics. Adv. Sci. Technol. Eng. Syst. J. 2019, 4, 220–228. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef] [PubMed]
Al Qasem, O.; Akour, M.; Alenezi, M. The influence of deep learning algorithms factors in software fault prediction. IEEE Access 2020, 8, 63945–63960. [Google Scholar] [CrossRef]

Figure 1. The confusion matrix for the external evaluation set of the best model—LightGBM.

Figure 2. The LightGBM model’s feature importance for all the 48 gathered software metrics.

Table 1. Static metrics.

Metric	Short Description
CBO	Counts the number of dependencies a class has
NOF	Counts the number of fields in a class, no matter their modifiers
NOPF	Counts only the public fields
NOSF	Counts only the static fields
NOM	Counts the number of methods, no matter its modifiers
NOPM	Counts only the public methods
NOSM	Counts only the static methods
WMC	It counts the number of branch instructions in a class
LOC	It counts the lines of counts, ignoring empty lines
LCOM	Calculates the lack of Cohesion of Methods
Tot2Op	Counts the total number of operators
TotMaxOp	Counts the total of the max operators
Max2Op	Counts the max operators
MaxTotOp	Counts the max total number of operators based on method results
Tot2Lev	Counts the total number of levels in the whole class code based on method results
TotMaxLev	Counts the sum of the maximum level in each method
MaxTotLev	Counts the max of the total number of levels in each method
Max2Lev	Counts the max level in the whole class. (In other words, the deepest branch)
Tot2DU	Counts the total amount of data usage in the class
TotMaxDU	Counts the total amount of max data usage in the class
Max2DU	Counts the max amount of data usage in the class
MaxTotDU	Counts the max amount of total data usage in the class
PubMembers	Counts the number of public members (fields or methods)

Table 2. Dynamic metrics.

Metric	Short Description
DIT	It counts the number of “fathers” a class has
NOSI	Counts the number of invocations to static methods
NOC	Counts the number of children a class has
RFC	Counts the number of unique method invocations in a class
Tot2DF	Counts the total number of data flows in a class
TotMaxDF	Counts the total max data flow in each method of the class
Max2DF	Counts the max data flow in each method of the class
MaxTotDF	Counts the max of total data flows in each method of the class
TotInMetCall	Counts the total number of within-class method calls
MaxInMetCall	Counts the max number of within-class method calls
InOutDeg	Counts the number of in-class calls for the external method (Similar to the out-degrees of a dynamic call graph)

Table 3. Test suite metrics.

Metric	Short Description
Run Time	The run time in seconds that it took Gzoltar to run all the tests and generate the matrix
Ncf	The number of failed test cases that cover the class
Nuf	The number of failed test cases that do not cover the class
Ncs	The number of successful test cases that cover the class
Ns	The number of successful tests
Nf	The number of failed tests
Ntsc	The total number of statements in the class covered by the test suit
Ndsc	The distinct number of statements covered by the test suite in a class
Nntc	The total number of test cases
PassTestRatio	The ratio of passed test cases in a class vs. the total number of tests that cover the class
FailTestRatio	The ratio of failed test cases in a class vs. the total number of tests that cover the class
TotPassTestRatio	The ratio of passed test cases in a class vs. the total number of tests in the test suite
TotFailTestRatio	The ratio of failed test cases in a class vs. the total number of tests in the test suite
NTestRunPerRT	The number of tests run on a class vs. the total run time
Um	Uniqueness metric
Md	Matrix density
Nmd	Normalized matrix density
Gs	Gini Simpson
DDU	Density Diversity Uniqueness

Table 4. Performance of best models.

Model	WP	WR	ACC
RF	73.8% ± 2.1%	73.5% ± 2.1%	73.5% ± 2.1%
XGBoost	78.5% ± 2.0%	78.1% ± 2.0%	78.1% ± 2.0%
LightGBM	79.0% ± 1.9%	78.6% ± 2.0%	78.6% ± 2.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arab, I.; Magel, K.; Akour, M. Evaluating the Predictive Power of Software Metrics for Fault Localization. Computers 2025, 14, 222. https://doi.org/10.3390/computers14060222

AMA Style

Arab I, Magel K, Akour M. Evaluating the Predictive Power of Software Metrics for Fault Localization. Computers. 2025; 14(6):222. https://doi.org/10.3390/computers14060222

Chicago/Turabian Style

Arab, Issar, Kenneth Magel, and Mohammed Akour. 2025. "Evaluating the Predictive Power of Software Metrics for Fault Localization" Computers 14, no. 6: 222. https://doi.org/10.3390/computers14060222

APA Style

Arab, I., Magel, K., & Akour, M. (2025). Evaluating the Predictive Power of Software Metrics for Fault Localization. Computers, 14(6), 222. https://doi.org/10.3390/computers14060222

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Predictive Power of Software Metrics for Fault Localization

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Prediction Abstraction Level

2.3. Class Labels

2.4. Feature Generation

2.4.1. Static Metrics

2.4.2. Dynamic Metrics

2.4.3. Test Suite Characteristics

2.5. Data Preparation

2.6. Random Baseline Estimation

2.7. Machine Learning Models

2.8. Evaluation Metrics

2.9. Code Availability

3. Results and Discussion

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI