Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

König, Philip; Raubitzek, Sebastian; Schatten, Alexander; Toth, Dennis; Obermann, Fabian; König, Caroline; Mallinger, Kevin

doi:10.3390/bdcc9070174

Open AccessArticle

Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

by

Philip König

^1,†

,

Sebastian Raubitzek

^1,†

,

Alexander Schatten

^2,*

,

Dennis Toth

¹,

Fabian Obermann

¹,

Caroline König

³

and

Kevin Mallinger

¹

SBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, Austria

²

Institute of Information Systems Engineering, TU Wien, Favoritenstrasse 9–11/194, 1040 Vienna, Austria

³

Christian Doppler Laboratory for Assurance and Transparency in Software Protection, Research Group Security & Privacy, Faculty of Computer Science, University of Vienna, Kolingasse 14–16, 1040 Vienna, Austria

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(7), 174; https://doi.org/10.3390/bdcc9070174

Submission received: 3 June 2025 / Revised: 26 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.

Keywords:

fault prediction; machine learning; CatBoost; code metrics; feature importance; artificial intelligence; bug prediction

1. Introduction

Software reliability, availability, and security lie at the heart of modern computing research. A method that has gained traction in the past years that helps to improve the dependability of software systems is fault or bug prediction—the estimation of which parts of a codebase are likely to contain defects. By guiding developers in prioritizing testing and maintenance efforts, fault prediction mitigates the risks of vulnerabilities, downtime, and expensive post-release fixes [1].

Over the years, numerous bug-prediction models and metrics have been explored, ranging from static code measures (e.g., Halstead metrics) to process-centric metrics (e.g., code churn) [2,3,4]. However, many existing studies—including those leveraging modern machine learning approaches—rely on limited or narrowly scoped datasets [1]. For instance, many benchmarks provide only static measures of older projects [5] or focus on a small number of open-source repositories with restricted metric suites [6]. Consequently, these datasets may not fully capture the real-world diversity of domains, commit practices, or code-change characteristics found in contemporary software ecosystems.

In this paper, we address these limitations by leveraging a large and diverse set of open-source projects—ranging from data processing frameworks to healthcare and security tools—encompassing millions of commits. We analyze each project at the file level on a per-commit basis over its entire lifetime, providing a granular perspective on code evolution.

To analyze this diverse dataset and predict error-prone code therein, we employed CatBoost [7,8], a gradient boosting algorithm known for its strong performance across a variety of problems.

As model inputs, we used code complexity metrics to extract information from each commit across all repositories, along with an indicator of whether a bug fix occurred. This resulted in a dataset comprising a selection of code metrics and a target variable indicating the presence or absence of a bug fix. We then trained the machine learning model on this dataset to predict whether a given commit was more likely to introduce a bug or resolve one.

Beyond its predictive capabilities, we also leveraged CatBoost in reverse—using a feature importance analysis to determine which factors (i.e., model inputs) had the greatest influence on classification outcomes. This allowed us to identify key factors and how they contribute to code quality and bug-proneness, providing deeper insights into software evolution patterns and informing best practices for improving software reliability. The entire process is illustrated in Figure 1, which outlines the six key steps of our approach.

Contributions

In summary, the contributions of this work are as follows:

Large-Scale Multi-Project Analysis: We collect and preprocess commits from 33 open-source projects, yielding a rich, heterogeneous dataset that spans domains, programming languages, and repository sizes.
Commit-Level Feature Extraction: Beyond standard static metrics, we incorporate process-centric (e.g., churn, revision frequency, age) and entropy-based features by analyzing changes commit by commit at the file level, thereby capturing the scattered nature of code evolution over time.
Robust Boost Classifier Implementation: We integrate a Bayesian-optimized CatBoost classifier with a sophisticated undersampling strategy to achieve maximal generalizability and accuracy.
Empirical Evaluation and Feature Importance: We validate our model’s performance using standard metrics (precision, recall, F1-score) on real-world data and provide a detailed feature importance analysis, identifying how churn, age, and entropy measures synergize in indicating defect risk.

Our work’s key contribution lies in showcasing how a feature importance study—in conjunction with a broad metric suite and a realistic dataset—produce both strong predictive performance and concrete, actionable insights for practitioners. This clarity is especially needed in modern software projects, where understanding why certain files are risky can be as critical as predicting which files to inspect.

The remainder of this paper is structured as follows: Section 2 reviews related work on bug prediction approaches and datasets, highlighting existing gaps. Section 3 details our methodology and experimental setup, including the dataset, labeling strategy, and metric extraction process. We present our results—including our machine learning predictions and corresponding feature importance analysis—in Section 4, and discuss threats to validity in Section 5. Finally, Section 6 draws conclusions and offers directions for future work.

2. Related Work

One finds a wide range of approaches for predicting bugs and faults in software, spanning from traditional correlation-based metric analyses to modern machine learning and artificial intelligence (AI)-driven methods. Our proposed approach bridges the gap between these classical metric-based techniques and contemporary AI models. In the following, we discuss related work and examine historical datasets relevant to defect prediction.

Bug Prediction Datasets. Many defect prediction approaches often rely on datasets like the NASA MDP repository or the PROMISE dataset, which typically offer limited static code metrics from a small number of projects [1,5,6]. While such benchmarks help researchers compare machine learning models, their narrow scope does not reflect modern development practices. More recent studies included data from open-source projects, yet often remained confined to language-specific or small-scale samples, limiting generalizability [9].
Machine Learning Approaches and CatBoost. Techniques such as Random Forest, Logistic Regression, and XGBoost have shown promise for fault prediction methods [10,11]. CatBoost, a gradient boosting framework notable for handling high-cardinality categorical features and missing values [7,8], has likewise been tested in defect prediction. However, these CatBoost-based efforts typically rely on either NASA MDP-like datasets or constrained open-source repositories, inheriting the same limitations in project diversity and metric variety [10,12,13,14,15,16]. Moreover, many of these studies apply only rudimentary parameter tuning, potentially overlooking CatBoost’s full capabilities. Despite CatBoost’s potential, prior works remain limited to small, static datasets or do not explore advanced tuning, leaving a significant opportunity for more robust evaluation.

Although advanced boosting models can provide high predictive accuracy, software engineering practitioners often need to understand why certain files or commits are labeled as defect-prone. In bug prediction contexts, such interpretability can illuminate which code metrics drive predictions, enabling project teams to better allocate testing resources and grasp emergent risk factors. Despite this growing emphasis, many prior bug-prediction studies treat boosting models as largely “black-box” methods, providing minimal clarity on how each metric contributes to final decisions [10,17].

Process Metrics. Recent studies increasingly prioritize process-centric features over purely static code measures for fault prediction [4]. Graves et al. [18] pioneered a rich set of change metrics that capture how frequently and in what manner a file is modified over its lifecycle. This was later expanded by Moser et al. [19] and includes attributes such as the number of revisions and Code Churn and age, which tracks a file’s lifespan. Beyond this original definition, relative churn emerged as a set of metrics to normalize churn by the total lines of code, enabling comparisons across files of varying sizes [3]. Likewise, size metrics such as totalLoc help gauge the scope of each code unit at commit time [20].
Entropy-based features, another type of process metrics, complement these change metrics by focusing on how and where changes occur within a project. Hassan et al. presented the concept of code-change entropy [21], which quantifies the scattering of modifications across multiple files during a certain time window, providing insight into whether developers introduce small, localized edits or large, system-wide alterations. High entropy has been linked to fault-proneness, as widespread changes can be harder for developers to track and may mask newly introduced issues [21]. Yet, while there have been efforts to integrate such entropy-based features in bug prediction, few adopt a commit-by-commit perspective across multiple domains. In this study, we unify size metrics, Moser-inspired change metrics (including our own relative variants), and Hassan’s entropy measures, aiming to yield a more holistic view of code evolution and its relationship to defects, combined with a rigorously tuned CatBoost model.

3. Methodology

This study aims to develop a robust machine learning approach for bug prediction that can generalize across a diverse set of open-source repositories. To achieve this, we employ a commit-level feature extraction process and a careful hyperparameter-tuning pipeline, prioritizing both predictive performance and interpretability. While we use CatBoost as our primary boosting algorithm—chosen for its strong handling of categorical data and support for advanced feature-importance analysis—our overarching goal is to validate a high-performing, explainable ML method for heterogeneous software environments. The following sections detail our data collection, labeling, feature engineering, and modeling techniques, laying out how we combine different metrics with a robust boosting framework to predict fault-prone code.

3.1. Data Collection

Table 1 summarizes the 33 open-source projects in our dataset, spanning a wide range of domains (e.g., security, healthcare, data processing, and web servers), with project ages up to over a decade and sizes ranging from a few dozen to tens of thousands of commits. In total, we collected 2,407,917 commits across these repositories, each annotated with code complexity and process metrics (see Section 3.2). We labeled commits as bugfix or non-bugfix based on commit-message keywords (Section 3), yielding a binary target variable isBugfix. Notably, the dataset exhibits class imbalance: 1,992,334 (82.7%) non-bugfix commits vs. 415,583 (17.3%) bugfix commits. This diversity in domain, language, and repository size provides a robust foundation for evaluating CatBoost’s effectiveness in realistic, heterogeneous software development environments.

Bug Labeling

All commit data in this study was extracted from Git repositories. Each commit provides code changes along with metadata (author, timestamp, message), forming the basis for our bug prediction dataset. While some projects link commits to external issue trackers (e.g., Jira, GitHub Issues), aligning such data across multiple repositories is often infeasible—especially for older projects spanning many years. Instead, we employ a keyword-based approach inspired by Mockus et al. [22], parsing commit messages for trigger words (fix, bug, fail, and correct) to label a commit as a bugfix. Before matching, we normalize and stem each message, ensuring variations like [Bug#1234] or bugfixed are detected correctly. Spot checks confirm sufficient accuracy for our repositories, enabling efficient bugfix identification without extensive manual annotation or third-party data.

3.2. Code Metrics

We gather features that capture how files change (process metrics), as well as how large (size metrics) and how “chaotic” (entropy metrics) these changes are. We treat each commit as a set of inserted, deleted, or replaced lines, thereby allowing richer churn-based calculations than simply taking the net difference in lines of code.

Absolute Modifications (modified). Each file’s commit-level changes are summarized by the total number of lines added, deleted, or replaced. Replacements are split into a removed portion and an added portion, ensuring that even zero-sum modifications (e.g., swapping lines of code) register as nontrivial changes. We intentionally treat additions and deletions equivalently in this metric, as our focus is on capturing the overall magnitude of change rather than its direction, based on the assumption that extensive modifications of any type correlate with increased fault risk.
Relative Modifications (relativeAdded, relativeReplaced). Beyond counting raw changes, we normalize them by the file’s current size. RelativeAdded is the fraction of added lines over the total number of lines in the new version, and relativeReplaced is the fraction of how many have been replaced. This highlights smaller files where even minimal changes may have a significant impact on maintainability. Note that while our absolute ’modified’ metric accounts for deletions, we do not include a separate ’relativeDeleted’ metric because deletions reference code that no longer exists in the current file state, making normalization less straightforward.
Revision Count (revision). We track how many times a file has been revised, offering a straightforward indicator of its edit frequency.
Total Lines of Code (totalLoc). This size metric quantifies the file’s scale at each commit. Large files might accumulate high modified values purely due to their size, so combining totalLoc with relative metrics helps contextualize high-churn files of varying scales.
Entropy-Based Metrics (hcpf1, hcpf2, and hcm2). To capture how scattered file modifications are over time, we adopt an entropy-based view of code changes inspired by Hassan [21]. Intuitively, entropy should be higher when changes are spread across many files and lower when edits concentrate in fewer files. This provides finer granularity than simple churn alone, which merely aggregates how many lines of code have changed.
Sliding-Window Computation. Concretely, for each commit c with timestamp $t_{c}$ , we examine all prior commits in the repository whose timestamps fall within a specified recently modified window (in our case, 180 days) before $t_{c}$ . Let S be the set of such commits. We then gather every file-level code edit from each commit in S, summing the absolute lines changed to obtain a global total $L_{total}$ . For each file f appearing in the set S, we denote by $L_{f}$ the total lines changed in f over those same 180 days. We define $p_{f} = L_{f} / L_{total}$ to be the fraction of lines changed in file f relative to the entire window. For newly introduced files within the sliding window, $L_{f}$ includes all lines of the initial file creation (counted as “added” lines) plus any subsequent modifications to that file within the same 180-day window. This approach ensures that new files contribute to entropy calculations proportionally to their size and any early evolution they undergo. The entropy of these changes is then:

$H = - \sum_{f \in F} (p_{f} log (p_{f})),$

(1)

where F is the set of all files touched in S, and we use log with a base (e.g., 2 or $| F |$ ) that is consistent across the entire study.
hcpf1 and hcpf2. We assign this computed entropy H to every file f changed in the current commit c as the hcpf1 metric: it captures how “scattered” the broader development context is around the time c. In addition, we define:

$hcpf 2_{f} = p_{f} \cdot H,$

(2)

which scales H by the proportion $p_{f}$ of lines changed in f over the previous 180 days. As a result, hcpf2 reflects both the scattering of changes and the file’s relative contribution in that window.
HCM2. Finally, we accumulate these entropy values over multiple preceding intervals to reflect the historical “entropy baggage” of a file. Let $c_{1}, c_{2}, \dots, c_{k}$ be all prior commits affecting file f, and let $h c p f 2_{c_{i}}$ be the entropy values computed at each of those commits in their respective sliding windows. The hcm2 metric is then:

$hcm 2_{f} = \sum_{i = 1}^{k} hcpf 2_{c_{i}} .$

(3)

This summation reflects how persistently a file has undergone development activity in high-entropy contexts. A large hcm2 value implies that the file has repeatedly experienced widely distributed changes over its history, potentially correlating with elevated fault risk [21].

Overall, we obtain three main entropy-based metrics per file and commit: hcpf1, hcpf2, and the cumulative hcm2. They collectively offer a view of short-term scattering (via hcpf1 and hcpf2) and long-term historical complexity (via hcm2), complementing churn- or size-based features for more robust bug prediction.

: Age (age). We quantify a file’s longevity at each commit by measuring its age in seconds. Concretely, this is the elapsed time between the file’s first appearance in the repository and the current commit timestamp. Using seconds allows for a precise chronological scale, ensuring even short-lived files or rapid iteration phases can be captured. For all files, including short-lived ones, age precisely reflects their actual duration of existence in the project, calculated as the elapsed time between the file’s first appearance and the current commit being analyzed. When files are deleted, their age is no longer tracked, as they cease to appear in subsequent commits. In practice, older files may accumulate technical debt or undergo incremental degradations, making age a potentially strong predictor of fault-proneness.

We provide standard statistics for all discussed metrics in Table 2.

Among the features, hcpf1 shows a relatively high mean (0.9165) and low standard deviation (0.1234), while hcpf2 remains close to zero. The hcm2 metric follows a similar distribution but exhibits a maximum value of 1.5072.

The modified metric is highly skewed, with an extreme maximum of 629,569 modified lines, suggesting the presence of large commits in the dataset. The age metric shows a significant range spanning from 0 to over 729 million seconds. The revision metric shows a maximum of 6569 revisions per file.

The totalLoc metric ranges from 0 to 800,376 lines, emphasizing the presence of large-scale software projects.

We additionally provide these statistics for each programming language (e.g., Java) in Appendix A for interested readers. Across the four language groups, JavaScript projects exhibit the highest mean revision count, pointing to frequent iterative changes, while Python shows the greatest present- and cumulative-entropy values (hcpf1, hcm2), indicating more widely scattered modifications. Java repositories contain the oldest and, on average, largest files, whereas TypeScript tops the average lines-of-code metric; despite these differences, churn-related measures remain of similar magnitude across languages, supporting the presented language-agnostic approach.

Overall, the dataset presents a mix of numerical and categorical features, with varying levels of completeness and skewness. The high imbalance in the isBugfix variable suggests that specialized resampling techniques, such as undersampling or synthetic data generation, may be necessary to enhance model performance in detecting bugfix commits effectively.

3.3. Machine Learning

For our machine learning approach, we employed CatBoost, a modern boosting classifier [7,8]. Decision trees, the fundamental building blocks of boosting methods, are non-parametric supervised learning models used for both classification and regression tasks [23].

Boosting classifiers extend the basic concept of decision trees by sequentially training a series of weak learners, typically simple decision trees, and combining their predictions to form a strong, robust model. Unlike bagging techniques, which build multiple independent models in parallel, boosting methods iteratively focus on correcting the errors of their predecessors. Prominent examples of boosting methods include Random Forest [24], XGBoost [25], and LightGBM [26].

CatBoost is a gradient boosting algorithm that uses a permutation-driven approach to handle categorical features natively, reducing the need for extensive preprocessing. It is reported to achieve high accuracy and strong performance across diverse problem domains, even when used out of the box, requiring minimal hyperparameter optimization [27,28]. Moreover, CatBoost has proven particularly effective when leveraging complexity metrics and other derived features, in, e.g., software testing or agriculture [29,30].

In addition, CatBoost is appreciated for its ease of implementation and built-in GPU support, features that are essential for efficiently processing large datasets.

Furthermore, selecting appropriate hyperparameters is crucial for obtaining optimal performance in any machine learning approach. Hyperparameter tuning involves fine-tuning parameters such as learning rate, tree depth, and regularization coefficients to balance bias and variance. For our study, we employed Bayesian optimization, a probabilistic model-based approach that explores the hyperparameter space by balancing exploration and exploitation [31]. In particular, we utilized the scikit-opt Python package [32] to automate this process.

3.4. Feature Importance Analysis

Understanding the contribution of each feature to the model’s predictions is essential for model interpretability and for identifying potential areas of improvement. To achieve this, we utilized CatBoost’s feature importance method, which calculates the importance of each feature based on its contribution to the model’s predictions [33].

In our analysis, we employed the PredictionValuesChange method, which measures the average change in the model’s prediction when a feature’s value is altered. This method estimates how much the model output shifts when a given feature is perturbed, making it the default for non-ranking metrics. Using this approach, we identified which features significantly influence prediction outcomes, thereby providing insights into the underlying patterns within the data.

3.5. Experimental Setup

The actual dataset for our machine learning experiments consists of all commits from all projects under study. Consequently, for each file and each commit, we obtained a collection of code (complexity) metrics and a target variable indicating whether a bugfix occurred or not.

We conducted our experiments using a two-step data splitting procedure. First, the data was divided into a training set and a final test set with an 80/20 split to ensure that the final evaluation remained unbiased. Further, this initial split was performed randomly using shuffling, stratified with respect to the binary target. The training set was then undersampled: we selected 80% of the minority class (Bugfix) and an equal amount from the majority class (No Bugfix). This undersampled set was repeatedly split into training and validation subsets for model development and parameter tuning. This iterative approach reduced variance in performance estimates and mitigated overfitting.

Additionally, for each undersampled train/test split within the training data for our CatBoost model, we applied Bayesian optimization combined with three-fold cross-validation to systematically search for optimal hyperparameters.

The final Model performance was assessed on the initially split test set using accuracy, precision, recall, F1-score, and confusion matrices [34]. All metrics were accumulated and averaged over repeated undersampling (100 iterations) to further address class imbalance, ensuring robust and balanced model behavior and evaluation.

Moreover, we conducted a feature importance analysis to understand the relative contribution of different metrics in the classification process. This analysis provides insights into how various process-centric, size-based, and entropy-based features influence the model’s predictions and supports interpretability in bug prediction tasks.

Additionally, for the top four features identified by our feature importance analysis, we analyzed the distributions of the predicted classes, specifically how values are distributed between Bugfix and No Bugfix for each feature. We again employed undersampling to generate representative distributions by bootstrapping, selecting 50% of the minority class and an equivalent amount from the majority class (No Bugfix). This procedure was repeated 100 times for both Bugfix and No Bugfix classes to generate boxplots illustrating the value ranges for the top four features.

4. Results and Discussion

The classification results show that our CatBoost-based model achieved an overall accuracy of 0.7674 on the final test dataset, which was entirely separated from the training and validation processes to ensure an unbiased evaluation.

To evaluate these results, we conducted 100 iterations, each time randomly sampling 80% of the minority class and an equal number from the majority class, creating balanced subsets of the test data. We assessed the model’s performance on the final validation dataset by aggregating classification metrics and confusion matrices across all iterations. The final report and matrices reflect averaged results from these accumulations, i.e., balanced, randomly selected subsets from the initially split test data. Additionally, we normalized the confusion matrices to depict relative proportions, as illustrated in Figure 2.

Figure 2 provides further insight into classification performance. We see that the model classifies most non-bugfix commits correctly, but it also misclassifies a notable proportion of bugfix commits.

From the relative confusion matrix, approximately 77.7% of non-bugfix commits are correctly classified, and 75.7% of bugfix commits are identified accurately (Table 3).

4.1. Feature Importance and Metric Synergy

Feature-importance analysis here is not merely a check on whether our model is leveraging each variable appropriately. Instead, we employ it to reveal how the different metrics drive predictions, offering deeper interpretability and insights into bug-prone code attributes. In essence, identifying the relative contribution of our features helps us understand why certain commits are deemed risky, guiding developers in prioritizing reviews or refactoring efforts. Figure 3 highlights the relative importance of each metric in the CatBoost model’s predictions. Here, hcpf1 emerges as the most influential feature. This aligns with the hypothesis that widespread, system-wide modifications are more prone to defects, especially when multiple files are changed within the same time window [21]. This is captured by hcpf1 because it measures how strongly changes were distributed across files in the preceding 180-day window; large-scale or widely scattered modifications lead to higher entropy values, indicating a higher defect risk. Close behind is age, indicating that older files, or those maintained over extended periods, are more likely to accumulate technical debt or incremental design flaws that can trigger bugs. The revision metric, which counts how often a file has been updated, ranks next. This finding is consistent with existing literature on process-centric metrics [18,19], where frequent revisions often correlate with elevated fault-proneness.

Beyond these top three, other metrics such as totalLoc, hcm2, and modified also contribute appreciably. Notably, the relative measures (relativeReplaced and relativeAdded) carry non-negligible weight, demonstrating that a spread of metrics capturing size, churn, and entropy together yields a stronger predictive signal than any single type of feature. This synergy suggests that bug proneness reflects a multifaceted process, where historical complexity (entropy), accumulated edits (revision, age), and localized changes (churn, lines added or replaced) all intertwine. Our results extend this conclusion beyond the common benchmark datasets to a large, real-world, and diverse set of repositories.

4.2. Expected vs. Observed Outcomes

From a theoretical standpoint, one might anticipate that age and revision would rank among the leading predictors, given that older or frequently edited files often accumulate design decay and thus become more fault-prone [35]. The prominence of hcpf1 is likewise unsurprising, given previous research linking scattering or spread of changes to higher fault risk [21].

However, our boxplot analysis (Figure 4) reveals a somewhat counterintuitive trend: commits labeled as bugfix tend to exhibit lower average values for hcpf1 than non-bugfix commits. This finding stands in contrast to the typical assumption that “more scattering leads to more defects” and indicates the importance of recognizing that these are bug-fixing commits, not necessarily fault-introducing commits. Indeed, many large-scale bug prediction studies—in the interest of practicality—employ fix-based commit labeling, since it is far simpler to detect terms like “fix” or “bug” in commit messages than to pinpoint exactly where a defect was introduced. Consequently, if a file incurred a bug during a period of high entropy, developers might only push the fix after that broader scattering subsides, producing lower measured hcpf1 values at fix time.

By illustrating that hcpf1 can be lower for fix-labeled commits, our analysis highlights the nuanced relationship between scattering and defects, and cautions that the “chaotic” conditions under which a bug is introduced may differ markedly from those under which it is fixed. Meanwhile, hcpf2, which scales hcpf1 by the fraction of lines changed, retains notable importance, suggesting that not just where changes occur but also how intensively a file is edited refines risk. In contrast, relativeAdded and relativeReplaced—though influential—do not surpass top features such as hcpf1, age, and revision, reflecting that in large-scale, heterogeneous projects, global entropy and historical file metrics can overshadow local churn ratios. Overall, the overlapping distributions and unexpected behavior of hcpf1 emphasize the complexity of the problem, affirming that bug prediction cannot rely solely on linear or single-metric assumptions.

4.3. Implications for Practice

Multiple moderately influential metrics reinforce the notion that bug prediction benefits from a diverse metric suite. Our analysis shows that churn, entropy, and change-based measures each contribute distinct predictive signals, indicating that tools focusing solely on static complexity could overlook crucial indicators of fault-proneness. Moreover, the synergy among these metrics offers actionable insights for project managers: for instance, an older file that undergoes a large, scattered edit might warrant deeper code reviews or additional testing. Overall, our feature ranking supports the argument that combining process-centric, entropy-based, and size-related metrics yields a robust and explainable bug prediction model. Nonetheless, it is crucial to interpret entropy values in the light of fix-based labeling conventions, bearing in mind that bug-introducing commits may appear in a higher-entropy state than the eventual fixes.

Overall, these results show that CatBoost is a viable method for bug prediction across highly diverse repositories. The top features (hcpf1, age, revision, etc.) underscore that project history and scattered edits can strongly correlate with bug risk. However, improvements in precision may be achieved through class weighting, refined feature selection, or alternative sampling strategies.

5. Threats to Validity

While our study provides valuable insights into bug prediction across multiple repositories, certain limitations must be acknowledged. These potential threats to validity arise from factors such as dataset composition, labeling methodology, and the scope of selected features, which may influence the generalizability and interpretability of our findings. Below, we outline key aspects related to internal and construct validity.

: Internal Validity. Our labeling approach for bug-fix commits relies on keyword matching in commit messages (e.g., “fix”, “bug”, “fail”), following prior work [22]. Although manual spot checks suggested reasonable accuracy, some bug fixes may not be labeled if developers do not use such keywords. Conversely, commits mentioning “fix” in unrelated contexts could be falsely flagged. Furthermore, developers often batch multiple changes into a single commit, complicating the link between a labeled bug-fix and the actual fault-introducing code.
: Construct Validity. Our chosen code metrics may not capture all facets of software complexity—such as concurrency issues or architectural smells. In particular, large commits with multiple refactorings can inflate churn measures without necessarily increasing true fault-proneness. Similarly, some bug-prone patterns might stem from design decisions not reflected in these metrics.
: Testing Practices and Bug Patterns. Another limitation, common in fault prediction research, is the absence of testing-related metrics in our feature set. Testing practices, including code coverage, test density, and continuous integration configurations, likely influence both the detection and introduction of bugs across projects, yet these factors are rarely incorporated into prediction models. Repositories with comprehensive test suites and robust CI pipelines may show different defect patterns than those with minimal testing infrastructure. These testing characteristics could potentially explain some of the variation in bug patterns we observed across projects and might help disambiguate the causal factors behind defect-prone files. Future work should incorporate these testing metrics to provide a more complete picture of the relationship between development practices and defect prediction.

6. Conclusions and Future Work

This paper presents a CatBoost-based approach for bug prediction across a diverse set of 33 open-source projects, spanning domains such as security, healthcare, and data processing. By examining commits at a granular, file-level scope, the study highlighted the importance of combining process-centric metrics, including churn, file age, revision frequency, and entropy-based features, in accurately distinguishing bugfix commits. Notably, hcpf1, file age, revision, and totalloc emerged as the most important factors influencing the decision of our model.

Based on the distributions of our predicted classes, we found that increased values for age, revision, and totalloc are indicative of bugs, whereas, counterintuitively, decreased values of hcpf1 (the present entropy) were observed. However, the remaining metrics also contributed substantially, indicating that a broad combination of file history, churn, and size factors yields a strong predictive model.

Thus, overall, we conclude that predicting bugs in code is not a straightforward linear task but rather a complex challenge with many nuances, requiring different perspectives—in terms of metrics—and sophisticated machine learning approaches such as CatBoost and other modern boosted classifiers.

The empirical results demonstrate that a properly tuned CatBoost classifier is capable of handling large, imbalanced real-world datasets while providing reasonable predictive performance in terms of an overall accuracy of 76.7%. Moreover, the successful application of this model to projects of various sizes, ages, and programming languages underscores its generalizability.

In summary, we contribute the following:

A large-scale, multi-repository dataset with varied metrics;
A code-metric-based feature extraction taking into account process, size, and entropy factors;
A Bayesian-optimized CatBoost model tailored together with undersampling for maximal generalizability for imbalanced data;
A feature importance analysis of our trained model highlighting the importance of age, entropy, and churn.

In practice, such a bug prediction tool can assist developers and maintainers in allocating resources more efficiently, ultimately enhancing the dependability of software systems.

Future work could expand this approach beyond open-source repositories by applying it to private, industrial codebases and by incorporating additional developer-related or architectural metrics that capture complexity at broader levels. Furthermore, the methodology could be extended to focus on specific types of vulnerabilities, complementing risk-based assessment strategies. By refining both the metric set and modeling techniques, future research can push the boundaries of robust, transparent, and actionable bug prediction in real-world software development contexts.

Another trajectory for future work would be to integrate our historically grounded fault prediction approach based on long-term commit- and file-level metrics across large-scale repositories, with retrieval-augmented generation (RAG)-based copilot systems such as [36]. Combining predictive insights from past code evolution with real-time context embeddings and dynamic graph updates could enable synchronized, context-aware bug detection and proactive code suggestions during development.

Further, our evaluation relies on established classification metrics such as precision, recall, and F1-score to quantify the predictive accuracy of our models in identifying buggy commits. In contrast, the Copilot for Testing system [36] is evaluated through a combination of objective and subjective metrics tailored to interactive, AI-assisted workflows. Specifically, their metrics emphasize accuracy (via bug detection rates, similar to our approach), efficiency (through fix/test case acceptance rates and developer feedback), and coverage (focused on high-impact code paths). While their evaluation aligns with the usability demands of AI pair programming, our selection of metrics captures performance in large-scale historical prediction scenarios. Integrating both perspectives, predictive classification and developer-in-the-loop effectiveness, represents a promising avenue for future hybrid evaluation protocols and would further improve current limitations of our approach, such as internal validity, i.e., the reliance on bug-fix commit filters.

Author Contributions

Conceptualization, P.K. and S.R.; methodology, P.K., S.R., D.T. and F.O.; software, P.K., S.R., D.T. and F.O.; validation, P.K., S.R., D.T., F.O. and C.K.; formal analysis, P.K., S.R., D.T. and F.O.; investigation, P.K. and S.R.; resources, A.S. and K.M.; data curation, P.K., S.R., D.T. and F.O.; writing—original draft preparation, P.K. and S.R.; writing—review and editing, P.K., S.R., C.K., A.S. and K.M.; visualization, S.R.; supervision, A.S. and K.M.; project administration, P.K., A.S. and K.M.; funding acquisition, P.K., A.S. and K.M. All authors have read and agreed to the published version of the manuscript.

Funding

The financial support by the Austrian Federal Ministry of Labour and Economy, the National Foundation for Research, Technology, and Development, and the Christian Doppler Research Association is gratefully acknowledged. SBA Research (SBA-K1 NGC) is a COMET Center within the COMET—Competence Centers for Excellent Technologies Programme and funded by BMIMI, BMWET, and the federal state of Vienna. The COMET Programme is managed by FFG. The authors acknowledge the funding by TU Wien Bibliothek for financial support through its Open Access Funding Program.

Data Availability Statement

The data used in this study is currently part of a larger project and cannot be shared at this stage, but it will be made available in future publications by the authors related to this topic.

Conflicts of Interest

Authors Philip König, Sebastian Raubitzek, Dennis Toth, Fabian Obermann and Kevin Mallinger were employed by the company SBA Research gGmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Language-Specific Code-Complexity Metrics

This appendix provides the summary statistics for all code-complexity features examined in the study. For each primary programming language—JavaScript, Java, Python, and TypeScript—we list the mean, standard deviation, minimum, and maximum values of every metric used in the analysis. These tables complement the results in Section 3 by documenting the full empirical distribution of the underlying data.

Table A1. Summary statistics of key features for JavaScript repositories.

Feature	Mean	Std. Dev.	Min	Max
hcpf1	0.7716	0.1183	0.3291	1.9183
hcpf2	0.0016	0.0107	0.0000	1.2789
hcm2	0.0087	0.0799	0.0000	1.5072
modified	62.5496	1118.7253	0	235,984
age	39,075,072.1569	43,938,929.9307	0	266,159,297
revision	240.1124	536.6532	0	3574
totalLoc	1252.6035	6225.6154	–20,994	165,420
relativeReplaced	0.0395	0.1142	–11.0000	8.5000
relativeAdded	0.0881	0.2398	–11.0000	35.0000

Table A2. Summary statistics of key features for Java repositories.

Feature	Mean	Std. Dev.	Min	Max
hcpf1	0.9188	0.1007	0.3115	1.5151
hcpf2	0.0014	0.0085	0.0000	0.6563
hcm2	0.0072	0.0374	0.0000	1.1659
modified	59.9954	930.6493	0	410,270
age	73,370,696.9635	108,411,689.9427	–64,472,991	729,206,297
revision	62.9542	364.4293	0	6569
totalLoc	699.1993	3425.9644	–266,792	800,376
relativeReplaced	0.0307	0.4685	–118.0000	178.0000
relativeAdded	0.0810	6.7391	–8936.0000	702.0000

Table A3. Summary statistics of key features for Python repositories.

Feature	Mean	Std. Dev.	Min	Max
hcpf1	1.0455	0.1861	0.0615	2.6513
hcpf2	0.0048	0.0215	0.0000	1.8096
hcm2	0.0251	0.0949	0.0000	1.3396
modified	61.8068	3485.5183	0	629,569
age	44,234,789.8377	56,339,843.2555	0	453,642,525
revision	26.5250	46.4008	0	531
totalLoc	548.1044	4379.3585	–18,222	629,569
relativeReplaced	0.0240	0.1574	–16.0000	35.0000
relativeAdded	0.0412	0.1972	–56.6667	12.0000

Table A4. Summary statistics of key features for TypeScript repositories.

Feature	Mean	Std. Dev.	Min	Max
hcpf1	0.7697	0.1073	0.3902	1.0072
hcpf2	0.0078	0.0465	0.0000	0.4547
hcm2	0.0129	0.0827	0.0000	0.9675
modified	58.3682	977.2964	1	75,288
age	27,797,963.1990	24,321,847.4456	0	156,951,163
revision	46.7624	60.1226	0	357
totalLoc	1340.1495	7845.6898	–2431	75,288
relativeReplaced	0.0419	0.6879	–89.0000	8.3333
relativeAdded	0.0626	0.2903	–24.0000	6.2727

References

Catal, C.; Diri, B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf. Sci. 2009, 179, 1040–1058. [Google Scholar] [CrossRef]
Li, Z.; Jing, X.Y.; Zhu, X. Progress on approaches to software defect prediction. IET Softw. 2018, 12, 161–175. [Google Scholar] [CrossRef]
Nagappan, N.; Ball, T. Use of relative code churn measures to predict system defect density. In Proceedings of the 27th International Conference on Software Engineering, ICSE ’05, New York, NY, USA, 15–21 May 2005; pp. 284–292. [Google Scholar] [CrossRef]
Rathore, S.S.; Kumar, S. A study on software fault prediction techniques. Artif. Intell. Rev. 2019, 51, 255–327. [Google Scholar] [CrossRef]
Shepperd, M.; Song, Q.; Sun, Z.; Mair, C. NASA MDP Software Defects Data Sets. IEEE Trans. Softw. Eng. 2018, 39, 1208–1215. [Google Scholar] [CrossRef]
D’Ambros, M.; Lanza, M.; Robbes, R. An Extensive Comparison of Bug Prediction Approaches. In Proceedings of the MSR 2010 (7th IEEE Working Conference on Mining Software Repositories), Cape Town, South Africa, 2–3 May 2010; IEEE CS Press: Piscataway, NJ, USA, 2010; pp. 31–41. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 6638–6648. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: gradient boosting with categorical features support. In Proceedings of the Workshop on ML Systems at NeurIPS, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
Bhandari, K.; Kumar, K.; Sangal, A.L. Data quality issues in software fault prediction: a systematic literature review. Artif. Intell. Rev. 2023, 56, 7839–7908. [Google Scholar] [CrossRef]
Gunda, S.K. Software Defect Prediction Using Advanced Ensemble Techniques: A Focus on Boosting and Voting Method. In Proceedings of the 2024 International Conference on Electronic Systems and Intelligent Computing (ICESIC), Chennai, India, 22–23 November 2024; pp. 157–161. [Google Scholar] [CrossRef]
Choudhary, G.R.; Kumar, S.; Kumar, K.; Mishra, A.; Catal, C. Empirical analysis of change metrics for software fault prediction. Comput. Electr. Eng. 2018, 67, 15–24. [Google Scholar] [CrossRef]
Aljamaan, H.; Alazba, A. Software defect prediction using tree-based ensembles. In Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2020, New York, NY, USA, 8–9 November 2020; pp. 1–10. [Google Scholar] [CrossRef]
Akbar, A.M.; Herteno, R.; Saputro, S.W.; Faisal, M.R.; Nugroho, R.A. Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm. J. Electron. Electromed. Eng. Med. Inform. 2024, 6, 169–181. [Google Scholar] [CrossRef]
Al-Smadi, Y.; Eshtay, M.; Al-Qerem, A.; Nashwan, S.; Ouda, O.; Abd El-Aziz, A. Reliable prediction of software defects using Shapley interpretable machine learning models. Egypt. Inform. J. 2023, 24, 100386. [Google Scholar] [CrossRef]
Zahan, M. Prediction of Faults in Embedded Software Using Machine Learning Approaches. In Proceedings of the 2023 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 21–23 September 2023; pp. 352–356. [Google Scholar] [CrossRef]
Uddin, M.N.; Li, B.; Mondol, M.N.; Rahman, M.M.; Mia, M.S.; Mondol, E.L. SDP-ML: An Automated Approach of Software Defect Prediction employing Machine Learning Techniques. In Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Virtual, 14–16 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
Gupta, A.; Sharma, S.; Goyal, S.; Rashid, M. Novel XGBoost Tuned Machine Learning Model for Software Bug Prediction. In Proceedings of the 2020 International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 17–19 June 2020; pp. 376–380. [Google Scholar] [CrossRef]
Graves, T.; Karr, A.; Marron, J.; Siy, H. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng. 2000, 26, 653–661. [Google Scholar] [CrossRef]
Moser, R.; Pedrycz, W.; Succi, G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 30th International Conference on Software Engineering, ICSE ’08, New York, NY, USA, 10–18 May 2008; pp. 181–190. [Google Scholar] [CrossRef]
Basili, V.R.; Perricone, B.T. Software errors and complexity: an empirical investigation0. Commun. ACM 1984, 27, 42–52. [Google Scholar] [CrossRef]
Hassan, A.E. Predicting faults using the complexity of code changes. In Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, Vancouver, BC, Canada, 16–24 May 2009; pp. 78–88. [Google Scholar] [CrossRef]
Mockus.; Votta. Identifying reasons for software changes using historic databases. In Proceedings of the 2000 International Conference on Software Maintenance, San Jose, CA, USA, 11–14 October 2000; pp. 120–130. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Raubitzek, S.; Corpaci, L.; Hofer, R.; Mallinger, K. Scaling Exponents of Time Series Data: A Machine Learning Approach. Entropy 2023, 25, 1671. [Google Scholar] [CrossRef] [PubMed]
Raubitzek, S.; Mallinger, K. On the Applicability of Quantum Machine Learning. Entropy 2023, 25, 992. [Google Scholar] [CrossRef] [PubMed]
Corpaci, L.; Wagner, M.; Raubitzek, S.; Kampel, L.; Mallinger, K.; Simos, D.E. Estimating Combinatorial t-Way Coverage Based on Matrix Complexity Metrics. In Proceedings of the Testing Software and Systems, Naples, Italy, 31 March–4 April 2025; Menéndez, H.D., Bello-Orgaz, G., Barnard, P., Bautista, J.R., Farahi, A., Dash, S., Han, D., Fortz, S., Rodriguez-Fernandez, V., Eds.; Springer: Cham, Switzerland, 2025; pp. 3–20. [Google Scholar]
Mallinger, K.; Raubitzek, S.; Neubauer, T.; Lade, S. Potentials and limitations of complexity research for environmental sciences and modern farming applications. Curr. Opin. Environ. Sustain. 2024, 67, 101429. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv 2012. [Google Scholar]
Head, T.; Kumar, M.; Nahrstaedt, H.; Louppe, G.; Shcherbatyi, I. Scikit-Optimize/Scikit-Optimize (v0.9.0). 2021. Available online: https://scikit-optimize.github.io/stable/ (accessed on 1 July 2025).
Developers, C. Feature Importance Calculation in CatBoost. 2024. Available online: https://catboost.ai/docs/en/features/feature-importances-calculation (accessed on 10 March 2025).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Parnas, D. Software aging. In Proceedings of the 16th International Conference on Software Engineering, Sorrento, Italy, 16–21 May 1994; pp. 279–287. [Google Scholar] [CrossRef]
Wang, Y.; Guo, S.; Tan, C. From Code Generation to Software Testing: AI Copilot with Context-Based RAG. IEEE Softw. 2025, 42, 34–42. [Google Scholar] [CrossRef]

Figure 1. Overview of the bug classification pipeline. Turquoise signals data block blocks, grey signals processing blocks.

Figure 2. Comparison of confusion matrices. Left: Averaged confusion matrix displaying absolute counts. Right: Relative confusion matrix showing normalized proportions per actual class.

Figure 3. Feature importance ranking. Features with higher scores have greater influence on the model’s predictions.

Figure 4. Aggregated distributions for the four most important features for each predicted class. From all predicted samples, we randomly drew a fraction of 0.5 from the minority class and the same number of samples from the majority class of our test dataset to obtain the above shown distributions per feature and class.

Table 1. Overview of selected open-source repositories.

Repository (All Links Accessed 25 June 2025)	Main Language	Domain	Commits	Contributors	Age (Years)
AzureAD/microsoft-authentication-library	Java	Security	4787	50	8.3
Bahmni/bahmni-core	Java	Healthcare	2784	112	11.6
PaperMC/Paper	Java	Game Server	9236	355	8.8
alibaba/fastjson	Java	Data Processing	3983	170	13
apache/ctakes	Java	Healthcare	140	4	1.9
apache/flink	Java	Data Processing	36,055	276	10.4
apache/lucene	Java	Data Processing	37,698	319	3.7
apache/tomcat	Java	Web Server	26,726	121	13.1
b2ihealthcare/snow-owl	Java	Healthcare	16,124	20	9.8
eclipse-equinox/equinox	Java	Development Tools	8623	125	2.4
eclipse-jdt/eclipse.jdt.core	Java	Development Tools	26,925	140	2.6
eclipse-mylyn/org.eclipse.mylyn	Java	Development Tools	21,813	114	1.9
eclipse-pde/eclipse.pde	Java	Development Tools	19,695	165	2.6
ehrbase/ehrbase	Java	Healthcare	4006	35	5.1
google/error-prone	Java	Development Tools	6485	220	10.2
google/j2cl	Java	Web Development	6246	57	7.4
google/nomulus	Java	Domain Management	4878	43	8.7
informatici/openhospital-core	Java	Healthcare	2161	44	5.8
keycloak/keycloak	Java	Security	26,494	330	11.3
openmrs/openmrs-core	Java	Healthcare	12,313	345	12.3
synthetichealth/synthea	Java	Healthcare	4851	75	8.4
termux/termux-app	Java	Mobile App	1467	75	9
zaproxy/zaproxy	Java	Security	9689	213	9.4
cuckoosandbox/cuckoo	JavaScript	Security	8063	146	13.1
vercel/next.js	JavaScript	Web Development	25,464	430	8.1
SigmaHQ/sigma	Python	Security	16,290	349	7.8
cve-search/cve-search	Python	Security	2688	59	12.5
honeynet/honeyscanner	Python	Security	32	3	1.4
intelowlproject/IntelOwl	Python	Security	2556	59	4.8
mitre/caldera	Python	Security	4321	95	6.9
pallets/flask	Python	Web Development	5343	401	14.6
smicallef/spiderfoot	Python	Security	3742	52	12.5
yeti-platform/yeti	Python	Security	3357	55	8.9
TanStack/query	TypeScript	Web Development	3297	469	5.1

Table 2. Summary statistics of key features in the dataset.

Feature	Mean	Std. Dev.	Min	Max
hcpf1	0.9165	0.1234	0.0615	2.6513
hcpf2	0.0017	0.0110	0.0000	1.8096
hcm2	0.0087	0.0490	0.0000	1.5072
modified	60.3048	1320.5725	0	629,569
age	68,354,412.0501	102,203,291.1513	0	729,206,297
revision	72.9583	368.1578	0	6569
totalLoc	744.1576	3821.2523	0	800,376
relativeReplaced	0.0357	0.5520	0.0	569.0
relativeAdded	0.0882	0.3574	0.0	220.0

Table 3. Averaged classification report.

Class	Precision	Recall	F1-Score	Support
0 (Non-Bugfix)	0.7622	0.7773	0.7697	266,140
1 (Bugfix)	0.7728	0.7574	0.7650	266,140
Accuracy	-	-	0.7674	532,280
Macro Avg	0.7675	0.7674	0.7674	532,280
Weighted Avg	0.7675	0.7674	0.7674	532,280

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

König, P.; Raubitzek, S.; Schatten, A.; Toth, D.; Obermann, F.; König, C.; Mallinger, K. Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories. Big Data Cogn. Comput. 2025, 9, 174. https://doi.org/10.3390/bdcc9070174

AMA Style

König P, Raubitzek S, Schatten A, Toth D, Obermann F, König C, Mallinger K. Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories. Big Data and Cognitive Computing. 2025; 9(7):174. https://doi.org/10.3390/bdcc9070174

Chicago/Turabian Style

König, Philip, Sebastian Raubitzek, Alexander Schatten, Dennis Toth, Fabian Obermann, Caroline König, and Kevin Mallinger. 2025. "Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories" Big Data and Cognitive Computing 9, no. 7: 174. https://doi.org/10.3390/bdcc9070174

APA Style

König, P., Raubitzek, S., Schatten, A., Toth, D., Obermann, F., König, C., & Mallinger, K. (2025). Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories. Big Data and Cognitive Computing, 9(7), 174. https://doi.org/10.3390/bdcc9070174

Article Menu

Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Collection

3.2. Code Metrics

3.3. Machine Learning

3.4. Feature Importance Analysis

3.5. Experimental Setup

4. Results and Discussion

4.1. Feature Importance and Metric Synergy

4.2. Expected vs. Observed Outcomes

4.3. Implications for Practice

5. Threats to Validity

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Language-Specific Code-Complexity Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI