Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessEditor’s ChoiceArticle

Peer-Review Record

Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

Big Data Cogn. Comput. 2025, 9(7), 174; https://doi.org/10.3390/bdcc9070174

by Philip König^1,†

, Sebastian Raubitzek^1,†

, Alexander Schatten^2,*

, Dennis Toth¹, Fabian Obermann¹, Caroline König³

and Kevin Mallinger¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Big Data Cogn. Comput. 2025, 9(7), 174; https://doi.org/10.3390/bdcc9070174

Submission received: 3 June 2025 / Revised: 26 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Please see attached file for full review.

Comments for author File: Comments.pdf

Author Response

file attached

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I find this paper interesting as bug detection and prediction is an unsolved problem in software engineering and this paper offers a thorough empirical study on fault prediction in software systems,. Particularly, the authors focused on the early identification of bug-prone commits within large-scale open-source projects and the scale is commendable as it draws on a dataset of 2.4 million commits from 33 heterogeneous repositories and also involving a broad domains like healthcare, cybersecurity, and data engineering. I think another key contributions is its interpretable and actionable feature analysis that the paper is based on. The findings show that files with long lifespans, frequent modifications, and scattered changes tend to be more defect-prone - maybe ddue to more human developers in projects of this scale? Such insights are highly valuable to practitioners aiming to prioritize testing, refactoring, or code reviews in large and evolving software systems.

That said, the paper could benefit from deeper engagement with several important areas:

Some metrics to compute a range of process metrics have been offered such as churn, revision frequency, file age, size-based indicators, and entropy measures that capture the dispersion of changes over time. Are they any particular relationship between these features, and whether some are more important than others?

The paper can be strengthened by a discussion of how the presence or absence of systematic testing may influence the observed bug patterns. Bug frequency may sometimes be more reflective of testing practices than inherent code quality. Including metrics such as code coverage, test density, or even proxy indicators (e.g., presence of CI pipelines) would help to disambiguate these factors. At minimum, acknowledging the variability in testing across projects would provide helpful context for interpreting defect-proneness.

The paper can discuss how predictive performance is evaluated. While standard metrics like precision, recall, F1 score, and AUC are likely employed, it would be beneficial to explicitly state this and to consider how such metrics compare to those used in deep learning–based code models. See recent work Y. Wang et al, From Code Generation to Software Testing: AI Copilot With Context-Based Retrieval-Augmented Generation," in IEEE Software, vol. 42, no. 4, pp. 34-42, 2025 and Christof Ebert et al, Testing Software Systems, IEEE Software, pp. 8-17, 2022. Some discussions on bug prediction metric used to assess the correctness of code generation for testing will be helpful, particularly with AI assistance for bug prediction increasingly intersects with automated code repair and synthesis.

Although the current work leverages hand-crafted features and gradient boosting, and these features are used to train a gradient boosting classifier under realistic class-imbalance conditions. Under what conditions are we ensure of generalizable predictive performance across diverse codebases? Some discussions on how their findings—particularly with regard to entropy and process metrics—can inform models that rely on code embeddings, transformers, or attention-based architectures. This can help future work to extend their work with recent trends in intelligent software testing.

Lastly, given the scale and relevance of the dataset, it will be interesting forr the community to access the dataset and to explore further the processed features and model configurations. Hence, a public data set may be useful as this enhances reproducibility but of course there are privacy issues and this can be mentioned in the paper.

Article Menu

Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

Further Information

Guidelines

MDPI Initiatives

Follow MDPI