Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark

Software 2025, 4(3), 17; https://doi.org/10.3390/software4030017

by Balázs Szalontai

, Balázs Márton

, Balázs Pintér^*

and Tibor Gregorics

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Mohd Nadeem

Software 2025, 4(3), 17; https://doi.org/10.3390/software4030017

Submission received: 20 May 2025 / Revised: 27 June 2025 / Accepted: 7 July 2025 / Published: 14 July 2025

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper, “Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark,” addresses inconsistencies in benchmark scores reported for large language models (LLMs) on the HumanEvalFix bug-fixing benchmark. Given the growing reliance on these benchmarks for evaluating LLMs in software engineering tasks, the study is both timely and significant. The work is highly relevant to the fields of software engineering, machine learning for code (ML4Code), and LLM evaluation.

* Strengths
- The paper evaluates 11 models (including various sizes and instruction/base variants) across 60 configuration combinations, including prompt formatting, temperature, generation length, and quantization. This comprehensive coverage allows the authors to isolate and quantify the effects of each setting.
- The findings are of real practical use to researchers and practitioners. Key insights include the impact of using incorrect prompt templates, confusion between base and instruction-tuned models, and the effects of truncation due to generation length limits. These are common yet often underreported issues.
- Figure 1 shows model performance discrepancies across papers in a clean and intuitive format. Table 2 presents an exhaustive grid of experimental results, highlighting which previously reported scores could be matched under what conditions.
- By reviewing GitHub issues filed by researchers unable to reproduce benchmark results, the paper grounds its discussion in real-world reproducibility challenges.

* Areas for Improvement
- Since prompt formatting is cited as a major cause of performance variance, it would be beneficial to include concrete examples of both correct and incorrect prompts, possibly in an appendix.
- While changes in pass@1 scores are discussed with standard deviations, the paper does not apply formal statistical testing. Additionally, the ±1% threshold used to define a “close match” lacks a clear justification.
- The focus is entirely on the HumanEvalFix benchmark. A discussion of whether these findings generalize to other widely-used benchmarks (e.g., BugsInPy, MBPP) would broaden the paper’s applicability.
- The paper identifies reproducibility challenges but stops short of proposing any concrete checklist, standard evaluation protocol, or best practices for future studies.

Author Response

Thank you for the thorough review and the insightful comments.

Comments 1:
- Since prompt formatting is cited as a major cause of performance
variance, it would be beneficial to include concrete examples of
both correct and incorrect prompts, possibly in an appendix.

Response 1: Thank you for pointing this out, we agree. We included the
prompt templates in the Appendix of the revised version of our paper.

Comments 2:
- While changes in pass@1 scores are discussed with standard
deviations, the paper does not apply formal statistical testing.
Additionally, the ±1% threshold used to define a “close match”
lacks a clear justification.

Response 2: We chose ±1% as a close match because some papers round
and some floor percentages, and it is not always clear from the paper.
A one percent difference could very well be because of these
differences. As Figure 1 shows, when the difference is larger it is
very clear that random differences couldn’t cause it. For example,
47.5% vs 81.0% for DeepSeekCoder-Inst (33B) (the first result in the
Figure).

Comments 3:
- The focus is entirely on the HumanEvalFix benchmark. A discussion
of whether these findings generalize to other widely-used
benchmarks (e.g., BugsInPy, MBPP) would broaden the paper’s
applicability.

Response 3: We reviewed further literature to investigate whether benchmark
result inconsistencies also appear more broadly, with a focus on the MBPP
benchmark and DeepSeekCoder models. We included our findings in the discussion
of the revised paper.

Comments 4:
- The paper identifies reproducibility challenges but stops short of
proposing any concrete checklist, standard evaluation protocol, or
best practices for future studies.

Response 4: We included an evaluation protocol in the discussion of
the revised version of our paper.

Reviewer 2 Report

Comments and Suggestions for Authors

Positive Points:

+ Thorough investigation into the reproducibility of code correction using HumanEvalFix with LLMs.
+ Comprehensive experiments across 11 models with varied settings, demonstrating the scope of the issue.
+ Highlighting discrepancies between reported results and datasets, which is critical for reproducibility.

Negative Points:

- There should be more discussion on reproducibility.

The paper addresses critical issues in reproducibility by identifying discrepancies between reported results and datasets, which is a significant contribution.
The experiment was carefully set up and the results are interesting.

The experiment is interesting, but if the purpose of this paper is not to "conduct fair comparisons" but rather to "focus on reproducibility," the discussion section could be further elaborated.
For example, based on the current experimental results, there is a consideration regarding potential confusion in model names. What about the experimental parameters?
Possibilities include "inaccurate descriptions in the existing paper," "parameters are not unified within the existing paper," or "the existing paper lacks descriptions regarding parameters altogether."
Additionally, for cases where the score does not match existing papers, possibilities such as "the most of the score in the existing paper could not be reproduced, suggesting the absence of parameters tested in this study" or "part of the score in the existing paper was reproduced, implying the possibility of partial recording errors" may be considered.
By incorporating these discussions, examining "what points should be noted when describing experimental settings for future publications of similar papers" could contribute to the discussion on reproducibility.

Alternatively, if the focus is not on reproducibility but rather on variability in numerical results due to experimental settings, the title might be changed slightly.

Author Response

Thank you for the thorough review and the insightful comments.

Comments 1:
- There should be more discussion on reproducibility.
The paper addresses critical issues in reproducibility by identifying
discrepancies between reported results and datasets, which is a
significant contribution. The experiment was carefully set up and the
results are interesting. The experiment is interesting, but if the
purpose of this paper is not to “conduct fair comparisons” but rather
to “focus on reproducibility,” the discussion section could be further
elaborated. For example, based on the current experimental results,
there is a consideration regarding potential confusion in model names.
What about the experimental parameters? Possibilities include
“inaccurate descriptions in the existing paper,” “parameters are not
unified within the existing paper,” or “the existing paper lacks
descriptions regarding parameters altogether.” Additionally, for cases
where the score does not match existing papers, possibilities such as
“the most of the score in the existing paper could not be reproduced,
suggesting the absence of parameters tested in this study” or “part of
the score in the existing paper was reproduced, implying the
possibility of partial recording errors” may be considered. By
incorporating these discussions, examining “what points should be
noted when describing experimental settings for future publications of
similar papers” could contribute to the discussion on reproducibility.

Response 1: Thank you for pointing this out. We reread the papers and discuss
their reporting on their experimental setups in the revised paper. We also
extended the revised paper with related work about why discrepancies might occur.

Reviewer 3 Report

Comments and Suggestions for Authors

"Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark"

The scope is limited to the HumanEvalFix benchmark, which, while widely used, is only one of several bugfixing benchmarks (e.g., QuixBugs, BugsInPy, MDEVAL). The study could have included a broader range of benchmarks to generalize findings further.
The review lacks a deeper analysis of why discrepancies occur beyond evaluation settings (e.g., differences in dataset versions, preprocessing, or unreported fine-tuning). Additionally, the study does not discuss whether the reviewed papers provided sufficient methodological details to enable reproduction.
The study does not justify the selection of the 11 models beyond their appearance in multiple papers. Including a broader range of model architectures (e.g., non-transformer-based models or newer models like Llama3) could enhance generalizability.
The baseline configuration (model-specific prompt, greedy decoding, 2048 tokens, bf16 precision, no quantization) is well-defined but assumes optimal settings. The study could have explored suboptimal baseline settings to reflect real-world misuse.
The study reports that 20 out of 57 unique scores are mathematically unobtainable (e.g., due to incorrect rounding), but it does not investigate potential causes (e.g., errors in reporting, different dataset sizes, or post-processing). This limits the depth of the analysis.
The study does not provide access to the code, data, or specific model versions used, which limits its own reproducibility. Additionally, the GitHub issues referenced are not linked or archived, making it difficult for readers to verify them.
The discussion lacks a broader reflection on the implications for the ML4Code community, such as how these findings might apply to other benchmarks or tasks (e.g., code generation, code summarization). It also does not address potential solutions beyond reporting settings, such as standardized evaluation protocols.
Some technical terms (e.g., pass@1, bf16 precision) are introduced without sufficient explanation for readers unfamiliar with ML4Code. The paper also contains minor typographical errors (e.g., “empricial” instead of “empirical” on page 9).
The study’s focus on evaluation settings is not entirely new, as prior work (e.g., Pimentel et al., 2024) has discussed variability in LLM evaluation frameworks. The novelty lies in the specific application to HumanEvalFix, but this is not explicitly positioned against related work.
The study does not discuss ethical implications, such as the potential for misleading benchmark results to influence commercial adoption of LLMs or the environmental impact of running exhaustive evaluations on A100 GPUs.
Some references (e.g., Hugging Face Blog for StarCoder) are non-peer-reviewed, which may weaken the academic rigor. The study could also cite more recent work on reproducibility challenges beyond 2024.
Justifying the focus on HumanEvalFix and the selection of models.
Providing access to code and data for reproducibility.
Addressing ethical implications and broader generalizability.
Correcting typographical errors and improving clarity for non-expert readers.

Comments for author File: Comments.pdf

Author Response

Thank you for the thorough review and the insightful comments.

Comments 1: The scope is limited to the HumanEvalFix benchmark, which,
while widely used, is only one of several bugfixing benchmarks (e.g.,
QuixBugs, BugsInPy, MDEVAL). The study could have included a broader
range of benchmarks to generalize findings further.

Response 1: Bugfixing benchmarks like BugsInPy and MDEVAL are not widely
adopted by the research community, so there is a lack of reported results that
would support a comparative analysis of the kind we conducted using
HumanEvalFix. As for the QuixBugs benchmark, although it is relatively
well-known, it is quite dated and contains only 40 short buggy programs.
Several models such as O1 already achieve near perfect accuracy on it, which
limits its usefulness for meaningful comparisons. Consequently, recent studies
often omit QuixBugs results. However, the issue of inconsistent benchmark
results across different studies is not unique to HumanEvalFix; it also
applies to other benchmarks such as MBPP, which we discuss in the revised
version of the paper.

Comments 2: The review lacks a deeper analysis of why discrepancies
occur beyond evaluation settings (e.g., differences in dataset
versions, preprocessing, or unreported fine-tuning). Additionally, the
study does not discuss whether the reviewed papers provided sufficient
methodological details to enable reproduction.

Response 2: Thank you for the suggestions. We reread the papers and discuss
their reporting on their experimental setups in the revised paper. We also
extended the revised paper with related work about why discrepancies might
occur. To our knowledge, the HumanEvalFix dataset doesn’t have multiple
versions, and it doesn’t need preprocessing.

Comments 3: The study does not justify the selection of the 11 models
beyond their appearance in multiple papers. Including a broader range
of model architectures (e.g., non-transformer-based models or newer
models like Llama3) could enhance generalizability.

Response 3: Thank you for pointing this out, we agree that specific inclusion
criteria for the evaluated models is necessary. We concretized this: we
include models that have different reported benchmark scores in at least two
papers and do not exceed 15B parameters. We extended the revised paper to
include this information. We also broadened the range or evaluated models with
WizardCoder (15B), which meets the above described inclusion criteria.
Furthermore, we tried to review all papers which evaluate on the HumanEvalFix
benchmark, and haven't found any non-Transformer models, possibly because the
benchmark was originally constructed to measure LLMs. LLama3 was not included
because there were no significant discrepancies among reported results in the
literature for the 8B version, and the 70B version requires too much GPU
memory for us to conduct local evaluations with it.

Comments 4: The baseline configuration (model-specific prompt, greedy
decoding, 2048 tokens, bf16 precision, no quantization) is
well-defined but assumes optimal settings. The study could have
explored suboptimal baseline settings to reflect real-world misuse.

Response 4: We evaluated 60 settings for each model, out of which one
is the baseline, and the other 59 can be considered suboptimal.

Comments 5: The study reports that 20 out of 57 unique scores are
mathematically unobtainable (e.g., due to incorrect rounding), but it
does not investigate potential causes (e.g., errors in reporting,
different dataset sizes, or post-processing). This limits the depth of
the analysis.

Response 5: Thank you very much for bringing our attention to the need of
explanation here. We thoroughly reread the papers that report HumanEvalFix
benchmark results, and found the most probable cause for these scores: some
papers calculate pass@1 not by generating 1 sample, but by generating n=20
samples to provide a more refined pass@1 performance estimate. This indicates
that these scores are indeed mathematically obtainable. Therefore, we have
removed our incorrect findings related to this and included an explanation in
the revised paper. Again, thank you very much for bringing this to our
attention.

Comments 6: The study does not provide access to the code, data, or
specific model versions used, which limits its own reproducibility.
Additionally, the GitHub issues referenced are not linked or archived,
making it difficult for readers to verify them.

Response 6: The study evaluates the HumanEvalFix benchmark using the
Code Generation LM Evaluation Harness provided by the BigCode Project
(cited in the paper), so it doesn’t have its own code or dataset. The
model versions are already included in the paper, we included the
links to the GitHub issues in the Appendix of the revision.

Comments 7: The discussion lacks a broader reflection on the
implications for the ML4Code community, such as how these findings
might apply to other benchmarks or tasks (e.g., code generation, code
summarization). It also does not address potential solutions beyond
reporting settings, such as standardized evaluation protocols.

Response 7: Thank you for the suggestion, we have included a recommended
evaluation protocol in the revision. We also performed and included a
small-scale study on the MBPP code generation benchmark, specifically focusing
on reported scores of DeepSeekCoder models. We find that our findings apply
more broadly and are not limited to the HumanEvalFix benchmark. It would be
interesting to conduct comprehensive evaluations in code generation in future
work.

Comments 8: Some technical terms (e.g., pass@1, bf16 precision) are
introduced without sufficient explanation for readers unfamiliar with
ML4Code. The paper also contains minor typographical errors (e.g.,
“empricial” instead of “empirical” on page 9).

Response 8: Thank you for your suggestion and for noticing the typo.
We included these explanations and thoroughtly reviewed the paper for
typos.

Comments 9: The study’s focus on evaluation settings is not entirely
new, as prior work (e.g., Pimentel et al., 2024) has discussed
variability in LLM evaluation frameworks. The novelty lies in the
specific application to HumanEvalFix, but this is not explicitly
positioned against related work.

Response 9: Thank you for your suggestion, we discuss such related
work in the revision. We also reviewed the work of Pimentel et al.,
but have chosen papers that are more closely related to our field instead.

Comments 10: The study does not discuss ethical implications, such as
the potential for misleading benchmark results to influence commercial
adoption of LLMs or the environmental impact of running exhaustive
evaluations on A100 GPUs.

Response 10: Thank you, we agree that this is a good idea, we included
it in a new "Broader Impact" section in the revision.

Comments 11: Some references (e.g., Hugging Face Blog for StarCoder)
are non-peer-reviewed, which may weaken the academic rigor. The study
could also cite more recent work on reproducibility challenges beyond
2024.

Response 11: Thank you for your suggestion. We checked the references and have
made the necessary adjustments. We also extended our study with related works
on reproducibility challenges.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Clear to my comments

Reviewer 2 Report

Comments and Suggestions for Authors

I would like to thank the authors for their thorough responses to the review comments. All my concerns have been addressed.

Reviewer 3 Report

Comments and Suggestions for Authors

all the comments and queries are resolved by the authors

Article Menu

Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark

Further Information

Guidelines

MDPI Initiatives

Follow MDPI