LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench

Trigui, Mohamed Mehdi; Al-Khatib, Wasfi G.

doi:10.3390/computers14100427

Open AccessReview

LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench

by

Mohamed Mehdi Trigui

^1,*

and

Wasfi G. Al-Khatib

^1,2,*

¹

Information & Computer Science Department (ICS), King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

²

Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS), King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(10), 427; https://doi.org/10.3390/computers14100427

Submission received: 23 August 2025 / Revised: 16 September 2025 / Accepted: 23 September 2025 / Published: 7 October 2025

Download

Browse Figures

Versions Notes

Abstract

Commit messages are vital for traceability, maintenance, and onboarding in modern software projects, yet their quality is frequently inconsistent. Recent large language models (LLMs) can transform code diffs into natural language summaries, offering a path to more consistent and informative commit messages. This paper makes two contributions: (i) it provides a systematic survey of automated commit message generation with LLMs, critically comparing prompt-only, fine-tuned, and retrieval-augmented approaches; and (ii) it specifies a transparent, agent-based evaluation blueprint centered on CommitBench. Unlike prior reviews, we include a detailed dataset audit, preprocessing impacts, evaluation metrics, and error taxonomy. The protocol defines dataset usage and splits, prompting and context settings, scoring and selection rules, and reporting guidelines (results by project, language, and commit type), along with an error taxonomy to guide qualitative analysis. Importantly, this work emphasizes methodology and design rather than presenting new empirical benchmarking results. The blueprint is intended to support reproducibility and comparability in future studies.

Keywords:

commit message generation; large language models; retrieval-augmented generation; software engineering automation; CommitBench dataset; automated documentation; transformer architecture

Graphical Abstract

1. Introduction

Version control systems such as Git are indispensable in modern software development, enabling collaborative work, precise change tracking, and effective management of complex projects [1,2]. At the heart of these systems, commit messages act as lightweight documentation that conveys the purpose, rationale, and scope of each change. When written well, they facilitate code review, accelerate bug diagnosis, support team knowledge sharing, and ease the onboarding of new contributors [1,3].

Despite this importance, empirical studies show that commit messages in practice are often vague, incoherent, or overly brief, commonly due to time pressure, unclear guidelines, or limited awareness of their long term value [2,4,5]. These shortcomings hinder traceability and maintenance and can impair communication as projects evolve [6,7]. To address this, research on automated commit message generation (CMG) has progressed from early neural machine translation (NMT) approaches [8,9,10,11] to methods powered by large language models (LLMs) [5,12].

Recent LLMs (e.g., GPT-5 and DeepSeek-V3.1) demonstrate strong potential for CMG by leveraging transformer architectures to produce fluent, context-aware summaries that can approach or surpass human written quality [4,5,7]. However, the literature still exhibits notable limitations: many studies evaluate a single model or dataset [13,14], rely on narrow metrics, or omit practical pathways to deployment [6,15,16]. Moreover, few works compare proprietary and open-source models side by side on large, diverse benchmarks such as CommitBench [17].

Scope. This paper surveys recent methods for LLM-based CMG and specifies a transparent, agent-based evaluation protocol centered on CommitBench. We detail dataset usage and splits, prompting and context settings (with optional retrieval augmentation), scoring and selection rules, and reporting guidelines. The emphasis is on methodology and design rather than empirical findings; we do not report new benchmark results in this study. Instead, we provide a standardized evaluation blueprint that future researchers can apply for reproducible multi-LLM comparisons.

Contributions. This paper makes the following contributions: (i) a systematic survey of LLM-based commit message generation, covering prompt-only, fine-tuned, and retrieval-augmented approaches; (ii) a reproducible evaluation blueprint centered on CommitBench, which specifies dataset preprocessing, model conditions, standardized prompting, evaluation metrics (BLEU, ROUGE, and METEOR) and human-centered criteria, error taxonomy, and reporting conventions results. The blueprint is intended to enable transparent and comparable future studies.

Organization. The remainder of this paper is structured as follows. Section 2 reviews prior work on CMG and LLMs for software engineering. Section 3 outlines the problem and research gaps. Section 4 presents the evaluation blueprint (datasets, preprocessing, model conditions, evaluation, agent workflow, and error taxonomy). Section 5 discusses implications and extensions. Section 6 concludes the paper.

2. Related Work

2.1. Commit Message Quality and Its Importance

Commit messages act as concise yet critical documentation for code changes in version control systems [1,2,3]. High-quality commit messages improve traceability, ease code review, accelerate bug detection, and support the onboarding of new contributors [2,5]. Conversely, poorly written commit messages often short, vague, or inconsistent, can hinder collaboration, slow down debugging, and obscure the rationale for code changes [6,7]. Empirical analyses [2,3] reveal that these issues frequently stem from time pressure, unclear guidelines, or undervaluation of documentation, making automated assistance highly desirable.

2.2. Automated Commit Message Generation Approaches

Early work on automated commit message generation (CMG) applied statistical and neural machine translation (NMT) techniques to map code diffs into natural language summaries [8,9,10,11]. These approaches leveraged token-level sequence-to-sequence models but were often limited in semantic understanding, producing generic or incomplete summaries [10,11]. Later research explored richer code representations, including abstract syntax trees and context-aware embeddings [14,18], as well as dataset-specific enhancements like modification embeddings [11] and contextual reasoning [13].

Benchmarking CMG approaches have also gained attention. Schall et al. [17] introduced CommitBench, a multi-language benchmark for CMG, enabling standardized evaluation across datasets, languages, and domains. However, many prior studies focus on isolated datasets or evaluation metrics (e.g., BLEU, ROUGE) without integrating human assessments [8,9].

2.3. Technical Background

To situate our survey and evaluation blueprint, we briefly introduce three important paradigms, viz., Transformer Architectures, fine-tuning for CMG, and Retrieval-Augmented Generation (RAG).

2.3.1. Transformer Architectures

Modern LLMs rely on the Transformer architecture, which introduces self-attention mechanisms to model complex relationships among sequence elements [5,12]. Unlike traditional recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), Transformers enable parallelized computation and excel at capturing long-range dependencies, establishing them as the foundation for state-of-the-art performance in both natural language- and code-related tasks.

2.3.2. Fine-Tuning for Commit Message Generation

Fine-tuning adapts a pre-trained Transformer model to the specific task of commit message generation using paired code diff–message data [8,9]. This process leverages transfer learning to retain general language and code understanding while optimizing for contextual, task-specific output. The fine-tuning procedure is guided by maximizing the likelihood of generating reference-quality commit messages and adapting to project-specific conventions [1,13].

2.3.3. Retrieval-Augmented Generation (RAG)

While fine-tuning large language models can yield strong results, it often comes at a high computational cost. Retrieval-Augmented Generation (RAG) offers a more efficient and scalable alternative [17,19]. RAG retrieves relevant information such as similar past commits, issue discussions, or documentation from a separate index and feeds that context into the model at generation time. This enables accurate, context-aware commit messages without retraining for each project. The process involves (i) building a searchable index from historical commit messages, code diffs, and metadata; (ii) retrieving the top-k most relevant entries for a new commit; and (iii) appending the retrieved context to the input prompt for LLM-based generation.

2.4. Large Language Models for Commit Message Generation and Integration

Recent advancements in large language models (LLMs) such as GPT-4, ChatGPT, DeepSeek, and LLaMA have expanded CMG capabilities by leveraging transformer-based architectures for context-rich, fluent summarization [4,5,7,12]. Empirical studies demonstrate their superior performance over traditional NMT-based methods, especially in generating more semantically coherent and relevant commit messages [6,13,14].

Nonetheless, limitations persist. Most works compare only a subset of LLMs [20,21], cross-model benchmarks, or fail to assess deployment feasibility in real world workflows [15,16,22,23,24]. Furthermore, research on integrating CMG directly into continuous integration/continuous deployment (CI/CD) pipelines using RAG or context enrichment remains sparse [19,25].

2.5. Comparative Method Matrix

Table 1 summarizes the three dominant paradigms for commit message generation with LLMs that are prompt-only, fine-tuned, and RAG highlighting their mechanisms, resource requirements, strengths, and limitations. This comparative view goes beyond prior surveys, which typically focus only on datasets and metrics.

2.6. Tooling Examples Across Paradigms

Several open-source software and commercial tools already explore CMG or related tasks in practice. Table 2 maps examples to the three paradigms (prompt-only, fine-tuned, and RAG-augmented). This clarifies existing prototypes and highlights where operational gaps remain.

2.7. Error Taxonomies in Commit Message Generation

Beyond quantitative metrics, prior research has proposed categorizing qualitative errors in generated commit messages. Lopes et al. [6] analyzed ChatGPT outputs and identified common mistake types such as lack of context (e.g., omitting issue IDs or module references), hallucination (inventing unsubstantiated rationales such as a “security fix”), and perceptual errors (misreading or over-emphasizing parts of a diff). For example, a model may misinterpret a documentation edit as a code change, or justify a parameter update with an incorrect rationale. Including such taxonomies in evaluation frameworks highlights failure modes that BLEU/ROUGE cannot capture, underscoring the need for qualitative analysis alongside automated metrics.

2.8. Summary

Table 3 summarizes key prior works (datasets, methods, metrics, and limitations). Our contribution is twofold: (i) a structured survey of LLM-based commit message generation (CMG), and (ii) a transparent, agent-based evaluation protocol on CommitBench that combines automated metrics and human judgments, with optional retrieval augmentation (RAG). We focus on methodology and reporting guidance; empirical benchmarking and deployment are deferred to future work.

Illustrative Examples

Building on Tian et al. [3], we distinguish ideal commit messages from messy ones using a concise rubric:

Ideal message: states what changed and why it changed; uses clear, imperative mood; links to issues/tickets when relevant; and scopes the change precisely (no unrelated edits).
Common pitfalls (messy): vague phrases (e.g., “fix login”), no rationale, no reference to the related issue, and overly broad or ambiguous scope.

These criteria guide both our prompt design and our human-evaluation protocol (clarity, informativeness, and relevance). As shown in Figure 1, the ideal message encodes both the purpose and the rationale of the change, often including an issue/ticket identifier, whereas the messy example omits key context.

As shown in Figure 1, the ideal message encodes both the purpose and the rationale of the change, often including an issue/ticket identifier, whereas the messy example omits key context. In our study, this rubric is operationalized in the annotation guide used for human ratings and in lightweight heuristics (e.g., presence of imperative verb, token length window, and optional ticket/issue reference) to aid qualitative analysis.

3. Problem Statement and Research Gap

The convergence of artificial intelligence (AI) and software engineering has transformed how code is authored, maintained, and documented. Large Language Models (LLMs) have shown remarkable potential in automating complex, context-sensitive tasks such as commit message generation [1,4,5]. Despite these advances, several unresolved challenges hinder both academic progress and industrial adoption of LLM-based commit message generation systems.

3.1. Systematic Benchmarking Deficit

While prior studies demonstrate the capabilities of individual LLMs, there is a notable lack of systematic, comparative benchmarking on large scale, realistic datasets specifically for commit message generation [4,5,7]. Without rigorous head to head evaluations, it is difficult for researchers and practitioners to make evidence-based decisions on model selection and deployment across diverse repositories, programming languages, and domains [1,13].

3.2. Generalization and Transferability Challenges

Current research often focuses on narrow benchmarks or limited project types, leaving generalization and transferability across repositories, programming languages, and software scales to be underexplored [1,7]. Real world environments demand models that are robust and adaptable to heterogeneous codebases and evolving project requirements. The absence of systematic studies addressing these dimensions limits the applicability and trustworthiness of existing solutions.

3.3. Evaluation Metrics and Usability Limitations

Automated metrics such as BLEU, ROUGE, and METEOR are widely used to evaluate commit message generation [1,4,5]. However, these metrics do not fully capture human perceptions of clarity, relevance, or informativeness. Recent studies stress the need for human-centered evaluation to assess practical utility, user satisfaction, and actionable quality [2,5]. Overreliance on automated metrics risks overlooking real usability issues and impedes adoption in development teams.

3.4. Practical Integration and Deployment Barriers

Despite algorithmic advances, integrating LLM-based commit message generation into real world workflows is underexplored [6,7,13]. Successful operationalization requires more than accuracy—models must offer low latency, enterprise scalability, privacy and security compliance, and support for continuous user feedback. The current literature provides limited guidance or validated blueprints for achieving these objectives.

3.5. Underutilization of CommitBench

CommitBench [17] is a recent large-scale, multi-language benchmark designed for commit message generation, featuring rigorous quality controls and broad applicability. Yet, its adoption in the literature remains limited. To date, this includes the following:

Bogomolov et al. [26] used CommitBench within a broader benchmark suite for long-context code models, without focusing on commit message generation.
Zhao et al. [27] referenced CommitBench for evaluating LLM code understanding, but did not generate or benchmark commit messages.
Cao et al. [28] profiled CommitBench in a meta-analysis of code-related benchmarks but did not conduct direct experiments on commit message generation.

No study has systematically benchmarked modern LLMs on CommitBench or explored its integration into practical DevOps pipelines. This work addresses that gap.

3.6. Research Objective

Based on the gaps identified including the lack of systematic benchmarking, limited generalization studies, insufficient human-centered evaluation, and underutilization of CommitBench, this study aims to deliver a comprehensive empirical evaluation of state-of-the-art LLMs for automated commit message generation. We assess both proprietary and open-source models using CommitBench, combining automated metrics (BLEU, ROUGE, and METEOR) with human evaluations of clarity, informativeness, and relevance. We also propose a Retrieval-Augmented Generation (RAG) architecture for real world DevOps integration.

The specific research questions and their motivations are summarized in Table 4.

4. Methodology

This section specifies the evaluation blueprint used throughout the paper: datasets and schema, preprocessing, modeling conditions (prompt-only, RAG, fine-tuned), metrics, and reporting conventions. Deployment/HITL integration is not part of the evaluation protocol and is discussed separately as an optional, post-evaluation pipeline. The overall workflow of this evaluation blueprint is illustrated in Figure 2.

4.1. Walkthrough Example

To demonstrate implementability and clarify the blueprint, we illustrate how a single commit flows through the evaluation protocol.

Toy Commit (input git diff)

Consider a small bug fix in the authentication module:

		  −−− a/auth/login.py
		  +++ b/auth/login.py
		  @@ def login(user, password):
		  −   token = issue_token(user.id)
		  −   logger.info("login ok")
		  −   return token
		  +   try:
		  +       token = issue_token(user.id)
		  +       logger.info("login ok")
		  +       return token
		  +   except TokenError as e:
		  +       logger.error(f"token failure: {e}")
		  +       return None

Step 1—Dataset preprocessing

The diff is normalized (whitespace, case, Unicode), trivial commits are filtered, and language tags are standardized.

Step 2—Optional retrieval (RAG)

The retriever looks up top-k similar commits from CommitBench, e.g., other authentication fixes, and appends them as additional context.

Step 3—Prompt construction

A standardized prompt template is assembled. It instructs the model to produce a single-line commit message in imperative mood, including both what changed and why.

Step 4—Generation

Each candidate model (prompt-only, RAG-augmented, fine-tuned) generates one or more commit messages.

Step 5—Evaluation

Generated messages are compared with the gold reference message in CommitBench using BLEU-4, ROUGE-L, and METEOR. In addition, human annotators rate clarity, informativeness, and relevance on a Likert scale.

Outcome

Scores are aggregated at micro- and macro-levels and broken down by language and commit type. The results across all models determine which approach performs best under the evaluation blueprint.

4.2. Dataset Collection and Preparation

4.2.1. Source: CommitBench

This study utilizes CommitBench, a large-scale, multi-language benchmark containing over 500,000 commit diffs and messages from diverse open-source repositories [17,29]. Its scale, diversity, and rich metadata make it an ideal foundation for robust empirical studies.

4.2.2. Record Schema and Example

Each entry in CommitBench is structured as a record containing several key fields: the commit hash, the corresponding unified diff, the reference message, the project (repository name), a split label indicating its assignment to train/validation/test sets, and the inferred diff_languages. Figure 3 provides a representative example of a CommitBench record, illustrating the structure of these fields and the typical range of their lengths.

As a complement to the visual snapshot in Figure 3, Table 5 enumerates the CommitBench fields we use and how each supports our pipeline (e.g., modeling input, supervision target, stratified splitting, and language analysis).

4.2.3. Dataset Landscape

Table 6 summarizes key public datasets that have been used for commit message generation, including CommitBench. We report the number of commits, dominant programming languages, average message length, typical diff size, and whether issue/ticket links are available. This landscape highlights why CommitBench is particularly suited for standardized evaluation; it is larger, multi-language, and better curated than prior proprietary sets.

4.2.4. Preprocessing Impacts

Data preprocessing strongly influences the reliability of evaluation metrics in CMG. Table 7 summarizes the main steps and their observed effects. For instance, BLEU/ROUGE are highly sensitive to case and punctuation, so lowercasing and normalization can inflate scores by 1–2 points. Filtering trivial commits avoids misleadingly high scores on near-empty diffs. Tokenization choice (subword vs. word-level) impacts the handling of rare identifiers and multilingual data. We recommend Unicode normalization (NFKC), preservation of identifiers and punctuation, and stratified splitting by project/language to avoid leakage.

4.3. Dataset Preprocessing Analysis

CommitBench initially contains ~1.16 M commits with heterogeneous language labels (e.g., duplicates like py,py, mixed case Js/js, and multi-language diffs). We apply a compact, reproducible cleaning pipeline to ensure comparability and reduce noise: (i) normalize language tags and retain six canonical languages (Python, JavaScript, PHP, Ruby, Java, Go); (ii) remove bot-like commits (e.g., dependabot, renovate) and trivial/low-information messages (e.g., “bump version”); (iii) filter extreme lengths (messages 1–80 tokens; diffs 1–4000 tokens); and (iv) balance the final evaluation set to equal per-language counts.

Message/diff length statistics motivate these thresholds: messages have a median of 8 tokens (max 111) and diffs a median of 63 tokens (max 343). Because lexical-overlap metrics (BLEU/ROUGE/METEOR) are sensitive to casing and tokenization, we apply Unicode NFKC normalization, case-folding, whitespace compaction, and preserve punctuation/identifiers to avoid penalizing informative tokens. Our intent is to provide a blueprint for dataset cleaning rather than to report new experimental results. Additional distributions (e.g., raw language mix, trivial/bot shares, length histograms) are available from the corresponding author upon request.

Reproducibility. We provide scripts for language normalization, bot/trivial filtering patterns, and length filters, plus a manifest of commit IDs kept/excluded, enabling exact reconstruction of Table 8.

Table 8 reports the cumulative effect of each preprocessing step on dataset size, showing that CommitBench remains large-scale even after filtering. To ensure fair evaluation across languages, we constructed a balanced evaluation set with equal samples from six canonical languages (Python, JavaScript, PHP, Ruby, Java, and Go). Figure 4 illustrates this distribution, confirming that each language contributes equally (96,057 commits) to the final evaluation set.

4.4. Model Selection, Training, and Evaluation

4.4.1. Modeling Paradigms

We study three complementary settings for automated commit message generation (illustrated in Figure 5):

Generic LLMs (zero-/few-shot): Prompt-engineered proprietary or open-source LLMs without repository-specific retrieval or task-specific fine-tuning.
RAG-augmented LLMs: The model receives the current git diff plus retrieved context (similar past commits, issues, docs) from a vector index.
Fine-tuned LLMs: Supervised training on CommitBench pairs (diff, message) to specialize the base model for CMG.

This triad defines the experimental conditions compared throughout our blueprint (RQ1) and is evaluated using both automated and human-centered metrics (RQ2), with breakdowns across commit types, domains, and languages (RQ3).

Figure 5. Modeling paradigms compared in the evaluation blueprint.

4.4.2. Model Suite

The evaluation covers proprietary LLMs (e.g., GPT-5 and DeepSeek-V3.1) and open-source models fine-tuned on CommitBench [7,13]. Proprietary models are adapted through prompt engineering and in-context learning, while open-source models undergo supervised fine-tuning Open-source setting. We fine-tuned a Qwen2.5-Coder-3B base model on CommitBench using our preprocessing pipeline, yielding a task-specific variant. Proprietary models (ChatGPT and DeepSeek) are included only in prompt-only and RAG-augmented modes (no fine-tuning) to provide a fair contrast between closed-source prompting and open-source fine-tuning.

4.4.3. Training Procedure

Hyperparameters such as learning rate, batch size, and sequence length are optimized on validation sets. Training uses early stopping and continuous monitoring of BLEU, ROUGE, and METEOR scores to avoid overfitting [9,10]. All experiments use fixed random seeds and documented setups for reproducibility.

4.4.4. Evaluation Metrics

The evaluation of candidate commit messages is carried out along three complementary dimensions: (i) reference-based automated metrics, which measure lexical similarity with the gold-standard message; (ii) human judgments, which capture developer-centered quality aspects such as clarity, informativeness, and relevance; (iii) operational measures, which assess practical deployment considerations including latency, token usage, and cost efficiency.

Table 9 summarizes the selected metrics, their primary purpose, and their main limitations. In addition, systematic qualitative shortcomings are classified using the error taxonomy (E1–E8) introduced in Section 4, which provides a structured framework for analyzing common failure modes beyond lexical overlap.

Reference-Based Metrics

Following prior CMG work [4,5], we report BLEU-4, ROUGE-L (summary variant), and METEOR. Because commit messages are short and often formulaic, we apply conservative preprocessing: Unicode normalization (NFKC), case-folding, whitespace compaction, and the preservation of punctuation and code identifiers. Scores are computed at the example level and aggregated as both micro-averages (over all examples) and macro-averages (per project and per language, then averaged) to reduce dataset-domain skew [30]. We report

95 %

bootstrap confidence intervals (10,000 resamples).

Human Evaluation

Operational measures.

For deployability, we record end-to-end latency (p50/p95), prompt and generation token counts, and estimated unit cost (USD/1K tokens, when applicable). We also track style compliance rates (e.g., imperative mood, and presence of concise what+why [3]) and output length distributions. These measures enable practical trade-off analysis across prompt-only, RAG, and fine-tuned configurations.

Significance testing. For pairwise model comparisons we use stratified bootstrap tests (by project and language) over example level scores; differences are reported with 95% CIs. Where multiple hypotheses are tested, we apply Benjamini–Hochberg FDR control.

Transparency. We release evaluation scripts, prompts, tokenization rules, and aggregation code to facilitate exact reproduction of Table 9 outputs and to enable consistent future comparisons.

4.5. Integration and Implementation

4.5.1. System Architecture

Integration into workflows is achieved via pre-commit hooks, CI/CD plugins, and REST APIs, enabling scalability and low latency deployment [31,32,33,34,35].

4.5.2. Implementation Challenges

Challenges include managing inference latency, scaling across projects, addressing privacy/security issues, and implementing fallback mechanisms for ambiguous commits [19,22].

4.5.3. User Feedback and Continuous Improvement

Developer interactions (accept, reject, and edit) are logged and used for iterative model improvements [22,36,37]. This feedback loop is consistent with recent efforts on large-scale codebase reconciliation and on issue-commit link recovery that similarly leverage developer edits and annotations [36,38].

4.6. Reproducibility Checklist

To ensure transparency and comparability, we require that future studies using this blueprint report the following items.

Model identifiers, provider, and version/date (e.g., GPT-4-0613).
Prompts: full text including few-shot exemplars and stop sequences.
Hyperparameters: temperature, top-p, max tokens, learning rate, batch size, and sequence length.
Tokenization and preprocessing rules (e.g., Unicode normalization and identifier preservation).
Seed control for reproducibility.
Dataset version and exact split recipe (CommitBench release and stratification).
Evaluation scripts, metric variants (e.g., ROUGE-L), and postprocessing rules.
Operational settings: latency measurement method, cost estimation (tokens × USD/1K).

4.7. Compute Resources and Constraints

Typical fine-tuning experiments for CommitBench-scale subsets (100 k–500 k commits) require 1–2 high-memory GPUs (e.g., NVIDIA A100 40 GB) or TPU v3 equivalents. Training time is on the order of 8–20 h depending on batch size and sequence length. Cost and hardware availability should be explicitly reported to support reproducibility and fairness in comparisons.

4.8. Limitations and Future Directions

4.8.1. Limitations

This work contributes a survey and a reproducible evaluation blueprint but does not present new benchmarking results. Consequently, the research questions in Section 3 are intentionally deferred. Additional limitations include dataset bias and generalization risks [1,5].

4.8.2. Future Work

Future work may explore multi-modal integration, explainability, cross-language transfer, and personalized commit styles [7,13]. In particular, we plan a systematic benchmarking of prompt-only, RAG-augmented, and fine-tuned models on CommitBench, with both automated metrics (BLEU, ROUGE, and METEOR) and human-centered evaluation (clarity, informativeness, and relevance).

4.9. Ethical Considerations

The study follows ethical practices: respecting open-source licenses, ensuring transparency for AI-generated content, mitigating bias, and preserving developer autonomy [5,12].

4.10. Post-Evaluation Deployment Pipeline

The integration of large language models (LLMs) into the commit process is operationalized through the workflow illustrated in Figure 6.

This pipeline is designed to seamlessly integrate with developers’ existing version control systems, ensuring minimal disruption to workflows while maximizing the quality and relevance of generated commit messages.

The process begins in the local repository, where developers modify source code across various files and formats (e.g., Python, Java, and JavaScript). Once changes are staged, the system automatically triggers a git diff analysis (Step 1), extracting the precise set of code modifications for subsequent processing.

In Step 2, a retrieval mechanism can optionally enrich the input by gathering similar past commits, issue tracker discussions, or related documentation. This retrieved context is appended to the raw diff before being passed to the LLM, enabling retrieval-augmented generation (RAG) for greater factual grounding and style alignment.

Next, the selected LLM (identified as the best-performing model from evaluation) generates a commit message (Step 3). Lightweight style heuristics such as the what+why criterion or imperative mood compliance are applied to ensure message quality.

In Step 4, the generated message is validated against style checks and then presented to the developer for review.

Finally, once approved, the chosen commit message is pushed to the remote repository (e.g., GitHub or GitLab) along with the corresponding code changes (Step 5). This ensures that high-quality, context-aware messages are preserved for downstream tasks such as traceability, code review, and maintenance.

Overall, this pipeline couples automated evaluation with human in the loop validation, providing both scientific rigor and practical usability. It supports diverse deployment scenarios, including cloud-hosted, on-premise, and hybrid setups, ensuring that privacy and scalability requirements can be met across different DevOps environments.

Deployment Illustration

Figure 6 illustrates how once the best-performing model has been identified using the evaluation blueprint, it can be integrated into a developer workflow. The four numbered stages show a simple sequence, from local git diff extraction, through optional retrieval and commit message generation with the selected LLM, to lightweight style checks, and finally commit/push to the remote repository. This figure is included as an illustrative example of post-evaluation deployment and is not part of the evaluation protocol described in Section 4.

5. Discussion

This paper proposes a transparent, agent-based evaluation protocol for automated commit message generation (CMG) on CommitBench, emphasizing methodology and reporting a protocol rather than empirical findings.

Here, we discuss the implications for research and practice, design choices and trade-offs in the protocol, considerations for human evaluation, reproducibility standards, threats to validity, and ethical aspects. We close with limitations and potential extensions.

5.1. Implications for Research and Practice

A standardized protocol can accelerate progress on CMG by making results across models and settings comparable (Table 9). For researchers, the agent workflow in Figure 2 clarifies the experimental surface: dataset preparation, prompting (optionally with retrieval), scoring, ranking, and reporting. For practitioners, adopting a principled process paired with style guidance informed by what/why criteria [3] and failure categorization via the error taxonomy (E1–E8) should improve traceability and review ergonomics without mandating a single model or vendor.

5.2. Protocol Design Choices and Trade-Offs

Prompt-only vs. RAG vs. fine-tuning (Figure 5). Prompt-only LLMs are the easiest to trial and compare, but they are highly sensitive to prompt formulation and context-window limitations. Retrieval-augmented generation (RAG) improves factual grounding and style alignment by injecting similar commits or ticket references at inference time, though it requires constructing and maintaining a retrieval index. Fine-tuning can provide stable performance gains, but it involves licensing constraints, additional compute resources, and risks of model drift; it also reduces flexibility for rapid model replacement. Our protocol enables all three paradigms to be evaluated under consistent scoring and reporting conditions.

Ranking and selection rules. When multiple candidates are generated (e.g., across models or prompts), selection may be based on (i) reference-oriented metrics (BLEU/ ROUGE/METEOR), (ii) learned rerankers, or (iii) lightweight heuristics (e.g., penalizing errors such as E1, E4, or E5 from the error taxonomy). We recommend reporting both oracle (best-of-k) and single-shot results to establish realistic performance bounds.

Latency, cost, and privacy. Prompt-only and RAG configurations differ in their latency and cost profiles. RAG adds retrieval latency, while fine-tuning lowers per-inference token costs but requires expensive upfront training. In environments with strict privacy constraints that disallow external API calls or indexing of sensitive diffs, on-premise or fully open-source models with local retrieval remain preferable [39,40].

5.3. Human Evaluation Considerations

Automated metrics are known to diverge from developer-centered judgments [2,41,42]. Accordingly, our evaluation specifies the following: (i) three Likert scale dimensions (clarity, informativeness, and relevance); (ii) dual annotation with reconciliation for disagreements; and (iii) inter-annotator agreement reported via Cohen’s

κ

and Krippendorff’s

α

. To reduce sampling bias, we recommend stratification by commit type, programming language, and diff length, and we require publication of the sampling procedure. Error labels E1–E8 are applied to make qualitative analysis systematic and comparable across studies.

5.4. Reporting and Reproducibility

To support reliable comparison and replication, reports should include the following: (a) CommitBench version and exact split recipe [17]; (b) prompt templates, stop criteria, temperature/top-p, and max tokens; (c) model identifiers, provider and version/date; (d) RAG index construction (sources, filters, top-k); (e) seed control and batch sizes; (f) references for any style rules; and (g) full metric scripts and postprocessing. Where licensing permits, release hashed commit IDs and prompts to enable recomputation.

5.5. Threats to Validity

Construct validity. BLEU/ROUGE/METEOR may not capture “why” adequacy or project style; we mitigate this with human ratings and E1–E8 taxonomy usage.

Internal validity. Prompt leakage, retrieval contamination, or inconsistent parameter settings can bias results. We specify fixed prompts, seeds, and audit logs for each run.

External validity. Results on CommitBench may not generalize to private monorepos, non-English projects, or atypical workflows. We encourage reporting per-language and per-domain breakdowns and clearly stating scope boundaries.

5.6. Ethical and Responsible Use

CMG systems must avoid leaking secrets present in diffs, respect repository licenses, and disclose AI assistance where organizational policy requires it. Teams should keep human-in-the-loop review by default, monitor for hallucinations (E4) and incorrect rationales (E7), and retain edit telemetry for continuous improvement, while complying with data retention policies.

5.7. Limitations and Extensions

This work specifies a protocol rather than reporting empirical results. We defer a full multi-LLM comparison (proprietary and open-source), ablation of RAG components, and deployment studies in CI contexts to future work. Extensions include multilingual evaluation, personalization to repository/style, learned reranking with error-aware features, measurement of developer effort (e.g., edit distance from AI suggestion), unit test generation [43], and live A/B tests of reviewer throughput.

6. Conclusions

We presented a survey of LLM-based CMG and a reproducible, agent-based evaluation protocol centered on CommitBench. The protocol standardizes datasets, prompting (with optional RAG), scoring, selection, and reporting, and includes a qualitative error taxonomy. Future work may execute systematic comparative experiments across proprietary and open-source LLMs, building on the proposed blueprint.

Author Contributions

Conceptualization, M.M.T. and W.G.A.-K.; methodology, M.M.T.; software, M.M.T.; validation, M.M.T. and W.G.A.-K.; formal analysis, M.M.T.; investigation, M.M.T. and W.G.A.-K.; resources, W.G.A.-K.; data curation, M.M.T.; writing—original draft preparation, M.M.T.; writing—review and editing, M.M.T. and W.G.A.-K.; visualization, W.G.A.-K.; supervision, W.G.A.-K.; project administration, W.G.A.-K.; funding acquisition, W.G.A.-K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS) at King Fahd University of Petroleum & Minerals (KFUPM). The APC was funded by KFUPM.

Data Availability Statement

The data presented in this study are openly available at Hugging Face in the CommitBench repository: https://huggingface.co/datasets/Maxscha/commitbench. This dataset was originally described by Schall et al. [17]. The preprocessing scripts and filtering patterns used in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

LLM	Large Language Model
CMG	Commit Message Generation
NMT	Neural Machine Translation
RAG	Retrieval-Augmented Generation
BLEU	Bilingual Evaluation Understudy
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
METEOR	Metric for Evaluation of Translation with Explicit ORdering

References

Zhang, Y.; Qiu, Z.; Stol, K.-J.; Zhu, W.; Zhu, J.; Tian, Y.; Liu, H. Automatic commit message generation: A critical review and directions for future work. IEEE Trans. Softw. Eng. 2024, 50, 816–835. [Google Scholar] [CrossRef]
Li, J.; Ahmed, I. Commit message matters: Investigating impact and evolution of commit message quality. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 15–16 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 806–817. [Google Scholar]
Tian, Y.; Zhang, Y.; Stol, K.-J.; Jiang, L.; Liu, H. What makes a good commit message? In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22), Pittsburgh, PA, USA, 21–29 May 2022; ACM: New York, NY, USA, 2022; pp. 2389–2401. [Google Scholar] [CrossRef]
Xue, P.; Wu, L.; Yu, Z.; Jin, Z.; Yang, Z.; Li, X.; Yang, Z.; Tan, Y. Automated commit message generation with large language models: An empirical study and beyond. IEEE Trans. Softw. Eng. 2024, 50, 3208–3224. [Google Scholar] [CrossRef]
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–79. [Google Scholar] [CrossRef]
Lopes, C.V.; Klotzman, V.I.; Ma, I.; Ahmed, I. Commit messages in the age of large language models. arXiv 2024, arXiv:2401.17622. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, J.; Wang, C.; Liang, P. Using large language models for commit message generation: A preliminary study. In Proceedings of the 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 12–15 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 126–130. [Google Scholar]
Jiang, S.; Armaly, A.; McMillan, C. Automatically generating commit messages from diffs using neural machine translation. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), Urbana, IL, USA, 30 October–3 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 135–146. [Google Scholar]
Liu, Z.; Xia, X.; Hassan, A.E.; Lo, D.; Xing, Z.; Wang, X. Neural-machine-translation-based commit message generation: How far are we? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; pp. 373–384. [Google Scholar]
Nie, L.Y.; Gao, C.; Zhong, Z.; Lam, W.; Liu, Y.; Xu, Z. Coregen: Contextualized code representation learning for commit message generation. Neurocomputing 2021, 459, 97–107. [Google Scholar] [CrossRef]
He, Y.; Wang, L.; Wang, K.; Zhang, Y.; Zhang, H.; Li, Z. Come: Commit message generation with modification embedding. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA, 17–21 July 2023; pp. 792–803. [Google Scholar]
Zhang, Q.; Fang, C.; Xie, Y.; Zhang, Y.; Yang, Y.; Sun, W.; Yu, S.; Chen, Z. A survey on large language models for software engineering. arXiv 2023, arXiv:2312.15223. [Google Scholar] [CrossRef]
Li, J.; Faragó, D.; Petrov, C.; Ahmed, I. Only diff is not enough: Generating commit messages leveraging reasoning and action of large language model. Proc. ACM Softw. Eng. 2024, 1, 745–766. [Google Scholar] [CrossRef]
Beining, Y.; Alassane, S.; Fraysse, G.; Cherrared, S. Generating commit messages for configuration files in 5G network deployment using LLMs. In Proceedings of the 2024 20th International Conference on Network and Service Management (CNSM), Prague, Czech Republic, 28–31 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Pandya, K. Automated Software Compliance Using Smart Contracts and Large Language Models in Continuous Integration and Continuous Deployment with DevSecOps. Master’s Thesis, Arizona State University, Tempe, AZ, USA, 2024. [Google Scholar]
Kruger, J. Embracing DevOps Release Management: Strategies and Tools to Accelerate Continuous Delivery and Ensure Quality Software Deployment; Packt Publishing Ltd.: Birmingham, UK, 2024. [Google Scholar]
Schall, M.; Czinczoll, T.; De Melo, G. Commitbench: A benchmark for commit message generation. In Proceedings of the 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 12–15 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 728–739. [Google Scholar]
Huang, Z.; Huang, Y.; Chen, X.; Zhou, X.; Yang, C.; Zheng, Z. An empirical study on learning-based techniques for explicit and implicit commit messages generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 544–555. [Google Scholar]
Gao, C.; Hu, X.; Gao, S.; Xia, X.; Jin, Z. The current challenges of software engineering in the era of large language models. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–30. [Google Scholar] [CrossRef]
Palakodeti, V.K.; Heydarnoori, A. Automated generation of commit messages in software repositories. arXiv 2025, arXiv:2504.12998. [Google Scholar] [CrossRef]
Bektas, A. Large Language Models in Software Engineering: A Critical Review of Evaluation Strategies. Master’s Thesis, Freie Universität Berlin, Berlin, Germany, 2024. [Google Scholar]
Liu, Y.; Chen, J.; Bi, T.; Grundy, J.; Wang, Y.; Yu, J.; Chen, T.; Tang, Y.; Zheng, Z. An empirical study on low-code programming using traditional vs large language model support. arXiv 2024, arXiv:2402.01156. [Google Scholar]
Don, R.G.G. Comparative Research on Code Vulnerability Detection: Open-Source vs. Proprietary Large Language Models and Lstm Neural Network. Master’s Thesis, Unitec Institute of Technology, Auckland, New Zealand, 2024. [Google Scholar]
Sultana, S.; Afreen, S.; Eisty, N.U. Code vulnerability detection: A comparative analysis of emerging large language models. arXiv 2024, arXiv:2409.10490. [Google Scholar] [CrossRef]
Wang, S.-K.; Ma, S.-P.; Lai, G.-H.; Chao, C.-H. ChatOps for microservice systems: A low-code approach using service composition and large language models. Future Gener. Comput. Syst. 2024, 161, 518–530. [Google Scholar] [CrossRef]
Bogomolov, E.; Eliseeva, A.; Galimzyanov, T.; Glukhov, E.; Shapkin, A.; Tigina, M.; Golubev, Y.; Kovrigin, A.; van Deursen, A.; Izadi, M.; et al. Long code arena: A set of benchmarks for long-context code models. arXiv 2024, arXiv:2406.11612. [Google Scholar] [CrossRef]
Zhao, Y.; Luo, Z.; Tian, Y.; Lin, H.; Yan, W.; Li, A.; Ma, J. Codejudge-eval: Can large language models be good judges in code understanding? arXiv 2024, arXiv:2408.10718. [Google Scholar]
Cao, J.; Chan, Y.-K.; Ling, Z.; Wang, W.; Li, S.; Liu, M.; Wang, C.; Yu, B.; He, P.; Wang, S.; et al. How should I build a benchmark? arXiv 2025, arXiv:2501.10711. [Google Scholar]
Kosyanenko, I.A.; Bolbakov, R.G. Dataset collection for automatic generation of commit messages. Russ. Technol. J. 2025, 13, 7–17. [Google Scholar] [CrossRef]
Li, Y.; Huo, Y.; Jiang, Z.; Zhong, R.; He, P.; Su, Y.; Bri, L.C.; Lyu, M.R. Exploring the effectiveness of LLMs in automated logging statement generation: An empirical study. IEEE Trans. Softw. Eng. 2024, 50, 3188–3207. [Google Scholar] [CrossRef]
Allam, H. Intelligent automation: Leveraging LLMs in DevOps toolchains. Int. J. AI Bigdata Comput. Manag. Stud. 2024, 5, 81–94. [Google Scholar] [CrossRef]
Ragothaman, H.; Udayakumar, S.K. Optimizing service deployments with NLP-based infrastructure code generation—An automation framework. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical Engineering, Computer and Information Technology (ICEECIT), Jember, Indonesia, 22–23 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 216–221. [Google Scholar]
Joshi, S. A review of generative AI and DevOps pipelines: CI/CD, agentic automation, MLOps integration, and large language models. Int. J. Innov. Res. Comput. Sci. Technol. 2025, 13, 1–14. [Google Scholar] [CrossRef]
Coban, S.; Mattukat, A.; Slupczynski, A. Full-Scale Software Engineering. Master’s Thesis, RWTH Aachen University, Aachen, Germany, 2024. [Google Scholar]
Krishna, A.; Meda, V. AI Integration in Software Development and Operations; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
Gandhi, A.; De, S.; Chechik, M.P.; Pandit, V.; Kiehn, M.; Chee, M.C.; Bedasso, Y. Automated codebase reconciliation using large language models. In Proceedings of the 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), Ottawa, ON, Canada, 27–28 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–11. [Google Scholar]
Cihan, U.; Haratian, V.; İçöz, A.; Gül, M.K.; Devran, O.; Bayendur, E.F.; Uçar, B.M.; Tüzün, E. Automated code review in practice. arXiv 2024, arXiv:2412.18531. [Google Scholar] [CrossRef]
Parveen, R. Investigating T-BERT for Automated Issue–Commit Link Recovery. Master’s Thesis, University of Tampere, Tampere, Finland, 2025. [Google Scholar]
Jaju, I. Maximizing DevOps Scalability in Complex Software Systems. Master’s Thesis, Uppsala University, Department of Information Technology, Uppsala, Sweden, 2023; p. 57. [Google Scholar]
Kolawole, I.; Fakokunde, A. Machine learning algorithms in DevOps: Optimizing software development and deployment workflows with precision. Int. J. Res. Publ. Rev. 2025, 2582, 7421. [Google Scholar] [CrossRef]
Zhang, X.; Muralee, S.; Cherupattamoolayil, S.; Machiry, A. On the effectiveness of large language models for GitHub workflows. In Proceedings of the 19th International Conference on Availability, Reliability and Security, Vienna, Austria, 30 July–2 August 2024; pp. 1–14. [Google Scholar]
Fernandez-Gauna, B.; Rojo, N.; Graña, M. Automatic feedback and assessment of team-coding assignments in a DevOps context. Int. J. Educ. Technol. High. Educ. 2023, 20, 17. [Google Scholar] [CrossRef]
Cellamare, F.P. AI-Driven Unit Test Generation. Ph.D. Thesis, Politecnico di Torino, Torino, Italy, 2025. [Google Scholar]

Figure 1. Ideal vs. messy commit messages. The ideal one states what and why, uses clear scope, and links to an issue; the messy one is vague and lacks context.

Figure 2. Evaluation blueprint on CommitBench: (1) dataset preparation; (2) prompting (optionally RAG); (3) scoring and ranking with reference-based metrics; (4) reporting by project/language/commit type. This figure describes the evaluation workflow only (no deployment).

Figure 3. Snapshot of CommitBench record used in our pipeline (hash, unified diff, reference message, project, split, and inferred languages).

Figure 4. Balanced evaluation set with equal samples per language (Python, JavaScript, PHP, Ruby, Java, and Go).

Figure 6. Deployment (post-evaluation): (1) local git diff; (2) optional retrieval; (3) selected LLM generates commit message; (4) commit and push. This figure illustrates how the best model from the evaluation could be used in practice; it is not part of the evaluation protocol.

Table 1. Comparison of CMG paradigms using LLMs.

Paradigm	Mechanism	Strengths	Limitations
Prompt-only LLMs	Zero-/few-shot prompting of proprietary or open-source LLMs. No task-specific training.	Easy to adopt; no infra; flexible across repos.	Highly prompt-sensitive; context-window limited; costly per call; privacy risks with API use.
Fine-tuned LLMs	Supervised training on diff–message pairs. Model specialized for CMG.	Stable performance; adapts to repo style; cheaper per inference.	High compute cost; risk of drift; licensing/IP issues; less flexible for rapid model switching.
RAG-augmented LLMs	Diff + retrieved similar commits/issues/docs fed to LLM.	Grounded outputs; style alignment; avoids retraining.	Extra retrieval latency; index maintenance overhead; contamination risks.

Table 2. Examples of tools and prototypes for commit message generation mapped to paradigms.

Paradigm	Example	Notes
Prompt-only	GitHub Copilot Chat; OpenCommit (OSS CLI)	Suggests commit summaries via API prompts. Sensitive to wording; no repo history.
Fine-tuned	CoMe model [11]; CodeT5+ finetunes	Trained on labeled diffs; stronger style control; requires compute + licenses.
RAG-augmented	Custom Git hooks with local vector DB; experimental CI plugins [22]	Retrieves similar past commits or issue links; latency + index maintenance are challenges.

Table 3. Summary of related work on automated commit message generation (datasets, methods, metrics, and limitations).

Study	Dataset	Methodology	Metrics	Limitations
Jiang et al. [8]	Proprietary Java diffs	Seq2Seq NMT	BLEU, human	Shallow semantics; poor generalization
Liu et al. [9]	Proprietary diffs	NMT + AST features	BLEU	Java-specific; lacks multilingual coverage
Nie et al. [10]	Proprietary	Contextual code embeddings	BLEU, METEOR	No benchmark-scale evaluation
He et al. [11]	Proprietary	Modification embeddings	BLEU, ROUGE	Dataset tuning; weak generalization
Huang et al. [18]	Proprietary	Explicit/implicit CMG	BLEU	Underperforms on unseen projects
Schall et al. [17]	CommitBench	Multiple baselines	BLEU, ROUGE	Lacks human evaluation
Xue et al. [4]	Proprietary	LLM-based CMG	BLEU, ROUGE	Single LLM; low diversity
Beining et al. [14]	Proprietary	LLMs for config commits	BLEU, ROUGE	Domain-specific; not generalizable
Palakodeti & Heydarnoori [20]	Proprietary	LLM-based CMG	BLEU	No DevOps integration
Wang et al. [25]	Proprietary	Low-code ChatOps + LLM	BLEU, human	Peripheral CMG focus
This work (protocol)	CommitBench	Multi-LLM ensemble	BLEU, ROUGE, METEOR, human	Protocol only; results forthcoming

Table 4. Research Questions and motivations.

Research Question	Motivation
RQ1: How do modern LLMs (e.g., ChatGPT, DeepSeek, LLaMA) compare in generating high-quality commit messages?	Identify the most effective model for practical commit message generation to guide academic benchmarking and tool selection.
RQ2: How well do automated metrics (BLEU, ROUGE, and METEOR) align with human perceptions (clarity, informativeness, relevance)?	Test whether common metrics reflect human-centered quality; determine if additional evaluation is needed [2].
RQ3: How does performance vary across commit types, domains, and languages within CommitBench?	Support generalization/robustness analysis across diverse software contexts.

Table 5. CommitBench record fields used in this study.

Field	Type	Description
`hash`	string (40 chars)	Git commit SHA for traceability and deduplication.
`diff`	string (unified diff)	Normalized patch (added/removed hunks); primary model input.
`message`	string	Reference commit message (supervision/evaluation).
`project`	string	Repository identifier (stratified splitting/domain analysis).
`split`	categorical	Dataset partition: train/validation/test.
`diff_languages`	string/set	Languages inferred from changed files (e.g., `py`, `js`, `php`).

Table 6. Comparison of representative datasets for commit message generation.

Criteria/Dataset	CommitGen	CoDiSum	CommitBERT	MCMD	CommitBench
Train size	26,208	75,000	276,392	1,800,000	1,165,213
Valid size	3000	8000	34,717	225,000	249,689
Test size	3000	7661	34,654	225,000	249,688
Repositories	1000	1000	52 k	500	72 k
Reproducible	✗	✗	✓	✓	✓
Deduplicated	✗	✓	✓	✓	✓
License-aware	✗	✗	✓	✓	✓
Published license	✗	Apache 2.0	–	–	CC BY-NC
Programming languages	Java	Java	Java, Ruby, JS, Go, PHP, Python	Java, C#, C++, Python, JavaScript	Java, Ruby, JS, Go, PHP, Python

Table 7. Preprocessing steps and their impacts on evaluation quality.

Step	Motivation	Observed Impact
Deduplication/filter merges	Remove trivial/duplicate commits	Prevents data leakage; avoids inflated BLEU from repeated examples. In our case, no duplicates were found after normalization.
Lowercasing + Unicode normalization	Standardize tokens across repositories/languages	Stabilizes BLEU/ROUGE; improves cross-language consistency.
Preserve identifiers and punctuation	Identifiers are semantically critical in CMG	Avoids semantic drift; improves human judgments even if BLEU unchanged.
Length bounds (min/max)	Filter trivial commits (e.g., “update”) and very large diffs	Avoids skew; improves interpretability of error taxonomy.
Subword tokenization (BPE)	Handle rare identifiers and compound tokens	Better generalization to unseen projects; reduces OOV errors.
Stratified splitting	Balance projects and languages across train/val/test	Prevents leakage; yields more reliable macro-level reporting. ¹

¹ We retain the official CommitBench splits and use a separate balanced subset only for language-fair evaluation.

Table 8. Summary of preprocessing steps and resulting dataset size.

Category	Count
Raw total (after normalization)	1,165,213
Duplicates removed	0
Bot-like commits removed	122
Trivial commits removed	9294
Length-based filtering	871
Final cleaned total	1,154,926
Balanced evaluation set	576,342 (96,057 per language)

Table 9. Evaluation dimensions, metrics, and reporting conventions used in this study.

Dimension	Metric	Definition/Purpose	Report as
Reference-based	BLEU-4	4-gram precision with brevity penalty; standard overlap proxy for short commit messages [4,5].	Mean ± 95% CI; micro & macro.
Reference-based	ROUGE-L_sum	Longest common subsequence; captures sequence-level overlap robust to small reorderings [4].	Mean ± 95% CI; micro & macro.
Reference-based	METEOR	Stem/synonym-aware alignment; stronger correlation on short texts [5].	Mean ± 95% CI; micro & macro.
Human judgment	Clarity (1–5)	Is the message easy to read and unambiguous? Two raters per sample.	Mean, median; $κ$ / $α$ for agreement.
Human judgment	Informativeness (1–5)	Does it capture the essential what and the relevant why? [3].	Mean, median; $κ$ / $α$ .
Human judgment	Relevance (1–5)	Does it accurately reflect the given `diff` without scope drift?	Mean, median; $κ$ / $α$ .
Human judgment	Error taxonomy (E1–E8)	Qualitative failure modes: missing what/why, hallucination, scope drift, style violations, ambiguity, incorrect rationale, and formatting issues.	Prevalence (%), per-model breakdown.
Operational	Latency	End-to-end generation time (ms).	p50/p95; per-model.
Operational	Tokens (prompt/gen)	Token counts for input and output; proxy for cost/limits.	Mean, p95; per-model.
Operational	Unit cost	Estimated USD/1K tokens (if applicable).	Mean; sensitivity range.
Operational	Style compliance	Share of outputs meeting guidelines (imperative mood, concise, correct scope, ticket reference).	Rate (%); per-model.
Operational	Length (chars/words)	Distribution of output size for readability and policy checks.	Mean, p95.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Trigui, M.M.; Al-Khatib, W.G. LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench. Computers 2025, 14, 427. https://doi.org/10.3390/computers14100427

AMA Style

Trigui MM, Al-Khatib WG. LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench. Computers. 2025; 14(10):427. https://doi.org/10.3390/computers14100427

Chicago/Turabian Style

Trigui, Mohamed Mehdi, and Wasfi G. Al-Khatib. 2025. "LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench" Computers 14, no. 10: 427. https://doi.org/10.3390/computers14100427

APA Style

Trigui, M. M., & Al-Khatib, W. G. (2025). LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench. Computers, 14(10), 427. https://doi.org/10.3390/computers14100427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLMs for Commit Messages: A Survey and an Agent-Based Evaluation Protocol on CommitBench

Abstract

1. Introduction

2. Related Work

2.1. Commit Message Quality and Its Importance

2.2. Automated Commit Message Generation Approaches

2.3. Technical Background

2.3.1. Transformer Architectures

2.3.2. Fine-Tuning for Commit Message Generation

2.3.3. Retrieval-Augmented Generation (RAG)

2.4. Large Language Models for Commit Message Generation and Integration

2.5. Comparative Method Matrix

2.6. Tooling Examples Across Paradigms

2.7. Error Taxonomies in Commit Message Generation

2.8. Summary

Illustrative Examples

3. Problem Statement and Research Gap

3.1. Systematic Benchmarking Deficit

3.2. Generalization and Transferability Challenges

3.3. Evaluation Metrics and Usability Limitations

3.4. Practical Integration and Deployment Barriers

3.5. Underutilization of CommitBench

3.6. Research Objective

4. Methodology

4.1. Walkthrough Example

4.2. Dataset Collection and Preparation

4.2.1. Source: CommitBench

4.2.2. Record Schema and Example

4.2.3. Dataset Landscape

4.2.4. Preprocessing Impacts

4.3. Dataset Preprocessing Analysis

4.4. Model Selection, Training, and Evaluation

4.4.1. Modeling Paradigms

4.4.2. Model Suite

4.4.3. Training Procedure

4.4.4. Evaluation Metrics

Reference-Based Metrics

Human Evaluation

4.5. Integration and Implementation

4.5.1. System Architecture

4.5.2. Implementation Challenges

4.5.3. User Feedback and Continuous Improvement

4.6. Reproducibility Checklist

4.7. Compute Resources and Constraints

4.8. Limitations and Future Directions

4.8.1. Limitations

4.8.2. Future Work

4.9. Ethical Considerations

4.10. Post-Evaluation Deployment Pipeline

Deployment Illustration

5. Discussion

5.1. Implications for Research and Practice

5.2. Protocol Design Choices and Trade-Offs

5.3. Human Evaluation Considerations

5.4. Reporting and Reproducibility

5.5. Threats to Validity

5.6. Ethical and Responsible Use

5.7. Limitations and Extensions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI