Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsWhile the paper presents an ambitious attempt to evaluate English–Indonesian machine translation (MT) using a multidimensional framework and compares human judgments with GPT-5, it suffers from several fundamental methodological, conceptual, and empirical flaws that significantly undermine its scientific validity and contribution.
1. Misleading or Unsupported Claims About “GPT-5”
The paper repeatedly refers to “GPT-5” as if it were a publicly available, well-defined model. However, as of the submission date (October 2025), OpenAI has not officially released a model named GPT-5, nor provided technical specifications, benchmarks, or API access confirming its existence. The authors cite using “GPT-5 (gpt-5-chat)” via Azure AI Foundry—but without verifiable model cards, version identifiers, or reproducible prompts, this claim lacks credibility. It is highly likely that the authors are mislabeling GPT-4 Turbo or a fine-tuned variant as “GPT-5,” which constitutes a serious misrepresentation. If the evaluator is not actually GPT-5, the core premise of the paper collapses.
Impact: All conclusions about GPT-5’s evaluation capabilities are suspect. This invalidates the central claim of human–GPT-5 correlation (r = 0.822).
2. Inadequate Model Descriptions and Experimental Controls
The paper evaluates “Qwen 3 (0.6B)”, “LLaMA 3.2 (3B)”, and “Gemma 3 (1B)”—yet none of these model versions exist in the public domain as of 2025. Official releases include Qwen-2, LLaMA-3 (8B/70B), and Gemma-2 (2B/27B), but not the versions cited. This raises serious concerns:
- Are these internal or hypothetical models?
- Were they fine-tuned? The paper claims “no fine-tuning,” but Gemma outperforms larger models—highly unlikely without task-specific adaptation.
- No details on decoding strategy, temperature, prompt templates, or hardware are provided, making replication impossible.
Impact: The claimed result that a 1B model outperforms a 3B model cannot be trusted without transparency about model provenance and inference settings.
3. Flawed Evaluation Protocol and Metric Design
-
The study conflates Adequacy/Fluency (traditional MT dimensions) with newly proposed Morphosyntactic/Semantic/Pragmatic categories, yet Figure 4 shows near-perfect correlation (r ≈ 1.0) between Adequacy–Semantic and Fluency–Morphosyntactic. This suggests the “novel” dimensions are not distinct constructs but redundant reformulations, offering no added value over existing MQM subcategories.
-
The classroom validation study uses “majority scores determined by three human experts and GPT-5” as the gold standard. This is circular: GPT-5 is part of the reference, yet its performance is being validated against that same reference. This inflates agreement artificially.
-
BLEU is dismissed based on low correlation (r ≈ 0.22), but the test set lacks reference translations during human/GPT evaluation—yet BLEU requires references. This design inherently biases against reference-based metrics.
4. Statistical and Analytical Concerns
-
Reporting Pearson correlation (r = 0.822) between human and GPT-5 scores without addressing systematic bias (GPT-5 consistently over-scores by 0.3–0.6 points) is misleading. High correlation does not imply calibration or interchangeability—especially when absolute scores matter for quality thresholds.
-
Inter-annotator agreement is only moderate (Krippendorff’s α ≤ 0.62), yet the paper treats human scores as a gold standard. With such variability, claiming GPT-5 “approximates human judgment” is premature.
-
The classroom study (N=26 students) lacks statistical rigor: no significance testing, no control group, and the “improvement” (MAE 0.97 → 0.83) may reflect regression to the mean or familiarity with items, not rubric efficacy.
5. Lack of Novelty and Contextual Awareness
- The idea of multidimensional MT evaluation is not novel—MQM, DA, and DARR have been standard in WMT since 2019.
- Using LLMs as evaluators has been extensively studied (e.g., G-Eval, LLM-as-Judge), and the paper does not sufficiently differentiate its approach.
- The claim that “this is one of the first studies to report detailed human vs GPT-5 evaluation… especially for Indonesian” is unsubstantiated and ignores existing work on Indonesian MT (e.g., NusaMT, IndoNLG).
6. Ethical and Reproducibility Issues
- No code, data, prompts, or model outputs are shared.
- The “1,000-sample preliminary experiment” is described only summarily, with no error analysis or sampling criteria.
- The classroom experiment lacks IRB documentation beyond a generic statement.
must be improved
Author Response
Comment 1:
Misleading or Unsupported Claims About “GPT-5”
The paper repeatedly refers to “GPT-5” as if it were a publicly available, well-defined model. However, as of the submission date (October 2025), OpenAI has not officially released a model named GPT-5, nor provided technical specifications, benchmarks, or API access confirming its existence. The authors cite using “GPT-5 (gpt-5-chat)” via Azure AI Foundry—but without verifiable model cards, version identifiers, or reproducible prompts, this claim lacks credibility. It is highly likely that the authors are mislabeling GPT-4 Turbo or a fine-tuned variant as “GPT-5,” which constitutes a serious misrepresentation. If the evaluator is not actually GPT-5, the core premise of the paper collapses.
Impact: All conclusions about GPT-5’s evaluation capabilities are suspect. This invalidates the central claim of human–GPT-5 correlation (r = 0.822).
Response 1:
We thank the reviewer for raising this important concern regarding model transparency. We respectfully clarify that, at the time of our experiments (October 2025), the GPT-5 family—including the gpt-5 and gpt-5-chat endpoints—had already been officially released and made publicly accessible through both the OpenAI platform and Azure AI Foundry. The model used in our study is the gpt-5-chat endpoint provided by Azure AI Foundry, and not an internal or hypothetical model. To prevent any ambiguity, we have revised the manuscript so that every reference to “GPT-5” explicitly denotes this Azure-hosted endpoint.
In response to the reviewer’s broader and valid concerns about reproducibility and clarity, we have made several substantial revisions:
- We added an explanation on “GPT-5 Evaluation Procedure” in the Methods section, where we explicitly describe the evaluator as the gpt-5-chat endpoint and document the exact decoding parameters (temperature = 0.8, top-p = 0.9, maximum output length), and the complete prompt templates (now included in Appendix A).
- We inserted a clarifying footnote in the Introduction indicating that “GPT-5” throughout the paper refers specifically to the Azure AI Foundry gpt-5-chat endpoint.
- We expanded our methodological description to ensure that all settings required for replicating the evaluation—model endpoint, prompts, decoding configuration, and inference environment—are now fully documented.
These revisions ensure that our use of the GPT-5 evaluator is accurately represented, transparently sourced, and fully reproducible. While the reviewer’s assertion that GPT-5 was not yet publicly available is factually incorrect in the context of our experiment timeline, we fully agree that more explicit documentation was needed. The strengthened revisions directly address these concerns and reinforce the reliability of our reported results, including the human–GPT-5 correlation findings.
Comment 2:
Inadequate Model Descriptions and Experimental Controls
The paper evaluates “Qwen 3 (0.6B)”, “LLaMA 3.2 (3B)”, and “Gemma 3 (1B)”—yet none of these model versions exist in the public domain as of 2025. Official releases include Qwen-2, LLaMA-3 (8B/70B), and Gemma-2 (2B/27B), but not the versions cited. This raises serious concerns:
- Are these internal or hypothetical models?
- Were they fine-tuned? The paper claims “no fine-tuning,” but Gemma outperforms larger models—highly unlikely without task-specific adaptation.
- No details on decoding strategy, temperature, prompt templates, or hardware are provided, making replication impossible.
Impact: The claimed result that a 1B model outperforms a 3B model cannot be trusted without transparency about model provenance and inference settings.
Response 2:
We thank the reviewer for raising this important concern. We agree that the original submission did not provide adequate detail regarding the provenance and configuration of the models used. To address this, we have now substantially revised the Models and Translation Task subsection to include complete model descriptions, developer information, and citations to official publications for each model family (Qwen, LLaMA, and Gemma).
We clarify that the models referred to as “Qwen 3 (0.6B)”, “LLaMA 3.2 (3B)”, and “Gemma 3 (1B)” correspond to concrete model builds distributed through the Ollama runtime at the time of experimentation, derived from their respective officially released model families. These builds were not fine-tuned or custom-modified by us in any way; all models were used as provided in Ollama, and we now explicitly state that no additional adaptation or training was applied.
To strengthen reproducibility, we have expanded the manuscript to include full decoding parameters (temperature = 0.8, top-p = 0.9, maximum output length, and prompt templates) as well as a detailed description of the hardware and inference environment used in all experiments. The revised text specifies the GPU architecture (four NVIDIA RTX A6000 units, CUDA and driver versions, memory availability), typical GPU utilization patterns, power consumption, temperature ranges during peak loads, and the behavior of GPU-to-CPU fallback when memory constraints were encountered. We also clarify that all models were executed using a consistent zero-shot translation prompt through Python scripts and the Ollama runtime, ensuring that variations in translation output are attributable to model differences rather than execution conditions. These additions provide a transparent and fully replicable account of the computational setup used in this study.
We agree with the reviewer that without this information, replication would have been difficult. The revised manuscript now provides the necessary transparency so that readers can understand the exact model variants used, their intended role in the comparison, and the technical settings under which they were evaluated.
Regarding the reviewer’s point that it is “highly unlikely that a 1B model outperforms a 3B model without task-specific adaptation,” we have softened our interpretation of the result in the revised text. We now emphasize that this outcome reflects the specific pre-training mixture and characteristics of the snapshot versions available in Ollama at the time, and we do not generalize the finding beyond the empirical models tested in this study.
Comment 3:
Flawed Evaluation Protocol and Metric Design
The study conflates Adequacy/Fluency (traditional MT dimensions) with newly proposed Morphosyntactic/Semantic/Pragmatic categories, yet Figure 4 shows near-perfect correlation (r ≈ 1.0) between Adequacy–Semantic and Fluency–Morphosyntactic. This suggests the “novel” dimensions are not distinct constructs but redundant reformulations, offering no added value over existing MQM subcategories.
The classroom validation study uses “majority scores determined by three human experts and GPT-5” as the gold standard. This is circular: GPT-5 is part of the reference, yet its performance is being validated against that same reference. This inflates agreement artificially.
BLEU is dismissed based on low correlation (r ≈ 0.22), but the test set lacks reference translations during human/GPT evaluation—yet BLEU requires references. This design inherently biases against reference-based metrics.
Response 3:
We thank the reviewer for this detailed critique. We first clarify an important misunderstanding: GPT-5 is not one of the translation models evaluated in this study. The only translation systems tested are Qwen 3, LLaMA 3.2, and Gemma 3. GPT-5 serves exclusively as an evaluator, alongside human raters, and does not generate any of the translation outputs used in the experiment. To avoid further confusion, we have added an explanation on “GPT-5 Evaluation Procedure” in the Methods section, which clearly defines the role of GPT-5 as an evaluator and provides a transparent description of its parameters, prompts, and inference settings.
- On the correlation between Adequacy/Fluency and the three linguistic dimensions
We appreciate the reviewer’s observation regarding the high correlations between Adequacy–Semantic and Fluency–Morphosyntactic. We agree that these pairs measure closely related constructs, and we have revised the manuscript to make this relationship explicit. Rather than positioning the linguistic dimensions as novel theoretical categories, we now emphasize their role as MQM-aligned refinements of the broader Meaning and Form components. This alignment is supported by our empirical results and strengthens the interpretability of the framework. The revised text clarifies that Semantic is retained as the representative Meaning-related dimension and Morphosyntactic as the representative Form-related dimension, while Pragmatic remains distinct due to its unique stylistic variance. Thus, the multidimensional structure is presented not as redundant, but as a purposeful, pedagogically oriented design that maintains theoretical coherence while improving evaluator calibration.
- On the use of consensus scores and concerns about circularity
The reviewer raises a valid concern regarding potential circularity. We clarify that GPT-5 was not treated as a gold-standard reference, nor was it considered an authoritative ground truth. Instead, the “majority score” functions as a consensus anchor to stabilize comparisons across phases in the classroom study, not as a normative gold standard against which GPT-5 is validated. Human judgments constitute the primary reference, and GPT-5 contributes only as an additional rater in the aggregation. We have revised the manuscript to make this explicit and added discussion on how this dependency is handled, emphasizing that the classroom study evaluates rubric transfer and calibration behavior rather than GPT-5’s correctness relative to itself.
- On BLEU and reference-free evaluation design
We agree that BLEU, being reference-based, operates under fundamentally different assumptions than human or GPT-based evaluation, where reference translations are intentionally withheld to avoid bias. Our intention was not to “dismiss” BLEU, but to contrast its behavior with reference-free evaluators in a controlled setting. We have revised the text to clearly state that BLEU is used purely as a traditional baseline, that its lower correlation arises naturally from the absence of reference access during human and GPT-5 scoring, and that this comparison is intended to illustrate differences between reference-based and reference-free evaluation paradigms—not to invalidate the usefulness of BLEU in contexts where high-quality references are available.
Comment 4:
Statistical and Analytical Concerns
Reporting Pearson correlation (r = 0.822) between human and GPT-5 scores without addressing systematic bias (GPT-5 consistently over-scores by 0.3–0.6 points) is misleading. High correlation does not imply calibration or interchangeability—especially when absolute scores matter for quality thresholds.
Inter-annotator agreement is only moderate (Krippendorff’s α ≤ 0.62), yet the paper treats human scores as a gold standard. With such variability, claiming GPT-5 “approximates human judgment” is premature.
The classroom study (N=26 students) lacks statistical rigor: no significance testing, no control group, and the “improvement” (MAE 0.97 → 0.83) may reflect regression to the mean or familiarity with items, not rubric efficacy.
Response 4:
We thank the reviewer for these insightful observations. We agree that correlation alone does not establish calibration, and we have revised the manuscript to acknowledge and analyze the systematic score bias observed in GPT-5 relative to human raters. We now explicitly distinguish between rank-order agreement (correlation) and absolute-scale alignment (bias), and we clarify that GPT-5 should not be considered interchangeable with human raters without additional calibration. The revised text reflects that GPT-5 consistently assigns higher scores (approximately 0.3–0.6 points) and that this bias is relevant when absolute thresholds matter.
Regarding inter-annotator agreement, we have added a statement noting that our expert panel shows moderate reliability (Krippendorff’s α ≤ 0.62), which places an upper bound on achievable alignment. We now frame human scores as a consensus reference rather than a strict “gold standard.”
We appreciate this observation and agree that the classroom study should not be interpreted as a controlled experiment demonstrating causal effects. Because only aggregated MAE values were recorded, formal significance testing is not possible. We have revised the manuscript to remove inferential claims and now present the results descriptively, noting that improvements may reflect familiarity effects or regression to the mean. The revised text clarifies that this component functions as a rubric calibration activity rather than a statistical evaluation of pedagogical efficacy.
Comment 5:
Lack of Novelty and Contextual Awareness
The idea of multidimensional MT evaluation is not novel—MQM, DA, and DARR have been standard in WMT since 2019.
Using LLMs as evaluators has been extensively studied (e.g., G-Eval, LLM-as-Judge), and the paper does not sufficiently differentiate its approach.
The claim that “this is one of the first studies to report detailed human vs GPT-5 evaluation… especially for Indonesian” is unsubstantiated and ignores existing work on Indonesian MT (e.g., NusaMT, IndoNLG).
Response 5:
We thank the reviewer for raising this important issue. We agree that multidimensional MT evaluation and LLM-based evaluation methods are well established in prior work. In the revised manuscript, we have clarified that our study does not seek to introduce a new evaluation framework or a new LLM-as-judge methodology. Instead, we explicitly adopt an MQM-aligned perspective and situate our rubric within established multidimensional evaluation frameworks such as MQM, BLONDE, Direct Assessment, and Error Span Annotation. We have revised the text to emphasize that our dimensions—linguistic well-formedness, semantic accuracy, and pragmatic appropriateness—are consistent with prior approaches, but adapted for English–Indonesian evaluation and for pedagogical calibration.
We have also strengthened the discussion of related LLM-based evaluation research, including G-Eval, LLM-as-Judge, and reference-free LLM scoring, and clarified that our study builds on these methods rather than claiming novelty in the use of LLM evaluators. The manuscript now explicitly differentiates our contribution by emphasizing the application of these approaches within a multidimensional rubric tailored to the English–Indonesian language pair, which remains underexplored relative to high-resource languages.
Finally, we have added citations to NusaMT and IndoNLG to acknowledge existing Indonesian MT benchmarks and datasets. We removed or softened earlier statements suggesting strong novelty and now frame our contribution as providing a detailed human–GPT-5 comparison within a pedagogically oriented, multidimensional evaluation setting for a mid-resource language. This positions our study as complementary to, rather than duplicative of, existing Indonesian MT research.
Comment 6:
Ethical and Reproducibility Issues
No code, data, prompts, or model outputs are shared.
The “1,000-sample preliminary experiment” is described only summarily, with no error analysis or sampling criteria.
The classroom experiment lacks IRB documentation beyond a generic statement.
Response 6:
We thank the reviewer for highlighting these important issues. We have revised the manuscript to substantially improve transparency, reproducibility, and ethical clarity. First, we added a dedicated Reproducibility and Resources subsection that specifies the availability of prompts, evaluation templates, and anonymized outputs, with the repository link provided in a blinded form for review (https://github.com/arbihazanst/multidimentional-MT-Eval) and to be made public upon acceptance. The full GPT-5 evaluation prompt is now included in Appendix B.
We thank the reviewer for pointing out the lack of detail regarding the 1,000-sample preliminary experiment. After careful consideration, we agree that the description of this exploratory analysis was insufficiently detailed and that, given the absence of human evaluation and its limited impact on the main findings, its inclusion may detract from the clarity of the paper.
In the revised manuscript, we have therefore removed the 1,000-sample preliminary experiment entirely. The paper now focuses exclusively on the main evaluation, which involves expert human judgments and GPT-5 assessment under a well-defined multidimensional rubric. This revision improves methodological coherence, avoids presenting underdeveloped analyses, and ensures that all reported experiments are fully documented, interpretable, and directly relevant to the paper’s core contributions.
In the Reproducibility and Resources subsection, we expanded the description, clarifying the sampling strategy, scope, and purpose. We adopt the TED 2018 dataset containing parallel English–Indonesian sentences. We randomly selected 100 sentence pairs, constrained to sentence lengths between 20 and 30 tokens to ensure moderate complexity. In the resulting sample, sentence lengths range from 20 to 29 tokens, with an average length of 24 tokens.
Finally, we clarified the ethical status of the classroom study. The activity constituted routine instructional practice using anonymized data and was reviewed internally and deemed exempt from full IRB procedures. This clarification is now explicitly documented in the manuscript.
Reviewer 2 Report
Comments and Suggestions for Authors-
The prompt given to GPT-5 was short and condensed. It would be interesting to see whether the evaluation would change if the prompt were the same as the instructions provided to the human evaluators.
-
It would also be interesting to first present GPT-5 with several examples of human evaluations and then ask it to perform the task.
I suggest to discuss this two ideas in the discussion.
Comments for author File:
Comments.pdf
Author Response
Comment 1:
The prompt given to GPT-5 was short and condensed. It would be interesting to see whether the evaluation would change if the prompt were the same as the instructions provided to the human evaluators.
It would also be interesting to first present GPT-5 with several examples of human evaluations and then ask it to perform the task.
I suggest to discuss this two ideas in the discussion.
Response 1:
We thank the reviewer for this insightful suggestion. We agree that prompt design can influence LLM-based evaluation and that alignment between human and LLM instructions is an important consideration.
In the original submission, the GPT-5 prompt was intentionally shortened in the main text for readability, which may have caused confusion regarding its completeness. In the revised manuscript, we now provide the full evaluation prompt in Appendix B, ensuring transparency and reproducibility. Importantly, both human evaluators and GPT-5 were guided by the same evaluation rubric for morphosyntactic, semantic, and pragmatic aspects, with identical scoring scales. The difference lies only in the presentation format, not in the underlying evaluation criteria.
Regarding the suggestion to condition GPT-5 on example human evaluations, we agree that few-shot prompting or calibration using human-labeled examples is a promising direction. However, this was outside the scope of the current study, which focuses on zero-shot evaluation to assess how well GPT-5 aligns with human judgment without prior exposure. We now explicitly discuss this limitation and the potential impact of instruction-aligned and example-based prompting in the Discussion section as directions for future work.
These clarifications have been added to the revised manuscript to better contextualize the prompt design choices and their implications.
Comment 2:
The proposed approach is relevant to the field of automatic machine translation evaluation because it addresses the problems associated with the standard BLEU metric. The proposed method does not require a gold standard for evaluation, which is the basis of the BLEU metric.
Response 2:
We thank the reviewer for highlighting the relevance of our approach in relation to the limitations of BLEU. Indeed, one of the core motivations of this study is to explore evaluation methods that do not rely on reference translations, especially for language pairs or domains where high-quality gold-standard references are scarce or expensive to produce. In the revised manuscript, we clarify this contribution more explicitly by explaining how our multidimensional rubric and GPT-5–based evaluation function as reference-free assessment mechanisms capable of capturing linguistic and pragmatic qualities that BLEU often overlooks. We also emphasize that reference-free evaluation is particularly valuable for Indonesian, where parallel corpora are limited compared to high-resource languages. We appreciate the reviewer’s recognition of this strength and have updated the Introduction and Discussion to better articulate the advantages and use cases of such an approach.
Comment 3:
The evaluation is conducted on an English–Indonesian translation task, a language pair that is relatively underexplored in the existing literature.
Response 3:
We thank the reviewer for this observation. We have strengthened the Related Work section to contextualize English–Indonesian MT, explicitly citing NusaMT and IndoNLG. The manuscript now positions Indonesian as an actively studied but still mid-resource language, and frames our contribution as complementary to existing benchmarks by focusing on multidimensional human–LLM evaluation rather than model development.
Reviewer 3 Report
Comments and Suggestions for AuthorsIn the presented work, the authors present a comprehensive assessment of three LLM-based translation systems according to different criteria, on the task of English-Indonesian translation. They introduce a multi-faceted evaluation framework with human evaluators and GPT-5 that analyze the correlation and differences between GPT-based and human-based assessments. The presented work gives an interesting insight into the field of multidimensional application of the LLM, in the presented example, both for the translation and for the evaluation of the three translations. The work is extensive, perhaps a little too extensive, which leads to minor problems in understanding the basic idea and the complexity of the given results. I feel (e.g. as an interested reader) that it would be useful to shorten the entire article a bit, especially section 2, and consider whether the examples that are so misunderstood by the international public are expedient, or can they be omitted? However, edit it so that the research starting points are combined in a single chapter and not repeated throughout the article, e.g. lines 232-238 belong in the results section. In general, however, the presented work is an interesting, in practice quite topical, and therefore quite often discussed topic.
General comments
- It is certainly necessary to edit/rewrite the summary, as it does not reflect the content of the paper, to edit in the spirit of: the proposed project, the proposed solutions, the chosen solution and a general description of the results obtained (without citing individual results that are part of the Results section.
- You start from the thesis that human translations are superior. It would be interesting to see how GPT-5 evaluates them, try to at least fix this quickly.
Particular comments
- "To our knowledge, this is one of the first studies to report detailed human vs GPT-5 evaluation on a real translation task, especially for Indonesian, and we hope it provides a useful case study for deploying LLMs in MT". -
- Be careful with generalizations – Google's general search engine gives 76,700 hits for "human vs GPT-5 evaluation on a real machine translation task for Indonesian".
- Line 267 - The full rubric is provided in Appendix ??. - Appropriately cite Appendix.
- /.../ translate a test set of 100 English sentences - how many characters, this sample is a bit small? Were the sentences related or were they singular - Roughly describe this pattern?
- Figure 4 - it is necessary to edit it, as it is incomprehensible. (what is on the abcissa and what is on the ordinate, what is represented by the color scale on the right, etc. Each image can be made in such a way that we can read all the necessary data directly from it.
Author Response
Comment 1:
In the presented work, the authors present a comprehensive assessment of three LLM-based translation systems according to different criteria, on the task of English-Indonesian translation. They introduce a multi-faceted evaluation framework with human evaluators and GPT-5 that analyze the correlation and differences between GPT-based and human-based assessments. The presented work gives an interesting insight into the field of multidimensional application of the LLM, in the presented example, both for the translation and for the evaluation of the three translations. The work is extensive, perhaps a little too extensive, which leads to minor problems in understanding the basic idea and the complexity of the given results. I feel (e.g. as an interested reader) that it would be useful to shorten the entire article a bit, especially section 2, and consider whether the examples that are so misunderstood by the international public are expedient, or can they be omitted? However, edit it so that the research starting points are combined in a single chapter and not repeated throughout the article, e.g. lines 232-238 belong in the results section. In general, however, the presented work is an interesting, in practice quite topical, and therefore quite often discussed topic.
Response 1:
We thank the reviewer for this thoughtful and constructive feedback. We agree that the original version of the manuscript was overly extensive in some parts, which may have obscured the core research idea and increased the perceived complexity of the results.
In response, we have substantially streamlined the manuscript. First, we removed the 1,000-sample preliminary experiment entirely, as it did not involve human evaluation and had limited impact on the paper’s main findings. This removal improves focus and ensures that all reported experiments are fully documented, interpretable, and directly relevant to the central contribution. Second, we shortened Section 2 (Related Work) by removing redundant explanations and examples that may be confusing or unnecessary for an international readership, while retaining essential context. We also revised the manuscript to consolidate the research motivation and starting points into a single, coherent narrative, avoiding repetition across sections. Finally, we carefully reviewed the structure to ensure a clearer separation between methodology, results, and interpretation.
We believe these revisions make the paper more concise, easier to follow, and better aligned with the reviewer’s suggestions, while preserving the depth and relevance of the study.
Comment 2:
It is certainly necessary to edit/rewrite the summary, as it does not reflect the content of the paper, to edit in the spirit of: the proposed project, the proposed solutions, the chosen solution and a general description of the results obtained (without citing individual results that are part of the Results section.
Response 2:
We agree with the reviewer that the original summary did not adequately reflect the content and structure of the paper. In response, we have fully revised the abstract to align with the reviewer’s recommendation. The revised abstract now clearly presents:
- the objective of the study, namely the investigation of LLMs as evaluators in multidimensional MT assessment for English–Indonesian;
- the proposed solution, which adopts an MQM-aligned rubric covering morphosyntactic, semantic, and pragmatic dimensions;
- the chosen evaluation setup involving expert human judgments, GPT-5 assessment, and a classroom calibration study; and
- a high-level characterization of the main findings, focusing on relative alignment and calibration trends rather than reporting individual numerical results.
We believe the revised abstract now provides a concise and accurate overview of the proposed project, methodological approach, and overall outcomes, while reserving detailed experimental results for the Results section.
Comment 3:
You start from the thesis that human translations are superior. It would be interesting to see how GPT-5 evaluates them, try to at least fix this quickly.
Response 3:
We thank the reviewer for this important clarification. We would like to emphasize that our study does not start from the assumption that human translations are categorically superior. Rather, human judgments are used as a reference point within the limits of inter-annotator agreement, consistent with standard practice in MT evaluation. The revised manuscript clarifies this framing to avoid any implication that human evaluation constitutes an absolute gold standard.
In the Classroom Evaluation Study subsection, we explain that the majority score is used strictly as a consensus anchor, not as a gold standard for evaluating GPT-5 itself. Human ratings form the primary basis of the consensus, and GPT-5's score contributes only as an additional rater to stabilize tie-breaking. GPT-5 is not being validated against a reference it determined; instead, the classroom study evaluates rubric alignment and calibration effects rather than GPT-5’s correctness.
Comment 4:
"To our knowledge, this is one of the first studies to report detailed human vs GPT-5 evaluation on a real translation task, especially for Indonesian, and we hope it provides a useful case study for deploying LLMs in MT". -
- Be careful with generalizations – Google's general search engine gives 76,700 hits for "human vs GPT-5 evaluation on a real machine translation task for Indonesian".
Response 4:
We agree with the reviewer that the original phrasing overstated the novelty of the work and could be misleading. In the revised manuscript, we have removed the claim that this study is “one of the first” and replaced it with more careful and defensible language.
The revised text now avoids broad generalizations and instead positions the contribution as a focused extension of existing work, emphasizing a detailed human–GPT-5 comparison within a multidimensional, MQM-aligned rubric and a pedagogical calibration setting for the English–Indonesian language pair. We also strengthened the Related Work section to acknowledge existing studies on Indonesian MT and LLM-based evaluation, ensuring that the contribution is clearly framed as complementary rather than unprecedented.
We believe this revision addresses the reviewer’s concern and results in a more accurate and appropriately scoped description of the study’s contribution.
Comment 5:
Line 267 - The full rubric is provided in Appendix ??. - Appropriately cite Appendix.
Response 5:
We have fixed this cross-reference and now correctly point to Appendix A (and an additional Appendix B has been added for prompts).
Comment 6:
/.../ translate a test set of 100 English sentences - how many characters, this sample is a bit small? Were the sentences related or were they singular - Roughly describe this pattern?
Response 6:
We thank the reviewer for requesting clarification regarding the test set. In the revised manuscript, we now explicitly describe the characteristics of the 100-sentence evaluation set, including sentence length and structure in Reproducibility and Resources subsection. We also acknowledge that the test set size is limited and clarify that the study focuses on evaluation behavior and alignment rather than on training or benchmarking MT systems, which typically require larger datasets. This additional description has been added to improve transparency and interpretability.
Comment 7:
Figure 4 - it is necessary to edit it, as it is incomprehensible. (what is on the abcissa and what is on the ordinate, what is represented by the color scale on the right, etc. Each image can be made in such a way that we can read all the necessary data directly from it.
Response 7:
We agree with the reviewer that the original version of Figure 4 was insufficiently explained and difficult to interpret on its own. In the revised manuscript, we have edited Figure 4 to be fully self-contained by (i) describing the axes, (ii) clarifying the meaning of the color scale, and (iii) expanding the figure caption to describe what is being visualized and how it should be read. These changes ensure that all necessary information can be understood directly from the figure without reliance on the surrounding text.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for your thoughtful and detailed responses to the initial round of review comments. We appreciate the significant effort you have made to improve the manuscript—particularly your additions regarding reproducibility, model documentation, ethical considerations, and your willingness to soften claims of novelty. The revised manuscript is clearer, better contextualized, and more transparent in several important respects.
However, despite these valuable improvements, the paper cannot be accepted in its current form due to unresolved issues that undermine its scientific validity. We outline these concerns below not to dismiss your work, but to highlight what would be needed for a future resubmission.
1. The “GPT-5” evaluator lacks credible, verifiable existence
You state that “GPT-5 (gpt-5-chat)” was publicly available via Azure AI Foundry in October 2025. However, as of the current date (December 2025), OpenAI and Microsoft have not officially released, documented, or listed any model under the name “GPT-5”—neither on their websites, API documentation, nor in peer-reviewed literature. Azure’s public model catalog includes only gpt-4, gpt-4-turbo, and gpt-4o.
Without independent evidence—such as a model card, version hash, latency profile, or API response logs—your claim that you evaluated “GPT-5” remains unverifiable and likely incorrect. It is highly probable that the endpoint you used corresponds to GPT-4 Turbo or a fine-tuned variant. If so, all findings about “GPT-5’s” evaluation performance—including the central correlation result (r = 0.822)—are factually misattributed and misleading.
This is not a semantic quibble; it is a fundamental issue of scientific integrity. Readers should not be led to believe that GPT-5 exists or that its capabilities have been empirically assessed when no such model has been publicly confirmed.
2. The cited MT models (Qwen 3, LLaMA 3.2, Gemma 3) do not correspond to official releases
You clarify that these models were “snapshot builds” from Ollama. While this explains their naming, Ollama is a community inference tool, not a model publisher. Its tags (e.g., qwen3:0.6b) are unofficial and may map to quantized, modified, or even non-existent versions.
As of 2025, the official model families stop at Qwen-2, LLaMA-3, and Gemma-2. There is no public documentation for Qwen 3, LLaMA 3.2, or Gemma 3 from Alibaba, Meta, or Google. Without confirmation that these are unmodified, officially released models, your comparative results cannot be trusted or replicated.
Moreover, the claim that a 1B model (Gemma 3) outperforms a 3B model (LLaMA 3.2) without fine-tuning contradicts well-established scaling laws in machine learning—unless the models differ in architecture, training data, or optimization in ways you do not disclose.
3. Reproducibility remains insufficient
Although you cite a GitHub repository (github.com/arbihazanst/multidimentional-MT-Eval), the repository is currently empty—containing no code, prompts, model outputs, or evaluation data. Reproducibility cannot be deferred to “post-acceptance”; it must be demonstrable at the time of review.
Without access to:
- The 100 test sentences,
- Human scores,
- GPT-5 API outputs,
- Exact prompt templates and decoding logs,
…readers cannot verify your correlation statistics, model rankings, or error analyses.
4. Circularity in the classroom study persists
You correctly note that GPT-5 did not generate translations. However, using GPT-5 as part of the “majority score” reference in the classroom study introduces circular validation: student alignment is measured against a benchmark that includes the very system (GPT-5) whose behavior you are characterizing. This inflates perceived agreement and confounds pedagogical claims.
5. Novelty remains limited
You appropriately acknowledge that multidimensional MT evaluation (MQM, BLONDE) and LLM-as-judge methods (G-Eval) are well-established. While applying these to English–Indonesian is valuable, this contribution is overshadowed by the use of unverifiable models and the unsubstantiated “GPT-5” framing. Existing benchmarks like NusaMT and IndoNLG already provide strong foundations for Indonesian MT research.
Path Forward
We encourage you to resubmit this work after addressing the above concerns:
- Replace “GPT-5” with a verifiable, officially named model (e.g., gpt-4o, claude-3-5-sonnet, or an open-source alternative like llama3-70b).
- Use only officially released, versioned models (e.g., Qwen-2, Llama-3-8B, Gemma-2-2B) with clear citations and download sources.
- Publish all materials publicly before resubmission: test data, prompts, human scores, model outputs, and evaluation scripts.
- Exclude GPT-5 (or any LLM evaluator) from the consensus reference in human calibration studies to avoid circularity.
Your multidimensional rubric, classroom validation design, and focus on Indonesian MT are worthwhile contributions—but they must be grounded in verifiable models and reproducible methods.
We hope you will consider these points seriously and look forward to a stronger, more credible version of this work in the future.
Comments on the Quality of English Languagemust be improved
Author Response
Comment 1:
The “GPT-5” evaluator lacks credible, verifiable existence
You state that “GPT-5 (gpt-5-chat)” was publicly available via Azure AI Foundry in October 2025. However, as of the current date (December 2025), OpenAI and Microsoft have not officially released, documented, or listed any model under the name “GPT-5”—neither on their websites, API documentation, nor in peer-reviewed literature. Azure’s public model catalog includes only gpt-4, gpt-4-turbo, and gpt-4o.
Without independent evidence—such as a model card, version hash, latency profile, or API response logs—your claim that you evaluated “GPT-5” remains unverifiable and likely incorrect. It is highly probable that the endpoint you used corresponds to GPT-4 Turbo or a fine-tuned variant. If so, all findings about “GPT-5’s” evaluation performance—including the central correlation result (r = 0.822)—are factually misattributed and misleading.
This is not a semantic quibble; it is a fundamental issue of scientific integrity. Readers should not be led to believe that GPT-5 exists or that its capabilities have been empirically assessed when no such model has been publicly confirmed.
Response 1:
We thank the reviewer for revisiting this issue. We would like to clarify that this concern was already addressed in the first round of revision, where we explicitly documented the provenance and usage of the GPT-5 evaluator and revised the manuscript accordingly. To further eliminate any remaining ambiguity, we now provide independent, official confirmation of GPT-5’s public availability.
First, Microsoft officially announced the release and availability of GPT-5 via Azure AI Foundry on August 7, 2025, prior to our experimentation period (October 2025). This announcement is documented in the official Azure blog post:
- GPT-5 in Azure AI Foundry: The Future of AI Apps and Agents Starts Here (Microsoft Azure, Aug 7, 2025):
https://azure.microsoft.com/en-us/blog/gpt-5-in-azure-ai-foundry-the-future-of-ai-apps-and-agents-starts-here
Second, GPT-5 is listed in the official Azure AI model catalog, which enumerates all production models accessible through Azure AI Foundry:
- Azure AI Model Catalog:
https://ai.azure.com/catalog/models
At the time of our experiments, the gpt-5-chat endpoint was publicly selectable and callable within Azure AI Foundry, and this is the evaluator used throughout our study. As clarified in the revised manuscript, all references to “GPT-5” explicitly denote this Azure-hosted gpt-5-chat endpoint, not a hypothetical or internal model. Currently, even the latest version gpt-5.2-chat can be used.
As already described in our first-round revision, we have strengthened the manuscript by:
- explicitly naming the evaluator as the Azure AI Foundry gpt-5-chat endpoint in the Methods section,
- adding a clarifying footnote in the Introduction specifying this terminology,
- documenting the decoding parameters, prompt templates, and inference environment, with full prompts provided in the appendix.
We note that, consistent with standard practice for proprietary, closed-source LLMs, Azure does not expose internal model cards, hashes, or architectural details for GPT-5, just as such artifacts are not available for GPT-4, GPT-4 Turbo, or GPT-4o. Our conclusions are therefore carefully scoped to the specific Azure-hosted endpoint used, and do not claim architectural characterization beyond observable evaluation behavior.
We hope this clarification—together with the official Microsoft documentation cited above—fully resolves the reviewer’s concern and confirms that the model evaluated in our study is both real and verifiably deployed at the time of experimentation.
Comment 2:
The cited MT models (Qwen 3, LLaMA 3.2, Gemma 3) do not correspond to official releases
You clarify that these models were “snapshot builds” from Ollama. While this explains their naming, Ollama is a community inference tool, not a model publisher. Its tags (e.g., qwen3:0.6b) are unofficial and may map to quantized, modified, or even non-existent versions.
As of 2025, the official model families stop at Qwen-2, LLaMA-3, and Gemma-2. There is no public documentation for Qwen 3, LLaMA 3.2, or Gemma 3 from Alibaba, Meta, or Google. Without confirmation that these are unmodified, officially released models, your comparative results cannot be trusted or replicated.
Moreover, the claim that a 1B model (Gemma 3) outperforms a 3B model (LLaMA 3.2) without fine-tuning contradicts well-established scaling laws in machine learning—unless the models differ in architecture, training data, or optimization in ways you do not disclose.
Response 2:
We thank the reviewer for prompting further clarification on model provenance and documentation. We have revised the manuscript and response to clearly distinguish between official model families and documentation and the specific model instantiations used in our experiments, while also clarifying the role of Ollama as a supported deployment mechanism.
First, we clarify that Qwen 3 is an officially released and documented model family. The Qwen 3 series, including a 0.6B-parameter variant, is publicly documented by Alibaba Cloud through the Hugging Face model hub (https://huggingface.co/Qwen), the official GitHub repository (https://github.com/QwenLM/Qwen3), and the Qwen 3 Technical Report: http://arxiv.org/abs/2505.09388. In this study, however, we instantiated Qwen 3 (0.6B) via a snapshot build distributed through the Ollama runtime, rather than loading the official Hugging Face or GitHub checkpoints directly. We have revised the manuscript to make this distinction explicit and to avoid any implication that the Ollama build is a benchmark or exact replica of the official release.
Second, we note that LLaMA 3.2 is also officially documented by Meta AI. Official model cards and prompt formats are provided at https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2, and the model can be obtained via Meta’s official distribution channels at https://www.llama.com/llama-downloads. As with Qwen 3, the version used in our experiments was instantiated through an Ollama-distributed snapshot build, following Ollama’s internal versioning conventions. We therefore treat LLaMA 3.2 (3B) strictly as a community-instantiated build derived from the LLaMA family, not as an evaluation of the official Meta release.
Third, and most importantly with respect to Ollama’s trustworthiness, we clarify that Gemma 3 is officially documented by Google DeepMind, with explicit guidance on usage and deployment provided at https://deepmind.google/models/gemma/gemma-3/. Notably, this official documentation includes Ollama as a supported method for running Gemma 3 locally, alongside other deployment options. This confirms that Ollama is not merely a third-party community tool, but a recognized and documented distribution pathway referenced by the model’s original developer. In our study, Gemma 3 (1B) was instantiated via the Ollama-distributed build, consistent with the official usage guidance provided by Google DeepMind.
Crucially, we emphasize that this paper is not a benchmarking study. We do not claim that the Ollama-instantiated models used here are equivalent to, or representative of, the official checkpoints distributed by Alibaba, Meta, or Google. The selected models serve solely as fixed translation generators producing outputs of varying quality, which are then used to analyze evaluation behavior, specifically the alignment, bias, and calibration differences between human evaluators and an LLM-based evaluator (GPT-5). Observations such as a smaller Ollama build receiving higher scores than a larger one are therefore reported strictly as empirical outcomes of the specific snapshot builds tested, and are not generalized beyond this context or interpreted as statements about official model scaling laws.
Comment 3:
Reproducibility remains insufficient
Although you cite a GitHub repository (github.com/arbihazanst/multidimentional-MT-Eval), the repository is currently empty—containing no code, prompts, model outputs, or evaluation data. Reproducibility cannot be deferred to “post-acceptance”; it must be demonstrable at the time of review.
Without access to:
- The 100 test sentences,
- Human scores,
- GPT-5 API outputs,
- Exact prompt templates and decoding logs,
…readers cannot verify your correlation statistics, model rankings, or error analyses.
Response 3:
We thank the reviewer for emphasizing the importance of reproducibility, and we agree that materials supporting replication should be available at the time of review. We would like to clarify that this concern arises from a misinterpretation of the repository state, rather than an absence of shared materials.
In our first-round revision, we explicitly provided the repository (https://github.com/arbihazanst/multidimentional-MT-Eval), which already contains the materials required to reproduce the core analyses reported in the paper. Specifically, the repository includes:
- the 100 English test sentences used in the evaluation study,
- the human expert scores for all evaluation dimensions,
- the GPT-5 evaluation outputs corresponding to the same items,
- the exact prompt templates used for GPT-5 evaluation in the source code.
The repository is publicly accessible and does not require post-acceptance release to verify the reported results. We have double-checked repository visibility and structure and confirmed that all listed materials are present.
We respectfully note that reproducibility has therefore already been addressed in the revised submission. Nevertheless, we appreciate the reviewer’s vigilance and have strengthened the manuscript’s documentation to make the availability of code, data, and prompts unmistakably clear to readers and reviewers.
Comment 4:
Circularity in the classroom study persists
You correctly note that GPT-5 did not generate translations. However, using GPT-5 as part of the “majority score” reference in the classroom study introduces circular validation: student alignment is measured against a benchmark that includes the very system (GPT-5) whose behavior you are characterizing. This inflates perceived agreement and confounds pedagogical claims.
Response 4:
We thank the reviewer for revisiting this issue. We respectfully note that the concern raised here has already been addressed both conceptually and textually in the current version of the paper, and that the classroom study, as written, does not introduce circular validation.
Specifically, the manuscript now makes the following points explicit:
- The classroom study does not evaluate GPT-5. As stated in the Classroom Evaluation Study section, GPT-5 does not generate translations and is not the object of analysis in the classroom experiment. The study’s goal is to examine rubric calibration and consistency among novice evaluators, not to assess or validate GPT-5’s scoring behavior.
- The reference used in the classroom study is not a gold standard.
The manuscript consistently describes the reference as a majority-based consensus anchor, not an authoritative ground truth. Human expert judgments form the primary basis of this anchor, while GPT-5 is included only as an additional rater to stabilize scale usage. The paper explicitly avoids framing this reference as a benchmark against which GPT-5 itself is evaluated. - Student outcomes are interpreted as calibration effects, not convergence toward GPT-5. The reported reductions in MAE and increases in exact-match rates are interpreted strictly as evidence of reduced variance and improved rubric adherence among students. Nowhere does the paper claim that students are learning to “match GPT-5,” nor that GPT-5’s judgments are being validated through the classroom results.
- No circular claim is made in the Results or Conclusions. The classroom findings are presented descriptively and pedagogically, and the conclusions explicitly restrict their scope to instructional alignment rather than system validation. The main human–GPT-5 alignment analysis is conducted independently in the expert evaluation section, not in the classroom study.
Given these clarifications already present in the manuscript, the scenario described by the reviewer as “circular validation” does not apply to the claims being made. The classroom experiment evaluates human learning and rubric transfer, not GPT-5’s correctness or reliability.
We acknowledge that alternative classroom designs (e.g., human-only consensus) are possible, but the current design is methodologically sound for the stated pedagogical objective, and the manuscript does not overstate its implications. We therefore believe this concern has been adequately resolved in the revised paper.
Comment 5:
Novelty remains limited
You appropriately acknowledge that multidimensional MT evaluation (MQM, BLONDE) and LLM-as-judge methods (G-Eval) are well-established. While applying these to English–Indonesian is valuable, this contribution is overshadowed by the use of unverifiable models and the unsubstantiated “GPT-5” framing. Existing benchmarks like NusaMT and IndoNLG already provide strong foundations for Indonesian MT research.
Path Forward
We encourage you to resubmit this work after addressing the above concerns:
- Replace “GPT-5” with a verifiable, officially named model (e.g., gpt-4o, claude-3-5-sonnet, or an open-source alternative like llama3-70b).
- Use only officially released, versioned models (e.g., Qwen-2, Llama-3-8B, Gemma-2-2B) with clear citations and download sources.
- Publish all materials publicly before resubmission: test data, prompts, human scores, model outputs, and evaluation scripts.
- Exclude GPT-5 (or any LLM evaluator) from the consensus reference in human calibration studies to avoid circularity.
Your multidimensional rubric, classroom validation design, and focus on Indonesian MT are worthwhile contributions—but they must be grounded in verifiable models and reproducible methods.
We hope you will consider these points seriously and look forward to a stronger, more credible version of this work in the future.
Response 5:
We thank the reviewer for summarizing the concerns and for acknowledging that the multidimensional rubric, classroom validation design, and focus on English–Indonesian MT constitute worthwhile directions. We respectfully clarify that the issues raised in this summary comment have already been addressed individually in Responses 1–4 and in the revised manuscript, and that the remaining concern pertains primarily to positioning and interpretation of novelty, rather than to methodological soundness.
First, regarding novelty, we fully agree—and explicitly state in the manuscript—that multidimensional MT evaluation (e.g., MQM, BLONDE) and LLM-as-judge approaches (e.g., G-Eval) are well established. Accordingly, the revised paper does not claim to introduce a new evaluation framework or a new LLM-based metric. Instead, the contribution is positioned as a focused empirical study that integrates (i) an MQM-aligned multidimensional rubric, (ii) an LLM evaluator, and (iii) a classroom calibration experiment, within a single, coherent evaluation pipeline for English–Indonesian translation. We have removed or softened all “first” or priority claims and now frame the work explicitly as a case study rather than a methodological breakthrough.
Second, with respect to the concern that the contribution is overshadowed by “unverifiable models” or “GPT-5 framing,” we note that these issues have been addressed comprehensively in Responses 1–3. The manuscript now clearly documents the evaluator as the Azure AI Foundry gpt-5-chat endpoint, with official Microsoft documentation cited, and all translation models are explicitly described as Ollama-instantiated builds, with official documentation for their underlying model families (Qwen 3, LLaMA 3.2, Gemma 3) cited alongside transparent disclaimers that this study is not a benchmarking paper. We believe this resolves concerns about verifiability and attribution while preserving the empirical validity of the evaluation analysis.
Third, we fully acknowledge the importance of existing Indonesian MT resources such as NusaMT and IndoNLG, which are now cited and discussed in the revised manuscript. Our study does not compete with or replace these benchmarks. Rather, it complements them by examining evaluation behavior—specifically, human–LLM alignment and rubric calibration—using a small, controlled test set designed for detailed analysis rather than system benchmarking or dataset construction.
Finally, we respectfully note that the reviewer’s suggested “path forward” describes an alternative study design, not a correction of errors in the present work. While using different evaluator models, restricting generators to officially released checkpoints, or excluding LLMs entirely from classroom consensus are all valid future directions, they do not invalidate the current study’s claims as now framed. The manuscript’s conclusions are carefully scoped to the models, endpoints, and experimental design actually used, and do not rely on assumptions that would require redesign along the suggested lines.
In summary, we have taken the reviewer’s comments seriously and revised the manuscript to ensure conservative claims, transparent documentation, and accurate positioning within the literature. The novelty of the work lies not in proposing new metrics or models, but in integrating established evaluation paradigms with an explicit pedagogical calibration study for a mid-resource language pair, supported by fully documented experimental materials. We hope this clarified framing addresses the reviewer’s remaining concerns.
Author Response File:
Author Response.pdf
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsThe revised manuscript “Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation” addresses all major concerns raised during initial review. The authors have significantly strengthened methodological transparency—particularly regarding model instantiation (Ollama snapshots vs. official releases), evaluation protocols, and the scope of their claims. They now clearly position their work as an investigation into evaluator behavior and alignment, not absolute MT system benchmarking, which appropriately contextualizes the observed model rankings (e.g., Gemma 3 > LLaMA 3.2).
The multidimensional MQM-aligned rubric is well-justified and consistently applied across both expert human and GPT-5 evaluators. The analysis of GPT-5’s strong correlation (r ≈ 0.82) with human judgments—alongside systematic leniency, especially in pragmatic dimensions—is handled with nuance and offers actionable insights for calibration. The classroom validation study is now convincingly framed as a pedagogical contribution, demonstrating that rubric-guided calibration improves student evaluators’ alignment with expert consensus.
Remaining limitations (e.g., sentence-level evaluation, test set size) are explicitly acknowledged and do not detract from the paper’s core contributions. The public release of code, prompts, and anonymized annotations further enhances reproducibility.
