PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation
Abstract
1. Introduction
- (i)
- We identify key limitations in current LLM evaluation paradigms and synthesize requirements for interpretable, multidimensional assessment;
- (ii)
- We define the PEARL metric suite, including component rubrics and seven formalized metrics spanning accuracy, explanation, argumentation, and robustness;
- (iii)
- We validate the framework across four curated synthetic evaluation conditions using education-aligned prompt sets (rubric-matching, explanation tasks, dialectical sequences, and paraphrase consistency).
- (iv)
- We analyze alignment patterns relative to a model-based proxy (GPT-4), report Pearson r and Spearman ρ, and note mixed alignment across metrics, and show their advantage in producing pedagogically useful, reproducible, and model-agnostic feedback.
2. Background and Limitations of Existing Metrics
2.1. Token-Level Metrics: BLEU, ROUGE, and METEOR
2.2. Pairwise Comparison and Win-Rate Leaderboards
2.3. Emergent Rubric-Based Evaluation
2.4. Absence of Explanation-Aware Metrics
2.5. Alignment and Robustness: Poorly Captured
3. The PEARL Metric Suite
3.1. Design Principles
3.2. Component Rubrics
3.3. Formalized Evaluation Metrics
- (1)
- Comparative performance metrics include the Rubric Win Count (RWC), Global Win Rate (GWR), and Rubric Mean Advantage (RMA). These metrics measure the relative performance of different models by comparing their scores across individual rubric dimensions. This enables transparent comparison and ranking based on interpretable evaluation criteria.
- (2)
- The explanation-aware metric, Explanation Quality Index (EQI), evaluates the clarity, coherence, and pedagogical value of model-generated justifications.
- (3)
- The qualitative reasoning metric, Dialectical Presence Rate (DPR), measures the presence/frequency of dialectical reasoning elements across a structured sequence (opinion → counterargument → synthesis). Rather than progress, DPR reports a presence rate in [0,1] based on rubric-aligned scoring across the three stages.
- (4)
- Robustness and confidence metrics include the Consistency Spread (CS) and Win Confidence Score (WCS). These metrics assess model stability under prompt variation and the certainty of comparative outcomes.
3.3.1. Rubric Win Count (RWC)
3.3.2. Global Win Rate (GWR)
3.3.3. Rubric Mean Advantage (RMA)
3.3.4. Explanation Quality Index (EQI)
- Clarity—the degree to which the explanation is easy to follow and well-articulated;
- Accuracy—whether the reasoning is logically sound and factually correct;
- Usefulness—the extent to which the explanation enhances user understanding or provides educational benefit.
3.3.5. Dialectical Presence Rate (DPR)
3.3.6. Consistency Spread (CS)
3.3.7. Win Confidence Score (WCS)
3.4. Linguistic and Pedagogical Dimensions Captured by PEARL
3.5. Metric-Data Alignment and Task Requirements
- Standard prompts—general-purpose queries suitable for rubric scoring and model comparison.
- Explanation prompts—questions that explicitly request reasoning, justification, or pedagogical elaboration.
- Dialectical prompts—multi-turn tasks structured around dialectical reasoning (opinion, counterargument, synthesis).
- Repeat runs—multiple generations from the same model on the same prompt to test for consistency.
4. Methodology for Metric Validation
4.1. Validation Goals and Setup
4.2. Validation Scenarios
4.2.1. Rubric-Matching Evaluation Conditions
4.2.2. Explanation Quality Tasks
4.2.3. Dialectical Reasoning Sequences
4.2.4. Stylistic Paraphrase Consistency
4.3. Metric-Rubric Alignment and Applicability
5. Results
5.1. Metric-Level Results
5.1.1. Rubric Win Count (RWC)
5.1.2. Global Win Rate (GWR)
5.1.3. Rubric Mean Advantage (RMA)
5.1.4. Explanation Quality Index (EQI)
5.1.5. Dialectical Presence Rate (DPR)
5.1.6. Consistency Spread (CS)
5.1.7. Win Confidence Score (WCS)
5.2. Per-Metric Ablations
5.3. Cross-Metric Comparative Analysis
5.4. Key Insights
6. Discussion
6.1. Alignment with a Model-Based Proxy (GPT-4)
6.2. Stability and Robustness of Metrics
6.3. Discriminative Power Across Model Quality Levels
6.4. Complementarity and Coverage of Evaluation Dimensions
6.5. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- OpenAI. GPT-4 Technical Report. Available online: https://cdn.openai.com/papers/gpt-4.pdf (accessed on 30 July 2025).
- Meta. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation. Available online: https://ai.meta.com/blog/llama-4-multimodal-intelligence (accessed on 30 July 2025).
- Mistral.ai. Announcing Mistral 7B: A High-Performance Open-Weight Language Model. Available online: https://mistral.ai/news/announcing-mistral-7b (accessed on 30 July 2025).
- Mistral.ai. Mixtral of Experts 8x22B. Available online: https://mistral.ai/news/mixtral-8x22b (accessed on 30 July 2025).
- Anthropic. Claude 3 Model Family. Available online: https://www.anthropic.com/news/claude-3-family (accessed on 30 July 2025).
- Team, G.; Kavukcuoglu, K. Gemini 2.5: Our Most Intelligent AI Model. Available online: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025 (accessed on 30 July 2025).
- Iacobescu, P.; Marina, V.; Anghel, C.; Anghele, A.-D. Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities. J. Cardiovasc. Dev. Dis. 2024, 11, 396. [Google Scholar] [CrossRef]
- Susnea, I.; Pecheanu, E.; Cocu, A.; Istrate, A.; Anghel, C.; Iacobescu, P. Non-Intrusive Monitoring and Detection of Mobility Loss in Older Adults Using Binary Sensors. Sensors 2025, 25, 2755. [Google Scholar] [CrossRef] [PubMed]
- Anghele, A.-D.; Marina, V.; Dragomir, L.; Moscu, C.A.; Anghele, M.; Anghel, C. Predicting Deep Venous Thrombosis Using Artificial Intelligence: A Clinical Data Approach. Bioengineering 2024, 11, 1067. [Google Scholar] [CrossRef] [PubMed]
- Xu, F.; Song, Y.; Iyyer, M.; Choi, E. A Critical Evaluation of Evaluations for Long-form Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 3225–3245. [Google Scholar] [CrossRef]
- Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 13806–13834. [Google Scholar] [CrossRef]
- Brimhall, B.L.; Neumann, J.M.; Moon, S.; Saluja, S.; Lee, N.J.; Hsu, W.; Fontaine, J.M. Current and future state of evaluation of large language models for clinical summarization tasks. npj Health Syst. 2025, 2, 6. [Google Scholar] [CrossRef]
- Joshi, B.; He, K.; Ramnath, S.; Sabouri, S.; Zhou, K.; Chattopadhyay, S.; Swayamdipta, S.; Ren, X. ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations. arXiv 2025, arXiv:2506.14200. [Google Scholar] [CrossRef]
- Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for Large Language Models: A Survey. ACM Comput. Surv. 2023, 56, 20. [Google Scholar] [CrossRef]
- Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
- Gupta, R. Evaluating LLMs: Beyond Simple Metrics. In Proceedings of the INLG Workshop & ACL, Tokyo, Japan, 23–27 September 2024; Available online: https://medium.com/@ritesh.gupta.ai/evaluating-llms-beyond-simple-metrics-1e6babbed195 (accessed on 31 July 2025).
- Toloka.ai. LLM Evaluation Framework: Principles, Practices, and Tools. Available online: https://toloka.ai/blog/llm-evaluation-framework-principles-practices-and-tools (accessed on 31 July 2025).
- Lopes, P.; Silva, E.; Braga, C.; Oliveira, T.; Rosado, L. XAI Systems Evaluation: A Review of Human and Computer-Centred Methods. Appl. Sci. 2022, 12, 9423. [Google Scholar] [CrossRef]
- Meike, N.; Jan, T.; Shreyasi, P.; Elisa, N.; Michelle, P.; Yasmin, S.; Jörg, S.; van Keullen, M.; Christin, S. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 295. [Google Scholar] [CrossRef]
- Stephen, C.; Xander, D.; Claudia, S.; Thomas Krendl, G.; Jérémy, S. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv 2023, arXiv:2307.15217. [Google Scholar] [CrossRef]
- Kishore, P.; Salim, R.; Todd, W.; Wei-Jing, Z. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Chin-Yew, L. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013 (accessed on 31 July 2025).
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909 (accessed on 31 July 2025).
- Blagec, K.; Dorffner, G.; Moradi, M.; Ott, S.; Samwald, M. A global analysis of metrics used for measuring performance in natural language processing. In Proceedings of the NLP Power! The First Workshop on Efficient Benchmarking in NLP, Dublin, Ireland, 26 May 2022; pp. 52–63. Available online: https://aclanthology.org/2022.nlppower-1.6/ (accessed on 31 July 2025).
- Shypula, A.; Li, S.; Zhang, B.; Padmakumar, V.; Yin, K.; Bastani, O. Evaluating the Diversity and Quality of LLM Generated Content. arXiv 2025, arXiv:2504.12522. [Google Scholar] [CrossRef]
- Zeng, X.; Liu, Y.; Meng, F.; Zhou, J. Towards Multiple References Era—Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Miami, FL, USA, 16 March 2024. [Google Scholar] [CrossRef]
- Sulem, E.; Abend, O.; Rappoport, A. BLEU is Not Suitable for the Evaluation of Sentence Splitting in Text Simplification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2018, Brussels, Belgium, 31 October–4 November 2018; pp. 738–744. [Google Scholar]
- Freitag, M.; Grangier, D.; Caswell, I. BLEU might be Guilty but References are not Innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online (Virtual), 16–20 November 2020; pp. 61–71. [Google Scholar] [CrossRef]
- Läubli, S.; Sennrich, R.; Volk, M. Has Machine Translation Achieved Human Parity? A Case for Document-Level Evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018; pp. 4791–4796. [Google Scholar] [CrossRef]
- Wiseman, S.; Shieber, S.; Rush, A. Challenges in Data-to-Document Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 2253–2263. [Google Scholar] [CrossRef]
- Reiter, E. A Structured Review of the Validity of BLEU in Evaluating NLG. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
- Aggarwal, D.; Sil, P.; Raman, B.; Bhattacharyya, P. “I Understand Why I Got This Grade”: Automatic Short Answer Grading (ASAG) with Feedback. In Proceedings of the Artificial Intelligence in Education (AIED 2025), Palermo, Italy, 22–26 July 2025; pp. 304–318. [Google Scholar] [CrossRef]
- Su, Y.; Xu, J. An Empirical Study on Contrastive Search and Contrastive Decoding for Open-ended Text Generation. arXiv 2022, arXiv:2211.10797. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhang, H. Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. Available online: https://lmsys.org/blog/2023-06-22-leaderboard (accessed on 31 July 2025).
- Wikipedia. LMArena. Available online: https://en.wikipedia.org/wiki/LMArena (accessed on 31 July 2025).
- Liusie, A.; Manakul, P.; Gales, M.J.F. LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), St. Julians, Malta, 16 March 2024; pp. 139–151. [Google Scholar] [CrossRef]
- Li, Z.; Wang, C.; Ma, P.; Wu, D.; Wang, S.; Gao, C.; Liu, Y. Split and Merge: Aligning Position Biases in LLM-based Evaluators. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 11084–11108. [Google Scholar] [CrossRef]
- Gao, M.; Liu, Y.; Hu, X.; Wan, X.; Bragg, J.; Cohan, A. Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4605–4629. [Google Scholar] [CrossRef]
- Hu, Z.; Song, L.; Zhang, J.; Xiao, Z.; Wang, T.; Chen, Z.; Yuan, N.J.; Lian, J.; Ding, K.; Xiong, H. Explaining Length Bias in LLM-based Preference Evaluations. arXiv 2024, arXiv:2407.01085. [Google Scholar] [CrossRef]
- Shi, L.; Ma, C.; Liang, W.; Ma, W.; Vosoughi, S. Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs. arXiv 2024, arXiv:2406.07791. [Google Scholar] [CrossRef]
- Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
- Kim, S.; Oh, D. Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks? Appl. Sci. 2025, 15, 2971. [Google Scholar] [CrossRef]
- Cisneros-González, J.; Gordo-Herrera, N.; Barcia-Santos, I.; Sánchez-Soriano, J. JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs). Future Internet 2025, 17, 265. [Google Scholar] [CrossRef]
- Fan, Z.; Wang, W.; W, X.; Zhang, D. SedarEval: Automated Evaluation using Self-Adaptive Rubrics. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 16916–16930. [Google Scholar] [CrossRef]
- Kasai, J.; Sakaguchi, K.; Dunagan, L.; Morrison, J.; Le Bras, R.; Choi, Y.; Smith, N.A. Transparent Human Evaluation for Image Captioning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Seattle, WA, USA, 10–15 July 2022; pp. 3464–3478. [Google Scholar] [CrossRef]
- Martin, P.P.; Kranz, D.; Graulich, N. Revealing Rubric Relations: Investigating the Interdependence of a Research Informed and a Machine Learning Based Rubric in Assessing Student Reasoning in Chemistry. Int. J. Artif. Intell. Educ. 2024, 35, 1465–1503. [Google Scholar] [CrossRef]
- Sieker, J.; Junker, S.; Utescher, R.; Attari, N.; Wersing, H.; Buschmeier, H.; Zarrieß, S. The Illusion of Competence: Evaluating the Effect of Explanations on Users’ Mental Models of Visual Question Answering Systems. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 19459–19475. [Google Scholar] [CrossRef]
- Chaudhary, M.; Gupta, H.; Bhat, S.; Varma, V. Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. arXiv 2024, arXiv:2412.09269. [Google Scholar] [CrossRef]
- Leiter, C.; Lertvittayakumjorn, P.; Fomicheva, M.; Zhao, W.; Gao, Y.; Eger, S. Towards Explainable Evaluation Metrics for Machine Translation. J. Mach. Learn. Res. 2024, 25, 3686–3734. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS’22), New Orleans, LA, USA, 28 November–9 December 2022; pp. 24824–24837. [Google Scholar]
- Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine Tuning Language Models from Human Preferences. arXiv 2019, arXiv:1909.08593. [Google Scholar] [CrossRef]
- Zhu, K.; Wang, J.; Zhou, J.; Wang, Z.; Chen, H.; Wang, Y.; Yang, L.; Ye, W.; Gong, N.Z.; Zhang, Y.; et al. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. arXiv 2023, arXiv:2306.04528. [Google Scholar] [CrossRef]
- Li, Z.; Peng, B.; He, P.; Yan, X. Evaluating the Instruction Following Robustness of Large Language Models to Prompt Injection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 557–568. [Google Scholar] [CrossRef]
- Turner, E.; Soligo, A.; Taylor, M.; Rajamanoharan, S.; Nanda, N. Model Organisms for Emergent Misalignment. arXiv 2025, arXiv:2506.11613. [Google Scholar] [CrossRef]
- Moradi, M.; Samwald, M. Evaluating the Robustness of Neural Language Models to Input Perturbations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021, Long Papers), Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1558–1570. [Google Scholar] [CrossRef]
- Yang, J.; Chen, D.; Sun, Y.; Li, R.; Feng, Z.; Peng, W. Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability Oriented Approach. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 11–16 August 2024; pp. 3343–3353. [Google Scholar] [CrossRef]
- Elangovan, A.; Liu, L.; Xu, L.; Bodapati, S.B.; Roth, D. ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 1137–1160. Available online: https://aclanthology.org/2024.acl-long.63/ (accessed on 31 July 2025).
- Ferdaus, M.M.; Abdelguerfi, M.; Ioup, E.; Niles, K.; Pathak, K.; Sloan, S. Towards Trustworthy AI: A Review of Ethical and Robust Large Language Models. arXiv 2024, arXiv:2407.13934. [Google Scholar] [CrossRef]
- Choudhury, A.; Chaudhry, Z. Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals. J. Med. Internet Res. 2024, 26, e56764. [Google Scholar] [CrossRef]
- Starke, G.; Gille, F.; Termine, A.; Aquino, Y.S.J.; Chavarriaga, R.; Ferrario, A.; Hastings, J.; Jongsma, K.; Kellmeyer, P.; Kulynych, B.; et al. Finding Consensus on Trust in AI in Health Care: Recommendations From a Panel of International Experts. J. Med. Internet Res. 2024, 27, e56306. [Google Scholar] [CrossRef]
- Zhen, H.; Shi, Y.; Huang, Y.; Yang, J.J.; Liu, N. Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference. Computers 2024, 13, 232. [Google Scholar] [CrossRef]
- Panadero, E.; Jonsson, A.; Pinedo, L.; Fernández-Castilla, B. Effects of Rubrics on Academic Performance, Self-Regulated Learning, and self-Efficacy: A Meta-analytic Review. Educ. Psychol. Rev. 2023, 35, 113. [Google Scholar] [CrossRef]
- Anghel, C.; Anghel, A.A.; Pecheanu, E.; Susnea, I.; Cocu, A.; Istrate, A. Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents. Informatics 2025, 12, 76. [Google Scholar] [CrossRef]
- Anghel, C.; Anghel, A.A.; Pecheanu, E.; Cocu, A.; Istrate, A. Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information 2025, 16, 652. [Google Scholar] [CrossRef]
- Melo, E.; Silva, I.; Costa, D.G.; Viegas, C.M.D.; Barros, T.M. On the Use of eXplainable Artificial Intelligence to Evaluate School Dropout. Educ. Sci. 2022, 12, 845. [Google Scholar] [CrossRef]
- Díaz, G.M. Supporting Reflective AI Use in Education: A Fuzzy-Explainable Model for Identifying Cognitive Risk Profiles. Educ. Sci. 2025, 15, 923. [Google Scholar] [CrossRef]
- Pan, Y.; Nehm, R.H. Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Educ. Sci. 2025, 15, 676. [Google Scholar] [CrossRef]
- Dipper, L.; Marshall, J.; Boyle, M.; Hersh, D.; Botting, N.; Cruice, M. Creating a Theoretical Framework to Underpin Discourse Assessment and Intervention in Aphasia. Brain Sci. 2021, 11, 183. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Wu, X.; Wen, J.; Huang, C.; Luo, M.; Liu, L.; Zheng, Y. Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study. Appl. Sci. 2023, 13, 2818. [Google Scholar] [CrossRef]
- Prentzas, J.; Binopoulou, A. Explainable Artificial Intelligence Approaches in Primary Education: A Review. Electronics 2025, 14, 2279. [Google Scholar] [CrossRef]
- Fernandes, P.; Treviso, M.V.; Pruthi, D.; Martins, A.F.T.; Neubig, G. Learning to Scaffold: Optimizing Model Explanations for Teaching. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) 2022, New Orleans, LA, USA, 28 November–9 December 2022; Available online: https://papers.neurips.cc/paper_files/paper/2022/file/ea64883d500d31738cd39eb49a748fa4-Paper-Conference.pdf (accessed on 31 July 2025).
- Rizzo, L.; Verda, D.; Berretta, S.; Longo, L. A Novel Integration of Data-Driven Rule Generation and Computational Argumentation for Enhanced Explainable AI. Mach. Learn. Knowl. Extr. 2024, 6, 2049–2073. [Google Scholar] [CrossRef]
- Maine, F. Building the Foundations of Dialogic Pedagogy with Five- and Six-Year-Olds. Educ. Sci. 2025, 15, 251. [Google Scholar] [CrossRef]
- Leidinger, A.; van Rooij, R.; Shutova, E. The language of prompting: What linguistic properties make a prompt successful? arXiv 2023, arXiv:2311.01967. [Google Scholar] [CrossRef]
- Gozzi, M.; Di Maio, F. Comparative Analysis of Prompt Strategies for Large Language Models: Single-Task vs. Multitask Prompts. Electronics 2024, 13, 4712. [Google Scholar] [CrossRef]
- Hackl, V.; Krainz, A.; Bock, A. Is GPT-4 a Reliable Rater? Evaluating Consistency in GPT-4’s Text Ratings. Front. Educ. 2023, 8, 1272229. [Google Scholar] [CrossRef]
- Chen, Z.; Wan, T. Grading Explanations of Problem-Solving Process and Generating Feedback Using Large Language Models at Human-Level Accuracy. Phys. Rev. Phys. Educ. Res. 2025, 21, 010126. [Google Scholar] [CrossRef]
Type | Dimensions |
---|---|
Technical Rubric | Accuracy, Clarity, Completeness, Terminology |
Argumentative Rubric | Clarity, Coherence, Originality, Dialecticality |
Explanation Rubric | Clarity, Accuracy, Usefulness |
PEARL Metric | Linguistic/Pedagogical Property Captured |
---|---|
RWC, GWR | Fine-grained scoring consistency; rubric agreement across prompts |
RMA | Magnitude of performance deltas in clarity, completeness, and terminology |
EQI | Explanation clarity, logical fidelity, didactic usefulness |
DPR | Dialectical reasoning, argumentative depth, engagement with alternative views |
CS | Semantic stability across paraphrased or restructured prompts |
WCS | Confidence in comparative scoring; robustness of preference signals |
PEARL Metric | Required Input Type(s) | Evaluation Scenario | Comparison Mode |
---|---|---|---|
RWC | Standard prompts + comparison | Tracks how often a model wins on individual rubric dimensions when compared to another. | Pairwise, per-dimension wins (M1 vs. M2) |
GWR | Standard prompts + comparison | Aggregates full-prompt wins to assess overall performance dominance. | Pairwise, per-prompt global wins (M1 vs. M2) |
RMA | Standard prompts + comparison | Computes the average margin of rubric score advantage across prompts. | Pairwise, per-prompt score margin (M1 − M2) |
EQI | Explanation prompts | Evaluates the quality of explanatory responses in terms of clarity, coherence, and usefulness. | Single-model explanation quality (rubric-scored) |
DPR | Dialectical prompts | Measures the presence and integration of dialectical elements across the opinion-counterargument-synthesis sequence. | Intra-sequence presence (opinion → counter → synthesis) |
CS | Repeat runs (same model) | Assesses the stability of rubric scores across repeated generations from the same model. | Single-model repeatability (spread across runs) |
WCS | Standard prompts + comparison | Average normalized win-margin decisiveness across prompts. | Pairwise |Δ|/Z across prompts (symmetric; no winner label). |
Model Identifier | Model Family | Parameters | Developer | Role |
---|---|---|---|---|
gemma:7b-instruct | Gemma | 7B | Evaluated | |
mistral:7b-instruct | Mistral | 7B | Mistral AI | Evaluated |
dolphin-mistral:latest | Mistral (fine-tuned) | 7B | Cognitive Computations | Evaluated |
zephyr:7b-beta | Zephyr | 7B | Hugging Face | Evaluated |
deepseek-r1:8b | DeepSeek | 8B | DeepSeek AI | Evaluated |
llama3:8b | LLaMA 3 | 8B | Meta AI | Evaluated |
openhermes:latest | OpenHermes | ~7B | Teknium | Evaluated |
nous-hermes2:latest | Nous-Hermes 2 | ~7B | Nous Research | Evaluated |
gpt-4 | OpenAI | - | OpenAI | Primary scorer |
llama3:instruct | LLaMA 3 | 8B | Meta AI | Secondary scorer |
Rubric | Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | Cohen κ (95% CI) | Mean Δ LLaMA3 − GPT-4 (95% CI) |
---|---|---|---|---|---|---|
Technical | 0.86 | 8.64 | 60.7 | 35.7 | - | - |
Argumentative | 1.36 | 7.00 | 60.7 | 35.7 | - | - |
Overall agreement | - | - | - | - | 0.00 [0.00, 0.00] | 0.00 [−4.13, 4.04] |
Rubric | Model A | Model B | RWC GPT-4 | RWC LLaMA3:Instruct | Δ |
---|---|---|---|---|---|
Technical | deepseek-r1:8b | gemma:7b-instruct | 31 | 7 | −24 |
Technical | llama3:8b | mistral:7b-instruct | 32 | 8 | −24 |
Argumentative | llama3:8b | zephyr:7b-beta | 36 | 15 | −21 |
Argumentative | llama3:8b | mistral:7b-instruct | 36 | 18 | −18 |
Argumentative | llama3:8b | nous-hermes2:latest | 35 | 19 | −16 |
Technical | dolphin-mistral:latest | gemma:7b-instruct | 25 | 9 | −16 |
Argumentative | gemma:7b-instruct | nous-hermes2:latest | 3 | 18 | 15 |
Argumentative | deepseek-r1:8b | llama3:8b | 0 | 14 | 14 |
Argumentative | gemma:7b-instruct | mistral:7b-instruct | 3 | 16 | 13 |
Technical | gemma:7b-instruct | nous-hermes2:latest | 0 | 13 | 13 |
Rubric | Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | Cohen κ (95% CI) | Mean Δ LLaMA3 − GPT-4 (95% CI) |
---|---|---|---|---|---|---|
Technical | 0.145 | 0.415 | 60.7 | 35.7 | - | - |
Argumentative | 0.083 | 0.242 | 57.1 | 28.6 | - | - |
Overall agreement | - | - | - | - | 0.00 [0.00, 0.00] | 0.00 [−0.16, 0.17] |
Rubric | Model A | Model B | GWR GPT-4 | GWR LLaMA3:Instruct | Δ |
---|---|---|---|---|---|
Technical | deepseek-r1:8b | nous-hermes2:latest | 0.000 | 0.833 | 0.833 |
Technical | gemma:7b-instruct | nous-hermes2:latest | 0.000 | 0.778 | 0.778 |
Technical | gemma:7b-instruct | openhermes:latest | 0.000 | 0.778 | 0.778 |
Technical | dolphin-mistral | gemma:7b-instruct | 1.000 | 0.333 | −0.667 |
Technical | gemma:7b-instruct | mistral:7b-instruct | 0.000 | 0.667 | 0.667 |
Technical | dolphin-mistral | nous-hermes2:latest | 0.000 | 0.556 | 0.556 |
Argumentative | deepseek-r1:8b | llama3:8b | 0.000 | 0.556 | 0.556 |
Technical | deepseek-r1:8b | openhermes:latest | 0.111 | 0.667 | 0.556 |
Argumentative | gemma:7b-instruct | openhermes:latest | 0.333 | 0.889 | 0.556 |
Technical | llama3:8b | mistral:7b-instruct | 1.000 | 0.444 | −0.556 |
Rubric | Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | ICC(2,1) (95% CI) | Lin’s CCC (95% CI) | Mean Δ LLaMA3 − GPT-4 (95% CI) |
---|---|---|---|---|---|---|---|
Technical | 0.242 | 0.628 | 57.1 | 42.9 | - | - | - |
Argumentative | 0.097 | 0.499 | 57.1 | 39.3 | - | - | - |
Overall agreement | - | - | - | - | 0.43 [0.25, 0.57] | 0.42 [0.25, 0.56] | 0.00 [−0.33, 0.32] |
Rubric | Model A | Model B | RMA GPT-4 | RMA LLaMA3:Instruct | Δ |
---|---|---|---|---|---|
Technical | gemma:7b-instruct | nous-hermes2:latest | −1.444 | 0.208 | 1.653 |
Technical | gemma:7b-instruct | llama3:8b | −1.556 | −0.028 | 1.528 |
Technical | gemma:7b-instruct | openhermes:latest | −1.167 | 0.236 | 1.403 |
Argumentative | gemma:7b-instruct | llama3:8b | −1.417 | −0.097 | 1.319 |
Argumentative | deepseek-r1:8b | llama3:8b | −0.917 | 0.153 | 1.069 |
Argumentative | llama3:8b | openhermes:latest | 2.000 | 0.958 | −1.042 |
Argumentative | gemma:7b-instruct | nous-hermes2:latest | −0.194 | 0.806 | 1.000 |
Technical | mistral:7b-instruct | nous-hermes2:latest | −0.861 | 0.125 | 0.986 |
Technical | gemma:7b-instruct | zephyr:7b-beta | −1.000 | −0.028 | 0.972 |
Argumentative | dolphin-mistral:latest | llama3:8b | −1.778 | −0.847 | 0.931 |
Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | ICC(2,1) (95% CI) | Lin’s CCC (95% CI) | Mean Δ (95% CI) |
---|---|---|---|---|---|---|
0.667 | 0.667 | 100.0 | 0.0 | 0.03 [−0.14, 0.34] | 0.03 [−0.13, 0.33] | [0.20, 1.21] |
Model | EQI GPT-4 | EQI LLaMA3 | Δ |
---|---|---|---|
gemma:7b-instruct | 8.481 | 8.519 | 0.038 |
llama3:8b | 8.444 | 8.593 | 0.149 |
deepseek-r1:8b | 8.370 | 8.444 | 0.074 |
nous-hermes2:latest | 8.222 | 8.296 | 0.074 |
mistral:7b-instruct | 7.778 | 8.296 | 0.518 |
openhermes:latest | 7.556 | 8.370 | 0.814 |
zephyr:7b-beta | 6.630 | 8.630 | 2.000 |
dolphin-mistral:latest | 6.593 | 8.259 | 1.666 |
Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | ICC(2,1) (95% CI) | Lin’s CCC (95% CI) | Mean Δ (95% CI) |
---|---|---|---|---|---|---|
−0.084 | −0.084 | 0.0 | 100.0 | −0.02 [−0.05, 0.00] | 0.02 [−0.05, 0.00] | [−0.10, −0.07] |
Model | DPR GPT-4 | DPR LLaMA3 | Δ |
---|---|---|---|
deepseek-r1:8b | 0.108 | 0.000 | −0.108 |
nous-hermes2:latest | 0.104 | 0.002 | −0.102 |
dolphin-mistral:latest | 0.091 | 0.002 | −0.089 |
zephyr:7b-beta | 0.087 | 0.000 | −0.087 |
llama3:8b | 0.087 | 0.000 | −0.087 |
openhermes:latest | 0.087 | 0.000 | −0.087 |
mistral:7b-instruct | 0.079 | 0.012 | −0.067 |
gemma:7b-instruct | 0.062 | 0.017 | −0.045 |
Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | ICC(2,1) (95% CI) | Lin’s CCC (95% CI) | Mean Δ (95% CI) |
---|---|---|---|---|---|---|
0.594 | 0.829 | 75.0 | 12.5 | −0.384 [−1.058, 0.010] | −0.344 [—] | [0.068, 1.054] |
Model | CS GPT-4 | CS LLaMA3 | Δ |
---|---|---|---|
llama3:8b | 0.000 | 1.472 | 1.472 |
deepseek-r1:8b | 0.000 | 0.848 | 0.848 |
dolphin-mistral:latest | 0.094 | 0.609 | 0.515 |
openhermes:latest | 0.094 | 0.721 | 0.626 |
nous-hermes2:latest | 0.094 | 1.476 | 1.381 |
zephyr:7b-beta | 0.283 | 1.131 | 0.849 |
mistral:7b-instruct | 0.660 | 0.660 | 0.000 |
gemma:7b-instruct | 1.125 | 0.189 | −0.936 |
Mean Δ | MAD | % LLaMA3 Higher | % GPT-4 Higher | ICC(2,1) (95% CI) | Lin’s CCC (95% CI) | Mean Δ (95% CI) |
---|---|---|---|---|---|---|
−0.020 | 0.021 | 12.5 | 87.5 | 0.22 [0.00, 0.42] | 0.22 [0.00, 0.41] | [−0.01, 0.01] |
Model | WCS GPT-4 | WCS LLaMA3 | Δ |
---|---|---|---|
llama3:8b | 0.119 | 0.053 | −0.066 |
gemma:7b-instruct | 0.094 | 0.055 | −0.040 |
openhermes:latest | 0.088 | 0.066 | −0.023 |
deepseek-r1:8b | 0.073 | 0.076 | +0.004 |
dolphin-mistral:latest | 0.071 | 0.055 | −0.016 |
mistral:7b-instruct | 0.066 | 0.053 | −0.013 |
nous-hermes2:latest | 0.064 | 0.063 | −0.002 |
zephyr:7b-beta | 0.057 | 0.053 | −0.004 |
Measure | GPT-4 | LLaMA3 |
---|---|---|
EQI vs. length (Pearson) | −0.376 | 0.246 |
EQI vs. length (Spearman) | −0.548 | −0.048 |
WCS vs. length (Pearson) | −0.611 | −0.591 |
WCS vs. length (Spearman) | −0.405 | −0.405 |
GWR: P (longer wins) | 0.536 | 0.685 |
GWR: P (longer wins), trimmed | 0.500 | 0.650 |
RMA: P (longer → positive margin) | 0.607 | 0.870 |
RMA: P (longer → positive margin), trimmed | 0.548 | 0.825 |
Metric | Evaluator | Kendall τ (wₜₑcₕ {0.25,0.75} vs. 0.5) | Max |Δrank| (Across Weights) |
---|---|---|---|
GWR | GPT-4 | 0.929–1.000 | 1 |
GWR | LLaMA3 | 0.643–0.857 | 2 |
WCS | GPT-4 | 1.000–1.000 | 0 |
WCS | LLaMA3 | 1.000–1.000 | 0 |
RMA | GPT-4 | 0.643–0.929 | 2 |
RMA | LLaMA3 | 0.929–1.000 | 1 |
Metric | Kendall τ (GPT-4 vs. LLaMA3) |
---|---|
EQI | 0.327 |
GWR | 0.143 |
WCS | 1.000 |
RMA | 0.214 |
Metric | Mean Δ | MAD | % LLaMA3 Higher | % LLaMA3 Lower | Pearson r | Spearman ρ | Sign Test p |
---|---|---|---|---|---|---|---|
RWC | 1.107 | 2.179 | 75.0 | 25.0 | 0.520 | 0.527 | 0.289 |
EQI | 0.667 | 0.667 | 100.0 | 0.0 | 0.135 | 0.395 | 0.008 |
CS | 0.594 | 0.829 | 75.0 | 12.5 | −0.686 | −0.565 | 0.125 |
RMA | 0.169 | 0.195 | 87.5 | 12.5 | 0.550 | 0.643 | 0.070 |
GWR | 0.114 | 0.128 | 87.5 | 12.5 | 0.350 | 0.238 | 0.070 |
DPR | −0.084 | 0.084 | 0.0 | 100.0 | −0.783 | −0.484 | 0.008 |
WCS | −0.020 | 0.021 | 12.5 | 87.5 | −0.149 | −0.071 | 0.070 |
Metric | N | Pearson r | Pearson p | 95% CI (Pearson r) | Spearman ρ | Spearman p | Δ(LLaMA3 – GPT-4) | 95% CI (Δ) | FDR (q = 0.10) |
---|---|---|---|---|---|---|---|---|---|
CS | 8 | −0.686 | 0.060 | [−0.937, 0.036] | −0.565 | 0.145 | 0.594 | [0.068, 1.054] | – |
DPR | 8 | −0.783 | 0.022 | [−0.959, −0.175] | −0.484 | 0.224 | −0.084 | [−0.096, −0.069] | ✓ |
EQI | 8 | 0.135 | 0.749 | [−0.630, 0.767] | 0.395 | 0.333 | 0.667 | [0.199, 1.208] | – |
GWR | 56 | 0.230 | 0.088 | [−0.035, 0.465] | 0.227 | 0.092 | 0.000 | [−0.164, 0.168] | – |
RMA | 56 | 0.535 | <0.001 | [0.317, 0.700] | 0.485 | <0.001 | 0.000 | [−0.328, 0.322] | ✓ |
RWC | 56 | 0.488 | <0.001 | [0.258, 0.666] | 0.404 | 0.002 | 0.000 | [−4.125, 4.045] | ✓ |
WCS | 56 | 0.254 | 0.059 | [−0.010, 0.485] | 0.183 | 0.177 | 0.000 | [−0.010, 0.009] | – |
Model | CS | WCS |
---|---|---|
deepseek-r1:8b | 0.000 | 0.073 |
llama3:8b | 0.000 | 0.119 |
dolphin-mistral:latest | 0.094 | 0.071 |
nous-hermes2:latest | 0.094 | 0.064 |
openhermes:latest | 0.094 | 0.088 |
zephyr:7b-beta | 0.283 | 0.057 |
mistral:7b-instruct | 0.660 | 0.066 |
gemma:7b-instruct | 1.126 | 0.094 |
Metric | n_high | n_low | mean_high | mean_low | Δ | d | p_Welch | p_MW |
---|---|---|---|---|---|---|---|---|
RMA | 3 | 3 | 1.0157 | −0.7624 | 1.7782 | 2.6529 | 0.0479 | 0.1000 |
GWR | 3 | 3 | 0.7891 | −0.9864 | 1.7755 | 2.4877 | 0.0539 | 0.1000 |
DPR | 3 | 3 | 0.8082 | −0.7595 | 1.5677 | 1.7262 | 0.1071 | 0.2000 |
CS | 3 | 3 | 0.6568 | −0.8325 | 1.4893 | 1.6209 | 0.1828 | 0.1157 |
RWC | 3 | 3 | 0.9641 | −0.3256 | 1.2897 | 1.8012 | 0.1279 | 0.2000 |
EQI | 3 | 3 | 0.7518 | −0.1820 | 0.9339 | 1.0715 | 0.3168 | 0.7000 |
WCS | 3 | 3 | 0.3077 | −0.1001 | 0.4078 | 0.3535 | 0.6942 | 1.0000 |
Model | CS | DPR | EQI | GWR | RMA | RWC | WCS |
---|---|---|---|---|---|---|---|
deepseek-r1:8b | 0.000 | 0.108 | 8.370 | 0.468 | 0.266 | 12.929 | 0.073 |
dolphin-mistral:latest | 0.094 | 0.091 | 6.593 | 0.190 | −0.401 | 0.714 | 0.071 |
gemma:7b-instruct | 1.125 | 0.062 | 8.481 | 0.167 | −0.639 | −3.714 | 0.094 |
llama3:8b | 0.000 | 0.087 | 8.444 | 0.921 | 1.0600 | 15.214 | 0.119 |
mistral:7b-instruct | 0.660 | 0.079 | 7.778 | 0.389 | −0.178 | −6.143 | 0.066 |
nous-hermes2:latest | 0.094 | 0.104 | 8.222 | 0.714 | 0.298 | −1.071 | 0.064 |
openhermes:latest | 0.094 | 0.087 | 7.556 | 0.579 | −0.306 | −8.357 | 0.088 |
zephyr:7b-beta | 0.283 | 0.087 | 6.630 | 0.571 | −0.099 | −9.571 | 0.057 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Craciun, M.V.; Cocu, A.; Niculita, C. PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation. Information 2025, 16, 926. https://doi.org/10.3390/info16110926
Anghel C, Anghel AA, Pecheanu E, Craciun MV, Cocu A, Niculita C. PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation. Information. 2025; 16(11):926. https://doi.org/10.3390/info16110926
Chicago/Turabian StyleAnghel, Catalin, Andreea Alexandra Anghel, Emilia Pecheanu, Marian Viorel Craciun, Adina Cocu, and Cristian Niculita. 2025. "PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation" Information 16, no. 11: 926. https://doi.org/10.3390/info16110926
APA StyleAnghel, C., Anghel, A. A., Pecheanu, E., Craciun, M. V., Cocu, A., & Niculita, C. (2025). PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation. Information, 16(11), 926. https://doi.org/10.3390/info16110926