RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting

Olariu, Maria-Ecaterina; Buinceanu, Vlad-Gabriel; Simionescu, Cristian; Dospinescu, Octavian; Georgescu, Răzvan; Tudor, Cezar; Iftene, Adrian; Bores, Ana-Maria

doi:10.3390/systems14030244

Open AccessArticle

RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting

by

Maria-Ecaterina Olariu

¹

,

Vlad-Gabriel Buinceanu

²

,

Cristian Simionescu

^1,2,*

,

Octavian Dospinescu

³

,

Răzvan Georgescu

²

,

Cezar Tudor

²,

Adrian Iftene

¹

and

Ana-Maria Bores

⁴

¹

Faculty of Computer Science, University Alexandru Ioan Cuza, 700506 Iasi, Romania

²

Nexus Media Romania, 700285 Iasi, Romania

³

Faculty of Economics and Business Administration, University Alexandru Ioan Cuza, 700506 Iasi, Romania

⁴

Faculty of Economics and Business Administration, Ștefan cel Mare Suceava University, 720229 Suceava, Romania

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(3), 244; https://doi.org/10.3390/systems14030244

Submission received: 19 December 2025 / Revised: 16 February 2026 / Accepted: 18 February 2026 / Published: 27 February 2026

(This article belongs to the Special Issue Business Intelligence and Data Analytics in Enterprise Systems)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) are increasingly being adopted in business settings; however, there remains a shortage of evaluation tools that account for country-specific regulations, particularly for Romania’s taxation and financial accounting requirements. RO-FIN-LLM is a benchmark designed to test how well LLMs handle Romania-specific regulatory question answering in taxation (including VAT regimes, income/profit tax, microenterprise rules, and other obligations) and financial accounting (including journal entries/monographs, amortization, provisions, and foreign exchange transactions). The benchmark contains questions curated by experts, each including the applicable regulatory time frames and the legal sources for the answers. Evaluation is performed in two protocols: closed-book and open-book with Retrieval Augmented Generation (RAG), using Tavily Search API. Evaluation metrics are represented by rubrics, namely correctness, legal citation quality, and clarity/structure. A subset of answers produced by three models was additionally evaluated by 12 specialists in the financial-accounting domain. In this revision, we also describe a public release plan for the question schema, prompts, and evaluation scripts to support independent reproducibility.

Keywords:

Romanian accounting standards; financial accounting; labor code; benchmark; large language models; retrieval-augmented generation; regulatory compliance; decision support systems

1. Introduction

The digital transformation of the tax and accounting fields is accelerating the adoption of algorithmic technologies capable of processing large volumes of regulations and providing preliminary interpretations of tax obligations. Modern tax systems are characterized by structural complexity, frequent legislative changes, and interdependencies between primary rules, secondary rules, and administrative instructions. This complexity leads to increased compliance costs and risk of error for both taxpayers and professionals [1].

In accounting, research highlights the usefulness of LLMs in explaining concepts, analyzing financial notes, or generating coherent accounting justifications [2]. LLMs have proven to excel in natural language processing and data analysis tasks, showing their potential to enter the financial and accounting sector of many enterprises. With increased computational power and improved algorithms in the latest versions, commercial LLMs have demonstrated noteworthy capabilities in understanding complex contexts, answering questions, and writing content [3].

These capabilities position LLMs as increasingly relevant for the financial domain [3]. Requiring data analysis, prediction and decision making, finance is a complex field that is highly specialized [3]. Current applications of LLMs include the automation of tasks such as financial report generation, market trends forecasting, investor sentiment analysis, and the offer of personalized financial advice [3].

Research questions. To make the study objectives explicit, we investigate the following research questions:

RQ1: How accurately do current LLMs answer Romania-specific taxation and accounting questions in a closed-book setting?
RQ2: What is the impact of retrieval augmentation (web search + summarization) on correctness and legal citation quality?
RQ3: How well do LLM-as-a-Judge rubric scores align with specialist human evaluations on a stratified subset of the benchmark?

We present RO-FIN-LLM as a first public benchmark and an extensible foundation rather than a definitive standard, with documented limitations and a clear expansion roadmap.

Paper structure. Section 2 (Literature Review) summarizes prior work on LLM-as-a-Judge evaluation and financial-domain benchmarks. Section 3 describes the dataset, models, prompts, and evaluation protocol (including human validation). Section 4 presents results, and Section 5 discusses implications and limitations. Section 6 concludes and outlines future work.

1.1. Motivation

The growing adoption of LLMs in enterprise and financial workflows raises the crucial question of assessing their abilities and identifying the most suitable tasks for automation [3]. However, the existing evaluation landscape highlights a significant gap: while many benchmarks exist, most rely on English datasets [4]. There is a lack of jurisdiction-specific evaluations, especially on the Romanian practices in the financial, accounting and business domain. In such a high-stake domain as the financial one, reliable artificial intelligence (AI) is important. Introducing a Romanian financial benchmark addresses the need for domain-specific evaluation. Given that the financial sector is a high-stakes domain, it is of utmost importance to maintain reliable artificial intelligence (AI). In addition, the Romanian financial-accounting system is characterized by a high degree of legislative dynamism, and the frequent changes in the regulatory framework reinforce the importance of obtaining valid and efficient responses from artificial intelligence models.

Given the diversity of businesses, whether in size, workforce structure, or industry, a significant challenge for Enterprise Systems is the identification of applicable regulatory frameworks and navigating the complex legal landscape. Enterprise Resource Planning (ERP) systems have seen convergence with AI to enhance cost optimization and operational efficiency [5]. Incorporating AI in ERP workflows enables firms to automate repetitive tasks, provide insights, and improve decision-making agility [6]. Large Language Models (LLMs) show promise in processing legal text and interpreting context to provide insights based on scenario-specific questions. Generative AI assists in summarizing ERP documents for quickly understanding key points and for making informed decisions [7]. Leveraging these models allows non-technical ERP users to solve their business tasks through natural language use, empowering users to use these capabilities without extensive technical expertise [7].

By providing detailed financial analysis, automating tasks and improving customer relations, LLMs are demonstrating potential to transform services provided in this domain [4]. For instance, Generative AI is used to manage tax jurisdictions within ERP systems by identifying missing compliance configurations and generating suggestions for alleviating the issues, ultimately reducing the manual workload for tax accountants [7]. Customer feedback of applications involving GenAI show that the effort required in the context of the United States tax configurations and sales sees a 50% to 90% reduction in time and resources [7]. GPT-based solutions are also used to revolutionize the financial operations by enabling contract evaluations and finance report generation, namely JPMorgan’s COiN program and BloombergGPT [8]. Other institutions employ similar solutions to create personalized client insights, detect fraud by analyzing behavioral trends and provide investment consultation [8]. Nevertheless, the application of LLMs in this high-stake sector requires validation to mitigate the risk of hallucinations, ensuring the accuracy and reliability in decision-making tasks [4].

The integration of these models into Decision Support Systems (DSSs) presents as an opportunity for the enhancement of Business Intelligence (BI) capabilities. This benchmark provides an assessment of the performance of LLMs across the three aforementioned areas of Taxation, Financial Accounting and Management, and HR and Governance. By leveraging this technology, personnel can quickly access specialized knowledge, thus gaining a partner for complex tasks. In a fast-paced business environment, responsiveness and resilience can be improved by implementing AI-enabled DSSs, which provide real-time recommendations and insights [9]. This reduces the time spent on legal questions, allowing the financial accounting specialist to shift their focus from information retrieval to planning or other relevant tasks. While implementation strategies vary, using these models in conversational interfaces such as chatbots represents the most intuitive approach for decision assistance. An example is the possibility of employing natural-language queries rather than writing complex SQL scripts, with the key advantage being easier navigation and functionality access [10]. LLM-powered NLP interfaces have been shown to significantly improve operational efficiency and reporting accuracy in financial and accounting workflows [11]. The collaboration of human experts with AI agents ensures that the business can leverage insights from provided data with higher efficiency and accuracy. While machines take on repetitive and data-intensive tasks, the human experts will provide strategic insight, creativity and ethical reasoning [12].

1.2. Goal

The goal of this research is to establish the capabilities of existing AI models to answer questions regarding the Romanian tax, accounting, and financial domains. The foundation of this study is the RO-FIN-LLM benchmark dataset, which contains 1045 questions made by domain experts reflecting the real-life, authentic scenarios encountered by them. The models tested in this investigation were selected based on their popularity, position in relevant benchmarks, or their status as open-source models. The models evaluated include: GPT 5, GPT 5 mini and GPT oss from OpenAI; DeepSeek, the R1 70b version; Qwen 3 32b from Alibaba; Claude Opus 4.1 and Sonnet 4.5 from Anthropic; Gemini 2.5 Flash and 2.5 Pro from Google; Mistral, the Small 3 version; and GLM 4.5 Air. Notably, Mistral Small 3.2 (an earlier version) was included due to the hypothesis that a European LLM might perform better in the Romanian context, potentially offering specialized European expertise for comparison. After a preliminary performance analysis, the three top contenders remained GPT-5, Claude 4.5 Sonnet and Gemini 2.5 Pro.

1.3. Research Contributions

This research aims to introduce a benchmark for the evaluation of LLMs on Romania-specific regulatory question answering in two core areas:

Taxation (e.g., VAT regimes, micro/profit tax, dividends).
Financial accounting under Romanian regulation (e.g., postings, amortization, provisions, foreign exchange (FX)).

The methodology utilizes a rubric-based evaluation, while employing the LLM-as-a-Judge approach. This technique is essential because traditional reference-based metrics (like BLEU and ROUGE) have limited correlation with human evaluations for generative tasks. Using the LLM-as-a-Judge approach allows for multi-dimensional assessment separated into distinct rubrics that align with relevant points derived from expert solutions. Key aspects assessed by the rubrics include:

Correctness/Mathematical Reasoning: Essential for capturing the factually and mathematically accurate content, as mathematical reasoning and calculations are crucial components of financial questions.
Legal Citation Evaluation: Specifically designed to ensure answers adhere to Romanian law and use current versions of all relevant statutes and legal acts, while also mitigating hallucinations by systematically detecting legally unsupported responses.
Clarity and structure: Essential for measuring how organized and easy to understand a response is and whether its structure facilitates comprehension by non-expert readers.

This process ensures a thorough coverage of all important answer aspects, maintaining high standards of factual accuracy, completeness and clarity. Furthermore, the answer generation dataset includes timing benchmarks to be able to evaluate the LLM’s response latency and completeness against human expert performance. The final step of this methodology involves validating the generated answers against the expected answers that were provided alongside the question, leveraging this structured, multi-dimensional LLM-as-a-Judge assessment.

The evaluation of the answers provided by selected LLMs was also done by 12 specialists with over 10 years experience in the financial-accounting domain. Due to time and resource constraints, the human evaluation was done on a smaller part of the dataset of the answers for 60 questions, that equally covered six areas of the Romanian financial and business regulation domain. The specialists had to provide a label of Yes or No for each rubric for each answer provided to the subset of questions. After assessing the capabilities of LLMs in the question answering task, the answers from only three models were selected for human evaluation, resulting in 180 answers. In addition to the labels provided, the professionals were asked to show their preferences in a confrontation between a combination of 2 of the 3 answers for the subset of questions. Furthermore, to acknowledge the nuances behind the Yes and No labels attributed to correctness rubric, an additional task was done by the raters, to also score the answer using the Likert scale (1–5). For a visual representation of the question answering and evaluation processes, refer to Figure 1.

2. Literature Review

2.1. LLM-as-a-Judge

Natural Language Generation (NLG) texts are difficult to evaluate automatically. Standard reference-based metrics such as BLEU and ROUGE have a limited correlation with human evaluations, especially for tasks that require a creative response [13]. These metrics generally miss aspects such as text fluency, logical coherence, and creativity [14]. These n-gram-based metrics measure the lexical overlap between the generated answer and a reference text [13]. Even if more than 60% of recent papers on NLG use ROUGE or BLEU for evaluation, they are not reliably measuring the quality of the content or catching the syntactic errors [13]. In addition to task performance, current AI evaluation must include robustness, fairness, and interpretability to solve complex and dynamic challenges in real-world scenarios [14]. In tasks requiring extractive question answering (EQA), the metrics typically used are Exact Match (EM) and F1-score [15]. These indicators are shown to underestimate the performance of models [15]. The main issue is that there are many ways to phase the correct answer, but the assessment process normally uses only one fixed gold answer [15]. For instance, some AI models generate complex phrases containing the answer, while others repeat part of the query before providing the relevant part of the answer [15]. Low flexibility in alternative correct answers might lead to an underestimation of the model’s true performance [15]. Human-annotated data, considered “ground truth”, can provide information that can be used to validate the generated responses [14]. To better understand a model’s performance, practicality and the potential risks involved, the researchers can gather expert feedback [14]. However, obtaining the feedback takes time and resources, proving to be a challenge for large-scale evaluation [14].

Many financial tasks, such as being given the S&P 500 earnings reports, answering conversational questions that require numerical reasoning over the input (ConvFinQA) [16,17], or given tabular data, reasoning for question answering requires numerical reasoning and high factual accuracy [4].

The empirical evidence suggests that LLM-as-a-Judge is a superior evaluator in comparison with traditional metrics, in regards to the correlation with human judgments. Multiple studies have demonstrated higher correlation coefficients when using LLMs as evaluators. In a study regarding EQA tasks, the correlation score between human judgments and the EM or F1-score are on average 0.22 and 0.40. When compared to the average Pearson correlation score provided by Qwen 2.5 as the Judge of 0.847, it shows that the traditional metrics underestimate the true performance, whilst confirming the potential of LLM-as-a-Judge to replace them [15].

When evaluating overall user satisfaction in dialogue systems using the LLM-RUBRIC framework, which incorporated multiple evaluation criteria, it allowed the system to achieve a Pearson’s correlation of 0.350 with human judges on real conversations, representing an improvement of over two times the performance of uncalibrated LLM’s direct judgment of overall satisfaction (0.143) [18].

The initial concept of LLM-as-a-Judge [19] assessed the correlation between an LLM judge and human evaluators in open-domain NLG tasks. The concept involves leveraging LLMs as reference-free evaluators, replacing costly and time-consuming human assessment [14].

LLM judges use several strategies categorized by how they assess the output and the context they rely on. Firstly, the LLM evaluates each answer and assigns a specific score based on the specified criteria, using for example the Likert scale from 1 to 5 [14]. G-Eval uses this method via a form-filling paradigm [13]. Secondly, the LLM directly compares two candidate outputs, A versus B, to determine which is preferable according to the provided evaluation criteria. This method is called Pairwise Comparison, which is similar to human decision-making processes that use relativity rather than absolute scoring [14]. Thirdly, the evaluation can revolve around the references provided, either reference-based or reference-free. In reference-free, the LLM judges text quality, namely fluency or coherence, based on its intrinsic knowledge, without relying on specific reference documents or a gold-truth answer [13,14]. The LLM is provided with the context, question, the predicted answer, and the gold truth answer to evaluate the response’s factual accuracy or relevance relative to the source information [14,15].

The size and capability of the LLM judge can play a significant role, although superior capability does not guarantee better results. Larger models typically exhibit higher correlation with human judgments [13,15]. For example, in EQA, the Qwen 2.5 72B Judge achieved an average Pearson correlation score of 0.847 with human judgment, outperforming smaller models [15]. For more complex and challenging evaluation tasks, such as numerical reasoning over structured data, the proprietary models saw better results than open-source Large Language Models [20]. More capable models do not always provide better results if the task is relatively simple. This can be seen in how G-EVAL-3.5 and G-EVAL-4 performed similarly on the Topical-Chat benchmark [13]. Another relevant aspect is “optimism bias”, where LLMs may be overly positive in their assessments, possibly due to their training as conversational agents, and not attuned to critical evaluation [21]. It is important to note that smaller, open-source models can be aligned to overcome the optimism bias, while offering a cost-free alternative to proprietary systems, by using Supervised Fine-Tuning or Reinforcement Learning [21].

2.2. Finance Benchmarks

The current state of finance benchmarks reflects a necessary progression from foundational text processing to evaluating highly complex, real-world analytical skills and autonomous capabilities.

Early benchmarks focused on Natural Language Understanding (NLU) tailored to the financial domain [4]. In general, NLU use cases include sentiment analysis, user intent, machine translation, customer support, text classification, etc. [9,22]. Examples of such use cases include datasets such as Finance Phrase Bank (FPB) (Malo et al., 2014 [23]) for sentiment analysis, and Financial Language Understanding Evaluation (FLUE) (Shah et al., 2022 [24]), which covered tasks like Named Entity Recognition (NER), sentiment analysis, and headline classification [4,16].

The benchmarks found in the research stage for this article can be categorized into three types: Numerical Reasoning, Long-Context Question Answering, and Expert or Real-World Tasks. These categories are based on the target of the benchmarks, specific limitations and addressed challenges. Numerical Reasoning benchmarks assess the capacity of a model to process complex algorithmic logic and reasoning over structured (tables) or unstructured (text) data [20]. It addresses the limitations in multiple-step calculations and in extracting the correct financial facts from semi-structured formats [20]. On the other hand, Long-Context Question Answering benchmarks evaluate how effectively an LLM can retrieve, synthesize, and reason over extremely long, realistic input documents [25]. They address the problem of LLMs not utilizing information presented far from the query or the answer location [25]. Finally, Real-World or Expert Task benchmarks measure the LLM’s capabilities in simulating a human financial expert’s workflow, requiring planning, factual accuracy, and real-time data integration [26]. They test the ability to use external tools such as EDGAR or Search and apply specialized knowledge such as CFA-level concepts to real-world situations [4].

FinQA presents questions alongside curated context, such as paragraphs or tables of statistics, that is necessary for the correct answer [25]. The original FinQA dataset consists of triplets of a golden context, a question, and an answer written in human language. This was created by finance experts based on S&P 500 companies’ annual financial reports [3,25].

ConvFinQA extends FinQA by shifting the format from single-turn QA to a conversational QA setting [3]. The dataset includes 3892 conversations comprising a total of 14,115 questions [16]. The conversational evaluation increases the complexity by mandating the exploration of the chain or numerical reasoning throughout the dialogue [4,16]. In a benchmark that uses ConvFinQA, during evaluation, the entire gold conversation and its context are used as input, in a one-shot prompt manner [16].

TAT-QA is a QA benchmark focused on a hybrid of tabular and textual content in finance [20]. It involves reasoning over both structured and unstructured text found in financial reports [20]. The dataset is composed of 2757 documents and 16,552 questions [25].

The capabilities tested with these datasets include arithmetic operations, such as addition, subtraction, multiplication, and division [20]. The benchmarks evaluate concepts such as ratios, covering fractions, ratios, inverse problems, percentages, and changes in ratio [20]. For example, calculating the percentage of quarterly revenue derived from a source or analyzing financial ratios requires such detailed evaluation methods [25]. While the datasets focus on numerical calculation, the benchmarks’ objective is higher-level financial analysis, such as identifying trends in revenue or interpreting statements regarding investments [20]. To achieve their aim, the models must perform multi-step reasoning to connect relevant external information and understand the implications of transactions [20].

Long-Context evaluation provides a more realistic scenario which is often faced by financial professionals who interact with lengthy documents [25]. The financial domain requires reasoning over large volumes of information due to the textual and tabular data [25]. DocFinQA extends FinQA by providing the full-document context from the Securities and Exchange Commission (SEC) reports, resulting in a longer context of 123k words [25]. Even large context-windowed models with 128k tokens struggle with loss of important context and fail to make use of the long inputs [25]. More than 40% of the documents in DocFinQA remain unanswerable, even when using large, state-of-the-art LLMs [25]. The performance of retrieval-free GPT-4 is half that of non-expert humans, which stands at 41% [25]. For the purpose of the benchmark, following Koncel-Kedziorski et al. (2023), the model is required to provide Python code that encapsulates what information from the context is used and what arithmetic operations are performed [25].

The documents used for long context benchmarking include: English financial news, press releases, financial documents, media from Bloomberg archives [16], 10-K annual reports and 10-Q quarterly reports drawn from SEC’s EDGAR database [25], and securities research reports and macroeconomic analyses [4]. Other benchmarks that address the challenges of using LLMs over lengthy documentation in the financial domain are Finance Agent Benchmark [26], IDEA-FinQA [4], and ConvFinQA [17].

The Finance Agent Benchmark uses agents equipped with Google Search and EDGAR database access for real-world financial research problems. It uses tasks, provided by experts, which require multi-step reasoning of SEC filings [26].

IDEA-FinQA is a financial question answering system which uses knowledge injection and factual enhancement from external sources from the data querying module. This module of the system allows for two types of search engines, text-based and embedding-based indexing, designed to overcome temporal limitations of LLM training data [4]. ConvFinQA’s conversational nature requires the models to reason over the entire gold conversation and its context, which adds complexity to the reasoning process and verifies the capability of models to process long textual information [17].

From a language perspective, most of the aforementioned benchmarks rely on English datasets, with few robust, expert-annotated legal and financial data in other languages. The non-English exceptions are FinFact, the first Chinese financial domain dataset; FinEval, a Chinese financial domain benchmark for LLMs; and IDEA-FinBench, a financial knowledge assessment from professional exams in Chinese and English—all of these are mentioned in the IDEA-FinQA benchmark [4]. BLOOM176B is a model noted to be multilingual [16]. The multilingual context extends to the LLM-as-a-Judge paradigm, with a mention of MLLM-as-a-Judge (Multimodal LLMs) [14]. In MM-Eval [27] the experiments show performance discrepancies in low-resource languages where the models tend to default to neutral scores in evaluation.

Based on the demonstrated gaps in the literature, a Romanian financial benchmark would address the need for domain-specific evaluation in a non-popular language. Creating such a benchmark would significantly empower financial institutions, regulatory bodies, and everyday consumers by allowing them to utilize artificial intelligence models that are better aligned with their specific local contexts. Additionally, developing this tool could encourage research efforts in other less-represented languages, fostering a more inclusive environment within the AI field. By focusing on establishing a Romanian financial benchmark, we not only tackle existing inequalities but also improve how advanced technologies can be applied in various linguistic and cultural settings.

3. Materials and Methods

This study details an evaluation benchmark designed to assess the performance of Large Language Models (LLMs) in a financial-accounting question answering task. The materials used comprise a set of human-generated questions and their corresponding answers. To establish a baseline for difficulty and response time, human participants were asked to evaluate the required time and difficulty to produce each answer. The attributed values were reviewed and validated by the assessors to mitigate potential bias. For the methods, a preliminary evaluation involved prompting a range of proprietary and open-source answerer models to answer the questions from the dataset. An initial screening of a small response sample was conducted using the G-Eval framework [13]. The best performing models, namely GPT-5, Claude 4.5 Sonnet, and Gemini 2.5 Pro, were selected as the main contenders for the full evaluation process and were also used as AI Judges for the final assessment phase.

3.1. Evaluation Dataset

The RO-FIN-LLM benchmark has 1045 questions created and validated by experts in Romanian taxation, accounting standards, and business management practices, who have worked in this field for more than 10 years. Each question in the dataset includes a human expert-generated answer, a difficulty classification, a category assignment, and the time required by the aforementioned specialists to formulate a complete response with appropriate legal references. With the dataset curated by domain experts, it captures real-life scenarios effectively and includes authentic scenarios that these professionals have encountered throughout their careers. The dataset question categories reflect six critical areas of Romanian financial and business regulation. The largest representation comes from the VAT and eVAT questions (25.64%,

n = 268

), followed by Accounting and Monography of Accounting Entries (20.47%,

n = 214

). The questions regarding Income Tax constitute 17.22% (

n = 180

), while those concerning Profit Tax account for 16.26% (

n = 170

). Those addressing Micro Enterprises comprise 8.32% (

n = 87

), and Other Obligations make up 12.05% (

n = 126

).

Each question was assigned a difficulty level based on the legal complexity, reasoning depth, and specificity of regulatory knowledge needed to answer. From this perspective, the dataset contains 37.4% Easy (

n = 393

), 49.2% Medium (

n = 517

), and 13.3% Hard questions (

n = 140

). This distribution is a direct result of real-world consultation experiences, thus ensuring a robust evaluation across various complexity levels, with emphasis on moderately challenging scenarios. The recorded answer times represent the time specialists required to provide a comprehensive response which included research, citations, and documentation of relevant legal sources. Most of the questions were answered within 10 min (

n = 456

) or 5 min (

n = 310

), indicating relatively straightforward scenarios with clear regulatory guidance. A substantial portion required 15 min (159), reflecting moderate complexity requiring cross-referencing multiple provisions. More complex questions demanded 20 min (

n = 58

), 30 min (

n = 40

), or up to 60 min (

n = 21

), typically involving multi-faceted scenarios, ambiguous provisions, or calculations requiring detailed documentation. These timing benchmarks provide context for evaluating the LLM’s response latency and completeness against human expert performance.

3.2. Evaluated Models

The purpose of this investigation is to establish the capabilities of existing AI models. The models tested are: GPT5, GPT 5 mini and GPT oss from OpenAI; DeepSeek, the R1 70b version; Qwen 3 32b from Alibaba; Claude Opus 4.1 and Sonnet 4.5 from Anthropic; Gemini 2.5 Flash and 2.5 Pro from Google; Mistral, the Small 3 version; and GLM 4.5 Air. Considering the variety of solutions, the starting point was an article that discusses a benchmark in the financial sector [26]. Further investigation was allowed for the benchmarks published by some of the authors’ companies on their website [28] for the finance sector to be found. Some benchmarks are academic, whilst the ones that were interesting are proprietary, such as TaxEval and Finance Agent. The TaxEval Benchmark evaluates a model’s ability to answer hard tax-related questions, focusing on both answer correctness and structured reasoning capabilities. The other relevant benchmark is Finance Agent [29], whose dataset contains questions that cover quantitative and qualitative retrieval, numerical reasoning, pattern analyzing, financial modelling and market analysis.

The tested models were selected on the basis of either their popularity, position in the benchmark, or the fact that they were open-source. In the latest benchmarks available at the time of writing this chapter, the chosen models are among the top 20 in accuracy, with a few exceptions; Mistral Small 3.1, an earlier version, for both benchmarks is ranked lower. In spite of the lower ranking in the benchmarks, Mistral Small 3.2 was chosen due to the supposition that a European LLM might perform better in the Romanian scenario, therefore bringing European expertise to the comparison. Another important aspect was the open-sourceness of the models. The models that are open-source, because their model weights are publicly released under permissive licenses, either MIT or Apache 2.0, are: OpenAI’s GPT OSS, DeepSeek’s R1, Alibaba’s Qwen 3, Mistral Small 3, and Zhipu AI’s GLM-4.5 Air. They allow for the download, use, modification, and even deployment locally or commercially of their services.

The remaining five models are the latest versions available from popular AI research and development companies. OpenAI’s GPT 5 and GPT 5 mini, Anthropic Claude’s Opus 4.1 and Sonnet 4.5, and Google’s Gemini 2.5 Flash Thinking and 2.5 Pro Thinking versions are models that were used in the experiments to establish their potential as well.

3.3. Answer Generation Prompt

The prompt used to answer the questions for this benchmark follows prompt engineering principles, which ensures that the interaction with the model provides useful answers. By providing the persona, the instructions, and writing clear phrases and questions, the aforementioned principles are followed. The prompt was first developed in English, and Claude was used to check for clarity of expression. Subsequently, the prompt was translated into Romanian and adapted into two versions, one for zero-shot prompting and one that allows the use of an online search tool.

The prompt whose aim was to test the capabilities of the AI models and their knowledge of the Romanian financial sector can be found in Appendix B.1. Appendix B.2 contains the instructions for the LLMs to access a search tool that allows web access. The limit for resources is six calls. The number six was calculated so that the whole answer generation process reaches less than 1 $.

To enable Retrieval Augmented Generation (RAG), an online search engine API named Tavily Search API [30] is used, which allows a plug-and-play setup to seamlessly integrate into the existing application. It is generally acknowledged as a search engine optimized for LLMs and RAG [31]. Its advantage in the general context includes providing high-quality, LLM-generated summaries synthesized from retrieved results, which contributes to a higher resolution rate on the GAIA benchmark compared to other general search APIs [32,33]. While similar products, namely Brave and Exa, have comparable resolve rates for the aforementioned benchmark and choose the webpages to open for further information based on search snippets, Tavily and Exa provide higher-quality LLM-generated summaries. To mitigate the potential hallucinations that may arise, Tavily offers an LLM-generated answer per query, comprising all search results. This is more likely to provide more accurate results than individual page summaries.

This API allows the users to specify the country and prioritizes the content from said country in the search results [30]. This feature is only available for the general topic, which means a broader search area, while still maintaining relevancy to the Romanian context. As another article [31] also mentions, the aforementioned API does not require the development team to crawl the web pages manually, which consequently allows for the focus to be on the various experiments.

According to documentation [30], Tavily Search API uses proprietary AI to rank the most relevant sources and content for the provided query. The resulting answer comes in JSON form and includes a property ‘raw_content’ that contains the scraped content of each URL it provides. This allows for the research to focus on the capabilities of each tool to process the raw content as well, and not rely on Tavily’s own summarization tool.

The raw-content provided by the Tavily search is then sent to be summarized by the same model in Appendix B.3, to maintain the continuity and limit the bias towards processing already-summarized content, a process which can be seen in Figure 2.

The summarization prompt explicitly requires that the result be focused on the main question, thus ensuring the provided information is relevant. As a consequence, the competence of the generative AI tools is tested for the entire process until the final answer is generated.

3.4. Preliminary LLM Evaluation

Recent findings in financial and agentic benchmarks [26] show that metrics such as G-Eval can be used for preliminary evaluation. In complex, real-world financial tasks, proprietary LLMs often establish a baseline for performance [26]. OpenAI’s o3 achieved an accuracy of 46.8 followed closely by Claude 3.7 Sonnet at 45.9%. On the other hand, open-source models such as LLaMA 4 Maverick and Mistral Small 3.1 demonstrated weaker capabilities, reaching an accuracy score of 3.1% and 10.8% [26].

Furthermore, evaluation on financial knowledge benchmarks, such as IDEA-FinBench, shows that proprietary models like GPT-4 exhibit exceptional performance when tested in an CFA-L2 exam [4]. This exam consists of multiple-choice questions, covering diverse topics, including Financial Statement Analysis, Fixed Income, Economics, etc. [34].

G-Eval is a framework designed to use Large Language Models with chain-of-thought and a form-filling paradigm, to assess the quality of Natural Language Generation (NLG) outputs [13]. It addresses the limitations of conventional metrics, such as BLEU or ROUGE, which have a low correlation with human judgment, especially on open-ended questions [13]. As seen in Figure 3, the evaluated models, namely GLM, DeepSeek, Qwen, GPT-5-mini, GPT-OSS and Mistral, scored lower in reasoning and factual accuracy, similar to the correctness rubric proposed below. GPT-5-mini’s promising results consolidate the performance of the GPT models from OpenAI.

In addition to the evaluation using G-Eval, a small batch of answers from GLM, DeepSeek, Qwen, and Mistral was experimented with. To ensure the evaluation provides relevant information, the predicted answers were selected from the search-enabled experiments. Therefore, the research team decided to refrain from moving forward with the analysis of open-source models and focus on analyzing the capabilities of OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5, and Google’s Gemini 2.5 Pro.

3.5. Rubric-Based Evaluation

The detailed evaluation of the answers from the remaining models, namely Claude Sonnet 4.5, Gemini 2.5 Pro and ChatGPT 5, relies on specific criteria to assess the quality of their generated answers in a more structured manner. Moving beyond accuracy metrics to a rubric-based evaluation allows the judgment to capture the nuanced performance in domains such as the law or finance, thus aligning with evaluation methodologies like G-Eval and LLM-Rubric [13,18]. Due to the use of LLM-as-a-Judge approach, the assessment is separated into distinct rubrics that align with relevant points derived from expert solutions, thus allowing a multi-dimensional evaluation. This process ensures thorough coverage of all important aspects of the response, maintaining high standards of factual accuracy, completeness, and clarity [26]. The criteria chosen for the purpose of this article were composed with humans as evaluators in mind as well. All evaluators were provided a brief description of what the rubric entails, detailed in the subsections below.

For each answer provided to the 1045 questions by the 6 models, one judge was allocated to evaluate. The judge model was randomly selected to use financial resources economically.

3.5.1. Clarity and Structure Evaluation

Most of the articles surveyed ask for similar metrics, such as coherence [13] and logical and structural integrity [21]. Another dimension encapsulated in this criteria is the comprehensibility of the answer by experts and non-experts alike. An essential aspect of this criterion is the comprehensibility of the answers provided, catering to both experts and non-experts. This emphasis on accessibility is vital as it ensures that complex information can be grasped by a wider audience, thereby enhancing the overall utility of the research. In addition, the evaluation process should involve a careful assessment of terminology usage, sentence structure, and overall readability, as these factors significantly influence communication effectiveness. Incorporating engaging visuals or additional materials can also play a key role in improving clarity, making intricate concepts more approachable. In addition, gathering feedback from both target demographics during the evaluation phase can further improve clarity and structure, ultimately facilitating more effective communication of ideas.

3.5.2. Correctness Evaluation

This evaluation includes the many dimensions which fall under the umbrella of correctness. Following another prompt structure where correctness is evaluated [15], the formulation for this prompt takes into consideration the completeness [21] and mathematical reasoning behind the final result, whether or not numerical in nature.

The correctness rubric prompt asks for the evaluation of completeness and mathematical reasoning of the predicted answer. For mathematical questions, the reasoning should also be sound, even if the final numerical result is equivalent. A similar explanation was provided to the human raters in Romanian, asking whether the answer is correct and the mathematical reasoning as well. It is paramount to mention that the human evaluators were asked to also rate the correctness on the Likert scale, 1 meaning completely incorrect and 5 meaning completely correct.

It is crucial that the mathematical reasoning is also followed exactly, even if the final numerical result is the same as the gold answer. This is due to the possibility of the models using an older version of the law, or even ambiguous mathematical explanation. In finance, mathematical reasoning and calculations are essential components of many critical financial questions and workflows [20]. Thus, high correctness scores and explicit mention of this validation focus are important.

3.5.3. Legal Citation Evaluation

The prompts used to evaluate the validity of the legal citations provided in the predicted answer are relevant and the same as the ones in the gold answer. If some citations are contradictory or not found in the gold answer, the label should be invalid. The labeling was further processed in a numerical manner, using 0 and 1. The human evaluators were asked to assess whether the citation of the articles of the law is correct or incorrect.

This rubric was designed to serve three functions. First, the formulation ensures that the answers adhere to Romanian law and the established legal framework for the finance domain. Secondly, it requires that the answer uses the current versions of all the relevant statutes and legal acts, guaranteeing the legal basis is up-to-date. Lastly, the dimensions mentioned above also warrant the integrity of the data, mitigating possible hallucinations, and allowing for a systematic detection of legally unsupported responses.

3.6. Human Evaluation

A sample of 60 questions from the 1045 were selected. The questions evenly cover the 6 categories, with each category being represented by 10 questions. The answers provided by Claude 4.5 Sonnet, Gemini 2.5 Pro and GPT-5, all with Tavily Search enabled, were evaluated by 12 specialists from the financial-accounting domain. The 12 raters were asked to provide binary scores on the same three rubrics that LLM-as-a-Judge models used for evaluation. Moreover, the correctness was also evaluated using a separate Likert scale from 1 meaning completely incorrect to 5 meaning completely correct. This evaluation was done to better understand the nuances behind the binary choices for the human raters.

In addition to evaluating the individual answers, the human raters were asked to show their preference between two models, taking into consideration the question and the ground truth answer. The evaluators were provided with one question at a time with the ground truth answer and 2 of the 3 answers, and asked to choose between answerer model A or answerer model B, or a tie. To ensure blinded evaluation, the answerer models were referred to as model A or model B. For each question, there were 3 repetitions, so that preference was established for all possible combinations.

3.7. Hypotheses

The first hypothesis is that Claude would be the best performer. This is based on the experience the research team has had with this model for other tasks. ChatGPT will be the least expensive according to the second hypothesis. Due to ChatGPT having the lowest cost per token out of the three models, the hope was that the cost would be lower. The third supposition focuses on the use of Tavily Search API. The Retrieval Augmented Generation with online search should be able to provide better legal citations. Lastly, based on the literature found on the subject of LLM-as-a-Judge, the model evaluations should be similar to the human ones.

4. Results

In this section, due to the overlap of models used for evaluation and dataset answer generation, the naming convention will be as follows: Judge Claude, Judge Gemini and Judge GPT or Evaluator Claude, Evaluator Gemini and Evaluator GPT when referring to the models as evaluators. When referring to the models being evaluated, those that provided the answers to the questions from the dataset will be named as Answerer Claude, Answerer Gemini and Answerer GPT, or their model name with the annotation with or without search, such as Claude-NoSearch or Claude-WithSearch.

Of all the evaluations, from the three LLM judges, only 8% (

n = 500

) of them received full scores for the three rubrics. On average, the cost per successful response was 0.2252 $. From the 8% we have a distribution of successful responses in light of the use of search tools or not. More than a third (38.4%) of the responses provided by the LLMs are successful. Meanwhile, the performance of the models with search enabled is almost double (61.6%), providing evidence for the use of this pairing for improved results.

In regards to the human evaluations, the smaller dataset consisting of 60 questions with the answers from the three models—Claude-WithSearch, Gemini-WithSearch and GPT-WithSearch—was analyzed. These evaluations were analyzed from two perspectives, namely to establish the quality of the answers provided, and to compare the human judgments to the LLMs’.

4.1. Per Rubric Analysis

This subsection concentrates on the scores provided by the evaluators, specialists or Claude-4.5-sonnet, Gemini-2.5-pro and GPT-5. In the case of LLM evaluators, the whole dataset of 1045 questions answered by the six LLM answerers, totaling 6270 answers, was assessed from the three-rubrics perspective. For the three rubrics of correctness, clarity and structure, and legal citation, the LLMs and human evaluators were asked to provide a binary score. All evaluators were provided with a brief explanation of what each rubric implied.

4.1.1. LLM-as-a-Judge

For the three rubrics, the mean score was calculated. For clarity and structure, the judges rank the responses, on average, at 96.2%. Verbosity bias is the tendency of a judge, human or model-based, to favor longer responses, indifferent of the actual information presented [14]. This bias, in combination with the way LLMs are constructed to adhere to human norms requiring well-structured examples, may be part of the reason why the model evaluators decided to place such high rankings for this metric [4].

In the case of our team’s study, an average correctness score of 42.5% attributed by the LLM judges signals a difficulty associated with the question answering task. In a similar study [25], their best-performing model (GPT o3) achieved an accuracy of 46.8%, with no model surpassing 50% accuracy for the Finance Agent Benchmark. When evaluated in CPA single-answer questions for IDEA-FinBench [4], the models evaluated also yield low scores, such as ChatGPT achieving 42.64% accuracy. An LLM measures performance against the evaluation criteria provided, and achieving such a score on average for correctness highlights some shortcomings that require further investigation.

The dimension in which the models correlated most strongly is clarity and structure. The scores provided by the judges for the models’ answers (see Table 1) are roughly similar, with the scores ranging from 90.5% for Answerer GPT-5, to 99.5% for Answerer Gemini with Search. A similar agreement trend follows the legal citation rubric, where all models performed poorly. The lowest average score was given to Claude-NoSearch (5.9%), and the best score belongs to GPT-WithSearch (16.1%).

Upon further investigation, for each evaluation rubric an average score was calculated for each evaluator model and for each evaluated model to see if the judges are in alignment with each other in their assessments. For a more detailed distribution of scores on how each evaluator scores the answerer models, please refer to Figure 4. The scoring trend follows through for each answerer model. A noticeable difference is found with Judge Gemini’s scoring attribution, with more lenient judgment across all rubrics and all answerer models.

The analysis of the confidence intervals for the provided answers reveals distinct performance tiers across the evaluated models (Table 2). Model GPT-5-WithSearch demonstrated the highest correctness, while outperforming all the other models, as indicated by the non overlap between the 95% confidence intervals. Interestingly, the confidence intervals of models Claude-WithSearch (0.358 [0.329, 0.387]) and Gemini-WithoutSearch (0.360 [0.331, 0.0389]) overlap substantially, even if the first model showed a slight numerical lead. This shows that the performance gap between these two models is marginal and potentially susceptible to noise, whereas the jump between models Gemini-WithTools and GPT-WithoutTools represents significant improvements in model capacity.

4.1.2. Human Evaluators

Upon examining the computed average scores for each rubric for each model in Figure 5, a stark discrepancy is seen in the correctness and legal citation rubrics. The human evaluators are more inclined to label an answer as Yes or 1, compared to the LLMs. The LLM judges, on the other hand, are more strict in their assessment, hardly attributing the Yes label for these rubrics.

Considering the additional evaluation done by specialists, attributing a correctness score on the Likert scale, as well as the aforementioned difference, an analysis was done on the correlation between the Yes or No label and the score on the Likert scale. In Figure 6, the distribution reveals that the threshold at score 3 on the Likert scale shows partial correctness. For a score of 3 Likert, the chance of the specialist to attribute a No label is only 35%. This shows that even in borderline cases, the tendency is to attribute a Yes label to a medium-quality answer. This trend may account for the elevated average scores seen with the human evaluations, compared to the LLM Judges.

4.2. Inter Rater Agreement on Rubric Scores

To ensure the validity of the evaluations provided by the specialists, the inter rater agreement on the correctness score was calculated using multiple methods. The inter rater agreement between human and LLM raters was also calculated. At first, Pearson Correlation Coefficient was calculated between each pair of raters for the correctness score [14]. The values from the matrix exhibited small correlation, with the highest value reaching 0.28 between two human raters, and −0.21 between judges gpt-5 and gemini-2.5-pro. The maximum value or correlation between human and LLM raters is 0.26, still a weak value. These results led to the evaluation of the Spearman’s Rank Correlation Coefficient for each of the three models’ results. The answers provided by Claude, Gemini and GPT with search enabled were separated, and the coefficient was calculated for the evaluations done on the batch of 60 answers for each model. In the case of Claude-WithTools’s answers, the strongest correlation scores were found between a rater and Gemini-as-a-Judge and Claude-as-a-Judge, 0.58 and 0.59. For Gemini-WithTools’s answers, the highest coefficient was 0.70, found between Judge Gemini and two human raters. Amongst human raters, the closest score was 0.69. When considering the evaluations done on GPT-WithTool’s answers, the strongest scores were between Judge Claude and two human raters, 0.69 and 0.67.

The following stage of investigating the agreement between the 15 total raters used Cohen’s Kappa, which provided lower scores than the Spearman’s Rank Correlation Coefficient. A Cohen’s Kappa matrix was drawn from the correctness scores provided by each pair of raters. The values in the matrix did not show strong or moderate agreement, with the highest score being 0.28 between two human raters. These results prompted the investigation of other correlation coefficients that better evaluate the 12 human rater and three LLM judges’ agreement. Due to the nature of the binary evaluations from the raters, and attempting to assess the specialists’ group agreement, Fleiss’s Kappa was also calculated. A low Fleiss’ Kappa inter-rater reliability score of 0.2473 shows fair agreement [35].

For the evaluation and confirmation of the expert evaluator group, IntraClass Correlation Coefficient was calculated in an effort to assess the reliability of the ratings as well [14,36,37].

As seen in Table 3 the low ICC1, ICC2 and ICC3 values show poor reliability. This means that the human raters disagree with each other in many cases, and we should not trust the judgment of a single rater. However, the collective should provide more information. ICC1k, ICC2k, and ICC3k scores range from 0.74 to 0.78, proving good reliability. Even though the individual raters disagree, the group as a collective is consistent. Considering that the calculation of the last three scores reflects the means of k raters and the resulting good reliability, the use of majority vote or average score for the 12 raters to determine the ground truth is justifiable.

Observing the stark difference between a low Fleiss’ Kappa and a high average ICC, namely a ICC2k of 0.75, led to the investigation turning towards the unbalanced nature of the dataset. The correctness scores of 0 or 1 have been attributed by the raters in a dissimilar manner. The percentage of 0s is 23.3% of the total of 2160 correctness scores (12 specialists rated 180 answers for the 60 questions that were answered by the three LLMs with Tavily Search API) attributed, and the percentage of 1s is 76.7%. Since the Kappa measurement was relying on the trait prevalence in the population under consideration, in this case the distribution of 0s and 1s, another statistic was calculated, Gwet’s AC1 score [35]. Gwet’s AC1 Score for correctness was 0.5816, for legal citation 0.5105, and for the clarity and structure rubric 0.6293. The score for correctness falls under the Moderate Agreement category, proving that there is moderate individual agreement, but high group reliability.

Consensus Label

Based on the aforementioned results, a Consensus Label score of 1 or 0, for each rubric, was calculated based on the Majority Vote [14], while flagging the cases where there was a tie between the 12 raters. The scores were first grouped by the question number and the answerer model, then the mean score from the 12 raters was calculated, alongside the number of 1s. For the majority vote, the threshold for the mean was 0.5, what was above the threshold was considered a 1, and what was equal and below was considered a 0. An agreement confidence score was also calculated to understand how unanimous the decision between raters was, ranging from 0.5, showing total disagreement, and 1, showing total agreement. The results can be seen in Table 4. The number of high-confidence binary scores for each score represents the number of scores that were calculated in accordance to at least 9 out of 12 same-value scores attributed by the human raters. The high confidence in 75% of the correctness score Consensus Labeling consolidates that there is strong agreement among the raters. The agreement between the raters and the Consensus Label is also shown by the high percentages for legal citation and clarity and structure rubrics: 70% and 82.22% respectively.

The same set of 180 answers used for the human evaluation was also assessed by the LLM judges. To compare human and LLM judgments, we first computed a human Consensus Label (majority vote across the 12 raters) for each answer, and then measured how often each LLM judge agreed with this human consensus. In a separate analysis, a Consensus Label was calculated for the three LLM judges and the accuracy of the LLM Consensus Label to the human Consensus Label was 57%. This reveals a low agreement between using all three judges and the human Consensus Label in regards to assessing the correctness. A more detailed exploration of the accuracy, as seen in Table 5, shows that for correctness, Gemini had a more similar judgment to the human evaluators. Gemini Judge is also more accurate than the other models in its analysis in legal citation. When the focus was on the clarity and structure rubric scores, there was a very high agreement. This may be a direct result of the fact that the number of 1s attributed by both human and LLM evaluators is close to the maximum. For the Consensus Label for this rubric, the number of 1s is 97.78% (176/180).

An examination of the LLM judgment agreement with the Consensus Label, categorized by question type, highlights potential areas where the judge models might struggle in their evaluation, as seen in Figure 7. A difficult category for all three models, where none surpassed 64% accuracy, is Monography of Accounting Entries. Meanwhile, the category where the average accuracy percentage is the highest is Micro Enterprises. It is important to note that Gemini Judge is the closest to the Consensus Label which encapsulates the majority of the human rater’s labels. At the same time, there is consistency across all categories in Claude providing moderate accuracy.

The Consensus Label was also used to gain insight into how each answerer model performed in each question category. The 60 questions selected for the human evaluation dataset contain 10 questions from each category. The Consensus Label for the correctness score for each answerer model and for each category can be seen in Figure 8. From this analysis, Claude and Gemini models with search-enabled answers were considered incorrect in half the cases, showing that these models might be struggling in this particular area of the financial-accounting domain.

With the help of Confusion Matrices (see Appendix C.1, Appendix C.2 and Appendix C.3) we can better understand the strong and weak points of each LLM judge compared to the Consensus Label established using the 12 human experts’ evaluations. In the correctness rubric, all three models exhibited a significant False Negative bias, by labeling answers considered true by the majority of the experts as false. Claude and GPT Judges were identical in performance, both missing 84 correct instances while correctly identifying only 67 correct answers. Gemini Judge showed better alignment, capturing 105 True Positives, however struggling with 46 False Negatives. While the number of answers flagged as false is small (29), the models are highly reliable in labeling the answer as false, with Gemini being the only one to produce any False Positives (5). This suggests that while the model judges are conservative and can be considered harsh, Gemini Judge’s performance is most balanced for correctness rubric evaluation.

The legal citation rubric evaluates the validity of the legal citations provided in the predicted answers and if all the cited articles are relevant and the same as the gold answer. The category showed the highest disparity between the judge models and the human judgement. All three judge models attributed most of the actually true answers a false label, with GPT producing the most False Negatives (153), followed by Claude (140) and Gemini (121). The latter flagged the most True Positives (40) among the judges, and similar True Negatives (18) as the other two models (19). While the human judges attributed the least false labels for this rubric to the provided answers (19), all three models were able to also label the same answers as false. It is important to note that the tendency for the LLM judges compared to the human evaluations for this rubric is to consider the legal citation in a more restrictive manner. Considering that the answers evaluated in this subset were generated using retrieval augmentation, combined with the harsh judgements of the LLMs compared to the human experts’, it highlights the need for a more critical analysis done with human oversight.

For the clarity and structure rubric, the dataset was leaning heavily towards True labels, making this a test of alignment. All three models performed very well, with Gemini leading them, correctly identifying 175 of 176 true labeled answers. GPT follows closely with 172 True Positives, and Claude slightly behind with 166. The judge models show very high agreement with human Consensus Label on this rubric.

4.3. Human Expert Model Evaluation

Model Preference

Each specialist was asked to choose between one of the two answers presented alongside the text of the Ground Truth answer for the question. The pairwise evaluation was performed three times in order to establish whether or not there is a collective preference for a certain answerer model. Due to the nature of the comparative evaluation, the win rate was calculated for each model. The win rate is the total number of wins/total number of judgments, across all questions, with ties counted as 0.5 for both models. GPT-5’s answers scored the highest, with a win rate of 40.79%. Meanwhile, Gemini, with a rate of 29.81%, and Claude, with a rate of 29.40%, are similar to each other, but far from the winner. This shows that there might be a preference across all specialists towards the answers from GPT-5, or that Gemini and Claude have obtained many ties with each other.

To better understand the individual preferences of the specialists, the win rate was transformed into a weighted score, by following the positions of the models for each question. First place obtained three points, the second place two points, followed by the third place with one point, per question. Compared to the average correctness score on the Likert scale that each human rater attributed, as seen in Table 6, the weighted score reveals the difference between raters. They cannot agree in regards to the cases involving answers from Gemini, as shown in the low correlation for the two scores and between all raters. With a low average Likert score and an average weighted score of 110, the answers from Gemini are regarded by the raters as passable, but rarely the winning answers. With the highest average Likert score, GPT’s answers are regarded with the highest average weighted score of 126.5. This shows the strongest correlation, meaning that when a rater gives GPT a high grade, they also pick it as the winner. Claude closely follows GPT, with an average weighted score of 123.16, proving that it won in most comparisons; however, the average Likert score does not surpass GPT’s. In regards to the raters, there are some outliers, with Rater 7 showing bias towards the answers from Claude and attributing the lowest score to GPT, disagreeing with the rest of the group. In a similar manner, there is Rater 4, although it is not as hard on the answers from GPT as Rater 7. Another outlier is Rater 9, who provided average Likert scores of 5 to almost all of the answers. The weighted score remains a more revealing metric by forcing the raters to provide a ranking order of the models. While the Likert scale provides more nuance, the weighted scores reduce the potential bias towards or against certain answerer models, revealing Gemini’s weakness in direct competition.

4.4. Search-Enabled Results Using LLM-as-a-Judge

During the answer generation process, the number of times the models used the search tool was recorded. On average, GPT leveraged Tavily the most, with an average of four calls per answer given. Meanwhile, Gemini did not take advantage of the tool, averaging 0.3 calls per answer, and Claude using 2.5 calls.

A source [25] shows that another GPT model, GPT-4o-mini, is issuing a large number of tool calls as well, 24.8. In the same study, Claude makes the second most tool calls, 11.1 on average, and Gemini 4.5 calls. The usage trends are comparably followed throughout the question answering stage of the experimentation. In the similar study, the earlier GPT model was an outlier, issuing a high number of tool calls, while also having the highest error rate, indicating poor tool utilization [25].

In order to assess the correlation between the number of tool calls and the performance of the models, a Spearman matrix was created. The strength and direction of the relationship between the chosen variables is shown in Figure 9. The values for number_of_searches versus clarity_and_structure_score or legal_citation_score or correctness_score are all close to 0, which indicates that there is no relationship between the number of tool calls and either of the performance scores. This means that the number of searches done by the model does not help predict whether one of the three rubric-based scores or all of them will be high or low.

On the other hand, based on the same matrix, there is a slightly positive relationship between search_active and successful_response, legal_citation_score or correctness_score. This suggests that when one is higher, the other one tends to also go up as well. Notably, the correlation between the dimensions is quite low due to the large number of responses with a score of zero. However, the matrix suggests that enabling the search tool contributes to achieving a correct response.

For Figure 10 a subset of 500 answers from the dataset was employed, consisting of Q&A with at least one successful answer, meaning that for a question, at least one of the six possible answers scored 100% across all three rubric metrics. In this matrix there is a powerful correlation between the number of successful answers and the score given to legal citation, while correctness does not necessarily mean a successful answer. Another important observation is that the possibility of using the Tavily Search API does not fully equate in the use of it; however, there is a strong link between the two.

As shown in Table 1, the correctness scores improve for all models when Tavily Search API is used. In GPT-5’s case, the enhancement is quantified with 16.4% better results for the correctness rubric, while Claude has an improvement of 14% and Gemini of only 4%. A similar improvement trend can be seen in the legal citation scores. GPT becomes the best performer for this rubric after using the search enabling tool, seeing the highest improvement of 5.7%, reaching a score of 16.1%. In the category without search, Gemini had a score of 13.6% and the web search tool boosted the score with only 1.6%. An interesting exception to the tendency of Tavily to increase the scores is found in the case of Claude for clarity and structure score. In this situation, there is a decrease of 2.4% in the average score. Once again, Gemini receives a small boost of 0.1% and GPT the largest raising of all models, 1.5%.

4.5. Question Category and Difficulty-Based Analysis for LLM-as-a-Judge

According to the average correctness score for each answerer model, the pattern for attributing these values remains roughly the same for each question category. It is important to note that Gemini-WithSearch and Gemini-NoSearch, GPT-NoSearch and Claude-NoSearch provide better answers for questions from Accounting and Monography of Accounting Entries. The enabling of web searching allowed Claude to provide better answers in the Income Tax category. It also allowed GPT to perform better in questions regarding Micro Enterprises. All of the models had a hard time answering correctly to VAT and eVAT questions. This is in spite of the fact that the majority of questions, over 86%, were classified as Medium (59.8%) and Easy (26.43%), showing poorer performance in all models. This highlights a significant and uniform weakness in handling this category following Romanian laws, in contrast to the remaining categories (see Figure 11).

On average, the answers to the Profit Tax category scored the highest in legal citation, indicating that in this area, the references are more accessible. The lowest score for the same rubric metric is in the Other obligations category, which may be due to the wide variety of issues that this category covers (see Figure 12).

From the heatmap in Figure 13, GPT-WithSearch is able to provide the most successful answers, showing better answering capabilities to Medium and Easy questions. A similar trend can be seen in Figure 14 in the number of questions answered with the correctness score of 1, meaning Yes, which shows that GPT-WithSearch performed the best, having the largest number of affirmative label attributed. For both metrics, an improvement in scoring is found when evaluating answers to questions from the Medium level in comparison with the Easy ones, which is an unforeseen result. The number of answers from the models that were labeled as Yes or 1 for the correctness rubric in relationship to the difficulty of the question was 2681 out of the 6270 total answers (see Figure 14). Conversely, a more expected outcome is shown in the difference, namely the reduction of successful and accurate responses generated by the models. By comparing the scores for Easy with those for the Hard level, it can be concluded that question and answer pairs from the hardest level of difficulty highlight the weaknesses of the models by yielding worse results.

4.6. Cost Analysis

For a successful response, where all three rubric metrics have the score of 1 based on LLM-as-a-Judge evaluation, the cost is on average 0.2252 $. From the average query cost to the correctness percentage point of view (see Figure 15), GPT-WithSearch ranks on top in comparison to the other models. The worst ratio of cost to correctness is provided by Claude-NoSearch, with the search-enabled version improving on both sides. It is interesting to note that enabling the search tool did not improve the results on either dimension by a lot for Gemini, ranking similarly in the bidimensional analysis. The results achieved by GPT and Claude in this comparative analysis further highlight the existing research on cost and accuracy trade-off when deploying LLMs and other predictive systems [26].

As shown in Figure 15, the human answerers require more time and thus a higher average query cost; however, they provide better results. Once again, Gemini’s performance improves by little when provided a search tool and more time. While requiring the least amount of time, Claude-NoSearch provides the least correct answers. However, an improvement in correctness appears when enabling search, along with an increase in time usage, but still below 150 s on average. GPT is able to provide higher correctness scores with more, conversely requiring more time to answer.

Considering the human evaluations, more precisely the Consensus Label, an average cost per query by dataset was calculated for each rubric. A correct answer requires the payment of 0.57 $ to GPT-5, while for Claude, the amount is 0.33 $ and for Gemini it is 0.09 $. For an answer that has correct legal citations, the most expensive remains GPT with 0.58 $, followed by Claude with 0.32 $ and Gemini with 0.08 $. The same ranking is valid for obtaining a clear and structured answer.

5. Discussion

This research builds onto previous studies in this domain, which can be found in the Section 2 (Literature Review). The research team provided a new dataset, focused on the Romanian financial domain and its regulations.

Concerning the potential bias the evaluator models might have towards their own answers, the average scores for the answerer models by each evaluator were calculated and plotted. The distribution of evaluations by answerer model and evaluator model can be found in Figure 16. A noticeable tendency for Judge Gemini to attribute better scores for all the evaluated models, without favoring itself, can be seen in Figure 4. All three judges have evaluated the answers from GPT with higher correctness scores, while Answerer Claude and Answerer Gemini have similar correctness scores attributed. Considering the almost equal distribution between judges for the answer models, we assume that there is no detectable self-bias outside acceptable limits in the responses generated by the models.

The first hypothesis of this research team was that Claude will perform better than the other two models based on the results provided by the model in daily use. Not only does it fail to achieve the best performance, but in several instances it is ranked as the lowest scorer. In the human evaluations, Claude with search enabled is ranked second in average Likert score. Accounting for the Consensus Label derived from the scores attributed by specialists to a subset of the answers provided by the models with access to Tavily, Claude performs the least desirable among the three models, as seen in Figure 7. An explanation for the research team’s expectation might be the verbosity bias, in combination with the model’s performance in daily tasks that do not necessarily require knowledge in the financial-accounting domain.

The second research hypothesis stated that ChatGPT or GPT-5 would be the lowest expenditure due to it having the lowest cost per token. When Tavily Search API is enabled, all models provide longer answers which need to be processed by the judge models. GPT-WithTools uses the most tokens as the question for the answer generation, 114.300 tokens on average for query input, and the most to write the answer, 39.797 as the average query output tokens. Another important factor in disproving this hypothesis, which also supports the logic behind the largest number of used tokens, is the fact that GPT uses the most tool calls.

Another hypothesis was that tool-enabled Retrieval Augmented Generation will provide better results in legal citation. From the Spearman Correlation Matrix for Tool Calls, the non-existent relationship between the number of tool calls and the legal citation score highlights that the increase in tool use does not equate to better results. From Figure 17, there is an improvement in the legal citation score when using the Search Tool. This aspect is also supported by the large number of legislative changes that occurred in 2025 in Romania’s financial-accounting field, which further highlights the need to employ advanced search and analysis tools provided by artificial intelligence models. Both discoveries add nuances to the hypothesis and highlight that performance is irrespective of the number of tool calls, but the search tool use in general improves performance.

The disparity between human and LLM evaluations for correctness and legal citation brings into discussion the verbosity bias. This bias relates to the presentation of the answer and the tendency of a human judge (or model-based as well) to favor longer and more structured responses [14,38]. Human judges often prefer different textual properties, with some looking for a concise answer rather than a detailed one, for example, which directly impacts the overall assessment. This was the case for Raters 4 and 7, who were the outliers of the group due to a suspected preference towards the answers from a model. The evaluation rubrics attempted to separate assessment into distinct dimensions; however, they revealed the discrepancies between individual raters.

Accounting for the human evaluation of the answers provided by the three models with access to Tavily Search API, the average scores for all three rubrics reveal that GPT might be better suited for the question answering task, even for the financial-accounting domain. The other two models are close contenders, scoring higher than GPT in clarity and structure. In this context, an important variable to take into account is the number of tool calls used on average by the models, with GPT using on average four calls, followed by Claude with 2.5 calls and Gemini with 0.3. Considering the much lower number of calls for the question answering task, the results of Gemini are admirable.

The calculation of the Consensus Label for each rubric and answerer model for each question led to a better evaluation of the LLMs-as-Judge attributed scores. It revealed that for correctness and legal citation rubrics, Gemini is the most accurate, agreeing to the evaluations of most of the human raters. It is also important to note that the LLMs attributed the least amount of Yes labels or 1 scores for these two rubrics for all models. For the rubric that had the highest average scores and most 1s attributed, GPT is closest to the consensus, especially considering it had the highest number of answers to evaluate (

n = 68

) out of all the judges.

One of the main goals of this research was to establish the capabilities of Large Language Models in answering domain-specific questions regarding finance and accounting. Another relevant contribution brought by this study effort was the creation of a dataset that is relevant to the Romanian landscape for the aforementioned domain. The dataset includes questions from multiple areas of interest, such as VAT and electronic VAT, Accounting and Monography of Accounting Entries, Income and Profit Tax, Micro Enterprises and Other Obligations. Additionally, the dataset contains responses from several models. Some of them were generated by three models, namely Claude 4.5 Sonnet from Anthropic, Gemini 2.5 Pro from Google and GPT 5 from OpenAI. Another set of answers was produced by these models using a Retrieval Augmented Generation tool that allows search called Tavily Search API. In the initial assessment, open-source models were included and analyzed using the G-Eval Framework.

The limitations of this research are, first, the size of the subset that was validated by specialists. While only 5.74% (

n = 60

) of the 1045 questions were taken into consideration, it is important to note that for each question there were three answers to be evaluated, resulting in a total of 180 validations. Second, financial limitations allowed for each answer to be evaluated by only one model assigned at random to mitigate potential self bias.

Some future directions for this research might be the participation of human evaluators in the preliminary evaluation of the answers provided by the models. Another relevant aspect is the evaluation of the judges, in order to establish an LLM judge.

Bearing in mind that all human raters attributed, on average, high correctness scores on the Likert scale to GPT-5 with search-enabled answers, the findings show that this could be the model to be used in applications that work with LLMs for tasks requiring knowledge of the financial-accounting domain. The better results from GPT-5 come with a higher bill, requiring on average 0.58 $ per query that is considered correct by the majority of human reviewers.

Regarding scalability, while high-load stress testing was outside the scope of this study, the benchmark shows efficiency gains in time comparisons. The time to answer for the models was significantly lower than human specialists (see Figure 15), suggesting that their adoption should focus on managing API rate limits and operational costs when scaling for a multitude of simultaneous queries. In a similar study, an earlier model from OpenAI, o3, took 3.1 min and costs $3.78, achieving only 46.8% accuracy [26]. In our case, still a model from OpenAI, reached the highest average correctness rate of 52.8230%, requiring, on average, 393.43 s to answer the question presented, as shown in Figure 15. Another important area discussed was the use of LLMs as judges to assess the trust that can be attributed to such evaluations in similar research situations. The findings of other articles highlight the potential of LLM judges due to their scalability and reproducibility [14]. Our findings show that while the use of human experts is time-consuming and the employment of LLMs for similar tasks is more efficient cost and time-wise, the models are not in very high correlation to humans in question and answer pair evaluations. The highest similarity regarding the judgements done by a model to the Consensus Label established using the human experts’ labels was obtained by Gemini with 68.33%. This indicates that human oversight remains necessary, despite the cost and time benefits from using LLMs as judges.

To ensure robustness, the benchmark used a dataset curated by domain experts, ensuring diversity across the three main areas. In accordance with the framework [7], the benchmark does not only evaluate the answer as correct or incorrect, it provides the nuances regarding legal citations and clarity, thus validating the logic and integrity of the answers provided. The use of LLMs as judges evaluated in our article required a structured evaluation by employing the use of rubrics, which is a mechanism found in robust frameworks [18]. While the models were not tested using incomplete or disordered data, their ability to correctly interpret questions indicates an understanding of specific terminology and complex legal texts. In the business context, the legal side is one of the main areas of focus, and having incorrect facts might present a risk. To preface the risk of hallucination, the models were allowed to use an external tool called Tavily to search and thus augment the internal knowledge with recent results. In an article, the hallucination prevention is suggested to be done by using Retrieval Augmented Generation (RAG) and Vector Engines. It mentions that a robust system must also be tested on the ability to anchor the answers in the data provided using these tools. In the case of our study, the use of Tavily shows an improvement in the amount of answers found correct by the LLM judges, as seen in Figure 11, and of answers labeled true for all three rubrics, as seen in Figure 13. Finally, regarding security, this benchmark uses general queries that do not require sensitive or personal information. However, since in enterprise deployment data privacy is of great importance, a practical implementation of these models might require personnel training or improvement or development of open-source models which can be locally hosted, thus ensuring that business data does not leave the secure enterprise infrastructure.

6. Conclusions

This paper introduced RO-FIN-LLM, a Romania-specific benchmark for regulatory question answering in taxation and financial accounting. We evaluated state-of-the-art LLMs in closed-book and retrieval-augmented (RAG) settings using a rubric-based evaluation (correctness, legal citation quality, and clarity/structure) and validated a stratified subset with specialist human raters.

Our results show that retrieval augmentation substantially improves correctness, but legal citation quality remains a key weakness across models, reinforcing the need for careful evidence handling and human oversight in compliance-oriented deployments. The judge models exhibit task-specific variance, and while they are well-calibrated for qualitative assessments of clarity and structure, their high false-negative rates in correctness and legal citation rubrics suggest they are not yet a substitute for human oversight in legally rigorous domains.

We emphasize that RO-FIN-LLM is intended as a first public benchmark with limitations and should be viewed as an extensible foundation rather than a definitive standard.

Limitations and future work. The current evaluation uses a single LLM judge per answer (randomly allocated judge with the exception of the subset for human validation) to control cost, and the human validation set is limited (

n = 60

questions or 180 Q&A pairs). In future work, we wish to (i) increase human validation coverage, (ii) adopt multi-judge evaluation (e.g., cross-judging or ensembles) and report judge agreement/sensitivity analyses on the whole dataset, (iii) report uncertainty for model comparisons via paired bootstrap confidence intervals, and (iv) strengthen evidence fidelity by logging retrieved passages/URLs and analyzing error modes (outdated legal basis, mismatched applicability, and incorrect effective dates).

Reproducibility. To support independent reproduction and extension, we provide a release plan for the benchmark schema, prompts, evaluation scripts, and logs from RAG (see Data Availability). OpenAI’s model variant is gpt-5-2025-08-07, with tier 4 usage (RPM: 10k, TPM: 4Mil, Batch Queue Limit: 200Mil), from the default endpoint (api.openai.com). Google Gemini model variant is gemini-2.5-pro (no snapshotting) released in 17 June 2025, with rate limits of 150 req/min, 2M tokens/min, 10k req/day, from the Google AI Studio endpoint (generativelanguage.googleapis.com). Anthropic Claude’s model variant is claude-sonnet-4-5-20250929, with tier 4 usage (4k req/min, 2M input tokens/min, 400k output tokens/min), from the api.anthropic.com endpoint. The machine configurations are: CPU—AMD Ryzen 9 5950X (16-core/32-thread, Zen 3, up to 5.086 GHz boost); GPU—2× NVIDIA GeForce RTX 3090 (24 GB VRAM each); RAM—128 GB DDR4; Storage—2 TB NVMe SSD (WD Black SN850X) + 12 TB HDD; OS—Pop!_OS 22.04 LTS (Ubuntu-based); Kernel—6.16.3; NVIDIA Driver—580.82.09.

Author Contributions

Conceptualization, M.-E.O., V.-G.B. and C.S.; methodology, M.-E.O., V.-G.B. and C.S.; software, V.-G.B., R.G. and C.T.; validation, O.D., A.I. and A.-M.B.; formal analysis, M.-E.O., V.-G.B., C.S., O.D., A.I. and A.-M.B.; investigation, M.-E.O. and V.-G.B.; resources, C.S., V.-G.B., R.G. and C.T.; data curation, V.-G.B., M.-E.O. and C.S.; writing—original draft preparation, M.-E.O., V.-G.B., C.S., O.D., R.G., C.T. and A.I.; writing—review and editing, M.-E.O., V.-G.B., C.S., O.D., R.G., C.T., A.I. and A.-M.B.; visualization, M.-E.O. and V.-G.B.; supervision, C.S.; project administration, C.S.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was developed within the project “PIONIER AI—Robot Autonom Pentru Afaceri. Cod SMIS: 338328”.

Data Availability Statement

To support reproducibility, we will publicly release (i) the benchmark schema and metadata fields, (ii) a representative sample of items, (iii) all prompts used for answer generation and rubric-based evaluation, and (iv) evaluation scripts used to compute the reported metrics. Code availability: Our code and evaluation scripts will be available in our GitHub repository: https://github.com/Nexus-Media/RO-FIN-LLM-Benchmark (accessed on 20 January 2026).

Acknowledgments

Claude Sonnet was used to verify language fluency during the development of the prompts. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Authors V.-G.B., C.S., R.G., C.T. were employed by the company Nexus Media. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
LLM	Large Language Model
GPT	Generative Pretrained Transformer
ERP	Enterprise Resource Planning
DSS	Decision Support Systems
BI	Business Intelligence
RAG	Retrieval Augmented Generation
VAT	Value Added Tax
eVAT	Electronic Value Added Tax
FX	Foreign Exchange
HR	Human Resources
NLG	Natural Language Generation
NLP	Natural Language Processing
NLU	Natural Language Understanding
EQA	Extractive Question Answering
EM	Exact Match
NER	Named Entity Recognition
EDGAR	Electronic Data Gathering, Analysis, and Retrieval system
CFA	Chartered Financial Analyst
SEC	Securities and Exchange Commission
MLLM	Multimodal Large Language Model

Appendix A. Evaluator Models Bias

Appendix A.1. Average Scores for GPT-5 and GPT-5-WithTools by Evaluator Model for Bias Evaluation

Appendix A.2. Average Scores for Claude-4.5-Sonnet and Claude-4.5-Sonnet-WithTools by Evaluator Model for Bias Evaluation

Appendix A.3. Average Scores for Gemini-2.5-Pro and Gemini-2.5-Pro-WithTools by Evaluator Model for Bias Evaluation

Appendix B. Prompts Used

Appendix B.1. Prompt for No Search Tool Usage

Esti un expert financiar roman. Vei primi o intrebare la care ar trebui sa raspunzi corect. In textul raspunsului include si o explicatie sumara pentru raspuns, precum si pasii urmati pentru a ajunge la raspunsul pe care l-ai dat. Pastreaza textul cat mai concis si bazeaza informatiile pe legislatia romaneasca din domeniul financiar.

English Version You are a Romanian financial expert. You will receive a question that you should answer correctly. In the text of the answer, include a brief explanation for the answer, as well as the steps taken to reach the answer you gave. Keep the text as concise as possible and base the information on Romanian legislation in the financial field.

Appendix B.2. Prompt for Search-Enabled Answering

Ești un agent de cercetare financiară. Răspunde ÎN LIMBA ROMÂNĂ. Folosește unealta ‘

t a v i l y_s e a r c h_r e s u l t s_j s o n

’ pentru a căuta informații actualizate DOAR când este absolut necesar. După ce ai găsit informațiile relevante, oferă un răspuns final complet. Nu repeta căutări pentru aceeași informație. Ai o LIMITĂ de 6 apeluri de unelte per întrebare. Dacă atingi limita, oprește-te din folosirea uneltelor și oferă răspunsul final. Verifică informațiile și include surse în răspuns. Pastreaza textul cat mai concis si bazeaza informatiile pe legislatia romaneasca din domeniul financiar.

English Version You are a financial research agent. Answer IN ROMANIAN. Use the ’

t a v i l y_s e a r c h_r e s u l t s_j s o n

’ tool to search for updated information ONLY when absolutely necessary. After you have found the relevant information, provide a complete final answer. Do not repeat searches for the same information. You have a LIMIT of 6 tool calls per question. If you reach the limit, stop using the tools and provide the final answer. Verify the information and include sources in the answer. Keep the text as concise as possible and base the information on Romanian financial legislation.

Appendix B.3. Prompt for Summarization

Esti un expert in sumarizarea de conținut web. Rolul tău este să analizezi conținutul integral al unei pagini web și să generezi un rezumat complet, care să se concentreze pe informațiile direct relevante pentru interogarea specifică. Rezumatul trebuie să conțină doar informații prezente în ‘Conținut pagină web’.

Query:

q u e r y

Conținut pagină web:

r a w_c o n t e n t

Generează rezumatul concentrat pe query.

English Version You are an expert in web content summarization. Your role is to analyze the entire content of a web page and generate a complete summary, focusing on the information directly relevant to the specific query. The summary should only contain information present in ‘Web page content’.

Query:

q u e r y

Web page content:

r a w_c o n t e n t

Generate the summary focused on the query.

Appendix B.4. Prompt for Clarity and Structure Evaluation (LLM-as-a-Judge)

You are an expert in question answering systems. Your job is to evaluate for a given evaluation criteria a predicted answer by comparing it against the gold answer and the given question.

Question: {question}
Predicted Answer: {llm_answer}
Gold Answer: {answer}
Labels:
VALID: The predicted respects the evaluation criteria.
INVALID: The predicted answer does not align with the evaluation criteria.
Evaluation Criteria: Clarity and Structure—Evaluate the predicted answer’s structure, whether it is easy to understand and read and if it is understandable by non experts. The label VALID should be attributed only if all of the above dimensions are true, and INVALID if any dimension is false.
Task: Evaluate the Predicted Answer against the Gold Answer using the criteria above. Provide a label and a brief justification for the label provided based on the evaluation criteria.
Response format: Format you answer in the mentioned format with the Explanation in ROMANIAN.

Appendix B.5. Prompt for Correctness Evaluation (LLM-as-a-Judge)

You are an expert in question answering systems. Your job is to evaluate for a given evaluation criteria a predicted answer by comparing it against the gold answer and the given question.

Question: {question}
Predicted Answer: {llm_answer}
Gold Answer: {answer}
Labels:
VALID: The predicted answer matches the gold answer or is a VALID alternative.
INVALID: The predicted answer is wrong or does not align with the gold answer. Partial correctness is considered invalid.
Evaluation Criteria: Correctness—Evaluate the completeness and mathematical reasoning of the predicted answer. For mathematical questions, the reasoning should also be sound, even if the final numerical result is equivalent.
Task: Evaluate the Predicted Answer against the Gold Answer using the criteria above. Provide a label and a brief justification for the label provided based on the evaluation criteria.
Response format: Format you answer in the mentioned format with the Explanation in ROMANIAN.

Appendix B.6. Prompt for Legal Citation Evaluation (LLM-as-a-Judge)

You are an expert in question answering systems. Your job is to evaluate for a given evaluation criteria a predicted answer by comparing it against the gold answer and the given question.

Question: {question}
Predicted Answer: {llm_answer}
Gold Answer: {answer}
Labels:
VALID: The predicted answer matches the gold answer or is a valid alternative.
INVALID: The predicted answer is wrong or does not align with the gold answer. Partial correctness is considered invalid.
Evaluation Criteria: Legal Citation—Evaluate the validity of the legal citations provided in the predicted answer. If all the cited articles in the predicted answer are relevant and the same as the ones in the gold answer, the label should be VALID. If some citations are contradicting or not found in the gold answer, the label should be INVALID.
Task: Evaluate the Predicted Answer against the Gold Answer using the criteria above. Provide a label and a brief justification for the label provided based on the evaluation criteria.
Response format: Format you answer in the mentioned format with the Explanation in ROMANIAN.

Appendix C. Confusion Matrices for the Subset of 180 Q&A Pairs

Appendix C.1. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Correctness Rubric

Appendix C.2. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Clarity and Structure Rubric

Appendix C.3. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Legal Citation Rubric

References

Yang, X.; Zang, S.; Ren, Y.; Peng, D.; Wen, Z. Evaluating Large Language Models on Financial Report Summarization: An Empirical Study. arXiv 2024, arXiv:2411.06852. [Google Scholar] [CrossRef]
Organisation for Economic Co-Operation and Development. Tax Administration; OECD: Paris, France, 2025. [Google Scholar]
Zhao, H.; Liu, Z.; Wu, Z.; Li, Y.; Yang, T.; Shu, P.; Xu, S.; Dai, H.; Zhao, L.; Mai, G.; et al. Revolutionizing finance with llms: An overview of applications and insights. arXiv 2024, arXiv:2401.11641. [Google Scholar] [CrossRef]
Yang, C.; Xu, C.; Qi, Y. Financial knowledge large language model. arXiv 2024, arXiv:2407.00365. [Google Scholar] [CrossRef]
John, I. Cloud-Based ERP and AI Integration for Cost Optimization; ResearchGate: Berlin, Germany, 2025. [Google Scholar]
Ashok, P.P.K. Embedding AI in Erp Workflows: A New Paradigm for Intelligent Decision Support. Int. J. Comput. Eng. Technol. 2025, 16, 104–115. [Google Scholar] [CrossRef]
Sarferaz, S. Implementing Generative AI Into ERP Software. IEEE Access 2025, 13, 73342–73354. [Google Scholar] [CrossRef]
Hettiarachchi, I. The Rise of Generative AI Agents in Finance: Operational Disruption and Strategic Evolution. Int. J. Eng. Technol. Res. Manag. 2025, 9, 447–457. [Google Scholar]
Shahid, I. AI Business Strategies; ResearchGate: Berlin, Germany, 2025. [Google Scholar]
Ipeirotis, P.; Zheng, H. Natural Language Interfaces for Databases: What Do Users Think? arXiv 2025, arXiv:2511.14718. [Google Scholar] [CrossRef]
Darmawati, D.; Jaafar, N.I.; HS, R.; Baja, H.K.; Purisamya, A.J.; Yolanda, A.M.W.; Amir, B.; Juanda, M.R.P. The Role of Artificial Intelligence in Improving the Efficiency and Accuracy of Local Government Financial Reporting: A Systematic Literature Review. J. Risk Financ. Manag. 2025, 18, 601. [Google Scholar] [CrossRef]
Machucho, R.; Ortiz, D. The Impacts of Artificial Intelligence on Business Innovation: A Comprehensive Review of Applications, Organizational Challenges, and Ethical Considerations. Systems 2025, 13, 264. [Google Scholar] [CrossRef]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y.; Ai, Q.; Ye, Z.; Liu, Y. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv 2024, arXiv:2412.05579. [Google Scholar]
Ho, X.; Huang, J.; Boudin, F.; Aizawa, A. LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA. arXiv 2025, arXiv:2504.11972. [Google Scholar]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
Chen, Z.; Li, S.; Smiley, C.; Ma, Z.; Shah, S.; Wang, W.Y. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv 2022, arXiv:2210.03849. [Google Scholar]
Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. arXiv 2024, arXiv:2501.00274. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
Srivastava, P.; Malik, M.; Gupta, V.; Ganu, T.; Roth, D. Evaluating LLMs’ Mathematical Reasoning in Financial Document Question Answering. arXiv 2024, arXiv:2402.11194. [Google Scholar] [CrossRef]
D’Souza, J.; Giglou, H.B.; Münch, Q. YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. arXiv 2025, arXiv:2505.14279. [Google Scholar]
Tandan, P. AI and Business Administration; ResearchGate: Berlin, Germany, 2025. [Google Scholar]
Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good Debt or Bad Debt. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
Shah, R.; Chawla, K.; Eidnani, D.; Shah, A.; Du, W.; Chava, S.; Raman, N.; Smiley, C.; Chen, J.; Yang, D. When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 2322–2335. [Google Scholar] [CrossRef]
Reddy, V.; Koncel-Kedziorski, R.; Lai, V.D.; Krumdick, M.; Lovering, C.; Tanner, C. Docfinqa: A long-context financial reasoning dataset. arXiv 2024, arXiv:2401.06915. [Google Scholar]
Bigeard, A.; Nashold, L.; Krishnan, R.; Wu, S. Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. arXiv 2025, arXiv:2508.00828. [Google Scholar]
Son, G.; Yoon, D.; Suk, J.; Aula-Blasco, J.; Aslan, M.; Kim, V.T.; Islam, S.B.; Prats-Cristià, J.; Tormo-Bañuelos, L.; Kim, S. MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models. arXiv 2024, arXiv:2410.17578. [Google Scholar]
TaxEval (v2). Available online: https://www.vals.ai/benchmarks/tax_eval_v2-09-29-2025 (accessed on 10 October 2025).
Finance Agent. Available online: https://www.vals.ai/benchmarks/finance_agent-09-29-2025 (accessed on 10 October 2025).
Tavily Documentation. Available online: https://docs.tavily.com/welcome (accessed on 10 October 2025).
Wang, Y.; Xu, H. SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine Interactions. In Proceedings of the 2024 IEEE International Conference on Data Mining Workshops (ICDMW); IEEE: New York, NY, USA, 2024; pp. 307–316. [Google Scholar]
Mialon, G.; Fourrier, C.; Wolf, T.; LeCun, Y.; Scialom, T. Gaia: A benchmark for general ai assistants. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Soni, A.B.; Li, B.; Wang, X.; Chen, V.; Neubig, G. Coding Agents with Multimodal Browsing are Generalist Problem Solvers. arXiv 2025, arXiv:2506.03011. [Google Scholar] [CrossRef]
CFA^® Program Level II Exam. Available online: https://www.cfainstitute.org/programs/cfa-program/candidate-resources/level-ii-exam (accessed on 10 October 2025).
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
Koo, T.K.; Li, M.Y. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef]
Shrout, P.E.; Fleiss, J.L. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 1979, 86, 420–428. [Google Scholar] [CrossRef] [PubMed]
Aroyo, L.; Welty, C. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation. AI Mag. 2015, 36, 15–24. [Google Scholar] [CrossRef]

Figure 1. Diagram showing the process behind the benchmark.

Figure 2. Diagram showing the process behind the use of Tavily and the use of prompts B2 and B3.

Figure 3. Radar plot of the 6 models’ results from the evaluation using G-Eval.

Figure 4. Figures showing the average score each evaluator model provided to the different answerer models.

Figure 5. The average scores attributed by LLMs-as-Judges (left), and by human evaluators (right) per answerer model with color scheme maintained for easier comparison.

Figure 6. Distribution of correctness on the Likert scale per model, split by correctness binary label.

Figure 7. The accuracy percentages for each category, for each LLM Judge. The categories translated into English are in the following order: Micro Enterprises, Income Tax, Profit Tax, Other obligations, Accounting and Monography of Accounting Entries, and VAT and eVAT.

Figure 8. The percentage of answers labeled correct by the majority of the human raters, sorted by question category and answerer model. The categories translated into English are in the following order: Other obligations, Accounting and Monography of Accounting Entries, Income Tax, Profit Tax, Micro Enterprises, and VAT and eVAT.

Figure 9. Spearman Correlation Matrix showing the correlation between the rubric scores, the number of successful results and the number of tool calls.

Figure 10. Spearman Correlation Matrix showing the correlation between the rubric scores, the number of successful results and the number of tool calls for the set of answers with at least one successful one.

Figure 11. Figure showing the average correctness score provided for the answerer models for each question category. The categories translated into English are in the following order: Other obligations, Accounting and Monography of Accounting Entries, Income Tax, Profit Tax, Micro Enterprises, and VAT and eVAT.

Figure 12. Figure showing the average rubric scores provided for each question category. The categories translated into English are in the following order: Profit Tax, VAT and eVAT, Other obligations, Accounting and Monography of Accounting Entries, Micro Enterprises, and Income Tax.

Figure 13. Heatmap showing the number of responses labeled as correct for all three rubrics, based on the answerer model and the question difficulty.

Figure 14. Heatmap showing the number of responses with the correctness score of 1 based on the answerer model and the question difficulty.

Figure 15. Figure showing the average cost-to-correctness score comparison based on LLMs-as-Judges evaluation.

Figure 16. Figure showing the distribution of answers provided by answerer models and evaluated by judge models.

Figure 17. Mean scores for each rubric with or without search tool.

Table 1. Performance metrics ¹ for each rubric per answerer model.

Answerer Model	Claude 4.5 Sonnet	Claude 4.5 Sonnet with Search	Gemini 2.5 Pro	Gemini 2.5 Pro with Search	GPT 5	GPT 5 with Search
Clarity and Structure Score	99.0	96.6	99.4	99.5	90.7	92.2
Correctness Score	21.8	35.8	36.0	40.0	52.8	69.2
Legal Citation Score	5.9	11.3	13.6	15.2	10.4	16.1

¹ Values range: 0–100 Values in bold represent the maximum score for each rubric.

Table 2. Confidence intervals for each rubric per answerer model.

Answerer Model	Claude 4.5 Sonnet	Claude 4.5 Sonnet with Search	Gemini 2.5 Pro	Gemini 2.5 Pro with Search	GPT 5	GPT 5 with Search
Clarity and Structure Score	0.990 [0.984, 0.996]	0.966 [0.954, 0.976]	0.994 [0.989, 0.998]	0.995 [0.990, 0.999]	0.907 [0.889, 0.924]	0.928 [0.912, 0.943]
Correctness Score	0.218 [0.194, 0.244]	0.358 [0.329, 0.387]	0.360 [0.331, 0.389]	0.400 [0.371, 0.430]	0.528 [0.499, 0.559]	0.695 [0.667, 0.722]
Legal Citation Score	0.059 [0.045, 0.074]	0.113 [0.094, 0.133]	0.136 [0.116, 0.157]	0.152 [0.131, 0.174]	0.104 [0.086, 0.123]	0.164 [0.142, 0.187]

Values in bold represent the maximum score for each rubric.

Table 3. Intraclass Correlation Coefficient (ICC) for correctness score across the expert human raters.

	Type	ICC	CI95%
1	ICC1	0.198882	[0.13, 0.29]
2	ICC2	0.205710	[0.14, 0.3]
3	ICC3	0.229147	[0.16, 0.33]
4	ICC1k	0.748685	[0.64, 0.83]
5	ICC2k	0.756563	[0.66, 0.84]
6	ICC3k	0.781046	[0.69, 0.85]

Table 4. Consensus for the Consensus Label.

Rubric Score Name	Number of Ties—6 vs. 6 Split	Number of High Confidence Labels (>=75%)	Percentage of High Confidence Labels
Correctness Score	5	135	75%
Legal Citation Score	5	126	70%
Clarity and Structure Score	3	148	82.22%

Table 5. Accuracy of LLMs-as-a-Judge score attribution versus Consensus Label.

Rubric Name	Claude 4.5 Sonnet Judge	Gemini 2.5 Pro Judge	GPT 5 Judge
Correctness Score	53% (67 and 29)	72% (105 and 24)	53% (67 and 29)
Legal Citation Score	22% (21 and 19)	32% (40 and 18)	15% (8 and 19)
Clarity and Structure Score	92% (166 and 0)	97% (175 and 0)	96% (172 and 1)

In the parenthesis, the first value represents the number of true labels assigned by the LLM judges that were considered true by the Consensus Label from the human raters, while the second number represents the number of false labels attributed also false by the human consensus. A better illustration of the similarities between LLM judges and human consensus can be found in the Confusion Matrices.

Table 6. Table showing the weighted score and the average Likert correctness score attributed by each specialist to the answerer models.

Rater Number	Claude Weighted Score	Gemini Weighted Score	GPT Weighted Score	Claude Avg Likert Score	Gemini Avg Likert Score	GPT Avg Likert Score
Rater 1	111	103	146	3.150	3.133	3.983
Rater 2	114	102	144	3.383	3.500	3.983
Rater 3	138	118	104	3.800	3.666	3.883
Rater 4	132	135	93	3.800	4.050	4.200
Rater 5	120	110	130	3.066	3.300	4.483
Rater 6	115	106	139	3.233	3.316	3.866
Rater 7	152	137	71	4.116	4.083	3.766
Rater 8	94	108	158	3.066	3.750	4.383
Rater 9	114	95	151	4.633	4.750	4.866
Rater 10	126	106	128	3.550	3.650	4.483
Rater 11	124	94	142	4.116	4.066	4.616
Rater 12	138	110	112	3.500	3.266	3.566

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olariu, M.-E.; Buinceanu, V.-G.; Simionescu, C.; Dospinescu, O.; Georgescu, R.; Tudor, C.; Iftene, A.; Bores, A.-M. RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting. Systems 2026, 14, 244. https://doi.org/10.3390/systems14030244

AMA Style

Olariu M-E, Buinceanu V-G, Simionescu C, Dospinescu O, Georgescu R, Tudor C, Iftene A, Bores A-M. RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting. Systems. 2026; 14(3):244. https://doi.org/10.3390/systems14030244

Chicago/Turabian Style

Olariu, Maria-Ecaterina, Vlad-Gabriel Buinceanu, Cristian Simionescu, Octavian Dospinescu, Răzvan Georgescu, Cezar Tudor, Adrian Iftene, and Ana-Maria Bores. 2026. "RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting" Systems 14, no. 3: 244. https://doi.org/10.3390/systems14030244

APA Style

Olariu, M.-E., Buinceanu, V.-G., Simionescu, C., Dospinescu, O., Georgescu, R., Tudor, C., Iftene, A., & Bores, A.-M. (2026). RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting. Systems, 14(3), 244. https://doi.org/10.3390/systems14030244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting

Abstract

1. Introduction

1.1. Motivation

1.2. Goal

1.3. Research Contributions

2. Literature Review

2.1. LLM-as-a-Judge

2.2. Finance Benchmarks

3. Materials and Methods

3.1. Evaluation Dataset

3.2. Evaluated Models

3.3. Answer Generation Prompt

3.4. Preliminary LLM Evaluation

3.5. Rubric-Based Evaluation

3.5.1. Clarity and Structure Evaluation

3.5.2. Correctness Evaluation

3.5.3. Legal Citation Evaluation

3.6. Human Evaluation

3.7. Hypotheses

4. Results

4.1. Per Rubric Analysis

4.1.1. LLM-as-a-Judge

4.1.2. Human Evaluators

4.2. Inter Rater Agreement on Rubric Scores

Consensus Label

4.3. Human Expert Model Evaluation

Model Preference

4.4. Search-Enabled Results Using LLM-as-a-Judge

4.5. Question Category and Difficulty-Based Analysis for LLM-as-a-Judge

4.6. Cost Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Evaluator Models Bias

Appendix A.1. Average Scores for GPT-5 and GPT-5-WithTools by Evaluator Model for Bias Evaluation

Appendix A.2. Average Scores for Claude-4.5-Sonnet and Claude-4.5-Sonnet-WithTools by Evaluator Model for Bias Evaluation

Appendix A.3. Average Scores for Gemini-2.5-Pro and Gemini-2.5-Pro-WithTools by Evaluator Model for Bias Evaluation

Appendix B. Prompts Used

Appendix B.1. Prompt for No Search Tool Usage

Appendix B.2. Prompt for Search-Enabled Answering

Appendix B.3. Prompt for Summarization

Appendix B.4. Prompt for Clarity and Structure Evaluation (LLM-as-a-Judge)

Appendix B.5. Prompt for Correctness Evaluation (LLM-as-a-Judge)

Appendix B.6. Prompt for Legal Citation Evaluation (LLM-as-a-Judge)

Appendix C. Confusion Matrices for the Subset of 180 Q&A Pairs

Appendix C.1. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Correctness Rubric

Appendix C.2. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Clarity and Structure Rubric

Appendix C.3. Confusion Matrices Comparing the Labels Considered the Consensus Label to the Labels Provided by Each Judge Model for the Legal Citation Rubric

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI