1. Introduction
The digital transformation of the tax and accounting fields is accelerating the adoption of algorithmic technologies capable of processing large volumes of regulations and providing preliminary interpretations of tax obligations. Modern tax systems are characterized by structural complexity, frequent legislative changes, and interdependencies between primary rules, secondary rules, and administrative instructions. This complexity leads to increased compliance costs and risk of error for both taxpayers and professionals [
1].
In accounting, research highlights the usefulness of LLMs in explaining concepts, analyzing financial notes, or generating coherent accounting justifications [
2]. LLMs have proven to excel in natural language processing and data analysis tasks, showing their potential to enter the financial and accounting sector of many enterprises. With increased computational power and improved algorithms in the latest versions, commercial LLMs have demonstrated noteworthy capabilities in understanding complex contexts, answering questions, and writing content [
3].
These capabilities position LLMs as increasingly relevant for the financial domain [
3]. Requiring data analysis, prediction and decision making, finance is a complex field that is highly specialized [
3]. Current applications of LLMs include the automation of tasks such as financial report generation, market trends forecasting, investor sentiment analysis, and the offer of personalized financial advice [
3].
Research questions. To make the study objectives explicit, we investigate the following research questions:
RQ1: How accurately do current LLMs answer Romania-specific taxation and accounting questions in a closed-book setting?
RQ2: What is the impact of retrieval augmentation (web search + summarization) on correctness and legal citation quality?
RQ3: How well do LLM-as-a-Judge rubric scores align with specialist human evaluations on a stratified subset of the benchmark?
We present RO-FIN-LLM as a first public benchmark and an extensible foundation rather than a definitive standard, with documented limitations and a clear expansion roadmap.
Paper structure. Section 2 (Literature Review) summarizes prior work on LLM-as-a-Judge evaluation and financial-domain benchmarks.
Section 3 describes the dataset, models, prompts, and evaluation protocol (including human validation).
Section 4 presents results, and
Section 5 discusses implications and limitations.
Section 6 concludes and outlines future work.
1.1. Motivation
The growing adoption of LLMs in enterprise and financial workflows raises the crucial question of assessing their abilities and identifying the most suitable tasks for automation [
3]. However, the existing evaluation landscape highlights a significant gap: while many benchmarks exist, most rely on English datasets [
4]. There is a lack of jurisdiction-specific evaluations, especially on the Romanian practices in the financial, accounting and business domain. In such a high-stake domain as the financial one, reliable artificial intelligence (AI) is important. Introducing a Romanian financial benchmark addresses the need for domain-specific evaluation. Given that the financial sector is a high-stakes domain, it is of utmost importance to maintain reliable artificial intelligence (AI). In addition, the Romanian financial-accounting system is characterized by a high degree of legislative dynamism, and the frequent changes in the regulatory framework reinforce the importance of obtaining valid and efficient responses from artificial intelligence models.
Given the diversity of businesses, whether in size, workforce structure, or industry, a significant challenge for Enterprise Systems is the identification of applicable regulatory frameworks and navigating the complex legal landscape. Enterprise Resource Planning (ERP) systems have seen convergence with AI to enhance cost optimization and operational efficiency [
5]. Incorporating AI in ERP workflows enables firms to automate repetitive tasks, provide insights, and improve decision-making agility [
6]. Large Language Models (LLMs) show promise in processing legal text and interpreting context to provide insights based on scenario-specific questions. Generative AI assists in summarizing ERP documents for quickly understanding key points and for making informed decisions [
7]. Leveraging these models allows non-technical ERP users to solve their business tasks through natural language use, empowering users to use these capabilities without extensive technical expertise [
7].
By providing detailed financial analysis, automating tasks and improving customer relations, LLMs are demonstrating potential to transform services provided in this domain [
4]. For instance, Generative AI is used to manage tax jurisdictions within ERP systems by identifying missing compliance configurations and generating suggestions for alleviating the issues, ultimately reducing the manual workload for tax accountants [
7]. Customer feedback of applications involving GenAI show that the effort required in the context of the United States tax configurations and sales sees a 50% to 90% reduction in time and resources [
7]. GPT-based solutions are also used to revolutionize the financial operations by enabling contract evaluations and finance report generation, namely JPMorgan’s COiN program and BloombergGPT [
8]. Other institutions employ similar solutions to create personalized client insights, detect fraud by analyzing behavioral trends and provide investment consultation [
8]. Nevertheless, the application of LLMs in this high-stake sector requires validation to mitigate the risk of hallucinations, ensuring the accuracy and reliability in decision-making tasks [
4].
The integration of these models into Decision Support Systems (DSSs) presents as an opportunity for the enhancement of Business Intelligence (BI) capabilities. This benchmark provides an assessment of the performance of LLMs across the three aforementioned areas of Taxation, Financial Accounting and Management, and HR and Governance. By leveraging this technology, personnel can quickly access specialized knowledge, thus gaining a partner for complex tasks. In a fast-paced business environment, responsiveness and resilience can be improved by implementing AI-enabled DSSs, which provide real-time recommendations and insights [
9]. This reduces the time spent on legal questions, allowing the financial accounting specialist to shift their focus from information retrieval to planning or other relevant tasks. While implementation strategies vary, using these models in conversational interfaces such as chatbots represents the most intuitive approach for decision assistance. An example is the possibility of employing natural-language queries rather than writing complex SQL scripts, with the key advantage being easier navigation and functionality access [
10]. LLM-powered NLP interfaces have been shown to significantly improve operational efficiency and reporting accuracy in financial and accounting workflows [
11]. The collaboration of human experts with AI agents ensures that the business can leverage insights from provided data with higher efficiency and accuracy. While machines take on repetitive and data-intensive tasks, the human experts will provide strategic insight, creativity and ethical reasoning [
12].
1.2. Goal
The goal of this research is to establish the capabilities of existing AI models to answer questions regarding the Romanian tax, accounting, and financial domains. The foundation of this study is the RO-FIN-LLM benchmark dataset, which contains 1045 questions made by domain experts reflecting the real-life, authentic scenarios encountered by them. The models tested in this investigation were selected based on their popularity, position in relevant benchmarks, or their status as open-source models. The models evaluated include: GPT 5, GPT 5 mini and GPT oss from OpenAI; DeepSeek, the R1 70b version; Qwen 3 32b from Alibaba; Claude Opus 4.1 and Sonnet 4.5 from Anthropic; Gemini 2.5 Flash and 2.5 Pro from Google; Mistral, the Small 3 version; and GLM 4.5 Air. Notably, Mistral Small 3.2 (an earlier version) was included due to the hypothesis that a European LLM might perform better in the Romanian context, potentially offering specialized European expertise for comparison. After a preliminary performance analysis, the three top contenders remained GPT-5, Claude 4.5 Sonnet and Gemini 2.5 Pro.
1.3. Research Contributions
This research aims to introduce a benchmark for the evaluation of LLMs on Romania-specific regulatory question answering in two core areas:
Taxation (e.g., VAT regimes, micro/profit tax, dividends).
Financial accounting under Romanian regulation (e.g., postings, amortization, provisions, foreign exchange (FX)).
The methodology utilizes a rubric-based evaluation, while employing the LLM-as-a-Judge approach. This technique is essential because traditional reference-based metrics (like BLEU and ROUGE) have limited correlation with human evaluations for generative tasks. Using the LLM-as-a-Judge approach allows for multi-dimensional assessment separated into distinct rubrics that align with relevant points derived from expert solutions. Key aspects assessed by the rubrics include:
Correctness/Mathematical Reasoning: Essential for capturing the factually and mathematically accurate content, as mathematical reasoning and calculations are crucial components of financial questions.
Legal Citation Evaluation: Specifically designed to ensure answers adhere to Romanian law and use current versions of all relevant statutes and legal acts, while also mitigating hallucinations by systematically detecting legally unsupported responses.
Clarity and structure: Essential for measuring how organized and easy to understand a response is and whether its structure facilitates comprehension by non-expert readers.
This process ensures a thorough coverage of all important answer aspects, maintaining high standards of factual accuracy, completeness and clarity. Furthermore, the answer generation dataset includes timing benchmarks to be able to evaluate the LLM’s response latency and completeness against human expert performance. The final step of this methodology involves validating the generated answers against the expected answers that were provided alongside the question, leveraging this structured, multi-dimensional LLM-as-a-Judge assessment.
The evaluation of the answers provided by selected LLMs was also done by 12 specialists with over 10 years experience in the financial-accounting domain. Due to time and resource constraints, the human evaluation was done on a smaller part of the dataset of the answers for 60 questions, that equally covered six areas of the Romanian financial and business regulation domain. The specialists had to provide a label of Yes or No for each rubric for each answer provided to the subset of questions. After assessing the capabilities of LLMs in the question answering task, the answers from only three models were selected for human evaluation, resulting in 180 answers. In addition to the labels provided, the professionals were asked to show their preferences in a confrontation between a combination of 2 of the 3 answers for the subset of questions. Furthermore, to acknowledge the nuances behind the Yes and No labels attributed to correctness rubric, an additional task was done by the raters, to also score the answer using the Likert scale (1–5). For a visual representation of the question answering and evaluation processes, refer to
Figure 1.
3. Materials and Methods
This study details an evaluation benchmark designed to assess the performance of Large Language Models (LLMs) in a financial-accounting question answering task. The materials used comprise a set of human-generated questions and their corresponding answers. To establish a baseline for difficulty and response time, human participants were asked to evaluate the required time and difficulty to produce each answer. The attributed values were reviewed and validated by the assessors to mitigate potential bias. For the methods, a preliminary evaluation involved prompting a range of proprietary and open-source answerer models to answer the questions from the dataset. An initial screening of a small response sample was conducted using the G-Eval framework [
13]. The best performing models, namely GPT-5, Claude 4.5 Sonnet, and Gemini 2.5 Pro, were selected as the main contenders for the full evaluation process and were also used as AI Judges for the final assessment phase.
3.1. Evaluation Dataset
The RO-FIN-LLM benchmark has 1045 questions created and validated by experts in Romanian taxation, accounting standards, and business management practices, who have worked in this field for more than 10 years. Each question in the dataset includes a human expert-generated answer, a difficulty classification, a category assignment, and the time required by the aforementioned specialists to formulate a complete response with appropriate legal references. With the dataset curated by domain experts, it captures real-life scenarios effectively and includes authentic scenarios that these professionals have encountered throughout their careers. The dataset question categories reflect six critical areas of Romanian financial and business regulation. The largest representation comes from the VAT and eVAT questions (25.64%, ), followed by Accounting and Monography of Accounting Entries (20.47%, ). The questions regarding Income Tax constitute 17.22% (), while those concerning Profit Tax account for 16.26% (). Those addressing Micro Enterprises comprise 8.32% (), and Other Obligations make up 12.05% ().
Each question was assigned a difficulty level based on the legal complexity, reasoning depth, and specificity of regulatory knowledge needed to answer. From this perspective, the dataset contains 37.4% Easy (), 49.2% Medium (), and 13.3% Hard questions (). This distribution is a direct result of real-world consultation experiences, thus ensuring a robust evaluation across various complexity levels, with emphasis on moderately challenging scenarios. The recorded answer times represent the time specialists required to provide a comprehensive response which included research, citations, and documentation of relevant legal sources. Most of the questions were answered within 10 min () or 5 min (), indicating relatively straightforward scenarios with clear regulatory guidance. A substantial portion required 15 min (159), reflecting moderate complexity requiring cross-referencing multiple provisions. More complex questions demanded 20 min (), 30 min (), or up to 60 min (), typically involving multi-faceted scenarios, ambiguous provisions, or calculations requiring detailed documentation. These timing benchmarks provide context for evaluating the LLM’s response latency and completeness against human expert performance.
3.2. Evaluated Models
The purpose of this investigation is to establish the capabilities of existing AI models. The models tested are: GPT5, GPT 5 mini and GPT oss from OpenAI; DeepSeek, the R1 70b version; Qwen 3 32b from Alibaba; Claude Opus 4.1 and Sonnet 4.5 from Anthropic; Gemini 2.5 Flash and 2.5 Pro from Google; Mistral, the Small 3 version; and GLM 4.5 Air. Considering the variety of solutions, the starting point was an article that discusses a benchmark in the financial sector [
26]. Further investigation was allowed for the benchmarks published by some of the authors’ companies on their website [
28] for the finance sector to be found. Some benchmarks are academic, whilst the ones that were interesting are proprietary, such as TaxEval and Finance Agent. The TaxEval Benchmark evaluates a model’s ability to answer hard tax-related questions, focusing on both answer correctness and structured reasoning capabilities. The other relevant benchmark is Finance Agent [
29], whose dataset contains questions that cover quantitative and qualitative retrieval, numerical reasoning, pattern analyzing, financial modelling and market analysis.
The tested models were selected on the basis of either their popularity, position in the benchmark, or the fact that they were open-source. In the latest benchmarks available at the time of writing this chapter, the chosen models are among the top 20 in accuracy, with a few exceptions; Mistral Small 3.1, an earlier version, for both benchmarks is ranked lower. In spite of the lower ranking in the benchmarks, Mistral Small 3.2 was chosen due to the supposition that a European LLM might perform better in the Romanian scenario, therefore bringing European expertise to the comparison. Another important aspect was the open-sourceness of the models. The models that are open-source, because their model weights are publicly released under permissive licenses, either MIT or Apache 2.0, are: OpenAI’s GPT OSS, DeepSeek’s R1, Alibaba’s Qwen 3, Mistral Small 3, and Zhipu AI’s GLM-4.5 Air. They allow for the download, use, modification, and even deployment locally or commercially of their services.
The remaining five models are the latest versions available from popular AI research and development companies. OpenAI’s GPT 5 and GPT 5 mini, Anthropic Claude’s Opus 4.1 and Sonnet 4.5, and Google’s Gemini 2.5 Flash Thinking and 2.5 Pro Thinking versions are models that were used in the experiments to establish their potential as well.
3.3. Answer Generation Prompt
The prompt used to answer the questions for this benchmark follows prompt engineering principles, which ensures that the interaction with the model provides useful answers. By providing the persona, the instructions, and writing clear phrases and questions, the aforementioned principles are followed. The prompt was first developed in English, and Claude was used to check for clarity of expression. Subsequently, the prompt was translated into Romanian and adapted into two versions, one for zero-shot prompting and one that allows the use of an online search tool.
The prompt whose aim was to test the capabilities of the AI models and their knowledge of the Romanian financial sector can be found in
Appendix B.1.
Appendix B.2 contains the instructions for the LLMs to access a search tool that allows web access. The limit for resources is six calls. The number six was calculated so that the whole answer generation process reaches less than 1
$.
To enable Retrieval Augmented Generation (RAG), an online search engine API named Tavily Search API [
30] is used, which allows a plug-and-play setup to seamlessly integrate into the existing application. It is generally acknowledged as a search engine optimized for LLMs and RAG [
31]. Its advantage in the general context includes providing high-quality, LLM-generated summaries synthesized from retrieved results, which contributes to a higher resolution rate on the GAIA benchmark compared to other general search APIs [
32,
33]. While similar products, namely Brave and Exa, have comparable resolve rates for the aforementioned benchmark and choose the webpages to open for further information based on search snippets, Tavily and Exa provide higher-quality LLM-generated summaries. To mitigate the potential hallucinations that may arise, Tavily offers an LLM-generated answer per query, comprising all search results. This is more likely to provide more accurate results than individual page summaries.
This API allows the users to specify the country and prioritizes the content from said country in the search results [
30]. This feature is only available for the general topic, which means a broader search area, while still maintaining relevancy to the Romanian context. As another article [
31] also mentions, the aforementioned API does not require the development team to crawl the web pages manually, which consequently allows for the focus to be on the various experiments.
According to documentation [
30], Tavily Search API uses proprietary AI to rank the most relevant sources and content for the provided query. The resulting answer comes in JSON form and includes a property ‘raw_content’ that contains the scraped content of each URL it provides. This allows for the research to focus on the capabilities of each tool to process the raw content as well, and not rely on Tavily’s own summarization tool.
The raw-content provided by the Tavily search is then sent to be summarized by the same model in
Appendix B.3, to maintain the continuity and limit the bias towards processing already-summarized content, a process which can be seen in
Figure 2.
The summarization prompt explicitly requires that the result be focused on the main question, thus ensuring the provided information is relevant. As a consequence, the competence of the generative AI tools is tested for the entire process until the final answer is generated.
3.4. Preliminary LLM Evaluation
Recent findings in financial and agentic benchmarks [
26] show that metrics such as G-Eval can be used for preliminary evaluation. In complex, real-world financial tasks, proprietary LLMs often establish a baseline for performance [
26]. OpenAI’s o3 achieved an accuracy of 46.8 followed closely by Claude 3.7 Sonnet at 45.9%. On the other hand, open-source models such as LLaMA 4 Maverick and Mistral Small 3.1 demonstrated weaker capabilities, reaching an accuracy score of 3.1% and 10.8% [
26].
Furthermore, evaluation on financial knowledge benchmarks, such as IDEA-FinBench, shows that proprietary models like GPT-4 exhibit exceptional performance when tested in an CFA-L2 exam [
4]. This exam consists of multiple-choice questions, covering diverse topics, including Financial Statement Analysis, Fixed Income, Economics, etc. [
34].
G-Eval is a framework designed to use Large Language Models with chain-of-thought and a form-filling paradigm, to assess the quality of Natural Language Generation (NLG) outputs [
13]. It addresses the limitations of conventional metrics, such as BLEU or ROUGE, which have a low correlation with human judgment, especially on open-ended questions [
13]. As seen in
Figure 3, the evaluated models, namely GLM, DeepSeek, Qwen, GPT-5-mini, GPT-OSS and Mistral, scored lower in reasoning and factual accuracy, similar to the correctness rubric proposed below. GPT-5-mini’s promising results consolidate the performance of the GPT models from OpenAI.
In addition to the evaluation using G-Eval, a small batch of answers from GLM, DeepSeek, Qwen, and Mistral was experimented with. To ensure the evaluation provides relevant information, the predicted answers were selected from the search-enabled experiments. Therefore, the research team decided to refrain from moving forward with the analysis of open-source models and focus on analyzing the capabilities of OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5, and Google’s Gemini 2.5 Pro.
3.5. Rubric-Based Evaluation
The detailed evaluation of the answers from the remaining models, namely Claude Sonnet 4.5, Gemini 2.5 Pro and ChatGPT 5, relies on specific criteria to assess the quality of their generated answers in a more structured manner. Moving beyond accuracy metrics to a rubric-based evaluation allows the judgment to capture the nuanced performance in domains such as the law or finance, thus aligning with evaluation methodologies like G-Eval and LLM-Rubric [
13,
18]. Due to the use of LLM-as-a-Judge approach, the assessment is separated into distinct rubrics that align with relevant points derived from expert solutions, thus allowing a multi-dimensional evaluation. This process ensures thorough coverage of all important aspects of the response, maintaining high standards of factual accuracy, completeness, and clarity [
26]. The criteria chosen for the purpose of this article were composed with humans as evaluators in mind as well. All evaluators were provided a brief description of what the rubric entails, detailed in the subsections below.
For each answer provided to the 1045 questions by the 6 models, one judge was allocated to evaluate. The judge model was randomly selected to use financial resources economically.
3.5.1. Clarity and Structure Evaluation
Most of the articles surveyed ask for similar metrics, such as coherence [
13] and logical and structural integrity [
21]. Another dimension encapsulated in this criteria is the comprehensibility of the answer by experts and non-experts alike. An essential aspect of this criterion is the comprehensibility of the answers provided, catering to both experts and non-experts. This emphasis on accessibility is vital as it ensures that complex information can be grasped by a wider audience, thereby enhancing the overall utility of the research. In addition, the evaluation process should involve a careful assessment of terminology usage, sentence structure, and overall readability, as these factors significantly influence communication effectiveness. Incorporating engaging visuals or additional materials can also play a key role in improving clarity, making intricate concepts more approachable. In addition, gathering feedback from both target demographics during the evaluation phase can further improve clarity and structure, ultimately facilitating more effective communication of ideas.
3.5.2. Correctness Evaluation
This evaluation includes the many dimensions which fall under the umbrella of correctness. Following another prompt structure where correctness is evaluated [
15], the formulation for this prompt takes into consideration the completeness [
21] and mathematical reasoning behind the final result, whether or not numerical in nature.
The correctness rubric prompt asks for the evaluation of completeness and mathematical reasoning of the predicted answer. For mathematical questions, the reasoning should also be sound, even if the final numerical result is equivalent. A similar explanation was provided to the human raters in Romanian, asking whether the answer is correct and the mathematical reasoning as well. It is paramount to mention that the human evaluators were asked to also rate the correctness on the Likert scale, 1 meaning completely incorrect and 5 meaning completely correct.
It is crucial that the mathematical reasoning is also followed exactly, even if the final numerical result is the same as the gold answer. This is due to the possibility of the models using an older version of the law, or even ambiguous mathematical explanation. In finance, mathematical reasoning and calculations are essential components of many critical financial questions and workflows [
20]. Thus, high correctness scores and explicit mention of this validation focus are important.
3.5.3. Legal Citation Evaluation
The prompts used to evaluate the validity of the legal citations provided in the predicted answer are relevant and the same as the ones in the gold answer. If some citations are contradictory or not found in the gold answer, the label should be invalid. The labeling was further processed in a numerical manner, using 0 and 1. The human evaluators were asked to assess whether the citation of the articles of the law is correct or incorrect.
This rubric was designed to serve three functions. First, the formulation ensures that the answers adhere to Romanian law and the established legal framework for the finance domain. Secondly, it requires that the answer uses the current versions of all the relevant statutes and legal acts, guaranteeing the legal basis is up-to-date. Lastly, the dimensions mentioned above also warrant the integrity of the data, mitigating possible hallucinations, and allowing for a systematic detection of legally unsupported responses.
3.6. Human Evaluation
A sample of 60 questions from the 1045 were selected. The questions evenly cover the 6 categories, with each category being represented by 10 questions. The answers provided by Claude 4.5 Sonnet, Gemini 2.5 Pro and GPT-5, all with Tavily Search enabled, were evaluated by 12 specialists from the financial-accounting domain. The 12 raters were asked to provide binary scores on the same three rubrics that LLM-as-a-Judge models used for evaluation. Moreover, the correctness was also evaluated using a separate Likert scale from 1 meaning completely incorrect to 5 meaning completely correct. This evaluation was done to better understand the nuances behind the binary choices for the human raters.
In addition to evaluating the individual answers, the human raters were asked to show their preference between two models, taking into consideration the question and the ground truth answer. The evaluators were provided with one question at a time with the ground truth answer and 2 of the 3 answers, and asked to choose between answerer model A or answerer model B, or a tie. To ensure blinded evaluation, the answerer models were referred to as model A or model B. For each question, there were 3 repetitions, so that preference was established for all possible combinations.
3.7. Hypotheses
The first hypothesis is that Claude would be the best performer. This is based on the experience the research team has had with this model for other tasks. ChatGPT will be the least expensive according to the second hypothesis. Due to ChatGPT having the lowest cost per token out of the three models, the hope was that the cost would be lower. The third supposition focuses on the use of Tavily Search API. The Retrieval Augmented Generation with online search should be able to provide better legal citations. Lastly, based on the literature found on the subject of LLM-as-a-Judge, the model evaluations should be similar to the human ones.
4. Results
In this section, due to the overlap of models used for evaluation and dataset answer generation, the naming convention will be as follows: Judge Claude, Judge Gemini and Judge GPT or Evaluator Claude, Evaluator Gemini and Evaluator GPT when referring to the models as evaluators. When referring to the models being evaluated, those that provided the answers to the questions from the dataset will be named as Answerer Claude, Answerer Gemini and Answerer GPT, or their model name with the annotation with or without search, such as Claude-NoSearch or Claude-WithSearch.
Of all the evaluations, from the three LLM judges, only 8% () of them received full scores for the three rubrics. On average, the cost per successful response was 0.2252 $. From the 8% we have a distribution of successful responses in light of the use of search tools or not. More than a third (38.4%) of the responses provided by the LLMs are successful. Meanwhile, the performance of the models with search enabled is almost double (61.6%), providing evidence for the use of this pairing for improved results.
In regards to the human evaluations, the smaller dataset consisting of 60 questions with the answers from the three models—Claude-WithSearch, Gemini-WithSearch and GPT-WithSearch—was analyzed. These evaluations were analyzed from two perspectives, namely to establish the quality of the answers provided, and to compare the human judgments to the LLMs’.
4.1. Per Rubric Analysis
This subsection concentrates on the scores provided by the evaluators, specialists or Claude-4.5-sonnet, Gemini-2.5-pro and GPT-5. In the case of LLM evaluators, the whole dataset of 1045 questions answered by the six LLM answerers, totaling 6270 answers, was assessed from the three-rubrics perspective. For the three rubrics of correctness, clarity and structure, and legal citation, the LLMs and human evaluators were asked to provide a binary score. All evaluators were provided with a brief explanation of what each rubric implied.
4.1.1. LLM-as-a-Judge
For the three rubrics, the mean score was calculated. For clarity and structure, the judges rank the responses, on average, at 96.2%. Verbosity bias is the tendency of a judge, human or model-based, to favor longer responses, indifferent of the actual information presented [
14]. This bias, in combination with the way LLMs are constructed to adhere to human norms requiring well-structured examples, may be part of the reason why the model evaluators decided to place such high rankings for this metric [
4].
In the case of our team’s study, an average correctness score of 42.5% attributed by the LLM judges signals a difficulty associated with the question answering task. In a similar study [
25], their best-performing model (GPT o3) achieved an accuracy of 46.8%, with no model surpassing 50% accuracy for the Finance Agent Benchmark. When evaluated in CPA single-answer questions for IDEA-FinBench [
4], the models evaluated also yield low scores, such as ChatGPT achieving 42.64% accuracy. An LLM measures performance against the evaluation criteria provided, and achieving such a score on average for correctness highlights some shortcomings that require further investigation.
The dimension in which the models correlated most strongly is clarity and structure. The scores provided by the judges for the models’ answers (see
Table 1) are roughly similar, with the scores ranging from 90.5% for Answerer GPT-5, to 99.5% for Answerer Gemini with Search. A similar agreement trend follows the legal citation rubric, where all models performed poorly. The lowest average score was given to Claude-NoSearch (5.9%), and the best score belongs to GPT-WithSearch (16.1%).
Upon further investigation, for each evaluation rubric an average score was calculated for each evaluator model and for each evaluated model to see if the judges are in alignment with each other in their assessments. For a more detailed distribution of scores on how each evaluator scores the answerer models, please refer to
Figure 4. The scoring trend follows through for each answerer model. A noticeable difference is found with Judge Gemini’s scoring attribution, with more lenient judgment across all rubrics and all answerer models.
The analysis of the confidence intervals for the provided answers reveals distinct performance tiers across the evaluated models (
Table 2). Model GPT-5-WithSearch demonstrated the highest correctness, while outperforming all the other models, as indicated by the non overlap between the 95% confidence intervals. Interestingly, the confidence intervals of models Claude-WithSearch (0.358 [0.329, 0.387]) and Gemini-WithoutSearch (0.360 [0.331, 0.0389]) overlap substantially, even if the first model showed a slight numerical lead. This shows that the performance gap between these two models is marginal and potentially susceptible to noise, whereas the jump between models Gemini-WithTools and GPT-WithoutTools represents significant improvements in model capacity.
4.1.2. Human Evaluators
Upon examining the computed average scores for each rubric for each model in
Figure 5, a stark discrepancy is seen in the correctness and legal citation rubrics. The human evaluators are more inclined to label an answer as Yes or 1, compared to the LLMs. The LLM judges, on the other hand, are more strict in their assessment, hardly attributing the Yes label for these rubrics.
Considering the additional evaluation done by specialists, attributing a correctness score on the Likert scale, as well as the aforementioned difference, an analysis was done on the correlation between the Yes or No label and the score on the Likert scale. In
Figure 6, the distribution reveals that the threshold at score 3 on the Likert scale shows partial correctness. For a score of 3 Likert, the chance of the specialist to attribute a No label is only 35%. This shows that even in borderline cases, the tendency is to attribute a Yes label to a medium-quality answer. This trend may account for the elevated average scores seen with the human evaluations, compared to the LLM Judges.
4.2. Inter Rater Agreement on Rubric Scores
To ensure the validity of the evaluations provided by the specialists, the inter rater agreement on the correctness score was calculated using multiple methods. The inter rater agreement between human and LLM raters was also calculated. At first, Pearson Correlation Coefficient was calculated between each pair of raters for the correctness score [
14]. The values from the matrix exhibited small correlation, with the highest value reaching 0.28 between two human raters, and −0.21 between judges gpt-5 and gemini-2.5-pro. The maximum value or correlation between human and LLM raters is 0.26, still a weak value. These results led to the evaluation of the Spearman’s Rank Correlation Coefficient for each of the three models’ results. The answers provided by Claude, Gemini and GPT with search enabled were separated, and the coefficient was calculated for the evaluations done on the batch of 60 answers for each model. In the case of Claude-WithTools’s answers, the strongest correlation scores were found between a rater and Gemini-as-a-Judge and Claude-as-a-Judge, 0.58 and 0.59. For Gemini-WithTools’s answers, the highest coefficient was 0.70, found between Judge Gemini and two human raters. Amongst human raters, the closest score was 0.69. When considering the evaluations done on GPT-WithTool’s answers, the strongest scores were between Judge Claude and two human raters, 0.69 and 0.67.
The following stage of investigating the agreement between the 15 total raters used Cohen’s Kappa, which provided lower scores than the Spearman’s Rank Correlation Coefficient. A Cohen’s Kappa matrix was drawn from the correctness scores provided by each pair of raters. The values in the matrix did not show strong or moderate agreement, with the highest score being 0.28 between two human raters. These results prompted the investigation of other correlation coefficients that better evaluate the 12 human rater and three LLM judges’ agreement. Due to the nature of the binary evaluations from the raters, and attempting to assess the specialists’ group agreement, Fleiss’s Kappa was also calculated. A low Fleiss’ Kappa inter-rater reliability score of 0.2473 shows fair agreement [
35].
For the evaluation and confirmation of the expert evaluator group, IntraClass Correlation Coefficient was calculated in an effort to assess the reliability of the ratings as well [
14,
36,
37].
As seen in
Table 3 the low ICC1, ICC2 and ICC3 values show poor reliability. This means that the human raters disagree with each other in many cases, and we should not trust the judgment of a single rater. However, the collective should provide more information. ICC1k, ICC2k, and ICC3k scores range from 0.74 to 0.78, proving good reliability. Even though the individual raters disagree, the group as a collective is consistent. Considering that the calculation of the last three scores reflects the means of k raters and the resulting good reliability, the use of majority vote or average score for the 12 raters to determine the ground truth is justifiable.
Observing the stark difference between a low Fleiss’ Kappa and a high average ICC, namely a ICC2k of 0.75, led to the investigation turning towards the unbalanced nature of the dataset. The correctness scores of 0 or 1 have been attributed by the raters in a dissimilar manner. The percentage of 0s is 23.3% of the total of 2160 correctness scores (12 specialists rated 180 answers for the 60 questions that were answered by the three LLMs with Tavily Search API) attributed, and the percentage of 1s is 76.7%. Since the Kappa measurement was relying on the trait prevalence in the population under consideration, in this case the distribution of 0s and 1s, another statistic was calculated, Gwet’s AC1 score [
35]. Gwet’s AC1 Score for correctness was 0.5816, for legal citation 0.5105, and for the clarity and structure rubric 0.6293. The score for correctness falls under the Moderate Agreement category, proving that there is moderate individual agreement, but high group reliability.
Consensus Label
Based on the aforementioned results, a Consensus Label score of 1 or 0, for each rubric, was calculated based on the Majority Vote [
14], while flagging the cases where there was a tie between the 12 raters. The scores were first grouped by the question number and the answerer model, then the mean score from the 12 raters was calculated, alongside the number of 1s. For the majority vote, the threshold for the mean was 0.5, what was above the threshold was considered a 1, and what was equal and below was considered a 0. An agreement confidence score was also calculated to understand how unanimous the decision between raters was, ranging from 0.5, showing total disagreement, and 1, showing total agreement. The results can be seen in
Table 4. The number of high-confidence binary scores for each score represents the number of scores that were calculated in accordance to at least 9 out of 12 same-value scores attributed by the human raters. The high confidence in 75% of the correctness score Consensus Labeling consolidates that there is strong agreement among the raters. The agreement between the raters and the Consensus Label is also shown by the high percentages for legal citation and clarity and structure rubrics: 70% and 82.22% respectively.
The same set of 180 answers used for the human evaluation was also assessed by the LLM judges. To compare human and LLM judgments, we first computed a human Consensus Label (majority vote across the 12 raters) for each answer, and then measured how often each LLM judge agreed with this human consensus. In a separate analysis, a Consensus Label was calculated for the three LLM judges and the accuracy of the LLM Consensus Label to the human Consensus Label was 57%. This reveals a low agreement between using all three judges and the human Consensus Label in regards to assessing the correctness. A more detailed exploration of the accuracy, as seen in
Table 5, shows that for correctness, Gemini had a more similar judgment to the human evaluators. Gemini Judge is also more accurate than the other models in its analysis in legal citation. When the focus was on the clarity and structure rubric scores, there was a very high agreement. This may be a direct result of the fact that the number of 1s attributed by both human and LLM evaluators is close to the maximum. For the Consensus Label for this rubric, the number of 1s is 97.78% (176/180).
An examination of the LLM judgment agreement with the Consensus Label, categorized by question type, highlights potential areas where the judge models might struggle in their evaluation, as seen in
Figure 7. A difficult category for all three models, where none surpassed 64% accuracy, is Monography of Accounting Entries. Meanwhile, the category where the average accuracy percentage is the highest is Micro Enterprises. It is important to note that Gemini Judge is the closest to the Consensus Label which encapsulates the majority of the human rater’s labels. At the same time, there is consistency across all categories in Claude providing moderate accuracy.
The Consensus Label was also used to gain insight into how each answerer model performed in each question category. The 60 questions selected for the human evaluation dataset contain 10 questions from each category. The Consensus Label for the correctness score for each answerer model and for each category can be seen in
Figure 8. From this analysis, Claude and Gemini models with search-enabled answers were considered incorrect in half the cases, showing that these models might be struggling in this particular area of the financial-accounting domain.
With the help of Confusion Matrices (see
Appendix C.1,
Appendix C.2 and
Appendix C.3) we can better understand the strong and weak points of each LLM judge compared to the Consensus Label established using the 12 human experts’ evaluations. In the correctness rubric, all three models exhibited a significant False Negative bias, by labeling answers considered true by the majority of the experts as false. Claude and GPT Judges were identical in performance, both missing 84 correct instances while correctly identifying only 67 correct answers. Gemini Judge showed better alignment, capturing 105 True Positives, however struggling with 46 False Negatives. While the number of answers flagged as false is small (29), the models are highly reliable in labeling the answer as false, with Gemini being the only one to produce any False Positives (5). This suggests that while the model judges are conservative and can be considered harsh, Gemini Judge’s performance is most balanced for correctness rubric evaluation.
The legal citation rubric evaluates the validity of the legal citations provided in the predicted answers and if all the cited articles are relevant and the same as the gold answer. The category showed the highest disparity between the judge models and the human judgement. All three judge models attributed most of the actually true answers a false label, with GPT producing the most False Negatives (153), followed by Claude (140) and Gemini (121). The latter flagged the most True Positives (40) among the judges, and similar True Negatives (18) as the other two models (19). While the human judges attributed the least false labels for this rubric to the provided answers (19), all three models were able to also label the same answers as false. It is important to note that the tendency for the LLM judges compared to the human evaluations for this rubric is to consider the legal citation in a more restrictive manner. Considering that the answers evaluated in this subset were generated using retrieval augmentation, combined with the harsh judgements of the LLMs compared to the human experts’, it highlights the need for a more critical analysis done with human oversight.
For the clarity and structure rubric, the dataset was leaning heavily towards True labels, making this a test of alignment. All three models performed very well, with Gemini leading them, correctly identifying 175 of 176 true labeled answers. GPT follows closely with 172 True Positives, and Claude slightly behind with 166. The judge models show very high agreement with human Consensus Label on this rubric.
4.3. Human Expert Model Evaluation
Model Preference
Each specialist was asked to choose between one of the two answers presented alongside the text of the Ground Truth answer for the question. The pairwise evaluation was performed three times in order to establish whether or not there is a collective preference for a certain answerer model. Due to the nature of the comparative evaluation, the win rate was calculated for each model. The win rate is the total number of wins/total number of judgments, across all questions, with ties counted as 0.5 for both models. GPT-5’s answers scored the highest, with a win rate of 40.79%. Meanwhile, Gemini, with a rate of 29.81%, and Claude, with a rate of 29.40%, are similar to each other, but far from the winner. This shows that there might be a preference across all specialists towards the answers from GPT-5, or that Gemini and Claude have obtained many ties with each other.
To better understand the individual preferences of the specialists, the win rate was transformed into a weighted score, by following the positions of the models for each question. First place obtained three points, the second place two points, followed by the third place with one point, per question. Compared to the average correctness score on the Likert scale that each human rater attributed, as seen in
Table 6, the weighted score reveals the difference between raters. They cannot agree in regards to the cases involving answers from Gemini, as shown in the low correlation for the two scores and between all raters. With a low average Likert score and an average weighted score of 110, the answers from Gemini are regarded by the raters as passable, but rarely the winning answers. With the highest average Likert score, GPT’s answers are regarded with the highest average weighted score of 126.5. This shows the strongest correlation, meaning that when a rater gives GPT a high grade, they also pick it as the winner. Claude closely follows GPT, with an average weighted score of 123.16, proving that it won in most comparisons; however, the average Likert score does not surpass GPT’s. In regards to the raters, there are some outliers, with Rater 7 showing bias towards the answers from Claude and attributing the lowest score to GPT, disagreeing with the rest of the group. In a similar manner, there is Rater 4, although it is not as hard on the answers from GPT as Rater 7. Another outlier is Rater 9, who provided average Likert scores of 5 to almost all of the answers. The weighted score remains a more revealing metric by forcing the raters to provide a ranking order of the models. While the Likert scale provides more nuance, the weighted scores reduce the potential bias towards or against certain answerer models, revealing Gemini’s weakness in direct competition.
4.4. Search-Enabled Results Using LLM-as-a-Judge
During the answer generation process, the number of times the models used the search tool was recorded. On average, GPT leveraged Tavily the most, with an average of four calls per answer given. Meanwhile, Gemini did not take advantage of the tool, averaging 0.3 calls per answer, and Claude using 2.5 calls.
A source [
25] shows that another GPT model, GPT-4o-mini, is issuing a large number of tool calls as well, 24.8. In the same study, Claude makes the second most tool calls, 11.1 on average, and Gemini 4.5 calls. The usage trends are comparably followed throughout the question answering stage of the experimentation. In the similar study, the earlier GPT model was an outlier, issuing a high number of tool calls, while also having the highest error rate, indicating poor tool utilization [
25].
In order to assess the correlation between the number of tool calls and the performance of the models, a Spearman matrix was created. The strength and direction of the relationship between the chosen variables is shown in
Figure 9. The values for number_of_searches versus clarity_and_structure_score or legal_citation_score or correctness_score are all close to 0, which indicates that there is no relationship between the number of tool calls and either of the performance scores. This means that the number of searches done by the model does not help predict whether one of the three rubric-based scores or all of them will be high or low.
On the other hand, based on the same matrix, there is a slightly positive relationship between search_active and successful_response, legal_citation_score or correctness_score. This suggests that when one is higher, the other one tends to also go up as well. Notably, the correlation between the dimensions is quite low due to the large number of responses with a score of zero. However, the matrix suggests that enabling the search tool contributes to achieving a correct response.
For
Figure 10 a subset of 500 answers from the dataset was employed, consisting of Q&A with at least one successful answer, meaning that for a question, at least one of the six possible answers scored 100% across all three rubric metrics. In this matrix there is a powerful correlation between the number of successful answers and the score given to legal citation, while correctness does not necessarily mean a successful answer. Another important observation is that the possibility of using the Tavily Search API does not fully equate in the use of it; however, there is a strong link between the two.
As shown in
Table 1, the correctness scores improve for all models when Tavily Search API is used. In GPT-5’s case, the enhancement is quantified with 16.4% better results for the correctness rubric, while Claude has an improvement of 14% and Gemini of only 4%. A similar improvement trend can be seen in the legal citation scores. GPT becomes the best performer for this rubric after using the search enabling tool, seeing the highest improvement of 5.7%, reaching a score of 16.1%. In the category without search, Gemini had a score of 13.6% and the web search tool boosted the score with only 1.6%. An interesting exception to the tendency of Tavily to increase the scores is found in the case of Claude for clarity and structure score. In this situation, there is a decrease of 2.4% in the average score. Once again, Gemini receives a small boost of 0.1% and GPT the largest raising of all models, 1.5%.
4.5. Question Category and Difficulty-Based Analysis for LLM-as-a-Judge
According to the average correctness score for each answerer model, the pattern for attributing these values remains roughly the same for each question category. It is important to note that Gemini-WithSearch and Gemini-NoSearch, GPT-NoSearch and Claude-NoSearch provide better answers for questions from Accounting and Monography of Accounting Entries. The enabling of web searching allowed Claude to provide better answers in the Income Tax category. It also allowed GPT to perform better in questions regarding Micro Enterprises. All of the models had a hard time answering correctly to VAT and eVAT questions. This is in spite of the fact that the majority of questions, over 86%, were classified as Medium (59.8%) and Easy (26.43%), showing poorer performance in all models. This highlights a significant and uniform weakness in handling this category following Romanian laws, in contrast to the remaining categories (see
Figure 11).
On average, the answers to the Profit Tax category scored the highest in legal citation, indicating that in this area, the references are more accessible. The lowest score for the same rubric metric is in the Other obligations category, which may be due to the wide variety of issues that this category covers (see
Figure 12).
From the heatmap in
Figure 13, GPT-WithSearch is able to provide the most successful answers, showing better answering capabilities to Medium and Easy questions. A similar trend can be seen in
Figure 14 in the number of questions answered with the correctness score of 1, meaning Yes, which shows that GPT-WithSearch performed the best, having the largest number of affirmative label attributed. For both metrics, an improvement in scoring is found when evaluating answers to questions from the Medium level in comparison with the Easy ones, which is an unforeseen result. The number of answers from the models that were labeled as Yes or 1 for the correctness rubric in relationship to the difficulty of the question was 2681 out of the 6270 total answers (see
Figure 14). Conversely, a more expected outcome is shown in the difference, namely the reduction of successful and accurate responses generated by the models. By comparing the scores for Easy with those for the Hard level, it can be concluded that question and answer pairs from the hardest level of difficulty highlight the weaknesses of the models by yielding worse results.
4.6. Cost Analysis
For a successful response, where all three rubric metrics have the score of 1 based on LLM-as-a-Judge evaluation, the cost is on average 0.2252
$. From the average query cost to the correctness percentage point of view (see
Figure 15), GPT-WithSearch ranks on top in comparison to the other models. The worst ratio of cost to correctness is provided by Claude-NoSearch, with the search-enabled version improving on both sides. It is interesting to note that enabling the search tool did not improve the results on either dimension by a lot for Gemini, ranking similarly in the bidimensional analysis. The results achieved by GPT and Claude in this comparative analysis further highlight the existing research on cost and accuracy trade-off when deploying LLMs and other predictive systems [
26].
As shown in
Figure 15, the human answerers require more time and thus a higher average query cost; however, they provide better results. Once again, Gemini’s performance improves by little when provided a search tool and more time. While requiring the least amount of time, Claude-NoSearch provides the least correct answers. However, an improvement in correctness appears when enabling search, along with an increase in time usage, but still below 150 s on average. GPT is able to provide higher correctness scores with more, conversely requiring more time to answer.
Considering the human evaluations, more precisely the Consensus Label, an average cost per query by dataset was calculated for each rubric. A correct answer requires the payment of 0.57 $ to GPT-5, while for Claude, the amount is 0.33 $ and for Gemini it is 0.09 $. For an answer that has correct legal citations, the most expensive remains GPT with 0.58 $, followed by Claude with 0.32 $ and Gemini with 0.08 $. The same ranking is valid for obtaining a clear and structured answer.
5. Discussion
This research builds onto previous studies in this domain, which can be found in the
Section 2 (Literature Review). The research team provided a new dataset, focused on the Romanian financial domain and its regulations.
Concerning the potential bias the evaluator models might have towards their own answers, the average scores for the answerer models by each evaluator were calculated and plotted. The distribution of evaluations by answerer model and evaluator model can be found in
Figure 16. A noticeable tendency for Judge Gemini to attribute better scores for all the evaluated models, without favoring itself, can be seen in
Figure 4. All three judges have evaluated the answers from GPT with higher correctness scores, while Answerer Claude and Answerer Gemini have similar correctness scores attributed. Considering the almost equal distribution between judges for the answer models, we assume that there is no detectable self-bias outside acceptable limits in the responses generated by the models.
The first hypothesis of this research team was that Claude will perform better than the other two models based on the results provided by the model in daily use. Not only does it fail to achieve the best performance, but in several instances it is ranked as the lowest scorer. In the human evaluations, Claude with search enabled is ranked second in average Likert score. Accounting for the Consensus Label derived from the scores attributed by specialists to a subset of the answers provided by the models with access to Tavily, Claude performs the least desirable among the three models, as seen in
Figure 7. An explanation for the research team’s expectation might be the verbosity bias, in combination with the model’s performance in daily tasks that do not necessarily require knowledge in the financial-accounting domain.
The second research hypothesis stated that ChatGPT or GPT-5 would be the lowest expenditure due to it having the lowest cost per token. When Tavily Search API is enabled, all models provide longer answers which need to be processed by the judge models. GPT-WithTools uses the most tokens as the question for the answer generation, 114.300 tokens on average for query input, and the most to write the answer, 39.797 as the average query output tokens. Another important factor in disproving this hypothesis, which also supports the logic behind the largest number of used tokens, is the fact that GPT uses the most tool calls.
Another hypothesis was that tool-enabled Retrieval Augmented Generation will provide better results in legal citation. From the Spearman Correlation Matrix for Tool Calls, the non-existent relationship between the number of tool calls and the legal citation score highlights that the increase in tool use does not equate to better results. From
Figure 17, there is an improvement in the legal citation score when using the Search Tool. This aspect is also supported by the large number of legislative changes that occurred in 2025 in Romania’s financial-accounting field, which further highlights the need to employ advanced search and analysis tools provided by artificial intelligence models. Both discoveries add nuances to the hypothesis and highlight that performance is irrespective of the number of tool calls, but the search tool use in general improves performance.
The disparity between human and LLM evaluations for correctness and legal citation brings into discussion the verbosity bias. This bias relates to the presentation of the answer and the tendency of a human judge (or model-based as well) to favor longer and more structured responses [
14,
38]. Human judges often prefer different textual properties, with some looking for a concise answer rather than a detailed one, for example, which directly impacts the overall assessment. This was the case for Raters 4 and 7, who were the outliers of the group due to a suspected preference towards the answers from a model. The evaluation rubrics attempted to separate assessment into distinct dimensions; however, they revealed the discrepancies between individual raters.
Accounting for the human evaluation of the answers provided by the three models with access to Tavily Search API, the average scores for all three rubrics reveal that GPT might be better suited for the question answering task, even for the financial-accounting domain. The other two models are close contenders, scoring higher than GPT in clarity and structure. In this context, an important variable to take into account is the number of tool calls used on average by the models, with GPT using on average four calls, followed by Claude with 2.5 calls and Gemini with 0.3. Considering the much lower number of calls for the question answering task, the results of Gemini are admirable.
The calculation of the Consensus Label for each rubric and answerer model for each question led to a better evaluation of the LLMs-as-Judge attributed scores. It revealed that for correctness and legal citation rubrics, Gemini is the most accurate, agreeing to the evaluations of most of the human raters. It is also important to note that the LLMs attributed the least amount of Yes labels or 1 scores for these two rubrics for all models. For the rubric that had the highest average scores and most 1s attributed, GPT is closest to the consensus, especially considering it had the highest number of answers to evaluate () out of all the judges.
One of the main goals of this research was to establish the capabilities of Large Language Models in answering domain-specific questions regarding finance and accounting. Another relevant contribution brought by this study effort was the creation of a dataset that is relevant to the Romanian landscape for the aforementioned domain. The dataset includes questions from multiple areas of interest, such as VAT and electronic VAT, Accounting and Monography of Accounting Entries, Income and Profit Tax, Micro Enterprises and Other Obligations. Additionally, the dataset contains responses from several models. Some of them were generated by three models, namely Claude 4.5 Sonnet from Anthropic, Gemini 2.5 Pro from Google and GPT 5 from OpenAI. Another set of answers was produced by these models using a Retrieval Augmented Generation tool that allows search called Tavily Search API. In the initial assessment, open-source models were included and analyzed using the G-Eval Framework.
The limitations of this research are, first, the size of the subset that was validated by specialists. While only 5.74% () of the 1045 questions were taken into consideration, it is important to note that for each question there were three answers to be evaluated, resulting in a total of 180 validations. Second, financial limitations allowed for each answer to be evaluated by only one model assigned at random to mitigate potential self bias.
Some future directions for this research might be the participation of human evaluators in the preliminary evaluation of the answers provided by the models. Another relevant aspect is the evaluation of the judges, in order to establish an LLM judge.
Bearing in mind that all human raters attributed, on average, high correctness scores on the Likert scale to GPT-5 with search-enabled answers, the findings show that this could be the model to be used in applications that work with LLMs for tasks requiring knowledge of the financial-accounting domain. The better results from GPT-5 come with a higher bill, requiring on average 0.58 $ per query that is considered correct by the majority of human reviewers.
Regarding scalability, while high-load stress testing was outside the scope of this study, the benchmark shows efficiency gains in time comparisons. The time to answer for the models was significantly lower than human specialists (see
Figure 15), suggesting that their adoption should focus on managing API rate limits and operational costs when scaling for a multitude of simultaneous queries. In a similar study, an earlier model from OpenAI, o3, took 3.1 min and costs
$3.78, achieving only 46.8% accuracy [
26]. In our case, still a model from OpenAI, reached the highest average correctness rate of 52.8230%, requiring, on average, 393.43 s to answer the question presented, as shown in
Figure 15. Another important area discussed was the use of LLMs as judges to assess the trust that can be attributed to such evaluations in similar research situations. The findings of other articles highlight the potential of LLM judges due to their scalability and reproducibility [
14]. Our findings show that while the use of human experts is time-consuming and the employment of LLMs for similar tasks is more efficient cost and time-wise, the models are not in very high correlation to humans in question and answer pair evaluations. The highest similarity regarding the judgements done by a model to the Consensus Label established using the human experts’ labels was obtained by Gemini with 68.33%. This indicates that human oversight remains necessary, despite the cost and time benefits from using LLMs as judges.
To ensure robustness, the benchmark used a dataset curated by domain experts, ensuring diversity across the three main areas. In accordance with the framework [
7], the benchmark does not only evaluate the answer as correct or incorrect, it provides the nuances regarding legal citations and clarity, thus validating the logic and integrity of the answers provided. The use of LLMs as judges evaluated in our article required a structured evaluation by employing the use of rubrics, which is a mechanism found in robust frameworks [
18]. While the models were not tested using incomplete or disordered data, their ability to correctly interpret questions indicates an understanding of specific terminology and complex legal texts. In the business context, the legal side is one of the main areas of focus, and having incorrect facts might present a risk. To preface the risk of hallucination, the models were allowed to use an external tool called Tavily to search and thus augment the internal knowledge with recent results. In an article, the hallucination prevention is suggested to be done by using Retrieval Augmented Generation (RAG) and Vector Engines. It mentions that a robust system must also be tested on the ability to anchor the answers in the data provided using these tools. In the case of our study, the use of Tavily shows an improvement in the amount of answers found correct by the LLM judges, as seen in
Figure 11, and of answers labeled true for all three rubrics, as seen in
Figure 13. Finally, regarding security, this benchmark uses general queries that do not require sensitive or personal information. However, since in enterprise deployment data privacy is of great importance, a practical implementation of these models might require personnel training or improvement or development of open-source models which can be locally hosted, thus ensuring that business data does not leave the secure enterprise infrastructure.
6. Conclusions
This paper introduced RO-FIN-LLM, a Romania-specific benchmark for regulatory question answering in taxation and financial accounting. We evaluated state-of-the-art LLMs in closed-book and retrieval-augmented (RAG) settings using a rubric-based evaluation (correctness, legal citation quality, and clarity/structure) and validated a stratified subset with specialist human raters.
Our results show that retrieval augmentation substantially improves correctness, but legal citation quality remains a key weakness across models, reinforcing the need for careful evidence handling and human oversight in compliance-oriented deployments. The judge models exhibit task-specific variance, and while they are well-calibrated for qualitative assessments of clarity and structure, their high false-negative rates in correctness and legal citation rubrics suggest they are not yet a substitute for human oversight in legally rigorous domains.
We emphasize that RO-FIN-LLM is intended as a first public benchmark with limitations and should be viewed as an extensible foundation rather than a definitive standard.
Limitations and future work. The current evaluation uses a single LLM judge per answer (randomly allocated judge with the exception of the subset for human validation) to control cost, and the human validation set is limited ( questions or 180 Q&A pairs). In future work, we wish to (i) increase human validation coverage, (ii) adopt multi-judge evaluation (e.g., cross-judging or ensembles) and report judge agreement/sensitivity analyses on the whole dataset, (iii) report uncertainty for model comparisons via paired bootstrap confidence intervals, and (iv) strengthen evidence fidelity by logging retrieved passages/URLs and analyzing error modes (outdated legal basis, mismatched applicability, and incorrect effective dates).
Reproducibility. To support independent reproduction and extension, we provide a release plan for the benchmark schema, prompts, evaluation scripts, and logs from RAG (see Data Availability). OpenAI’s model variant is gpt-5-2025-08-07, with tier 4 usage (RPM: 10k, TPM: 4Mil, Batch Queue Limit: 200Mil), from the default endpoint (api.openai.com). Google Gemini model variant is gemini-2.5-pro (no snapshotting) released in 17 June 2025, with rate limits of 150 req/min, 2M tokens/min, 10k req/day, from the Google AI Studio endpoint (generativelanguage.googleapis.com). Anthropic Claude’s model variant is claude-sonnet-4-5-20250929, with tier 4 usage (4k req/min, 2M input tokens/min, 400k output tokens/min), from the api.anthropic.com endpoint. The machine configurations are: CPU—AMD Ryzen 9 5950X (16-core/32-thread, Zen 3, up to 5.086 GHz boost); GPU—2× NVIDIA GeForce RTX 3090 (24 GB VRAM each); RAM—128 GB DDR4; Storage—2 TB NVMe SSD (WD Black SN850X) + 12 TB HDD; OS—Pop!_OS 22.04 LTS (Ubuntu-based); Kernel—6.16.3; NVIDIA Driver—580.82.09.