Next Article in Journal
Early Detection and Intervention of Developmental Dyscalculia Using Serious Game-Based Digital Tools: A Systematic Review
Previous Article in Journal
Intelligent Assessment of Scientific Creativity by Integrating Data Augmentation and Pseudo-Labeling
Previous Article in Special Issue
The Negative Concord Mystery: Insights from a Language Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation

1
School of Business, University of Applied Science and Arts Northwestern Switzerland, 4600 Olten, Switzerland
2
Institute for Information Systems, University of Applied Science and Arts Northwestern Switzerland, 4600 Olten, Switzerland
*
Author to whom correspondence should be addressed.
Information 2025, 16(9), 786; https://doi.org/10.3390/info16090786
Submission received: 30 July 2025 / Revised: 5 September 2025 / Accepted: 8 September 2025 / Published: 10 September 2025

Abstract

Large language models (LLMs) like ChatGPT-4 and Gemini 1.0 demonstrate significant text generation capabilities but often struggle with outdated knowledge, domain specificity, and hallucinations. Retrieval-Augmented Generation (RAG) offers a promising solution by integrating external knowledge sources to produce more accurate and informed responses. This research investigates RAG’s effectiveness in enhancing LLM performance for financial report analysis. We examine how RAG and the specific prompt design improve the provision of qualitative and quantitative financial information in terms of accuracy, relevance, and verifiability. Employing a design science research approach, we compare ChatGPT-4 responses before and after RAG integration, using annual reports from ten selected technology companies. Our findings demonstrate that RAG improves the relevance and verifiability of LLM outputs (by 0.66 and 0.71, respectively, on a scale from 1 to 5), while also reducing irrelevant or incorrect answers. Prompt specificity is shown to critically impact response quality. This study indicates RAG’s potential to mitigate LLM biases and inaccuracies, offering a practical solution for generating reliable and contextually rich financial insights.

1. Introduction

In the evolving landscape of natural language processing, large language models (LLMs) such as ChatGPT-4 [1] and Gemini 1.0 [2] have demonstrated remarkable capabilities in generating human-like text across a plethora of domains. However, the integration of Retrieval-Augmented Generation (RAG) presents a significant leap forward, particularly in addressing the challenges of outdated knowledge, domain specificity, and hallucinations—issues that undermine the trustworthiness and relevance of LLM outputs. RAG is a technique that enhances large language models (LLMs) by integrating external knowledge sources. It leverages retrieved documents to generate more accurate and informed responses, which is useful for tasks such as creative text generation, question answering, error correction, and summarization [3].
This research focuses on the effectiveness of RAG in enhancing the accuracy and relevance of LLM responses, especially in the analysis of financial reports. By investigating the accuracy of LLMs in an RAG setting, using financial data as a test case, this study aims to shed light on how RAG can mitigate biases and hallucinations inherent in LLMs. Specifically, the research examines RAG’s ability to provide qualitative and quantitative information in response to financial data queries, the impact of different prompts on the correctness (accuracy), relevance, and verifiability of the information retrieved, and the efficiency of RAG in conveying the source and context of financial data. Through a comprehensive literature review and empirical analysis, this study endeavors to contribute to the understanding of RAG’s potential in making LLM outputs more accurate, reliable, and domain-specific, thus broadening the applicability of LLMs in specialized fields such as finance.

1.1. Problem Statement

Although modern LLMs such as ChatGPT or Gemini can reflect a wide range of knowledge, this knowledge may be outdated or may not answer certain domain-specific questions [3]. As Gao et al. [4] stated, outdated knowledge can lead to inaccuracies, while hallucinations [5,6] and lack of domain expertise can undermine the trustworthiness and relevance of generated content. These challenges are especially apparent in dynamic fields, such as real-time news updates and specialized sectors, where the prompt and precise generation of information is of paramount importance [7]. According to Loukas et al. [8], the challenges in retrieval efficiency and domain adaptation stem from the complexity of effectively integrating external knowledge sources into LLMs. Furthermore, inadequate retrieval processes, limited domain-specific fine-tuning, and the lack of dynamic adaptation mechanisms contribute to the persistence of these challenges.
While the RAG approach offers promising solutions to enhance response accuracy in LLMs, there is a gap in understanding how this methodology can be effectively tailored to specific use cases. Additionally, research is scant on evaluating the efficacy of RAG in delivering both qualitative and quantitative outcomes across diverse applications. This gap highlights the need for focused studies that not only investigate the adaptability of RAG to varied contexts but also rigorously assess its performance in these settings. With our study we contribute a tailored RAG solution for assessing the content of financial documents based on state-of-the-art technologies and a qualitative evaluation of the implemented prototype (in comparison to a solution without RAG) considering the accuracy, relevance, and verifiability of responses.

1.2. Research Questions

LLMs exhibit training and dataset biases that lead to hallucinations in generative models. The aim of this study is to analyze hallucinations from LLMs applied to financial data. Using a set of documents such as annual reports of publicly listed companies and a set of test questions, this research investigates how accurate the answers are in an RAG scenario and whether the specific origin is traceable.
Considering the problem statement, the following main research question is formulated: How effective is RAG when integrated with LLMs in providing accurate and relevant responses for analyzing financial reports, especially for qualitative questions?
To answer the main research question, the following sub-research questions (RQs) are considered and categorized according to the respective phases of the research design process (awareness, suggestion, development, and implementation):
  • RQ1: Awareness: How accurately does RAG provide qualitative information in response to specific financial data queries?
  • RQ2: Suggestion: What impact do different prompts have on the correctness (accuracy), relevance, and verifiability of financial information retrieved by RAG, and how effective is RAG in general for correct and verifiable financial data retrieval?
  • RQ3: Development: How could a suitable RAG approach be designed to enhance the accuracy, relevance, and verifiability of LLMs in analyzing financial reports?
  • RQ4: Evaluation: How effectively does RAG convey the source and context of financial data, such as referencing specific documents or indicating data sources?

2. Literature Review

In this section, we provide a review of the related literature, highlighting foundational theories and models, practical applications and use cases, along with recent advances and state-of-the-art techniques in RAG. The literature search was conducted with keyword research using keywords such as RAG, LLM, or Artificial Intelligence in academic databases such as Scopus, Google Scholar, or Semantic Scholar.

2.1. Background and Models

Any model that can be fine-tuned to a wide range of subsequent tasks after being trained on large amounts of data is called a foundation model [9]. Examples include BERT, GPT-3, and CLIP [10]. A class of general-purpose deep learning models known as LLMs have been pre-trained on a substantial volume of text data and allow to identify, translate, predict, or create content, as well as to extract complex patterns from massive datasets [11,12].
According to Humza and Awan [13], LLMs possess emergent skills such as in-context learning, reasoning, planning, decision-making, and answering questions and allow for generalization and domain adaptability. When an AI model is trained to identify and classify objects or concepts without having previously seen examples of such categories or concepts, the process is known as “zero-shot learning”.
A powerful technique called RAG combines the computational abilities of LLMs with information retrieval. RAG has two phases: retrieval and content generation. The process begins by gathering relevant information from a dataset. The datasets can be obtained from search engines [14]. An LLM uses the information it has acquired as input to produce content or responses that are tailored to the user’s needs. RAG is particularly helpful for tasks such as information retrieval, question answering, and content summary as it combines generation and retrieval techniques to increase output accuracy and contextual relevance.

2.2. Applications and Use Cases

The applications and use cases of RAG span various domains and scenarios, showcasing its versatility and effectiveness in text generation tasks [3]. Furthermore, Siriwardhana et al. [7] highlight that RAG models are primarily designed for Open-Domain Question Answering (ODQA) tasks, where they excel in retrieving and generating answers to questions from a vast external knowledge base. Therefore, by leveraging retrieval mechanisms, RAG models can provide accurate and comprehensive answers to a wide range of questions.
RAG is highlighted as a cost-effective querying method for LLMs in resource-limited text classification, such as in the banking domain [8]. One of the standout applications of RAG is in the domain of financial sentiment analysis [15]. The conventional limitations of NLP models, restricted by the scale of their training data, often resulted in less effective analyses of complex financial information. RAG addresses these limitations by augmenting LLMs with external financial databases, news articles, and reports, thus significantly enhancing their analytical capabilities. For instance, Zhang et al. [15] demonstrated a remarkable improvement in financial sentiment analysis, achieving a 15% to 48% increase in accuracy over traditional models through the integration of RAG with instruction-tuned LLMs. Instruction-tuned large language models (LLMs) are AI models specifically optimized to accurately understand and execute natural language instructions, enhancing their reliability and usability in practical applications.
RAG has also proven invaluable in the legal domain, particularly in improving the evaluation of LLM-generated texts for legal question answering tasks. By utilizing relevant documents retrieved through the RAG process models, such as Eval-RAG, Gao et al. [4] have successfully identified factual inaccuracies in LLM outputs with RAG, aligning closely with the precision required in legal analysis. This approach not only bolsters the reliability of LLMs in legal applications but also offers a novel method to assess their outputs based on factual correctness [4]. In addition, Ryu et al. [16] emphasized that in applications where sensitive information is involved, such as legal contexts, RAG methods can help alleviate ethical concerns by ensuring that the generated texts are based on relevant and accurate information retrieved from trusted sources.
In conclusion, by leveraging extensive external databases, RAG significantly enriches the output quality of language models across a range of natural language processing (NLP) tasks [17]. It is particularly effective in augmenting question answering and knowledge-intensive tasks, where precision and contextual depth are paramount [18]. Moreover, the RAG methodology is instrumental in mitigating hallucinations in model outputs, thereby enhancing their reliability for critical applications. Additionally, its capability to facilitate multilingual response generation expands its utility, positioning it as a versatile instrument in global communication initiatives [19]. Furthermore, these attributes underscore RAG’s pivotal role in bridging the divide between conventional language models and the complex requirements of specialized domains, heralding an era of more sophisticated, reliable, and adaptable AI-driven technologies.

2.3. Recent Advances and State-of-the-Art Techniques

A groundbreaking development in the area of NLP in the financial domain is the introduction of BloombergGPT. It is a 50-billion parameter language model trained on a wide range of financial data. It has been trained on a dataset consisting of 363 billion tokens. In total, 345 billion of those came from general-purpose datasets and the rest from Bloomberg’s captive financial data for the last four decades.
BloombergGPT consistently outperforms existing models such as GPT-3, BLOOM176B, GPT-NeoX, and OPT66B across a range of financial benchmarks. BloombergGPT excels in tasks related to finance, such as sentiment analysis and named entity recognition, where understanding the nuances of the financial language is extremely important [20].
Recent research has also introduced an innovative framework named Dynamic RAG with Information Needs (DRAGIN) [21]. DRAGIN’s approach to real-time information retrieval during text generation is to assess the necessity of retrieval based on an LLM’s uncertainty and the semantic significance of tokens. By using self-attention mechanisms, DRAGIN creates accurate retrieval queries tailored to the context of text generation. This results in more relevant and useful retrieved information. This approach enhances the coherence and accuracy of the generated text while balancing retrieval augmentation with computational costs. The DRAGIN framework consists of two components, Real-Time Information Needs Detection (RIND) and Query-Based Formulation based on Self-Attention (QFS). RIND is the process of identifying and understanding the user’s information needs as they evolve during the interaction with the LLM. RIND is designed to trigger retrieval based on the real-time information needs of LLMs. It considers the importance, semantics, and influence of each token. QFS involves formulating search queries using the self-attention mechanism. Self-attention allows the model to weigh the importance of different words in a query based on their relevance to each other. QFS leverages self-attention to generate more effective search queries tailored to the user’s information needs [21].
In contrast, Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection (KCTS) tackled the hallucination problem in LLMs through model-agnostic decoding methods and knowledge-constrained decoding techniques [22]. KCTS incorporates token-level hallucination detection and discriminator-guided decoding to ensure that the generated text remains faithful to the reference knowledge without the need for fine-tuning the language model. This improves the overall quality of the natural language generation tasks. KCTS addresses the hallucination problem in LLMs in the following ways.
Model agnosticism emphasizes the versatility of KCTS. It is not confined to a particular model architecture or training methodology. It can instead adapt to various language models. Knowledge-constrained decoding (KCD) employs an additional knowledge classifier atop a language model. This classifier helps identify hallucinations and directs the decoding process based on the model’s adherence to established knowledge. KCTS employs a technique called Reward Inflection Point Approximation (RIPA) to pinpoint where hallucinations may occur within the generated text, thereby enhancing the accuracy of the output. KCTS aims to constrain the generated text to be faithful to the reference knowledge without the need for fine-tuning the language model. KCTS utilizes discriminator-guided decoding to ensure that the generated text is grounded on the reference knowledge [22].
The idea behind the CHAIN-OF-NOTE (CON) is to generate sequential reading notes for retrieved documents. This allows for a systematic evaluation of the relevance of the retrieved information to the input question before formulating a final response. The framework aims to filter out irrelevant or less credible content, leading to more precise and contextually relevant responses [23].
Regarding the evaluation of LLM solutions related to financial applications, especially in question answering, the focus is usually on typical machine learning-related measures such as accuracy. For instance, in the RAG-based solution suggested in [24], an accuracy of 78.6% and a recall of 89.2% were found for a fully optimized configuration. An RAG-based solution for question answering for financial reports is also investigated in [25], showing accuracy values between 57% and 93% for retrieving text data in different setups of the considered solution.
A solution developed by Gondhalekar, Patel & Yeh [26] considered RAG for question answering in financial documents related to text-, table-, and image-based contents. For text-related queries, accuracy values of 83.6 and 90.4 (depending on the considered LLM) have been found. In [27], an RAG-based solution utilizing function calling of an SQL database with financial information is utilized. The evaluation focuses on accuracy-related measures for syntactic and semantic similarity of the generated responses to reference responses considering measures such as BLEU, SacreBLEU, and BERTScore, indicating promising results.
While considerable research focuses on enhancing the accuracy of LLMs using RAG, a notable gap exists in exploring the quality of answers to quantitative and qualitative questions also considering broader assessment metrics. Specifically, in analyzing financial documents, more comprehensive studies are needed to assess how RAG can improve the relevance and reliability of LLM-generated responses, particularly in domain-specific queries and the verifiability of sourced information. This research addresses this gap by systematically evaluating the effectiveness of RAG-enhanced LLMs in financial data analysis, providing insights into their potential to mitigate biases and hallucinations inherent in traditional LLMs. In particular, we consider the relevance and the verifiability of the LLM responses in addition to the usually considered accuracy.

3. Methodology

To answer the research questions and validate the thesis statement, this study adopts a design science research (DSR) approach with a strong emphasis on experiments and evaluation. Initially, financial reports are processed by an LLM to generate baseline answers to a set of predefined questions. Next, the same financial reports are analyzed using an RAG approach. The responses generated by the RAG-enhanced LLM are then compared to the initial LLM responses. This comparison allows for a thorough evaluation of the quality improvements achieved through the RAG approach, specifically regarding the answers’ accuracy, relevance, and verifiability. This study uses only a single LLM to maintain simplicity and focus on evaluating the RAG approach and its potential quality improvements.
We have selected 10 companies for our research based on the criteria mentioned below (Table 1). The number of 10 companies is certainly rather small compared to the total number of listed companies but appears sufficient to represent those fulfilling our specific criteria and is comparable to those considered in similar studies on evaluating annual reports such as [28,29,30], also considering that the study is qualitative in nature using manual assessment. The considered three evaluation criteria serve as a robust framework to guide our research efforts effectively. We believe that these criteria will help reduce the variance in regulatory reporting and accounting standards and practices.
  • Single Regulatory Market (US): Our selection process prioritizes companies operating within a single regulatory environment, specifically focusing on the United States. This criterion streamlines our research scope, enabling a deeper understanding of the regulatory landscape and its implications for the selected companies.
  • NASDAQ Listing: We have exclusively chosen companies listed on the NASDAQ stock exchange. NASDAQ-listed companies often represent dynamic, growth-oriented firms with a strong emphasis on innovation and technology.
  • Exclusion of Financial Sector: To maintain consistency and comparability across our research cohort, we excluded companies operating within the financial sector. Financial institutions often employ unique accounting practices and regulatory frameworks distinct from those of other industries, which could skew our analysis. By excluding this sector, we ensure a more homogenous group for meaningful insights.
  • Focus on the Technology Sector: Our research targets companies within the technology sector, characterized by their asset-light business models and emphasis on innovation-driven growth. Unlike traditional industries, technology firms typically rely less on physical assets and more on intellectual property, talent, and digital infrastructure. This strategic focus allows us to explore the dynamics of disruptive technologies, market trends, and competitive strategies within this high-growth sector.
  • Revenue Range of USD 5 to USD 7 Billion: Within the technology sector, we have focused on companies with annual revenues ranging from USD 5 to USD 7.3 billion. This is to reduce the variance in the complexity of the reports generated by the companies.
These selection criteria ensure that we do not analyze a too diverse set of companies, which could lead to stronger deviations in the results. In addition, this facilitates the manual examination of responses.
Our research methodology consists of the analysis of responses generated by the LLM to a series of ten questions (Table 2). These questions are divided into two categories: qualitative and quantitative. The qualitative segment comprises four inquiries aimed at capturing non-financial information. The financial aspect encompasses six inquiries, with three focusing on extracting financial data readily available within annual reports, and the remaining three targeting the calculation of financial ratios derived from data provided within these reports.
For our use case, we use the functionalities of ChatGPT’s GPT-4 model solution. The reason we decided to use this model is that the current GPT-4 version provides 128k context length and allows text and image input/text and image output. As our study involves the analysis of data from annual reports in pdf format, GPT-4 is the best fit model to proceed with our research.
The process begins with the user query, which is further transformed into two main actions comprising retrieval and generation. Upon sending the user query, an API initiates the retrieval by gathering current data from a content store, which in our case is GPT-4. This consists of annual PDF reports from various companies selected for our use case execution. Subsequently, this information is forwarded to the LLM in GPT-4, along with the user’s query and the retrieved relevant data snippets. In the generation phase, these data snippets are used to enhance the query with trained data, and the model is instructed to generate a response based on the user query [31].
To evaluate the effectiveness of the RAG approach integrated with LLMs in the context of financial report analysis, we employ a self-evaluation methodology. The responses before and after the RAG enhancement are compared based on three key criteria: accuracy, relevance, and verifiability, with each criterion scored on a scale from 0 to 5. By analyzing these scores, we aim to quantify the improvements resulting from the RAG approach, thereby demonstrating its potential to enhance the quality and reliability of LLM-generated outputs in financial data analysis.

4. Results

This section delineates the findings from our investigation into the efficacy of RAG in augmenting the accuracy and relevance of LLMs for financial data analysis. The results underscore the significant enhancements achieved through the integration of RAG. The results are shown in Appendix A.

4.1. Accuracy of Qualitative Information

The scores for accuracy and relevance before and after RAG introduction (see Figure 1) were almost identical, showing that RAG maintained a high response quality. A significant benefit observed was the reduction in irrelevant or incorrect answers, indicating the model’s improvement in avoiding irrelevant information. The verifiability scores before and after RAG were consistent, reflecting the system’s ability to generate reliable and verifiable information. RAG’s ability to efficiently retrieve and present verifiable financial data was confirmed, significantly improving the quality of the information provided by the LLM.
The research contributes to understanding how RAG can be effectively integrated with LLMs to improve their performance in domain-specific tasks such as financial report analysis. The findings highlight the potential of RAG to mitigate biases and inaccuracies in traditional LLMs, making them more suitable for specialized applications. The developed tool, which integrates RAG with ChatGPT-4 for financial data analysis, provides a practical solution for businesses and researchers by ensuring accurate and contextually rich responses.
The study confirms that using RAG to enhance the capabilities of LLMs leads to more accurate and relevant responses when analyzing company annual reports. While this study provides valuable insights, further research is needed to explore the scalability of RAG-enhanced LLMs across different domains and larger datasets. Investigating the long-term impacts of continuous learning and adaptation mechanisms in LLMs could enhance their utility and reliability.
Figure 1 illustrates the comparison of the quality metrics (accuracy, relevance, and verifiability) for the LLMs before and after the integration of RAG. The chart shows that while accuracy slightly decreased from 4.89 to 4.79 (−0.1), relevance and verifiability significantly improved from 4.16 to 4.81 (+0.66) and from 4.11 to 4.82 (+0.71), respectively. This demonstrates that integrating RAG enhances the overall relevance and reliability of the responses generated by the LLMs.

4.2. Impact of Different Prompts

To assess the impact of different prompts on the responses generated by the LLM, queries were posed with three levels of granularity. Firstly, the simplest form involved directly asking the LLM the question without any additional context. Secondly, the queries included a moderate amount of context to provide some background information. Finally, the most detailed prompts included extensive context and a brief description of the relevant document.
For instance, a simple prompt may just ask “What is the company’s policy for revenue recognition?” while a more specific one provides some further information such as company name and context, e.g., “Analyze the Financial Report of AutoDesk2022 uploaded above and let me know if the auditors have expressed a qualified or unqualified opinion about the financial statements. Please provide relevant details.” A very specific prompt further extends the background information or makes more specific what is asked for, e.g., “Have there been any changes in the accounting standards or practices adopted by AutoDesk for the fiscal year 2022? If so, please provide detailed information about the specific changes implemented, the reasons for these changes, and their impact on the financial statements. Additionally, mention if there were any new accounting pronouncements adopted during the fiscal year and their effects on the company’s financial reporting.”
This approach aimed to evaluate if there is a difference in how varying degrees of prompt detail affect the accuracy, relevance, and verifiability of the LLM’s answers and if so, how much difference between the granularities can be observed. Additionally, if there is a difference between the answers of the LLM with and without the RAG improvement.
It has been observed that simple prompts give an average of 4.76 in terms of accuracy, 4.71 for relevance, and 4.71 for verifiability (see Figure 2). The slightly more precise prompt resulted in an accuracy grade of 4.07, a relevance grade of 3.94, and a verifiability grade of 3.95. This may be attributed to missing data. Lastly, the very detailed prompt results in an accuracy grade of 4.73, a relevance grade of 4.75, and a verifiability grade of 4.76.
In terms of the difference between before and after the RAG improvement, in comparing the visualizations of Figure 2 and Figure 3, it has been observed that as the specificity of the prompt increases, so do the accuracy, relevance, and verifiability of the responses for the solution utilizing RAG. This contrasts with the findings of the solution without RAG, which seemingly cannot always benefit from the further prompt details when content retrieved from relevant documents is not provided. Therefore, for generating high-quality responses, the use of very specific prompts is highly recommended.
Experiments with different prompt structures revealed that prompts explicitly requesting source references and providing further details markedly improved the verifiability of responses. The correctness of the retrieved financial information increased by 5.73% with prompts emphasizing the context of information to be retrieved, as observed from the very specific prompt category. However, we also see that the gains of additional specificity in the prompt decrease so that it appears questionable whether providing even more details in the prompts may be beneficial. It should also be considered that prompt generation when performed manually causes effort so that—depending on the use case—the prompt should not become too detailed.
Moreover, the efficiency of RAG in retrieving and presenting verifiable financial data was quantified, showing a reduction in the incidence of hallucinated or irrelevant responses by 15.55%. This is based on the combined average improvement in relevance and verifiability across all prompt types.

4.3. Suitable RAG Approach

The developed artifact integrates RAG with ChatGPT 4 with the OpenAPI RAG specifically for financial data analysis. The code follows the following steps: Firstly, the retrieval module sources relevant financial documents, in this case annual reports, using advanced search algorithms to ensure high relevance. This data is then processed and fed into the generation module. In the generation phase, the LLM uses the retrieved data to generate responses that are accurate and contextually relevant. This integration of RAG ensures that the model provides current and specific information, reducing hallucinations and inaccuracies. The artifact’s workflow begins with query-triggered document retrieval, followed by information processing and response generation. This approach leverages the synergy between retrieval and generation to enhance the precision and reliability of the LLM outputs. Detailed technical specifications, including retrieval algorithms, data preprocessing, and integration methods, are provided in Appendix A.

5. Discussion

Our study explored how integrating RAG enhances the accuracy and relevance of LLMs when analyzing company annual reports. The findings suggest that RAG significantly improves the accuracy, relevance, and verifiability of LLM responses, particularly in the context of financial data analysis.
Our key findings are as follows:
  • Enhanced accuracy and relevance: Integrating RAG with LLMs improved the accuracy of qualitative responses and their alignment with current financial contexts, indicating the potential of RAG in refining the performance of LLMs in domain-specific tasks. These findings confirm those of other studies such as [24,26].
  • Impact of prompt specificity: The level of prompt specificity significantly affects the quality of LLM responses, with detailed prompts resulting in higher accuracy, relevance, and verifiability. This is also in accordance with the findings of other studies such as [32].
  • Reduction in hallucinations: By using RAG, the incidence of hallucinations or irrelevant responses was reduced by 15.55%, highlighting the role of RAG in enhancing the reliability of LLM outputs. Nevertheless, hallucinations cannot be fully avoided, so this risk needs to be considered in any practical application of such a solution.
  • Efficiency in data retrieval: RAG demonstrated efficiency in retrieving and presenting verifiable financial data, particularly when prompts explicitly requested source references and provided further context.
  • It is interesting to note that for accuracy, the solution without RAG performed similar to the solution with RAG (which is clearly better regarding relevance and verifiability). We assume that this is caused by the type of questions asking mostly for general and qualitative information. In other studies focusing on specific quantitative information (such as from financial tables) [30], we found a clear advantage of an RAG-based solution.
While these qualitative findings are rather explicit in our study, there may be biases due to the selection of test data (selected companies) and the chosen test questions that may in particular affect the specific quantitative results. As is known from other studies (e.g., [28]), there are various factors influencing information extraction capabilities from annual reports in PDF format. These include business-related aspects such as reporting standards and the complexity of the document (which is frequently related to the size and complexity of the company), the layout and formatting of the report, and technical aspects regarding the generation of the PDF file.

6. Conclusions

We investigated how RAG can be effectively combined with LLMs to enhance their performance in specialized tasks such as financial report analysis. The results emphasize the potential of RAG to reduce biases and inaccuracies in traditional LLMs, making them more suitable for specific applications.
The developed tool integrates RAG with ChatGPT-4 for financial data analysis, offering a practical solution for businesses and researchers. By using advanced retrieval algorithms and the generative capabilities of LLMs, the tool ensures accurate, relevant, and contextually rich responses, which can significantly benefit financial analysis and decision-making processes.
While this study offers valuable insights, more research is needed to assess the scalability of RAG-enhanced LLMs across different domains and larger datasets. First of all, to overcome a major limitation of our study, we suggest a more comprehensive exploration of our suggested approach involving a larger number and more diverse set of companies, e.g., not only US technology firms. For instance, all companies in a particular stock market index should be investigated in a future study. This should indicate the generalizability of our approach to further companies and allow for a stronger evaluation considering statistical testing methods such as t-tests or ANOVA. As discussed above, the current focus on 10 selected companies may induce a bias considering the total number of companies publishing annual reports and we therefore suggest a more comprehensive study, which certainly requires a more powerful computation setup considering the demands on storage and processing capabilities in the RAG setup. Also to overcome bias, the usage of external evaluators is suggested or the development of a suitable approach for the automatic assessment of LLM responses considering the significant manual effort if a large number of companies is considered.
In addition, further exploration of the long-term impacts of continuous learning in LLMs could enhance their utility and reliability.
In conclusion, integrating RAG with LLMs shows promise in overcoming the limitations of traditional language models. By improving accuracy and verifiability, RAG enables LLMs to deliver more reliable and domain-specific outputs, enhancing their effectiveness, particularly in financial analysis.
For future research, we also suggest applying our concept to other sectors and use cases, which appears quite straightforward. For instance, contracts, product specification, or technical documents could be analyzed in a similar way. In addition, it would be useful to conduct similar studies to documents in further languages to explore the multilingual capabilities of modern LLMs.

Author Contributions

Conceptualization, methodology, investigation, A.M., B.P. and C.D.; validation, supervision, T.H.; writing—original draft preparation, A.M., B.P. and C.D.; writing—review and editing, T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Documentation of the RAG Chatbot

The script implements a chat-like interface using RAG to answer questions based on the contents of PDF documents. The script extracts text from PDFs, generates embeddings for these texts, and then uses OpenAI’s GPT-4 model to generate responses based on relevant documents retrieved using the nearest neighbor search.
Key Components and Steps:
  • Importing Libraries: The script imports the necessary libraries for PDF text extraction (pdfminer.six), embedding generation (transformers, torch), nearest neighbor search (sklearn), and response generation (openai).
  • Setting Up Directory and Extracting Text from PDFs: The directory containing the PDF files is specified. Text is extracted from each PDF using the pdfminer.six library. Extracted texts are saved to a file to avoid redundant processing in future runs.
  • Generating Embeddings: A pre-trained BERT model from Hugging Face is used to tokenize and encode the extracted texts. Embeddings are generated from the encoded texts using the BERT model and saved to a file.
  • Nearest Neighbors Model: A nearest neighbors model from sklearn is created and fitted with the document embeddings. This model is used to retrieve the most relevant documents for a given query.
  • Retrieving Relevant Documents: The retrieve_documents function encodes a query and finds the most relevant documents using the nearest neighbors model.
  • Truncating Context: The truncate_context function limits the size of the context to a specified number of tokens to ensure that the input stays within the token limit for OpenAI’s API.
  • Generating Response with GPT-4: The generate_response_with_gpt4 function generates a response using OpenAI’s GPT-4 model. It constructs the prompt using the retrieved documents and the query, ensuring that the context is truncated to fit within the token limit.
  • Chat-like Interaction: The chat function handles user input in a loop, allowing for a chat-like interaction. It retrieves relevant documents for each query, generates a response using GPT-4, and prints the response.
The complete code is available on request as a file “OpenAI_RAG.py.” Please note that an own OpenAPI key must be entered for the functioning and the directory for the retrieved documents should be adjusted.

References

  1. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2024, arXiv:2303.08774v6. [Google Scholar]
  2. Gemini Team; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
  3. Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E.; et al. CRUD-RAG: A comprehensive Chinese benchmark for retrieval-augmented generation of large language models. arXiv 2024, arXiv:2401.17043. [Google Scholar] [CrossRef]
  4. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-augmented generation for large language models: A survey. arXiv 2024, arXiv:2312.10997v5. [Google Scholar] [CrossRef]
  5. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv 2023, arXiv:2311.05232. [Google Scholar] [CrossRef]
  6. McKenna, N.; Li, T.; Cheng, L.; Hosseini, M.J.; Johnson, M.; Steedman, M. Sources of hallucination by large language models on inference tasks. arXiv 2023, arXiv:2305.14552. [Google Scholar] [CrossRef]
  7. Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Trans. Assoc. Comput. Linguist. 2023, 11, 1–17. [Google Scholar] [CrossRef]
  8. Loukas, L.; Stogiannidis, I.; Diamantopoulos, O.; Malakasiotis, P.; Vassos, S. Making LLMs worth every penny: Resource-limited text classification in banking. In Proceedings of the ICAIF ‘23: 4th ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023; pp. 392–400. [Google Scholar] [CrossRef]
  9. Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
  10. Zhou, C.; Li, Q.; Li, C.; Yu, J.; Liu, Y.; Wang, G.; Zhang, K.; Ji, C.; Yan, Q.; He, L.; et al. A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT. arXiv 2023, arXiv:2302.09419. [Google Scholar] [CrossRef]
  11. Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2024, 56, 1–40. [Google Scholar] [CrossRef]
  12. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. arXiv 2024, arXiv:2307.06435. [Google Scholar] [CrossRef]
  13. Hamza, M.; Awan, W.N. Understanding the Landscape of Generative AI: A Computational Literature Review. 2025. Available online: https://ssrn.com/abstract=5327156 (accessed on 29 July 2025).
  14. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. Proc. AAAI Conf. Artif. Intell. 2023, 38, 17754–17762. [Google Scholar] [CrossRef]
  15. Zhang, B.; Yang, H.; Zhou, T.; Ali Babar, M.; Liu, X.-Y. Enhancing financial sentiment analysis via retrieval augmented large language models. In Proceedings of the ICAIF ‘23: 4th ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 349–356. [Google Scholar] [CrossRef]
  16. Ryu, C.; Lee, S.; Pang, S.; Choi, C.; Choi, H.; Min, M.; Sohn, J.-Y. Retrieval-based evaluation for LLMs: A case study in Korean legal QA. In Proceedings of the Natural Legal Language Processing Workshop 2023; Association for Computational Linguistics: Singapore, 2023; pp. 132–137. [Google Scholar]
  17. Finardi, P.; Avila, L.; Castaldoni, R.; Gengo, P.; Larcher, C.; Piau, M.; Costa, P.; Caridá, V. The chronicles of RAG: The retriever, the chunk and the generator. arXiv 2024, arXiv:2401.07883. [Google Scholar] [CrossRef]
  18. Ren, Y.; Cao, Y.; Guo, P.; Fang, F.; Ma, W.; Lin, Z. Retrieve-and-sample: Document-level event argument extraction via hybrid retrieval augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 293–306. [Google Scholar]
  19. Wang, R.; Bao, J.; Mi, F.; Chen, Y.; Wang, H.; Wang, Y.; Li, Y.; Shang, L.; Wong, K.-F.; Xu, R. Retrieval-free knowledge injection through multi-document traversal for dialogue models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 6608–6619. [Google Scholar] [CrossRef]
  20. Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. BloombergGPT: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
  21. Su, W.; Tang, Y.; Ai, Q.; Wu, Z.; Liu, Y. DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 12991–13013. [Google Scholar]
  22. Choi, S.; Fang, T.; Wang, Z.; Song, Y. KCTS: Knowledge-constrained tree search decoding with token-level hallucination detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 14035–14053. [Google Scholar]
  23. Yu, W.; Zhang, H.; Pan, X.; Ma, K.; Wang, H.; Yu, D. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv 2023, arXiv:2311.09210. [Google Scholar]
  24. Wang, J.; Ding, W.; Zhu, X. Financial analysis: Intelligent financial data analysis system based on llm-rag. arXiv 2025, arXiv:2504.06279. [Google Scholar] [CrossRef]
  25. Sælemyr, J.; Femdal, H.T. Chunk Smarter, Retrieve Better: Enhancing LLMS in Finance: An Empirical Comparison of Chunking Techniques in Retrieval Augmented Generation for Financial Reports. Master’s Thesis, Norwegian School of Economics, Bergen, Norway, 2024. [Google Scholar]
  26. Gondhalekar, C.; Patel, U.; Yeh, F.C. MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering. arXiv 2025, arXiv:2506.20821. [Google Scholar]
  27. Chinaksorn, N.; Wanvarie, D. LLM-RAG for Financial Question Answering: A Case Study from SET50. In Proceedings of the 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 18–21 February 2025; pp. 0952–0957. [Google Scholar]
  28. Balsiger, D.; Dimmler, H.R.; Egger-Horstmann, S.; Hanne, T. Assessing large language models used for extracting table information from annual financial reports. Computers 2024, 13, 257. [Google Scholar] [CrossRef]
  29. Bas, H.; Christen, M.; Radović, D.; Hanne, T. Large Language Models for Table Content Extraction from Annual Reports. In Congress on Intelligent Systems; Springer Nature: Singapore, 2024; pp. 61–79. [Google Scholar]
  30. Iaroshev, I.; Pillai, R.; Vaglietti, L.; Hanne, T. Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering. Appl. Sci. 2024, 14, 9318. [Google Scholar] [CrossRef]
  31. Martineau, K. What Is Retrieval-Augmented Generation? IBM Research Blog, 1 May 2024. Available online: https://research.ibm.com/blog/retrieval-augmented-generation-RAG?ref=robkerr.ai (accessed on 29 July 2025).
  32. Rosa, S. Large Language Models for Requirements Engineering. Ph.D. Dissertation, Politecnico di Torino, Turin, Italy, 2025. [Google Scholar]
Figure 1. Comparison of quality metrics before (left columns in blue) and after (right columns in green) RAG.
Figure 1. Comparison of quality metrics before (left columns in blue) and after (right columns in green) RAG.
Information 16 00786 g001
Figure 2. Average values by prompt type without RAG.
Figure 2. Average values by prompt type without RAG.
Information 16 00786 g002
Figure 3. Average values by prompt type with RAG.
Figure 3. Average values by prompt type with RAG.
Information 16 00786 g003
Table 1. List of companies selected for the study.
Table 1. List of companies selected for the study.
Sr. No.SymbolCompanyRevenueMarkert CapSector Classification
USD (Billion)USD (Billion)
1PAYXPaychex, Inc., Rochester, NY, USA5.007118.45Technology Services
2FTNTFortinet, Inc., Sunnyvale, CA, USA5.30565.2Technology Services
3TTWOTake-Two Interactive Software, Inc., New York, NY, USA5.35143.07Technology Services
4ADSKAutodesk, Inc.5.497209.95Technology Services
5SSNCS&C Technologies Holdings, Inc., San Francisco, CA, USA 5.50361.52Technology Services
6SNPSSynopsys, Inc., Sunnyvale, CA, USA5.853523.38Technology Services
7ROPRoper Technologies, Inc., Sarasota, FL, USA6.178510.82Technology Services
8PANWPalo Alto Networks, Inc., Santa Clara, CA, USA6.893295.32Technology Services
9WDAYWorkday, Inc., Pleasanton, CA, USA7.197250.85Technology Services
10EAElectronic Arts Inc., Redwood City, CA, USA7.241128.5Technology Services
Table 2. List of test questions.
Table 2. List of test questions.
No.QuestionCategory
1What is the bet revenue for the fiscal year and it’s breakdown?Quantitative
2What is the cost of revenue for the fiscal year and its breakdown?Quantitative
3What are the operating expenses for the fiscal year and its breakdown?Quantitative
4What is the current ratio as per the balance sheet?Quantitative—Calculated
5What is the debt-to-equity ratio as per the balance sheet?Quantitative—Calculated
6What is the quick ratio as per the balance sheet?Quantitative—Calculated
7Have the auditors expressed a qualified or unqualified opinion about the financial statements?Qualitative
8Have there been any changes in the accounting standards or practices adopted by the company for the fiscal year?Qualitative
9What is the company’s policy for revenue recognition?Qualitative
10Does the company have a stock repurchase program? If yes, what is it?Qualitative
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mokashi, A.; Puthuparambil, B.; Daniel, C.; Hanne, T. Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation. Information 2025, 16, 786. https://doi.org/10.3390/info16090786

AMA Style

Mokashi A, Puthuparambil B, Daniel C, Hanne T. Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation. Information. 2025; 16(9):786. https://doi.org/10.3390/info16090786

Chicago/Turabian Style

Mokashi, Abhijit, Bennet Puthuparambil, Chaissy Daniel, and Thomas Hanne. 2025. "Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation" Information 16, no. 9: 786. https://doi.org/10.3390/info16090786

APA Style

Mokashi, A., Puthuparambil, B., Daniel, C., & Hanne, T. (2025). Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation. Information, 16(9), 786. https://doi.org/10.3390/info16090786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop