Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting

Dadopoulos, Michail; Moschidis, Stratos

doi:10.3390/jrfm19060402

Open AccessArticle

Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting

by

Michail Dadopoulos

^1,*

and

Stratos Moschidis

²

¹

Information Technologies Institute, Centre for Research & Technology Hellas, 57001 Thessaloniki, Greece

²

Department of Accounting and Finance, University of Macedonia, 54636 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

J. Risk Financial Manag. 2026, 19(6), 402; https://doi.org/10.3390/jrfm19060402

Submission received: 27 April 2026 / Revised: 26 May 2026 / Accepted: 29 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue Judgment and Decision-Making Research in Auditing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate product-to-catalog invoice matching is a foundational internal control for financial oversight and audit quality, yet it is bottlenecked by inconsistent vendor descriptions and the resulting ‘long tail’ of supplier heterogeneity, driving costly manual reconciliation in Enterprise Resource Planning (ERP) environments. This study pursues three objectives: (i) to design a Retrieval-Augmented Generation (RAG) architecture that matches invoice line items to a product catalog under conditions of optical character recognition noise, vendor-specific abbreviations, and multilingual heterogeneity; (ii) to evaluate this architecture on three public entity resolution benchmarks against established lexical and Dense retrieval baselines; and (iii) to assess its viability as a decision support system in a real accounts payable workflow with audit-trail requirements. To address (i), we introduce a novel ‘augment-both-sides’ strategy: large language models (LLMs) proactively enrich each catalog Stock Keeping Unit (SKU) with synonyms and alternative descriptions before vectorization, while invoice lines undergo runtime query expansion, and an LLM-based reranker produces the final Top-3 candidates. For (ii), evaluation on the Abt-Buy, Amazon-Google, and Walmart-Amazon datasets yields Top-3 Recall of 91.60% to 97.96%, matching or exceeding the strongest non-LLM baseline on every benchmark. For (iii), a production deployment on approximately 200 manually verified Greek invoice lines (proprietary dataset, anecdotal observation) yields a Top-3 hit rate of approximately 97%, consistent with the public-benchmark results. The architecture functions as a reliable intelligent decision aid, narrowing the search space from thousands of SKUs to a precise candidate set for structured human verification.

Keywords:

invoice reconciliation; retrieval-augmented generation; large language models; entity resolution; accounts payable; decision support systems; audit trail

1. Introduction

Automated invoice processing is a core component of modern ERP and financial accounting systems, which aim to integrate and streamline end-to-end financial operations (Grabski et al., 2011; O’Leary, 2000). A critical and challenging component of this workflow is the accurate reconciliation of product line items from vendor invoices against an internal corporate catalog. This reconciliation is a foundational internal control, forming the basis of the “three-way match1” required for rigorous financial oversight and the prevention of fraudulent or erroneous payments. Evaluating these literal discrepancies requires professional judgment, and the consistency of that judgment is of direct interest to internal audit functions that test the reliability of the underlying controls. This task is notoriously difficult and represents a complex “fuzzy matching” problem. Invoice product descriptions are often noisy because of optical character recognition (OCR) errors, are highly abbreviated, and use heterogeneous, vendor-specific terminology, especially when derived from scanned or semi-structured documents (Cristani et al., 2018; Ha & Horák, 2022). In contrast, internal catalogs may contain structured, but lexically different, descriptions. Traditional fuzzy string-matching algorithms, which rely on character-level similarity (e.g., Levenshtein distance) or simple keyword-based searches, are brittle and fail in this low-signal, high-variance environment (Cohen et al., 2003).

This failure of traditional automation is a primary source of operational friction in modern accounts payable (AP), with direct implications for audit quality and financial-data integrity. It breaks the ‘touchless’ processing workflow, forcing literal discrepancies to be routed for manual intervention. Such manual invoice handling, data entry, visual inspection, reconciliation, and archiving remains slow and costly in practice and can materially delay downstream payments and procurement operations (Krieger et al., 2023). Under cognitive fatigue, human evaluators face a complex decision task that increases the risk of misclassification (F. Huang & Vasarhelyi, 2019), thereby threatening the reliability of internal controls over financial reporting. The cost of these failures is not merely operational: misclassified line items can propagate into expense accounts, distort management reports, and surface as exceptions in external audits (Abderrahman & Makarem, 2026). Improving the upstream reconciliation step therefore strengthens both operational efficiency and the assurance environment in which Accounting Information Systems (AIS) operate.

Recent advancements in LLMs and RAG offer a new paradigm for tackling this advanced fuzzy matching problem. While Robotic Process Automation (RPA) has successfully automated deterministic, rule-based tasks, it struggles when processes require flexible interpretation of unstructured and variable text (F. Huang & Vasarhelyi, 2019; Ng et al., 2021), and real-world deployment in purchasing/procurement highlights both practical potential and implementation barriers (Flechsig et al., 2022). In this broader context of procurement digitization and automation (Bode et al., 2023; Strohmer et al., 2020), we present an end-to-end system that uses LLMs for robust, scalable invoice-to-catalog matching.

However, LLM adoption in this domain remains constrained by two unresolved gaps. From an academic standpoint, prior entity resolution research has primarily evaluated matching architectures on clean, English-language e-commerce datasets, leaving open how generative LLM-based pipelines perform under the OCR noise, abbreviations, and multilingual heterogeneity that characterize real AP streams (Althaf et al., 2025; Peeters et al., 2024). From a practical standpoint, available enterprise tooling either applies deterministic RPA bots that cannot handle textual variability, or applies LLMs as opaque end-to-end matchers that satisfy neither the auditability nor the latency requirements of production AP environments.

Our work addresses both gaps. We build on two complementary lines of RAG research, query-side rewriting (Gao et al., 2023; Ma et al., 2023) and index-side document augmentation (Raina & Gales, 2024), and on recent LLM-based entity-matching architectures (Althaf et al., 2025; Peeters et al., 2024; Zhang et al., 2025). Prior approaches typically intervene on only one side of the retrieval pipeline: query rewriting transforms the incoming query, while document augmentation enriches the index. Our core contribution is a novel “augment-both-sides” strategy that combines these two interventions in a single end-to-end pipeline, with a domain-specific LLM reranker that enforces AP-specific judgment such as unit and packaging equivalence:

Catalog Augmentation: We first proactively enrich the internal corporate catalog. An LLM generates additional keywords, synonyms, and potential invoice-variants for each product, and these enhanced entries are stored as embeddings in a vector database.
Query Augmentation and Reranking: During live invoice processing, our system uses an LLM to generate multiple augmented query variants from the raw, extracted invoice line. This “query expansion” retrieves a broad set of potential candidates, which are then evaluated by a specialized LLM-based reranker to produce the final Top-3 matches.

This system is designed to function as a high-efficiency decision support system, rather than a fully autonomous “black box.” In a production AP workflow, the operational goal is to accelerate human review by presenting the correct match within the operator’s immediate field of view. Therefore, we define our primary performance metrics as Top-1 Recall (representing the potential for fully “touchless” automation) and Top-3 Recall (representing rapid, human-verified processing). In this study we focus on matchable line items (where a correct catalog mapping exists). Detecting true non-catalog items and exception-only lines is important in practice but is out of scope for this evaluation.

This system was developed and validated in a real-world enterprise setting, processing complex and heterogeneous Greek invoices. In this production environment, the system was evaluated on approximately 200 Greek invoice line items, each manually verified by an AP analyst against the corporate catalog; the system surfaced the correct catalog entry within the Top-3 in roughly 97% of cases. We report this as an anecdotal production observation rather than as a reproducible benchmark, since the underlying dataset is not publicly available. Residual failures were dominated by lines whose ground-truth catalog entry shared neither brand nor product name with the invoice description, and that therefore could not be resolved from the line text alone. Due to the commercial sensitivity of this corporate data, we cannot publish the dataset. Therefore, to ensure academic reproducibility and benchmark our method, we evaluate the core of our matching methodology on three well-established public entity resolution datasets: Abt-Buy, Amazon-Google, and Walmart-Amazon. Our results demonstrate that this Hybrid architecture achieves competitive performance against established baselines, validating its effectiveness as a scalable solution for enterprise accounting environments.

This study makes two principal contributions. Theoretically, we extend the RAG paradigm from standard one-sided augmentation to a dual-sided architecture and demonstrate that this design closes capability gaps that Hybrid retrieval alone does not close on noisy, multilingual product-matching data. Practically, we provide a deployable architecture with measured performance on both reproducible public benchmarks and a real production AP environment, together with an explicit decision support framing that aligns the system’s outputs with the auditability and traceability requirements of internal control. The remainder of the paper is structured as follows. Section 2 reviews related work on e-invoicing and RPA governance, automated invoice information extraction, entity resolution for AP, and LLM-based contextual matching. Section 3 describes the system architecture and experimental setup. Section 4 presents the empirical results across the three public benchmarks. Section 5 discusses limitations, audit-trail implications, and threats to validity, and concludes with directions for future work.

2. Related Work

This section reviews four bodies of work that inform our system design. We first situate invoice reconciliation within the broader transformation of AP through e-invoicing and RPA (Section 2.1), then survey automated invoice information-extraction techniques (Section 2.2), entity resolution and product-matching methods (Section 2.3), and finally the application of LLMs to contextual matching and judgment (Section 2.4). Before proceeding, we define two terms used throughout the paper. By robust product reconciliation in accounting, we mean the automated mapping of unstructured vendor-invoice line items to a structured internal product catalog under conditions of OCR noise, vendor-specific abbreviations, multilingual or mixed-script text, and long-tail supplier heterogeneity. This reconciliation constitutes one leg of the three-way match (purchase order, goods-receipt note, invoice) and therefore acts as a foundational internal control. By Dual-Augmentation RAG we mean a Retrieval-Augmented Generation pipeline (Lewis et al., 2020) in which the LLM enriches both the indexed corpus (catalog side) and the incoming query (invoice side) prior to retrieval and reranking, in contrast to standard RAG architectures that augment only one side.

2.1. E-Invoicing Adoption and RPA Governance

Electronic invoicing (e-invoicing) and RPA have been central to the transformation of AP within AIS. At the adoption level, empirical work using the Technology-Organization-Environment (TOE) lens shows that compatibility, complexity, relative advantage, trialability, firm size, and competitive/regulatory pressure are significant determinants of e-invoicing uptake, underscoring why many AP functions continue to operate with Hybrid (paper/PDF + EDI/XML) flows and why downstream automation must tolerate heterogeneity (Tiwari et al., 2023).

Earlier research in AIS and operations on indirect procurement further clarifies why AP matching is difficult: organizations use diverse B2B e-procurement processes and platforms, and “fit” varies by process archetype, driving the long-tailed distribution of document formats and item descriptions that AP must reconcile (J.-I. Kim & Shunk, 2004). Within auditing and controllership, RPA has been framed as a way to offload “well-defined, low-judgment” tasks so professionals can focus on higher-judgment work (F. Huang & Vasarhelyi, 2019). At the same time, newer risk work warns that RPA initiatives carry operational/controllability risks that must be explicitly rated and governed, for example via impact-uncontrollability matrices (Schlegel et al., 2024). Together, these streams explain why, despite high RPA adoption, a significant volume of residual manual reconciliation persists, necessitating more intelligent, cognitive automation approaches. A recent systematic review of emerging technologies in external auditing confirms RPA as one of six dominant innovations transforming audit practice, while also surfacing persistent gaps in governance and the integration of AI-augmented decision steps (Abderrahman & Makarem, 2026).

The governance gap is particularly acute for AP reconciliation because, unlike rule-based RPA, language-model-based reconciliation introduces a non-deterministic decision step that is harder to audit. Recent enterprise RAG guidance (Abderrahman & Makarem, 2026; Iaroshev et al., 2024) emphasizes three controls that traditional RPA does not require: comprehensive logging of retrieval and generation steps, per-task explainability linking each output to the documents that produced it, and continuous adversarial testing against prompt injection. Our architecture is designed to be compatible with these controls by exposing inspectable intermediate artifacts at every stage (the LLM-generated query variants, the deduplicated candidate shortlist, and the reranker’s Top-3 selections), so that the resulting decision trail meets the same audit-evidence standards that internal auditors apply to deterministic AP workflows.

2.2. Automated Invoice Processing and Information Extraction (IE)

Invoice IE has progressed from rule/layout heuristics to multimodal deep models and industrial Intelligent Document Processing (IDP) services. Early template-free systems (Palm et al., 2017) demonstrated generalization beyond fixed forms; subsequent Visual Document Understanding (VDU) architectures fuse text, layout, and vision to improve robustness across varied layouts. Representative systems include OCRMiner (text and layout features) (Ha & Horák, 2022), semantic graph-based methods (S. Luo & Yu, 2024), entity-relevancy models like MatchVIE (Tang et al., 2021), and pretrain-then-finetune families such as LayoutLM/LayoutLMv3 (Y. Huang et al., 2022; Xu et al., 2020). OCR-free transformers like Document Understanding Transformer (DONUT) (G. Kim et al., 2022) further show that strong layout with vision priors can recover structured fields without external OCR. Most recently, instruction-tuned document LLMs (C. Luo et al., 2024) and general-purpose Vision Language Models (VLMs) (e.g., GPT-4V, LLaVA) exhibit strong zero/few-shot capabilities and can be prompted for structured outputs (e.g., JSON) directly from page images (Liu et al., 2023; OpenAI et al., 2024).

Public evaluations such as DocILE highlight remaining pain points, abbreviations, noisy OCR, vendor-specific phrasing, and layout edge cases, especially at the line-item level (Šimsa et al., 2023). Field reports echo this: in production, the “long-tail” supplier problem (rare formats, inconsistent layout, infrequent vendors) limits pure template/RPA approaches, while transformer-based models generalize better yet still leave residual errors that must be handled downstream (Krieger et al., 2023). Recent evaluations quantify these residual errors directly: on the SROIE invoice benchmark, an LLM-prompted extraction pipeline combined with an established OCR backend achieves 91.5% field-level accuracy (Chen et al., 2025), leaving a non-trivial fraction of line-level errors that propagates into any downstream reconciliation step. A complementary stream uses structured e-invoices for downstream accounting automation (e.g., VAT/account-code classification) to reduce manual bookkeeping effort (Bardelli et al., 2020).

Commercial IDP services (Azure Document Intelligence, Amazon Textract, Google Document AI) and open-source toolchains (e.g., LlamaParse, Docling) now expose pretrained invoice parsers via APIs, offering high extraction quality on headers and line items. As both DocILE results and industry practice suggest, outputs often remain abbreviated or partially erroneous, motivating post-extraction repair. Parallel evaluations of generative LLMs on financial-table extraction tasks confirm both the promise and the residual accuracy gaps of these models in the broader financial-document setting (Balsiger et al., 2024). For example, retrieval-augmented pipelines that pair DONUT with RAG have been used to correct systematically mis-extracted addresses in parcel invoices, illustrating how external knowledge can reliably fix recurrent IE errors (Jeong et al., 2025).

2.3. Entity Resolution (ER) and Product Matching for AP

The core of our pipeline addresses the ER problem in AP, specifically matching invoice line items to items in the purchasing catalog. In procure-to-pay (P2P) workflows, this step is a well-known bottleneck in two- and three-way matching, because human-readable descriptions on supplier invoices rarely match catalog records exactly. The issue is particularly pronounced in indirect procurement, where organizations rely on multiple, heterogeneous B2B e-procurement systems. Historically, baseline approaches to this matching problem have relied on lexical and bag-of-words similarity methods, ranging from edit-distance-based string metrics (Cohen et al., 2003; Wagner & Fischer, 1974) to ranking models such as TF-IDF (Salton et al., 1975) and BM25 (Robertson et al., 1995). While foundational, these techniques are brittle in the presence of vendor-specific abbreviations, missing or mis-ordered attributes, multilingual text, and the OCR noise that is typical of scanned invoices in AP.

In response, supervised deep-learning approaches to ER have emerged. Early neural models such as DeepMatcher and Ditto demonstrated substantial gains over lexical baselines by learning contextual representations directly from record pairs (Li et al., 2020; Mudgal et al., 2018). Subsequent work has refined these approaches by analyzing the impact of different pre-trained embeddings for both blocking and matching (Zeakis et al., 2023), investigating alternative training strategies such as supervised contrastive learning (Peeters & Bizer, 2022) and proposing new neural architectures for product matching (Mistiawan & Suhartono, 2024). A complementary pillar is candidate generation at catalog scale: semantic search and Dense retrieval are increasingly used to reduce the match space before classification or reranking and are now widely adopted in large e-commerce catalogs (Nigam et al., 2019).

More recently, generative LLMs have been evaluated as direct entity matchers, with prompted GPT-class models reported to outperform fine-tuned Pre-trained Language Model (PLM)g baselines on standard ER benchmarks (Peeters et al., 2024). Building on this paradigm, recent work has begun to combine LLMs with multi-agent coordination for entity resolution (Althaf et al., 2025) and with Retrieval-Augmented Generation for domain-specific named-entity matching (Zhang et al., 2025).

Procurement and AP contexts introduce domain-specific constraints, unit and packaging variants, legacy item descriptions, multilingual text, and supplier-specific shorthand, as well as organizational requirements such as auditability and exception handling. Prior B2B tendering research on “semantic product matching” addresses product heterogeneity at scale under different governance and data-sharing arrangements and informs our treatment of cross-source variation (Mehrbod et al., 2018). Within AP specifically, two recurring themes in recent work are (1) the long tail of suppliers and invoice formats, and (2) interactive learning from users: systems that learn online from practitioner feedback to improve invoice line-item matching (Maurya et al., 2020) and multi-stage AI architectures that combine robust candidate retrieval with human-in-the-loop disambiguation and traceability for invoice exceptions (Tater et al., 2022).

These strands motivate our candidate-generation plus LLM reranking design. We combine scalable semantic retrieval to cope with catalog size, with an instruction-tuned LLM reranker that acts as a policy enforcement layer. This layer applies AP-specific logic, such as normalizing units of measure and resolving packaging equivalences, ensuring that technical matches also make sense from a procurement and accounting perspective. In addition, we expose feedback mechanisms so that user corrections on long-tail cases are captured and can be exploited for continual model improvement.

Table 1 summarizes the relative strengths and weaknesses of the main matching paradigms relevant to AP reconciliation.

Two patterns emerge from Table 1. First, no classical retrieval paradigm dominates across all dimensions: methods optimized for exact-token signal (edit-distance, BM25) preserve identifiers and remain fully auditable but cannot bridge synonym, abbreviation, or cross-lingual gaps without external resources. Dense retrieval inverts these strengths, gaining semantic robustness at the cost of identifier integrity and interpretability. Hybrid retrieval mitigates but does not close these gaps, particularly when a single invoice line simultaneously carries OCR noise, vendor-specific abbreviations, and mixed-script tokens, a routine occurrence rather than an edge case in vendor invoices.

Second, the dimensions where every non-LLM method fails—abbreviation expansion, cross-lingual code-switching, and contextual disambiguation between near-duplicate SKUs—are exactly the dimensions where the P2P process incurs its highest manual-review costs. The Dual-Augmentation RAG architecture is designed to close these specific gaps by embedding contextual judgment at the index, query, and reranking stages. Its advantages, however, are conditional: rewrite validation, controlled abbreviation expansion, multilingual support in the retrieval stack, SKU locking, and comprehensive logging must each be engineered into the pipeline rather than assumed from LLM prompting alone. Where these safeguards are absent, the failure modes listed in the final row of Table 1—hallucinated expansions, code corruption, reranker bias—become operationally significant and may degrade rather than improve audit-trail quality.

2.4. Applying LLMs for Contextual Matching and Judgment

The emergence of LLMs offers a new paradigm for solving the complex entity resolution (ER) challenges outlined in 2.3. While some research explores using LLMs to perform zero-shot matching directly (Peeters et al., 2024; Wang et al., 2024), such “single-step” approaches can be slow, costly, and difficult to audit when applied at the scale required for enterprise AP. Closer to our setting, recent work has begun to apply RAG to financial-document understanding tasks such as report question answering (Dang et al., 2026; Iaroshev et al., 2024), though P2P reconciliation remains an underexplored application.

Instead, our system integrates LLMs into a multi-stage process, often called a RAG pipeline (Lewis et al., 2020), that is more scalable, controllable, and traceable. This design moves LLMs from being a single “black box” oracle to a specialized component that enhances traditional retrieval methods. Our “augment-both-sides” strategy uses LLMs to mimic specific forms of expert AP judgment at three distinct stages:

Proactive Catalog Enrichment: First, we use an LLM to “read” our internal product catalog and proactively generate realistic synonyms, common abbreviations, and alternative descriptions for each item. This is an automated form of master data enhancement. For example, “M6 Stainless Steel Hex Bolt, 10mm, 100-pack” might be enriched with terms like “SS M6 bolt,” “hex 10mm,” or “box of 100 bolts.” This enriched data is stored in a high-speed vector database, allowing our system to anticipate the messy, inconsistent language suppliers use before their invoices even arrive. This concept builds on RAG research into document-side augmentation (James et al., 2025; Raina & Gales, 2024) but applies it as a practical control for data quality in a procurement context.
Interpreting Noisy Invoice Queries: When a noisy line item like “SS hexblt 10mm” is extracted from an invoice, it often fails to match the catalog directly. Our system uses an LLM to rewrite this ambiguous query into multiple, clearer variants (e.g., “stainless steel hex bolt 10mm,” “M10 hex bolt stainless”). This step mimics an AP clerk’s “best guess” at what the supplier meant to say, translating vendor-specific shorthand into our internal terminology. This query-expansion step builds on prior work in LLM-driven query rewriting (Ma et al., 2023) and hypothetical document expansion (Gao et al., 2023), and is critical for handling OCR errors and vendor-specific phrasing, ensuring good candidates are found even when the initial data is poor.
Applying Contextual Judgment (Reranking): The first two steps retrieve a list of potential matches from the catalog. This list is then passed to a final LLM-based reranker, which acts like a senior AP professional performing a final check. It compares the original invoice line to the top candidates and re-orders them based on contextual cues. This stage is crucial for resolving ambiguities that lexical methods miss, such as a “box” versus an “each” unit of measure or packaging equivalences (e.g., “10-pack” vs. “10 units”). This LLM-based reranking (Adeyemi et al., 2023) applies nuanced business logic, improving the quality of the final Top-3 matches presented to the user.

Combined, these stages provide a scalable and robust solution that addresses the core AP challenges of noisy data and vendor heterogeneity by embedding contextual judgment at key points, all while maintaining the high throughput and traceability required for modern AIS.

3. Materials and Methods

3.1. Research Design

Our research proposes an “augment-both-sides” RAG architecture. Unlike traditional rule-based matching engines used in legacy ERP systems, this approach uses LLMs to bridge the semantic gap between unstructured invoice data and structured master data.

The system design consists of two distinct phases: (1) Offline Catalog Indexing, where the corporate master data is prepared and enriched; and (2) Online Runtime Processing, where incoming invoice line items are resolved against the catalog. Figure 1 summarizes the proposed “augment-both-sides” architecture, highlighting the Offline Catalog Indexing phase and the online runtime invoice processing phase.

3.2. System Architecture and Implementation

3.2.1. Phase 1: Catalog Augmentation and Vector Indexing

To address the vocabulary mismatch problem, we implemented a proactive enrichment strategy during the indexing phase. As illustrated in our implementation logic, we do not simply index the raw product title. Instead, we construct a composite “chunk text” for every catalog item. This composite text concatenates the product description with an Enriched Metadata field, a set of synthetic keywords, synonyms, common abbreviations, and “invoice-style” variations generated by an LLM (gpt-5-mini). This ensures that the vector representation of the product anticipates the variable language likely to appear on vendor invoices.

The foundation of our system is a robust vector database representing the corporate product catalog. Unlike traditional relational databases used in ERPs, which rely on exact keyword matching, our system also uses Vector Space Models to capture semantic similarities, allowing relevant results even when wording differs.

We implemented a “Hybrid Search” architecture using Qdrant, an open-source vector database. This architecture combines two distinct retrieval mechanisms to maximize recall:

Dense Retrieval: Each enriched catalog entry is encoded with OpenAI’s text-embedding-3-large model, with the output dimensionality reduced from the default 3072 to 1024 for storage and latency efficiency. Retrieval is performed by cosine similarity in this 1024-dimensional space. This captures semantic context (e.g., understanding that “portable PC” and “laptop” are related).
Sparse Retrieval (BM25): In parallel, each catalog entry is indexed with a BM25 representation, produced by FastEmbed’s Qdrant/bm25 model, which preserves exact-keyword signal (e.g., SKU codes such as “X1-Carbon” or “M6 × 10”) that embeddings tend to smooth over.

3.2.2. Phase 2: Real-Time Query Augmentation and Reranking

As shown in Figure 1, at runtime when an invoice line item is extracted, the system does not query the database immediately; instead, it employs a multi-step reconciliation process designed to mimic and augment the reasoning of a human accounts payable clerk. The process begins with an “augment-both-sides” strategy, where an LLM, acting as a “Query Synthesizer”, analyzes the raw invoice text.

This step is particularly important in our operational context, where invoices mix Greek and English and raw descriptions are frequently noisy, abbreviated, or linguistically mixed. Consequently, the model (gpt-5-mini) generates five diverse search query variants rather than relying on a single query string that is susceptible to OCR errors. This approach effectively expands the search scope to account for cross-lingual phrasings, spelling errors, or formatting conventions unique to the vendor, ensuring that the semantic intent is preserved even when the syntax varies.

Following this augmentation, the system executes the five generated queries in parallel against the Qdrant vector store. For each query variant, the system retrieves the top seven matches using Hybrid search. These results are then aggregated and deduplicated to eliminate redundancy, producing a consolidated “shortlist” of unique candidate matches (up to a theoretical maximum of 35 items). Achieving high recall at this intermediate stage is crucial to ensuring the correct item remains available for the final validation phase. To conclude the process, this unique shortlist is passed to a second LLM module, the “Evaluator,” which acts as an automated internal control. This model functions as a logical reranker, analyzing the candidates against the original invoice line based on strict business logic, such as verifying brand consistency, matching product codes, and checking package quantities, to filter out noise and return the definitive Top-3 matches. The full system prompts and output schemas for all three LLM components (enricher, query synthesizer, and reranker) are reproduced verbatim in Appendix A.

3.3. Data Preparation and Experimental Setup

To evaluate the robustness of our system across different domains and degrees of data noise, we utilized three standard entity resolution benchmark datasets2: Abt-Buy, Amazon-Google, and Walmart-Amazon (Köpcke et al., 2010; Mudgal et al., 2018).

We structured the data to simulate a realistic corporate procurement scenario:

The Corporate Catalog (Master Data): The “right-side” dataset serves as the authorized product master file found in an ERP system.
The Invoice Stream (Query Set): The “left-side” dataset was filtered to create a stream of incoming “invoice line items” that require reconciliation against the catalog.

A critical contribution of our methodology is the domain-specific preprocessing applied to these datasets to simulate the information available in a real-world ERP catalog. Since raw product data is often incomplete, we implemented logic to construct informative “Product Descriptions” for the catalog side, before Phase 1 of our system. Table 2 summarizes the dataset characteristics and the preprocessing logic used to construct the catalog-side descriptions for each benchmark. In all experiments, the full right-side dataset was treated as the corporate catalog (‘Catalog Size’), while the left-side dataset was filtered to the linked/matched records used as the incoming invoice stream (‘Evaluated Queries’). Thus, each evaluated query is matched against the entire catalog. The resulting retrieval-ready inputs, together with the per-rank baseline and proposed-system outputs underlying Table 3 and Figure 2, are provided as Supplementary Materials (Dataset S1).

3.4. Evaluation Metrics

Consistent with the design of decision support systems (DSS) in accounting, we focus on Top-k Recall for positive matches only. In a production AP workflow, correctly classifying negative pairs (non-matches) adds no value; the operational goal is to retrieve the correct General Ledger (GL) code or SKU.

Top-1 Recall: Proxy for “Touchless Automation”. This measures the percentage of invoices where the system’s first choice is correct, allowing for automatic posting without human review.
Top-3 Recall: Proxy for “Decision Support Efficiency.” This measures how often the correct match appears in the top three suggestions. If the correct code is visible immediately, the AP clerk can validate it with a single click (taking seconds) rather than searching the catalog manually (taking minutes).

3.5. Methodological Choices

Vector store. We chose Qdrant because it supports Hybrid Dense–Sparse retrieval within a unified vector-search framework, reducing the engineering complexity of maintaining separate semantic and lexical retrieval components. Its open-source and self-hostable design also aligns with the data-residency requirements typical of corporate accounts-payable environments. In addition, keeping Dense retrieval and BM25-style Sparse retrieval within the same retrieval stack simplifies logging and auditability, because both semantic and lexical retrieval evidence can be inspected within one pipeline.

Embedding model. We selected OpenAI’s text-embedding-3-large model because it is a production-grade embedding model designed for retrieval-oriented applications, supports multilingual semantic matching in our deployment setting, and allows dimensionality reduction from 3072 to 1024, enabling a balance between retrieval quality, storage cost, and latency.

Reasoning model. The three LLM-based components, catalog enricher, Query Synthesizer, and reranker, use gpt-5-mini rather than a larger model. This choice reflects the operational constraint that invoices are processed in batches of hundreds or thousands of lines per day. The smaller model provides sufficient structured rewriting and reranking capability while keeping latency and cost compatible with production use. In our configuration, the average end-to-end cost is approximately $0.00067 per invoice line (Section 5.4).

Statistical procedure. We report bootstrap 95% confidence intervals based on 1000 resamples rather than relying only on point estimates. This non-parametric procedure is appropriate because Recall@k is a bounded metric and because the public benchmarks contain repeated or near-duplicate product descriptions that can affect apparent matching errors. We therefore use the confidence intervals as an uncertainty estimate around the reported Recall@k values, while separately reporting duplicate-aware results to distinguish true retrieval failures from benchmark labeling artifacts.

4. Results

4.1. Accuracy and Robustness

The experimental results, summarized numerically in Table 3 and graphically in Figure 2, demonstrate the system’s effectiveness across varying degrees of data noise and domain complexity. To isolate the contribution of the retrieval layer versus the full generative pipeline, we first evaluate candidate generation without query expansion or LLM reranking. Using each invoice line as a single search query, we compare Sparse retrieval (BM25), Dense retrieval (embeddings), and Hybrid retrieval (Dense + Sparse fusion) on both the raw and the LLM-enriched catalog. We report Recall@k on positive pairs, where a hit occurs if the correct catalog item appears in the top-k retrieved candidates.

In a production AP context, the consequences of a false positive (a confident but wrong Top-1 suggestion accepted by an operator) differ qualitatively from those of a false negative (a missed match routed to manual review); the downstream implications of this cost asymmetry are discussed in Section 5.2. We therefore complement Top-k Recall, which in our single-target setup coincides with Top-k accuracy since each query has exactly one ground-truth match, with a False Positive Rate (FPR). We operationalize the FPR for this single-target retrieval setting as the fraction of queries for which the system’s Top-1 suggestion is not a correct catalog match under duplicate-aware scoring, i.e., the rate at which a confidently displayed Top-1 candidate would, if accepted by an AP clerk, represent an incorrect posting. All reported metrics for the proposed system include bootstrap 95% confidence intervals (1000 resamples) to establish statistical robustness. Under this definition, the operational Top-1 error rate is 5.06% on Abt-Buy, 15.90% on Walmart-Amazon, and 26.22% on Amazon-Google, reflecting the inherent difficulty of fine-grained product disambiguation; this validates the Top-3 decision-support framing of Section 3, since at the Top-3 level the duplicate-aware Recall reaches 94.34–98.93% across the three benchmarks.

4.2. Analysis of Retrieval Baselines

The results in Table 3 provide a nuanced view of how different retrieval strategies handle diverse data qualities.

Across all datasets, Dense retrieval (embeddings) consistently outperformed Sparse retrieval (BM25) in terms of Recall@1 and Recall@3, confirming that semantic understanding is superior to keyword matching for resolving vendor descriptions. Consistent with the literature, Hybrid retrieval (combining Dense and Sparse) universally outperforms pure Sparse (BM25) retrieval. In most cases, it also outperforms pure Dense retrieval, serving as a robust baseline. A notable exception occurs in the Abt-Buy dataset. Here, pure Dense retrieval achieved an R@1 of 86.67% (Raw), surpassing the Hybrid approach (80.64%). The anomaly arises from the specific nature of the Abt-Buy data, which consists of high-quality, descriptive product names. A possible explanation is that, in such “clean” environments, the semantic signal already carried by the embedding model is sufficient on its own, and the addition of a Sparse channel (BM25) shifts ranking weight toward exact-token overlap on common attribute words rather than the discriminative product-name span, an effect we did not observe on the noisier Walmart-Amazon and Amazon-Google catalogs.

The impact of Phase 1 (catalog enrichment) varies by domain. For Amazon-Google and Abt-Buy, enrichment primarily aids in widening the net, improving Recall@3. However, for the Walmart-Amazon dataset, enrichment did not improve accuracy; in fact, Hybrid R@1 dropped from 80.67% to 75.68%. This suggests that for highly heterogeneous, general retail datasets, LLM-generated keywords may introduce “semantic drift” or hallucinations that obscure the correct match, whereas they are highly effective in more specialized domains like electronics or software.

4.3. Performance of the Proposed System

Despite the variations in the baseline and enrichment layers, the full proposed system (which adds the Phase 2 Query Synthesizer and reranker) consistently delivers the highest Top-1 performance on every benchmark and the highest Top-3 performance under duplicate-aware scoring.

Correction of Retrieval Errors: Even where the retrieval layer struggled (e.g., the drop in Walmart-Amazon enrichment), the proposed system recovered significant ground, achieving an R@1 of 83.47% (standard) and 84.10% (duplicate-aware).
Top-1 accuracy: The system demonstrates its capability for automation, particularly in Abt-Buy, where it achieved 93.97% R@1 (standard) and 94.94% (duplicate-aware), outperforming the best baseline (Dense at 86.67%)
Top-3 accuracy: At Top-3, the picture is uniformly positive under duplicate-aware scoring: the proposed system improves over the best baseline by +1.71 points on Amazon-Google (94.34% vs. 92.63%), +2.92 points on Abt-Buy (98.93% vs. 96.01% Dense Raw), and +1.35 points on Walmart-Amazon (97.92% vs. 96.57% Hybrid Raw). Under standard scoring, the LLM reranker improves performance on Abt-Buy (+1.95 points) and Walmart-Amazon (+0.73 points), and marginally underperforms on Amazon-Google (91.60% vs. 92.63% Dense Enriched, a 1.03-point gap that is within the bootstrap confidence interval). This single-benchmark gap aligns with the labeling-artifact analysis in Section 5.4: on Amazon-Google, which contains the densest cluster of duplicate text rows (103 duplicate clusters spanning 300 catalog rows), a strict literal scorer penalizes the reranker when it selects a text-identical row that happens to have been assigned a different catalog ID. Under duplicate-aware scoring, the gap reverses to a +1.71 improvement.

This “generative lift” confirms that while retrieval models are effective at narrowing the search space, the LLM reranker is essential for the “last mile” of disambiguation. The reranker filters the noise introduced by enrichment, leveraging the generated synonyms for recall while applying strict logic for precision. A qualitative inspection of the residual Top-3 misses across all three benchmarks shows that the dominant causes, text-identical catalog rows assigned to different IDs, and near-duplicate SKUs distinguished only by attributes (color, capacity, licensing tier) absent from the invoice line, are properties of the benchmark catalogs themselves and affect every retrieval method comparably. We quantify this effect explicitly in Section 5.4.

5. Discussion

5.1. Synthesis of Findings

This study set out to address a persistent friction in the P2P cycle: the reconciliation of unstructured vendor invoice data against structured internal product catalogs. By designing and evaluating an “augment-both-sides” system, we demonstrated that LLMs can bridge the semantic gap that traditionally causes rule-based systems to fail.

Our results across three diverse datasets indicate that the proposed system functions as a high-reliability DSS. While “touchless” automation capability (Top-1 Recall) varied by domain, ranging from 73.78% in the highly ambiguous Amazon-Google dataset (software and technology products) to 94.94% in the clearer Abt-Buy electronics domain, the system consistently achieved a duplicate-aware Top-3 Recall between 94.34% and 98.93% across all three benchmarks (standard Top-3 Recall between 91.60% and 97.96%).

This finding is significant for AIS design: it suggests that even when the system cannot autonomously execute straight-through processing with sufficient confidence, it can successfully narrow the search space to a negligible set of options. This eliminates the high search costs associated with manual reconciliation, transforming a complex retrieval task into a rapid validation task.

5.2. Implications for Accounting Practice and Internal Controls

Supporting Judgment and Decision-Making in Exception Handling: The high Top-3 Recall enables a shift in AP processing from manual catalog search to a constrained verification task. By consistently presenting the correct SKU within the immediate view of the operator, the system reduces the search component of the task. Prior work in intelligent decision aids (F. Huang & Vasarhelyi, 2019) argues that this reduction allows professionals to reallocate attention to non-standard cases. A formal behavioral evaluation of this reallocation, for instance, whether verification accuracy or speed improves when the correct item is surfaced in the Top-3, is outside the scope of the present study and is left to future work.

Enhancing Internal Controls: From an audit perspective, the “augment-both-sides” strategy acts as a robust preventative control. By standardizing the matching process through vector embeddings rather than relying on the subjective keyword searches of individual clerks, the system is intended to reduce the risk of misclassification (e.g., coding a capital asset as an expense). Furthermore, the retrieval-and-rerank architecture exposes natural points at which auditable evidence can be persisted: the five LLM-generated query variants, the deduplicated shortlist of up to 35 candidates retrieved by Hybrid search, and the final top three selections. Together these provide a structured record of how each invoice line was reconciled rather than a single opaque decision. In the present implementation only the top three selections are logged; persisting the upstream artifacts is a straightforward extension that we discuss as future work.

Cost Asymmetry and the Top-3 Framing: In a real AP workflow, the operational costs of a false negative and a false positive are asymmetric. A missed match simply routes the invoice to manual review and costs an AP clerk a few minutes; a confident but wrong Top-1 suggestion, if accepted automatically, can break the three-way match, drive an incorrect financial posting, and surface as an exception during internal or external audits. The FPR figures reported in Section 4.1, ranging from 5.06% on Abt-Buy to 26.22% on Amazon-Google, directly motivate the Top-3 framing rather than Top-1 automation; at the Top-3 level the duplicate-aware Recall reaches 94.34–98.93%, with an operational Top-3 error rate of 1.07–5.66%, meaning the architecture is best deployed as a decision-support filter that surfaces the right candidate within the operator’s field of view, rather than as a fully autonomous matcher. This positioning aligns the system’s output with the design philosophy of internal control over financial reporting, where decisions affecting the general ledger require human attestation rather than autonomous machine commitment.

Internal-control implications of the residual FPR: The FPR values reported in Section 4.1 (5.06–26.22% under duplicate-aware Top-1 scoring) carry direct implications for internal control over financial reporting. At the Top-1 level these error rates are not acceptable for autonomous posting: a confidently displayed but incorrect SKU, if accepted automatically, can break the three-way match, propagate to an incorrect general-ledger account, and surface as an audit exception affecting the completeness and accuracy assertions of the procurement cycle. The system should therefore be deployed as a decision-support filter for an AP clerk’s verification step, not as an autonomous posting agent, unless one of the following compensating controls is added: (i) a confidence threshold above which Top-1 suggestions are auto-posted and below which they route to manual review; (ii) a second-level reviewer for transactions above a materiality threshold; or (iii) a post-posting reconciliation control that samples auto-posted transactions for accuracy testing. In the Top-3 framing, the residual error rates fall to 1.07–5.66% under duplicate-aware scoring, a range at which the system’s output meets the same evidentiary standard that human-generated catalog look-ups produce, while preserving full auditability of the candidate set.

5.3. Limitations and Future Research

While the results are promising, this study is subject to limitations that frame our future research directions:

Linguistic Scope and Mixed-Script Complexity: A primary limitation of this study is the linguistic homogeneity of the standard benchmarks (Abt-Buy, Amazon-Google, Walmart-Amazon), which are exclusively English. This contrasts with our target operational environment, which is characterized by high linguistic entropy. In our real-world use case, data is not simply “translated”; it involves complex code-switching, where invoices and catalog entries frequently mix Greek and English terms within the same line item (e.g., an English brand name paired with a Greek functional description, or mixed-script abbreviations). While the “Query Synthesizer” demonstrated robust handling of synonyms in the benchmarks, its primary value lies in its ability to normalize this Hybrid Greek–English input, a capability not fully quantified by the current English-only datasets.
Enrichment Trade-offs: Our experiments revealed that the “augment-both-sides” strategy requires careful tuning. As observed in the Walmart-Amazon dataset, LLM-based catalog enrichment does not universally improve performance and can introduce noise (reducing R@1) in highly heterogeneous retail datasets. Future work should investigate governance mechanisms, such as confidence thresholds or “human-in-the-loop” review stages, to validate generated synonyms before they enter the vector index.
End-to-End Latency and Behavioral Validation: The per-line latency reported in Section 5.4 was measured on individual benchmark invocations and establishes technical feasibility, but a formal evaluation under production load (concurrent invoice streams, API rate-limit saturation, and end-to-end queue throughput), together with a controlled behavioral study of AP-clerk verification accuracy and speed when supported by the system, were out of scope for this study. The productivity claims associated with the decision-support framing remain to be empirically validated in a controlled setting.
Multimodal Integration: Our evaluation focused on textual line items. Emerging research in multimodal transformers suggests that incorporating visual layout features (e.g., the spatial coordinates of text on the invoice) could further enhance extraction and matching accuracy, particularly for invoices where the visual structure implies the category.

5.4. Benchmark Artifacts, Production Validation, and Operational Feasibility

Benchmark labeling artifacts: Across the three public benchmarks, manual inspection of the residual rank-3 failures shows that the dominant single cause is a labeling artifact rather than a retrieval limitation: the Abt-Buy, Amazon-Google and Walmart-Amazon catalogs contain multiple text-identical rows assigned to different IDs (most strikingly on Amazon-Google, with 103 duplicate text-clusters spanning 300 catalog rows). A second, smaller class of residual failures consists of fine-grained variant queries where the discriminating attribute—color, storage capacity, licensing tier—is absent from the invoice text and therefore unresolvable from the line item alone. These artifacts penalize every retrieval method in our comparison comparably, including the BM25, Dense and Hybrid baselines. To quantify this effect without modifying the underlying public datasets, we computed a duplicate-aware metric (counting a hit when any text-identical catalog row is retrieved). As shown in Table 3, this reveals a “labeling gap” of up to 2.74% (in Amazon-Google R@3). Under duplicate-aware evaluation, the system’s FPR drops correspondingly, confirming that a large portion of apparent failures are simply valid retrievals of text-identical duplicate SKUs. We left the datasets in place rather than de-duplicating the catalogs so that our numbers remain directly comparable to prior work that uses the same splits.

Production validation: In the operational deployment on a corpus of real Greek-language vendor invoices reconciled against a production corporate catalog, where neither labeling artifact occurs, the system reached approximately 97% Top-3 reconciliation accuracy on a manually verified evaluation of approximately 200 invoice line items. We report this as an anecdotal production observation rather than as reproducible benchmark evidence, since the underlying dataset cannot be released due to commercial sensitivity. The English benchmarks should therefore be read as a reproducibility-oriented lower bound on the architecture’s behavior, while the Greek deployment provides a corroborating production signal.

Operational feasibility and cost: We instrumented our system to record end-to-end latency and token usage to demonstrate practical feasibility. Across the three benchmarks, the proposed pipeline (query expansion, Hybrid retrieval, and LLM reranking) achieves an average end-to-end latency of 20.1 s per invoice line, with per-benchmark values of 19.8 s on Abt-Buy, 18.4 s on Amazon-Google, and 22.0 s on Walmart-Amazon. We emphasize that 20.1 s is a per-line latency, not invoice-level wall-clock time. Because line items are independent retrieval queries, processing can be parallelized across all lines of an invoice, so invoice-level completion time is bounded by API concurrency rather than by the sum of per-line latencies. Using gpt-5-mini, the system consumes an average of 1024 input tokens and 206 output tokens per line item; at current OpenAI API pricing, this corresponds to an estimated operational cost of approximately $0.00067 per line item (or roughly 6.7 cents per 100 invoice lines). Catalog enrichment is a one-time offline operation performed when the catalog is updated. It costs approximately $0.115 per 1000 SKUs at current API pricing and requires roughly 22 s of parallelized processing time. This negligible cost is fully amortized across all subsequent invoice queries against that catalog version. The combination of low per-line transaction cost and high decision-support accuracy supports the commercial viability of this architecture for substantially reducing manual reconciliation effort in production AP environments. We note that the per-line latency is higher than would be acceptable for a real-time blocking control, but is well within the tolerance of an asynchronous AP queue, where invoices are typically batched and reviewed within hours rather than seconds.

5.5. Conclusions

This study addressed a persistent friction in the procure-to-pay cycle: matching unstructured vendor invoice descriptions to structured corporate catalogs at production scale. We pursued three objectives. First, we designed a Dual-Augmentation RAG architecture in which LLMs proactively enrich the catalog index and rewrite incoming queries, with an LLM reranker applied afterward to enforce AP-specific judgment. Second, evaluation on three public entity resolution benchmarks yielded duplicate-aware Top-3 Recall between 94.34% and 98.93% with bootstrap-confirmed statistical robustness, matching or exceeding the strongest non-LLM baseline on every benchmark at an operational cost of approximately $0.00067 per invoice line. Third, deployment in a real Greek-language AP environment produced approximately 97% Top-3 reconciliation accuracy, consistent with the benchmark results.

The principal implication for AIS practice is that intelligent decision-aid architectures of this kind can credibly replace the search component of manual AP reconciliation while preserving the inspectable, controllable properties that an internal audit requires. The architecture is best positioned as a Top-3 decision-support filter rather than as a fully autonomous matcher, reflecting the cost asymmetry between false negatives and false positives in the AP context. Future work will address three directions in turn: a controlled multilingual evaluation on a multilingual ER benchmark to formally validate the code-switching capability our production environment relies on; a hold-out experiment in which catalog entries are deliberately removed from the index to measure non-match detection under realistic distributional shift; and a behavioral study of AP-clerk verification accuracy and speed with and without the decision aid, to quantify the productivity and audit-quality gains observed informally in deployment.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jrfm19060402/s1, Dataset S1: Reproducibility data archive (supplementary_data/) of comma-separated (CSV) files containing, for each of the three public benchmarks (Abt-Buy, Amazon-Google, Walmart-Amazon), the preprocessed retrieval inputs, the per-rank baseline retrieval outputs (BM25, Dense, and Hybrid against raw and enriched catalogs), and the proposed system’s top-3 predictions with per-query token-usage and latency metrics; a README.txt documents the folder layout, the CSV column schemas, and the procedure for reproducing Table 3 and Figure 2.

Author Contributions

Conceptualization, M.D. and S.M.; methodology, M.D. and S.M.; software, M.D.; validation, M.D. and S.M.; formal analysis, M.D.; investigation, M.D.; data curation, M.D.; writing—original draft preparation, M.D.; writing—review and editing, M.D. and S.M.; visualization, M.D.; supervision, S.M.; project administration, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Three categories of data underlie this study and are released as follows. (i) Public benchmark datasets. The three entity resolution benchmarks evaluated in Section 4 (Abt-Buy, Amazon-Google, Walmart-Amazon) are obtained without modification from the official DeepMatcher dataset index at https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md (accessed on 25 May 2026) (Mudgal et al., 2018), specifically the Structured/Walmart-Amazon, Structured/Amazon-Google and Textual/Abt-Buy entries. (ii) Derived files used in our pipeline. The retrieval-ready inputs (catalog.csv, catalog_enriched.csv, queries_positive.csv, pairs_positive.csv per dataset) and the per-rank outputs of every retrieval method reported in Table 3 and Figure 2 (BM25, Dense and Hybrid against both raw and LLM-enriched catalogs, plus the proposed system’s Top-3 predictions and per-row correctness flags in queries_positive_with_top3_eval.csv) are provided as Supplementary Material accompanying this article. A README.txt in the Supplementary Material the column schema of every file and maps each cell of Table 3 to its source. These derived files were produced from the public benchmarks above by the procedures described in Section 3.2 and Section 3.3. (iii) Production deployment data. The Greek-language vendor invoice corpus and corresponding corporate catalog referenced in Section 1 and Section 5.3 cannot be released due to commercial confidentiality. Source code. The implementation of the pipeline (catalog enrichment, query synthesis, Hybrid retrieval and LLM reranking) is not redistributed with this submission. The methodology, prompts and models required to reproduce the system are documented in Section 3 and Appendix A; the source code is available from the corresponding author on reasonable request.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (GPT-5.5, OpenAI), Claude (Claude Opus 4.7, Anthropic) and Gemini (Gemini 3.1 Pro, Google) in order to help improve grammar and clarity, and to assist with structuring sections of the manuscript. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The system uses three LLM prompts, all served by the gpt-5-mini model with reasoning_effort set to “minimal” and verbosity set to “low” for both latency and cost control. The prompts are reproduced verbatim below from the source code; only line wrapping has been adjusted for readability.

Appendix A.1. Catalog Enricher (Phase 1—Offline)

“““You are a keyword generator that enriches clean catalog items for better fuzzy retrieval and matching against noisy invoice lines in a RAG system.

Task: For each catalog entry, return only concise keywords (no sentences, no labels, no duplicates or repeat of words). The goal is to expand the catalog entry with realistic variants and terms that could appear in invoices or receipts or other catalogs, improving recall during embedding or vector similarity search.

Guidelines:

NEVER invent attributes not present in the input. Do not guess colors, sizes, capacities, or brands.
Keep the brand EXACT if present; add common brand abbreviations only if widely used (e.g., “hewlett packard” → “hp”).
Include synonyms for function/category if present (e.g., “headphones”, “headset”; “tv”, “television”).
Add realistic invoice-style variations (e.g., abbreviations).
Expand common abbreviations (e.g., “Tabl → tablets”, “Inj → injection”, “Amp → ampoule”), but keep both expanded and abbreviated forms when relevant.
Keep all numeric attributes EXACT (capacity, size, version). Also add common unit variants (e.g., “gb” and “gbyte”). Expand or clarify where needed.
No sentences, no marketing, no stopwords, no explanations. Keep each keyword short (≤4 words).
Focus only on metadata that could plausibly exist for this product in catalog descriptions.

Return at least 5–10 concise, search-oriented keywords that maximize retrieval accuracy between catalog data and real catalog text.”””

The enricher is invoked in mini-batches of five catalog entries per LLM call to amortize system-prompt overhead. The structured output is constrained to the following Pydantic schema, which requires one keyword list per entry:

Output schema (Pydantic):

class EnrichmentMetadata(BaseModel):

entry1_enriched_metadata_keywords: list[str]

entry2_enriched_metadata_keywords: list[str]

entry3_enriched_metadata_keywords: list[str]

entry4_enriched_metadata_keywords: list[str]

entry5_enriched_metadata_keywords: list[str]

Appendix A.2. Query Synthesizer (Phase 2—Runtime)

“““You are a Query Synthesizer for Product Matching in a vector-search pipeline.

From a single product description, produce FIVE diverse, rich, high-recall queries to retrieve the best-matching product from a vector store of catalog lines.

Do NOT invent attributes. Use ONLY info present in the input.
Include every product/model code, sizes/dimensions, versions, and pack/count EXACTLY as they appear (when present). If absent, omit.
Exclude invoice meta: lots, discounts, prices, VAT, order numbers, dates, ad-dresses, loyalty/offer text, etc.
Always preserve original script, casing, and diacritics for brand/model tokens; if you add an expansion, KEEP the original too.
Queries must be DENSE, INFORMATIVE (no minimal queries) and meaning-fully different (avoid trivial rephrasings).

DIVERSITY POLICY (pick the most helpful variations based on the product). Across the five queries, ensure you cover several of the following:

Category and form synonyms.
Acronym expansions or contracted/long-form variants.
Units/number/symbol/notation formatting variants found in input (e.g., “500 mg”/“500 mg”, “2 × 500 g”/“2 × 500 g”).
Packaging synonyms (add 1–2 besides the original).”””

The output is constrained to the following Pydantic schema, which returns a list of exactly five query strings:

Output schema (Pydantic):

class SearchQueries(BaseModel):

search_queries: List[str] = Field(

…,

description=“A list of 5 different search queries”

)

Appendix A.3. Evaluator/Reranker (Phase 2—Runtime, Final Stage)

“““You are a Product Match Evaluator. Your goal is to identify and rank the best product matches for a given product line.

INPUTS:

(1) The original product line.

(2) Up to 35 retrieved catalog lines (duplicates possible).

TASK: Evaluate all candidates and return the top three most relevant catalog lines.

EVALUATION PRIORITY (in order of importance):

Exact matches of product type or name.
Exact matches of brand name (brand text must match exactly if present).
Exact matches of product code(s) (e.g., SKU, EAN, or internal codes).
High textual similarity in product type or product name.
Exact matches or compatible values for size/dimensions or pack/count.
Other contextual or descriptive similarity.

INSTRUCTIONS:

Always return the three best candidates, ranked from most to least relevant.
If no exact matches exist, return the three closest ones based on partial or se-mantic similarity.
If a candidate matches at least the product type or name, it is valid for rank-ing.
Never fabricate or modify product text; use the catalog lines as provided.
Focus on precision and meaningful relevance rather than sufficiency.”””

The output is constrained to the following Pydantic schema, which returns exactly three candidate strings; these are then mapped back to (id, title, description) tuples by exact match against the deduplicated shortlist, with a substring fallback for minor reformatting:

Output schema (Pydantic):

class Candidates(BaseModel):

top_3_candidates: List[str] = Field(

…,

description=“The top 3 candidates to the OCR-extracted”

“product line from the vector store catalog”

)

Notes

1	The three-way match is the standard accounts-payable internal control requiring agreement between the purchase order, the goods-receipt note, and the supplier invoice on item, quantity, and price before payment is authorized.
2	The three benchmarks are obtained from the official DeepMatcher dataset index at https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md (accessed on 25 May 2026) (Mudgal et al., 2018). Specifically, we use the Structured/Walmart-Amazon, Structured/Amazon-Google and Textual/Abt-Buy entries from that page.

References

Abderrahman, A. S. M., & Makarem, N. (2026). The future of external audit: A systematic literature review of emerging technologies and their impact on external audit practices. Journal of Risk and Financial Management, 19(3), 216. [Google Scholar] [CrossRef]
Adeyemi, M., Oladipo, A., Pradeep, R., & Lin, J. (2023). Zero-shot cross-lingual reranking with large language models for low-resource languages. arXiv. [Google Scholar] [CrossRef]
Althaf, A. M., Mohammed, M. A., Milanova, M., Talburt, J., & Cakmak, M. C. (2025). Multi-agent RAG framework for entity resolution: Advancing beyond single-LLM approaches with specialized agent coordination. Computers, 14(12), 525. [Google Scholar] [CrossRef]
Balsiger, D., Dimmler, H.-R., Egger-Horstmann, S., & Hanne, T. (2024). Assessing large language models used for extracting table information from annual financial reports. Computers, 13(10), 257. [Google Scholar] [CrossRef]
Bardelli, C., Rondinelli, A., Vecchio, R., & Figini, S. (2020). Automatic electronic invoice classification using machine learning models. Machine Learning and Knowledge Extraction, 2(4), 617–629. [Google Scholar] [CrossRef]
Bode, C., Burkhart, D., Schültken, R., & Vollmer, M. (2023). Future of procurement. In R. Merkert, & K. Hoberg (Eds.), Global logistics and supply chain strategies for the 2020s: Vital skills for the next generation (pp. 261–276). Springer International Publishing. [Google Scholar] [CrossRef]
Chen, L.-C., Weng, H.-T., Pardeshi, M. S., Chen, C.-M., Sheu, R.-K., & Pai, K.-C. (2025). Evaluation of prompt engineering on the performance of a large language model in document information extraction. Electronics, 14(11), 2145. [Google Scholar] [CrossRef]
Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. IIWeb, 3, 73–78. Available online: https://www.bibsonomy.org/bibtex/b918a22c0ac156bcd7114e8361377773 (accessed on 25 May 2026).
Cristani, M., Bertolaso, A., Scannapieco, S., & Tomazzoli, C. (2018). Future paradigms of automated processing of business documents. International Journal of Information Management, 40, 67–75. [Google Scholar] [CrossRef]
Dang, Q.-V., Nguyen, N.-S.-A., & Vo, T.-B.-D. (2026). HierFinRAG—Hierarchical multimodal RAG for financial document understanding. Informatics, 13(2), 30. [Google Scholar] [CrossRef]
Flechsig, C., Anslinger, F., & Lasch, R. (2022). Robotic Process Automation in purchasing and supply management: A multiple case study on potentials, barriers, and implementation. Journal of Purchasing and Supply Management, 28(1), 100718. [Google Scholar] [CrossRef]
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long papers) (pp. 1762–1777). Association for Computational Linguistics. Available online: https://aclanthology.org/2023.acl-long.99/ (accessed on 25 May 2026).
Grabski, S. V., Leech, S. A., & Schmidt, P. J. (2011). A review of ERP research: A future agenda for accounting information systems. Journal of Information Systems, 25(1), 37–78. Available online: https://publications.aaahq.org/jis/article-abstract/25/1/37/1563 (accessed on 25 May 2026). [CrossRef]
Ha, H. T., & Horák, A. (2022). Information extraction from scanned invoice images using text analysis and layout features. Signal Processing: Image Communication, 102, 116601. [Google Scholar] [CrossRef]
Huang, F., & Vasarhelyi, M. A. (2019). Applying robotic process automation (RPA) in auditing: A framework. International Journal of Accounting Information Systems, 35, 100433. [Google Scholar] [CrossRef]
Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). LayoutLMv3: Pre-training for document AI with unified text and image masking. arXiv. [Google Scholar] [CrossRef]
Iaroshev, I., Pillai, R., Vaglietti, L., & Hanne, T. (2024). Evaluating retrieval-augmented generation models for financial report question and answering. Applied Sciences, 14(20), 9318. [Google Scholar] [CrossRef]
James, A., Trovati, M., & Bolton, S. (2025). Retrieval-augmented generation to generate knowledge assets and creation of action drivers. Applied Sciences, 15(11), 6247. [Google Scholar] [CrossRef]
Jeong, Y.-B., Seo, H., Kim, Y.-H., & Kim, W.-Y. (2025). Retrieval-augmented visual parcel invoice understanding transformer for address correction. Engineering Applications of Artificial Intelligence, 158, 111542. [Google Scholar] [CrossRef]
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2022). OCR-free document understanding transformer. arXiv. [Google Scholar] [CrossRef]
Kim, J.-I., & Shunk, D. L. (2004). Matching indirect procurement process with different B2B e-procurement systems. Computers in Industry, 53(2), 153–164. [Google Scholar] [CrossRef]
Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1–2), 484–493. [Google Scholar] [CrossRef]
Krieger, F., Drews, P., & Funk, B. (2023). Automated invoice processing: Machine learning-based information extraction for long tail suppliers. Intelligent Systems with Applications, 20, 200285. [Google Scholar] [CrossRef]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th international conference on neural information processing systems, NIPS ’20 (pp. 9459–9474). ACM. [Google Scholar]
Li, Y., Li, J., Suhara, Y., Doan, A., & Tan, W.-C. (2020). Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1), 50–60. [Google Scholar] [CrossRef]
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. arXiv. [Google Scholar] [CrossRef]
Luo, C., Shen, Y., Zhu, Z., Zheng, Q., Yu, Z., & Yao, C. (2024). LayoutLLM: Layout instruction tuning with large language models for document understanding. arXiv. [Google Scholar] [CrossRef]
Luo, S., & Yu, J. (2024). SGFNet: A semantic graph-based multimodal network for financial invoice information extraction. Expert Systems with Applications, 258, 125156. [Google Scholar] [CrossRef]
Ma, X., Gong, Y., He, P., Zhao, H., & Duan, N. (2023). Query rewriting in retrieval-augmented large language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5303–5315). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Maurya, C. K., Gantayat, N., Dechu, S., & Horvath, T. (2020). Online similarity learning with feedback for invoice line item matching. arXiv. [Google Scholar] [CrossRef]
Mehrbod, A., Zutshi, A., Grilo, A., & Jardim-Gonsalves, R. (2018). Application of a semantic product matching mechanism in open tendering e-marketplaces. Journal of Public Procurement, 18(1), 14–30. Available online: https://www.emerald.com/insight/content/doi/10.1108/jopp-03-2018-002/full/html (accessed on 25 May 2026). [CrossRef]
Mistiawan, A., & Suhartono, D. (2024). Product matching with two-branch neural network embedding. Journal Européen Des Systèmes Automatisés, 57(4), 1207. [Google Scholar] [CrossRef]
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., & Raghavendra, V. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 international conference on management of data, SIGMOD ’18 (pp. 19–34). ACM. [Google Scholar] [CrossRef]
Ng, K. K. H., Chen, C.-H., Lee, C. K. M., Jiao, J., & Yang, Z.-X. (2021). A systematic literature review on intelligent automation: Aligning concepts from theory, practice, and future perspectives. Advanced Engineering Informatics, 47, 101246. [Google Scholar] [CrossRef]
Nigam, P., Song, Y., Mohan, V., Lakshman, V., Ding, W., Shingavi, A., Teo, C. H., Gu, H., & Yin, B. (2019). Semantic product search. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2876–2885). ACM. [Google Scholar] [CrossRef]
O’Leary, D. E. (2000). Enterprise resource planning systems: Systems, life cycle, electronic commerce, and risk. Cambridge University Press. Available online: https://books.google.com/books?hl=en&lr=&id=7fzMFG-tCmkC&oi=fnd&pg=PP11&dq=Enterprise+Resource+Planning+Systems+O%27Leary,+D.+E.+(2000)&ots=9a4Vlr0Y9P&sig=dRJS6XswJReudxUtffBskikKTNA (accessed on 25 May 2026).
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 technical report. arXiv. [Google Scholar] [CrossRef]
Palm, R. B., Winther, O., & Laws, F. (2017). CloudScan—A configuration-free invoice analysis system using recurrent neural networks. arXiv. [Google Scholar] [CrossRef]
Peeters, R., & Bizer, C. (2022). Supervised contrastive learning for product matching. In Companion proceedings of the web conference 2022, WWW ’22 (pp. 248–251). ACM. [Google Scholar] [CrossRef]
Peeters, R., Steiner, A., & Bizer, C. (2024). Entity matching using large language models. arXiv. [Google Scholar] [CrossRef]
Raina, V., & Gales, M. (2024, May 20). Question-based retrieval using atomic units for enterprise RAG. arXiv. [CrossRef]
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Nist Special Publication Sp, 109, 109. Available online: https://books.google.com/books?hl=en&lr=&id=j-NeLkWNpMoC&oi=fnd&pg=PA109&dq=Okapi+at+TREC-3&ots=YkE6HhAsME&sig=kDgCD0Ysml73EXihKaq8_229ZBQ (accessed on 25 May 2026).
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. [Google Scholar] [CrossRef]
Schlegel, D., Fundanovic, O., & Kraus, P. (2024). Rating risks in robotic process automation (RPA) projects: An expert assessment using an impact-uncontrollability matrix. Procedia Computer Science, 239, 185–192. [Google Scholar] [CrossRef]
Strohmer, M. F., Easton, S., Eisenhut, M., Epstein, E., Kromoser, R., Peterson, E. R., & Rizzon, E. (2020). Digital in procurement. In M. F. Strohmer, S. Easton, M. Eisenhut, E. Epstein, R. Kromoser, E. R. Peterson, & E. Rizzon (Eds.), Disruptive procurement: Winning in a digital world (pp. 49–76). Springer International Publishing. [Google Scholar] [CrossRef]
Šimsa, Š., Šulc, M., Uřičář, M., Patel, Y., Hamdi, A., Kocián, M., Skalický, M., Matas, J., Doucet, A., Coustaty, M., & Karatzas, D. (2023). DocILE benchmark for document information localization and extraction. arXiv. [Google Scholar] [CrossRef]
Tang, G., Xie, L., Jin, L., Wang, J., Chen, J., Xu, Z., Wang, Q., Wu, Y., & Li, H. (2021). MatchVIE: Exploiting match relevancy between entities for visual information extraction. arXiv. [Google Scholar] [CrossRef]
Tater, T., Gantayat, N., Dechu, S., Jagirdar, H., Rawat, H., Guptha, M., Gupta, S., Strak, L., Kiran, S., & Narayanan, S. (2022). AI driven accounts payable transformation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(11), 12405–12413. [Google Scholar] [CrossRef]
Tiwari, A. K., Marak, Z. R., Paul, J., & Deshpande, A. P. (2023). Determinants of electronic invoicing technology adoption: Toward managing business information system transformation. Journal of Innovation & Knowledge, 8(3), 100366. [Google Scholar] [CrossRef]
Wagner, R. A., & Fischer, M. J. (1974). The string-to-string correction problem. Journal of the ACM, 21(1), 168–173. [Google Scholar] [CrossRef]
Wang, T., Chen, X., Lin, H., Chen, X., Han, X., Wang, H., Zeng, Z., & Sun, L. (2024). Match, compare, or select? An investigation of large language models for entity matching. arXiv. [Google Scholar] [CrossRef]
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1192–1200). ACM. [Google Scholar] [CrossRef]
Zeakis, A., Papadakis, G., Skoutas, D., & Koubarakis, M. (2023). Pre-trained embeddings for entity resolution: An experimental analysis. Proceedings of the VLDB Endowment, 16(9), 2225–2238. [Google Scholar] [CrossRef]
Zhang, J., Fang, J., Zhang, C., Zhang, W., Ren, H., & Xu, L. (2025). Geographic named entity matching and evaluation recommendation using multi-objective tasks: A study integrating a large language model (LLM) and retrieval-augmented generation (RAG). ISPRS International Journal of Geo-Information, 14(3), 95. [Google Scholar] [CrossRef]

Figure 1. Two-phase “augment-both-sides” architecture for invoice-to-catalog reconciliation: (Phase 1) offline catalog enrichment and Hybrid (Dense + Sparse) indexing; (Phase 2) runtime query expansion, Hybrid retrieval, deduplication, and LLM reranking for Top-3 matches.

Figure 2. Recall@1 (left) and Recall@3 (right) for the three retrieval baselines under raw and LLM-enriched catalogs, and for the proposed system, across the three benchmarks. Paired colors show the effect of catalog enrichment for each retriever; the orange bar isolates the additional contribution of query expansion and LLM reranking on top of an enriched catalog. Numerical values for all bars are reproduced in Table 3.

Table 1. Methodological comparison: from fuzzy matching to Dual-Augmentation RAG.

Dimension	Edit-Distance (Levenshtein/Jaro-Winkler)	TF-IDF/BM25	Dense Embeddings	Hybrid (Dense + Sparse)	Proposed: Dual-Augmentation RAG
OCR/character noise	Handles minor typos only	Weak unless fuzzy/character analyzers are used	Variable	Moderate	Strong if rewrite validation and identifier protection are used
Synonyms	No	No natively; possible with expansion	Generally strong	Strong	Strong, with catalog and query augmentation
Abbreviations	Limited	Requires dictionary	Limited/domain-dependent	Moderate	Strong if expansions are validated
Cross-lingual/mixed script	No	Requires multilingual analyzer	Encoder-dependent	Encoder-dependent	Strong on a multilingual retrieval stack
SKU/model-code preservation	Strong	Strong	Weak	Strong	Strong with SKU locking/validation
Query-time cost	Very low, but poor scalability	Low	Medium	Medium-high	Highest if rewrite/reranking are online
Auditability	High	High	Low	Medium	Medium-high with full logging
Main failure mode	Synonyms/abbreviations	Vocabulary mismatch	Semantic drift/code loss	Mixed failure modes	Hallucinated expansions, code corruption, reranker bias
Best suited for	Small clean catalogs	Code/keyword-heavy catalogs	Clean descriptive catalogs	General production baseline	Noisy multilingual AP/product data

Table 2. Dataset specifications and preprocessing logic.

Dataset	Domain	Catalog Size (Rows)	Evaluated Queries	Catalog Construction Strategy (Preprocessing)
Amazon-Google	Software and Tech	3226	1167	Composite Field Construction: We concatenated the Title and Manufacturer fields. The logic explicitly checks if the manufacturer is already present in the title; if not, it is appended to ensure the embedding captures the brand identity, which is critical for matching software licenses.
Abt-Buy	Consumer Electronics	1092	1028	Single Field Indexing: This dataset contained high-quality, descriptive Names. We indexed the Name field directly as it contained sufficient signal (Brand + Model + Spec) to distinguish SKUs without additional concatenation.
Walmart-Amazon	General Retail	22,074	962	Conditional Attribute Injection: This was the most heterogeneous dataset. We implemented a conditional logic that analyzed the Title. If key attributes like Brand or Model Number were missing from the title string, they were injected from their respective columns. This mirrors the “data cleansing” phase often required in ERP migrations.

Table 3. Recall@1 and Recall@3 of retrieval baselines (raw vs. LLM-enriched catalog) and the proposed system across the three benchmarks.

Dataset	Method	Metric	Raw Catalog	Enriched Catalog	Proposed System	Proposed (Dup-Aware)	Gap
Amazon-Google	Dense	R@1	68.38%	70.01%
	Sparse	R@1	65.98%	63.58%
	Hybrid	R@1	68.98%	68.04%	71.89% [69.32%, 74.30%]	73.78% [71.12%, 76.26%]	+1.89%
	Dense	R@3	90.75%	92.63%
	Sparse	R@3	86.46%	86.29%
	Hybrid	R@3	90.83%	92.12%	91.60% [89.97%, 93.14%]	94.34% [92.97%, 95.72%]	+2.74%
Abt-Buy	Dense	R@1	86.67%	85.90%
	Sparse	R@1	68.39%	70.14%
	Hybrid	R@1	80.64%	79.38%	93.97% [92.41%, 95.43%]	94.94% [93.58%, 96.30%]	+0.97%
	Dense	R@3	96.01%	95.91%
	Sparse	R@3	84.24%	86.58%
	Hybrid	R@3	92.41%	95.33%	97.96% [96.98%, 98.74%]	98.93% [98.25%, 99.51%]	+0.97%
Walmart-Amazon	Dense	R@1	77.44%	75.47%
	Sparse	R@1	77.03%	74.84%
	Hybrid	R@1	80.67%	75.68%	83.47% [81.19%, 85.86%]	84.10% [81.60%, 86.17%]	+0.63%
	Dense	R@3	94.38%	93.76%
	Sparse	R@3	94.59%	93.76%
	Hybrid	R@3	96.57%	95.74%	97.30% [96.15%, 98.23%]	97.92% [96.99%, 98.75%]	+0.62%

Bold values indicate the best result in each row under standard scoring, among the directly comparable columns (Raw Catalog, Enriched Catalog, and Proposed System). The Proposed (Dup-Aware) column uses lenient scoring, is not directly comparable to the strictly scored baselines, and is reported as a diagnostic (see Section 5.4).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dadopoulos, M.; Moschidis, S. Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting. J. Risk Financial Manag. 2026, 19, 402. https://doi.org/10.3390/jrfm19060402

AMA Style

Dadopoulos M, Moschidis S. Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting. Journal of Risk and Financial Management. 2026; 19(6):402. https://doi.org/10.3390/jrfm19060402

Chicago/Turabian Style

Dadopoulos, Michail, and Stratos Moschidis. 2026. "Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting" Journal of Risk and Financial Management 19, no. 6: 402. https://doi.org/10.3390/jrfm19060402

APA Style

Dadopoulos, M., & Moschidis, S. (2026). Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting. Journal of Risk and Financial Management, 19(6), 402. https://doi.org/10.3390/jrfm19060402

Article Menu

Beyond Fuzzy Matching: A Dual-Augmentation RAG System for Robust Product Reconciliation in Accounting

Abstract

1. Introduction

2. Related Work

2.1. E-Invoicing Adoption and RPA Governance

2.2. Automated Invoice Processing and Information Extraction (IE)

2.3. Entity Resolution (ER) and Product Matching for AP

2.4. Applying LLMs for Contextual Matching and Judgment

3. Materials and Methods

3.1. Research Design

3.2. System Architecture and Implementation

3.2.1. Phase 1: Catalog Augmentation and Vector Indexing

3.2.2. Phase 2: Real-Time Query Augmentation and Reranking

3.3. Data Preparation and Experimental Setup

3.4. Evaluation Metrics

3.5. Methodological Choices

4. Results

4.1. Accuracy and Robustness

4.2. Analysis of Retrieval Baselines

4.3. Performance of the Proposed System

5. Discussion

5.1. Synthesis of Findings

5.2. Implications for Accounting Practice and Internal Controls

5.3. Limitations and Future Research

5.4. Benchmark Artifacts, Production Validation, and Operational Feasibility

5.5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Catalog Enricher (Phase 1—Offline)

Appendix A.2. Query Synthesizer (Phase 2—Runtime)

Appendix A.3. Evaluator/Reranker (Phase 2—Runtime, Final Stage)

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI