LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework

Karras, Aristeidis; Theodorakopoulos, Leonidas; Karras, Christos; Krimpas, George A.; Giannaros, Anastasios; Bakalis, Charalampos-Panagiotis

doi:10.3390/a18120791

Open AccessArticle

LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework

by

Aristeidis Karras

^1,*

,

Leonidas Theodorakopoulos

²

,

Christos Karras

^1,*

,

George A. Krimpas

¹

,

Anastasios Giannaros

¹

and

Charalampos-Panagiotis Bakalis

²

¹

Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece

²

Department of Management Science and Technology, University of Patras, 26334 Patras, Greece

^*

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(12), 791; https://doi.org/10.3390/a18120791

Submission received: 17 November 2025 / Revised: 4 December 2025 / Accepted: 6 December 2025 / Published: 15 December 2025

Download

Browse Figures

Versions Notes

Abstract

In this work, we present a principled framework for the deployment of Large Language Models (LLMs) in enterprise big data management across digital governance, marketing, and accounting domains. Unlike conventional predictive applications, our approach integrates LLMs as auditable, sector-adaptive components that robustly and directly enhance data curation, lineage, and regulatory compliance. The study contributes (i) a systematic evaluation of seven LLM-enabled functions—including schema mapping, entity resolution, and document extraction—that directly improve data quality and operational governance; (ii) a distributed architecture that deploys Apache Spark orchestration with Markov Chain Monte Carlo sampling to achieve quantifiable uncertainty and reproducible audit trails; and (iii) a cross-sector analysis demonstrating robust semantic accuracy, compliance management, and explainable outputs suited to diverse assurance requirements. Empirical evaluations reveal that the proposed architecture persistently attains elevated mapping precision, resilient multimodal feature extraction, and consistent human supervision. These characteristics collectively reinforce the integrity, accountability, and transparency of information ecosystems, particularly within compliance-driven organizational settings.

Keywords:

LLM; generative AI; data management; big data; spark; decision-making; digital governance; digital marketing; accounting; audit; digital services for citizens

1. Introduction

Within a relatively short period of time, Large Language Models (LLMs) have become major business artificial intelligence tools, especially since they can process, reason about, and generate insights related to large and occasionally unstructured datasets. The reason why they are especially powerful, at least in practice, is that they can generate fluent, context-sensitive outputs which can subsequently make otherwise complex analyses and operational problems simpler. These models are broadly applied in organizations in two ways. On the one hand, they favor prediction-related processes, including classification, summarization, or forecasting, which bear a direct impact to decision-making. Conversely, they are also finding apprehension in data management tasks that are concerned with the organization, control, and dependable access to enterprise data. The latter is, arguably, much more difficult since it requires greater reproducibility, the possibility to trace the origins of data, and rigorous audit of the results. These are, as can be imagined, particularly necessary in settings where the strictness of regulatory or compliance-based requirements is applied. Their realization is not a simple matter, and it can be associated with working with disjointed data formats, historical or semi-structured data, changing consent and policy requirements, and, most importantly, how to ensure that the results of model execution are succinct and readily understood and audited by an enterprise system (and hopefully in a manner which humans can feasibly validate).

We propose and apply a framework in this study, that can help solve these problems through a combination of transformer-based LLMs with Apache Spark orchestration, as well as estimating uncertainty with Markov Chain Monte Carlo (MCMC) sampling. Integrating these parts will help to facilitate scalable, transparent, and auditable data management. The design incorporates a number of important components: controlled data retrieval, deterministic validation, unidirectional lineage tracing, and hierarchical human control. The act of assembling these pieces brings us to the goal of restricting unpredictable behavior, and at the same time, this ensures that every output can be subject to verifiable digital artifacts in controlled business processes.

In order to test this framework, we use it in three representative areas of business, including digital governance, marketing, and accounting. Each area also has its data, functioning procedures, and conformity requirements. Demonstrating by these applications, we indicate that it becomes feasible, practically, to treat complex datasets in a compliant and efficient, and reproducible way, by probabilistic calibration and close monitoring of data origins. The findings indicate that, beyond their origins as predictive text generators, LLMs can function as reliable, accountable, policy-conscious units of enterprise information systems that address regulatory and operational needs.

1.1. Aim of This Study

The primary aim of this study is to design and validate a framework that is clear, auditable, and checkable for using large language models to enhance enterprise data management. We aim, in particular, to move beyond traditional predictive analytics and illustrate how LLMs can function as reliable, interpretable, and policy-aware tools that strengthen data curation, tracking, governance, and controlled access in systems involving multiple stakeholders.

More specifically, our objectives are threefold:

To identify how LLMs deliver measurable benefits in enterprise data workflows;
To examine portability and governance challenges across sectors with differing rules and operational requirements;
To develop distributed, uncertainty-aware systems that integrate Apache Spark orchestration with MCMC sampling, ensuring reproducibility, reliable data tracking, and institutional trust in automated data pipelines.

1.2. Contributions

The main contributions of this work can be summarized as follows:

We propose a scalable and adaptable framework for integrating large language models as governed and auditable elements within enterprise data pipelines, which, to our knowledge, has not previously been implemented in this specific combination.
We define and empirically verify seven LLM-enabled functions, spanning data integration, quality assurance, metadata lineage, compliance management, and access control, and demonstrate their applicability across digital governance, marketing, and accounting.
We design a distributed architecture that explicitly accounts for uncertainty, by combining Apache Spark orchestration with MCMC sampling, thereby enabling policy-compliant, reproducible, and transparent data management at scale.

This study addresses the following research questions:

RQ1 How can probabilistic calibration using MCMC sampling be combined with deterministic validation to reduce non-determinism while ensuring explainability and replayability?
RQ2 Which foundational components support the creation of scalable and managed LLM pipelines, including controlled retrieval, provenance anchoring, and risk-based human oversight across industries with differing regulatory requirements?
RQ3 How do these design choices balance semantic accuracy, complete provenance, decision speed, and computational or energy efficiency in real-world environments?
RQ4 Which evaluation methods most effectively connect accuracy, calibration, and governance metrics, in order to enable responsible deployment in business contexts?

The rest of the paper is organized in the following way. Section 2 reviews background and related work on enterprise data management, probabilistic calibration, and data governance. Section 3 describes the system architecture and orchestration patterns, including governed retrieval, validator design, and provenance instrumentation. Section 4 presents the main algorithms and MCMC diagnostics. Section 5 outlines the architecture patterns and orchestration strategies used in practice. Section 6 details sectoral implementations and failure modes. Section 7 describes the evaluation methodology and metrics. Section 8 reports empirical findings and transferability analysis. Section 9 discusses implications and limitations, and Section 10 concludes and outlines future directions.

1.3. Problem Framing & Scope

1.3.1. Defining LLMs for Data Management Versus Prediction

In enterprise information systems, LLM deployments bifurcate into (i) prediction-oriented uses—classifications, summaries, forecasts—and (ii) data-management-oriented uses that render data fit for purpose by curating, governing, and exposing it reliably across platforms in digital governance, marketing, accounting, and big data infrastructures [1,2,3,4]. Prediction-oriented work optimizes task metrics (accuracy, precision/recall, calibration) and tolerates stochasticity and distribution shift for downstream utility [5,6,7,8,9,10,11,12,13,14,15]. Data-management deployments, by contrast, prioritize process guarantees—reproducibility, provenance, auditability, policy conformance—because outputs (schemas, mappings, lineage, quality assertions, access decisions) become institutional controls [1,2,13,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44].

This distinction implies different stakeholders and validations. Prediction primarily serves analysts/marketers/policy teams who accept some opacity for metric lift. Data-management deployments primarily serve data stewards, auditors, and compliance officers, who must rely on evidence objects and replayable pipelines, as illustrated by recent work on provenance systems and audit-ready ML workflows [21,22,40,45]. In these settings, verification shifts from aggregate predictive accuracy to deterministic validation of lineage coverage, mapping reproducibility, and consent or policy enforcement, supported by provenance-tracking and control-oriented architectures [38,39,40,45,46]. Such guarantees are non-negotiable for public registries, financial reporting, and audit workflows, and increasingly shape responsible marketing where consent, transparency, and recourse are essential [39,40,41,42,43,45,46,47,48,49,50].

Related probabilistic foundations reinforce these guarantees. MCMC methods and distributed implementations (e.g., over Spark) enable scalable, reproducible posterior sampling in high-dimensional spaces [51,52,53]; Monte Carlo EM for GLMMs operationalizes iterative refinement of latent structures [54]. These methodologies [51,52,54,55] illustrate how uncertainty can be quantified while preserving audit-ready reproducibility, informing LLM-driven lifecycle controls across governance, marketing, and accounting.

1.3.2. Functional Boundaries and Scope

Guided by this delineation, we emphasize the capacities of large language models (LLMs) that facilitate adaptive big data governance rather than conventional analytic processing, wherein demonstrable value consistently emerges across multiple domains:

F1.: Data Ingestion & Integration. Language-grounded semantic alignment across heterogeneous sources; assistive rule generation with rationales; and schema evolution handling in lake and lakehouse settings, where LLMs act as integration co-pilots that propose, validate, and adapt mappings and transformation logic across complex enterprise data landscapes [16,17,23,56].
F2.: Data Quality & Constraint Management. Context-aware detection and repair (format, range, referential integrity) using domain semantics, with LLMs surfacing anomalies, proposing candidate fixes, and working in tandem with post-validators to preserve consistency [10,11,37,57].
F3.: Metadata & Lineage Management. Automated extraction of technical and business metadata, transformation introspection, and lineage completion to support replayability, impact analysis, and audit-ready provenance across complex pipelines [19,20,33,34].
F4.: Governed Data Access via RAG. Consent- and authorization-aware retrieval with citations and policy-conditioned rationales, where LLMs serve as constrained brokers over enterprise knowledge bases rather than unconstrained generators [2,13,17,36].
F5.: Document-to-Structure Extraction. High-fidelity parsing of invoices, tenders, contracts, and disclosures into normalized records with confidence scores, cross-links, and master-data validation, often combining LLM reasoning with vision or layout-aware models [14,27,58,59].
F6.: Compliance & Privacy Operations. PII/PCI detection and redaction, consent and purpose tagging, policy classification, and continuous violation monitoring with audit-ready justifications to support regulatory compliance across sectors [45,46,48,49].

Strictly predictive tasks (e.g., demand forecasting, propensity modeling, clustering) are excluded from the scope of this study unless they are directly instrumental to a control objective, such as anomaly surfacing or access-risk scoring [5,6,7,8,9,10,11,12,13,14,15]. Our emphasis is therefore on operational data management, where LLM-enabled pipelines support governance, provenance, and assurance functions in line with enterprise risk and audit practices as shown in works [12,29,32,33,34,35,36,37,56].

1.3.3. Corpus-Building Approach (Non-PRISMA, Rigor-Preserving)

To balance breadth and depth without formal PRISMA reporting, this study conducts a curated, targeted search with backward/forward snowballing from anchor works. The horizon was 2020–2025 across IEEE Xplore, ACM Digital Library, arXiv, SpringerLink, and sector outlets in information systems, accounting/audit, and public-sector informatics. Inclusion required: (i) an explicit data-management objective (any of F1–F6); (ii) architectural or empirical detail sufficient for reproduction or independent validation; (iii) governance/assurance considerations (provenance, controls, compliance, auditability); and (iv) either cross-context transferability or clear grounding in digital government, marketing, or accounting. We excluded purely predictive demonstrations lacking data-management substance and works without actionable designs.

The resulting corpus integrates peer-reviewed studies, design and architecture papers, and sector exemplars, combining socio-technical breadth (governance, policy, adoption) with technical depth (pipelines, guardrails, evaluation).

First, several works delineate the emerging roles of LLMs in data systems, distinguishing between analytic, governance, and co-pilot functions in enterprise architectures [1,2]. Related studies extend these perspectives to corpus-centric and search-augmented settings where LLMs mediate access to large knowledge bases [13,17]. Additional contributions explore sector-specific data platforms and proactive data systems that embed LLMs deeper into enterprise decision workflows [24,60]. A second group focuses on quality, lineage, and assurance instrumentation, detailing how LLMs assist in constraint checking, error detection, and provenance tracking [19,20]. Further work proposes low-overhead provenance frameworks and ML lifecycle transparency platforms tailored to regulated environments [21,22]. Classical big data governance and data quality frameworks remain important anchors for these LLM-enabled assurance pipelines [38,61]. A third line of work studies integration and ETL co-pilots, where LLMs help design or execute ingestion and transformation flows across heterogeneous sources [16,23]. Empirical studies show that language-based co-pilots can propose mappings and transformation logic for complex schemas while preserving human oversight [24,26]. Recent platforms integrate these co-pilots with distributed data processing engines to support large-scale ETL orchestration [12,56]. Separate studies investigate governed retrieval-augmented generation over enterprise knowledge bases, emphasizing policy-aware retrieval, citation, and justification mechanisms [2,17]. Extensions introduce consent-aware and risk-sensitive retrieval pipelines where LLMs act as constrained brokers rather than free generators [13,26]. Sectoral deployments further highlight governance challenges around attribution, hallucination control, and answerability in business and public-sector settings [35,36].

Another thread develops document-to-structure pipelines that convert semi-structured artefacts such as invoices, disclosures, or logs into normalized datasets with validation hooks [58,59]. Vision–language models and layout-aware architectures are used to improve extraction robustness and cross-document linking in enterprise repositories [27,31]. These pipelines frequently combine LLM reasoning with classical OCR and schema validation to satisfy audit and reporting requirements [14,28]. Finally, a growing body of work examines compliance and privacy operations across sectors, including data provenance, licensing, and risk detection for regulated domains [47,49]. Other contributions focus on watermarking, content tracing, and dataset-level governance to support responsible model training and deployment [48,50]. Sectoral case studies in public services, environmental domains, and platform governance illustrate how LLM-based controls interact with evolving regulatory frameworks [42,62]. We also synthesize emerging and adjacent perspectives, as well as gray or secondary indices, to support triangulation and horizon scanning around LLM use in prediction and sectoral operations [5,6]. Additional studies on time-series forecasting and financial or macroeconomic applications highlight how predictive LLM deployments intersect with risk and assurance considerations [7,9]. Complementary work on LLMOps and operational governance practices informs our view of production-ready, auditable LLM pipelines [18,25].

Collectively, this corpus supports cross-sector abstraction while preserving the assurance-by-design focus central to digital governance, marketing, and accounting [33,56]. Recent architectures for provenance-aware data platforms and controlled LLM integration further reinforce this emphasis on traceability and accountability across enterprise data lifecycles [34,37].

1.3.4. Cross-Sector Context Map

Across digital governance, digital marketing, and accounting/audit, LLM deployments converge on transparent, controllable, and auditable data pipelines that emphasize ingestion/integration, data quality, metadata/lineage, governed access (e.g., RAG), and document-to-structure extraction—rather than purely predictive use [12,19,22,46,63,64,65]. Common structural drivers include: (i) extreme heterogeneity and velocity (registries, case files, sensors; omnichannel telemetry and CRM; ledgers plus semi-structured contracts/disclosures) [1,3,16,25]; (ii) persistent needs for semantic mapping and entity resolution to support cross-agency harmonization, customer-journey stitching, and subsidiary consolidation [3,16,25]; (iii) pervasive quality, compliance, and explainability pressures—GDPR-style consent/purpose limits and SOX/ISA/IFRS documentation—favoring auditable, reproducible tooling [2,12,22,46,47]; (iv) provenance/lineage as first-class controls for trust and assurance, with LLMs extracting and completing evidence graphs [19]; and (v) governed, purpose-limited access to natural-language interfaces via policy- and consent-aware retrieval [2,12]. Sectoral stakes then differentiate priorities:

Digital governance. Citizen-centric records, identity/authentication logs, and multi-agency documents mix long-lived masters with episodic cases and IoT streams [1]. Statutory transparency and strict consent boundaries heighten requirements for provenance, auditability, and explainable access; errors affect rights and service equity [12,46,47]. LLMs should focus on policy-aligned ingestion/integration, documentation, and lineage extraction, and governed RAG that emits citations and oversight-ready rationales [2,19].
Digital marketing. High-velocity behavioral telemetry, adtech interactions, CRM histories, and social text—often blended with third-party data—operate under evolving contracts and privacy regimes [16,25]. Differential consent and purpose limitation intersect with rapid experimentation; misuse or scope creep creates regulatory and reputational risk, while weak identity resolution erodes ROI [2,12,47]. LLM priorities: consent-aware ER/identity mapping, contract/policy extraction, and governed RAG over CDPs and knowledge assets with policy filters by design [2,3,12,16].
Accounting/audit. Structured financial systems and semi-structured artifacts (invoices, receipts, contracts) must form durable, replayable evidence chains [3]. External assurance demands determinism, reproducibility, and consistent evidence; human-in-the-loop controls and verifiable logs must bound stochastic behavior [19,22]. LLMs fit best in deterministic document-to-structure extraction with validators, metadata/lineage completion, and access workflows preserving segregation of duties and audit trails [19,22]. All the above elements are analysed further in Table 1.

Synthesis. LLMs function best as assistive, governed components that (i) accelerate semantic alignment and documentation, (ii) fortify quality and lineage controls, and (iii) mediate policy- and consent-aware access. Public services prioritize transparency and recourse [12,46,47]; marketing emphasizes agility under consent and purpose limits [2,12]; and accounting requires determinism and replayable evidence [19,22]. Accordingly, evaluation must weight provenance coverage and control effectiveness alongside task accuracy to ensure inspectable, reproducible, and compliant pipelines across contexts [1,3,16,19,22].

2. Taxonomy of LLM-Enabled Data Management Functions

This section defines seven reusable functions through which LLMs deliver intelligent big data management across digital governance, marketing, and accounting [66,67,68,69,70]. The taxonomy proceeds from upstream alignment to downstream governed access, matching subsections F1–F7 and the pipeline summarizations in Table 2, Table 3, Table 4 and Table 5. For each function, we state the objective, outline typical realizations, and summarize reported evidence, with concise transitions to the next stage.

2.1. F1. Schema & Mapping Co-Pilot

Function. Assist schema understanding, cross-system mapping, and semantic reconciliation, producing governed transformations with human-reviewable rationales [71,72,73,74]. LLMs exploit natural-language metadata and glossaries to propose semantically faithful mappings beyond syntactic matchers. Realizations & evidence. Pretrained LLMs are coupled with knowledge bases, mapping histories, and catalog metadata; RAG retrieves prior decisions and style guides; program analysis exports versioned artifacts. Enterprise-scale routing/selection is demonstrated for federated repositories. Studies report high precision and reduced effort, with accelerated schema-evolution timelines via automated impact analysis [71,72,73,75,76,77,78]. Transition. Record-level consistency motivates F2.

2.2. F2. Entity Resolution Assistant

Function. Consolidate duplicates/relations by combining embedding-based blocking with LLM adjudication to handle ambiguity and contextual references [79,80,81,82,83]. Realizations & evidence. Candidates are staged via vector similarity; prompts encode schema, rules, and exemplars; batching and demo selection optimize cost/quality. Clustering with in-context demonstrations lowers pairwise complexity and enables uncertainty-driven verification. Reported results show F1 gains and multi-fold API-call reductions with competitive quality, including strong performance in geospatial/public-record settings [79,81,83]. Transition. With entities reconciled, fitness-for-use is addressed in F3.

2.3. F3. Data Quality & Constraint Repair

Function. Complement profilers and rule engines by contextualizing anomalies, distinguishing errors from edge cases, and proposing policy-consistent repairs that preserve integrity [84,85]. Realizations & evidence. Hybrid designs pair statistical detection with LLM-based explanation/remediation, followed by deterministic validators and rollback safeguards; iterative refinement emits machine-auditable justifications. Evidence indicates strong detection, practical auto-remediation when domain knowledge is injected, and improved blocking under semantic noise. Transition. Quality actions must be explainable and traceable, motivating F4.

2.4. F4. Metadata/Lineage Auto-Tagger

Function. Extract technical/business metadata and complete lineage across code, notebooks, and workflow logs to support governance, impact analysis, and assurance. Realizations & evidence. Program/workflow analysis feeds LLM summarization and semantic annotation; some systems anchor lineage in immutable ledgers [12,18,21,22]. Implementations report efficient tracking and end-to-end ML provenance with substantial documentation reduction and audit-ready enterprise trails. Transition. Operationalized policy follows from lineage; F5 focuses on privacy/consent and PII handling.

2.5. F5. Policy/Consent Classifier & PII Redactor

Function. Automate policy classification (e.g., GDPR/CCPA purpose), consent-state reasoning, and sensitive-data handling across structured/unstructured assets [86,87,88,89,90,91]. Realizations & evidence. Multi-model pipelines pair specialist PII detectors with LLM legal/policy reasoning and consent propagation, integrating knowledge graphs and registries; borderline cases are triaged by humans; monitors emit machine-checkable rationales. Reported systems achieve high compliance precision/recall and reduce manual review via KG augmentation and consent-platform integration. Transition. With processing controls established, F6 structures unstructured inputs.

2.6. F6. Document-to-Structure Extractor

Function. Convert invoices, contracts, receipts, and filings into normalized records with field-level confidence, cross-document links, and master-data validation [92,93,94,95]. Realizations & evidence. Vision–language pipelines combine layout analysis with LLM field semantics, domain validators, and memory/agentic learning from corrections. Benchmarks and cases show double-digit accuracy gains over single prompts/vanilla agents, with quantified robustness to rotation/layout and notable efficiency benefits. Transition. Finally, curated corpora require safe, explainable access (F7).

2.7. F7. Retrieval-Augmented Generation (RAG) for Governed Data Access

Function. Provide compliant, explainable, least-privilege natural-language access with authorization, consent, and minimization, emitting citations and logs [2,13,17]. Realizations & evidence. Vector indexing is combined with policy filters, consent checks, and PII redaction pre-synthesis; schema-routing co-pilots scale to large repositories; full trails support audit/incident response [13,17,73]. Studies show effective policy adherence without material loss of answer quality; search-augmented training stabilizes attribution/compliance.

Synthesis. F1–F7 form an auditable pipeline: semantics-aware integration (F1) and entity fidelity (F2) provide coherence; fitness-for-use is assured (F3) and documented (F4); processing is constrained by policy/consent (F5); unstructured inputs are normalized (F6); and access is governed (F7). Table 2, Table 3, Table 4 and Table 5 consolidate performance, sectoral fit, complexity, and risk mitigation.

Table 2. Reported performance characteristics of LLM-enabled data-management functions.

Function	Accuracy Range	Speed Improvement	Manual Effort Reduction	Primary Success Metrics (Refs.)
Schema & Mapping Co-Pilot	85–92%	60–75%	60–75%	Mapping precision; semantic correctness [71,72,73,75,76]
Entity Resolution Assistant	82–99%	Up to 5× fewer LLM calls	—	F1; clustering quality; cost/quality trade-offs [79,80,81,82,83]
Data Quality & Constraint Repair	65–80% detection; 50–70% auto-remediation	—	—	Detection rate; repair success; FP rate [84,85]
Metadata/Lineage Auto-Tagger	75–90% coverage (reported)	Real-time or near-real-time	75–90% documentation reduction	Lineage completeness; provenance readability [12,20,21,22]
Policy/Consent Classifier & PII Redactor	83–95%	60–80% review reduction	—	Compliance precision/recall; PII detection rate [86,87,88,90,91]
Document-to-Structure Extractor	75–95%	30–35%	—	Field-level F1; robustness to layout/rotation [92,93,95]
RAG for Governed Data Access	Query-dependent	Scales to large DBs	—	Response accuracy; compliance adherence [2,13,73]

Notes: RAG = retrieval-augmented generation; PII = personally identifiable information; FP = false positive; DBs = databases. Ranges depend on datasets, prompts, and evaluation protocols.

Table 3. Sector applicability of the seven functions with indicative differentiators.

Function	Gov.	Mkt.	Acct./Audit	Key Differentiators (Refs.)
Schema & Mapping Co-Pilot	High	High	High	Cross-agency, omnichannel, and subsidiary harmonization [72,73,75,76]
Entity Resolution Assistant	High	High	High	Citizen/customer/vendor identity fidelity [79,80,81,83]
Data Quality & Constraint Repair	Medium	High	High	Real-time cleaning vs. regulatory determinism [84,85]
Metadata/Lineage Auto-Tagger	High	Medium	High	Transparency and audit-trail depth [12,20,21,22]
Policy/Consent Classifier & PII Redactor	High	High	Medium	Consent and purpose limitation; policy verification [86,88,90,91]
Document-to-Structure Extractor	Medium	Medium	High	Volume and criticality of invoices/contracts [92,95]
RAG for Governed Data Access	High	High	Medium	Natural-language access under policy/consent [2,13,17,73]

Notes: Gov. = government; Mkt. = market-facing; Acct./Audit = accounting and audit.

Table 4. Implementation complexity and resource considerations.

Function	Technical Complexity	Infrastructure Requirements	Governance Overhead	Typical Time
Schema & Mapping Co-Pilot	Medium	Knowledge bases; catalog integration	Low	2–3 months
Entity Resolution Assistant	Medium–High	Vector/cluster infrastructure	Medium	3–4 months
Data Quality & Constraint Repair	Medium	Rule engines; validators	Medium	2–4 months
Metadata/Lineage Auto-Tagger	High	Lineage/graph stores; code parsing	High	4–6 months
Policy/Consent Classifier & PII Redactor	High	Privacy KG; consent registry; DLP	High	4–6 months
Document-to-Structure Extractor	Medium–High	Multi-modal models; RPA/ETL hooks	Medium	3–5 months
RAG for Governed Data Access	High	Vector DB; policy engine; secure gateways	High	4–8 months

Notes: KG = knowledge graph; DLP = data loss prevention; RPA = robotic process automation; ETL = extract–transform–load; DB = database.

Table 5. Primary risks, failure modes, and proven mitigations.

Function	Primary Risk Factors	Failure Modes	Mitigations and Monitoring (Refs.)
Schema & Mapping Co-Pilot	Schema hallucination; overgeneralization	Incorrect mappings; integrity breaks	Confidence thresholds; mandatory human validation; mapping telemetry [71,72,73,75,76]
Entity Resolution Assistant	Over/under-clustering; drift	Duplicate merges; missed links	Uncertainty-driven prompts; balanced demos; precision/recall dashboards [79,80,81,82]
Data Quality & Constraint Repair	Over-correction; rule conflicts	Data corruption; regressions	Deterministic validators; rollback; change logs [84,85]
Metadata/Lineage Auto-Tagger	Incomplete capture; overhead	Missing dependencies; stale lineage	Hybrid capture; coverage SLOs; spot audits [20,21,22]
Policy/Consent Classifier & PII Redactor	False negatives; policy drift	PII leakage; unlawful purpose	Multi-stage detection; human-in-the-loop; violation alerts [86,88,90,91]
Document-to-Structure Extractor	Layout/rotation variance	Field misreads; linkage errors	Multi-modal fusion; validators; layout/rotation pre-processing [92,95]
RAG for Governed Data Access	Hallucination; privilege creep	Unauthorized disclosure	Sandboxed retrieval; policy filters; full query audit [2,13,17]

Notes: SLO = service-level objective; PII = personally identifiable information; RAG = retrieval-augmented generation.

3. Methodology

In this section, a systematic and principled methodology is proposed to address the challenges of large-scale, data-driven intelligence across digital governance, marketing, and accounting applications. Special emphasis is placed on the integration of sectoral datasets, advanced distributed processing, probabilistic sampling via Markov Chain Monte Carlo (MCMC) methods, and robust evaluation within a reproducible and auditable framework.

3.1. Datasets and Data Generation

This study is based on a combination of real-world and synthetically generated datasets. For each evaluated sector

s \in {Governance, Marketing, Accounting}

, we denote the corresponding dataset as

D_{s}

.

Public-sector procurement analytics utilize the European Union Tenders Electronic Daily (TED) corpus: structured awards and tender notices with explicit entity and policy references [96].
Marketing analytics are grounded in the Olist e-commerce corpus, encompassing rich transactional and customer behavior attributes [97].
Accounting and audit scenarios source the SROIE and CORD financial document sets, which offer machine-readable ground-truth annotations.

To benchmark scalability, robustness, and methodological performance, synthetic datasets

D_{syn} (θ)

are generated under parametric templates parameterized by

θ

, allowing exact control over entity distributions, attribute sparsity, and cross-table dependencies:

D = ⋃_{s} D_{s} \cup D_{syn} (θ)

Table 6 summarizes the main characteristics and the functions (F1–F7) instantiated per dataset.

3.2. Distributed LLM and MCMC Framework

In this section, we present a distributed LLM architecture, harnessing the Apache Spark platform for scalable data partitioning and compute orchestration, and employing MCMC sampling for quantifiable uncertainty and adaptive provenance.

3.2.1. Data Preprocessing and Partitioning

Each dataset

D_{s}

is first loaded into the distributed environment and divided into logical partitions using a hash-based function that ensures balance and preserves both sector and entity semantics. Formally, we define the partitioning function

h : N \to N

and the corresponding partition operator as:

P (x_{i}) = Partition (D_{s}, h (i))

This strategy allows the system to process data in parallel across all partitions while maintaining sector-specific grouping, which is crucial for consistent governance analysis and domain-aligned model execution.

3.2.2. Model Invocation and Inference

We deploy large language models customized to each sectoral task—such as retrieval-augmented generation, consent-aware entity resolution, and multimodal document extraction—through Spark’s distributed UDF pipeline. Each model operates as an independent microservice, allowing elastic scaling and fault-tolerant inference. Let

f (x; ϕ)

represent a sector-specific LLM characterized by parameters

ϕ

:

y_{i} = f (x_{i}; ϕ_{s})

Parallel inference over partitioned inputs

x_{i}

ensures high throughput and extensive coverage across diverse workloads. This setup effectively couples model invocation with cluster-level scheduling, maximizing performance while maintaining deterministic task flow.

3.2.3. MCMC Posterior Sampling for Uncertainty Quantification

To evaluate model reliability under variable data quality and regulatory conditions, we adopt a Markov Chain Monte Carlo (MCMC) approach for posterior estimation. Given an input

x

and sector-specific model f we estimate the posterior distribution over outputs

y

as:

p (y ∣ x, D_{s}) \propto p (y ∣ f, x) \cdot p (f ∣ D_{s})

Each iteration of sampling progresses according to:

y^{(t + 1)} = K (y^{(t)}, η)

where

K

represents a proposal kernel that incorporates both task parameters and regulatory constraints, and

η

denotes sector-aware hyperparameters. The aggregated expectation over all samples yields the uncertainty-adjusted estimate:

E [q (y)] \approx \frac{1}{T} \sum_{t = 1}^{T} q (y^{(t)})

This process provides both interval-based confidence and a calibrated representation of decision uncertainty suitable for regulated enterprise use.

3.2.4. Provenance and Traceability

To ensure full reproducibility and end-to-end auditability, each inference and corresponding MCMC sample is tagged with a deterministic provenance identifier derived from the data stream and system state:

ProvID = hash (D_{s}, x_{i}, t)

This provenance identifier allows downstream traceability across Spark partitions, model checkpoints, and validation events. It also enables seamless alignment with post-hoc audits, lineage tracking, and cross-sectoral compliance assessments.

3.3. Evaluation Metrics and Uncertainty Reporting

We evaluate the system using a series of domain-relevant performance and reliability metrics, collectively represented as

Q (y)

. For each metric, results are reported with their corresponding 95% credible intervals derived from the MCMC posterior distribution. A 95% credible interval is adopted as it aligns with prevailing audit and risk-management practice, while remaining tight enough to inform operational thresholds in compliance-critical settings; alternative coverage levels can be configured where regulation demands. The posterior is given as:

Metric = Q (y) \pm {CI}_{95 %}

The credible interval is computed as:

{CI}_{95 %} = 1.96 \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {(q (y^{(t)}) - \bar{q})}^{2}}

where

\bar{q} = \frac{1}{T} \sum_{t = 1}^{T} q (y^{(t)})

.

This formulation provides both point estimates and empirical uncertainty bounds, allowing the reported figures to be interpreted with statistical confidence while maintaining alignment with audit and reproducibility standards.

This principled approach enables not only superior predictive and extraction accuracy but also rigorous quantification of uncertainty and reproducibility, advancing the state-of-the-art for sectoral AI in real-world enterprise.

3.3.1. Baselines

To contextualize the performance of the proposed LLM-enabled functions, three classical baselines are considered. For F1 (schema mapping), a non-LLM matcher is used that combines token-based string similarity (e.g., Jaccard and edit distance on attribute names) with rule-based type checks to select mappings. For F2 (entity resolution), a blocking-plus-rules baseline is instantiated with deterministic blocking keys (normalized names and identifiers) and fixed similarity thresholds without LLM adjudication. For F7 (governed retrieval), a keyword/BM25-style retrieval model over the same indices is employed, without LLM reasoning or policy-aware re-ranking. Baseline scores are reported alongside the proposed metrics (e.g., SWMC, CERS, QPEG) in the evaluation tables, illustrating relative gains in mapping correctness, resolution quality, and governed answer accuracy.

3.3.2. Implementation Details

Text-based functions (F1–F5, F7) were implemented leveraging a GPT-4-class large language model accessed via an HTTP API, while document-to-structure extraction (F6) utilized a vision–language model combining OCR and layout-aware text understanding. Model invocations were orchestrated from Apache Spark using distributed user-defined functions (UDFs) that called per-partition microservices; each worker maintained a small pool of HTTP clients to amortize connection overhead and ensure fault-tolerant retries. Models were operated in a zero-shot or few-shot prompting regime, using sector-specific templates and small sets of in-context examples extracted from dataset training splits, without extensive fine-tuning. For MCMC sampling, a random-walk proposal kernel over discrete mapping or label spaces was employed, running 1000 iterations per batch, including a brief warm-up phase, with acceptance rates tuned between 0.3 and 0.5. Experiments were executed on a cluster composed of 8 worker nodes, each featuring 8 CPU cores and 16 GBs of RAM. Typical end-to-end runtimes for the largest synthetic datasets were on the order of tens of minutes.

3.3.3. MCMC Versus Simple Repetition

Beyond reporting posterior credible intervals via MCMC, we conducted a simple ablation to compare MCMC-based uncertainty with variance estimates obtained from repeated runs without MCMC. We evaluate metrics:

Q (y)

using Markov chain samples

{y_{t}}_{t = 1}^{T}

estimating the posterior expectation

\hat{Q} = \frac{1}{T} \sum_{t = 1}^{T} Q (y_{t})

and variance

\hat{Var} (Q) = \frac{1}{T} \sum_{t = 1}^{T} {(Q (y_{t}) - \hat{Q})}^{2}

The 95% credible interval is

\hat{Q} \pm 1.96 \sqrt{\hat{Var} (Q)} .

For comparisons, we perform an ablation estimating uncertainty from repeated runs without MCMC by running the pipeline

R

times, computing empirical variance

{\hat{σ}}_{rep}^{2} = \frac{1}{R} \sum_{r = 1}^{R} {(Q^{(r)} - \bar{Q})}^{2}

where

\bar{Q} = \frac{1}{R} \sum_{r = 1}^{R} Q^{(r)}

This contrasts with MCMC variance

{\hat{σ}}_{mcmc}^{2} = \hat{Var} (Q)

Repetition variance captures instability from stochastic pipeline elements only, whereas MCMC variance reflects also model and decision uncertainty given the probabilistic model. This makes MCMC intervals preferable for compliance, providing calibrated, interpretable uncertainty bounds aligned with regulatory coverage requirements.

4. Proposed Algorithms

4.1. ReMatch++: Schema & Mapping Co-Pilot

In this section, we propose the “ReMatch++” algorithm, a schema mapping co-pilot designed to maximize Semantic-Weighted Mapping Correctness (SWMC), while enabling transparent, human-in-the-loop data integration. The key innovation is the use of distributed LLM-based mapping proposals, scored and validated by both AI and deterministic policy checks, with all candidate rationales sampled for uncertainty using Markov Chain Monte Carlo (MCMC). The representation of the Algorithm is given in Figure 1, while the inner workings are given in Algorithm 1.

Algorithm 1 ReMatch++ Distributed Schema Mapping via LLM and MCMC

Require:: Source schema S, target schema T, distributed dataset partitions ${D_{j}}_{j = 1}^{J}$
1:: Retrieve distributed, sector-specific glossary $G$ via retrieval-augmented generation (RAG)
2:: for each partition $j = 1, \dots, J$ in parallel do
3:: Generate candidate mappings $M_{j} = LLMMap (G, S_{j}, T_{j})$
4:: for each mapping $m \in M_{j}$ do
5:: Compute semantic score $d_{sem} (m)$ and business weight $w_{biz} (m)$
6:: $S (m) \leftarrow α d_{sem} (m) + β w_{biz} (m)$
7:: if mapping m passes deterministic Spark validators then
8:: Propose m for human validation; log provenance $LineageID \leftarrow hash (D_{j}, m)$
9:: end if
10:: end for
11:: end for
12:: for $t = 1, \dots, T$ do ▹ MCMC sampling loop for uncertainty
13:: Sample alternative mapping proposals $M^{(t)}$ via proposal kernel $K$
14:: Compute SWMC for sample t:

${SWMC}^{(t)} = \frac{1}{N} \sum_{i = 1}^{N} w_{i} I [m_{i}^{(t)} valid] \cdot (1 - d_{sem} (m_{i}^{(t)}))$
15:: Store all accepted proposals and scores
16:: end for
17:: Aggregate final mapping, credible SWMC intervals, and produce auditable ETL artifacts

The procedure operates by first extracting relevant domain knowledge, then proposing mappings in parallel using LLMs and Spark computation. Scoring combines semantic and business criteria. All proposals are validated, versioned, and presented for expert review, with uncertainty sampling providing credible intervals.

Evaluation: The principal evaluation metric for ReMatch++ is the Semantic-Weighted Mapping Correctness (SWMC), formally defined as

SWMC = \frac{1}{N} \sum_{i = 1}^{N} w_{i} \cdot I [m_{i} is correct] \cdot (1 - d_{sem} (m_{i})),

where

w_{i}

denotes the business-critical importance of the i-th mapping,

I [\cdot]

is the indicator function for mapping correctness, and

d_{sem} (m_{i})

measures the semantic distance for mapping

m_{i}

. To rigorously estimate mapping reliability, we report the

95 %

credible interval for SWMC, as derived from MCMC posterior sampling over candidate mappings.

4.2. Consent-Aware Entity Resolution (C-ER) and Policy Classifier

In this section, we present the Consent-Aware Entity Resolution (C-ER) algorithm, which constructs high-fidelity identity graphs compliant with sector- and purpose-based consent policies. Our approach integrates distributed processing via Apache Spark and quantifies model uncertainty using Markov Chain Monte Carlo (MCMC) sampling. This framework is designed for scalability, explainability, and regulatory compliance across digital governance, marketing, and accounting datasets. The architecture of the algorithm is given in Figure 2 while the inner workings are given in Algorithm 2.

Algorithm 2 Consent-Aware Entity Resolution (C-ER)

Require:: Distributed entity dataset partitions ${E_{j}}_{j = 1}^{J}$ , purpose/consent policies $P$
1:: for each partition $j = 1, \dots, J$ in parallel do
2:: Block entities with composite feature hashing in Spark
3:: for each candidate pair $(e_{i}, e_{k}) \in BlockPairs (E_{j})$ do
4:: Compute feature similarity and context vector
5:: Use LLM to propose link probability $p_{i k}$ and rationale
6:: for $t = 1, \dots, T$ do ▹ MCMC calibration loop
7:: Sample link decision $z_{i k}^{(t)} \sim Bernoulli (p_{i k})$ (calibrated by in-context features)
8:: if $z_{i k}^{(t)} = 1$ and $Purpose (e_{i}, e_{k}) \in P_{allowed}$ then
9:: Link $(e_{i}, e_{k})$ ; tag with purpose/confidence; update graph
10:: else
11:: Flag or deny linkage; record for governance compliance
12:: end if
13:: end for
14:: end for
15:: end for
16:: Aggregate entity graph and audit logs; output confidence summaries from MCMC chains

The procedure efficiently partitions and blocks large datasets, and then proposes candidate links through a combination of text-based features and LLM predictions. It subsequently leverages MCMC sampling to quantify linkage uncertainty and calibration. All link proposals, denials, and rationales are transparently logged for audit and regulatory purposes. The residual risks from blocking, such as missed cross-block matches or over-dense blocks, are analyzed as failure modes in Section 6, along with mitigation strategies.

Evaluation: Algorithm performance is assessed using the Consent-aware Entity Resolution Score (CERS), defined as

CERS = α F_{1} + β {AUC}_{conf} + γ {BI}_{avg},

where

F_{1}

denotes pairwise clustering accuracy,

{AUC}_{conf}

is the area under the confidence calibration curve, and

{BI}_{avg}

is the average business impact of link decisions. Credible intervals for these metrics are derived from the posterior distribution sampled via MCMC:

{CI}_{95 %} (CERS) = 1.96 \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {({CERS}^{(t)} - \bar{CERS})}^{2}} .

4.3. Doc2Ledger-LLM: Multimodal Extraction with Validators

In this section, we propose the “Doc2Ledger-LLM” algorithm, a robust multimodal document-to-ledger pipeline that is both distributed and uncertainty-aware. The design integrates Spark for large-scale document partitioning, large language models (LLMs) for field extraction, and MCMC sampling for quantifying extraction and validation reliability. This method is specifically tailored for accounting and audit workflows requiring traceable and reproducible evidence. The graphical representation of the proposed method is given in Figure 3 while the inner workings are given in Algorithm 3.

Algorithm 3 Doc2Ledger-LLM Multimodal Distributed Extraction

Require:: Distributed document partitions ${X_{j}}_{j = 1}^{J}$ , validator set $V$
1:: for each partition $j = 1, \dots, J$ in parallel do
2:: for each document $x \in X_{j}$ do
3:: Apply OCR and extract layout structure
4:: Use LLM to propose key field candidates $(f_{k})$ and content rationales
5:: for $t = 1, \dots, T$ do ▹ MCMC extraction chain
6:: Sample field extraction set $f^{(t)}$ using LLM context and layout priors
7:: for each field $f \in f^{(t)}$ do
8:: if field f passes deterministic checks in $V$ (e.g., totals, VAT, master data)
then
9:: Record f as valid; generate and log provenance hash
10:: else
11:: Propose auto-fix or flag f; route for human review
12:: end if
13:: end for
14:: end for
15:: Assemble and post full extraction $f^{*}$ to ledger with provenance
16:: end for
17:: end for
18:: Aggregate results; return confidence intervals from MCMC sampling

The Doc2Ledger-LLM process begins with distributed OCR and layout parsing of scanned or digital documents, followed by LLM-driven candidate field extraction for each document. MCMC sampling yields a chain of plausible field sets, allowing the estimation of extraction uncertainty and reliability. Each extracted data item is validated against deterministic business rules (totals, tax, and master entity lookups) coded in Spark, with failures prompting either an automatic correction or a request for human oversight. Ledger postings are recorded with hash-based provenance across all decision points, providing evidence-grade traceability that achieves near-complete coverage in our experiments.

Evaluation: The extraction- and ledger-writing performance is quantified using the Hierarchical Extraction Quality Metric (HEQM),

HEQM = \frac{1}{N} \sum_{i = 1}^{N} w_{i} F 1_{i} exp (- λ {var}_{i}),

where

w_{i}

is the field/business importance,

F 1_{i}

is field-level extraction accuracy, and

{var}_{i}

is the empirical variance from the MCMC samples. Provenance Fidelity Score (PFS) and other coverage assurance metrics are also reported, with all scores accompanied by

95 %

credible intervals estimated from the sample chains.

5. Architecture Patterns

This section formalizes three implementation blueprints through which Large Language Models (LLMs) enable intelligent big data management across digital governance, marketing, and accounting. The patterns are organized from access, to transformation, to assurance, thereby tracing how organizations first expose governed, explainable access to existing assets (Pattern A), then accelerate ingestion and transformation under human control (Pattern B), and finally institutionalize evidence-grade provenance (Pattern C). Each pattern is presented with a concise rationale, core components, governance considerations, sector-specific adaptations, and a compiler-safe TikZ illustration. Figure 4, Figure 5 and Figure 6 visualize the architectures.

5.1. Pattern A: RAG-over-Lakehouse for Governed Question Answering

Rationale. Retrieval-augmented generation (RAG) over a lakehouse foundation provides natural language access to enterprise data while enforcing least-privilege, consent, and policy constraints. This pattern marries open table formats with ACID guarantees and schema evolution [98,99] to a policy-aware RAG stack [100,101,102], augmented with enterprise trust controls for masking, grounding, and logging [103]. By separating retrieval from generation, the architecture improves factuality and auditability without duplicating data [101].

Core components. (i) A lakehouse foundation (e.g., Delta, Iceberg, Hudi) exposes versioned, governed tables and logs for both batch and streaming ingestion [98,99]. (ii) Semantic indexing and vector storage create embeddings of documents and structured records, binding privacy labels and access metadata to chunks for policy-time filtering [104]. (iii) A RAG orchestration engine routes queries, enforces consent and purpose-limitations, assembles grounded contexts, and prompts LLMs with citations [101]. (iv) A governance/audit layer provides immutable interaction logs, response rationales, and replayable retrieval traces for compliance and incident response [88,91,103]. Although, while the lakehouse architecture avoids duplication of primary data, semantic indexing necessarily introduces embedding replicas; these are governed via the same access, retention, and purpose-limitation controls as the underlying tables.

Governance and sector adaptations. Digital governance requires transparency-by-design and explainable responses suitable for public oversight [88,91]. Marketing emphasizes consent-aware retrieval across customer knowledge and partner documents [88,91]. Accounting prioritizes access segregation, sampling evidence, and documented rationales to satisfy external assurance [103].

Performance and risks. Empirical accounts show scalable querying with schema-aware routing in large repositories [104], while trust layers support masking and zero-retention to bound leakage. Residual risks include hallucination and privilege creep; mitigations include strict pre-generation filtering, deterministic post-validators, and full query audits [101,102].

5.2. Pattern B: ETL Co-Pilot at the Ingestion/Transform Stage (Human-in-the-Loop)

Rationale. Many benefits arise before analytics: accelerating ingestion, mapping, cleansing, and enrichment under explicit human control. The ETL Co-Pilot pattern positions LLMs as suggesters of transformations while preserving determinism through validators, approvals, and replayable execution [105,106,107,108,109,110,111,112]. This pattern complements Pattern A by improving upstream data fitness-for-use for downstream governed access.

Core components. (i) An orchestration engine triggers pipelines (batch/stream) and injects LLM checkpoints at planning and transform stages [106,109]. (ii) A transformation suggestion engine proposes cleanses and mappings from examples and documentation (“describe-and-suggest”), emitting machine-readable diffs [105,106,107]. (iii) A quality module profiles distributions, detects violations, and explains anomalies with recommended fixes and rollback plans [107,108,111]. (iv) A human-approval gateway routes high-impact changes to stewards, compliance officers, or technical reviewers, recording rationales for audit [108,110,111,112].

Governance and sector adaptations. Governments harmonize cross-agency records with statutory transparency and citizen rights in mind; marketing pipelines incorporate consent checks and preference management; accounting requires reviewer sign-off on financial transformations with segregation of duties [105,106,109,111].

Performance and risks. Organizations report substantial reductions in manual effort and high automation rates for routine transformations while retaining near-perfect accuracy for critical financial workflows [105,106,107,108]. Primary risks involve over-automation and validator gaps; mitigations include confidence thresholds, mandatory reviews at risk cutoffs, and change replay logs [109,110,111,112].

5.3. Pattern C: Lineage & Evidence Graph with LLM Annotation (Optional Blockchain Anchoring)

Rationale. To make access and transformation auditable, provenance must be captured, enriched, and attested. This pattern instruments data operations, uses LLMs to convert raw metadata into human-readable evidence, represents relationships in a graph, and optionally anchors critical events to a distributed ledger for tamper-evidence [12,19,20,22,100,113,114,115,116]. It closes the loop with Patterns A and B by providing replayability and independent verifiability.

Core components. (i) A provenance capture engine records parameters, timestamps, identities, and transformation code paths with minimal overhead [12,19,20]. (ii) An LLM annotation service explains why transformations occurred, mapping to business rules and regulatory obligations [45,117,118]. (iii) A graph database represents entities, transformations, and access trails for impact analysis and change management [12,19,20,22]. (iv) An optional blockchain anchoring layer timestamps hashes of evidence objects and critical events, enabling later integrity verification [100,113,114,115,116].

Governance and sector adaptations. Public agencies require explainable decision trails and citizen-data transparency; marketing benefits from dynamic consent lineage; accounting relies on replayable evidence chains and immutable trails for continuous audit [100,102,113,114,115].

Performance and risks. Prior work demonstrates enterprise-scale lineage with modest overhead and sub-second graph queries [12,20,22]. Optional blockchain anchoring adds latency but enhances immutability; gas- and throughput-optimized designs mitigate operational costs in practice [100,113,115]. Risks include incomplete capture and stale explanations; mitigations include hybrid capture (static + runtime), coverage SLOs, spot audits, and attestation schedules [12,20,45,117].

5.4. Cross-Pattern Integration and Deployment Progression

Patterns are intentionally composable. Many organizations begin with Pattern A to deliver immediate, governed access, then adopt Pattern B to reduce upstream friction in ingestion and transformation, and finally institutionalize Pattern C for durable assurance and continuous audit. Shared services—LLM serving, embedding indices, catalogs, policy/consent engines, and logging—enable cost-effective reuse across patterns [103]. In regulated environments, a prudent progression is to (i) stand up governed RAG with strict pre-generation filters and replayable logs [101,102], (ii) introduce ETL co-pilots with confidence thresholds and mandatory approvals at risk cutoffs [105,106,107,111,112], and (iii) deploy lineage/evidence graphs, optionally anchored to a ledger where immutability is paramount [12,20,22,113,114,115].

6. Sectoral Implementations: Evidence, Effective Practices, and Failure Modes

Building on Section 2 and Section 5, we examine implementations in digital governance, digital marketing, and accounting/audit. We shift from capability to constraint. First, we discuss clear practices that align with each sector. Then, we explore common failure modes that stem from legal, organizational, or technical mismatches. Across sectors, assurance focuses on four control points: consent, policy enforcement, provenance, and human oversight. These are applied at the access, transformation, and provenance layers. Figure 7, Figure 8 and Figure 9 illustrate these points of insertion.

6.1. Digital Governance

Effective practices. LLMs are highly flexible models, but in digital governance they add the most consistent value when that flexibility is constrained to policy-aligned data plumbing and explicitly bounded tasks [119,120]. Statutory corpora support high-precision policy tagging and regulatory classification; systems such as LegiLM seed automated workflows for GDPR-oriented detection [88,121,122]. “Citizen 360” views remain tenable only under strict consent and purpose limitation with policy-aware lakes and rationale-bearing retrieval logs [87,123,124,125]. Failure modes. (i) Insufficient provenance: reasoning not anchored in executable lineage rarely meets auditability; immutable logging and site-spanning evidence queries mitigate at non-trivial integration cost [101,114,122,126,127,128]. (ii) Audience-aligned explainability: authorities must justify outcomes across jurisdictions; cross-border harmonization elevates requirements engineering and validation burdens [122,125,129]. These motivate replayable, consent-aware access (Pattern A), human approval at transformation points (Pattern B), and evidence-grade provenance (Pattern C).

6.2. Digital Marketing

Effective practices. LLM-augmented CDPs improve identity resolution and enrichment as third-party signals recede [130]. Hybrid association (LLM-derived features + probabilistic clustering) yields higher segment fidelity and activation quality [131,132]. Large-scale consent/preference parsing is feasible when free-form requests are translated into enforceable access policies with low latency [87,131,133]. LLM-assisted feature engineering and simulation accelerates design while reducing reliance on risky online tests [132,133]. Failure modes. Profile drift from over-weighted inferred traits degrades personalization, amplified by aggressive compression/merging [134,135]. Manipulative content risk emerges as conversion-centric prompts drift toward dark patterns [136]. Consent scope creep arises when retrieval/enrichment extend beyond authorized purposes; progressive prompting must be fenced by policy engines with comprehensive logging [87,131,137]. These support governed access (Pattern A), co-piloted transformations with review (Pattern B), and explicit consent lineage (Pattern C).

6.3. Accounting & Audit

Effective practices. Accuracy improves when multimodal extraction (layout/OCR + LLM field semantics) is combined with domain validators; neural OCR and layout-aware models outperform legacy pipelines [92,100,138]. Memory-augmented and agentic designs raise domain-specific extraction performance and harden AP automation [94,100,139]. Automated control testing leverages LLMs for relation extraction and rule mapping under reviewer oversight [140,141,142,143]. Continuous evidence collection pairs governed retrieval with finance-specific QA over hierarchical/tabular reports (e.g., 10-K) [141,142,144,145]. Failure modes. Reproducibility vs. probabilism: auditors require identical outputs for identical inputs; randomness must be bounded and determinism enforced at control points [141,146]. Sampling bias/drift: historical training can miss emergent fraud or rules [141,147]. Evidence sufficiency: black-box rationales conflict with ISA/SOX documentation; justification artifacts, executable lineage, and human sign-off are mandatory [122,141]. These favor Pattern B (validators/approvals) and Pattern C (evidence graphs/attestation), with Pattern A reserved for read-only, consent- and role-governed queries.

Synthesis Across Sectors

All three domains adopt governed LLM components but differ in admissibility thresholds. Governance prioritizes transparency and legal interoperability, making consent-aware access and replayable provenance non-negotiable [87,114,121,122,123,125,126,129]. Marketing emphasizes agility under consent with explicit consent lineage and policy-constrained enrichment to avoid manipulation and scope creep [131,133,136,137,148]. Accounting requires determinism and evidentiary sufficiency via validators, sign-offs, and executable lineage [138,140,144,149,150]. In practice, the composition of Pattern A (governed access), Pattern B (human-in-the-loop transformation), and Pattern C (evidence-grade provenance) yields sector-fit assurance.

7. Evaluation and Metrics

7.1. Data Management Performance Indicators

7.1.1. Schema Mapping Accuracy

Schema mapping accuracy is a foundational indicator for LLM-enabled data integration. We propose the Semantic-Weighted Mapping Correctness (SWMC), which extends precision–recall-style measures by incorporating semantic distance:

SWMC = \frac{\sum_{i = 1}^{n} w_{i} \cdot I (m_{i} = m_{i}^{*}) \cdot (1 - d_{sem} (s_{i}, t_{i}))}{\sum_{i = 1}^{n} w_{i}},

(1)

where

w_{i}

denotes business-criticality weight for mapping i,

I (\cdot)

is the indicator of correctness,

m_{i}

and

m_{i}^{*}

are predicted and ground-truth mappings, and

d_{sem} (s_{i}, t_{i})

is an embedding-based semantic distance between source field

s_{i}

and target field

t_{i}

[72,73]. Contemporary implementations report SWMC in the range

0.78

–

0.92

across domains; for example, a ReMatch-like framework attains

SWMC \approx 0.89

without predefined training data. The semantic term captures partial-correctness cases in which mappings are semantically near despite syntactic deviation [72,78].

7.1.2. Entity Resolution Precision–Recall–F1

Entity resolution requires balanced evaluation of precision, recall, and clustering quality. We define the Contextual Entity Resolution Score (CERS):

CERS = α \cdot F_{1} + β \cdot {AUC}_{conf} + γ \cdot {BI}_{avg}, α + β + γ = 1,

(2)

with

F_{1} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

,

{AUC}_{conf}

the area under the confidence calibration curve, and

{BI}_{avg}

the average business impact score of resolved entities. Weights

(α, β, γ)

enable sector-specific prioritization [79,81]. State-of-the-art systems exceed

CERS \geq 0.85

; in-context clustering yields up to

150 %

relative gains over pairwise methods and reduces complexity

\sim 5 \times

, while calibration exposes uncertainty biases for risk-based oversight [79,81,82,110].

7.1.3. Constraint Repair Rate

To assess data-quality remediation beyond detection, we propose the Automated Quality Enhancement Index (AQEI):

AQEI = \frac{\sum_{j = 1}^{m} c_{j} \cdot r_{j} \cdot (1 - F P_{j})}{\sum_{j = 1}^{m} c_{j} \cdot d_{j}},

(3)

where

c_{j}

is a criticality weight for quality dimension j,

r_{j}

the successful repair rate,

F P_{j}

the false-positive intervention rate, and

d_{j}

the detected issue frequency. The numerator penalizes over-aggressive corrections that introduce defects [84,85]. Reported

AQEI

values range

0.60

–

0.78

; financial data typically achieves

0.72

–

0.78

(clearer constraints), while unstructured text yields

0.60

–

0.68

[84,85].

7.1.4. Lineage Coverage Completeness

Provenance evaluation should aggregate multiple facets. We define the Provenance Fidelity Score (PFS):

PFS = w_{c} C + w_{a} A + w_{s} S + w_{v} V, w_{c} + w_{a} + w_{s} + w_{v} = 1,

(4)

where C is coverage (fraction of transformations documented), A annotation accuracy against expert validation, S semantic richness (NLG quality), and V verifiability (e.g., cryptographic integrity). Weights reflect sector priorities [12,20]. State-of-the-art systems score

PFS = 0.82

–

0.91

; blockchain anchoring improves V (

\geq 0.95

) at modest latency (5–

10 %

) [20,22,114,126]. Governance often emphasizes C and V (e.g.,

w_{c} = 0.35

,

w_{v} = 0.35

); marketing may favor S for stakeholder comprehension (e.g.,

w_{s} = 0.40

).

7.1.5. Document Extraction F1

For document-to-structure extraction we propose the Hierarchical Extraction Quality Metric (HEQM):

HEQM = \sum_{k = 1}^{K} π_{k} \cdot F_{1, k} \cdot exp (- λ \cdot σ_{k}^{2}),

(5)

where

F_{1, k}

is field-level F1 for category k,

π_{k}

is business-importance weight,

σ_{k}^{2}

is extraction variance across instances, and

λ

a variance-penalty strength. The variance term rewards consistency under format drift [92,94,94]. Multimodal systems attain

HEQM = 0.81

–

0.93

on financial documents; memory-augmented designs improve

30.3 %

over single-LLM prompts, yet layout rotations beyond safe angles degrade

HEQM

by 15–

25 %

[93,94,94].

7.1.6. RAG Answer Faithfulness

For retrieval-augmented generation, we define the Grounded Response Integrity Score (GRIS):

GRIS = ω_{f} \cdot Faithfulness + ω_{a} \cdot Attribution + ω_{c} \cdot Completeness - ω_{h} \cdot Hallucination,

(6)

with weights

(ω_{f}, ω_{a}, ω_{c}, ω_{h})

tuned to sector risk [103,142,151]. Enterprise deployments report

GRIS > 0.87

on domain repositories; trust-layered frameworks exceed

90 %

confidence across sustainability, finance, and operations. Governance prioritizes faithfulness (e.g.,

ω_{f} = 0.45

), while marketing often prioritizes completeness (e.g.,

ω_{c} = 0.40

) [103,151,152].

7.2. Operational Efficiency Metrics

7.2.1. Time-to-Ingest Reduction

We define the Pipeline Acceleration Factor (PAF):

PAF = \frac{T_{baseline} - T_{LLM}}{T_{baseline}} \cdot (1 - \frac{E_{rework}}{E_{total}}),

(7)

where

T_{baseline}

and

T_{LLM}

are baseline and LLM-enhanced ingestion times, and the rework factor discounts shifted workload [153,154]. Deployments report

PAF = 0.55

–

0.82

; event-driven architectures sustain

\approx 750

KB/s for >100 MB payloads with <8% variance. Cloud-native designs reduce ingestion time 40–

60 %

via schema-agnostic processing and auto-configuration [109,153,154,155].

7.2.2. Rework Reduction Rate

We propose the Defect Prevention Index (DPI):

DPI = 1 - \frac{D_{downstream} + α \cdot C_{correction}}{D_{baseline}},

(8)

where

D_{downstream}

counts defects escaping downstream,

C_{correction}

is correction cost,

α

a cost-normalization factor, and

D_{baseline}

the historical defect baseline. Production systems yield

DPI = 0.48

–

0.74

; proactive anomaly detection attains

100 %

recall on injected pipeline faults with low false alarms, and ML-based monitors improve prevention 35–

50 %

over rules [156].

7.2.3. Human Review Minutes Saved

We define the Quality-Preserved Efficiency Gain (QPEG):

QPEG = \sum_{t = 1}^{T} (M_{saved, t} \cdot Q_{maintained, t} - λ_{esc} \cdot E_{escaped, t}),

(9)

where

M_{saved, t}

are review minutes saved,

Q_{maintained, t}

the quality maintenance ratio, and

E_{escaped, t}

quality escapes;

λ_{esc}

penalizes downstream impact [108,111,157]. Human-in-the-loop platforms report 40–

60 %

net productivity gains at ≥99% decision quality; adaptive learning improves QPEG 15–

25 %

over 12 months [107,108,110,158].

7.3. Governance Assurance Metrics

7.3.1. Audit Findings Resolution Rate

We introduce the Compliance Deficiency Closure Index (CDCI):

CDCI = \frac{\sum_{i = 1}^{N} s_{i} \cdot I (R_{i} \leq T_{i})}{N} \cdot (1 - \frac{F_{new}}{F_{baseline}}),

(10)

where

s_{i}

weights finding severity,

I (R_{i} \leq T_{i})

indicates on-time closure, and

\frac{F_{new}}{F_{baseline}}

penalizes newly introduced gaps [88,89,159]. Deployed systems score

CDCI = 0.71

–

0.88

with 60–

80 %

manual-review reduction; LegiLM achieves

83.9 %

precision in GDPR detection, enabling proactive prevention [86,88].

7.3.2. Privacy Incidents Avoided

We define the Privacy Risk Mitigation Score (PRMS):

PRMS = \sum_{j = 1}^{M} p_{j} \cdot (1 - e^{- τ D_{j}}) \cdot C_{j},

(11)

where

p_{j}

is incident probability without intervention,

D_{j}

the detection lead time,

τ

a steepness parameter, and

C_{j}

the consequence severity [86,87]. Advanced PII detectors prevent 85–

95 %

of potential incidents; multi-stage (rules+LLM) pipelines suppress false negatives to <5% while keeping false positives at 10–

15 %

[87].

7.3.3. Compliance Timeliness Achievement

We define the Regulatory Punctuality Index (RPI):

RPI = \frac{\sum_{k = 1}^{K} w_{k} \cdot exp (- β \cdot max (0, L_{k}))}{K},

(12)

where

w_{k}

weights requirement importance and

L_{k}

is lateness (negative if early);

β

controls penalty severity [89,89]. Automated preparation systems reach

RPI > 0.92

with 40–

60 %

reporting-effort reduction; NLP verification detects likely violations pre-deadline [89,160].

7.4. Strategic Decision Impact Indicators

7.4.1. Decision Cycle Time Reduction

We define the Information-to-Action Velocity (IAV):

IAV = \frac{1}{T_{decision}} \cdot Q_{decision} \cdot U_{utilization},

(13)

where

T_{decision}

is time from request to execution,

Q_{decision}

is outcome quality, and

U_{utilization}

is the deployment rate of LLM-derived insights [142,151]. Reported gains show 45–

65 %

cycle-time reduction with stable or improved quality; natural-language query broadens access for non-technical stakeholders [103,149,151].

7.4.2. Business Lift Proxies

We define the Aggregate Business Value Score (ABVS):

ABVS = \sum_{m = 1}^{M} θ_{m} \cdot \frac{Δ P_{m}}{σ_{P_{m}}} \cdot I (p_{m} < 0.05),

(14)

where

θ_{m}

weights metric importance,

Δ P_{m}

is improvement,

σ_{P_{m}}

the historical variance, and

I (p_{m} < 0.05)

enforces statistical significance [161]. Marketing reports ABVS gains of

1.8

–

2.4 σ

with 15–

25 %

conversion lift; governance shows

1.5

–

2.1 σ

on service metrics; accounting exhibits

2.2

–

2.8 σ

on audit-efficiency indicators [141].

7.5. Cross-Sector Pattern Portability Framework

We formalize Pattern Transfer Feasibility (PTF) across sectors

(i \to j)

:

{PTF}_{i \to j} = ω_{t} T_{i, j} + ω_{r} (1 - R_{Δ}) + ω_{o} O_{i, j} - ω_{c} C_{adapt},

(15)

where

T_{i, j}

measures technical compatibility,

R_{Δ}

regulatory divergence,

O_{i, j}

operational similarity, and

C_{adapt}

adaptation cost; weights reflect organizational priorities [150,162].

7.5.1. Technical Infrastructure Portability

Table 7 summarizes technical portability and typical adaptation requirements across sector pairs.

Overall, infrastructure elements are highly portable (mean

\bar{PTF} \approx 0.88

); base LLMs and vector stores transfer readily, while blockchain portability is tempered by sector-specific consensus and regulatory constraints [115,163].

7.5.2. Prompt Engineering Adaptation Requirements

Prompting is the major adaptation lever in cross-sector transfer. We distinguish three tiers [14,165]: Tier 1 (PTF

\geq 0.85

) universal prompts (data quality checks, format validation, basic entity extraction; ≤10% edits) [149]; Tier 2 (

0.60 \leq PTF < 0.85

) domain-contextual prompts (schema mapping, semantic relations, constraints; 25–

40 %

edits with terminology/examples) [72,73]; Tier 3 (PTF

< 0.60

) sector-specific prompts (compliance, risk, domain reasoning; ≥60% reconstruction with legal language and reasoning chains) [88,121]. We quantify effort as

PAE = α_{0} + \sum_{l = 1}^{L} β_{l} n_{l} c_{l},

(16)

where

α_{0}

is base evaluation effort,

n_{l}

the number of prompts at tier l, and

c_{l}

the average adaptation complexity. Empirical projects report 120–480 engineering hours depending on sector distance and pattern scope [163,165,166,167].

7.5.3. Governance Guardrail Configuration

Guardrail transferability varies with regulation and risk tolerance (Table 8) [150,168].

We capture regulatory alignment via

RAI = 1 - \frac{\sum_{g = 1}^{G} |F_{source} (g) - F_{target} (g)|}{\sum_{g = 1}^{G} max (F_{source} (g), F_{target} (g))},

(17)

where

F_{source}

and

F_{target}

encode guardrail stringency; higher

RAI

implies lower adaptation burden [121,125,167,171].

7.5.4. Human Oversight and Approval Workflow Adaptation

Human-in-the-loop transferability averages

PTF \approx 0.68

and varies with decision criticality (Table 9) [107,111].

Escalation thresholds follow a cost-risk optimum:

θ_{escalate} = arg min_{θ} (C_{review} P (Review ∣ θ) + C_{error} P (Error ∣ θ)),

(18)

with conservative

θ

(e.g., ≥0.95) in governance and lower thresholds in marketing (e.g., ≥0.75) [107,111,157,172].

7.5.5. Comprehensive Transferability Decision Matrix

Table 10 aggregates portability, prompt adaptation, guardrail reconfiguration, and approval workflow complexity into an overall PTF and duration estimate.

The overall score aggregates dimensions as

{PTF}_{overall} = \frac{w_{t} {PTF}_{tech} + w_{p} {PTF}_{prompt} + w_{g} {PTF}_{guard} + w_{a} {PTF}_{approval}}{w_{t} + w_{p} + w_{g} + w_{a}},

(19)

with weights reflecting sector priorities: governance often emphasizes guardrails and approvals (

w_{g} = 0.35

,

w_{a} = 0.30

); marketing emphasizes technical and prompt portability (

w_{t} = 0.35

,

w_{p} = 0.30

); accounting balances all with elevated guardrail emphasis (

w_{g} = 0.30

) [150,167,168,173].

7.6. Strategic Transfer Implementation Roadmap

A phased roadmap improves transfer success [150,167,168]:

Phase 1: Feasibility (2–4 weeks). Quantify PTF; elicit stakeholder requirements; map regulatory constraints [171].
Phase 2: Infrastructure (4–8 weeks). Fine-tune base LLMs; adapt vector schemas; model graph relationships [163,167].
Phase 3: Prompts (3–6 weeks). Adapt prompts by tiers; curate evaluation sets; establish benchmarks [165,166].
Phase 4: Guardrails (4–8 weeks). Integrate legal frameworks; configure bias monitors; author explanation templates [121,168].
Phase 5: Oversight (3–5 weeks). Implement approvals; calibrate thresholds; test interfaces [107,111].
Phase 6: Pilot & Refine (6–10 weeks). Limited production; monitor KPIs; iterate on operational feedback [167,174].

Typical end-to-end transfer completes in 22–41 weeks depending on sector distance and readiness, with 65–

85 %

of expected benefits realized within 12 months of full deployment [150,163,167].

8. Experimental Results

8.1. Architecture Evaluation

In this part we assess the suggested distributed data management architecture with the LLM and which we have tested within three enterprise domains, which are digital governance, marketing, and accounting. The system uses Apache Spark and Markov Chain Monte Carlo (MCMC) to determine the uncertainty and verify the consistency of performance. In order to obtain a clear vision of the situation, we are interested in three main metrics, namely Semantic-Weighted Mapping Correctness (SWMC), Consent-Aware Entity Resolution Score (CERS), and Hierarchical Extraction Quality Metric (HEQM).

As shown in Figure 10, the performance of ReMatch + algorithm is shown in terms of mapping in all three regions, and this study has indicated that the algorithm has consistently recorded high performance on mapping. Mean values of SWMC are not less than 0.90, which is an indication of well-developed semantic consistency, and high mapping accuracy. Interestingly, the small credible intervals found using MCMC sampling also provide an indication that the system gives stable and reproducible results, even when executed in a distributed manner. Within a single cluster configuration, repeated runs show stable aggregates; however, quantifying variance across heterogeneous nodes and clusters is left as future work, and the claim of distributed reproducibility should be interpreted in that scope.

Figure 11 gives a summary of the performance of the Consent-Aware Entity Resolution (C-ER) module. In this case, accuracy in terms of F1 is high in every sector, indicating the ability to recognize an entity accurately, and confidence calibration (AUCconf) is in line with the desired probabilities—a vital aspect of privacy and consent compliance. The business impact score (BIavg) is also high, implying that the module provides practical operational advantages. These probabilistic assessments are strong throughout because of the small uncertainty bands.

Figure 12 represents the Doc2Ledger-LLM system that has HEQM scores that are greater than 0.85 in most of the tested conditions. Performance only drops at a slow rate, even when layouts have been rotated or templates have been distorted considerably. This represents the power of multimodal feature fusion in combination with deterministic validation on our distributed architecture.

Collectively, these findings point to an impression that the architecture is both trustworthy and understandable: semantic accuracy in schema mapping (SWMC) is always high, entity resolution (CERS) is accurate and consent-sensitive, and multimodal extraction (HEQM) is strong. The resulting metrics, which are a combination of distributed computation and MCMC-based uncertainty calibration, are truly a measure of the actual uncertainty in the world, which provides transparency and accountability in the governance, marketing, and accounting environments.

The results in Table 11 demonstrate that the LLM-enabled pipeline consistently outperforms classical approaches across key metrics, with particularly strong gains in semantic understanding (F1, F2), multimodal extraction robustness (F6), and policy-aware retrieval accuracy (F7). These improvements validate the integration of LLMs within the Spark-orchestrated, MCMC-calibrated framework for enterprise data management tasks.

8.2. Distributed MCMC and Spark Diagnostics

We next examine how distributed MCMC samplers behave and converge when run across multiple Spark partitions. This analysis gives insight into how well the architecture handles large-scale uncertainty estimation while maintaining accuracy, stability, and computational efficiency.

Figure 13 shows that after a brief burn-in period, the chains mix quickly, stabilizing to produce meaningful posterior samples. This indicates that initialization effects are short-lived, and convergence is consistent across the distributed Spark environment.

Figure 14 complements this, showing that PSRF values remain below 1.05 and ESS values exceed 900. Together, these metrics indicate that distributed MCMC through Spark maintains both statistical soundness and reliable uncertainty estimates.

Figure 15 reveals how human reviewers respond to automated mappings across different confidence thresholds. Moderate thresholds yield over 85% acceptance, showing a good balance between reliability and efficiency, whereas stricter thresholds increase correctness at the cost of more human review effort. This emphasizes the familiar trade-off in governance-heavy systems: automation versus accountability.

As seen in Figure 16, the system achieves near-complete provenance coverage across all audit dimensions, integrating partition-level Spark operations with MCMC chain logs—critical for transparency in regulated environments.

Figure 17 highlights the architecture’s adaptability: patterns developed for one domain transfer effectively to others, underscoring flexibility beyond the initially evaluated sectors.

Performance scaling, summarized in Figure 18, shows that efficiency improves steadily as dataset size grows, confirming the distributed setup handles large enterprise workloads effectively.

Finally, Figure 19 compares the distributed LLM + MCMC system against traditional rule-based, standalone ML, and standard LLM approaches. The framework consistently outperforms alternatives, with the gradient visually emphasizing the gap in performance and cross-domain generalization.

8.3. Spark Cluster Resource Utilization Analysis

In order to analyze and optimize the workload distribution in the Spark-driven pipeline, a detailed resource utilization study was conducted across the cluster’s four main worker nodes. Each node’s utilization profile was visualized using a heatmap, highlighting CPU, memory, and network (bandwidth) usage during the main evaluation tasks. The following summary highlights key differences and roles of each node in the experiment:

Node1: High on CPU and memory, moderate network. Versatile, handles a balanced workload.
Node2: Peak memory (90%), moderate CPU. Suited for memory-intensive operations (e.g., large joins).
Node3: High usage across all resources; handles the heaviest workload and may be a bottleneck.
Node4: Lowest resource utilization; opportunity to assign more work or rebalance load.

Visual inspection of the utilization heatmap (see Figure 20) reveals distinct workload distribution patterns, with Node3 emerging as the cluster’s most heavily utilized worker, and Node4 as the least loaded. This suggests potential for Spark’s internal scheduler to further optimize load balancing in future runs. Additionally, the Spark Resource Utilization is show in Table 12.

The results in Figure 21 show that the Spark Job Stage Breakdown provides insights into time allocation for different computational stages across each workflow run. Each stage is visualized using a gradient of blue tones, with direct numeric annotation on every segment to facilitate rapid comparison. The distribution of time spent indicates that:

MCMC Sampling and UDF Compute are the dominant contributors to total job runtime across all jobs.
Shuffle Read and Shuffle Write exhibit moderate durations, reflecting typical Spark overhead.
The balance of stage durations is broadly consistent, with some variation in MCMC load and validation time between runs.

This visualization highlights where optimization efforts should be focused to reduce bottlenecks and accelerate the Spark-LLM pipeline. Overall, these results suggest that combining distributed MCMC with Spark orchestration allows scalable, interpretable, and reliable inference. Mapping precision, entity resolution, and multimodal extraction remain strong across varied enterprise datasets, while diagnostics confirm convergence integrity. At the same time, these findings raise broader questions around interpretability, adaptability, and long-term sustainability across evolving technical and regulatory landscapes. The next section explores these broader implications in greater detail.

9. Discussion

The interpretation and synthesis of the findings are presented in this section. Overall, the results of this paper suggest that the proposed distributed LLM-MCMC framework will result in improved performance on a number of different tasks, including schema mapping and entity resolution, and document structure extraction. Besides the raw accuracy, the system also appears to favor governance compliance and general reliability, which is encouraging when it comes to work in practice implementations of the system. The existence of probabilistic calibration, via Markov Chain Monte Carlo sampling, and deterministic checks, which are enforced by type- and rule-based validators, seem to be one of the most significant ones. This bilateral approach helps in trade-offs of a high accuracy and retention of the entire traceability.

Among the most interesting attributes is that uncertainty is expressed clearly via credible intervals and all the validated output is contained into detailed audit logs. Practically, this implies that these model outputs are not merely numbers or predictions, but verifiable and reproducible things that can be inspected and audited, that are vital in regulated processes. Notably, these advantages seem to be consistent between the seven functional dimensions (F1–F7) and hold well between the three architectural models (A–C), suggesting that similar methodological principles can be applied to dissimilar contexts of operations with only slight modifications made to them.

Key Findings

Across sectors and functions, three consistent patterns were observed:

LLM-enabled functions F1–F7 improved semantic correctness and coverage over classical baselines, particularly for schema mapping, entity resolution, and document extraction.
The Spark-orchestrated architecture scaled predictably with data volume and cluster size, keeping runtimes within operational bounds for large synthetic and real workloads.
MCMC-based uncertainty quantification yielded calibrated credible intervals better aligned with audit and compliance practices than simple repetition-based variance.

Methodological Implications

Methodologically, it appears to be more practical to undertake Apache Spark orchestration and MCMC sampling as a way to establish a sensible connection between statistical rigor and neural model flexibility. Posterior uncertainty. The credible intervals on significant metrics can be a useful control that e.g., enabling decisions regarding whether to automate further or whether to resample further.

Practically, probabilistic and deterministic calibration and validation are complementary to each other: MCMC sampling can be used to estimate the level of ambiguity and to focus on the cases that are uncertain, whereas the validators can be applied to implement the domain-specific rules and business logic. Reproducibility is also supported by fixed random seeds, versioned prompts, and retrieval datasets, and immutable provenance keys associated with every decision. That is, as opposed to remaining fixed performance figures, the ranges that are being reported in tables, like Table 2, become actionable governance and operational assurance benchmarks.

Taking a sectoral lens through the results, it can be seen that there are some interesting patterns. Although various fields such as digital governance, marketing, and accounting use different thresholds and oversight policies, overall, the structure underpinning these policies is quite similar. Policies of massive lineage and transparent access are very important in digital governance, with retrieval and generation highly policy-gated and traceable rationale logs traced back to particular decisions. In marketing, flexibility is given priority within consent limits, where consent-aware entity resolution and clause extraction allow personal campaigns to be made without breaching privacy standards. In auditing and accounting, strict determinism and reproducibility are essential; multimodal data extraction should not be compromised by using any type of segregation of duty as a way of ensuring the evidentiary integrity.

Regardless of such distinctions, the unified background of the governed retrieval, provenance anchoring, and human-in-the-loop validation seems to be generally applicable. This confirms the applicability of the architectural patterns A-C to sectors, and it is very consistent with the empirical data being summed up in Table 2, Table 3, Table 4 and Table 5.

The framework has a lot of potential, but a number of practical and methodological constraints can be identified. Originally, scalability and context management may remain difficult, especially when dealing with very large datasets or mixed-format data, or legacy and unstructured data. Second, the critical importance of preprocessing, metadata control, and consistent documentation is essential to get credible results; the lack of data hygiene may cause bias even with a powerful set of validators. Third, very technical fields might be beyond the overall competencies of existing LLMs and lead to plausible but not entirely accurate outputs. Fourth, distributed inference and posterior sampling are computationally and energy-prohibitive, and this might restrict their use in cost-sensitive or edge-based settings. Lastly, it is essential to have evaluation rigor: it is easy to induce perceived improvements on synthetic data or unfinished ground truth, and the automation bias of review can overestimate perceived accuracy. These considerations indicate that there is a necessity to automate controlled, repeated regression tests as well as systematic ablation studies to gather external validity across time.

According to our results, the following practices may be used in order to make organizations adopt such architectures successfully:

Manage non-determinism carefully. Fix random seeds, lock retrieval datasets before deployment, and require validators to approve each state-changing operation. Use uncertainty intervals to guide when human review is necessary.
Define and track provenance standards. Monitor lineage completeness, readability, and replay accuracy, and integrate provenance checks into CI/CD pipelines and version control processes.
Adopt risk-aware human review. Direct expert attention based on uncertainty, policy relevance, and potential operational impact, and store reviewer insights as part of the verifiable record.
Ensure policy and consent compliance. Implement policies as code, link consent lineage to models and embeddings, and maintain immutable pre-generation logs for traceability.
Optimize efficiency. Reduce latency and costs through smart routing, batching, and cache reuse. Modular adapters or mixture-of-experts setups can substitute when scalability and speed are priorities.

Collectively, these recommendations can be used to operationalize functional categories (F1–F7) in architectural patterns (A–C), which makes sure that the performance is in line with the governance expectations and regulatory ones.

10. Conclusions and Future Work

The current paper proposes an effective and flexible system of utilizing large language models (LLMs) as credible, responsible, and verifiable units in a large-scale data infrastructure. This architecture combines transformer-based inference, Apache Spark orchestration, and Markov Chain Monte Carlo (MCMC) sampling into one modular architecture. The integration is not only enabling to perform the usual prediction tasks but also provides the means of measuring uncertainty, decision-making, and keeping exhaustive provenance on the distributed data pipelines, features that are often overlooked or insufficiently tackled in many working environments.

The framework was tested in three different regulated areas, which are digital governance, marketing, and accounting. Every area has its own data structures, workflow, and compliance issues. We have found that the same basis, which involves controlled recollection, validator-driven production, and unchangeable ancestry, can, with few changes, be reconfigured to meet the particular policy, control mandates, and contextual delicacies of every industry. The framework is therefore shown to have a significant level of transferability without requiring large-scale re-engineering.

We have shown the effectiveness of the design in several dimensions with the use of empirical analysis. An example is the ReMatch++ schema-mapping module, which consistently scores high at Semantic-Weighted Mapping Correctness (SWMC) scores above 0.90, and has a high semantic accuracy even in the case of heterogeneous data sources. Likewise, the Consent-Aware Entity Resolution (C-ER) module attains high classification and well-calibrated predictive estimates, which translates into real business advantages via the Consent-Aware Resolution Score (CERS). The Doc2Ledger-LLM system also demonstrates a high level of performance using a wide variety of document layouts and templates, achieving high Hierarchical Extraction Quality Metric (HEQM) scores and demonstrating the resilience of the architecture to the variability of real-world data. Further, distributed MCMC diagnostics, implemented in Spark, provide support for the framework converging and being reproducible at scale, ensuring that outputs can be traced and audited reliably in the enterprise environment. Taken together, these findings indicate that the combination of uncertainty-aware modeling and deterministic validation would help overcome the gap between neural generalization and organizational responsibility- a consideration deemed very important when using LLMs in high-stakes, regulated settings.

However, although these encouraging findings were made, there still exist various challenges and practical issues. Large-scale sampling and distributed inference place large computational loads, frequently incurring large energy and resource overheads, and may be restricted to run on resource-constrained or edge environments. Inconsistency in data, such as legacy or semi-structured assets, requires careful preprocessing and a stringent check of the data to maintain reliable results. Also, the changing regulatory environment in terms of consent, data retention, fairness, and transparency requires the constant revision of governance structures and the implementation of policy-as-code. Lastly, although human supervision is a crucial measure to address automation bias, the series of review processes, escalation plans, and the definition of the tasks of the automated systems and human analysts is an unsolved issue.

In the future, several areas of research and development paths can be identified. To begin with, scalability will probably require more efficient methods of uncertainty-estimation, perhaps using amortized inference, sequential Monte Carlo (SMC), or variational models with diagnostics built in to streams or continuously updated data. Second, domain adaptation (through retrieval-tuned configuration and specialized adapters) would be able to maintain determinism with a benefit to contextual accuracy in a variety of operational environments. Third, policy-as-code and counterfactual audit systems that can formally verify compliance processes might be useful in automating compliance processes, which can be stress-tested under complex or adversarial conditions to apply in privacy, consent, and fairness. Lastly, evaluation protocols must be developed more than mere measures of accuracy, including measures of provenance completeness, latency, cost, and evidentary sufficiency, hence providing a more realistic and operationally significant view of system performance. The advancement of these spheres, we believe, will make intelligent data management more scalable and more reliable, and justifiable.

In addition to methodological extensions, there are a number of practical opportunities to this work. Computational budgeting and adaptive sampling strategies can minimize overhead without ensuring reduced reliability. Privacy-preserving learning systems, selective disclosure systems, and user-oriented uncertainty displays may make a system both more ethical and usable at the same time. Furthermore, we will explore usages in fields where traceability and compliance are especially important, including healthcare logistics, climate risk analysis, and procurement in the public sector. The addition of continuous learning, regression tests, and continuous lineage into operational pipelines could be an addition to reproducibility and confidence. Finally, formal techniques of assurance, such as specification mining and runtime verification, have the possibility of offering mathematically-based trust in scalable LLM-driven systems.

Overall, the work offers a framework that radically reinvents the use of large language models in an enterprise context: it can no longer be viewed as a predictive model, but as a structure that can be verified, held accountable, and audited when it comes to handling sensitive data. The combination of probabilistic calibration, deterministic guardrails, and strict provenance tracking provided by the system generates tangible value while meeting the high standards required of regulated industries. Going forward, we intend to keep building AI systems that are both accurate and efficient, as well as transparent, reproducible, and controlled in a manner that an organization can have confidence in to be operational, and that we consider to be a requirement to responsibly deploy AI into high-stakes decision-making.

Author Contributions

L.T., A.K., C.K., G.A.K., A.G. and C.-P.B. conceived the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABVS	Aggregate Business Value Score
ACID	Atomicity, Consistency, Isolation, Durability
AQEI	Automated Quality Enhancement Index
AUC	Area Under the Curve
AUCconf	Area Under the Confidence Calibration Curve
BI	Business Impact
BIavg	Average Business Impact
C-ER	Consent-Aware Entity Resolution
CCPA	California Consumer Privacy Act
CDP	Customer Data Platform
CDCI	Compliance Deficiency Closure Index
CERS	Consent-aware Entity Resolution Score
CRM	Customer Relationship Management
DB	Database
DLP	Data Loss Prevention
DPI	Defect Prevention Index
EM	Expectation-Maximization
ER	Entity Resolution
ESS	Effective Sample Size
ETL	Extract-Transform-Load
GDPR	General Data Protection Regulation
GLMM	Generalized Linear Mixed Model
GRIS	Grounded Response Integrity Score
HEQM	Hierarchical Extraction Quality Metric
IAV	Information-to-Action Velocity
IFRS	International Financial Reporting Standards
IIoT	Industrial Internet of Things
ISA	International Standards on Auditing
KG	Knowledge Graph
LSTM	Long Short-Term Memory
MCMC	Markov Chain Monte Carlo
ML	Machine Learning
MOE	Mixture of Experts
NLG	Natural Language Generation

PII	Personally Identifiable Information
PSRF	Potential Scale Reduction Factor
RAG	Retrieval-Augmented Generation
RBAC	Role-Based Access Control
RPI	Regulatory Punctuality Index
RPA	Robotic Process Automation
RQ	Research Question
SOTA	State-of-the-Art
SOX	Sarbanes-Oxley Act
SWMC	Semantic-Weighted Mapping Correctness
TED	Tenders Electronic Daily
UDF	User Defined Function
VAT	Value Added Tax
VL/HCC	Visual Languages and Human-Centric Computing

References

Zhang, M.; Ji, Z.; Luo, Z.; Wu, Y.; Chai, C. Applications and Challenges for Large Language Models: From Data Management Perspective. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 5530–5541. [Google Scholar] [CrossRef]
Fan, W.; Wu, P.; Ding, Y.; Ning, L.; Wang, S.; Li, Q. Towards Retrieval-Augmented Large Language Models: Data Management and System Design. In Proceedings of the 2025 IEEE 41st International Conference on Data Engineering (ICDE), Hong Kong, 19–23 May 2025; pp. 4509–4512. [Google Scholar] [CrossRef]
Trummer, I. From BERT to GPT-3 codex: Harnessing the potential of very large language models for data management. arXiv 2023, arXiv:2306.09339. [Google Scholar] [CrossRef]
Karras, A.; Theodorakopoulos, L.; Karras, C.; Theodoropoulou, A.; Kalliampakou, I.; Kalogeratos, G. LLMs for Cybersecurity in the Big Data Era: A Comprehensive Review of Applications, Challenges, and Future Directions. Information 2025, 16, 957. [Google Scholar] [CrossRef]
Jadhav, A.; Mirza, V. Large Language Models in Equity Markets: Applications, Techniques, and Insights. Front. Artif. Intell. 2025, 8, 1608365. [Google Scholar] [CrossRef] [PubMed]
Kim, Y.; Xu, X.; McDuff, D.; Breazeal, C.; Park, H.W. Health-llm: Large language models for health prediction via wearable sensor data. arXiv 2024, arXiv:2401.06866. [Google Scholar] [CrossRef]
Fang, X.; Xu, W.; Tan, F.A.; Zhang, J.; Hu, Z.; Qi, Y.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding–A Survey. arXiv 2024, arXiv:2402.17944. [Google Scholar]
Su, J.; Jiang, C.; Jin, X.; Qiao, Y.; Xiao, T.; Ma, H.; Wei, R.; Jing, Z.; Xu, J.; Lin, J. Large language models for forecasting and anomaly detection: A systematic literature review. arXiv 2024, arXiv:2402.10350. [Google Scholar] [CrossRef]
Carriero, A.; Pettenuzzo, D.; Shekhar, S. Macroeconomic forecasting with large language models. arXiv 2024, arXiv:2407.00890. [Google Scholar] [CrossRef]
Im, J.; Lee, J.; Lee, S.; Kwon, H.Y. Data pipeline for real-time energy consumption data management and prediction. Front. Big Data 2024, 7, 1308236. [Google Scholar] [CrossRef]
Chang, C.; Wang, W.Y.; Peng, W.C.; Chen, T.F. Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters. ACM Trans. Intell. Syst. Technol. 2025, 16, 60. [Google Scholar] [CrossRef]
Jacques-Silva, G.; Kalyvianaki, E.; Cohn-Gordon, K.; Meguid, A.; Nguyen, H.; Ben-David, D.; Nayak, C.; Saravagi, V.; Stasa, G.; Papagiannis, I.; et al. Unified Lineage System: Tracking Data Provenance at Scale. In Proceedings of the Companion of the 2025 International Conference on Management of Data, Berlin, Germany, 22–27 June 2025; pp. 457–470. [Google Scholar]
Luo, H.; Chuang, Y.S.; Gong, Y.; Zhang, T.; Kim, Y.; Wu, X.; Fox, D.; Meng, H.; Glass, J. Sail: Search-augmented instruction learning. arXiv 2023, arXiv:2305.15225. [Google Scholar] [CrossRef]
Mao, K.; Dou, Z.; Mo, F.; Hou, J.; Chen, H.; Qian, H. Large language models know your contextual search intent: A prompting framework for conversational search. arXiv 2023, arXiv:2303.06573. [Google Scholar] [CrossRef]
Ma, M.D.; Wang, X.; Kung, P.N.; Brantingham, P.J.; Peng, N.; Wang, W. STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18751–18759. [Google Scholar]
Zhang, J.; Zhang, H.; Chakravarti, R.; Hu, Y.; Ng, P.; Katsifodimos, A.; Rangwala, H.; Karypis, G.; Halevy, A. CoddLLM: Empowering Large Language Models for Data Analytics. arXiv 2025, arXiv:2502.00329. [Google Scholar] [CrossRef]
Li, X.; Dou, Z.; Zhou, Y.; Liu, F. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 26–37. [Google Scholar] [CrossRef]
Singh, S.; Vorster, L. LLM Supply Chain Provenance: A Blockchain-Based Approach. In Proceedings of the International Conference on AI Research, Lisbon, Portugal, 5–6 December 2024; Academic Conferences and Publishing Limited: London, UK, 2024. [Google Scholar]
Hoffmann, N.; Pour, N.E. A low overhead approach for automatically tracking provenance in machine learning workflows. In Proceedings of the 2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Vienna, Austria, 8–12 July 2024; pp. 567–573. [Google Scholar]
Korolev, V.; Joshi, A. Crystalia: Flexible and Efficient Method for Large Dataset Lineage Tracking. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 3431–3440. [Google Scholar]
Padovani, G.; Anantharaj, V.; Fiore, S. yProv4ML: Effortless Provenance Tracking for Machine Learning Systems. arXiv 2025, arXiv:2507.01078. [Google Scholar] [CrossRef]
Spoczynski, M.; Melara, M.S.; Szyller, S. Atlas: A framework for ml lifecycle provenance & transparency. arXiv 2025, arXiv:2502.19567. [Google Scholar] [CrossRef]
Lu, L.; An, J.; Wang, Y.; Kong, C.; Liu, Z.; Wang, S.; Lin, H.; Fang, M.; Huang, Y.; Yang, E.; et al. From text to cql: Bridging natural language and corpus search engine. arXiv 2024, arXiv:2402.13740. [Google Scholar] [CrossRef]
Xu, X.; Yao, B.; Dong, Y.; Gabriel, S.; Yu, H.; Hendler, J.; Ghassemi, M.; Dey, A.K.; Wang, D. Mental-llm: Leveraging large language models for mental health prediction via online text data. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Melbourne, Australia, 5–9 October 2024; Volume 8, pp. 1–32. [Google Scholar]
Ozcan, F.; Chung, Y.; Chronis, Y.; Gan, Y.; Wang, Y.; Binnig, C.; Wehrstein, J.; Kakkar, G.T.; Abu-El-Haija, S. LLMs and Databases: A Synergistic Approach to Data Utilization. IEEE Data Eng. Bull. 2025, 49, 32–44. [Google Scholar]
Liu, Z.; He, X.; Tian, Y.; Chawla, N.V. Can we soft prompt llms for graph learning tasks? In Proceedings of the Companion Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024; pp. 481–484. [Google Scholar]
Lee, G.; Yu, W.; Shin, K.; Cheng, W.; Chen, H. Timecap: Learning to contextualize, augment, and predict time series events with large language model (LLM) agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 18082–18090. [Google Scholar]
Nako, P.; Jatowt, A. Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction. arXiv 2025, arXiv:2501.05925. [Google Scholar] [CrossRef]
Soru, T.; Marshall, J. Leveraging Log Probabilities in Language Models to Forecast Future Events. arXiv 2025, arXiv:2501.04880. [Google Scholar] [CrossRef]
Vadlapati, P. LML-DAP: Language Model Learning a Dataset for Data-Augmented Prediction. arXiv 2024, arXiv:2409.18957. [Google Scholar] [CrossRef]
Wu, Z.; Zhao, Y.; Wang, H. Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification. arXiv 2025, arXiv:2506.01631. [Google Scholar] [CrossRef]
Tang, N.; Chen, M.; Ning, Z.; Bansal, A.; Huang, Y.; McMillan, C.; Li, T.J.J. A study on developer behaviors for validating and repairing llm-generated code using eye tracking and ide actions. arXiv 2024, arXiv:2405.16081. [Google Scholar] [CrossRef]
Tang, N.; Chen, M.; Ning, Z.; Bansal, A.; Huang, Y.; McMillan, C.; Li, T.J.J. Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking. In Proceedings of the 2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Liverpool, UK, 2–6 September 2024; pp. 40–46. [Google Scholar] [CrossRef]
Li, L.; Wang, P.; Ren, K.; Sun, T.; Qiu, X. Origin tracing and detecting of llms. arXiv 2023, arXiv:2304.14072. [Google Scholar] [CrossRef]
Zhu, J.; Xiao, M.; Wang, Y.; Zhai, F.; Zhou, Y.; Zong, C. TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification. arXiv 2025, arXiv:2503.15289. [Google Scholar] [CrossRef]
Nikolic, I.; Baluta, T.; Saxena, P. Model Provenance Testing for Large Language Models. arXiv 2025, arXiv:2502.00706. [Google Scholar] [CrossRef]
Chen, S.; Kang, F.; Yu, N.; Jia, R. FASTTRACK: Fast and Accurate Fact Tracing for LLMs. arXiv 2024, arXiv:2404.15157. [Google Scholar] [CrossRef]
Wang, J.; Crawl, D.; Purawat, S.; Nguyen, M.; Altintas, I. Big data provenance: Challenges, state of the art and opportunities. In Proceedings of the 2015 IEEE international conference on big data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; pp. 2509–2516. [Google Scholar]
Zhang, D.; Zhoubian, S.; Hu, Z.; Yue, Y.; Dong, Y.; Tang, J. Rest-mcts*: Llm self-training via process reward guided tree search. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 64735–64772. [Google Scholar]
Tang, X.; Yang, X.; Yao, Z.; Wen, J.; Zhou, X.; Han, J.; Hu, S. DS-GCG: Enhancing LLM Jailbreaks with Token Suppression and Induction Dual-Strategy. In Proceedings of the 2025 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Compiegne, France, 5–7 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 273–278. [Google Scholar]
Michail, A.; Clematide, S.; Sennrich, R. Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples. arXiv 2025, arXiv:2502.08638. [Google Scholar]
Ziegler, I.; Köksal, A.; Elliott, D.; Schütze, H. Craft your dataset: Task-specific synthetic dataset generation through corpus retrieval and augmentation. arXiv 2024, arXiv:2409.02098. [Google Scholar] [CrossRef]
Jia, P.; Liu, Y.; Zhao, X.; Li, X.; Hao, C.; Wang, S.; Yin, D. Mill: Mutual verification with large language models for zero-shot query expansion. arXiv 2023, arXiv:2310.19056. [Google Scholar]
Kong, X.; Gunter, T.; Pang, R. Large language model-guided document selection. arXiv 2024, arXiv:2406.04638. [Google Scholar] [CrossRef]
Adam, D.; Kliegr, T. Traceable LLM-based validation of statements in knowledge graphs. Inf. Process. Manag. 2025, 62, 104128. [Google Scholar] [CrossRef]
Wittner, R.; Holub, P.; Mascia, C.; Frexia, F.; Müller, H.; Plass, M.; Allocca, C.; Betsou, F.; Burdett, T.; Cancio, I.; et al. Toward a common standard for data and specimen provenance in life sciences. Learn. Health Syst. 2024, 8, e10365. [Google Scholar] [CrossRef] [PubMed]
Vieira, M.; de Oliveira, T.; Cicco, L.; de Oliveira, D.; Bedo, M. From Tracking Lineage to Enhancing Data Quality and Auditing: Adding Provenance Support to Data Warehouses with ProvETL. In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024), Angers, France, 28–30 April 2024; Volume 1. [Google Scholar] [CrossRef]
Lau, G.K.R.; Niu, X.; Dao, H.; Chen, J.; Foo, C.S.; Low, B.K.H. Waterfall: Scalable framework for robust text watermarking and provenance for llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 20432–20466. [Google Scholar]
Longpre, S.; Mahari, R.; Chen, A.; Obeng-Marnu, N.; Sileo, D.; Brannon, W.; Muennighoff, N.; Khazam, N.; Kabbara, J.; Perisetla, K.; et al. The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI. arXiv 2023, arXiv:2310.16787. [Google Scholar] [CrossRef]
Hu, Y.; Nguyen, T.P.; Ghosh, S.; Razniewski, S. Enabling LLM knowledge analysis via extensive materialization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 16189–16202. [Google Scholar]
Karras, C.; Karras, A.; Avlonitis, M.; Sioutas, S. An overview of mcmc methods: From theory to applications. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Hersonissos, Greece, 17–20 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 319–332. [Google Scholar]
Karras, C.; Theodorakopoulos, L.; Karras, A.; Krimpas, G.A.; Bakalis, C.P.; Theodoropoulou, A. MCMC Methods: From Theory to Distributed Hamiltonian Monte Carlo over PySpark. Algorithms 2025, 18, 661. [Google Scholar] [CrossRef]
Amasiadi, N.; Aslani-Gkotzamanidou, M.; Theodorakopoulos, L.; Theodoropoulou, A.; Krimpas, G.A.; Merkouris, C.; Karras, A. AI-Driven Bayesian Deep Learning for Lung Cancer Prediction: Precision Decision Support in Big Data Health Informatics. BioMedInformatics 2025, 5, 39. [Google Scholar] [CrossRef]
Karras, C.; Karras, A.; Avlonitis, M.; Giannoukou, I.; Sioutas, S. Maximum likelihood estimators on mcmc sampling algorithms for decision making. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Hersonissos, Greece, 17–20 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 345–356. [Google Scholar]
Vlachou, E.; Karras, C.; Karras, A.; Tsolis, D.; Sioutas, S. EVCA classifier: A MCMC-based classifier for analyzing high-dimensional big data. Information 2023, 14, 451. [Google Scholar] [CrossRef]
Jayasri, P.; Jeya, R.; Saritha, V. Service Provisioning and Data Management in IIoT using Improved Bi-LSTM: A Comprehensive Review. In Proceedings of the 2025 3rd International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 6–8 August 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 456–465. [Google Scholar] [CrossRef]
Kodri, W.A.G.; Haris, M.; Fitriadi, R. Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance. Teknika 2025, 14, 213–222. [Google Scholar] [CrossRef]
Yuksel, K.A.; Gunduz, A.; Anees, A.B.; Sawaf, H. Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models. arXiv 2025, arXiv:2502.12755. [Google Scholar]
Quan, S. Automatically Generating Numerous Context-Driven SFT Data for LLMs Across Diverse Granularity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-25), Philadelphia, PA, USA, 25 February–4 March 2025; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 25074–25082. [Google Scholar]
Zeighami, S.; Lin, Y.; Shankar, S.; Parameswaran, A. LLM-Powered Proactive Data Systems. arXiv 2025, arXiv:2502.13016. [Google Scholar] [CrossRef]
Changala, R.; Kaur, C.; Satapathy, N.R.; Vuyyuru, V.A.; Santosh, K.; Valavan, M.P. Healthcare Data Management Optimization Using LSTM and GAN-Based Predictive Modeling: Towards Effective Health Service Delivery. In Proceedings of the 2024 International Conference on Data Science and Network Security (ICDSNS), Tiptur, India, 26–27 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kizilkaya, D.; Sermet, Y.; Demir, I. Towards HydroLLM: Approaches for building a domain-specific language model for hydrology. J. Hydroinform. 2025, 27, 1652–1666. [Google Scholar] [CrossRef]
Cohen, O.S.; Malul, E.; Meidan, Y.; Mimran, D.; Elovici, Y.; Shabtai, A. KubeGuard: LLM-Assisted Kubernetes Hardening via Configuration Files and Runtime Logs Analysis. arXiv 2025, arXiv:2509.04191. [Google Scholar]
Shan, R.; Shan, T. Enterprise LLMOps: Advancing Large Language Models Operations Practice. In Proceedings of the 2024 IEEE Cloud Summit, Washington, DC, USA, 27–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 143–148. [Google Scholar]
Reed, C.; Wynn, M.; Bown, R. Artificial Intelligence in Digital Marketing: Towards an Analytical Framework for Revealing and Mitigating Bias. Big Data Cogn. Comput. 2025, 9, 40. [Google Scholar] [CrossRef]
Li, W.; Liu, W.; Deng, M.; Liu, X.; Feng, L. The impact of large language models on accounting and future application scenarios. J. Account. Lit. 2025. [Google Scholar] [CrossRef]
Aghaei, R.; Kiaei, A.A.; Boush, M.; Vahidi, J.; Zavvar, M.; Barzegar, Z.; Rofoosheh, M. Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications, Future Directions, and Strategic Recommendations. arXiv 2025, arXiv:2501.10685. [Google Scholar] [CrossRef]
Ao, S.I.; Hurwitz, M.; Palade, V. Cognitive computing and business intelligence applications in accounting, finance and management. Big Data Cogn. Comput. 2025, 9, 54. [Google Scholar] [CrossRef]
Tavasoli, A.; Sharbaf, M.; Madani, S.M. Responsible innovation: A strategic framework for financial LLM integration. arXiv 2025, arXiv:2504.02165. [Google Scholar] [CrossRef]
Kerr, D.; Smith, K.T.; Smith, L.M.; Xu, T. A Review of AI and Its Impact on Management Accounting and Society. J. Risk Financ. Manag. 2025, 18, 340. [Google Scholar] [CrossRef]
Bedagkar, A.; Mitra, S.; Medicherla, R.; Naik, R.; Pal, S. LLM Driven Smart Assistant for Data Mapping. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE/ACM: Piscataway, NJ, USA, 2025; pp. 181–191. [Google Scholar] [CrossRef]
Sheetrit, E.; Brief, M.; Mishaeli, M.; Elisha, O. Rematch: Retrieval enhanced schema matching with llms. arXiv 2024, arXiv:2403.01567. [Google Scholar] [CrossRef]
Wang, T.; Chen, X.; Lin, H.; Han, X.; Sun, L.; Wang, H.; Zeng, Z. DBCopilot: Natural Language Querying over Massive Databases via Schema Routing. In Proceedings of the 28th International Conference on Extending Database Technology (EDBT), Barcelona, Spain, 25–28 March 2025; pp. 707–721. [Google Scholar] [CrossRef]
Ma, C.; Chakrabarti, S.; Khan, A.; Molnár, B. Knowledge graph-based retrieval-augmented generation for schema matching. arXiv 2025, arXiv:2501.08686. [Google Scholar]
Gan, Y.; Chen, X.; Huang, Q.; Purver, M.; Woodward, J.R.; Xie, J.; Huang, P. Towards robustness of text-to-SQL models against synonym substitution. arXiv 2021, arXiv:2106.01065. [Google Scholar]
Fu, S.D.; Chen, X. Compound Schema Registry. arXiv 2024, arXiv:2406.11227. [Google Scholar] [CrossRef]
Yu, C.; Yao, Y.; Zhang, X.; Zhu, G.; Guo, Y.; Shao, X.; Shibasaki, M.; Hu, Z.; Dai, L.; Guan, Q.; et al. Monkuu: A LLM-powered natural language interface for geospatial databases with dynamic schema mapping. Int. J. Geogr. Inf. Sci. 2025, 38, 2533322. [Google Scholar] [CrossRef]
Wang, Y.; Liu, P.; Yang, X. Linkalign: Scalable schema linking for real-world large-scale multi-database text-to-sql. arXiv 2025, arXiv:2503.18596. [Google Scholar]
Fu, J.; Tang, H.; Khan, A.; Mehrotra, S.; Ke, X.; Gao, Y. In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration. Proc. ACM Manag. Data 2025, 3, 252. [Google Scholar] [CrossRef]
Bouabdelli, L.F.; Abdelhedi, F.; Hammoudi, S.; Hadjali, A. An Advanced Entity Resolution in Data Lakes: First Steps. In Proceedings of the 14th International Conference on Data Science, Technology and Applications—Volume 1: DATA, Bilbao, Spain, 10–12 June 2025; SciTePress: Setúbal, Portugal, 2025; pp. 661–668. [Google Scholar] [CrossRef]
Saengsiripaiboon, S.; Pacharawongsakda, E.; Jitkongchuen, D. Enhancing Efficiency in Entity Resolution Strategies for Batch Prompting. In Proceedings of the 2025 6th International Conference on Big Data Analytics and Practices (IBDAP), Chiang Mai, Thailand, 1–3 August 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 31–35. [Google Scholar] [CrossRef]
Wang, T.; Chen, X.; Lin, H.; Chen, X.; Han, X.; Wang, H.; Zeng, Z.; Sun, L. Match, compare, or select? an investigation of large language models for entity matching. arXiv 2024, arXiv:2405.16884. [Google Scholar] [CrossRef]
Zhang, J.; Fang, J.; Zhang, C.; Zhang, W.; Ren, H.; Xu, L. Geographic Named Entity Matching and Evaluation Recommendation Using Multi-Objective Tasks: A Study Integrating a Large Language Model (LLM) and Retrieval-Augmented Generation (RAG). ISPRS Int. J. Geo-Inf. 2025, 14, 95. [Google Scholar] [CrossRef]
Abd El Aziz, R.A.; Elzanfaly, D.; Farhan, M.S. Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data. In Proceedings of the 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), Victoria, Seychelles, 1–2 February 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–4. [Google Scholar] [CrossRef]
Xu, L.; Zhang, X.; Duan, F.; Wang, S.; Weng, R.; Wang, J.; Cai, X. FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training. arXiv 2025, arXiv:2502.00761. [Google Scholar] [CrossRef]
Asthana, S.; Zhang, B.; Mahindru, R.; DeLuca, C.; Gentile, A.L.; Gopisetty, S. Deploying Privacy Guardrails for LLMs: A Comparative Analysis of Real-World Applications. arXiv 2025, arXiv:2501.12456. [Google Scholar] [CrossRef]
Asthana, S.; Mahindru, R.; Zhang, B.; Sanz, J. Adaptive PII Mitigation Framework for Large Language Models. arXiv 2025, arXiv:2501.12465. [Google Scholar] [CrossRef]
Zhu, L.; Yang, L.; Li, C.; Hu, S.; Liu, L.; Yin, B. LegiLM: A Fine-Tuned Legal Language Model for Data Compliance. arXiv 2024, arXiv:2409.13721. [Google Scholar]
Cejas, O.A.; Azeem, M.I.; Abualhaija, S.; Briand, L.C. NLP-Based Automated Compliance Checking of Data Processing Agreements Against GDPR. IEEE Trans. Softw. Eng. 2023, 49, 4282–4303. [Google Scholar] [CrossRef]
Cory, T.; Rieder, W.; Krämer, J.; Raschke, P.; Herbke, P.; Küpper, A. Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models. arXiv 2025, arXiv:2503.10727. [Google Scholar] [CrossRef]
Garza, L.; Elluri, L.; Kotal, A.; Piplai, A.; Gupta, D.; Joshi, A. Privcomp-kg: Leveraging knowledge graph and large language models for privacy policy compliance verification. arXiv 2024, arXiv:2404.19744. [Google Scholar]
Berghaus, D.; Berger, A.; Hillebrand, L.; Cvejoski, K.; Sifa, R. Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing. arXiv 2025, arXiv:2509.04469. [Google Scholar]
Biswas, A.; Talukdar, W. Robustness of structured data extraction from in-plane rotated documents using multi-modal large language models (llm). arXiv 2024, arXiv:2406.10295. [Google Scholar]
Liu, J.; Zeng, Y.; Højmark-Bertelsen, M.; Gadeberg, M.N.; Wang, H.; Wu, Q. Memory-Augmented Agent Training for Business Document Understanding. arXiv 2024, arXiv:2412.15274. [Google Scholar]
Abdellaif, O.H.; Hassan, A.N.; Hamdi, A. Erpa: Efficient rpa model integrating ocr and llms for intelligent document processing. In Proceedings of the 2024 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 13–14 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 295–300. [Google Scholar]
Soylu, A.; Elvesæter, B.; Turk, P.; Roman, D.; Corcho, O.; Simperl, E.; Konstantinidis, G.; Lech, T.C. Towards an ontology for public procurement based on the open contracting data standard. In Proceedings of the Conference on e-Business, e-Services and e-Society, Trondheim, Norway, 18–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 230–237. [Google Scholar]
La Paz, A.I.; Ramaprasad, A.; Syn, T.; Vasquez, J. An ontology of E-commerce-mapping a relevant corpus of knowledge. J. Theor. Appl. Electron. Commer. Res. 2015, 10, i–ix. [Google Scholar] [CrossRef]
Saddad, E.; El-Bastawissy, A.; Mokhtar, H.M.; Hazman, M. Lake data warehouse architecture for big data solutions. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 417–424. [Google Scholar] [CrossRef]
Mazumdar, D.; Hughes, J.; Onofre, J. The data lakehouse: Data warehousing and more. arXiv 2023, arXiv:2310.08697. [Google Scholar] [CrossRef]
Li, T.; Hu, L. Audit as You Go: A Smart Contract-Based Outsourced Data Integrity Auditing Scheme for Multiauditor Scenarios with One Person, One Vote. Secur. Commun. Netw. 2022, 2022, 8783952. [Google Scholar] [CrossRef]
Francati, D.; Ateniese, G.; Faye, A.; Milazzo, A.M.; Perillo, A.M.; Schiatti, L.; Giordano, G. Audita: A blockchain-based auditing framework for off-chain storage. In Proceedings of the Ninth International Workshop on Security in Blockchain and Cloud Computing, Virtual Event, 7–11 June 2021; pp. 5–10. [Google Scholar]
Shi, Z.; Bergers, J.; Korsmit, K.; Zhao, Z. AUDITEM: Toward an automated and efficient data integrity verification model using blockchain. arXiv 2022, arXiv:2207.00370. [Google Scholar] [CrossRef]
Roychowdhury, S.; Krema, M.; Mahammad, A.; Moore, B.; Mukherjee, A.; Prakashchandra, P. ERATTA: Extreme RAG for enterprise-Table To Answers with Large Language Models. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4605–4610. [Google Scholar] [CrossRef]
Hamayat, F.; Ejaz, L.; Danish, M.; Nazir, A.; Ahadian, P.; Ahmad, R.F. SEEBot: Leveraging Open-Source LLMs and RAG for Secure and Economical Enterprise Chatbots. In Proceedings of the 2025 5th International Conference on Artificial Intelligence and Education (ICAIE), Suzhou, China, 14–16 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 147–151. [Google Scholar] [CrossRef]
Di Profio, M.; Zhong, M.; Sripada, Y.; Jaspars, M. FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering. arXiv 2025, arXiv:2507.23118. [Google Scholar]
Chakraborty, S. Beyond ETL: How AI Agents Are Building Self-Healing Data Pipelines. J. Comput. Sci. Technol. Stud. 2025, 7, 741–756. [Google Scholar] [CrossRef]
Andersen, J.S.; Maalej, W. Design patterns for machine learning-based systems with humans in the loop. IEEE Softw. 2023, 41, 151–159. [Google Scholar] [CrossRef]
Xin, D.; Ma, L.; Liu, J.; Macke, S.; Song, S.; Parameswaran, A. Helix: Accelerating human-in-the-loop machine learning. arXiv 2018, arXiv:1808.01095. [Google Scholar]
Pogiatzis, A.; Samakovitis, G. An event-driven serverless ETL pipeline on AWS. Appl. Sci. 2020, 11, 191. [Google Scholar] [CrossRef]
Yin, W.; Heinecke, S.; Li, J.; Keskar, N.S.; Jones, M.; Shi, S.; Georgiev, S.; Milich, K.; Esposito, J.; Xiong, C. Combining data-driven supervision with human-in-the-loop feedback for entity resolution. arXiv 2021, arXiv:2111.10497. [Google Scholar]
Wang, J.; Guo, B.; Chen, L. Human-in-the-loop machine learning: A macro-micro perspective. arXiv 2022, arXiv:2202.10564. [Google Scholar]
Wang, Z.J.; Choi, D.; Xu, S.; Yang, D. Putting humans in the natural language processing loop: A survey. arXiv 2021, arXiv:2103.04044. [Google Scholar] [CrossRef]
Hardin, T.; Kotz, D. Amanuensis: Provenance, privacy, and permission in TEE-enabled blockchain data systems. In Proceedings of the 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), Bologna, Italy, 10–13 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 144–156. [Google Scholar] [CrossRef]
Ahmad, A.; Saad, M.; Bassiouni, M.; Mohaisen, A. Towards blockchain-driven, secure and transparent audit logs. In Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, New York, NY, USA, 5–7 November 2018; pp. 443–448. [Google Scholar]
Amin, M.A.; Tummala, H.; Mohan, S.; Ray, I. Healthcare Policy Compliance: A Blockchain Smart Contract-Based Approach. arXiv 2023, arXiv:2312.10214. [Google Scholar] [CrossRef]
Pattengale, N.D.; Hudson, C.M. Decentralized genomics audit logging via permissioned blockchain ledgering. BMC Med. Genom. 2020, 13, 102. [Google Scholar] [CrossRef]
Lu, Y.; Wang, J. KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar] [CrossRef]
Zhang, H.; Si, J.; Yan, G.; Qi, B.; Cai, P.; Mao, S.; Wang, D.; Shi, B. RAKG: Document-level Retrieval Augmented Knowledge Graph Construction. arXiv 2025, arXiv:2504.09823. [Google Scholar]
Heavner, S.F.; Kumar, V.K.; Anderson, W.; Al-Hakim, T.; Dasher, P.; Armaignac, D.L.; Clermont, G.; Cobb, J.P.; Manion, S.; Remy, K.E.; et al. Critical data for critical care: A primer on leveraging electronic health record data for research from society of critical care medicine’s panel on data sharing and harmonization. Crit. Care Explor. 2024, 6, e1179. [Google Scholar] [CrossRef] [PubMed]
Santos, A.; Pena, E.H.; Lopez, R.; Freire, J. Interactive Data Harmonization with LLM Agents: Opportunities and Challenges. arXiv 2025, arXiv:2502.07132. [Google Scholar]
McIntosh, T.R.; Susnjak, T.; Liu, T.; Watters, P.; Xu, D.; Liu, D.; Nowrozy, R.; Halgamuge, M.N. From cobit to iso 42001: Evaluating cybersecurity frameworks for opportunities, risks, and regulatory compliance in commercializing large language models. Comput. Secur. 2024, 144, 103964. [Google Scholar] [CrossRef]
Mökander, J.; Schuett, J.; Kirk, H.R.; Floridi, L. Auditing large language models: A three-layered approach. AI Ethics 2024, 4, 1085–1115. [Google Scholar] [CrossRef]
Jernite, Y.; Nguyen, H.; Biderman, S.; Rogers, A.; Masoud, M.; Danchev, V.; Tan, S.; Luccioni, A.S.; Subramani, N.; Johnson, I.; et al. Data governance in the age of large-scale data-driven language technology. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 2206–2222. [Google Scholar]
Thorogood, A. Policy-aware data lakes: A flexible approach to achieve legal interoperability for global research collaborations. J. Law Biosci. 2020, 7, lsaa065. [Google Scholar] [CrossRef]
Park, S. Bridging the global divide in AI regulation: A proposal for a contextual, coherent, and commensurable framework. Wash. Int. Law J. 2023, 33, 216. [Google Scholar] [CrossRef]
Akbarfam, A.J.; Maleki, H. SOK: Blockchain for Provenance. arXiv 2024, arXiv:2407.17699. [Google Scholar] [CrossRef]
Ozdayi, M.S.; Kantarcioglu, M.; Malin, B. Leveraging blockchain for immutable logging and querying across multiple sites. BMC Med Genom. 2020, 13, 82. [Google Scholar] [CrossRef]
Akbarfam, A.J.; Heidaripour, M.; Maleki, H.; Dorai, G.; Agrawal, G. Forensiblock: A provenance-driven blockchain framework for data forensics and auditability. In Proceedings of the 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 1–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 136–145. [Google Scholar]
Shi, J.; Firmansyah, E.A.; Wang, Y.; Xu, W. Technological innovation and regulatory harmonization in Islamic finance: A systematic review and machine learning analysis (2000–2023). J. Islam. Account. Bus. Res. 2025. [Google Scholar] [CrossRef]
Boldt Sousa, T. Customer Data Platforms: A Pattern Language for Digital Marketing Optimization with First-Party Data. In Proceedings of the 27th European Conference on Pattern Languages of Programs, Irsee, Germany, 6–10 July 2022; pp. 1–5. [Google Scholar]
Shivampeta, S. AI-Augmented Customer Data Platforms: Engineering for Scale, Speed, and Compliance. J. Comput. Sci. Technol. Stud. 2025, 7, 837–845. [Google Scholar] [CrossRef]
Wen, Y.; Li, W.; Luo, J.; Xiao, J.; Jia, Y.; Wang, Z. Multi-domain Data Association Analysis: Research on Precise Customer Classification Based on LLM and GMM Models. In Proceedings of the 4th Asia-Pacific Artificial Intelligence and Big Data Forum, Ganzhou, China, 27–29 December 2024; pp. 920–925. [Google Scholar]
Kasuga, A.; Yonetani, R. Cxsimulator: A user behavior simulation using llm embeddings for web-marketing campaign assessment. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 3817–3821. [Google Scholar]
Tan, Z.; Zeng, Q.; Tian, Y.; Liu, Z.; Yin, B.; Jiang, M. Democratizing large language models via personalized parameter-efficient fine-tuning. arXiv 2024, arXiv:2402.04401. [Google Scholar] [CrossRef]
Zeldes, Y.; Zait, A.; Labzovsky, I.; Karmon, D.; Farkash, E. ComMer: A Framework for Compressing and Merging User Data for Personalization. arXiv 2025, arXiv:2501.03276. [Google Scholar] [CrossRef]
Embar, V.; Shrivastava, R.; Damodaran, V.; Mehlinger, T.; Hsiao, Y.C.; Raghunathan, K. LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment. arXiv 2025, arXiv:2503.19090. [Google Scholar]
Gan, C.; Yang, D.; Hu, B.; Liu, Z.; Shen, Y.; Zhang, Z.; Gu, J.; Zhou, J.; Zhang, G. Making large language models better knowledge miners for online marketing with progressive prompting augmentation. arXiv 2023, arXiv:2312.05276. [Google Scholar] [CrossRef]
Jena, P.K.; Dash, A.K.; Maharana, D.; Palai, C. A Novel Invoice Automation System. In Proceedings of the 2023 IEEE 5th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA), Hamburg, Germany, 7–8 October 2023; pp. 478–486. [Google Scholar] [CrossRef]
Akdoğan, A.; Kurt, M. Exttnet: A deep learning algorithm for extracting table texts from invoice images. arXiv 2024, arXiv:2402.02246. [Google Scholar] [CrossRef]
Aguda, T.; Siddagangappa, S.; Kochkina, E.; Kaur, S.; Wang, D.; Smiley, C.; Shah, S. Large language models as financial data annotators: A study on effectiveness and efficiency. arXiv 2024, arXiv:2403.18152. [Google Scholar] [CrossRef]
Dong, M.M.; Stratopoulos, T.C.; Wang, V.X. A scoping review of ChatGPT research in accounting and finance. Int. J. Account. Inf. Syst. 2024, 55, 100715. [Google Scholar] [CrossRef]
Kim, S.; Song, H.; Seo, H.; Kim, H. Optimizing retrieval strategies for financial question answering documents in retrieval-augmented generation systems. arXiv 2025, arXiv:2503.15191. [Google Scholar] [CrossRef]
Yue, C.; Xu, X.; Ma, X.; Du, L.; Ding, Z.; Han, S.; Zhang, D.; Zhang, Q. Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Rajpoot, P.K.; Parikh, A. Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for Relation Extraction from Financial Documents. arXiv 2023, arXiv:2310.17714. [Google Scholar] [CrossRef]
Bardelli, C.; Rondinelli, A.; Vecchio, R.; Figini, S. Automatic electronic invoice classification using machine learning models. Mach. Learn. Knowl. Extr. 2020, 2, 617–629. [Google Scholar] [CrossRef]
Sarmah, B.; Li, M.; Lyu, J.; Frank, S.; Castellanos, N.; Pasquali, S.; Mehta, D. How to choose a threshold for an evaluation metric for large language models. arXiv 2024, arXiv:2412.12148. [Google Scholar]
Cardei, M.A.; Lamp, J.; Derdzinski, M.; Bhatia, K. DexBench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management. arXiv 2025, arXiv:2510.00038. [Google Scholar]
Kumar, A.; Lakkaraju, H. Manipulating large language models to increase product visibility. arXiv 2024, arXiv:2404.07981. [Google Scholar] [CrossRef]
Zhu, J.; Bazaz, S.A.; Dutta, S.; Anuraag, B.; Haider, I.; Bandopadhyay, S. Talk to your data: Enhancing Business Intelligence and Inventory Management with LLM-Driven Semantic Parsing and Text-to-SQL for Database Querying. In Proceedings of the 2023 4th International Conference on Data Analytics for Business and Industry (ICDABI), Bahrain, 25–26 June 2023; pp. 321–325. [Google Scholar] [CrossRef]
Bamigbade, O.; Ejeofobiri, C.K.; Mayegun, K.O. Cross-sector AI framework for risk detection in national security, energy and financial networks. World J. Adv. Res. Rev. 2023, 18, 1307–1327. [Google Scholar] [CrossRef]
Palaniappan, S.; Mali, R.; Cuomo, L.; Vitale, M.; Youssef, A.; Madathil, A.P.; Murugesan, M.; Bettini, A.; De Magistris, G.; Veneri, G. Enhancing Enterprise-Wide Information Retrieval through RAG Systems Techniques, Evaluation, and Scalable Deployment. In Proceedings of the Abu Dhabi International Petroleum Exhibition and Conference (ADIPEC), Abu Dhabi, United Arab Emirates, 4–7 November 2024; Society of Petroleum Engineers: Abu Dhabi, United Arab Emirates, 2024; p. D011S020R002. [Google Scholar] [CrossRef]
Purwar, A.; Balakrishnan, G. Evaluating the efficacy of open-source llms in enterprise-specific rag systems: A comparative study of performance and scalability. arXiv 2024, arXiv:2406.11424. [Google Scholar]
Sserunjogi, R.; Ogenrwot, D.; Niwamanya, N.; Nsimbe, N.; Bbaale, M.; Ssempala, B.; Mutabazi, N.; Wabinyai, R.F.; Okure, D.; Bainomugisha, E. Design and Evaluation of a Scalable Data Pipeline for AI-Driven Air Quality Monitoring in Low-Resource Settings. In Proceedings of the International Conference on Software Engineering and Data Engineering, New Orleans, LA, USA, 20–21 October 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 212–231. [Google Scholar]
Rucco, C.; Longo, A.; Saad, M. Efficient Data Ingestion in Cloud-based architecture: A Data Engineering Design Pattern Proposal. arXiv 2025, arXiv:2503.16079. [Google Scholar] [CrossRef]
Wang, X.; Carey, M.J. An IDEA: An ingestion framework for data enrichment in AsterixDB. arXiv 2019, arXiv:1902.08271. [Google Scholar]
Singh, R.; V, A.; Mishra, S.; Singh, S.K. Streamlined Data Pipeline for Real-Time Threat Detection and Model Inference. In Proceedings of the 2025 17th International Conference on COMmunication Systems and NETworks (COMSNETS), Bengaluru, India, 6–10 January 2025; pp. 1148–1153. [Google Scholar] [CrossRef]
Jakubik, J.; Weber, D.; Hemmer, P.; Vössing, M.; Satzger, G. Improving the efficiency of human-in-the-loop systems: Adding artificial to human experts. In Proceedings of the International Conference on Wirtschaftsinformatik, Paderborn, Germany, 18–21 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 131–147. [Google Scholar]
Liao, Y.C.; Streli, P.; Li, Z.; Gebhardt, C.; Holz, C. Continual Human-in-the-Loop Optimization. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–26. [Google Scholar]
Amaral, O.; Abualhaija, S.; Briand, L. ML-Based Compliance Verification of Data Processing Agreements against GDPR. In Proceedings of the 2023 IEEE 31st International Requirements Engineering Conference (RE), Hannover, Germany, 4–8 September 2023; pp. 53–64. [Google Scholar] [CrossRef]
Hassani, S.; Sabetzadeh, M.; Amyot, D.; Liao, J. Rethinking legal compliance automation: Opportunities with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 432–440. [Google Scholar]
Wang, M.; Zhang, D.J.; Zhang, H. Large language models for market research: A data-augmentation approach. arXiv 2024, arXiv:2412.19363. [Google Scholar] [CrossRef]
Cao, H.; Gu, H.; Guo, X. Feasibility of transfer learning: A mathematical framework. arXiv 2023, arXiv:2305.12985. [Google Scholar] [CrossRef]
Witkowski, A.; Wodecki, A. A cross-disciplinary knowledge management framework for generative artificial intelligence in product management: A case study from the manufacturing sector. In Proceedings of the European Conference on Knowledge Management, Veszprem, Hungary, 5–6 September 2024; Academic Conferences International Limited: London, UK, 2024; pp. 921–929. [Google Scholar]
Jeong, C. A study on the implementation of generative ai services using an enterprise data-based llm application architecture. arXiv 2023, arXiv:2309.01105. [Google Scholar] [CrossRef]
Agarwal, A.; Chan, A.; Chandel, S.; Jang, J.; Miller, S.; Moghaddam, R.Z.; Mohylevskyy, Y.; Sundaresan, N.; Tufano, M. Copilot evaluation harness: Evaluating llm-guided software programming. arXiv 2024, arXiv:2402.14261. [Google Scholar] [CrossRef]
Szymanski, A.; Gebreegziabher, S.A.; Anuyah, O.; Metoyer, R.A.; Li, T.J.J. Comparing criteria development across domain experts, lay users, and models in large language model evaluation. arXiv 2024, arXiv:2410.02054. [Google Scholar] [CrossRef]
Lekadir, K.; Frangi, A.F.; Porras, A.R.; Glocker, B.; Cintas, C.; Langlotz, C.P.; Weicken, E.; Asselbergs, F.W.; Prior, F.; Collins, G.S.; et al. FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 2025, 388, e081554. [Google Scholar] [CrossRef]
Tjondronegoro, D. TOAST framework: A multidimensional approach to ethical and sustainable ai integration in organizations. arXiv 2025, arXiv:2502.00011. [Google Scholar]
Bouchard, D.; Chauhan, M.S.; Skarbrevik, D.; Bajaj, V.; Ahmad, Z. LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases. arXiv 2025, arXiv:2501.03112. [Google Scholar] [CrossRef]
Brown, N.B. Enhancing trust in llms: Algorithms for comparing and interpreting llms. arXiv 2024, arXiv:2406.01943. [Google Scholar] [CrossRef]
Zhong, H.; Do, T.; Jie, Y.; Neuwirth, R.J.; Shen, H. Global AI Governance: Where the Challenge is the Solution-An Interdisciplinary, Multilateral, and Vertically Coordinated Approach. arXiv 2025, arXiv:2503.04766. [Google Scholar]
Natarajan, S.; Mathur, S.; Sidheekh, S.; Stammer, W.; Kersting, K. Human-in-the-loop or AI-in-the-loop? Automate or Collaborate? In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 28594–28600. [Google Scholar]
Wu, J.; He, J. Trustworthy Transfer Learning: A Survey. arXiv 2024, arXiv:2412.14116. [Google Scholar] [CrossRef]
Zhang, J.; Budhdeo, S.; William, W.; Cerrato, P.; Shuaib, H.; Sood, H.; Ashrafian, H.; Halamka, J.; Teo, J.T. Moving towards vertically integrated artificial intelligence development. NPJ Digit. Med. 2022, 5, 143. [Google Scholar] [CrossRef]

Figure 1. ReMatch++: stepwise flowchart for distributed schema mapping including glossary retrieval, candidate generation, scoring, validation, and MCMC sampling.

Figure 2. Consent-Aware Entity Resolution: process flow from partitioning, candidate blocking, LLM proposal, MCMC calibration, and consent validation, to auditable graph output.

Figure 3. Doc2Ledger-LLM: stepwise process flow from distributed OCR and layout extraction, LLM-driven candidate field proposals, MCMC extraction sampling and validation, to final ledger assembly with auditable provenance and confidence estimation.

Figure 4. Pattern A—RAG-over-Lakehouse for governed question answering: policy- and consent-aware retrieval with grounded LLM responses over versioned, auditable data.

Figure 5. Pattern B—ETL Co-Pilot with human-in-the-loop: LLM suggestions, deterministic validators, and risk-based approvals for governed ingestion and transformation.

Figure 6. Pattern C—Lineage & Evidence Graph with LLM annotation and optional blockchain anchoring: capture, explain, relate, and attest provenance for replayable assurance.

Figure 7. Governance: consent/purpose-gated access over harmonized registries with replayable provenance and immutable anchors.

Figure 8. Marketing: consent-centric enrichment/segmentation in a CDP, governed RAG for insight, and audit-ready activation.

Figure 9. Accounting/audit: deterministic extraction and control testing with reviewer sign-off; replayable evidence and governed, read-only QA.

Figure 10. ReMatch++ performance across sectors: Semantic-Weighted Mapping Correctness (SWMC) with 95% credible intervals estimated from MCMC sampling.

Figure 11. Breakdown of Consent-Aware Entity Resolution Score (CERS) across sectors. Includes F1 accuracy, confidence calibration (AUCconf), and average business impact (BIavg) with 95% credible intervals.

Figure 12. Heatmap of Hierarchical Extraction Quality Metric (HEQM) showing Doc2Ledger-LLM performance under layout rotation and template variations.

Figure 13. Trace plots for key model parameters (semantic and business weights) over 2000 MCMC iterations. Dashed lines mark burn-in; horizontal trends indicate mixing.

Figure 14. Convergence diagnostics: Gelman-Rubin PSRF near 1 indicates good convergence; high Effective Sample Size (ESS) confirms sufficient independent sampling.

Figure 15. Expert acceptance rates of automated mappings at varying confidence thresholds. The curve highlights the trade-off between automation efficiency and human oversight.

Figure 16. Radar chart of provenance coverage across major audit dimensions (‘Who’, ‘What’, ‘When’, ‘Where’, ‘Why’, ‘How’). High scores indicate strong traceability.

Figure 17. Pattern Transferability Factor (PTF): Heatmap showing how architecture patterns adapt from one domain to another. Higher values imply stronger cross-domain flexibility.

Figure 18. System scaling performance: Pipeline Acceleration Factor (PAF) increases with dataset size, demonstrating strong distributed scalability.

Figure 19. Comparison of F1 scores across governance, marketing, and accounting. The blue gradient indicates improvement from traditional approaches to the distributed LLM + MCMC framework.

Figure 20. Resource utilization heatmap for Spark cluster nodes (CPU, memory, and bandwidth).

Figure 21. Breakdown of stage durations for three Spark jobs, by computational stage. Each bar segment shows the exact duration in seconds for the corresponding stage, using a distinct blue tone for clarity.

Table 1. Cross-sector comparison of big-data characteristics and LLM data-management priorities.

Dimension	Digital Governance	Digital Marketing	Accounting/Audit
Data profile	Registries; case files; identity/authentication logs; multi-agency docs; mixed structured/semi-structured/IoT [1]	Web/app events; CRM; adtech; social text; high velocity/variety [16,25]	Ledgers; invoices; contracts; disclosures; structured and semi-structured [3]
Integration focus	Cross-agency harmonization/canonicalization [3,16,25]	Omnichannel stitching; identity/household resolution [16,25]	Subsidiary consolidation; intercompany reconciliation [3]
Compliance emphasis	Lawfulness and transparency; explainability [12,46,47]	Consent/purpose limitation; fairness and explainability [2,12,47]	External assurance; control testing; evidence sufficiency [19,22]
Provenance/lineage	Statutory audit trails; tamper-evidence [19]	Consent chain; data-source attribution [12]	Replayable financial/audit trails [19,22]
Access governance	Role/mandate-based; transparency by design [2,12]	Consent-aware segmentation; policy filters [2,12]	Segregation of duties; approvals and logging [19,22]
LLM priorities	Policy-aligned integration; lineage extraction; governed RAG with citations [2,19]	Consent-aware ER/identity mapping; contract/policy extraction; governed RAG over CDPs/knowledge assets [2,3,12,16]	Deterministic document extraction with validators; lineage completion; access-control evidence [19,22]

Abbreviations: ER—entity resolution; RAG—retrieval-augmented generation; CDP—customer data platform.

Table 6. Summary of datasets and instantiated tasks per sector.

Dataset	Domain	Approx. Size	Main Tasks/Functions
TED	Digital Governance	∼1 million contract notices and awards	Schema mapping (F1), entity resolution (F2), constraint validation (F3), metadata lineage (F4), governed retrieval (F7)
Olist	Digital Marketing	∼100,000 orders, customers	Cross-table integration (F1), identity resolution (F2), quality control (F3), consent-aware retrieval (F4–F5), policy-governed access (F7)
SROIE/CORD	Accounting	∼10,000 annotated scanned documents	Document-to-structure extraction (F6), metadata tagging (F4), governed access (F7)
Synthetic $D_{syn} (θ)$	Stress Tests	Up to ∼1 million records	Scaled versions of all sector tasks F1–F7 with controlled noise, sparsity, and dependencies

Table 7. Technical infrastructure portability across sectors (0 = incompatible, 1 = seamless).

Pattern Component	Gov→Mkt	Mkt→Gov	Gov→Acc	Acc→Gov	Mkt→Acc	Acc→Mkt	Adaptation Requirements
Base LLM Models	0.92	0.88	0.94	0.90	0.95	0.93	Domain fine-tuning (2–4 weeks); sector lexicon integration [163]
Vector Databases	0.96	0.95	0.97	0.96	0.98	0.97	Sectoral metadata schemas (1–2 weeks); index optimization [151,164]
RAG Orchestration	0.84	0.72	0.88	0.75	0.89	0.86	Query routing reconfiguration (3–5 weeks); retrieval strategy adaptation [152]
Graph Databases	0.91	0.89	0.93	0.91	0.94	0.92	Relationship schema redesign (2–3 weeks); traversal tuning [12,20]
Blockchain Integration	0.78	0.81	0.85	0.83	0.76	0.80	Consensus selection (4–6 weeks); smart-contract development [126]

Notes: Portability values are averaged from empirical cross-sector benchmarks. LLM = large language model; RAG = retrieval-augmented generation.

Table 8. Guardrail transferability and adaptation strategies.

Guardrail Type	Universal Components (PTF $\geq 0.80$ )	Sector-Specific Components (PTF $< 0.60$ )	Adaptation Strategy
PII Detection	Core entity patterns	Industry identifiers; context sensitivity	Vocabulary augmentation; threshold calibration (2–3 weeks) [86,87]
Consent Management	Preference tracking; opt-out	GDPR/CCPA/SOX mapping; purpose limitation	Legal framework integration (4–6 weeks) [88,91]
Access Control	RBAC; authentication	Purpose-based (gov), consent-driven (mkt), audit-logged (acct)	Policy engine rules (3–5 weeks) [2,124]
Bias Monitoring	Disparity detection	Protected-class definitions; fairness thresholds	Define criteria; customize dashboards (3–4 weeks) [169]
Explainability	Basic provenance; decision logs	Stakeholder-appropriate narratives; evidence sufficiency	Explanation templates; user testing (4–8 weeks) [122,170]

Notes: PTF = portability transfer factor; RBAC = role-based access control; GDPR = General Data Protection Regulation; CCPA = California Consumer Privacy Act; SOX = Sarbanes–Oxley Act.

Table 9. Human oversight patterns and transfer considerations across sectors.

Decision Type	Governance	Marketing	Accounting	Transfer Considerations
Low-Risk Routine	Auto + audit log (≥95%)	Auto + performance monitor (≥90%)	Auto + deterministic checks (≥85%)	High PTF; recalibrate thresholds (1–2 weeks) [108,157]
Medium-Risk Operational	Manager review (60–70%)	Team lead approval (70–80%)	Controller review (55–65%)	Moderate PTF; redesign approvals (3–4 weeks) [107,111]
High-Risk Strategic	Multi-stakeholder committee (20–30%)	Executive approval (30–40%)	Independent auditor (15–25%)	Low PTF; rebuild governance (6–10 weeks) [122,172]

Notes: PTF = portability transfer factor. Percentages indicate approximate human involvement in oversight workflows; durations estimate adaptation times for cross-sector deployment.

Table 10. Comprehensive transferability matrix across patterns.

Pattern	Tech. port.	Prompt	Guardrail	Approval	PTF	Duration
RAG-over-Lakehouse	0.91	0.72	0.68	0.74	0.76	12–18 weeks
ETL Co-Pilot	0.94	0.78	0.71	0.82	0.81	10–14 weeks
Lineage & Evidence Graph	0.89	0.68	0.85	0.70	0.78	14–20 weeks
Schema Mapping	0.93	0.81	0.76	0.88	0.85	8–12 weeks
Entity Resolution	0.92	0.75	0.72	0.85	0.81	9–13 weeks
Document Extraction	0.87	0.69	0.74	0.80	0.78	11–16 weeks

Notes: Tech. port. = technical portability; PTF = platform transferability factor (as defined in the text). RAG = retrieval-augmented generation.

Table 11. Performance comparison of proposed LLM-enabled functions versus classical baselines.

Task	Metric	Classical Baseline	LLM-Enabled (Ours)	Relative Gain
F1: Schema Mapping	Semantic Mapping Precision	79.2%	89.4%	+13.1%
F1: Schema Mapping	Semantic Mapping Recall	75.6%	86.8%	+14.8%
F2: Entity Resolution	F1 Score	82.3%	91.7%	+11.4%
F2: Entity Resolution	Runtime Efficiency	18.4 min	7.2 min	−60.9%
F6: Document Extraction	Field-Level F1	71.8%	84.6%	+17.8%
F6: Document Extraction	Robustness to Layout Variance	64.2%	82.1%	+27.9%
F7: Governed Retrieval	Policy-Compliant Accuracy	68.7%	85.3%	+24.2%
F7: Governed Retrieval	Citation Recall (Provenance)	72.1%	88.9%	+23.3%

Table 12. Spark Node Resource Utilization Summary.

Node	CPU (%)	Memory (%)	Bandwidth (%)	Core Role
Node1	75	80	60	Balanced, versatile workload
Node2	65	90	70	Memory-centric; ideal for data-heavy ops
Node3	80	85	75	Consistent high usage; top performer
Node4	50	65	55	Underutilized; potential for increased load

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karras, A.; Theodorakopoulos, L.; Karras, C.; Krimpas, G.A.; Giannaros, A.; Bakalis, C.-P. LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework. Algorithms 2025, 18, 791. https://doi.org/10.3390/a18120791

AMA Style

Karras A, Theodorakopoulos L, Karras C, Krimpas GA, Giannaros A, Bakalis C-P. LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework. Algorithms. 2025; 18(12):791. https://doi.org/10.3390/a18120791

Chicago/Turabian Style

Karras, Aristeidis, Leonidas Theodorakopoulos, Christos Karras, George A. Krimpas, Anastasios Giannaros, and Charalampos-Panagiotis Bakalis. 2025. "LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework" Algorithms 18, no. 12: 791. https://doi.org/10.3390/a18120791

APA Style

Karras, A., Theodorakopoulos, L., Karras, C., Krimpas, G. A., Giannaros, A., & Bakalis, C.-P. (2025). LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework. Algorithms, 18(12), 791. https://doi.org/10.3390/a18120791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework

Abstract

1. Introduction

1.1. Aim of This Study

1.2. Contributions

1.3. Problem Framing & Scope

1.3.1. Defining LLMs for Data Management Versus Prediction

1.3.2. Functional Boundaries and Scope

1.3.3. Corpus-Building Approach (Non-PRISMA, Rigor-Preserving)

1.3.4. Cross-Sector Context Map

2. Taxonomy of LLM-Enabled Data Management Functions

2.1. F1. Schema & Mapping Co-Pilot

2.2. F2. Entity Resolution Assistant

2.3. F3. Data Quality & Constraint Repair

2.4. F4. Metadata/Lineage Auto-Tagger

2.5. F5. Policy/Consent Classifier & PII Redactor

2.6. F6. Document-to-Structure Extractor

2.7. F7. Retrieval-Augmented Generation (RAG) for Governed Data Access

3. Methodology

3.1. Datasets and Data Generation

3.2. Distributed LLM and MCMC Framework

3.2.1. Data Preprocessing and Partitioning

3.2.2. Model Invocation and Inference

3.2.3. MCMC Posterior Sampling for Uncertainty Quantification

3.2.4. Provenance and Traceability

3.3. Evaluation Metrics and Uncertainty Reporting

3.3.1. Baselines

3.3.2. Implementation Details

3.3.3. MCMC Versus Simple Repetition

4. Proposed Algorithms

4.1. ReMatch++: Schema & Mapping Co-Pilot

4.2. Consent-Aware Entity Resolution (C-ER) and Policy Classifier

4.3. Doc2Ledger-LLM: Multimodal Extraction with Validators

5. Architecture Patterns

5.1. Pattern A: RAG-over-Lakehouse for Governed Question Answering

5.2. Pattern B: ETL Co-Pilot at the Ingestion/Transform Stage (Human-in-the-Loop)

5.3. Pattern C: Lineage & Evidence Graph with LLM Annotation (Optional Blockchain Anchoring)

5.4. Cross-Pattern Integration and Deployment Progression

6. Sectoral Implementations: Evidence, Effective Practices, and Failure Modes

6.1. Digital Governance

6.2. Digital Marketing

6.3. Accounting & Audit

Synthesis Across Sectors

7. Evaluation and Metrics

7.1. Data Management Performance Indicators

7.1.1. Schema Mapping Accuracy

7.1.2. Entity Resolution Precision–Recall–F1

7.1.3. Constraint Repair Rate

7.1.4. Lineage Coverage Completeness

7.1.5. Document Extraction F1

7.1.6. RAG Answer Faithfulness

7.2. Operational Efficiency Metrics

7.2.1. Time-to-Ingest Reduction

7.2.2. Rework Reduction Rate

7.2.3. Human Review Minutes Saved

7.3. Governance Assurance Metrics

7.3.1. Audit Findings Resolution Rate

7.3.2. Privacy Incidents Avoided

7.3.3. Compliance Timeliness Achievement

7.4. Strategic Decision Impact Indicators

7.4.1. Decision Cycle Time Reduction

7.4.2. Business Lift Proxies

7.5. Cross-Sector Pattern Portability Framework

7.5.1. Technical Infrastructure Portability

7.5.2. Prompt Engineering Adaptation Requirements

7.5.3. Governance Guardrail Configuration

7.5.4. Human Oversight and Approval Workflow Adaptation

7.5.5. Comprehensive Transferability Decision Matrix

7.6. Strategic Transfer Implementation Roadmap

8. Experimental Results

8.1. Architecture Evaluation

8.2. Distributed MCMC and Spark Diagnostics

8.3. Spark Cluster Resource Utilization Analysis

9. Discussion

Methodological Implications

10. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement