1. Introduction
In enterprise environments, data assets typically span multiple heterogeneous sources and complex integration pipelines [
1,
2]. Missing values, format anomalies, semantic conflicts, and cross-field inconsistencies arise naturally during data collection and integration. When left unaddressed, the resulting errors amplify bias during model training, degrade prediction accuracy, and trigger cascading costs in practical business operations [
3]. Accurate error detection and reliable correction therefore constitute a foundational requirement for building trustworthy data-driven systems.
Integrity constraints, denial constraints, and functional dependencies have long served as the primary instruments for detecting and repairing data violations [
1,
4]. Well-structured fields benefit from the stability and controllability that constraint enforcement provides, yet errors involving contextual information, cross-field semantic relationships, or implicit domain knowledge remain difficult to capture through manually crafted rules. As data distributions shift across time and business scenarios, the cost of maintaining rule bases grows rapidly [
5]. Learning-based systems such as HoloClean [
6], Raha [
7], and Baran [
8] have advanced the state of the art by combining statistical signals with constraints. Still, most of them address detection and correction as loosely coupled steps and lack principled mechanisms for rejecting uncertain repairs.
Large language models have recently demonstrated an ability to leverage contextual understanding and world knowledge for detecting subtle errors and generating plausible repairs in tabular data [
9,
10,
11]. Deploying LLMs directly for error correction, however, imposes heavy computational and latency burdens that conflict with enterprise requirements for high-throughput processing [
12,
13]. Knowledge distillation addresses this tension by transferring task-specific capabilities from a large teacher model to a compact student model, reducing inference cost while retaining the semantic representations essential for error correction [
14,
15,
16].
Semantic representation alone does not suffice for correcting errors that involve multi-field dependencies or business-rule violations. Repair decisions in complex scenarios should be traceable and auditable, which calls for explicit reasoning over rules and relational constraints. Markov logic networks [
17] and statistical relational learning frameworks [
18] offer a principled way to combine hard constraints with soft evidence under uncertainty. Neuro-symbolic methods extend this idea by embedding symbolic constraints into neural networks in differentiable form [
19,
20,
21], allowing models to satisfy logical requirements while maintaining representation capacity.
An automatic correction system must also control the risk of writing incorrect values back to the data source. When model confidence is insufficient, the system should produce set-valued suggestions or trigger rejection. Conformal prediction provides distribution-free coverage guarantees under mild exchangeability assumptions [
22,
23].
Recent work has begun applying conformal methods to data quality tasks [
24,
25]; most of these efforts, however, perform calibration at a single processing level, and the design of conformal schemes aligned with multi-stage correction pipelines has received limited attention.
The literature reviewed above leaves enterprise data error correction in an uncomfortable position. Learning-augmented systems that combine constraints with statistical signals still treat detection and correction as loosely coupled stages, with little principled support for rejecting uncertain repairs; direct LLM invocation closes part of this gap on semantic errors, yet at an inference cost that the throughput and latency budgets of enterprise pipelines cannot absorb. The picture becomes harder once cross-field repair is required, because traceable reasoning over explicit constraints is needed to write back values that pass auditing, and neither purely statistical repair nor single-model neural approaches deliver such reasoning. Conformal calibration has begun to address the reliability question, but the calibration is typically applied at a single processing level, which fits poorly with pipelines whose confidence profiles differ from one processing stage to another.
To address these difficulties, this paper constructs an enterprise-oriented data error correction framework that matches computational cost to error complexity at the record level, preserves semantic-correction capability without incurring LLM-scale inference cost, makes cross-field reasoning structurally explicit, and provides coverage-controlled prediction sets together with a principled rejection mechanism.
The main contributions of this paper are summarized as follows:
A complexity-aware progressive architecture is proposed that dynamically routes records across a rule-based layer, a distilled-model layer, and a probabilistic-reasoning layer to achieve a controllable balance between correction quality and inference latency.
A neuro-symbolic integration strategy is developed, incorporating a task-specialized distilled language model for semantic representations and a factor graph for structured probabilistic–logical reasoning to tighten cross-field consistency.
A hierarchical conformal prediction scheme is introduced to perform conditional calibration within each processing layer, providing a global coverage lower bound together with a data-driven rejection mechanism for high-risk corrections.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 presents the proposed framework.
Section 4 describes the experimental setup and results.
Section 5 concludes with a discussion of limitations and future directions.
2. Related Work
2.1. Error Detection and Repair in Tabular Data
Data cleaning has evolved from constraint-based engineering toward learning-augmented pipelines over the past two decades. Classical systems enforce integrity constraints such as functional dependencies and denial constraints to identify and repair violations. Foundational treatments of constraint modeling and repair semantics can be found in Ilyas and Chu [
1] and Abiteboul et al. [
5]. NADEEF [
26] provides a unified programming interface for specifying heterogeneous quality rules. KATARA [
27] leverages knowledge bases and crowdsourcing for repair candidate generation.
HoloClean [
6] marked a major advance by compiling denial constraints, external data sources, and statistical signals into a single factor graph and performing probabilistic inference over dirty cells. PClean [
28] extends this direction with a Bayesian generative model that explicitly separates latent clean values from observed noise. On the detection side, Raha [
7] ensembles multiple detection strategies into a configuration-free system requiring minimal user labels, and Baran [
8] addresses correction through a unified context representation combined with transfer learning. A systematic comparison by Abedjan et al. [
2] concludes that no single detection method dominates across error types, which motivates multi-strategy approaches.
Despite the progress in individual components, most existing systems treat detection and correction as sequential yet loosely coupled steps. Explicit mechanisms for rejecting uncertain repairs are rarely provided, and the computational cost of joint detection–correction pipelines has received limited attention.
2.2. LLMs and Deep Learning for Data Quality
Deep learning has enabled representation learning and end-to-end optimization for several data governance tasks, including entity alignment, error localization, and repair suggestion [
29]. Retrieval-augmented generation further enhances factual consistency by incorporating external knowledge sources at inference time [
30].
Cocoon [
9] decomposes data cleaning into manageable subtasks that leverage LLM semantic understanding alongside statistical error detection, outperforming prior systems on standard benchmarks. LLMClean [
10] uses LLMs to automatically generate ontological functional dependencies for context-aware cleaning. Bendinelli et al. [
11] investigate LLM agents with iterative feedback loops for systematic error correction. CleanAgent [
31] employs autonomous multi-agent workflows for end-to-end data quality management.
All of the above approaches inherit practical limitations of LLMs, including high inference cost, output hallucination, non-deterministic behavior, and context window constraints [
9]. The tension between semantic capability and enterprise-scale inference cost therefore remains an unresolved practical bottleneck for LLM-based cleaning systems.
2.3. Knowledge Distillation for Compact Models
Knowledge distillation transfers capabilities from large teacher models to smaller students through soft targets, feature alignment, and multi-task regularization [
14]. Sanh et al. [
15] demonstrated that a 60% smaller BERT retains 97% of its language understanding performance. Surveys by Gou et al. [
16] and Xu and McAuley [
32] systematize techniques including temperature scaling, sample reweighting, and structural constraints, providing methodological foundations for task-specific distillation.
In the era of large language models, distillation has evolved beyond output-level alignment. Rationale-based methods distill chain-of-thought reasoning from a teacher into a student model [
33]. Probe-based approaches extract richer training signals from intermediate representations [
34]. Recent work on uncertainty-aware distillation provides Bayesian frameworks that equip student models with calibrated confidence estimates [
35].
Distillation specifically tailored for data error correction, where the student must simultaneously handle format validation, semantic repair, and constraint satisfaction, remains underexplored, and task-oriented distillation objectives with capability-preserving auxiliary losses have not been systematically investigated in this setting.
2.4. Conformal Prediction and Reliable Decision-Making
Conformal prediction constructs prediction sets with finite-sample coverage guarantees under the exchangeability assumption [
22,
23]. Split conformal prediction and its variants have been widely adopted for uncertainty quantification in classification and regression settings [
36]. Recent surveys highlight the versatility of conformal methods across data modalities [
37] and their promise for NLP applications where model calibration is often poor [
38].
Applying conformal methods to data quality is a recent development.
Jäger and Biessmann [
24] combine imputation with conformal prediction to improve downstream ML performance, demonstrating that uncertainty-aware cleaning can outperform deterministic approaches. Zhan et al. [
25] apply inductive conformal prediction to biomedical data cleaning, using reliability metrics to selectively correct mislabeled samples.
Existing conformal approaches for data cleaning operate at a single processing level, and calibration schemes aligned with the heterogeneous confidence profiles produced by the different processing stages of a multi-stage pipeline have not been systematically developed.
Reading these threads together, three observations stand out. Rule-based systems, language-model distillation, and uncertainty quantification each address part of the enterprise-cleaning problem, but the points where they should connect are largely unclaimed: rules pass to learned models without a budget on inference cost, and learned models return predictions without a budget on coverage. Cross-field consistency in particular tends to fall through the cracks, because constraint-driven and neural pipelines optimize different objectives. The framework developed in the next section is built around exactly these connection points, treating routing, reasoning, and calibration as a single design problem rather than three loosely linked components.
3. Proposed Framework
3.1. Problem Formulation
Consider a relational table with m attributes . Each record is a tuple of field values drawn from the respective attribute domains. A subset of fields in x may contain errors, including format violations, missing entries, semantic conflicts, or cross-field inconsistencies.
The error correction task is defined at two levels. At the cell level, the system produces a binary error mask indicating which fields are erroneous, together with a repair value for each flagged field i. At the record level, the system outputs a candidate set of repair suggestions along with a confidence distribution over the candidates.
The framework aims to minimize an expected risk that jointly accounts for computational cost and correction error:
where
denotes the routing decision that assigns
x to one of three processing layers,
measures the combined latency and resource consumption,
is the task loss for detection and correction, and
controls the trade-off between efficiency and accuracy.
3.2. Overall Architecture
The three-layer structure was settled by iterative empirical profiling rather than fixed in advance. Preliminary experiments showed that no single processing mechanism met both the accuracy and the latency requirements across all error types: the ablation in
Section 4.4 reports that relying on the language model alone (No-Logic-Layer) degrades performance on cross-field inconsistencies, and applying factor-graph reasoning to every record (No-Routing) inflates latency without a corresponding repair gain. The error types observed in enterprise tables, namely format violations, semantic conflicts, and multi-field inconsistencies, mapped naturally to three different solvers, so the pipeline was decomposed accordingly. The decomposition keeps simple errors on the deterministic path and reserves probabilistic–logical reasoning for records where the cost is justified.
The framework consists of three processing layers connected through feature reuse and signal sharing, as illustrated in
Figure 1.
The first layer (L1) applies deterministic rules, regular expressions, domain dictionaries, and cross-field constraints. It handles format and consistency errors with low latency and produces a binary rule-trigger vector that records which hard rules were activated. The second layer (L2) employs a task-specialized distilled language model to perform semantic error detection and correction on records of medium complexity. The third layer (L3) integrates the student model’s representation with explicit rules on a factor graph, performing structured probabilistic–logical reasoning for records with complex cross-field dependencies.
A complexity-aware router sits upstream of the three layers and assigns each incoming record to the appropriate layer.
Figure 2 depicts the end-to-end processing workflow, including the routing decision and the information flow across layers.
3.3. Two-Stage Complexity Assessment and Routing
A central design requirement is that routing must be computationally cheaper than the downstream layers it serves. A two-stage routing strategy is therefore adopted so that the routing stage avoids any dependence on the full student model.
3.3.1. Stage 1: Lightweight Pre-Screening
In the first stage, a lightweight complexity score is computed using only features that do not require running the student model. A pattern deviation score
is defined as
where
is the perplexity of a 5-g character-level language model trained on clean schema-conforming data, and
is the count of failed validation checks (e.g., regex mismatches) from the rule base. The function
denotes min–max normalization computed on the training set, and
is a scaling coefficient.
Three design choices in Equation (
2) deserve explicit justification: how the two components are combined, how each is normalized, and what
n-gram order is used. The two components are combined additively rather than multiplicatively or through a learned weighted sum. A multiplicative form collapses to zero whenever
, which would mask records whose formats pass all regex checks yet exhibit anomalously high character-level perplexity, a common pattern for context-dependent errors. A learned weighted sum was considered but rejected at this stage, since the routing score enters downstream threshold estimation (Equation (
7)), and keeping
in a fixed, interpretable form preserves the separability between score definition and threshold optimization. For normalization, min–max is applied rather than Z-score or robust scaling. Because
is a small non-negative integer count, min–max produces a dimensionless
quantity compatible with
in the same additive expression. Heavy-tailed behavior of
across domains is addressed by clipping the training-set perplexity at the 99th percentile before normalization, which bounds the influence of isolated outliers without discarding them. The character-level model is set to order five rather than to a token-level or subword alternative. Field-level content in relational tables is short, frequently contains out-of-vocabulary strings (identifiers, codes, mixed-script names), and does not tolerate the segmentation errors introduced by subword tokenizers. Character-level modeling sidesteps tokenization entirely; the order of five balances context sensitivity against storage cost, and a sensitivity analysis over
is reported in
Section 4.4.
A contextual dependency score
is also computed to approximate the statistical coupling between the target field and its neighbors, using pointwise mutual information with Laplace smoothing:
where
is the target attribute,
is the set of functionally dependent attributes, and
uses add-one smoothing on empirical counts.
Records whose falls below a threshold and whose is below a threshold are routed directly to L1. The remaining records proceed to Stage 2.
3.3.2. Stage 2: Entropy-Augmented Routing
For records that pass Stage 1, a single forward pass through a frozen, lightweight projection head of the student model produces a predictive distribution over each potentially erroneous field. The semantic uncertainty is then computed as
where
is the set of suspicious fields identified in Stage 1 by the rule layer,
is the candidate vocabulary, and
is the softmax distribution at temperature
.
The reliability of
as a complexity signal depends on whether the underlying predictive distribution is calibrated. The temperature
used in Equation (
4) is not a free hyper-parameter but the outcome of post hoc temperature scaling [
39]: after distillation, a single scalar temperature is tuned on the validation set by minimizing negative log-likelihood, and the resulting value is frozen before routing thresholds are estimated.The empirical calibration quality of the resulting distribution and the monotone relation between
and downstream correction error are reported later in
Section 4.3.
The final complexity score combines all three components:
where
is the logistic function and the weights satisfy
.
Table 1 summarizes the domain, normalization, and approximate per-record cost of each component.
The routing decision is then as follows:
The thresholds
and
are estimated by constrained empirical risk minimization on the validation set:
where
is the measured latency,
is the validation accuracy, and
is a user-specified accuracy lower bound.
To stabilize routing near the decision boundaries, a boundary regularization term is added that penalizes abrupt changes in the routing probability for records whose complexity scores lie within a margin of the thresholds.
3.3.3. Routing Error Analysis
The quality of the final correction depends on both each layer’s repair capability and the router’s ability to assign every record to a layer that can handle it. To make this dependence explicit, an oracle routing baseline is introduced. For each record in the validation set, all three layers are executed and the oracle layer is defined as the lowest-cost layer that produces a correct repair. The routing error rate is then the fraction of records for which
differs from the oracle choice. Two failure modes are distinguished:
under-routing, which assigns a record to a layer below the oracle and typically produces a wrong repair, and
over-routing, which assigns a record to a layer above the oracle and wastes computation without harming quality. The two failure modes carry asymmetric costs, and the routing objective in Equation (
7) explicitly penalizes under-routing through the accuracy constraint
while tolerating bounded over-routing through the latency objective.
Section 4.4 reports the empirical confusion matrix between
and the oracle, together with the downstream repair accuracy conditioned on each failure mode. A first-order regret argument further shows that when
and
are chosen to satisfy the constraint in Equation (
7), the excess risk relative to oracle routing is bounded by the weighted sum of the miscoverage rates on the two threshold bands plus the layer-wise repair-accuracy gaps, which in practice scales linearly with the routing error rate and vanishes as the complexity score becomes more discriminative.
3.4. Task-Specialized Knowledge Distillation
The second layer employs a distilled language model trained through task-oriented knowledge distillation.
GPT-4o is used as the teacher model and is invoked via API in a few-shot setting to generate cell-level correction labels on the training set.
The student model is BERT-base (110 M parameters), initialized from the pre-trained checkpoint and fine-tuned with a masked-field prediction objective.
3.4.1. Input Formulation
Each record is serialized as , with [MASK] iterating over the fields flagged as erroneous in Stage 1. The candidate set is populated differently by field type: the full domain vocabulary for categorical fields, the top-k () teacher predictions for free-text fields, and a constrained schema-compliant token set for open-ended string repair. This typed candidate construction, rather than the masked-prediction backbone itself, is the design element specific to the data-correction setting.
3.4.2. Distillation Loss
Let
and
denote the teacher and student predictive distributions, and let
and
denote the aligned intermediate representations at layer
ℓ. The total training loss is
where
is the temperature,
P is a learnable projection matrix,
re-weights samples to emphasize the medium-complexity interval, and
is the boundary regularization described in
Section 3.3.
Concretely, is implemented as a Gaussian centered at the midpoint with bandwidth equal to , normalized so that its mean over the training set equals one; this places higher weight on records whose complexity score sits within the L2 routing band, where the distillation signal is most informative.
The set indexes three capability-preserving auxiliary losses. The format loss is a cross-entropy objective on regex-checkable fields, training the student to replicate format validation. The semantic loss is a contrastive objective that aligns the student’s embedding of a clean field value with the teacher’s embedding and pushes apart the embeddings of erroneous values. The logic loss is a binary cross-entropy objective on whether a given pair of field values satisfies a sampled functional dependency, training the student to internalize basic constraint judgments.
After distillation, the student model produces a base semantic representation for each record, which serves as input to the third layer.
A subtle point regarding the teacher choice merits explicit clarification. Although GPT-4o serves as the teacher for L2 and is also retained as the GPT-4o-Direct baseline in
Section 4.1, the teacher upper-bounds only the standalone semantic-correction capability of L2; it does not upper-bound the framework’s record-level repair quality. L1 handles deterministic format errors more reliably than free-form generation, L3 enforces relational and arithmetic constraints that GPT-4o operating field by field cannot guarantee, and the conformal rejection mechanism filters low-confidence student outputs that would otherwise propagate. The framework can therefore exceed the teacher on aggregate metrics even though L2 alone cannot, as confirmed by the per-layer and per-method breakdowns in
Section 4.2 and
Section 4.4.
3.4.3. Training Configuration
All distillation experiments use a batch size of 64, a learning rate of with linear warm-up over the first 10% of steps, the AdamW optimizer, a maximum sequence length of 256 tokens, and temperature . Training runs for 30 epochs with early stopping based on cell-level F1 on the validation set. All experiments are conducted on a single NVIDIA A100 GPU (40 GB).
3.5. Neural Probabilistic–Logical Reasoning
Records routed to the third layer require structured reasoning that jointly considers neural semantic representations and explicit relational constraints.
A factor graph is constructed whose variables correspond to the uncertain field values in a record and whose factors encode both data-driven features and domain rules.
3.5.1. Graph Construction
For a record x routed to L3, each field flagged as potentially erroneous becomes a variable node . The domain of is the candidate set produced by the student model (top-k predictions).
Two types of factors are defined. Rule factors instantiate hard and soft rules, where each contributes a weighted indicator.
Feature factors couple the neural representation to the variable assignments.
The combined representation
is formed by concatenating the student encoder output
, a task-specific incremental network output
, and the rule-trigger vector
:
where ⊕ denotes vector concatenation.
3.5.2. Rule Templates
Hard rules are deterministic constraints derived from the schema. Examples include:
If , then .
.
Soft rules capture statistical regularities that hold in most cases. Examples include:
If , then is likely “VIP” (weight 0.8).
If and , then is likely “corporate” (weight 0.7).
Hard rules receive a fixed large weight; soft-rule weights are learned.
3.5.3. Joint Distribution and Inference
The energy-based joint distribution over the variables is
where
Z denotes latent variables and
collects all learnable parameters.
Loopy belief propagation is adopted for approximate marginal inference, with a maximum of 20 message-passing iterations, and model parameters are learned by maximizing the evidence lower bound under a variational posterior .
The novel component is the generator that produces soft-rule weights from the combined representation:
with Gumbel temperature
annealed linearly from 1.0 to 0.1 over 50 epochs. Factor-graph training is conducted sequentially after distillation has converged (
Section 3.4), so the 50-epoch schedule here is independent of the 30-epoch distillation budget and the student parameters are frozen except during the neural-parameter phase described below. This construction differs from fixed-weight Markov logic networks in that soft-rule weights are conditioned on each record through
, which allows the same soft rule to exert different influence across records depending on their neural evidence. Training alternates between optimizing the factor and neural parameters with soft-rule weights frozen and optimizing the generator with neural parameters frozen, and it stops when the validation F1 and explanation consistency improve by less than
for five consecutive epochs.
3.5.4. Evidence Output
For each repair decision, the third layer outputs the posterior probability of the chosen candidate, the set of activated rules with their weights, and the sequence of belief-propagation messages that led to the final marginal. An example evidence trace for a corrected city field might read: rule R1 (province-city constraint, weight 1.0) activated; rule R7 (address-zipcode soft rule, weight 0.82) activated; posterior .
3.6. Hierarchical Conformal Prediction and Rejection
After the three processing layers produce their corrections, the system must assess the reliability of each repair before writing it back to the data source.
Conformal prediction is applied independently at each layer, and the coverage guarantees are then aggregated at the global level.
Figure 3 illustrates the overall confidence quantification and rejection workflow.
3.6.1. Candidate Space
The candidate label space over which prediction sets are constructed differs by field type. For categorical fields, is the full set of valid domain values. For numerical fields, is a discretized set of candidate values centered on the model’s point prediction.
For free-text fields,
is restricted to the top-
k candidates generated by the student model (with
in the experiments reported in
Section 4).
In all cases, the conformal procedure operates over a finite candidate set.
3.6.2. Nonconformity Score
At each layer
ℓ, a nonconformity score
is defined as a weighted combination of three components:
where
captures the prediction confidence at layer
ℓ,
is the entropy of the predictive distribution, and
measures the representation distance between
x and its nearest neighbors in the calibration set. The weight vector
is non-negative, sums to one, and is estimated by minimizing the average prediction set size on the calibration set subject to the coverage constraint.
3.6.3. Layer-Wise Calibration
After the routing function
and all model parameters are fixed, an independent calibration set is constructed for each layer:
The calibration samples do not participate in training the model or the router. For each calibration sample, the nonconformity score is computed and the scores are sorted in ascending order. Given a nominal error level
, the quantile threshold at layer
ℓ is
where
. For a new record
x routed to layer
ℓ, the prediction set is
3.6.4. Assumptions and Global Coverage
The per-layer coverage guarantee relies on two assumptions:
Assumption 1. Conditional exchangeability. Within each layer ℓ, the calibration samples and the test samples are exchangeable conditional on .
Assumption 2. Fixed routing. The routing function depends only on the input features and is fixed before calibration begins.
Under Assumption 1, each layer satisfies
By the law of total probability, the global coverage follows
Conditional exchangeability may be violated in practice when enterprise data exhibit temporal drift or when the error distribution shifts between calibration and deployment.
Section 4 reports additional experiments under a temporal split to evaluate the robustness of the coverage guarantee under mild distribution shift.
3.6.5. Rejection Mechanism
On top of the conformal prediction sets, a rejection option is introduced based on maximum posterior confidence.
Let denote the highest posterior probability at layer ℓ. A rejection threshold is set to the -th quantile of the empirical distribution of on the calibration set. When , the system withholds a single-point repair and instead returns the full prediction set along with the associated evidence. The record is then routed to a human-review channel according to the deployment policy. Because the threshold is derived from a data-driven quantile, the rejection rate adapts to the confidence profile of each layer, and slow distributional shifts move the quantile rather than the rejection criterion itself.
3.7. Operational Considerations
The proposed framework maintains four synchronously updated components: the rule base at L1, the distilled student model at L2, the factor graph and its soft-rule generator at L3, and the per-layer conformal calibration sets. Each component has a distinct update cadence and failure mode, and the interactions among them determine the realistic maintenance burden.
Rule changes in the schema or business logic directly modify L1 output and therefore the rule-trigger vector that feeds L3. A change to a hard rule requires re-running the factor-graph inference on affected records and re-computing the L3 calibration set, but does not require re-training the student model. A change to a soft rule only requires re-training the soft-rule weight generator on L3 together with a refresh of the L3 calibration set. A change that introduces a new field or a new error type, by contrast, invalidates the student’s serialization template and requires a full re-distillation on the extended training set.
Model drift is handled through periodic re-calibration of the conformal quantiles rather than through re-training. A rolling calibration window (for instance, the most recent thirty days of labeled data) keeps the quantiles aligned with the current data distribution and preserves the coverage guarantee under slow shifts. The student model itself is re-distilled at a slower cadence, typically when the rolling miscoverage gap on the calibration window exceeds a deployment-specified tolerance.
Version coupling is enforced across components. Each deployed pipeline is pinned to a triple of artifacts: the student model checkpoint, the factor-graph parameter set including the soft-rule generator, and the per-layer conformal quantile table. These artifacts are versioned together and any mismatch triggers a deployment abort, which prevents silent coverage violations when an updated student is paired with a stale calibration set.
Relative to simpler baselines, the framework’s maintenance cost is higher than that of a single fine-tuned encoder but lower than that of a hand-maintained rule base with comparable coverage, because the student model absorbs most of the semantic cases that would otherwise require hand-coded rules, and the calibration layer automates the confidence estimation that is typically implemented ad hoc in rule-based systems. The incremental engineering burden is concentrated in the versioning and re-calibration pipeline, which is standard infrastructure in production machine learning systems.
4. Experimental Validation and Analysis
4.1. Experimental Setup
Three datasets are used in this study, covering both real enterprise data and public benchmarks.
Table 2 summarizes the key statistics.
Three datasets are utilized in this study. The first, ENT-Prod, is derived from the production database of an industry partner and spans three relational tables. It contains 15,247 records with 22 attributes per record, yielding an overall cell-level error rate of 7.8% that covers format violations, semantic conflicts, cross-field inconsistencies, and missing values. The second dataset, Hospital, is a widely used data-cleaning benchmark containing 1000 records and 19 attributes with approximately 5% erroneous cells, primarily caused by functional dependency violations. The original dirty version is used without additional error injection. The third dataset, Flights, contains 2136 flight delay records. Following standard injection protocols, a dirty version is constructed by applying value substitution, missing-value injection, and format corruption to yield an overall error rate of approximately 10%, while the original clean values serve as ground-truth labels.
Across all datasets, the data are partitioned into training, validation, calibration, and test sets using a specific 6:1:1:2 ratio. This ratio was explicitly determined to balance competing statistical requirements for the progressive pipeline. First, a minimum of 10% calibration data (e.g., approximately 1500 records for ENT-Prod) was necessary to compute stable empirical quantiles for conformal prediction. Second, a 20% test set was required to maintain sufficient statistical power for record-level paired significance testing. Allocating the remaining 70% primarily to training (60%) rather than validation (10%) ensured stable convergence of the distilled student model. Alternative splits evaluated during preliminary checks (such as 7:1:1:1 or 5:1:1:3) either produced high variance in the conformal thresholds or lacked sufficient test samples to confirm significance, confirming 6:1:1:2 as the optimal default configuration.
Seven baseline methods are compared, as listed in
Table 3. The selection is designed to span the four methodological families against which the proposed framework must be positioned: (i) purely rule-based repair (Rule-Only, Rule + Stat), which establishes the deterministic lower bound; (ii) learning-augmented constraint repair (HoloClean, Raha + Baran), which represents the state of the art for structured inconsistencies; (iii) a distillation-free encoder (BERT-FT), which isolates the contribution of the distillation objective from that of the BERT-base backbone shared with the student model; and (iv) an LLM-direct baseline (GPT-4o-Direct), which sets an empirical ceiling on semantic repair quality and an empirical floor on inference efficiency. HoloClean and Raha + Baran are run using their publicly released codebases on the same data splits. GPT-4o-Direct represents an LLM-direct baseline invoked via API in a five-shot setting. Latency figures reflect end-to-end wall-clock time including network overhead and are marked separately in the result tables.
Six quality metrics, two efficiency metrics, and four reliability metrics are evaluated. The quality metrics include cell-level detection precision, recall, F1, repair accuracy on erroneous cells, end-to-end corrected cell accuracy, and record-level complete repair rate. Efficiency is measured by average inference latency per record (p50) and throughput (records per second). Reliability is assessed via actual coverage, average prediction set size, rejection rate, and empirical miscoverage gap. Each configuration is repeated with five random seeds to account for training variance. Significance is assessed via paired t-tests at the record level on the shared test split, with * indicating p < 0.05 and ** indicating p < 0.01 relative to the best non-Ours baseline.
All experiments are conducted on an Ubuntu 20.04 LTS operating system equipped with a single NVIDIA A100 GPU (40 GB), utilizing Python 3.9, PyTorch 2.0.1, and CUDA 11.8.
4.2. Main Results
Table 4 presents the cell-level and record-level metrics alongside efficiency indicators for all methods on the three datasets.
On ENT-Prod, Ours-Full achieves the highest cell-level F1 of 0.926, surpassing the second-best GPT-4o-Direct (0.914) by 1.2 absolute points (). The record-level complete repair rate reaches 0.702, exceeding GPT-4o-Direct by 3.1 points and HoloClean by 6.3 points. The gap in record-level performance is substantially larger than the gap in cell-level F1, which indicates that the third-layer factor graph reasoning resolves cross-field conflicts that inflate cell-level recall without achieving full-record consistency. On E2E cell accuracy, Ours-Full (0.983) slightly outperforms GPT-4o-Direct (0.981), confirming that the progressive pipeline introduces few false-positive repairs on clean cells. The throughput of Ours-Full (94 rec/s) is 5.2 times that of GPT-4o-Direct (18 rec/s), making the framework viable for batch processing in enterprise pipelines.
On Hospital, HoloClean performs competitively (F1 = 0.895, Repair Acc. = 0.891) because the dataset is rich in explicit functional dependencies, which HoloClean’s denial-constraint inference directly exploits. Ours-Full still leads in record-level complete repair rate (0.706 vs. 0.672), because the student model captures semantic patterns in address and name fields that purely constraint-driven inference overlooks. BERT-FT lags behind both HoloClean and Raha + Baran on this dataset (Record Repair 0.638 vs. 0.672 and 0.657), which suggests that a fine-tuned encoder without explicit constraint integration is insufficient when functional dependencies dominate.
On Flights, where semantic errors account for the largest share, GPT-4o-Direct achieves a higher cell-level F1 (0.893) than Ours-Full (0.887). The difference arises because GPT-4o operates over an open vocabulary and can resolve low-frequency semantic errors through its broad world knowledge, whereas the student model’s top-k candidate set occasionally misses rare correct values. Despite this cell-level gap, Ours-Full leads on record-level complete repair rate (0.669 vs. 0.648) by 2.1 points, because the factor graph enforces multi-field consistency that GPT-4o, operating field by field, cannot guarantee. The latency advantage remains substantial (12.3 ms vs. 60.9 ms), representing an approximate 5× reduction.
4.3. Routing and Efficiency Analysis
On ENT-Prod, solving the constrained optimization in Equation (
7) with
on the validation set yields
and
; the corresponding values on Hospital and Flights are reported in parentheses hereafter.
Table 5 compares three routing strategies on ENT-Prod.
Under adaptive routing, approximately 57% of records are handled by L1, 31% by L2, and 12% by L3. Sending all records to L3 (All-to-L3) yields a marginal F1 gain of 0.003, at the expense of more than doubling the p50 latency from 11.9 ms to 27.3 ms. The quality gain is marginal because most records contain only format or simple semantic errors that L1 and L2 already resolve correctly; sending them through the full factor-graph pipeline adds cost without changing the repair outcome. Random routing assigns records uniformly regardless of complexity and produces a lower F1 of 0.903. The 2.3-point drop relative to adaptive routing confirms that the learned complexity function captures meaningful structure and that indiscriminate allocation to expensive layers degrades both efficiency and quality.
Figure 4 provides a finer view of routing behavior by partitioning records into five complexity bins.
In the lowest bin (0–20), 96% of records are routed to L1, which aligns with the expectation that simple format violations and dictionary mismatches concentrate at low complexity scores. The 20–40 bin shows the onset of a transition, with 72% of records still assigned to L1 and 24% forwarded to L2, reflecting cases where a slightly elevated character-level perplexity signals the need for semantic analysis. In the 40–60 range, L2 becomes the dominant destination (62%), indicating that records with moderate semantic uncertainty are handled by the distilled student model. Above a complexity score of 60, L3 absorbs the majority of records (62% in the 60–80 bin and 89% in the 80–100 bin), corresponding to records with strong cross-field dependencies and high entropy across multiple candidate values. The transition between adjacent bins is smooth, with no abrupt jumps at the threshold boundaries, which indicates that the boundary regularization term (
Section 3.3) prevents oscillatory routing decisions.
Calibration of the Routing Entropy Signal
The use of
as a routing feature presupposes that the entropy of the projection-head distribution tracks downstream correction error. We verify this on the ENT-Prod validation set. After post hoc temperature scaling at
(
Section 3.3), the expected calibration error of the projection head drops from
(uncalibrated softmax) to
, and the Spearman rank correlation between per-record entropy and actual correction error rises from
to
.
Figure 5 reports the corresponding diagnostics.
Figure 5a is a reliability diagram comparing the uncalibrated and calibrated curves against the perfect-calibration diagonal; the calibrated curve aligns visibly closer to the diagonal across the entire confidence range.
Figure 5b plots the empirical correction error rate across ten equal-frequency entropy bins together with
bootstrap confidence intervals. The error rate increases monotonically from approximately
in the lowest-entropy bin to
in the highest, and adjacent bins satisfy non-overlapping or near-tangent intervals. The monotone relation between
and downstream correction error is therefore established empirically, which justifies the use of
as a complexity signal for records in the medium-complexity routing band.
4.4. Component Ablation
Table 6 presents the ablation results on ENT-Prod. Six variants are evaluated, each disabling a single component while keeping all other settings identical to Ours-Full.
Among the seven variants in
Table 6, removing the rule-trigger vector (No-Rule-Feat) ranks first by the magnitude of the record-repair drop, followed closely by removing the logical-reasoning layer (No-Logic-Layer). The cell-F1 drops are
points (No-Rule-Feat, paired
t-test
, Cohen’s
) and
points (No-Logic-Layer,
,
), respectively; the corresponding record-repair drops are
and
points. The pairwise gap between the two variants is
record-repair points and is itself significant (
, paired
t-test on the shared test split), but the two effects lie within the same order of magnitude, so the rule-trigger vector and the logic layer are best understood as jointly necessary contributors to cross-field repair quality rather than as a single dominant factor. The decline of No-Rule-Feat is concentrated on records routed to L3, where the factor graph relies on the rule-trigger vector to identify which hard constraints are relevant; without this vector, the factor graph treats all constraints uniformly, leading to repairs that appear plausible at the individual field level yet violate relational dependencies across fields.
Section 4.4.3 reports a complementary ablation restricted to cross-field error cells, which isolates the two components more directly.
No-Logic-Layer, which removes L3 entirely and routes all high-complexity records to L2, reduces the record repair rate by 3.0 points. The latency drops to 8.4 ms because the most expensive processing stage is eliminated. The quality loss confirms that the student model alone cannot enforce arithmetic and relational constraints when multiple fields in a record are simultaneously uncertain.
No-Boundary disables the boundary regularization term and leads to a 1.8-point decline in record repair rate (from 0.702 to 0.684). Inspection of the routing decisions reveals that, without regularization, records near the threshold boundaries oscillate between adjacent layers across training epochs, and the resulting instability degrades the quality of repairs for medium-complexity records.
No-Routing sends every record to L3 and achieves a slightly higher F1 (0.929) and record repair rate (0.705) than Ours-Full, because every record benefits from full-depth processing.
The -point F1 advantage of No-Routing falls within one standard error of Ours-Full and does not reach significance under a paired t-test (), whereas the latency increase is both significant and operationally prohibitive.
However, the p50 latency increases from 11.9 ms to 27.3 ms, a 2.3× increase. The contribution of routing is therefore primarily in efficiency: it reduces latency by 56% relative to full-depth processing while incurring only a 0.3-point F1 trade-off that is statistically indistinguishable from noise.
Compared to GPT-4o-Direct, the end-to-end latency reduction averages approximately 80% across the three datasets (81.0%, 81.2%, and 79.8% on ENT-Prod, Hospital, and Flights, respectively).
Removing conformal prediction (No-CP) has a negligible impact on F1 and repair accuracy, yet it eliminates the coverage guarantee and the rejection mechanism. Single-Level-CP restores coverage control but achieves a lower actual coverage (0.941 vs. 0.949) at the 0.95 nominal level and produces larger prediction sets (see
Section 4.5 for a detailed comparison).
4.4.1. Routing Score Design Ablation
To examine the design choices behind the pattern deviation score
in Equation (
2), three groups of ablations are conducted on ENT-Prod: the score-combination scheme, the normalization scheme, and the order of the character-level language model.
Table 7 reports the downstream cell-level F1 and the routing error rate (relative to the oracle routing defined in
Section 3.3) for each configuration, with all other components held identical to Ours-Full.
The additive combination outperforms both the multiplicative and the learned alternatives. The multiplicative form suffers precisely from the failure mode predicted in
Section 3.3: records with
collapse the score to zero and are routed to L1 irrespective of their perplexity, inflating the under-routing rate. The learned weighted sum is competitive but introduces an additional optimization layer between the score and the threshold, which weakens interpretability without a commensurate quality gain. Min–max normalization leads over Z-score and robust scaling by a small margin, consistent with the dimensional-alignment argument given in
Section 3.3. The LM-order ablation shows a monotonic trend up to
with diminishing returns beyond trigram, confirming that the choice of
is in the plateau region rather than an outlier. Unigram is clearly inadequate because it ignores character co-occurrence statistics on which detection of corrupted tokens relies.
4.4.2. Routing Error and Oracle Comparison
To isolate the impact of routing decisions from layer-level repair capability, each validation record is executed on all three layers. The oracle layer is the lowest-cost layer that yields a correct repair, and the routing confusion matrix between the learned router and the oracle is reported in
Table 8. The learned router agrees with the oracle on
of records. Disagreements split sharply asymmetrically: over-routing (assignment to a higher layer than necessary) accounts for
and produces no repair quality loss, while under-routing (assignment to a lower layer than necessary) accounts for
and causes an average cell F1 drop of
on the affected records. Overall, under-routing contributes at most
to the aggregate cell F1 gap relative to oracle routing, which is within the standard error of the reported Ours-Full score. This asymmetry confirms the design intent that the accuracy constraint
in Equation (
7) should suppress under-routing even at the cost of tolerating some over-routing.
To make the limited aggregate impact of under-routing concrete,
Table 9 reports the cell-level F1 conditioned on each routing-error category. Records where the router agrees with the oracle attain a cell F1 of
. Over-routed records retain a cell F1 of
, statistically indistinguishable from the agreement group, confirming that over-routing wastes computation without harming quality. Under-routed records suffer a cell F1 of
, an absolute drop of
relative to the agreement group; weighted by the under-routing proportion (
), the contribution to the aggregate cell-F1 gap is
, which is below the standard error of the reported Ours-Full score and supports the moderated claim that under-routing has a small but non-negligible local effect bounded by a small global effect.
4.4.3. Cross-Field-Only Ablation
The aggregate ablation in
Table 6 averages over four error types of unequal frequency. To isolate the contribution of each component on the error category that motivated its design,
Table 10 restricts the evaluation to cells participating in at least one cross-field constraint violation on ENT-Prod (
such cells in the test split). All variants share the same routing decisions and conformal calibration as in
Table 6; only the corresponding component is disabled.
Three observations follow. First, on the cross-field error subset, removing the rule-trigger vector produces the largest drop (
,
),
ahead of removing the logic layer (
,
); the gap between the two variants is itself significant (
, paired
t-test). The two components therefore contribute on the same order of magnitude, with the rule-trigger vector ranking first by a margin that exceeds the standard error but does not dominate by an order of magnitude. Second, the boundary regularization and the conformal layer have small or non-significant effects on the cross-field subset, confirming that their contribution operates through routing stability and reliability rather than through cross-field repair capability per se. Third, the non-significant gap of Single-Level-CP and No-CP indicates that the conformal layer affects which corrections are accepted (via rejection) rather than the underlying repair accuracy, consistent with the theoretical role assigned in
Section 4.5.
The reading of these results is intentionally conservative: the rule-trigger vector is the largest single contributor on the cross-field subset, but the logic layer is a near-coequal contributor, and removing either alone reproduces only part of the gap. The framework’s cross-field repair quality therefore depends on the joint presence of explicit rule signals and structured probabilistic–logical reasoning, rather than on any single mechanism in isolation.
4.5. Reliability Analysis
Conformal reliability is reported on ENT-Prod because its calibration split (roughly 1500 records, partitioned across three layers) is large enough to yield stable per-layer quantile estimates; Hospital and Flights have an order of magnitude fewer calibration samples, under which any reported coverage gap would be dominated by finite-sample sampling noise rather than by the calibration procedure itself.
Table 11 compares the reliability metrics of Ours-Full and Single-Level-CP at three nominal coverage levels on ENT-Prod.
At all three nominal levels, Ours-Full achieves actual coverage that closely matches the target, with a maximum miscoverage gap of 0.004. Single-Level-CP consistently falls below the nominal target by a wider margin, with the gap reaching 0.009 at the 0.95 level. The reason is that a single calibration set conflates records of heterogeneous complexity. Simple records that L1 handles with near-certainty and complex records that L3 processes with high uncertainty are mixed in the same nonconformity score distribution, making a single quantile threshold either overly conservative on simple records or insufficiently conservative on complex ones. Per-layer calibration sidesteps this problem by constructing a separate, more homogeneous distribution within each layer.
Meanwhile, the average prediction set size of Ours-Full remains smaller than that of Single-Level-CP across all levels (e.g., 1.29 vs. 1.35 at the 0.95 level). Tighter sets translate directly to reduced human review burden, because a singleton prediction set can be accepted automatically whereas a set with two or more candidates requires manual adjudication.
Temporal Split Evaluation
To assess robustness under distribution shift, ENT-Prod is re-partitioned by timestamp, using the earliest 80% of records for training, validation, and calibration, and the most recent 20% for testing.
Under this temporal split at the 0.95 level, Ours-Full achieves an actual coverage of 0.938 with a miscoverage gap of 0.012, while Single-Level-CP drops to 0.921 with a gap of 0.029. The coverage of Ours-Full still approximately satisfies the nominal target, although the wider gap compared to the random split confirms that temporal drift weakens the conditional exchangeability assumption. Per-layer calibration retains an advantage because each layer receives a more homogeneous subset of records, reducing the distributional mismatch within each calibration set even when the global distribution shifts.
Figure 6 shows the trade-off between rejection rate and overall error rate on ENT-Prod, obtained by varying the rejection quantile parameter
.
When the rejection rate increases from 0% to approximately 8%, the overall error rate drops sharply from 9.9% to 5.1%, while the average set size grows moderately from 1.00 to 1.29. The steep initial decline reflects the fact that the most error-prone records, those with the lowest posterior confidence, are the first to be rejected. Removing a small fraction of high-risk decisions thus disproportionately improves overall accuracy. Beyond a rejection rate of 15%, the error rate curve flattens near 3.4% and further rejection yields diminishing returns while the set size continues to grow. The interval between 5% and 10% rejection offers the most favorable balance for practical deployment: the error rate is approximately halved relative to zero rejection, and the additional human review burden remains modest.
4.6. Error Analysis and Case Study
Table 12 breaks down the cell-level repair accuracy by error type on ENT-Prod. The purpose of this table is to reveal each method’s characteristic strengths and weaknesses, providing complementary insight beyond the aggregate metrics in
Table 4.
Rule-Only achieves the highest format-violation accuracy (0.961), marginally above Ours-Full (0.958), because deterministic regex matching and domain dictionaries provide near-complete coverage of well-defined format rules; however, it drops to 0.523 on semantic conflicts, where contextual understanding is required.
GPT-4o-Direct achieves the highest repair accuracy on semantic conflicts (0.891), benefiting from its broad world knowledge and open vocabulary. Ours-Full trails GPT-4o-Direct by 1.8 points on this category, because the student model’s candidate set is restricted to the top-k predictions and occasionally excludes rare but correct values. On cross-field inconsistencies, Ours-Full (0.896) outperforms all baselines by at least 3.4 points (vs. HoloClean at 0.862). The advantage arises from the explicit integration of hard rules in the factor graph, which enforces arithmetic and relational constraints that purely statistical or LLM-based methods cannot guarantee. BERT-FT achieves only 0.768 on cross-field errors, confirming that an encoder fine-tuned without explicit constraint signals struggles to maintain multi-field consistency.
Figure 7 shifts the perspective from methods to layers, showing how the three-layer architecture achieves its overall performance through specialization.
L1 attains 0.974 on format violations, the highest single-layer accuracy for any error type, because regex and dictionary matching provide near-perfect coverage for well-defined format constraints. L1’s accuracy on semantic conflicts, by contrast, is only 0.531, reflecting the intrinsic limitation of deterministic rules for context-dependent errors. L2 reaches 0.891 on semantic conflicts, closely matching GPT-4o-Direct’s method-level accuracy (0.891 in
Table 12), which indicates that the distillation objective transfers the teacher’s semantic-correction capability to the compact student model. L3 attains 0.912 on cross-field inconsistencies, substantially higher than L2’s 0.796 on the same category. The 11.6-point gap between L2 and L3 on cross-field errors is the largest inter-layer difference across all error types, which is consistent with the design intent of routing high-complexity records to the factor graph. The routing mechanism directs each record to the layer best equipped to handle its dominant error type, which explains why the aggregate performance of Ours-Full exceeds what any single layer could achieve alone.
4.6.1. Case Studies
Case 1 (L1 repair). A customer record contains a phone number field with 12 digits. The rule layer detects the format violation through a regex check and removes the extraneous leading digit. Complexity score ; routed to L1; confidence 1.0; no conformal set needed.
Case 2 (L2 repair). A record lists city = “Naning” while province = “Jiangsu.” The rule layer does not reject the value because “Naning” is not in the explicit deny list, yet the character-level perplexity is elevated (). The student model predicts “Nanjing” with probability 0.94 based on the provincial context. Complexity score ; routed to L2; conformal set ; singleton, accepted.
Case 3 (L3 repair). An order record shows order_total = 500, unit_price = 100, and quantity = 3, violating the arithmetic constraint total = price × quantity. The student model assigns comparable probabilities to quantity and order_total, yielding high entropy (). Complexity score ; routed to L3. Factor graph inference activates hard rule R2 (arithmetic constraint, weight 1.0) and soft rule R11 (historical order-size regularity, weight 0.74). The posterior determines the final repair. Evidence trace: R2 activated → R11 supporting → posterior 0.91 → accepted.
4.6.2. Failure Analysis
One failure case involves a record where customer_name = “Li Wei” should be “Li Weimin.” The truncated name is itself a valid entry in the domain vocabulary, so neither the rule layer nor the student model flags it as erroneous. The maximum posterior confidence at L2 is 0.48, which falls below the rejection threshold (). The system correctly withholds a single-point repair and returns a prediction set , routing the record to human review. The failure illustrates a fundamental limitation: when the erroneous value is itself a valid domain entry, the model cannot distinguish correct from incorrect without external identity-resolution signals. The rejection mechanism is specifically designed to handle such boundary cases by deferring to human judgment rather than committing a potentially harmful write-back.
4.7. Limitations
Several limitations of the present study warrant explicit discussion. The enterprise dataset (ENT-Prod) cannot be publicly released due to confidentiality constraints; representative anonymized examples and schema descriptions are integrated into the main text to ensure clarity, and all reported numerical results can be reproduced on Hospital and Flights from the public repositories. The conformal coverage guarantee relies on conditional exchangeability, which may weaken under substantial temporal drift or abrupt schema changes. The temporal-split experiment in
Section 4.5 showed graceful degradation, but stronger distributional shifts remain untested, and a formal robustness analysis under covariate shift would require adaptive conformal methods that are beyond the scope of this paper. Hard- and soft-rule templates require domain expertise for construction, which limits out-of-the-box applicability to new domains; the framework exposes the rule set as an explicit input and therefore inherits the maintenance burden of traditional rule-based systems on this dimension. On the Flights dataset, the student model’s top-
k candidate set occasionally missed rare correct values, yielding a cell-level F1 slightly below that of GPT-4o-Direct; retrieval-augmented candidate generation is a natural extension but was not implemented here. All experiments were restricted to structured tabular data, and the framework’s performance on semi-structured formats such as JSON logs or nested XML records has not been evaluated. Finally, the operational considerations discussed in
Section 3.7 describe the maintenance overhead qualitatively; a quantitative study of multi-component drift across a realistic production cycle is left to future work.
5. Conclusions and Future Work
This paper presented a complexity-aware progressive framework for enterprise data error correction that integrates deterministic rules, a task-specialized distilled language model, and neural probabilistic–logical reasoning within a unified architecture. A learnable router assigns each record to the processing layer matching its error complexity, and a hierarchical conformal prediction scheme provides per-layer calibrated prediction sets with a global coverage guarantee and a data-driven rejection mechanism. Experiments on ENT-Prod, Hospital, and Flights produced three principal findings. First, the framework improved the record-level complete repair rate by 2.1 to 3.1 percentage points over the strongest baseline (GPT-4o-Direct) while reducing inference latency by approximately 80% relative to direct GPT-4o invocation. Second, ablation studies confirmed the necessity of explicit symbolic signals for cross-field consistency. Third, hierarchical conformal calibration maintained tighter coverage and smaller prediction sets than single-level calibration across all nominal levels.
For enterprise data governance, the framework offers a deployable pattern for integrating language-model capabilities into latency-sensitive pipelines while keeping throughput and audit trails intact. For applied neuro-symbolic work, the results suggest that separating semantic representation from explicit logical constraints, and routing records based on error complexity, can yield better multi-field corrections than end-to-end neural pipelines on this type of data.
Building on these implications, several directions are identified as promising avenues for future work. The conformal layer could be made adaptive in time so that calibration tracks distribution shift without violating finite-sample coverage [
40]. The fixed top-
k student vocabulary could be replaced by retrieval-augmented candidate generation to recover rare correct values, which is the residual gap on Flights. The factor graph could be extended to semi-structured formats such as JSON or nested XML. Finally, the rejected-record stream is a natural input for closed-loop active learning, which would let the rule and student components improve on the records they currently defer.