Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control

Liu, Chao; Mu, Hong; Zhou, Jingjing; Wang, Enliang; Zhao, Xuejian

doi:10.3390/math14101599

Open AccessArticle

Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control

by

Chao Liu

¹,

Hong Mu

¹,

Jingjing Zhou

²,

Enliang Wang

²

and

Xuejian Zhao

^2,*

¹

Nanjing Shurui Data Technology Co., Ltd., Nanjing 211100, China

²

Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1599; https://doi.org/10.3390/math14101599

Submission received: 5 April 2026 / Revised: 6 May 2026 / Accepted: 6 May 2026 / Published: 8 May 2026

Download

Browse Figures

Versions Notes

Abstract

Reliable tabular data correction is a prerequisite for trustworthy analytics in enterprise information systems. Tabular data in such environments frequently contain formatting errors, semantic conflicts, missing values, and cross-field inconsistencies that degrade downstream analytics and machine learning performance. Rule-based methods efficiently handle structural violations but miss context-dependent errors, whereas large language models (LLMs) offer strong semantic-correction capability at inference costs prohibitive for enterprise-scale deployment. This paper formulates data error correction as a progressive decision process and proposes a complexity-aware framework with three processing stages. The first stage applies deterministic rules for low-complexity structural errors. The second stage employs a task-specialized distilled language model for medium-complexity semantic correction. The third stage performs neural probabilistic–logical reasoning on a factor graph for high-complexity cross-field errors. A learnable routing mechanism assigns each record to the appropriate stage based on a lightweight complexity score. Layer-wise conformal prediction is further introduced to construct calibrated prediction sets with coverage guarantees at each stage, together with a rejection mechanism for low-confidence corrections. The framework is evaluated on one enterprise dataset and two public benchmarks (Hospital and Flights). It improves the record-level complete repair rate by 2.1 to 3.1 percentage points over the strongest baseline (GPT-4o-Direct) and by up to 16.8 points over purely rule-based repair, while reducing average inference latency by approximately 80% relative to direct GPT-4o invocation. Ablation studies confirm the critical role of complexity-aware routing and rule-trigger features, and reliability analyses show that hierarchical conformal calibration maintains tighter coverage than single-level alternatives across varying confidence requirements. These results indicate that complexity-aware progressive routing coupled with hierarchical conformal calibration provides a practical path toward high-throughput, auditable, and reliability-controlled data cleaning suitable for enterprise deployment.

Keywords:

data error correction; complexity-aware routing; knowledge distillation; probabilistic logical reasoning; conformal prediction; reliable data cleaning

MSC:

68T05; 68T50

1. Introduction

In enterprise environments, data assets typically span multiple heterogeneous sources and complex integration pipelines [1,2]. Missing values, format anomalies, semantic conflicts, and cross-field inconsistencies arise naturally during data collection and integration. When left unaddressed, the resulting errors amplify bias during model training, degrade prediction accuracy, and trigger cascading costs in practical business operations [3]. Accurate error detection and reliable correction therefore constitute a foundational requirement for building trustworthy data-driven systems.

Integrity constraints, denial constraints, and functional dependencies have long served as the primary instruments for detecting and repairing data violations [1,4]. Well-structured fields benefit from the stability and controllability that constraint enforcement provides, yet errors involving contextual information, cross-field semantic relationships, or implicit domain knowledge remain difficult to capture through manually crafted rules. As data distributions shift across time and business scenarios, the cost of maintaining rule bases grows rapidly [5]. Learning-based systems such as HoloClean [6], Raha [7], and Baran [8] have advanced the state of the art by combining statistical signals with constraints. Still, most of them address detection and correction as loosely coupled steps and lack principled mechanisms for rejecting uncertain repairs.

Large language models have recently demonstrated an ability to leverage contextual understanding and world knowledge for detecting subtle errors and generating plausible repairs in tabular data [9,10,11]. Deploying LLMs directly for error correction, however, imposes heavy computational and latency burdens that conflict with enterprise requirements for high-throughput processing [12,13]. Knowledge distillation addresses this tension by transferring task-specific capabilities from a large teacher model to a compact student model, reducing inference cost while retaining the semantic representations essential for error correction [14,15,16].

Semantic representation alone does not suffice for correcting errors that involve multi-field dependencies or business-rule violations. Repair decisions in complex scenarios should be traceable and auditable, which calls for explicit reasoning over rules and relational constraints. Markov logic networks [17] and statistical relational learning frameworks [18] offer a principled way to combine hard constraints with soft evidence under uncertainty. Neuro-symbolic methods extend this idea by embedding symbolic constraints into neural networks in differentiable form [19,20,21], allowing models to satisfy logical requirements while maintaining representation capacity.

An automatic correction system must also control the risk of writing incorrect values back to the data source. When model confidence is insufficient, the system should produce set-valued suggestions or trigger rejection. Conformal prediction provides distribution-free coverage guarantees under mild exchangeability assumptions [22,23].

Recent work has begun applying conformal methods to data quality tasks [24,25]; most of these efforts, however, perform calibration at a single processing level, and the design of conformal schemes aligned with multi-stage correction pipelines has received limited attention.

The literature reviewed above leaves enterprise data error correction in an uncomfortable position. Learning-augmented systems that combine constraints with statistical signals still treat detection and correction as loosely coupled stages, with little principled support for rejecting uncertain repairs; direct LLM invocation closes part of this gap on semantic errors, yet at an inference cost that the throughput and latency budgets of enterprise pipelines cannot absorb. The picture becomes harder once cross-field repair is required, because traceable reasoning over explicit constraints is needed to write back values that pass auditing, and neither purely statistical repair nor single-model neural approaches deliver such reasoning. Conformal calibration has begun to address the reliability question, but the calibration is typically applied at a single processing level, which fits poorly with pipelines whose confidence profiles differ from one processing stage to another.

To address these difficulties, this paper constructs an enterprise-oriented data error correction framework that matches computational cost to error complexity at the record level, preserves semantic-correction capability without incurring LLM-scale inference cost, makes cross-field reasoning structurally explicit, and provides coverage-controlled prediction sets together with a principled rejection mechanism.

The main contributions of this paper are summarized as follows:

A complexity-aware progressive architecture is proposed that dynamically routes records across a rule-based layer, a distilled-model layer, and a probabilistic-reasoning layer to achieve a controllable balance between correction quality and inference latency.
A neuro-symbolic integration strategy is developed, incorporating a task-specialized distilled language model for semantic representations and a factor graph for structured probabilistic–logical reasoning to tighten cross-field consistency.
A hierarchical conformal prediction scheme is introduced to perform conditional calibration within each processing layer, providing a global coverage lower bound together with a data-driven rejection mechanism for high-risk corrections.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed framework. Section 4 describes the experimental setup and results. Section 5 concludes with a discussion of limitations and future directions.

2. Related Work

2.1. Error Detection and Repair in Tabular Data

Data cleaning has evolved from constraint-based engineering toward learning-augmented pipelines over the past two decades. Classical systems enforce integrity constraints such as functional dependencies and denial constraints to identify and repair violations. Foundational treatments of constraint modeling and repair semantics can be found in Ilyas and Chu [1] and Abiteboul et al. [5]. NADEEF [26] provides a unified programming interface for specifying heterogeneous quality rules. KATARA [27] leverages knowledge bases and crowdsourcing for repair candidate generation.

HoloClean [6] marked a major advance by compiling denial constraints, external data sources, and statistical signals into a single factor graph and performing probabilistic inference over dirty cells. PClean [28] extends this direction with a Bayesian generative model that explicitly separates latent clean values from observed noise. On the detection side, Raha [7] ensembles multiple detection strategies into a configuration-free system requiring minimal user labels, and Baran [8] addresses correction through a unified context representation combined with transfer learning. A systematic comparison by Abedjan et al. [2] concludes that no single detection method dominates across error types, which motivates multi-strategy approaches.

Despite the progress in individual components, most existing systems treat detection and correction as sequential yet loosely coupled steps. Explicit mechanisms for rejecting uncertain repairs are rarely provided, and the computational cost of joint detection–correction pipelines has received limited attention.

2.2. LLMs and Deep Learning for Data Quality

Deep learning has enabled representation learning and end-to-end optimization for several data governance tasks, including entity alignment, error localization, and repair suggestion [29]. Retrieval-augmented generation further enhances factual consistency by incorporating external knowledge sources at inference time [30].

Cocoon [9] decomposes data cleaning into manageable subtasks that leverage LLM semantic understanding alongside statistical error detection, outperforming prior systems on standard benchmarks. LLMClean [10] uses LLMs to automatically generate ontological functional dependencies for context-aware cleaning. Bendinelli et al. [11] investigate LLM agents with iterative feedback loops for systematic error correction. CleanAgent [31] employs autonomous multi-agent workflows for end-to-end data quality management.

All of the above approaches inherit practical limitations of LLMs, including high inference cost, output hallucination, non-deterministic behavior, and context window constraints [9]. The tension between semantic capability and enterprise-scale inference cost therefore remains an unresolved practical bottleneck for LLM-based cleaning systems.

2.3. Knowledge Distillation for Compact Models

Knowledge distillation transfers capabilities from large teacher models to smaller students through soft targets, feature alignment, and multi-task regularization [14]. Sanh et al. [15] demonstrated that a 60% smaller BERT retains 97% of its language understanding performance. Surveys by Gou et al. [16] and Xu and McAuley [32] systematize techniques including temperature scaling, sample reweighting, and structural constraints, providing methodological foundations for task-specific distillation.

In the era of large language models, distillation has evolved beyond output-level alignment. Rationale-based methods distill chain-of-thought reasoning from a teacher into a student model [33]. Probe-based approaches extract richer training signals from intermediate representations [34]. Recent work on uncertainty-aware distillation provides Bayesian frameworks that equip student models with calibrated confidence estimates [35].

Distillation specifically tailored for data error correction, where the student must simultaneously handle format validation, semantic repair, and constraint satisfaction, remains underexplored, and task-oriented distillation objectives with capability-preserving auxiliary losses have not been systematically investigated in this setting.

2.4. Conformal Prediction and Reliable Decision-Making

Conformal prediction constructs prediction sets with finite-sample coverage guarantees under the exchangeability assumption [22,23]. Split conformal prediction and its variants have been widely adopted for uncertainty quantification in classification and regression settings [36]. Recent surveys highlight the versatility of conformal methods across data modalities [37] and their promise for NLP applications where model calibration is often poor [38].

Applying conformal methods to data quality is a recent development.

Jäger and Biessmann [24] combine imputation with conformal prediction to improve downstream ML performance, demonstrating that uncertainty-aware cleaning can outperform deterministic approaches. Zhan et al. [25] apply inductive conformal prediction to biomedical data cleaning, using reliability metrics to selectively correct mislabeled samples.

Existing conformal approaches for data cleaning operate at a single processing level, and calibration schemes aligned with the heterogeneous confidence profiles produced by the different processing stages of a multi-stage pipeline have not been systematically developed.

Reading these threads together, three observations stand out. Rule-based systems, language-model distillation, and uncertainty quantification each address part of the enterprise-cleaning problem, but the points where they should connect are largely unclaimed: rules pass to learned models without a budget on inference cost, and learned models return predictions without a budget on coverage. Cross-field consistency in particular tends to fall through the cracks, because constraint-driven and neural pipelines optimize different objectives. The framework developed in the next section is built around exactly these connection points, treating routing, reasoning, and calibration as a single design problem rather than three loosely linked components.

3. Proposed Framework

3.1. Problem Formulation

Consider a relational table

T

with m attributes

{A_{1}, A_{2}, \dots, A_{m}}

. Each record

x = (x_{1}, x_{2}, \dots, x_{m})

is a tuple of field values drawn from the respective attribute domains. A subset of fields in x may contain errors, including format violations, missing entries, semantic conflicts, or cross-field inconsistencies.

The error correction task is defined at two levels. At the cell level, the system produces a binary error mask

e (x) \in {0, 1}^{m}

indicating which fields are erroneous, together with a repair value

{\hat{x}}_{i}

for each flagged field i. At the record level, the system outputs a candidate set

S (x)

of repair suggestions along with a confidence distribution

q (y ∣ x)

over the candidates.

The framework aims to minimize an expected risk that jointly accounts for computational cost and correction error:

J = E_{x} [\cos t (r (x)) + λ \cdot err (f, r (x))],

(1)

where

r (x) \in {L_{1}, L_{2}, L_{3}}

denotes the routing decision that assigns x to one of three processing layers,

\cos t (\cdot)

measures the combined latency and resource consumption,

err (\cdot)

is the task loss for detection and correction, and

λ > 0

controls the trade-off between efficiency and accuracy.

3.2. Overall Architecture

The three-layer structure was settled by iterative empirical profiling rather than fixed in advance. Preliminary experiments showed that no single processing mechanism met both the accuracy and the latency requirements across all error types: the ablation in Section 4.4 reports that relying on the language model alone (No-Logic-Layer) degrades performance on cross-field inconsistencies, and applying factor-graph reasoning to every record (No-Routing) inflates latency without a corresponding repair gain. The error types observed in enterprise tables, namely format violations, semantic conflicts, and multi-field inconsistencies, mapped naturally to three different solvers, so the pipeline was decomposed accordingly. The decomposition keeps simple errors on the deterministic path and reserves probabilistic–logical reasoning for records where the cost is justified.

The framework consists of three processing layers connected through feature reuse and signal sharing, as illustrated in Figure 1.

The first layer (L1) applies deterministic rules, regular expressions, domain dictionaries, and cross-field constraints. It handles format and consistency errors with low latency and produces a binary rule-trigger vector

r_{L 1} (x) \in {0, 1}^{| R_{hard} |}

that records which hard rules were activated. The second layer (L2) employs a task-specialized distilled language model to perform semantic error detection and correction on records of medium complexity. The third layer (L3) integrates the student model’s representation with explicit rules on a factor graph, performing structured probabilistic–logical reasoning for records with complex cross-field dependencies.

A complexity-aware router sits upstream of the three layers and assigns each incoming record to the appropriate layer.

Figure 2 depicts the end-to-end processing workflow, including the routing decision and the information flow across layers.

3.3. Two-Stage Complexity Assessment and Routing

A central design requirement is that routing must be computationally cheaper than the downstream layers it serves. A two-stage routing strategy is therefore adopted so that the routing stage avoids any dependence on the full student model.

3.3.1. Stage 1: Lightweight Pre-Screening

In the first stage, a lightweight complexity score is computed using only features that do not require running the student model. A pattern deviation score

D_{p} (x)

is defined as

D_{p} (x) = MinMax ({PPL}_{char} (x)) + κ \cdot N_{fail} (x),

(2)

where

{PPL}_{char} (x)

is the perplexity of a 5-g character-level language model trained on clean schema-conforming data, and

N_{fail} (x)

is the count of failed validation checks (e.g., regex mismatches) from the rule base. The function

MinMax (\cdot)

denotes min–max normalization computed on the training set, and

κ

is a scaling coefficient.

Three design choices in Equation (2) deserve explicit justification: how the two components are combined, how each is normalized, and what n-gram order is used. The two components are combined additively rather than multiplicatively or through a learned weighted sum. A multiplicative form collapses to zero whenever

N_{fail} (x) = 0

, which would mask records whose formats pass all regex checks yet exhibit anomalously high character-level perplexity, a common pattern for context-dependent errors. A learned weighted sum was considered but rejected at this stage, since the routing score enters downstream threshold estimation (Equation (7)), and keeping

D_{p}

in a fixed, interpretable form preserves the separability between score definition and threshold optimization. For normalization, min–max is applied rather than Z-score or robust scaling. Because

N_{fail} (x)

is a small non-negative integer count, min–max produces a dimensionless

[0, 1]

quantity compatible with

N_{fail}

in the same additive expression. Heavy-tailed behavior of

{PPL}_{char}

across domains is addressed by clipping the training-set perplexity at the 99th percentile before normalization, which bounds the influence of isolated outliers without discarding them. The character-level model is set to order five rather than to a token-level or subword alternative. Field-level content in relational tables is short, frequently contains out-of-vocabulary strings (identifiers, codes, mixed-script names), and does not tolerate the segmentation errors introduced by subword tokenizers. Character-level modeling sidesteps tokenization entirely; the order of five balances context sensitivity against storage cost, and a sensitivity analysis over

n \in {1, 2, 3, 5}

is reported in Section 4.4.

A contextual dependency score

K (x)

is also computed to approximate the statistical coupling between the target field and its neighbors, using pointwise mutual information with Laplace smoothing:

K (x) = \frac{1}{| N (x) |} \sum_{j \in N (x)} log \frac{P_{sm} (x_{i}, x_{j})}{P_{sm} (x_{i}) P_{sm} (x_{j})},

(3)

where

x_{i}

is the target attribute,

N (x)

is the set of functionally dependent attributes, and

P_{sm}

uses add-one smoothing on empirical counts.

Records whose

D_{p} (x)

falls below a threshold

θ_{low}

and whose

K (x)

is below a threshold

θ_{K}

are routed directly to L1. The remaining records proceed to Stage 2.

3.3.2. Stage 2: Entropy-Augmented Routing

For records that pass Stage 1, a single forward pass through a frozen, lightweight projection head of the student model produces a predictive distribution over each potentially erroneous field. The semantic uncertainty is then computed as

S (x) = - \sum_{i \in \hat{M} (x)} \sum_{v \in V} {\hat{p}}_{T} (v ∣ x_{ctx}) log {\hat{p}}_{T} (v ∣ x_{ctx}),

(4)

where

\hat{M} (x)

is the set of suspicious fields identified in Stage 1 by the rule layer,

V

is the candidate vocabulary, and

{\hat{p}}_{T}

is the softmax distribution at temperature

T = 2.0

.

The reliability of

S (x)

as a complexity signal depends on whether the underlying predictive distribution is calibrated. The temperature

T = 2.0

used in Equation (4) is not a free hyper-parameter but the outcome of post hoc temperature scaling [39]: after distillation, a single scalar temperature is tuned on the validation set by minimizing negative log-likelihood, and the resulting value is frozen before routing thresholds are estimated.The empirical calibration quality of the resulting distribution and the monotone relation between

S (x)

and downstream correction error are reported later in Section 4.3.

The final complexity score combines all three components:

C (x) = 100 \cdot σ (α D_{p} (x) + β S (x) + γ K (x)),

(5)

where

σ (\cdot)

is the logistic function and the weights satisfy

α + β + γ = 1

. Table 1 summarizes the domain, normalization, and approximate per-record cost of each component.

The routing decision is then as follows:

r (x) = \{\begin{matrix} L_{1}, & C (x) < θ_{low}, \\ L_{2}, & θ_{low} \leq C (x) < θ_{high}, \\ L_{3}, & C (x) \geq θ_{high} . \end{matrix}

(6)

The thresholds

θ_{low}

and

θ_{high}

are estimated by constrained empirical risk minimization on the validation set:

{θ_{low}, θ_{high}} = arg min_{θ} E_{x} [lat (r_{θ} (x))] s . t . E_{x} [acc (r_{θ} (x))] \geq A_{min},

(7)

where

lat (\cdot)

is the measured latency,

acc (\cdot)

is the validation accuracy, and

A_{min}

is a user-specified accuracy lower bound.

To stabilize routing near the decision boundaries, a boundary regularization term

R_{bd}

is added that penalizes abrupt changes in the routing probability for records whose complexity scores lie within a margin of the thresholds.

3.3.3. Routing Error Analysis

The quality of the final correction depends on both each layer’s repair capability and the router’s ability to assign every record to a layer that can handle it. To make this dependence explicit, an oracle routing baseline is introduced. For each record in the validation set, all three layers are executed and the oracle layer is defined as the lowest-cost layer that produces a correct repair. The routing error rate is then the fraction of records for which

r (x)

differs from the oracle choice. Two failure modes are distinguished: under-routing, which assigns a record to a layer below the oracle and typically produces a wrong repair, and over-routing, which assigns a record to a layer above the oracle and wastes computation without harming quality. The two failure modes carry asymmetric costs, and the routing objective in Equation (7) explicitly penalizes under-routing through the accuracy constraint

A_{min}

while tolerating bounded over-routing through the latency objective. Section 4.4 reports the empirical confusion matrix between

r (x)

and the oracle, together with the downstream repair accuracy conditioned on each failure mode. A first-order regret argument further shows that when

θ_{low}

and

θ_{high}

are chosen to satisfy the constraint in Equation (7), the excess risk relative to oracle routing is bounded by the weighted sum of the miscoverage rates on the two threshold bands plus the layer-wise repair-accuracy gaps, which in practice scales linearly with the routing error rate and vanishes as the complexity score becomes more discriminative.

3.4. Task-Specialized Knowledge Distillation

The second layer employs a distilled language model trained through task-oriented knowledge distillation.

GPT-4o is used as the teacher model and is invoked via API in a few-shot setting to generate cell-level correction labels on the training set.

The student model is BERT-base (110 M parameters), initialized from the pre-trained checkpoint and fine-tuned with a masked-field prediction objective.

3.4.1. Input Formulation

Each record is serialized as

[CLS] A_{1} = v_{1} [SEP] A_{2} = v_{2} [SEP] \dots [MASK] [SEP]

, with [MASK] iterating over the fields flagged as erroneous in Stage 1. The candidate set

V

is populated differently by field type: the full domain vocabulary for categorical fields, the top-k (

k = 20

) teacher predictions for free-text fields, and a constrained schema-compliant token set for open-ended string repair. This typed candidate construction, rather than the masked-prediction backbone itself, is the design element specific to the data-correction setting.

3.4.2. Distillation Loss

Let

p_{T} (\cdot ∣ x)

and

p_{S} (\cdot ∣ x)

denote the teacher and student predictive distributions, and let

T_{l} (x)

and

S_{l} (x)

denote the aligned intermediate representations at layer ℓ. The total training loss is

\begin{matrix} L = w (C (x)) ( & τ^{2} KL (p_{T}^{τ} (\cdot ∣ x) ∥ p_{S}^{τ} (\cdot ∣ x)) \\ + α_{d} {∥ P T_{l} (x) - S_{l} (x) ∥}_{2}^{2}) \\ + β_{d} \sum_{u \in U} L^{(u)} + λ_{d} R_{bd}, \end{matrix}

(8)

where

τ > 0

is the temperature, P is a learnable projection matrix,

w (C (x))

re-weights samples to emphasize the medium-complexity interval, and

R_{bd}

is the boundary regularization described in Section 3.3.

Concretely,

w (C (x))

is implemented as a Gaussian centered at the midpoint

(θ_{low} + θ_{high}) / 2

with bandwidth equal to

(θ_{high} - θ_{low}) / 2

, normalized so that its mean over the training set equals one; this places higher weight on records whose complexity score sits within the L2 routing band, where the distillation signal is most informative.

The set

U = {fmt, sem, \log}

indexes three capability-preserving auxiliary losses. The format loss

L^{(fmt)}

is a cross-entropy objective on regex-checkable fields, training the student to replicate format validation. The semantic loss

L^{(sem)}

is a contrastive objective that aligns the student’s embedding of a clean field value with the teacher’s embedding and pushes apart the embeddings of erroneous values. The logic loss

L^{(\log)}

is a binary cross-entropy objective on whether a given pair of field values satisfies a sampled functional dependency, training the student to internalize basic constraint judgments.

After distillation, the student model produces a base semantic representation

h_{base} (x) \in R^{d}

for each record, which serves as input to the third layer.

A subtle point regarding the teacher choice merits explicit clarification. Although GPT-4o serves as the teacher for L2 and is also retained as the GPT-4o-Direct baseline in Section 4.1, the teacher upper-bounds only the standalone semantic-correction capability of L2; it does not upper-bound the framework’s record-level repair quality. L1 handles deterministic format errors more reliably than free-form generation, L3 enforces relational and arithmetic constraints that GPT-4o operating field by field cannot guarantee, and the conformal rejection mechanism filters low-confidence student outputs that would otherwise propagate. The framework can therefore exceed the teacher on aggregate metrics even though L2 alone cannot, as confirmed by the per-layer and per-method breakdowns in Section 4.2 and Section 4.4.

3.4.3. Training Configuration

All distillation experiments use a batch size of 64, a learning rate of

2 \times 10^{- 5}

with linear warm-up over the first 10% of steps, the AdamW optimizer, a maximum sequence length of 256 tokens, and temperature

τ = 4.0

. Training runs for 30 epochs with early stopping based on cell-level F1 on the validation set. All experiments are conducted on a single NVIDIA A100 GPU (40 GB).

3.5. Neural Probabilistic–Logical Reasoning

Records routed to the third layer require structured reasoning that jointly considers neural semantic representations and explicit relational constraints.

A factor graph is constructed whose variables correspond to the uncertain field values in a record and whose factors encode both data-driven features and domain rules.

3.5.1. Graph Construction

For a record x routed to L3, each field flagged as potentially erroneous becomes a variable node

v_{i}

. The domain of

v_{i}

is the candidate set produced by the student model (top-k predictions).

Two types of factors are defined. Rule factors

f_{r} (V)

instantiate hard and soft rules, where each

r \in R = R_{hard} \cup R_{soft}

contributes a weighted indicator.

Feature factors

ϕ_{i} (V, h_{comb} (x))

couple the neural representation to the variable assignments.

The combined representation

h_{comb} (x)

is formed by concatenating the student encoder output

h_{base} (x)

, a task-specific incremental network output

h_{deep} (x) \in R^{d^{'}}

, and the rule-trigger vector

r_{L 1} (x)

:

h_{comb} (x) = h_{base} (x) \oplus h_{deep} (x) \oplus r_{L 1} (x),

(9)

where ⊕ denotes vector concatenation.

3.5.2. Rule Templates

Hard rules are deterministic constraints derived from the schema. Examples include:

If $province = “ Jiangsu ”$ , then $city \in {“ Nanjing ”, “ Suzhou ”, \dots}$ .
$order_total = unit_price \times quantity$ .

Soft rules capture statistical regularities that hold in most cases. Examples include:

If $order_amount > 100,000$ , then $customer_level$ is likely “VIP” (weight 0.8).
If $registration_channel = “ enterprise ”$ and $monthly_order_count \geq 5$ , then $customer_type$ is likely “corporate” (weight 0.7).

Hard rules receive a fixed large weight; soft-rule weights are learned.

3.5.3. Joint Distribution and Inference

The energy-based joint distribution over the variables is

p_{Θ} (V, Z ∣ x) \propto exp (\sum_{r \in R} w_{r} f_{r} (V) + \sum_{i} θ_{i} ϕ_{i} (V, h_{comb} (x))),

(10)

where Z denotes latent variables and

Θ

collects all learnable parameters.

Loopy belief propagation is adopted for approximate marginal inference, with a maximum of 20 message-passing iterations, and model parameters are learned by maximizing the evidence lower bound under a variational posterior

q_{φ} (Z ∣ x)

.

The novel component is the generator that produces soft-rule weights from the combined representation:

π = GumbelSoftmax (A h_{comb} (x) + b, τ_{g}), w_{k} = π_{k},

(11)

with Gumbel temperature

τ_{g}

annealed linearly from 1.0 to 0.1 over 50 epochs. Factor-graph training is conducted sequentially after distillation has converged (Section 3.4), so the 50-epoch schedule here is independent of the 30-epoch distillation budget and the student parameters are frozen except during the neural-parameter phase described below. This construction differs from fixed-weight Markov logic networks in that soft-rule weights are conditioned on each record through

h_{comb} (x)

, which allows the same soft rule to exert different influence across records depending on their neural evidence. Training alternates between optimizing the factor and neural parameters with soft-rule weights frozen and optimizing the generator with neural parameters frozen, and it stops when the validation F1 and explanation consistency improve by less than

0.1 %

for five consecutive epochs.

3.5.4. Evidence Output

For each repair decision, the third layer outputs the posterior probability of the chosen candidate, the set of activated rules with their weights, and the sequence of belief-propagation messages that led to the final marginal. An example evidence trace for a corrected city field might read: rule R1 (province-city constraint, weight 1.0) activated; rule R7 (address-zipcode soft rule, weight 0.82) activated; posterior

p (city = “ Nanjing ” ∣ x) = 0.93

.

3.6. Hierarchical Conformal Prediction and Rejection

After the three processing layers produce their corrections, the system must assess the reliability of each repair before writing it back to the data source.

Conformal prediction is applied independently at each layer, and the coverage guarantees are then aggregated at the global level. Figure 3 illustrates the overall confidence quantification and rejection workflow.

3.6.1. Candidate Space

The candidate label space

Y

over which prediction sets are constructed differs by field type. For categorical fields,

Y

is the full set of valid domain values. For numerical fields,

Y

is a discretized set of candidate values centered on the model’s point prediction.

For free-text fields,

Y

is restricted to the top-k candidates generated by the student model (with

k = 20

in the experiments reported in Section 4).

In all cases, the conformal procedure operates over a finite candidate set.

3.6.2. Nonconformity Score

At each layer ℓ, a nonconformity score

α_{l} (x, y)

is defined as a weighted combination of three components:

α_{l} (x, y) = w_{1}^{(l)} g_{conf}^{(l)} (x, y) + w_{2}^{(l)} g_{unc}^{(l)} (x) + w_{3}^{(l)} g_{sim}^{(l)} (x, y),

(12)

where

g_{conf}^{(l)} (x, y) = 1 - p_{l} (y ∣ x)

captures the prediction confidence at layer ℓ,

g_{unc}^{(l)} (x)

is the entropy of the predictive distribution, and

g_{sim}^{(l)} (x, y)

measures the representation distance between x and its nearest neighbors in the calibration set. The weight vector

w^{(l)}

is non-negative, sums to one, and is estimated by minimizing the average prediction set size on the calibration set subject to the coverage constraint.

3.6.3. Layer-Wise Calibration

After the routing function

r (\cdot)

and all model parameters are fixed, an independent calibration set is constructed for each layer:

D_{cal}^{(l)} = {(x_{i}, y_{i}) : r (x_{i}) = l} .

(13)

The calibration samples do not participate in training the model or the router. For each calibration sample, the nonconformity score is computed and the scores are sorted in ascending order. Given a nominal error level

δ \in (0, 1)

, the quantile threshold at layer ℓ is

q_{l} (δ) = α_{(k_{l})}^{(l)}, k_{l} = ⌈ (1 - δ) (n_{l} + 1) ⌉,

(14)

where

n_{l} = | D_{cal}^{(l)} |

. For a new record x routed to layer ℓ, the prediction set is

{\hat{Γ}}_{l} (x) = {y \in Y : α_{l} (x, y) \leq q_{l} (δ)} .

(15)

3.6.4. Assumptions and Global Coverage

The per-layer coverage guarantee relies on two assumptions:

Assumption 1.

Conditional exchangeability. Within each layer ℓ, the calibration samples and the test samples are exchangeable conditional on

r (X) = l

.

Assumption 2.

Fixed routing. The routing function

r (\cdot)

depends only on the input features and is fixed before calibration begins.

Under Assumption 1, each layer satisfies

P (Y \in {\hat{Γ}}_{l} (X) ∣ r (X) = l) \geq 1 - δ .

(16)

By the law of total probability, the global coverage follows

P (Y \in \hat{Γ} (X)) = \sum_{l} P (Y \in {\hat{Γ}}_{l} (X) ∣ r (X) = l) P (r (X) = l) \geq 1 - δ .

(17)

Conditional exchangeability may be violated in practice when enterprise data exhibit temporal drift or when the error distribution shifts between calibration and deployment.

Section 4 reports additional experiments under a temporal split to evaluate the robustness of the coverage guarantee under mild distribution shift.

3.6.5. Rejection Mechanism

On top of the conformal prediction sets, a rejection option is introduced based on maximum posterior confidence.

Let

π_{max}^{(l)} (x) = {max}_{y \in Y} p_{l} (y ∣ x)

denote the highest posterior probability at layer ℓ. A rejection threshold

τ_{l}

is set to the

ρ

-th quantile of the empirical distribution of

π_{max}^{(l)}

on the calibration set. When

π_{max}^{(l)} (x) < τ_{l}

, the system withholds a single-point repair and instead returns the full prediction set

{\hat{Γ}}_{l} (x)

along with the associated evidence. The record is then routed to a human-review channel according to the deployment policy. Because the threshold is derived from a data-driven quantile, the rejection rate adapts to the confidence profile of each layer, and slow distributional shifts move the quantile rather than the rejection criterion itself.

3.7. Operational Considerations

The proposed framework maintains four synchronously updated components: the rule base at L1, the distilled student model at L2, the factor graph and its soft-rule generator at L3, and the per-layer conformal calibration sets. Each component has a distinct update cadence and failure mode, and the interactions among them determine the realistic maintenance burden.

Rule changes in the schema or business logic directly modify L1 output and therefore the rule-trigger vector

r_{L 1} (x)

that feeds L3. A change to a hard rule requires re-running the factor-graph inference on affected records and re-computing the L3 calibration set, but does not require re-training the student model. A change to a soft rule only requires re-training the soft-rule weight generator on L3 together with a refresh of the L3 calibration set. A change that introduces a new field or a new error type, by contrast, invalidates the student’s serialization template and requires a full re-distillation on the extended training set.

Model drift is handled through periodic re-calibration of the conformal quantiles rather than through re-training. A rolling calibration window (for instance, the most recent thirty days of labeled data) keeps the quantiles aligned with the current data distribution and preserves the coverage guarantee under slow shifts. The student model itself is re-distilled at a slower cadence, typically when the rolling miscoverage gap on the calibration window exceeds a deployment-specified tolerance.

Version coupling is enforced across components. Each deployed pipeline is pinned to a triple of artifacts: the student model checkpoint, the factor-graph parameter set including the soft-rule generator, and the per-layer conformal quantile table. These artifacts are versioned together and any mismatch triggers a deployment abort, which prevents silent coverage violations when an updated student is paired with a stale calibration set.

Relative to simpler baselines, the framework’s maintenance cost is higher than that of a single fine-tuned encoder but lower than that of a hand-maintained rule base with comparable coverage, because the student model absorbs most of the semantic cases that would otherwise require hand-coded rules, and the calibration layer automates the confidence estimation that is typically implemented ad hoc in rule-based systems. The incremental engineering burden is concentrated in the versioning and re-calibration pipeline, which is standard infrastructure in production machine learning systems.

4. Experimental Validation and Analysis

4.1. Experimental Setup

Three datasets are used in this study, covering both real enterprise data and public benchmarks. Table 2 summarizes the key statistics.

Three datasets are utilized in this study. The first, ENT-Prod, is derived from the production database of an industry partner and spans three relational tables. It contains 15,247 records with 22 attributes per record, yielding an overall cell-level error rate of 7.8% that covers format violations, semantic conflicts, cross-field inconsistencies, and missing values. The second dataset, Hospital, is a widely used data-cleaning benchmark containing 1000 records and 19 attributes with approximately 5% erroneous cells, primarily caused by functional dependency violations. The original dirty version is used without additional error injection. The third dataset, Flights, contains 2136 flight delay records. Following standard injection protocols, a dirty version is constructed by applying value substitution, missing-value injection, and format corruption to yield an overall error rate of approximately 10%, while the original clean values serve as ground-truth labels.

Across all datasets, the data are partitioned into training, validation, calibration, and test sets using a specific 6:1:1:2 ratio. This ratio was explicitly determined to balance competing statistical requirements for the progressive pipeline. First, a minimum of 10% calibration data (e.g., approximately 1500 records for ENT-Prod) was necessary to compute stable empirical quantiles for conformal prediction. Second, a 20% test set was required to maintain sufficient statistical power for record-level paired significance testing. Allocating the remaining 70% primarily to training (60%) rather than validation (10%) ensured stable convergence of the distilled student model. Alternative splits evaluated during preliminary checks (such as 7:1:1:1 or 5:1:1:3) either produced high variance in the conformal thresholds or lacked sufficient test samples to confirm significance, confirming 6:1:1:2 as the optimal default configuration.

Seven baseline methods are compared, as listed in Table 3. The selection is designed to span the four methodological families against which the proposed framework must be positioned: (i) purely rule-based repair (Rule-Only, Rule + Stat), which establishes the deterministic lower bound; (ii) learning-augmented constraint repair (HoloClean, Raha + Baran), which represents the state of the art for structured inconsistencies; (iii) a distillation-free encoder (BERT-FT), which isolates the contribution of the distillation objective from that of the BERT-base backbone shared with the student model; and (iv) an LLM-direct baseline (GPT-4o-Direct), which sets an empirical ceiling on semantic repair quality and an empirical floor on inference efficiency. HoloClean and Raha + Baran are run using their publicly released codebases on the same data splits. GPT-4o-Direct represents an LLM-direct baseline invoked via API in a five-shot setting. Latency figures reflect end-to-end wall-clock time including network overhead and are marked separately in the result tables.

Six quality metrics, two efficiency metrics, and four reliability metrics are evaluated. The quality metrics include cell-level detection precision, recall, F1, repair accuracy on erroneous cells, end-to-end corrected cell accuracy, and record-level complete repair rate. Efficiency is measured by average inference latency per record (p50) and throughput (records per second). Reliability is assessed via actual coverage, average prediction set size, rejection rate, and empirical miscoverage gap. Each configuration is repeated with five random seeds to account for training variance. Significance is assessed via paired t-tests at the record level on the shared test split, with * indicating p < 0.05 and ** indicating p < 0.01 relative to the best non-Ours baseline.

All experiments are conducted on an Ubuntu 20.04 LTS operating system equipped with a single NVIDIA A100 GPU (40 GB), utilizing Python 3.9, PyTorch 2.0.1, and CUDA 11.8.

4.2. Main Results

Table 4 presents the cell-level and record-level metrics alongside efficiency indicators for all methods on the three datasets.

On ENT-Prod, Ours-Full achieves the highest cell-level F1 of 0.926, surpassing the second-best GPT-4o-Direct (0.914) by 1.2 absolute points (

p < 0.01

). The record-level complete repair rate reaches 0.702, exceeding GPT-4o-Direct by 3.1 points and HoloClean by 6.3 points. The gap in record-level performance is substantially larger than the gap in cell-level F1, which indicates that the third-layer factor graph reasoning resolves cross-field conflicts that inflate cell-level recall without achieving full-record consistency. On E2E cell accuracy, Ours-Full (0.983) slightly outperforms GPT-4o-Direct (0.981), confirming that the progressive pipeline introduces few false-positive repairs on clean cells. The throughput of Ours-Full (94 rec/s) is 5.2 times that of GPT-4o-Direct (18 rec/s), making the framework viable for batch processing in enterprise pipelines.

On Hospital, HoloClean performs competitively (F1 = 0.895, Repair Acc. = 0.891) because the dataset is rich in explicit functional dependencies, which HoloClean’s denial-constraint inference directly exploits. Ours-Full still leads in record-level complete repair rate (0.706 vs. 0.672), because the student model captures semantic patterns in address and name fields that purely constraint-driven inference overlooks. BERT-FT lags behind both HoloClean and Raha + Baran on this dataset (Record Repair 0.638 vs. 0.672 and 0.657), which suggests that a fine-tuned encoder without explicit constraint integration is insufficient when functional dependencies dominate.

On Flights, where semantic errors account for the largest share, GPT-4o-Direct achieves a higher cell-level F1 (0.893) than Ours-Full (0.887). The difference arises because GPT-4o operates over an open vocabulary and can resolve low-frequency semantic errors through its broad world knowledge, whereas the student model’s top-k candidate set occasionally misses rare correct values. Despite this cell-level gap, Ours-Full leads on record-level complete repair rate (0.669 vs. 0.648) by 2.1 points, because the factor graph enforces multi-field consistency that GPT-4o, operating field by field, cannot guarantee. The latency advantage remains substantial (12.3 ms vs. 60.9 ms), representing an approximate 5× reduction.

Summary of the Performance Advantage

The lead of the proposed framework across the three datasets traces to three concrete mechanisms. Complexity-aware routing matches each record with its lowest-cost adequate solver, which keeps easy records away from the L3 budget and prevents hard records from being forced onto a semantic backbone that cannot enforce relational constraints; this is what produces the latency and throughput gains. The rule-trigger vector

r_{L 1} (x)

injected into the factor graph makes the relevant symbolic signals explicit to the L3 reasoner, which is what produces the record-level repair gains on the two datasets where cross-field dependencies dominate (ENT-Prod and Hospital). The hierarchical conformal layer and its rejection mechanism filter low-confidence L2 outputs before they enter the end-to-end metric, which is what keeps E2E cell accuracy from being eroded by occasional student errors on rare semantic cases. The ablations in Section 4.4 test each of these mechanisms separately.

4.3. Routing and Efficiency Analysis

On ENT-Prod, solving the constrained optimization in Equation (7) with

A_{min} = 0.92

on the validation set yields

θ_{low} = 35.0

and

θ_{high} = 65.0

; the corresponding values on Hospital and Flights are reported in parentheses hereafter. Table 5 compares three routing strategies on ENT-Prod.

Under adaptive routing, approximately 57% of records are handled by L1, 31% by L2, and 12% by L3. Sending all records to L3 (All-to-L3) yields a marginal F1 gain of 0.003, at the expense of more than doubling the p50 latency from 11.9 ms to 27.3 ms. The quality gain is marginal because most records contain only format or simple semantic errors that L1 and L2 already resolve correctly; sending them through the full factor-graph pipeline adds cost without changing the repair outcome. Random routing assigns records uniformly regardless of complexity and produces a lower F1 of 0.903. The 2.3-point drop relative to adaptive routing confirms that the learned complexity function captures meaningful structure and that indiscriminate allocation to expensive layers degrades both efficiency and quality.

Figure 4 provides a finer view of routing behavior by partitioning records into five complexity bins.

In the lowest bin (0–20), 96% of records are routed to L1, which aligns with the expectation that simple format violations and dictionary mismatches concentrate at low complexity scores. The 20–40 bin shows the onset of a transition, with 72% of records still assigned to L1 and 24% forwarded to L2, reflecting cases where a slightly elevated character-level perplexity signals the need for semantic analysis. In the 40–60 range, L2 becomes the dominant destination (62%), indicating that records with moderate semantic uncertainty are handled by the distilled student model. Above a complexity score of 60, L3 absorbs the majority of records (62% in the 60–80 bin and 89% in the 80–100 bin), corresponding to records with strong cross-field dependencies and high entropy across multiple candidate values. The transition between adjacent bins is smooth, with no abrupt jumps at the threshold boundaries, which indicates that the boundary regularization term (Section 3.3) prevents oscillatory routing decisions.

Calibration of the Routing Entropy Signal

The use of

S (x)

as a routing feature presupposes that the entropy of the projection-head distribution tracks downstream correction error. We verify this on the ENT-Prod validation set. After post hoc temperature scaling at

T = 2.0

(Section 3.3), the expected calibration error of the projection head drops from

0.087

(uncalibrated softmax) to

0.021

, and the Spearman rank correlation between per-record entropy and actual correction error rises from

0.42

to

0.61

. Figure 5 reports the corresponding diagnostics. Figure 5a is a reliability diagram comparing the uncalibrated and calibrated curves against the perfect-calibration diagonal; the calibrated curve aligns visibly closer to the diagonal across the entire confidence range. Figure 5b plots the empirical correction error rate across ten equal-frequency entropy bins together with

95 %

bootstrap confidence intervals. The error rate increases monotonically from approximately

4 %

in the lowest-entropy bin to

38 %

in the highest, and adjacent bins satisfy non-overlapping or near-tangent intervals. The monotone relation between

S (x)

and downstream correction error is therefore established empirically, which justifies the use of

S (x)

as a complexity signal for records in the medium-complexity routing band.

4.4. Component Ablation

Table 6 presents the ablation results on ENT-Prod. Six variants are evaluated, each disabling a single component while keeping all other settings identical to Ours-Full.

Among the seven variants in Table 6, removing the rule-trigger vector (No-Rule-Feat) ranks first by the magnitude of the record-repair drop, followed closely by removing the logical-reasoning layer (No-Logic-Layer). The cell-F1 drops are

1.8

points (No-Rule-Feat, paired t-test

p < 0.01

, Cohen’s

d = 1.42

) and

1.3

points (No-Logic-Layer,

p < 0.01

,

d = 1.05

), respectively; the corresponding record-repair drops are

4.1

and

3.0

points. The pairwise gap between the two variants is

1.1

record-repair points and is itself significant (

p = 0.03

, paired t-test on the shared test split), but the two effects lie within the same order of magnitude, so the rule-trigger vector and the logic layer are best understood as jointly necessary contributors to cross-field repair quality rather than as a single dominant factor. The decline of No-Rule-Feat is concentrated on records routed to L3, where the factor graph relies on the rule-trigger vector to identify which hard constraints are relevant; without this vector, the factor graph treats all constraints uniformly, leading to repairs that appear plausible at the individual field level yet violate relational dependencies across fields. Section 4.4.3 reports a complementary ablation restricted to cross-field error cells, which isolates the two components more directly.

No-Logic-Layer, which removes L3 entirely and routes all high-complexity records to L2, reduces the record repair rate by 3.0 points. The latency drops to 8.4 ms because the most expensive processing stage is eliminated. The quality loss confirms that the student model alone cannot enforce arithmetic and relational constraints when multiple fields in a record are simultaneously uncertain.

No-Boundary disables the boundary regularization term and leads to a 1.8-point decline in record repair rate (from 0.702 to 0.684). Inspection of the routing decisions reveals that, without regularization, records near the threshold boundaries oscillate between adjacent layers across training epochs, and the resulting instability degrades the quality of repairs for medium-complexity records.

No-Routing sends every record to L3 and achieves a slightly higher F1 (0.929) and record repair rate (0.705) than Ours-Full, because every record benefits from full-depth processing.

The

0.3

-point F1 advantage of No-Routing falls within one standard error of Ours-Full and does not reach significance under a paired t-test (

p = 0.31

), whereas the

2.3 \times

latency increase is both significant and operationally prohibitive.

However, the p50 latency increases from 11.9 ms to 27.3 ms, a 2.3× increase. The contribution of routing is therefore primarily in efficiency: it reduces latency by 56% relative to full-depth processing while incurring only a 0.3-point F1 trade-off that is statistically indistinguishable from noise.

Compared to GPT-4o-Direct, the end-to-end latency reduction averages approximately 80% across the three datasets (81.0%, 81.2%, and 79.8% on ENT-Prod, Hospital, and Flights, respectively).

Removing conformal prediction (No-CP) has a negligible impact on F1 and repair accuracy, yet it eliminates the coverage guarantee and the rejection mechanism. Single-Level-CP restores coverage control but achieves a lower actual coverage (0.941 vs. 0.949) at the 0.95 nominal level and produces larger prediction sets (see Section 4.5 for a detailed comparison).

4.4.1. Routing Score Design Ablation

To examine the design choices behind the pattern deviation score

D_{p} (x)

in Equation (2), three groups of ablations are conducted on ENT-Prod: the score-combination scheme, the normalization scheme, and the order of the character-level language model. Table 7 reports the downstream cell-level F1 and the routing error rate (relative to the oracle routing defined in Section 3.3) for each configuration, with all other components held identical to Ours-Full.

The additive combination outperforms both the multiplicative and the learned alternatives. The multiplicative form suffers precisely from the failure mode predicted in Section 3.3: records with

N_{fail} (x) = 0

collapse the score to zero and are routed to L1 irrespective of their perplexity, inflating the under-routing rate. The learned weighted sum is competitive but introduces an additional optimization layer between the score and the threshold, which weakens interpretability without a commensurate quality gain. Min–max normalization leads over Z-score and robust scaling by a small margin, consistent with the dimensional-alignment argument given in Section 3.3. The LM-order ablation shows a monotonic trend up to

n = 5

with diminishing returns beyond trigram, confirming that the choice of

n = 5

is in the plateau region rather than an outlier. Unigram is clearly inadequate because it ignores character co-occurrence statistics on which detection of corrupted tokens relies.

4.4.2. Routing Error and Oracle Comparison

To isolate the impact of routing decisions from layer-level repair capability, each validation record is executed on all three layers. The oracle layer is the lowest-cost layer that yields a correct repair, and the routing confusion matrix between the learned router and the oracle is reported in Table 8. The learned router agrees with the oracle on

93.2 %

of records. Disagreements split sharply asymmetrically: over-routing (assignment to a higher layer than necessary) accounts for

5.1 %

and produces no repair quality loss, while under-routing (assignment to a lower layer than necessary) accounts for

1.7 %

and causes an average cell F1 drop of

0.18

on the affected records. Overall, under-routing contributes at most

0.003

to the aggregate cell F1 gap relative to oracle routing, which is within the standard error of the reported Ours-Full score. This asymmetry confirms the design intent that the accuracy constraint

A_{min}

in Equation (7) should suppress under-routing even at the cost of tolerating some over-routing.

To make the limited aggregate impact of under-routing concrete, Table 9 reports the cell-level F1 conditioned on each routing-error category. Records where the router agrees with the oracle attain a cell F1 of

0.939

. Over-routed records retain a cell F1 of

0.937

, statistically indistinguishable from the agreement group, confirming that over-routing wastes computation without harming quality. Under-routed records suffer a cell F1 of

0.759

, an absolute drop of

0.180

relative to the agreement group; weighted by the under-routing proportion (

1.7 %

), the contribution to the aggregate cell-F1 gap is

0.180 \times 0.017 \approx 0.003

, which is below the standard error of the reported Ours-Full score and supports the moderated claim that under-routing has a small but non-negligible local effect bounded by a small global effect.

4.4.3. Cross-Field-Only Ablation

The aggregate ablation in Table 6 averages over four error types of unequal frequency. To isolate the contribution of each component on the error category that motivated its design, Table 10 restricts the evaluation to cells participating in at least one cross-field constraint violation on ENT-Prod (

n = 1287

such cells in the test split). All variants share the same routing decisions and conformal calibration as in Table 6; only the corresponding component is disabled.

Three observations follow. First, on the cross-field error subset, removing the rule-trigger vector produces the largest drop (

- 0.065

,

d = 1.51

),

0.021

ahead of removing the logic layer (

- 0.044

,

d = 1.12

); the gap between the two variants is itself significant (

p = 0.008

, paired t-test). The two components therefore contribute on the same order of magnitude, with the rule-trigger vector ranking first by a margin that exceeds the standard error but does not dominate by an order of magnitude. Second, the boundary regularization and the conformal layer have small or non-significant effects on the cross-field subset, confirming that their contribution operates through routing stability and reliability rather than through cross-field repair capability per se. Third, the non-significant gap of Single-Level-CP and No-CP indicates that the conformal layer affects which corrections are accepted (via rejection) rather than the underlying repair accuracy, consistent with the theoretical role assigned in Section 4.5.

The reading of these results is intentionally conservative: the rule-trigger vector is the largest single contributor on the cross-field subset, but the logic layer is a near-coequal contributor, and removing either alone reproduces only part of the gap. The framework’s cross-field repair quality therefore depends on the joint presence of explicit rule signals and structured probabilistic–logical reasoning, rather than on any single mechanism in isolation.

4.5. Reliability Analysis

Conformal reliability is reported on ENT-Prod because its calibration split (roughly 1500 records, partitioned across three layers) is large enough to yield stable per-layer quantile estimates; Hospital and Flights have an order of magnitude fewer calibration samples, under which any reported coverage gap would be dominated by finite-sample sampling noise rather than by the calibration procedure itself. Table 11 compares the reliability metrics of Ours-Full and Single-Level-CP at three nominal coverage levels on ENT-Prod.

At all three nominal levels, Ours-Full achieves actual coverage that closely matches the target, with a maximum miscoverage gap of 0.004. Single-Level-CP consistently falls below the nominal target by a wider margin, with the gap reaching 0.009 at the 0.95 level. The reason is that a single calibration set conflates records of heterogeneous complexity. Simple records that L1 handles with near-certainty and complex records that L3 processes with high uncertainty are mixed in the same nonconformity score distribution, making a single quantile threshold either overly conservative on simple records or insufficiently conservative on complex ones. Per-layer calibration sidesteps this problem by constructing a separate, more homogeneous distribution within each layer.

Meanwhile, the average prediction set size of Ours-Full remains smaller than that of Single-Level-CP across all levels (e.g., 1.29 vs. 1.35 at the 0.95 level). Tighter sets translate directly to reduced human review burden, because a singleton prediction set can be accepted automatically whereas a set with two or more candidates requires manual adjudication.

Temporal Split Evaluation

To assess robustness under distribution shift, ENT-Prod is re-partitioned by timestamp, using the earliest 80% of records for training, validation, and calibration, and the most recent 20% for testing.

Under this temporal split at the 0.95 level, Ours-Full achieves an actual coverage of 0.938 with a miscoverage gap of 0.012, while Single-Level-CP drops to 0.921 with a gap of 0.029. The coverage of Ours-Full still approximately satisfies the nominal target, although the wider gap compared to the random split confirms that temporal drift weakens the conditional exchangeability assumption. Per-layer calibration retains an advantage because each layer receives a more homogeneous subset of records, reducing the distributional mismatch within each calibration set even when the global distribution shifts.

Figure 6 shows the trade-off between rejection rate and overall error rate on ENT-Prod, obtained by varying the rejection quantile parameter

ρ

.

When the rejection rate increases from 0% to approximately 8%, the overall error rate drops sharply from 9.9% to 5.1%, while the average set size grows moderately from 1.00 to 1.29. The steep initial decline reflects the fact that the most error-prone records, those with the lowest posterior confidence, are the first to be rejected. Removing a small fraction of high-risk decisions thus disproportionately improves overall accuracy. Beyond a rejection rate of 15%, the error rate curve flattens near 3.4% and further rejection yields diminishing returns while the set size continues to grow. The interval between 5% and 10% rejection offers the most favorable balance for practical deployment: the error rate is approximately halved relative to zero rejection, and the additional human review burden remains modest.

4.6. Error Analysis and Case Study

Table 12 breaks down the cell-level repair accuracy by error type on ENT-Prod. The purpose of this table is to reveal each method’s characteristic strengths and weaknesses, providing complementary insight beyond the aggregate metrics in Table 4.

Rule-Only achieves the highest format-violation accuracy (0.961), marginally above Ours-Full (0.958), because deterministic regex matching and domain dictionaries provide near-complete coverage of well-defined format rules; however, it drops to 0.523 on semantic conflicts, where contextual understanding is required.

GPT-4o-Direct achieves the highest repair accuracy on semantic conflicts (0.891), benefiting from its broad world knowledge and open vocabulary. Ours-Full trails GPT-4o-Direct by 1.8 points on this category, because the student model’s candidate set is restricted to the top-k predictions and occasionally excludes rare but correct values. On cross-field inconsistencies, Ours-Full (0.896) outperforms all baselines by at least 3.4 points (vs. HoloClean at 0.862). The advantage arises from the explicit integration of hard rules in the factor graph, which enforces arithmetic and relational constraints that purely statistical or LLM-based methods cannot guarantee. BERT-FT achieves only 0.768 on cross-field errors, confirming that an encoder fine-tuned without explicit constraint signals struggles to maintain multi-field consistency.

Figure 7 shifts the perspective from methods to layers, showing how the three-layer architecture achieves its overall performance through specialization.

L1 attains 0.974 on format violations, the highest single-layer accuracy for any error type, because regex and dictionary matching provide near-perfect coverage for well-defined format constraints. L1’s accuracy on semantic conflicts, by contrast, is only 0.531, reflecting the intrinsic limitation of deterministic rules for context-dependent errors. L2 reaches 0.891 on semantic conflicts, closely matching GPT-4o-Direct’s method-level accuracy (0.891 in Table 12), which indicates that the distillation objective transfers the teacher’s semantic-correction capability to the compact student model. L3 attains 0.912 on cross-field inconsistencies, substantially higher than L2’s 0.796 on the same category. The 11.6-point gap between L2 and L3 on cross-field errors is the largest inter-layer difference across all error types, which is consistent with the design intent of routing high-complexity records to the factor graph. The routing mechanism directs each record to the layer best equipped to handle its dominant error type, which explains why the aggregate performance of Ours-Full exceeds what any single layer could achieve alone.

4.6.1. Case Studies

Case 1 (L1 repair). A customer record contains a phone number field with 12 digits. The rule layer detects the format violation through a regex check and removes the extraneous leading digit. Complexity score

C (x) = 6.3

; routed to L1; confidence 1.0; no conformal set needed.

Case 2 (L2 repair). A record lists city = “Naning” while province = “Jiangsu.” The rule layer does not reject the value because “Naning” is not in the explicit deny list, yet the character-level perplexity is elevated (

D_{p} = 0.71

). The student model predicts “Nanjing” with probability 0.94 based on the provincial context. Complexity score

C (x) = 38.2

; routed to L2; conformal set

\hat{Γ} = {“ Nanjing ”}

; singleton, accepted.

Case 3 (L3 repair). An order record shows order_total = 500, unit_price = 100, and quantity = 3, violating the arithmetic constraint total = price × quantity. The student model assigns comparable probabilities to quantity

\in {3, 5}

and order_total

\in {300, 500}

, yielding high entropy (

S (x) = 1.82

). Complexity score

C (x) = 76.5

; routed to L3. Factor graph inference activates hard rule R2 (arithmetic constraint, weight 1.0) and soft rule R11 (historical order-size regularity, weight 0.74). The posterior

p (quantity = 5 ∣ x) = 0.91

determines the final repair. Evidence trace: R2 activated → R11 supporting → posterior 0.91 → accepted.

4.6.2. Failure Analysis

One failure case involves a record where customer_name = “Li Wei” should be “Li Weimin.” The truncated name is itself a valid entry in the domain vocabulary, so neither the rule layer nor the student model flags it as erroneous. The maximum posterior confidence at L2 is 0.48, which falls below the rejection threshold (

τ_{2} = 0.55

). The system correctly withholds a single-point repair and returns a prediction set

\hat{Γ} = {“ Li Wei ”, “ Li Weimin ”, “ Li Weidong ”}

, routing the record to human review. The failure illustrates a fundamental limitation: when the erroneous value is itself a valid domain entry, the model cannot distinguish correct from incorrect without external identity-resolution signals. The rejection mechanism is specifically designed to handle such boundary cases by deferring to human judgment rather than committing a potentially harmful write-back.

4.7. Limitations

Several limitations of the present study warrant explicit discussion. The enterprise dataset (ENT-Prod) cannot be publicly released due to confidentiality constraints; representative anonymized examples and schema descriptions are integrated into the main text to ensure clarity, and all reported numerical results can be reproduced on Hospital and Flights from the public repositories. The conformal coverage guarantee relies on conditional exchangeability, which may weaken under substantial temporal drift or abrupt schema changes. The temporal-split experiment in Section 4.5 showed graceful degradation, but stronger distributional shifts remain untested, and a formal robustness analysis under covariate shift would require adaptive conformal methods that are beyond the scope of this paper. Hard- and soft-rule templates require domain expertise for construction, which limits out-of-the-box applicability to new domains; the framework exposes the rule set as an explicit input and therefore inherits the maintenance burden of traditional rule-based systems on this dimension. On the Flights dataset, the student model’s top-k candidate set occasionally missed rare correct values, yielding a cell-level F1 slightly below that of GPT-4o-Direct; retrieval-augmented candidate generation is a natural extension but was not implemented here. All experiments were restricted to structured tabular data, and the framework’s performance on semi-structured formats such as JSON logs or nested XML records has not been evaluated. Finally, the operational considerations discussed in Section 3.7 describe the maintenance overhead qualitatively; a quantitative study of multi-component drift across a realistic production cycle is left to future work.

5. Conclusions and Future Work

This paper presented a complexity-aware progressive framework for enterprise data error correction that integrates deterministic rules, a task-specialized distilled language model, and neural probabilistic–logical reasoning within a unified architecture. A learnable router assigns each record to the processing layer matching its error complexity, and a hierarchical conformal prediction scheme provides per-layer calibrated prediction sets with a global coverage guarantee and a data-driven rejection mechanism. Experiments on ENT-Prod, Hospital, and Flights produced three principal findings. First, the framework improved the record-level complete repair rate by 2.1 to 3.1 percentage points over the strongest baseline (GPT-4o-Direct) while reducing inference latency by approximately 80% relative to direct GPT-4o invocation. Second, ablation studies confirmed the necessity of explicit symbolic signals for cross-field consistency. Third, hierarchical conformal calibration maintained tighter coverage and smaller prediction sets than single-level calibration across all nominal levels.

For enterprise data governance, the framework offers a deployable pattern for integrating language-model capabilities into latency-sensitive pipelines while keeping throughput and audit trails intact. For applied neuro-symbolic work, the results suggest that separating semantic representation from explicit logical constraints, and routing records based on error complexity, can yield better multi-field corrections than end-to-end neural pipelines on this type of data.

Building on these implications, several directions are identified as promising avenues for future work. The conformal layer could be made adaptive in time so that calibration tracks distribution shift without violating finite-sample coverage [40]. The fixed top-k student vocabulary could be replaced by retrieval-augmented candidate generation to recover rare correct values, which is the residual gap on Flights. The factor graph could be extended to semi-structured formats such as JSON or nested XML. Finally, the rejected-record stream is a natural input for closed-loop active learning, which would let the rule and student components improve on the records they currently defer.

Author Contributions

Conceptualization, C.L. and X.Z.; methodology, C.L. and E.W.; software, C.L. and H.M.; validation, H.M., J.Z. and E.W.; formal analysis, C.L.; investigation, C.L. and J.Z.; resources, X.Z.; data curation, H.M. and J.Z.; writing—original draft preparation, C.L.; writing—review and editing, X.Z. and E.W.; visualization, C.L.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank the data governance team at the industry partner for providing annotated enterprise data.

Conflicts of Interest

Authors Chao Liu and Hong Mu were employed by the company Nanjing Shurui Data Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ilyas, I.F.; Chu, X. Data Cleaning; ACM Books; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Abedjan, Z.; Chu, X.; Deng, D.; Fernandez, R.C.; Ilyas, I.F.; Ouzzani, M.; Papotti, P.; Stonebraker, M.; Tang, N. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 2016, 9, 993–1004. [Google Scholar] [CrossRef]
Mohammed, S.; Naumann, F.; Harmouch, H. Step-by-step data cleaning recommendations to improve ML prediction accuracy. In Proceedings of the 28th International Conference on Extending Database Technology (EDBT), Barcelona, Spain, 25–28 March 2025; pp. 542–554. [Google Scholar] [CrossRef]
Chu, X.; Ilyas, I.F.; Krishnan, S.; Wang, J. Data cleaning: Overview and emerging challenges. In Proceedings of the ACM SIGMOD International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2016; pp. 2201–2206. [Google Scholar] [CrossRef]
Abiteboul, S.; Hull, R.; Vianu, V. Foundations of Databases; Addison-Wesley: Boston, MA, USA, 1995; ISBN 978-0201537710. [Google Scholar]
Rekatsinas, T.; Chu, X.; Ilyas, I.F.; Ré, C. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 2017, 10, 1190–1201. [Google Scholar] [CrossRef]
Mahdavi, M.; Abedjan, Z.; Castro Fernandez, R.; Madden, S.; Ouzzani, M.; Stonebraker, M.; Tang, N. Raha: A configuration-free error detection system. In Proceedings of the ACM SIGMOD International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2019; pp. 865–882. [Google Scholar] [CrossRef]
Mahdavi, M.; Abedjan, Z. Baran: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 2020, 13, 1948–1961. [Google Scholar] [CrossRef]
Huang, Z. Data cleaning using large language models. arXiv 2024, arXiv:2410.15547. [Google Scholar] [CrossRef]
Biester, F.; Abdelaal, M.; Bermbach, D. LLMClean: Context-aware tabular data cleaning via LLM-generated OFDs. arXiv 2024, arXiv:2404.18681. [Google Scholar] [CrossRef]
Bendinelli, T.; Dox, A.; Holz, C. Exploring LLM agents for cleaning tabular machine learning datasets. arXiv 2025, arXiv:2503.06664. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. Available online: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html (accessed on 8 April 2026).
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, NeurIPS 2019, Vancouver, BC, Canada, 13 December 2019. [Google Scholar] [CrossRef]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Richardson, M.; Domingos, P. Markov logic networks. Mach. Learn. 2006, 62, 107–136. [Google Scholar] [CrossRef]
Getoor, L.; Taskar, B. Introduction to Statistical Relational Learning; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar] [CrossRef]
Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; Volume 80, pp. 5502–5511. Available online: https://proceedings.mlr.press/v80/xu18h.html (accessed on 8 April 2026).
Lamb, L.C.; Garcez, A.; Gori, M.; Prates, M.; Avelar, P.; Vardi, M. Graph neural networks meet neural-symbolic computing: A survey and perspective. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Virtual, 7–15 January 2021; pp. 4877–4884. [Google Scholar] [CrossRef]
Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; De Raedt, L. DeepProbLog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 3749–3759. Available online: https://proceedings.neurips.cc/paper/2018/hash/dc5d637ed5e62c36ecb73b654b05ba2a-Abstract.html (accessed on 5 May 2026).
Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar] [CrossRef]
Balasubramanian, V.; Ho, S.S.; Vovk, V. Conformal Prediction for Reliable Machine Learning: Theory, Adaptations and Applications; Morgan Kaufmann: San Francisco, CA, USA, 2014. [Google Scholar] [CrossRef]
Jäger, S.; Biessmann, F. From data imputation to data cleaning: Automated cleaning of tabular data improves downstream predictive performance. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2024; Volume 238, pp. 3394–3402. Available online: https://proceedings.mlr.press/v238/jager24a.html (accessed on 8 April 2026).
Zhan, X.; Xu, Q.; Zheng, Y.; Lu, G.; Gevaert, O. Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction. PLoS Comput. Biol. 2025, 21, e1012803. [Google Scholar] [CrossRef] [PubMed]
Dallachiesa, M.; Ebaid, A.; Eldawy, A.; Elmagarmid, A.; Ilyas, I.F.; Ouzzani, M.; Tang, N. NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2013; pp. 541–552. [Google Scholar] [CrossRef]
Chu, X.; Morcos, J.; Ilyas, I.F.; Ouzzani, M.; Papotti, P.; Tang, N.; Ye, Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1247–1261. [Google Scholar] [CrossRef]
Lew, A.K.; Agrawal, M.; Sontag, D.; Mansinghka, V.K. PClean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2021; Volume 130, pp. 1927–1935. Available online: https://proceedings.mlr.press/v130/lew21a.html (accessed on 5 May 2026).
Thirumuruganathan, S.; Tang, N.; Ouzzani, M.; Doan, A. Data curation with deep learning. In Proceedings of the 23rd International Conference on Extending Database Technology (EDBT); OpenProceedings.org: Konstanz, Germany, 2020; pp. 277–286. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. Available online: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html (accessed on 8 April 2026).
Qi, D.; Miao, Z.; Wang, J. CleanAgent: Automating data standardization with LLM-based agents. arXiv 2024, arXiv:2403.08291. [Google Scholar] [CrossRef]
Xu, C.; McAuley, J. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 10566–10575. [Google Scholar] [CrossRef]
Ho, N.; Schmid, L.; Yun, S. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 14852–14882. [Google Scholar] [CrossRef]
Burns, C.; Ye, H.; Klein, D.; Steinhardt, J. Discovering latent knowledge in language models without supervision. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Fang, L.; Chen, Y.; Zhong, W.; Ma, P. Bayesian knowledge distillation: A Bayesian perspective of distillation with uncertainty quantification. In Proceedings of the 41st International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2024; Volume 235, pp. 12935–12956. Available online: https://proceedings.mlr.press/v235/fang24a.html (accessed on 8 April 2026).
Angelopoulos, A.N.; Bates, S. Conformal prediction: A gentle introduction. Found. Trends Mach. Learn. 2023, 16, 494–591. [Google Scholar] [CrossRef]
Zhou, X.; Chen, B.; Gui, Y.; Cheng, L. Conformal prediction: A data perspective. ACM Comput. Surv. 2025, 57, 245. [Google Scholar] [CrossRef]
Campos, M.; Farinhas, A.; Zerva, C.; Figueiredo, M.A.T.; Martins, A.F.T. Conformal prediction for natural language processing: A survey. Trans. Assoc. Comput. Linguist. 2024, 12, 1497–1516. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2017; Volume 70, pp. 1321–1330. Available online: https://proceedings.mlr.press/v70/guo17a.html (accessed on 8 April 2026).
Gibbs, I.; Candès, E. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 1660–1672. Available online: https://proceedings.neurips.cc/paper/2021/hash/0d441de75945e5acbc865406fc9a2559-Abstract.html (accessed on 8 April 2026).

Figure 1. Overall architecture of the multi-layer progressive data error correction framework. Records enter through the complexity-aware router and are assigned to one of three processing layers. Feature signals flow from lower layers to higher layers through vector concatenation.

Figure 2. Processing workflow of the complexity-aware routing and the three-layer progressive pipeline. Low-complexity records are handled by L1, medium-complexity records by L2, and high-complexity records by L3. Rule-trigger signals and student-model representations are reused across layers.

Figure 3. Workflow of multi-level confidence quantification and reliability assurance. Each layer performs independent conformal calibration. Records whose maximum posterior falls below the rejection threshold are routed to human review. The global coverage guarantee is obtained by aggregating per-layer coverage via the law of total probability.

Figure 4. Routing distribution by complexity score bin on ENT-Prod. Stacked bars show the proportion of records assigned to L1, L2, and L3 within each interval.

Figure 5. Calibration analysis of the routing-stage entropy signal on the ENT-Prod validation set. (a) Reliability diagram before (

T = 1

) and after (

T = 2

) post hoc temperature scaling; the calibrated curve aligns more closely with the diagonal, with ECE reduced from 0.087 to 0.021. (b) Empirical correction error rate as a function of entropy bin (10 equal-frequency bins, low to high); the shaded band denotes the

95 %

bootstrap confidence interval. The monotone increase from

4 %

to

38 %

across bins, together with a Spearman rank correlation of

ρ = 0.61

, justifies the use of

S (x)

as a complexity signal for routing.

Figure 5. Calibration analysis of the routing-stage entropy signal on the ENT-Prod validation set. (a) Reliability diagram before (

T = 1

) and after (

T = 2

) post hoc temperature scaling; the calibrated curve aligns more closely with the diagonal, with ECE reduced from 0.087 to 0.021. (b) Empirical correction error rate as a function of entropy bin (10 equal-frequency bins, low to high); the shaded band denotes the

95 %

bootstrap confidence interval. The monotone increase from

4 %

to

38 %

across bins, together with a Spearman rank correlation of

ρ = 0.61

, justifies the use of

S (x)

as a complexity signal for routing.

Figure 6. Rejection rate versus overall error rate (solid) and average prediction set size (dashed) on ENT-Prod.

Figure 7. Repair accuracy of each processing layer by error type on ENT-Prod, measured by routing every record to the indicated layer.

Table 1. Complexity features: domain, normalization, and cost.

Feature	Domain	Normalization	Cost (ms)
$D_{p} (x)$	$[0, + \infty)$	Min–max	<1
$K (x)$	$(- \infty, + \infty)$	Z-score	<1
$S (x)$	$[0, + \infty)$	Min–max	≈2

Table 2. Dataset statistics.

Statistic	ENT-Prod	Hospital	Flights
Records	15,247	1000	2136
Attributes	22	19	12
Total cells	335,434	19,000	25,632
Error rate (%)	7.8	5.0	10.0
Dominant errors	Mixed	FD viol.	Semantic
Train/Val/Cal/Test	6:1:1:2	6:1:1:2	6:1:1:2

Table 3. Baseline methods.

Method	Description
Rule-Only	Deterministic rules and constraints only.
Rule + Stat	Rules plus frequency-based statistical repair.
HoloClean [6]	Probabilistic inference over a factor graph with denial constraints and external signals.
Raha + Baran [7,8]	Configuration-free detection (Raha, budget = 20) followed by context-based correction (Baran).
BERT-FT	BERT-base fine-tuned on the correction task without distillation or routing.
GPT-4o-Direct	GPT-4o invoked via API (5-shot) for cell-level correction. ^†
Ours-Full	Complete proposed framework.

^† Latency includes API network overhead.

Table 4. Main results across three datasets. Best values per dataset are in bold. * p < 0.05, ** p < 0.01 vs. best non-Ours baseline.

Dataset	Method	Cell Det. F1	Repair Acc.	E2E Cell Acc.	Record Repair	Latency p50 (ms)	Thru. (rec/s)
ENT-Prod	Rule-Only	0.818 ± 0.003	0.792 ± 0.004	0.968 ± 0.001	0.534 ± 0.008	4.2	238
	Rule + Stat	0.833 ± 0.003	0.811 ± 0.005	0.970 ± 0.001	0.562 ± 0.009	5.1	196
	HoloClean	0.882 ± 0.004	0.869 ± 0.005	0.976 ± 0.001	0.639 ± 0.011	16.8	56
	Raha + Baran	0.891 ± 0.005	0.878 ± 0.006	0.977 ± 0.001	0.648 ± 0.010	9.3	108
	BERT-FT	0.884 ± 0.004	0.861 ± 0.005	0.975 ± 0.001	0.614 ± 0.012	7.9	127
	GPT-4o-Direct ^†	0.914 ± 0.003	0.892 ± 0.004	0.981 ± 0.001	0.671 ± 0.009	62.5	18
	Ours-Full	0.926 ± 0.003 **	0.901 ± 0.004 *	0.983 ± 0.001 *	0.702 ± 0.008 **	11.9	94
Hospital	Rule-Only	0.842 ± 0.004	0.818 ± 0.005	0.972 ± 0.001	0.593 ± 0.010	3.8	263
	Rule + Stat	0.856 ± 0.004	0.834 ± 0.006	0.974 ± 0.001	0.611 ± 0.011	4.6	217
	HoloClean	0.895 ± 0.005	0.891 ± 0.005	0.981 ± 0.001	0.672 ± 0.012	15.4	62
	Raha + Baran	0.883 ± 0.006	0.876 ± 0.007	0.979 ± 0.001	0.657 ± 0.013	8.7	115
	BERT-FT	0.871 ± 0.005	0.868 ± 0.006	0.977 ± 0.001	0.638 ± 0.011	7.1	141
	GPT-4o-Direct ^†	0.901 ± 0.004	0.893 ± 0.005	0.982 ± 0.001	0.679 ± 0.010	55.2	20
	Ours-Full	0.908 ± 0.004 *	0.902 ± 0.004 *	0.984 ± 0.001	0.706 ± 0.009 **	10.4	102
Flights	Rule-Only	0.791 ± 0.005	0.763 ± 0.006	0.953 ± 0.002	0.518 ± 0.012	4.1	244
	Rule + Stat	0.808 ± 0.005	0.781 ± 0.006	0.956 ± 0.002	0.539 ± 0.013	5.0	200
	HoloClean	0.847 ± 0.006	0.835 ± 0.007	0.965 ± 0.002	0.591 ± 0.014	18.2	52
	Raha + Baran	0.864 ± 0.005	0.852 ± 0.006	0.968 ± 0.001	0.612 ± 0.012	8.9	112
	BERT-FT	0.858 ± 0.005	0.846 ± 0.006	0.966 ± 0.002	0.601 ± 0.013	7.6	132
	GPT-4o-Direct ^†	0.893 ± 0.004	0.886 ± 0.005	0.975 ± 0.001	0.648 ± 0.011	60.9	18
	Ours-Full	0.887 ± 0.004	0.879 ± 0.005	0.974 ± 0.001	0.669 ± 0.010 *	12.3	88

^† Latency reflects API wall-clock time including network overhead.

Table 5. Routing strategy comparison on the ENT-Prod test split. Percentages reflect routed-record proportions; differences from the oracle results on the validation split (detailed later in Section 4.4) reflect normal train/test variation.

Strategy	L1	L2	L3	Cell F1	Lat. p50
Adaptive	0.573	0.312	0.115	0.926 ± 0.003	11.9
All-to-L3	0.000	0.000	1.00	0.929 ± 0.002	27.3
Random	0.334	0.333	0.333	0.903 ± 0.005	18.6

Table 6. Ablation results on ENT-Prod (nominal coverage 0.95).

Config.	Cell F1	Rec. Repair	Cov.	Lat. p50
Ours-Full	0.926 ± 0.003	0.702 ± 0.008	0.949 ± 0.004	11.9
No-Routing	0.929 ± 0.002	0.705 ± 0.007	0.951 ± 0.003	27.3
No-Boundary	0.919 ± 0.004	0.684 ± 0.010	0.944 ± 0.005	11.6
No-Rule-Feat	0.908 ± 0.005	0.661 ± 0.012	0.946 ± 0.005	11.8
No-Logic-Layer	0.913 ± 0.004	0.672 ± 0.011	0.947 ± 0.004	8.4
No-CP	0.927 ± 0.003	0.704 ± 0.008	N/A	11.3
Single-Level-CP	0.923 ± 0.003	0.695 ± 0.009	0.941 ± 0.006	12.0

Note: N/A indicates Not Applicable, as coverage is not defined for configurations without Conformal Prediction (CP).

Table 7. Routing score design ablation on ENT-Prod. Routing error rate is measured against the oracle layer assignment on the validation set.

Variant	Configuration	Cell F1	Routing Err. (%)
Score combination (normalization = min–max, LM = 5-g)
Additive (default)	$D_{p} = MinMax (PPL) + κ N_{fail}$	0.926 ± 0.003	6.8
Multiplicative	$D_{p} = MinMax (PPL) \cdot (1 + κ N_{fail})$	0.912 ± 0.004	11.2
Learned weighted sum	$D_{p} = MLP ([PPL, N_{fail}])$	0.918 ± 0.005	8.4
Normalization (combination = additive, LM = 5-g)
Min–max (default)	training-set [0, 1] scaling	0.926 ± 0.003	6.8
Z-score	zero-mean unit-variance	0.921 ± 0.004	7.6
Robust (IQR) scaling	median/IQR scaling	0.923 ± 0.004	7.2
Character-level LM order (combination = additive, normalization = min–max)
$n = 1$	unigram	0.903 ± 0.006	12.5
$n = 2$	bigram	0.917 ± 0.005	8.9
$n = 3$	trigram	0.923 ± 0.004	7.4
$n = 5$ (default)	5-g	0.926 ± 0.003	6.8

Note: Bold values denote the best results.

Table 8. Routing confusion matrix on the ENT-Prod validation set. Rows are the oracle-optimal layers; columns are the layers chosen by the learned router. Values are record proportions (%).

Oracle∖Assigned	L1	L2	L3	Row Total
L1 (oracle)	56.1	2.9	0.4	59.4
L2 (oracle)	1.1	27.8	1.8	30.7
L3 (oracle)	0.2	0.4	9.3	9.9
Column total	57.4	31.1	11.5	100.0

Table 9. Cell-level F1 conditioned on routing-error category, ENT-Prod validation set. The aggregate contribution of under-routing to the overall cell-F1 gap is the product of the local drop and the under-routing proportion.

Category	Proportion (%)	Cell F1 on Subset	Contribution to F1 Gap
Agree with oracle	93.2	0.939 ± 0.003	—
Over-routed (above oracle)	5.1	0.937 ± 0.005	≈0 (ns)
Under-routed (below oracle)	1.7	0.759 ± 0.014	≈0.003

ns: not significant (

p > 0.5

, paired t-test against the agreement group).

Table 10. Cross-field-only ablation on ENT-Prod. Cell repair accuracy is computed on cells flagged as participating in at least one cross-field constraint violation. Effect sizes are Cohen’s d against Ours-Full on paired record-level outcomes; significance is from paired t-tests on the same cells.

Variant	Cross-Field Acc.	Δ from Ours-Full	Cohen’s d	p-Value
Ours-Full	0.896 ± 0.006	—	—	—
No-Rule-Feat	0.831 ± 0.009	$- 0.065$	1.51	<0.001
No-Logic-Layer	0.852 ± 0.008	$- 0.044$	1.12	<0.001
No-Boundary	0.881 ± 0.007	$- 0.015$	0.43	0.018
Single-Level-CP	0.890 ± 0.007	$- 0.006$	0.18	0.142
No-CP	0.894 ± 0.006	$- 0.002$	0.07	0.561

Table 11. Reliability metrics at multiple nominal coverage levels on ENT-Prod.

$1 - δ$	Method	Cov.	Set Size	Rej.	Gap
0.90	Ours-Full	0.904 ± 0.003	10.18 ± 0.02	0.051 ± 0.004	0.004
0.90	Single-CP	0.897 ± 0.005	10.24 ± 0.03	0.037 ± 0.005	0.003
0.95	Ours-Full	0.949 ± 0.004	10.29 ± 0.03	0.082 ± 0.005	0.001
0.95	Single-CP	0.941 ± 0.006	10.35 ± 0.04	0.061 ± 0.006	0.009
0.99	Ours-Full	0.991 ± 0.002	10.43 ± 0.04	0.137 ± 0.007	0.001
0.99	Single-CP	0.986 ± 0.004	10.51 ± 0.05	0.112 ± 0.008	0.004

Table 12. Per-error-type cell repair accuracy on ENT-Prod. Best value per column is in bold; in the Format column, the bold entry appears on Rule-Only because it achieves the highest value for that error type.

Method	Format	Missing	Semantic	Cross-Field
Rule-Only	0.961	0.684	0.523	0.712
HoloClean	0.938	0.821	0.774	0.862
Raha + Baran	0.924	0.843	0.812	0.831
BERT-FT	0.912	0.836	0.805	0.768
GPT-4o-Direct	0.943	0.872	0.891	0.824
Ours-Full	0.958	0.879	0.873	0.896

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Mu, H.; Zhou, J.; Wang, E.; Zhao, X. Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control. Mathematics 2026, 14, 1599. https://doi.org/10.3390/math14101599

AMA Style

Liu C, Mu H, Zhou J, Wang E, Zhao X. Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control. Mathematics. 2026; 14(10):1599. https://doi.org/10.3390/math14101599

Chicago/Turabian Style

Liu, Chao, Hong Mu, Jingjing Zhou, Enliang Wang, and Xuejian Zhao. 2026. "Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control" Mathematics 14, no. 10: 1599. https://doi.org/10.3390/math14101599

APA Style

Liu, C., Mu, H., Zhou, J., Wang, E., & Zhao, X. (2026). Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control. Mathematics, 14(10), 1599. https://doi.org/10.3390/math14101599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complexity-Aware Progressive Data Error Correction with Distilled Language Models and Conformal Reliability Control

Abstract

1. Introduction

2. Related Work

2.1. Error Detection and Repair in Tabular Data

2.2. LLMs and Deep Learning for Data Quality

2.3. Knowledge Distillation for Compact Models

2.4. Conformal Prediction and Reliable Decision-Making

3. Proposed Framework

3.1. Problem Formulation

3.2. Overall Architecture

3.3. Two-Stage Complexity Assessment and Routing

3.3.1. Stage 1: Lightweight Pre-Screening

3.3.2. Stage 2: Entropy-Augmented Routing

3.3.3. Routing Error Analysis

3.4. Task-Specialized Knowledge Distillation

3.4.1. Input Formulation

3.4.2. Distillation Loss

3.4.3. Training Configuration

3.5. Neural Probabilistic–Logical Reasoning

3.5.1. Graph Construction

3.5.2. Rule Templates

3.5.3. Joint Distribution and Inference

3.5.4. Evidence Output

3.6. Hierarchical Conformal Prediction and Rejection

3.6.1. Candidate Space

3.6.2. Nonconformity Score

3.6.3. Layer-Wise Calibration

3.6.4. Assumptions and Global Coverage

3.6.5. Rejection Mechanism

3.7. Operational Considerations

4. Experimental Validation and Analysis

4.1. Experimental Setup

4.2. Main Results

Summary of the Performance Advantage

4.3. Routing and Efficiency Analysis

Calibration of the Routing Entropy Signal

4.4. Component Ablation

4.4.1. Routing Score Design Ablation

4.4.2. Routing Error and Oracle Comparison

4.4.3. Cross-Field-Only Ablation

4.5. Reliability Analysis

Temporal Split Evaluation

4.6. Error Analysis and Case Study

4.6.1. Case Studies

4.6.2. Failure Analysis

4.7. Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI