1. Introduction
The transition toward smart water treatment plants (WTPs) aligns with low-carbon environmental policies and addresses the urgent need to enhance plant management while reducing energy consumption. Among all treatment units, the sedimentation tank plays a central role in solid–liquid separation in drinking water treatment plants, and its operational efficiency directly influences both overall plant energy use and treated water quality. Extensive studies have been conducted to improve sedimentation efficiency, leading to substantial gains in settling performance [
1,
2,
3]. However, these advances have also introduced a new challenge: how to remove the settled sludge effectively under conditions of high sedimentation rates and strong process dynamics.
At present, the mainstream industrial practice is to employ sludge removal bridge cranes that travel across the entire tank at a constant speed while performing rapid suction [
4]. Although this conservative strategy ensures the safety and reliability of the water treatment process, its operating cycle is largely determined by operator experience and lacks support from effective data analysis and quantitative modeling. As a result, considerable resources are wasted. In China alone, the water consumed during sludge discharge can account for up to 4.5% of a plant’s total treated water volume [
5], placing substantial pressure on downstream treatment processes. Moreover, the discharged sludge-laden water often contains heavy metal concentrations far exceeding allowable standards [
6], resulting in serious losses of both water resources and energy.
To improve the overall efficiency of sludge discharge, research in both academia and industry has mainly focused on adjusting chemical dosing based on effluent turbidity, with the aim of enabling more stable removal of settled solids within a single discharge cycle. This strategy can stabilize settled-solid removal within a discharge cycle while reducing reagent consumption, water waste, and pollution risk. This research paradigm is feasible in engineering practice because variables such as turbidity and dosing rate can be continuously acquired through the Supervisory Control and Data Acquisition (SCADA) system and can directly support closed-loop dosing control. Consequently, related studies have gradually developed into a relatively systematic technical framework [
7,
8].
With the rapid development of deep learning and artificial intelligence, an increasing number of neural network architectures have been applied to prediction tasks in real-world water treatment processes [
9,
10]. For example, Tochio et al. developed an artificial neural network (ANN) model and calibrated it using full-scale WTP data, effectively reducing human–machine interaction in chemical dosing control [
11]. Kim et al. considered the effects of extreme weather conditions and human operations, and integrated a convolutional neural network for feature extraction with a gated recurrent unit for time-series analysis. Their model achieved effective prediction of coagulant dosage under dynamically changing conditions and demonstrated that deep learning models can respond to sedimentation states in dynamic environments [
12]. Zhao et al. treated effluent-turbidity optimization as a time-series forecasting task and proposed an integrated CNN–BiLSTM model, employing smoothing strategies and temporal redundancy analysis to address the complexity of raw-water data, and validated the model on real datasets from Xi’an and Shanghai [
13]. Liu et al. focused on floc-volume dynamics and designed a multilayer convolutional neural network to extract turbidity-variation features from real-time sedimentation-tank monitoring; by introducing turbidity-corrected data, they mitigated the lag caused by hydraulic retention time and improved generalization [
14]. To address the long-standing shortage of training data, Zhou et al. proposed a Transformer–TCN integrated architecture incorporating the time-window effect, achieving accurate turbidity prediction under limited short-term data through floc-morphology representation and self-learned lag representation [
15]. In all of these studies, however, the modeling target is the water leaving the tank, not the sedimentation state inside it.
This shared design choice exposes a fundamental limitation when the operational question concerns sludge discharge rather than coagulant dosing. Effluent turbidity integrates multiple in-tank processes at the outlet and is inherently insensitive to the spatial location of sludge accumulation. The same outlet turbidity may arise from severe front-section accumulation, residual rear-section buildup, or a combination of both, yet these scenarios require different sludge-discharge ranges and cycles. Frameworks built around end-point variables therefore cannot tell the sludge-discharge gantry where to travel, how far to travel, or when to start; they can only confirm, after the fact, whether the resulting effluent quality remains acceptable. As a consequence, gantry operation in current practice continues to rely on fixed full-length cycles and operator experience, even at plants that have already deployed advanced turbidity-prediction models. What is missing is not a better end-point predictor, but a state-perception layer that observes the bottom of the tank itself.
To reveal the hydraulically driven mechanism of sludge layer thickness inside sedimentation tanks and to approximate the real process as closely as possible, researchers have widely adopted computational fluid dynamics (CFD) [
16,
17] to numerically simulate key hydraulic behaviors, such as the three-dimensional flow field structure and the distribution of suspended particles in sedimentation tanks [
18]. On this basis, the coupling between particle transport and settling has been further analyzed to explain the causes of local sludge accumulation and to guide the optimization of inlet and outlet structures [
19]. For example, Ma et al. performed sedimentation tank simulations in ANSYS Fluent (version 2022 R2) using the Random
-
turbulence models together with a discrete phase model, and effectively revealed the relationship between changes in particle diameter and settling distance [
20]. Bruno et al. used large eddy simulation to capture the major vortex structures of three-dimensional turbulence inside the settler. In the numerical solution, they discretized the governing equations with the finite volume method on a structured grid, and then introduced a scalar tracer to simulate the diffusion and transport of substances in water. This enabled them to analyze the motion and state of settling particles from a hydrodynamic perspective [
21].
These studies consistently demonstrate that CFD models can effectively characterize the settling patterns of bottom sludge. However, substantial barriers still limit their engineering application. First, high-resolution three-dimensional simulations are computationally intensive and highly sensitive to boundary conditions, inlet turbulence, and particle parameters. As a result, parameter identification and uncertainty propagation are difficult to incorporate into an operational closed loop. Second, as reported by Zhao et al. and Soleimani et al. [
22], sedimentation is a complex nonlinear process jointly governed by flow dynamics and operating conditions. Existing CFD-based methods cannot directly establish an intuitive mapping between tank operating conditions and sludge states. Therefore, they are used mainly for offline diagnosis and design optimization rather than real-time state estimation during operation.
At present, neither deep learning models nor numerical simulation methods can fully overcome the bottlenecks of online engineering deployment. These limitations arise mainly from two aspects: engineering implementation and scientific representation.
From the engineering perspective, the primary challenge lies in the lack of observation and annotation. Bottom sludge thickness in sedimentation tanks has long lacked an online ground-truth measurement that is continuously available, maintainable, and strictly aligned with SCADA data. As a result, the workflow for sample construction, model training, and performance validation based on in-tank state supervision remains incomplete, which limits the feasibility, verifiability, and reproducibility of supervised learning for process-state modeling.
From the scientific perspective, the primary challenge lies in the unclear mechanism governing sludge spatial distribution. The sedimentation process is affected by multiple coupled factors and nonlinear disturbances, including influent fluctuations, dosing strategies, sludge discharge events, and changes in hydraulic structure [
23,
24]. Because bottom sludge distribution is strongly coupled with flow structure, particle transport, and floc changes, purely data-driven models often fail to maintain physical consistency under varying and disturbed conditions, thereby weakening their generalization ability and engineering reliability.
To address these challenges, this study focuses on the process state inside the sedimentation tank as the modeling target. By integrating multi-source operational data from the SCADA system with interface measurements obtained from an ultrasonic sludge–water interface meter [
25], and based on two field experiments, each spanning three months, we define a soft-sensing task for the spatial sedimentation state under operating conditions. This task enables online estimation and prediction of the true in-tank sedimentation state from observable SCADA variables. Furthermore, because sedimentation tank states are influenced by hydraulic action and particle transport, purely data-driven models are prone to physical inconsistency under operating-condition switching, sparse observations, and out-of-distribution scenarios. To address this issue, CFD simulation results are introduced as hydrodynamic priors for settling particles. CFD characterizes the internal flow-field structure of the sedimentation tank and therefore provides soft constraints on spatial distribution and physical plausibility for data-driven models. On the other hand, this study explores an engineering-oriented mechanism for injecting hydraulic prior knowledge. We further conduct a systematic comparison of three knowledge-fusion methods for incorporating hydrodynamic priors of settling particles, thereby providing a new approach for exploiting implicit physical information in data-driven sedimentation-state estimation.
First, we reframe the modeling target from end-point water quality to the in-tank sedimentation state. Existing data-driven studies in WTPs [
11,
12,
13,
14,
15] model effluent turbidity or coagulant dosage and cannot resolve where bottom sludge has accumulated, which is the variable that actually governs sludge-discharge scheduling.
Second, we construct a closed-loop data foundation for HST soft sensing, combining synchronized SCADA records, ultrasonic sludge-interface profiles, and CFD-simulated hydrodynamic priors collected at a full-scale WTP.
Third, we propose and systematically compare three paradigms for injecting CFD-derived priors into data-driven predictors—parameter transfer, representation fusion, and knowledge distillation—operating, respectively, in the parameter, latent, and supervision spaces. Unlike single-mechanism hybrid CFD–DL approaches, this decomposition isolates where in the learning pipeline the physical prior is most effective, and clarifies the trade-off between in-distribution accuracy and out-of-distribution robustness under operating-cycle shift.
3. Experiments and Discussion
3.1. Experimental Benchmark Setup
The data collection process is described in
Section 2.1. We collected 247 paired SCADA and ultrasonic sludge–water interface samples at a 24 h interval as in-distribution (ID) data, and split them into training, validation, and test sets at a ratio of 7:2:1. In addition, 13 OOD samples were collected by varying the sedimentation duration to evaluate model generalization. To avoid information leakage, all data were split chronologically, which is more consistent with online deployment. Missing values and outliers were processed, and continuous features were standardized using statistics from the training set only. To construct transferable physical priors, we further generated CFD data. Based on the average operating condition of the sedimentation tank in 2024 and prior experience [
32], seven representative operating conditions were designed, covering different hydraulic loads, suspended solids concentrations, and sedimentation durations. CFD results were sampled every 20 min, yielding 40,222 samples, which were split into CFD training and test sets at a ratio of 8:2. Except for sedimentation duration, influent turbidity, and flow velocity, all other features were fixed at their annual mean values. The simulation parameters are shown in
Table 3.
Conditions A and B represent lower flow velocity and lower influent turbidity, respectively, whereas Condition E represents an extreme suspended-solids concentration. The other conditions were designed to examine the effect of sedimentation time on the in-tank state.
The encoder and predictor were first pretrained on the CFD data, and the best weights on the CFD test set were used to initialize training on real-world data. All deep learning models were trained on an NVIDIA GeForce RTX 4090 GPU (Santa Clara, CA, USA) using Python 3.10 and PyTorch (version 2.1). The batch size was 16, the optimizer was AdamW, and the learning rate was 0.0001. Traditional machine learning baselines were implemented with scikit-learn and tuned by grid search on the same training and validation splits. For fairness, each experiment was repeated 10 times with different random seeds, and average results were reported. Unless otherwise noted, model selection and early stopping were both based on validation performance.
3.2. Comparative Experiments with Baseline Models
The aim of this section is to compare traditional machine-learning methods with deep temporal models without introducing physical priors or additional constraints. Accordingly, all models were trained on the real-world training set and evaluated on the test set. We selected several widely used and effective machine-learning models in urban water systems, as well as a range of advanced deep learning predictors, including the physics-informed neural network (PINN) with explicit physical constraints and the diffusion model incorporating flow-related characteristics. For the PINN, both mass-consistency loss and partial differential equation loss were applied. The experimental results are presented in
Table 4.
The baseline results show that conventional machine-learning models, including XGBoost, LightGBM, and Random Forest, cannot balance all three metrics well. For example, XGBoost and LightGBM yield relatively high PA values of 0.196 and 0.225, respectively. Although Random Forest reduces PA to 0.116, its CCE rises to 0.154, indicating unstable local shape prediction. This is likely because traditional ML models rely mainly on static or limited lag features [
42,
43,
44] and cannot fully capture the delayed accumulation and nonlinear coupling of the sedimentation process.
The trade-off between mass conservation (MCE) and shape consistency (CCE) is visualized in
Figure 2. Models in the lower-left region of the plot achieve low values on both metrics; this region is occupied predominantly by the deep temporal models (LSTM, GRU, and Attention), with Attention achieving the lowest MCE (3.085) and LSTM and GRU achieving the lowest CCE (both 0.032). Conventional ML models cluster in the upper-left or upper-right regions, indicating that they sacrifice one metric for the other. PINN occupies an isolated position in the upper region of the plot (CCE 0.151), and Diffusion lies in the right-most region (MCE 5.428), confirming that neither generative modeling nor explicit physical-residual constraints achieved a balanced trade-off in this baseline setting. Overall, no model in
Figure 2 falls into the joint low-CCE and low-MCE region—the visual gap between the deep-temporal cluster and the lower-left corner of the plot quantifies the room for improvement that the three knowledge-fusion paradigms in the next section are designed to fill. The CFD output itself, evaluated as a candidate prior, exhibits absolute errors more than an order of magnitude higher than the best data-driven baselines (PA ≈ 0.80–0.85 at
), confirming that it cannot serve as a stand-alone predictor and must be used as a structural prior rather than as ground truth. At the same time, its profile errors are remarkably stable across the five simulated operating conditions (PA varies by less than 6% and MCE by less than 5% at
), indicating that the CFD output captures a hydrodynamic structure largely invariant to operating-condition perturbations and is therefore appropriate as a transferable prior.
Among the three tested thresholds, yields the lowest profile-error metrics across all five conditions, while systematically overestimates and systematically underestimates the interface elevation.
3.3. Validation Experiments on the Effectiveness of Paradigm Design
This section evaluates whether the three hydraulic particle-transport prior fusion paradigms proposed in this study can systematically improve profile prediction performance, and examines their effects on the three metrics. The three paradigms are described in
Section 2.2.2,
Section 2.2.3 and
Section 2.2.4. To reduce model-specific bias, we compared all predictors under the three paradigms, and the results are shown in
Table 5.
When averaged across the six predictors, the three paradigms produced different improvement profiles relative to the unfused baselines. Paradigm I reduced PA by 25.1%, CCE by 8.7%, and MCE by 47.6% on average; it was the only paradigm that improved all three metrics simultaneously when averaged across predictors. Paradigm III reduced PA by 36.9% and MCE by 50.0% on average—the largest gains in those two metrics—but CCE increased by 7.4% on average relative to the unfused baselines, indicating that distillation-based supervision in our task did not preserve fine-grained curvature structure as well as parameter-level injection. Paradigm II reduced PA by 4.0% and MCE by 15.2% on average, while CCE increased by 5.7%; this confirms that latent-level fusion alone produced smaller and less consistent gains in our experiments than the other two routes. Counted by predictor-level wins, Paradigm I produced the lowest PA in three of the six predictors and the lowest MCE in four; Paradigm III produced the lowest PA in three predictors and the lowest MCE in two. Among the eighteen paradigm–predictor combinations, the Attention predictor under Paradigm I produced the lowest PA (0.026, tied with MLP under Paradigm III) and the lowest MCE (1.052); the lowest CCE in the same comparison was produced instead by GRU under Paradigm I (0.032), with Attention under Paradigm I producing the third-lowest CCE (0.034). MLP also showed substantial improvement under Paradigm I, indicating that the benefit of CFD-informed initialization in our setting was not restricted to sequence-based predictors. Among individual large gains, GRU under Paradigm III reached PA 0.033, CCE 0.033, and MCE 1.364, while PINN improved from PA 0.086 (unfused) to PA 0.028 under Paradigm III. The Diffusion model showed limited improvement under all three paradigms, suggesting that, in the small-sample sedimentation setting examined here, generative-style modeling did not benefit from CFD prior injection to the same extent as discriminative architectures.
The visualization in
Figure 3 is consistent with the quantitative results above. Under the ID setting, Paradigm I produced profiles whose global shape—including the rapid decay near the inlet, the plateau region, and the rising tail near the outlet—visually tracked the observed profile more closely than Paradigms II and III in most samples. The Attention predictor under Paradigm I produced the visually closest match in our sample selection, without large systematic deviation in the rising tail; this is consistent with its low PA and MCE in
Table 5. Paradigm II produced visually stable profiles in some samples but exhibited residual fluctuations and slight drift in the plateau region in other samples, consistent with the higher MCE of Paradigm II across most predictors in
Table 5; this issue is examined further in
Section 3.4. Paradigm III produced visually smoother profiles in the plateau region with fewer high-frequency excursions, consistent with the lower CCE values reported for GRU and PINN under Paradigm III in
Table 5, although the cross-predictor average CCE under Paradigm III was higher than the unfused baselines as noted above. Across predictors, LSTM and GRU exhibited local smoothness but showed cumulative systematic drift in the plateau region under long-step prediction. MLP produced occasional local bulges or overall over-estimation, consistent with its lack of explicit temporal modeling. PINN exhibited spikes and high-frequency fluctuations across multiple seeds under our small-sample, noisy real-data setting, indicating that, in this regime, the explicit physical-residual constraints did not consistently translate into smooth profile reconstruction; this is consistent with prior reports of optimization difficulty for PINNs under small-data and high-noise conditions [
26].
Computational differences among the three paradigms mainly arise during training, whereas their inference costs are comparable. To ensure a fair paradigm-level comparison, the reported computational costs were averaged across different encoder backbones rather than measured using a single predictor, because training time, GPU memory usage, and inference latency are jointly affected by both the fusion paradigm and the encoder complexity. Therefore, only encoder-averaged results can provide a meaningful basis for comparing the computational burden of the three paradigms. On a single NVIDIA RTX 4090 GPU, the average per-seed training time across encoder backbones was approximately 38 min for Paradigm I, 45 min for Paradigm II, and 62 min for Paradigm III. The corresponding peak training-stage GPU memory was 2.8 GB, 3.4 GB, and 5.1 GB, respectively. The larger memory footprint of Paradigm III is mainly due to teacher–student distillation. During inference, Paradigms I and III retain only the deployed predictor, while Paradigm II additionally retains the physical-prior encoder and fusion module. The average inference-stage GPU memory was 1.2 GB for Paradigms I and III and 1.5 GB for Paradigm II, with per-sample latency of 18 ms, 23 ms, and 18 ms, respectively. These results indicate that all three paradigms satisfy the latency requirement for online operation, with Paradigm I having the lowest average training cost and Paradigms I and III having comparable inference efficiency.
3.4. Ablation of Physical Prior Mechanisms
This section conducts ablation studies on two key components of physical prior injection. First, we examine whether the CFD-based pretraining transfer in Paradigm I requires end-to-end fine-tuning for effective adaptation in the real domain. Second, we compare different fusion strategies in Paradigm II and analyze their effects on pointwise accuracy, shape consistency, and mass conservation. These experiments isolate the contribution of each fusion mechanism. The results are shown in
Table 6 and
Table 7. The best-performing Attention model was used in all experiments.
The ablation results for Paradigm I show that, compared with training from scratch, pretraining without predictor fine-tuning did not improve performance and instead produced higher errors. This pattern is consistent with the interpretation that the CFD-derived initialization provides a useful starting point in parameter space, but that domain bias between the simulated and real distributions still requires correction through gradient updates on real data. The result therefore characterizes the operating regime under which Paradigm I is effective in our experiments rather than asserting a general advantage of pretraining-only initialization.
As shown in
Table 7, the ablation results for different fusion strategies in Paradigm II indicate that direct concatenation performs worst on all three metrics. This suggests that treating physical priors as ordinary input features can damage the separability of latent representations, induce local nonphysical oscillations, and increase global bias, as reflected by the much higher MCE. In contrast, gating achieves the most balanced performance, followed by attention fusion. This indicates that selectively injecting physical priors through learnable weighting mechanisms can significantly improve stability and physical consistency. A possible reason is that, although the particle concentration distribution in CFD data follows physical laws, the representations extracted from CFD may still conflict with those learned from real SCADA data. Therefore, the information from different representations must be adaptively reorganized to reduce such conflicts.
3.5. Evaluation Experiments on Generalization and Out-of-Distribution Extrapolation Performance
This section evaluates model behavior under one form of OOD shift, namely a systematic change in the sludge-discharge cycle. The OOD scenario considered here is therefore a structural shift in operating policy rather than a perturbation in input features or a change in source-water properties. Since bottom sludge accumulation is strongly coupled with sedimentation duration, this distribution shift directly changes both the overall level and the spatial distribution of the sludge–water interface profile. To characterize task difficulty under discharge-cycle shift and examine model sensitivity to output structure, we first report the direct prediction performance of different decoders on the OOD data. Specifically, the best parameters obtained for each predictor in the ID setting were directly tested on the OOD data. The results are shown in
Figure 4 and
Table 8.
Compared with the ID results, the traditional machine learning models XGBoost, LightGBM, and Random Forest show MCE values of 18.813, 16.910, and 15.068, respectively, with substantially increased CCE. This indicates that, when the discharge cycle changes and the output distribution shifts as a whole, static regression models cannot adequately represent process dynamics or constrain profile shape and total mass, resulting in limited robustness. The results also show clear performance differences across decoders under discharge-cycle shift, together with strong error amplification. For example, MLP and Diffusion yield PA values of 1.317 and 2.981, respectively, while their MCE values rise to 112.519 and 27.538. This suggests that some models are prone to systematic mismatch in the overall profile level under target distribution shift. In contrast, Attention achieves the lowest PA and MCE, while GRU and LSTM obtain lower CCE values, indicating that temporal memory structures contribute to shape stability, although they still fail to control global mass error effectively. The radar plot shows that inter-model dispersion in the OOD setting is larger than in the ID setting. Traditional machine-learning models occupy a relatively compact region with lower mass error but moderate shape consistency. MLP and Diffusion extend further from the radar centroid across multiple metrics in the OOD setting, consistent with their elevated PA (1.317 and 2.981) and MCE (112.519 and 27.538) in
Table 8. Attention, GRU, and LSTM occupy a more compact radar region, consistent with their lower OOD error metrics in the same table. Under direct OOD evaluation (i.e., without target-domain fine-tuning), the lowest observed PA across all baselines was 0.193 and the lowest observed MCE was 15.068—approximately 7-fold and 14-fold higher than the corresponding lowest values under the ID setting (
Table 4 and
Table 5). This motivates the few-shot adaptation experiments described next.
We further introduced a small number of labeled OOD samples for fine-tuning, and evaluated two adaptation settings: 4-shot and 6-shot. That is, 4 or 6 OOD samples were used for light fine-tuning, and the remaining samples were used for testing. The results are shown in
Table 9.
The results show clear differences among the three paradigms under discharge-cycle shift. In the 0-shot setting, Paradigm III produced the lowest values on all three metrics: relative to the second-best paradigm in this setting (Paradigm I), its PA, CCE, and MCE were 33.3%, 50.9%, and 33.8% lower, respectively. This is consistent with the interpretation that, even without target-domain supervision, the distillation route biases the predictor toward the CFD-derived profile region, which mitigates both systematic plateau-level shifts and high-frequency local excursions under cycle change. Under few-shot fine-tuning, the three paradigms differed more strongly in shape consistency and mass conservation than in pointwise accuracy. In the 4-shot setting, Paradigm II produced the lowest PA (0.061) by a small margin over Paradigm III (0.062), while Paradigm III retained the lowest CCE (0.029, 34.1% lower than Paradigm II) and the lowest MCE (3.678, 19.8% lower than Paradigm II). When the number of fine-tuning samples increased to 6-shot, Paradigm III produced the lowest values on all three metrics: relative to the second-best paradigm in this setting (Paradigm II), its PA, CCE, and MCE were 7.6%, 29.3%, and 38.1% lower respectively. By contrast, Paradigm I exhibited an increase in CCE from 0.124 (4-shot) to 0.142 (6-shot), suggesting that under structural distribution shift caused by discharge-cycle changes, a transfer mechanism based only on initialization bias may be more sensitive to target-domain sample distribution than the distillation route. To further illustrate the differences among paradigms, predictions on the OOD data are visualized in
Figure 5.
Figure 5 shows that, under OOD conditions dominated by changes in sedimentation duration, the three physical-prior fusion paradigms affect sludge–water interface prediction in distinct ways. Compared with the ID setting, OOD samples show more pronounced systematic shifts in the overall level of the platform region, together with stronger and more irregular local disturbances. This is consistent with the physical mechanism that sludge accumulation increases with sedimentation time, and it also indicates that changes in the discharge cycle make both global mass control and local shape fitting more difficult. As shown in
Figure 5, the main advantage of Paradigm I does not lie in local smoothness. In several samples, obvious high-frequency oscillations and local fluctuations still appear in the platform region. However, compared with other methods, Paradigm I better preserves the overall contour of key structures, including the sharp drop near the inlet, the position of the platform region, and the rise near the outlet. As a result, its predicted curves are generally closer to the true profiles in terms of global level and main trend. Its strength should therefore be understood as better control of global structure and absolute bias, rather than suppression of local oscillations. Paradigm II jointly models CFD-derived physical information as latent conditional features together with real operating-condition representations. Under OOD conditions, its performance depends more strongly on the quality of cross-domain representation alignment. When the physical representation matches the target operating-state distribution well, the model can maintain a reasonable trend to some extent. However, in some samples, the later part of the platform region still shows persistent overestimation, underestimation, or local drift, indicating that feature enhancement alone is insufficient to provide stable output constraints in out-of-distribution settings. By contrast, the advantage of Paradigm III is most evident at the shape level. Its predicted curves are more continuous and smoother overall, with fewer high-frequency fluctuations and spike-like anomalies in the platform region. This suggests that, under teacher-model guidance, the solution space of the student model is constrained to a neighborhood closer to the feasible manifold characterized by CFD, which is more beneficial for preserving profile consistency and physical plausibility. Overall, under OOD conditions, the key challenge is not the sharp decline near the inlet, but the systematic drift of the platform region and local nonphysical oscillations. The former leads to cumulative errors in total sludge discharge estimation, while the latter increases the risk of misjudging local sludge accumulation or short-circuit flow. In this respect, Paradigm III shows the most stable shape consistency under zero-shot and few-shot adaptation.
4. Conclusions
To address the limited physical reliability of data-driven methods and the difficulty of using pure CFD methods for real-time state estimation, this study developed a soft-sensing framework for sedimentation-state prediction by integrating SCADA operational data, measured sludge–water interface data, and CFD-based hydraulic priors. The framework enables effective prediction of the true in-tank sedimentation state from observable operating variables. On this basis, we further proposed and systematically compared three paradigms for physical-knowledge integration, namely parameter transfer, representation fusion, and knowledge distillation. These paradigms inject the hydraulic structure and particle-transport priors encoded in CFD into the parameter space, latent representation space, and supervision-constraint space, respectively.
Averaged across the six predictors, Paradigm I reduced PA, CCE, and MCE by 25.1%, 8.7%, and 47.6% relative to the unfused baselines, and was the only paradigm that improved all three metrics simultaneously. Paradigm III produced the largest average reductions in PA (36.9%) and MCE (50.0%) but increased CCE by 7.4%, indicating that distillation-based supervision did not preserve curvature structure as effectively as parameter-level injection. Paradigm II produced the smallest gains, with an average CCE increase of 5.7%. Under the in-distribution condition (24 h discharge cycle), Attention with Paradigm I produced the lowest PA (0.026, tied with MLP under Paradigm III) and the lowest MCE (1.052) among the eighteen paradigm–predictor combinations; the lowest CCE was produced instead by GRU under Paradigm I (0.032), with Attention under Paradigm I third (0.034). Within-architecture, Attention under Paradigm I reduced PA, CCE, and MCE by 50.0%, 10.5%, and 65.9% relative to unfused Attention—quantifying the benefit of CFD-informed initialization, not a cross-model ranking. Cross-combination, the corresponding reductions over the second-best of the eighteen combinations were 7.1% in PA (vs. PINN–III, 0.028) and 2.9% in MCE (vs. MLP–III, 1.083). Under the out-of-distribution condition, Paradigm III produced the lowest CCE and MCE in all three shot configurations, while Paradigm II produced the lowest PA in the 4-shot setting only; relative to the second-best paradigm, Paradigm III reduced PA, CCE, and MCE by 33.3%, 50.9%, and 33.8% in the zero-shot setting, and by 7.6%, 29.3%, and 38.1% in the 6-shot setting. The two routes therefore exhibit a complementary trade-off in our experimental scope: parameter transfer benefits ID prediction through CFD-informed initialization, while distillation benefits OOD prediction by biasing the predictor toward the CFD-derived region, at the cost of weaker ID curvature consistency.
The main contribution of this study lies not only in improving sedimentation-state prediction accuracy, but also in proposing and validating an engineering-oriented soft-sensing framework for hydraulic knowledge integration that is reusable across models and can provide complementary advantages under different operating conditions. This framework offers a unified and verifiable methodological basis for sludge discharge optimization, online state perception in smart water treatment plants, and digital-twin-driven fine-grained operation. Despite the relative paradigm orderings observed under both ID and OOD evaluation in this study, four boundaries of the present work should be acknowledged. First, the framework was validated at a single 100,000 m3/d plant with one tank geometry and one influent-water regime over approximately 240 days of campaign data and a subsequent two-month deployment. Cross-plant generalization to tanks with different aspect ratios, coagulation chemistries, and source-water characteristics therefore remains to be demonstrated, and is the most immediate direction for follow-up work. Second, the OOD evaluation was constructed from thirteen reference profiles collected under extended sludge-discharge cycles, with virtual interpolation used only for supplementary response analysis. A larger OOD set would strengthen the robustness claim for Paradigm III, but expanding it in full-scale operation is constrained by safety review, since each profile requires deliberately prolonging the discharge interval; lightweight reference-data acquisition that does not depend on extended cycles is therefore a key methodological extension for future work. Third, the CFD priors were generated under simplified conditions, with most operating features held at annual-mean values and only sedimentation duration, influent turbidity, and flow velocity varied across the five simulated regimes. Although the simulated flow regime was verified to be physically consistent with that of the study tank through an a-priori Reynolds–Froude analysis, a systematic sensitivity study of how variations in turbulence-model parameters, inlet conditions, and grid resolution propagate into prediction performance was beyond the scope of this validation campaign and is identified here as a natural extension, particularly when porting the framework to plants with markedly different geometries. Fourth, in Paradigm II the alignment between the CFD-derived and SCADA-derived latent representations is learned implicitly from the supervision signal rather than enforced by an explicit alignment loss; while this avoids over-constraining the latent space toward a CFD-defined manifold, a principled comparison of supervision-driven and constraint-driven alignment under varying degrees of distribution shift remains an open methodological question. Taken together, these boundaries delineate the validated regime of the proposed framework rather than indicate fundamental obstacles, and each suggests a concrete and tractable next step.