Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches

El-Dakhakhni, Omar; Li, Zhong; Zhou, Pengxiao; Snowling, Spencer

doi:10.3390/w18111255

Open AccessArticle

Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches

¹

Department of Civil Engineering, McMaster University, Hamilton, ON L8S 4L8, Canada

²

Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA

³

Hatch Ltd., Sheridan Science & Technology Park, 2800 Speakman Drive, Mississauga, ON L5K 2R7, Canada

^*

Author to whom correspondence should be addressed.

Water 2026, 18(11), 1255; https://doi.org/10.3390/w18111255

Submission received: 26 March 2026 / Revised: 1 May 2026 / Accepted: 15 May 2026 / Published: 22 May 2026

(This article belongs to the Section Wastewater Treatment and Reuse)

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of wastewater influent quality is critical for optimizing treatment plant operations, minimizing environmental impact, and enabling proactive management under dynamic conditions. However, the complex, nonlinear, and temporally dependent nature of influent processes poses significant challenges to traditional modeling approaches. This study introduces a robust stacked ensemble learning framework that integrates Long Short-Term Memory (LSTM), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost) to forecast three key influent quality parameters: biochemical oxygen demand (BOD₅), total phosphorus (TP), and total solids (TS) at a municipal wastewater treatment plant (WWTP) in Canada. Through sequential backward feature selection and SHapley Additive exPlanations (SHAP), the model achieves both high predictive accuracy and interpretability, providing insights into temporal, environmental, and process-based drivers of influent variability. The ensemble consistently outperforms individual models, delivering high generalization performance across all three influent quality targets. This work demonstrates that stacked ensemble models, when coupled with explainable AI techniques, can bridge the gap between black-box performance and operational transparency in wastewater forecasting. The proposed framework lays the groundwork for more resilient, data-driven decision-making in municipal WWTPs.

Keywords:

wastewater influent; water quality; wastewater treatment plants; data-driven; machine learning; ensemble

1. Introduction

Effective operation and management of wastewater treatment plants (WWTPs) increasingly depend on the ability to forecast influent quality in advance. Predicting influent characteristics is essential for optimizing operational costs, treatment efficiency, and mitigating environmental risks. Municipal wastewater influent is generated from residential, commercial, and, in some cases, industrial sources and is typically collected and transported through an extensive and complex network of sewer systems to WWTPs. It exhibits complex physicochemical characteristics, including highly variable organic loads, nutrient concentrations, suspended solids, and temperature. Temporal and spatial variations in water usage across all residential, commercial, and industrial users, combined with the structural complexity of municipal sewer networks, significantly affect wastewater influent characteristics [1]. In addition, the configuration of the sewer system plays a critical role. Combined sewer systems are significantly affected by stormwater during precipitation events, which can lead to substantial fluctuations in influent flow and pollutant concentrations. On the other hand, while separated systems are designed to isolate sanitary wastewater from stormwater, they remain susceptible to infiltration and inflow, which can still introduce variability in both influent flow and composition. Overall, the complex, nonlinear, and temporally dependent nature of influent processes presents significant challenges to traditional modeling approaches, highlighting the need for advanced predictive approaches.

Influent forecasting has advanced in recent years, particularly with respect to flow rate prediction. For example, Zhou et al. (2023) emphasized modeling influent flow rate rather than quality because flow dynamics are comparatively easier to monitor and exhibit more stable temporal patterns [2]. Most existing work has focused on influent flow rate prediction, where data-driven models have been successfully applied to capture temporal fluctuations in wastewater inflow. On the other hand, efforts to model influent quality characteristics, such as biochemical oxygen demand after 5 days (BOD₅), total phosphorus (TP), and total solids (TS), remain relatively understudied. Cheng et al. (2020) used recurrent neural networks, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), as soft sensors for predicting influent flow, temperature, and biochemical oxygen demand (BOD) [3]. Liu et al. (2026) developed a Fluctuation-injected Trend inverted Transformer using an inverted Transformer architecture for the prediction of BOD and chemical oxygen demand (COD) [4]. These studies demonstrated the capability of various data-driven techniques to model nonlinear and dynamic influent behavior, but the prediction of influent quality (e.g., BOD₅, TP, TS) remains far more complex and less explored.

Meanwhile, previous efforts to model influent quality focused mainly on the use of one specific model to predict a given parameter [5]. For example, one study [6] relied on single-algorithm machine learning models to predict parameters such as BOD₅, CBOD, or TSS, sometimes using surrogate variables (e.g., flow, temperature) within sliding-window frameworks. This highlights the lack of comprehensive ensemble-based frameworks for influent quality forecasting. Ensemble approaches have been shown to improve predictive accuracy and robustness in wastewater effluent forecasting [7], suggesting a possible pathway toward more reliable influent quality prediction. Despite this potential, ensemble modeling remains underutilized in influent quality prediction, leaving a clear methodological gap in leveraging multiple algorithms to capture the nonlinear and multivariate nature of influent dynamics. Another notable limitation in recent research of wastewater influent quality prediction is the low interpretability of machine learning models. Their black-box structure obscures the underlying decision logic and limits operational trust, creating a need for transparent feature attribution methods like SHAP to support real-world implementation [8].

Therefore, the objective of this study is to develop a robust ensemble learning framework for short-term prediction of key wastewater influent quality parameters (BOD₅, TP, and TS) and to enhance model transparency through feature attribution techniques. Specifically, an ensemble influent prediction model is developed by integrating three machine learning algorithms, including LSTM, Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost); a sequential backward selection (SBS) process is used to identify optimal predictors, reduce redundancy, and enhance efficiency; and the model results are interpreted using both global feature importance scores and SHapley Additive exPlanations (SHAP). The proposed modeling framework and the results will provide valuable support for the daily operation and decision-making in WWTPs.

2. Study Area and Data Collection

The influent quality data in this study were collected from a single confidential WWTP in Canada and consist of daily BOD₅, TP, TS, and flow records from 1 January 2015 to 31 December 2017. Although site-specific, the methodology is transferable and is expected to generalize across facilities, with potential performance gains when applied to more recent, higher-resolution datasets. The input variables for model development, along with their statistics, are presented in Table 1. The dataset comprised 1095 samples corresponding to 1095 consecutive days. Missing observations were filled using linear interpolation, with 174 values missing for BOD₅, 14 for TP, and 15 for TS. Since the missing values were distributed across the time series rather than occurring in large contiguous gaps, interpolation was applied to pre-serve temporal continuity and avoid excessive data loss. The variables were then normalized using the Box-Cox transformation to reduce skewness, stabilize variance, and mitigate the influence of extreme values while preserving all observations. Multiple outlier handling strategies were initially considered, including data retention and removal-based approaches; however, the final model retained all data points to preserve rare but operationally meaningful extreme influent events.

The corresponding meteorological data, including daily maximum temperature, minimum temperature, degree-day, snowfall amount, snowfall depth, and precipitation, were obtained from Dark Sky, a now-defunct weather forecasting service. To enhance model relevance and predictive capability, additional features were derived and adopted, including accumulated heating degree days (HDD), cooling degree days (CDD), and multi-day cumulative precipitation (for n = 2, …, 7 days), capturing the cumulative effects of snowmelt and other sewershed conditions. Additionally, the primary target chemical characteristics and their most closely associated co-target variables were indexed up to 7 days backward to create lagged variables (e.g., BOD₅, TP, TS n days ago), providing insights into the wastewater influent dynamics. The temporal structure of inputs follows a sliding window framework, which enables the model to learn from sequential patterns and short-term dependencies. The 7-day rolling means of BOD₅, TP, and TS were created as target variables to smooth short-term variability, capture meaningful trends, and improve model reliability.

3. Methodology

3.1. Ensemble Members

LSTM was selected based on its successful application in previous studies [9]. LSTM is a type of recurrent neural network (RNN) that is designed to capture long-term dependencies in sequential data. Unlike multilayer perceptron (MLP) models, where connections are strictly between adjacent layers, LSTM models establish connections among nodes within the same layer, enabling the retention of sequential information by incorporating specialized gates (input, forget, and output) and a memory cell. A Grid Search was implemented and it determined that six stacked layers, each containing 80 neurons, yielded the best performance. The Adam optimizer was employed to optimize learning rate and weight updates due to its ability to efficiently handle sparse gradients and non-stationary objectives, which are common in time series data. The loss function used was mean squared error (MSE), a commonly adopted metric for regression tasks that penalizes larger errors more heavily and provides a smooth gradient for optimization.

SVR is an extension of support vector machines (SVM) tailored for regression tasks, and it has been successfully applied for wastewater prediction tasks, including the chemical characteristics of wastewater effluent [7]. It seeks to identify a function that approximates target values within a specified tolerance margin, ϵ, while simultaneously minimizing model complexity. In this study, the radial basis function (RBF) kernel was selected due to its capacity to model complex, nonlinear interactions without requiring prior assumptions about the underlying data distribution. The RBF kernel measures similarity between input vectors based on their Euclidean distance, enabling the SVR model to adapt flexibly to localized variations in the feature space. This flexibility is particularly advantageous for environmental systems modeling, where variable relationships are often nonlinear, temporally dependent, and affected by external disturbances, making the RBF kernel well-suited for forecasting wastewater influent dynamics.

XGBoost was also selected as a constituent algorithm due to its demonstrated success in modeling wastewater influent dynamics and predicting key water quality parameters such as BOD₅ [10]. XGBoost is an advanced implementation of gradient-boosting decision trees that builds an ensemble of weak learners stage-wise, optimizing a regularized objective function to prevent overfitting and enhance generalization. XGBoost minimizes a second-order Taylor expansion of the loss function in each boosting iteration, enabling more accurate approximations and faster convergence compared to traditional gradient boosting methods. In this study, hyperparameters were tuned using grid search with MSE as the loss function. The optimal learning rate was determined as 0.1 and the maximum tree depth as six. XGBoost’s ability to automatically handle missing data and capture nonlinear feature interactions makes it particularly well-suited for forecasting wastewater influent, where relationships between predictors and targets are often irregular, noisy, and seasonally influenced.

3.2. Ensemble Architecture

To enhance predictive robustness and leverage the complementary strengths of individual algorithms, a stacked ensemble architecture was implemented. As shown in Figure 1, this approach integrated three machine learning algorithms as base learners, each independently trained on optimized feature subsets selected using SBS for the prediction of BOD₅, TP, and TS. Each member model generated an individual prediction for a given observation, and these predictions served as input to a meta-learner trained to produce the final output. Ordinary least squares (OLS) regression was employed as the meta-learner due to its transparency, closed-form solution, and reduced risk of overfitting. A five-fold cross-validation scheme was employed during base model training. This stacking structure was selected over alternative ensemble methods such as bagging or boosting because it explicitly models inter-algorithm complementarity rather than relying solely on variance reduction. By combining the distinct modeling capabilities of deep learning (LSTM), kernel methods (SVR), and tree-based algorithms (XGBoost), this ensemble structure aimed to capture a wide range of dynamic relationships during wastewater influent processes.

3.3. Feature Selection

SBS was integrated into the ensemble to reduce the dimensionality of the input space while retaining the most informative predictors for influent forecasting. Previous research demonstrates SBS’s ability to reduce input dimensionality and improve model accuracy, especially in tasks involving high-dimensional or sparsely informative features [11]. In this study, SBS was initialized with the complete set of 42 candidate features, denoted by X = {x₁, x₂, …, x₄₂} and iteratively refined to produce a reduced feature subset X_k = {x_j|j = 1, 2, …, k; x_j ∈ X }, where k < 42 and the target dimensionality p was defined a priori. At each step, a single feature x_j was removed from X_k such that its exclusion maximized model performance, measured using cross-validated root mean squared error (RMSE). This backward elimination process continued until the subset had been reduced to size p = 10, yielding a streamlined input set tailored to the modeling objectives of the study. The best-performing subsets for BOD₅, TP, and TS, respectively, were selected for further investigation and model development.

3.4. AI Use Statement

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-4) and Microsoft Copilot (Microsoft 365 Copilot, GPT-based large language model) for limited assistance with coding tasks (e.g., debugging and syntax support) during the study, and to assist with proofreading and editing the manuscript. The authors have reviewed and edited all outputs and take full responsibility for the content of this publication.

4. Results and Discussion

4.1. Model Performance

A series of ensemble model runs with different input feature sets, following the feature elimination process described in Section 3, were conducted to predict the 7-day rolling mean values of BOD₅, TP, and TS. Three performance metrics, including coefficient of determination (R²), RMSE, and mean absolute percentage error (MAPE), were chosen to assess model performance on both the training and testing datasets for each target variable (BOD₅, TP, and TS) [7,12]. R² quantifies the proportion of variance in the target variable explained by the model. RMSE measures the average magnitude of the differences between predicted and observed values in mg/L, providing an interpretable measure of the model’s absolute accuracy. MAPE complements RMSE by expressing errors as percentages, facilitating cross-variable comparisons despite differences in scale or magnitude [7].

Results of these ensemble runs are presented in Figure 2. R² values are color-coded in blue, with darker shades indicating higher performance. For MAPE and RMSE, the highest performance is highlighted in bright yellow and the lowest in red. Then, the optimal number of features for each target was determined based on the highest-performing error metrics from the testing dataset. RMSE was selected as the primary metric, typically aligning with optimal MAPE and R² values, except for TS, where deviations were observed. The best-performing ensembles were defined as those with the optimal number of features. The number of features for the best-performing ensembles was 10, 18, and 20 for BOD₅, TP, and TS, respectively. It should be noted that although the MAPE for TS was not the most optimal, it differed by only 0.1% from the optimal MAPE. The corresponding R² and RMSE values of TS were significantly higher than those of the other feature sets. Compared with similar models reported in the literature, the overall performance presented in Figure 2 can be considered more than satisfactory [2,3].

Figure 3 displays scatter plots of actual versus predicted values for BOD₅, TP, and TS, generated using the testing sets of the best-performing ensembles and their member models. Each row corresponds to one of the target variables. For BOD₅ predictions (Figure 3A–D), the ensemble model exhibited the most substantial alignment with the 1:1 reference line, characterized by a dense, diagonally aligned cluster, indicating high predictive accuracy across the entire concentration range. In contrast, the SVR and LSTM models exhibited a greater spread, particularly at higher concentrations (>350 mg/L), where predictions tended to underestimate the actual values. XGBoost performed better than SVR and LSTM individually but still showed slight deviations in extreme BOD₅ values. The ensemble’s improved clustering indicates that it effectively balances the strengths of its members. In the TP plots (Figure 3E–H), the ensemble model exhibits minimal dispersion and demonstrates tight agreement with the reference line. SVR predictions are visibly more clustered toward the center and exhibit underestimation at both extremes of the concentration range (<5 mg/L and >7 mg/L). LSTM performs worse in this case, displaying wide vertical dispersion and a consistent downward bias relative to the 1:1 line. XGBoost also shows a reasonably linear trend but with a visibly higher residual spread than the ensemble, as reflected in the broader vertical scatter of points. The ensemble’s performance in TP prediction is notably robust, achieving the tightest clustering around the 1:1 line, while XGBoost displays a wider error envelope. These patterns indicate higher residual variance in XGBoost compared to the ensemble, even though the overall trend remains linear. For TS (Figure 3I–L), the ensemble model again shows the most compact point cloud. SVR and LSTM exhibit broader scatter and systematic bias, with SVR compressing high concentrations (>500 mg/L) toward the mean, and LSTM underestimating across the range. XGBoost performs slightly worse than the ensemble, as seen in its larger vertical spread, but remains broadly comparable. The ensemble’s superior TS predictions may stem from its ability to integrate complementary model strengths: LSTM captures temporal dependencies and sequential patterns, XGBoost detects nonlinearities and abrupt variance shifts, and SVR contributes robust generalization in noisy environments [3,7,10]. By stacking these models, the ensemble effectively adapts to both smooth trends and sharp fluctuations in TS, resulting in more accurate forecasts. Overall, the scatter plots confirm the ensemble model’s consistent advantage across all three variables, as it produces predictions that are well-clustered and less variable.

4.2. Model Interpretability

4.2.1. Feature Importance Score

Feature importance scores were calculated for the three best-performing ensemble models for BOD₅, TP, and TS (Figure 4). The feature importance scores provide valuable insights into the relative influence of each predictor on the ensemble model when predicting the chemical characteristics of wastewater influent. Temporal characteristics, including year and month, as well as accumulated HDD, are among the strongest predictors of BOD₅, reflecting the critical role of periodicity in predicting influent concentration. The influence of these predictors on BOD₅ influent is consistent with the thermally driven nature of microbial activity, which decomposes organic matter more rapidly at higher temperatures and more slowly during colder periods [13]. This observation is supported by later analysis from the SHAP plots (Figure 5), where accumulated HDD and CDD emerged as key drivers in predicting BOD₅ influent. For TP prediction, month, accumulated HDD, and precipitation are key predictors. This may be because the WWTP receives flow from a combined sewer system, stormwater is conveyed alongside wastewater, making influent quality highly sensitive to wet-weather conditions. It also reflects the impact of hydrological conditions on phosphorus transport, with precipitation mobilizing phosphorus-rich sediments and organic matter [14,15]. At the same time, the influence of accumulated HDD may be explained by the fact that colder conditions can affect snowmelt and phosphorus release from deicing agents [16,17]. The most important predictors for TS were flow, precipitation, month, and accumulated HDD. The flow rate and precipitation function as the hydrological drivers of surface wash-off, month captures seasonal climate patterns, and accumulated HDD is likely to affect cold-season storage and pollutant retention that varies significantly during thaw and snowmelt periods. These observations demonstrate that these lagged features are key drivers of the temporal variability in TS.

4.2.2. SHAP

In addition to the feature importance scores for the best-performing ensemble models, SHAP values were calculated for each feature within each ensemble member to further enhance interpretability. The SHAP plots for BOD₅ (Figure 5A–C) highlight the influence of accumulated heating and cooling degree days, pointing to the significant impact of temperature changes on wastewater influent quality. This relationship may be due to the positive correlation between temperature and microbial activity, where elevated temperatures accelerate the decomposition of organic matter [18]. For TP, the SHAP plots (Figure 5D–F) validate the results of the previous feature importance analysis, underscoring the periodic nature of meteorological drivers (HDD and accumulated HDD) and urban hydrology (precipitation and accumulated precipitation) and their role in shaping phosphorus influent patterns. Accumulated precipitation and accumulated HDD show clear seasonal associations with TP influent concentrations, consistent with patterns reported in previous studies of Canadian municipal wastewater influent [19,20]. Precipitation facilitates surface runoff, which can transport phosphorus-rich material. In cold-season urban settings, phosphorus loading has been shown to increase during deicing-related runoff events. SHAP analysis for TS forecasting (Figure 5G–I) reveals monthly patterns of solids loading, as well as the impacts of climatic and environmental variables, including accumulated degree days and precipitation. Together with climatic and temporal variables, these factors reflect the complex interactions that shape TS influent variability. This multidimensional influence underscores the need to integrate diverse yet interrelated features to ensure robust and accurate predictive modeling of TS.

4.3. Model Uncertainties

Although the developed ensemble modeling framework provided satisfactory predictions of wastewater influent quality, it is subject to several important sources of uncertainty. Input data are a primary contributor of uncertainty. In this study, several input variables exhibited strong intercorrelations, such as accumulated temperature or precipitation measures over successive days, which may introduce redundancy and obscure the true drivers of influent dynamics. To address this, the SBS approach was applied to reduce dimensionality and mitigate redundancy; however, because it relies on a linear estimator, it may have overlooked nonlinear or temporal interactions critical to models like LSTM [21]. Additionally, input data are susceptible to various forms of noise, time-lag inconsistencies, and unmodeled rare events, all of which can degrade model performance and reliability. External shocks and seasonal variability may also not be fully captured through engineered features alone, further contributing to uncertainty. Future work should incorporate sequence-aware feature selection methods and probabilistic modeling techniques, such as attention mechanisms or Bayesian neural networks, to more effectively separate signal from noise in complex, temporally structured wastewater influent data [22].

Uncertainties may also arise from hyperparameter selection, which is typically based on the trade-off between model accuracy and generalization. In this study, the LSTM model employs a deep recurrent structure, whereas the XGBoost model comprises multiple decision trees with limited regularization; both can lead to overfitting if hyperparameters are not carefully tuned [23,24]. These uncertainties are particularly salient in environmental time series forecasting, where input features may be highly correlated or exhibit seasonal patterns. Future research should explore Bayesian approaches to quantify and mitigate parameter and hyperparameter uncertainties in influent forecasting.

Finally, uncertainties can be introduced during data preparation and processing. In this study, the targets were subject to Box-Cox normalization to stabilize variance and constrain output range. While these transformations improve model training, they can introduce potential distortion when inverse transformations are applied, especially under extreme or previously unseen conditions. Slight deviations in predicted normalized values may result in disproportionately large errors once back-transformed, complicating interpretability and real-world applicability. These challenges underscore the importance of meticulous post-processing and uncertainty quantification, particularly for applications in environmental management, where prediction accuracy at extremes is often most crucial.

5. Conclusions

In this study, an ensemble learning framework was developed for the short-term prediction of three key wastewater influent quality parameters: BOD₅, TP, and TS. The framework integrated LSTM, SVR, and XGBoost models as base learners with an OLS meta-learner. A sequential feature selection algorithm was employed to optimize input dimensionality for each target, and SHAP-based explainability methods were used to assess the contribution of individual predictors. Predictors such as flow rate, temperature, precipitation, and back-indexed co-target variables were used to forecast the 7-day rolling means of each target, capturing short-term dependencies. Performance was evaluated using RMSE, MAPE, and R² on independent test sets. The ensemble consistently outperformed the constituent base models across all three targets. SHAP analysis showed that short-term historical lags, along with precipitation, temperature, and influent flow rate, were key contributors to predictive performance, while the rolling-mean targets reflected smoothed short-term patterns captured through these inputs. These findings not only validated the predictive strength of the ensemble framework but also provided interpretable insights into the underlying drivers of influent variability.

The study introduced a technically grounded and interpretable framework for predicting wastewater influent quality prediction. While the modeling work was demonstrated using site-specific data, the proposed methodology can be generalized across facilities. The developed model can be readily retrained and replicated at other facilities using local data to support wastewater treatment operations and optimization. Additionally, various uncertainties associated with the developed framework were discussed at the input, hyperparameters, and data processing levels, helping to identify key sources of variability and error in the prediction pipeline. To enhance the scalability and robustness of influent quality prediction, future research should explore probabilistic forecasting approaches, extend the framework to multiple facilities, and investigate nonlinear or sequence-aware feature selection methods. Additionally, testing model resilience under stress conditions, such as heavy rainfall, operational disruptions, or influent spikes, will be essential for assessing real-world applicability.

Author Contributions

Conceptualization, S.S. and Z.L.; methodology, O.E.-D. and P.Z.; validation, O.E.-D.; formal analysis, O.E.-D.; investigation, O.E.-D.; resources, S.S. and Z.L.; data curation, O.E.-D., P.Z. and S.S.; writing—original draft preparation, O.E.-D.; writing—review and editing, O.E.-D. and Z.L.; visualization, O.E.-D.; supervision, Z.L.; project administration, S.S., P.Z. and Z.L.; funding acquisition, S.S. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by funding from the Collaborative Research and Development (CRD) program of the Natural Sciences and Engineering Research Council of Canada (NSERC). Early-stage development of the work was also supported by Hydromantis Environmental Software Solutions, Inc., which was later acquired by Hatch Ltd.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy restrictions.

Acknowledgments

The authors used ChatGPT and Microsoft Copilot for limited assistance with coding tasks, as well as for proofreading and editing the manuscript.

Conflicts of Interest

Author Spencer Snowling was employed by the company Hatch Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Hydromantis Environmental Software Solutions, Inc., which was subsequently acquired by Hatch Ltd. The funder was involved in problem identification and data collection.

References

Zhang, Q.; Li, Z.; Snowling, S.; Siam, A.; El-Dakhakhni, W. Predictive models for wastewater flow forecasting based on time series analysis and artificial neural network. Water Sci. Technol. 2019, 80, 243–253. [Google Scholar] [CrossRef] [PubMed]
Zhou, P.; Li, Z.; Zhang, Y.; Snowling, S.; Barclay, J. Online machine learning for stream wastewater influent flow rate prediction under unprecedented emergencies. Front. Environ. Sci. Eng. 2023, 17, 152. [Google Scholar] [CrossRef]
Cheng, T.; Harrou, F.; Kadri, F.; Sun, Y.; Leiknes, T. Forecasting of wastewater treatment plant key features using deep learning-based models: A case study. IEEE Access 2020, 8, 184475–184485. [Google Scholar] [CrossRef]
Liu, Y.; Ren, S.; Fang, G. FiTiformer: A fluctuation–trend modulation model for multi-step forecasting of wastewater influent parameters. J. Water Process Eng. 2026, 83, 109600. [Google Scholar] [CrossRef]
Andreides, M.; Dolejš, P.; Bartáček, J. The prediction of WWTP influent characteristics: Good practices and challenges. J. Water Process Eng. 2022, 49, 103009. [Google Scholar] [CrossRef]
Verma, A.; Wei, X.; Kusiak, A. Predicting the total suspended solids in wastewater: A data-mining approach. Eng. Appl. Artif. Intell. 2013, 26, 1366–1372. [Google Scholar] [CrossRef]
Nourani, V.; Elkiran, G.; Abba, S.I. Wastewater treatment plant performance analysis using artificial intelligence—An ensemble approach. Water Sci. Technol. 2018, 78, 2064–2076. [Google Scholar] [CrossRef]
Wang, Y.; Li, T.; Bai, L.; Yu, H.; Qu, F. Comparison of interpretable machine learning models and mechanistic model for predicting effluent nitrogen in WWTP. J. Water Process Eng. 2025, 77, 108344. [Google Scholar] [CrossRef]
Zhang, W.; Zhao, J.; Quan, P.; Wang, J.; Meng, X.; Li, Q. Prediction of influent wastewater quality based on wavelet transform and residual LSTM. Appl. Soft Comput. 2023, 148, 110858. [Google Scholar] [CrossRef]
Ching, P.M.L.; Zou, X.; Wu, D.; So, R.H.Y.; Chen, G.H. Development of a wide-range soft sensor for predicting wastewater BOD₅ using an eXtreme gradient boosting (XGBoost) machine. Environ. Res. 2022, 210, 112953. [Google Scholar] [CrossRef]
Aha, D.W.; Bankert, R.L. A comparative evaluation of sequential feature selection algorithms. In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 4–7 January 1995; pp. 1–7. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Metcalf & Eddy, Inc.; Tchobanoglous, G.; Stensel, H.D.; Tsuchihashi, R.; Burton, F.L. Wastewater Engineering: Treatment and Resource Recovery, 5th ed.; McGraw-Hill Education: New York, NY, USA, 2014. [Google Scholar]
Xiao, Y.; Zhang, C.; Zhang, T.; Luan, B.; Liu, J.; Zhou, Q.; Li, C.; Cheng, H. Transport processes of dissolved and particulate nitrogen and phosphorus over urban road surfaces during rainfall runoff. Sci. Total Environ. 2024, 948, 174905. [Google Scholar] [CrossRef]
Vystavna, Y.; Hejzlar, J.; Kopáček, J. Long-term trends of phosphorus concentrations in an artificial lake: Socio-economic and climate drivers. PLoS ONE 2017, 12, e0186917. [Google Scholar] [CrossRef]
Westerlund, C.; Viklander, M. Particles and associated metals in road runoff during snowmelt and rainfall. Sci. Total Environ. 2006, 362, 143–156. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Elliott, J.A.; Lobb, D.A.; Flaten, D.N.; Yarotski, J. Critical factors affecting field-scale losses of nitrogen and phosphorus in spring snowmelt runoff in the Canadian Prairies. J. Environ. Qual. 2013, 42, 1509–1516. [Google Scholar] [CrossRef] [PubMed]
McKinley, V.L.; Vestal, J.R. Biokinetic analyses of adaptation and succession: Microbial activity in composting municipal sewage sludge. Appl. Environ. Microbiol. 1984, 47, 933–941. [Google Scholar] [CrossRef]
Zhou, P.; Li, Z.; Snowling, S.; Goel, R.; Zhang, Q. Multi-step ahead prediction of hourly influent characteristics for wastewater treatment plants: A case study from North America. Environ. Monit. Assess. 2022, 194, 389. [Google Scholar] [CrossRef]
Zhou, P.; Li, Z.; Snowling, S.; Barclay, J. Unraveling the impact of COVID-19 lockdowns on Canadian municipal sewage. Environ. Sci. Water Res. Technol. 2023, 9, 2213–2218. [Google Scholar] [CrossRef]
Venkatesh, B.; Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol. 2019, 19, 3–26. [Google Scholar] [CrossRef]
Pearce, T.; Leibfried, F.; Brintrup, A. Uncertainty in neural networks: Approximately Bayesian ensembling. Proc. Mach. Learn. Res. 2020, 108, 6817–6827. [Google Scholar]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]

Figure 1. Stacked ensemble framework for influent quality prediction with SBS selection and model interpretability outputs.

Figure 2. Results of ensemble model runs with different input feature combinations for the three target variables (BOD₅, TP, and TS). R² is shown in blue (darker indicates better performance); for MAPE and RMSE, best performance is yellow and worst is red. # represents the number of features.

Figure 3. Scatter plots for the best-performing ensembles and their member models: BOD₅ (A–D), TP (E–H), and TS (I–L).

Figure 4. Feature importance for the best-performing ensemble models for BOD₅ (10 features), TP (18 features), and TS (20 features).

Figure 5. SHAP feature importance for constituent models: BOD₅ (A–C), TP (D–F), and TS (G–I).

Table 1. List of input variables used in the ensemble model and their relevant statistical measures.

Variable	Acronym	Unit	Max	Min	Mean	SD
Year	Year	Year	N/A	N/A	N/A	N/A
Month	Month	Month	N/A	N/A	N/A	N/A
Weekday/weekend	Wkdy/Wknd	Binary	N/A	N/A	N/A	N/A
Daily maximum temperature [2 m elevation corrected]	MaxT	°C	35.18	−15.08	13.67	11.08
Daily minimum temperature [2 m elevation corrected]	MinT	°C	25.79	−26.33	5.08	10.28
Hourly based daily average temperature [2 m elevation corrected]	AvgT	°C	29.09	−21.48	9.14	10.54
Heating degree day (HDD)	HDD	Degree-days	21.48	0	1.36	3.38
Cooling degree day (CDD)	CDD	Degree-days	30	0	1.54	8.45
Accumulated HDD for n days (n = 2, …, 7)	AccHDDn	Degree-days	N/A	N/A	N/A	N/A
Accumulated CDD for n days (n = 2, …, 7)	AccCDDn	Degree-days	N/A	N/A	N/A	N/A
Precipitation	Precip	mm	72.50	0	10.56	4.58
n-day accumulated precipitation	AccPrecip_n	mm	N/A	N/A	N/A	N/A
Snow depth	Snow	mm	0.03	0	0	0.01
Water flow	Flow	m³/s	1413.03	143.88	253.20	137.08
Solids n days ago (n = 1, …, 7)	S_nd	mg/L	N/A	N/A	N/A	N/A
Phosphorus n days ago (n = 1, …, 7)	P_nd	mg/L	N/A	N/A	N/A	N/A
BOD₅ n days ago (n = 1, …, 7)	B_nd	mg/L	N/A	N/A	N/A	N/A

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El-Dakhakhni, O.; Li, Z.; Zhou, P.; Snowling, S. Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches. Water 2026, 18, 1255. https://doi.org/10.3390/w18111255

AMA Style

El-Dakhakhni O, Li Z, Zhou P, Snowling S. Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches. Water. 2026; 18(11):1255. https://doi.org/10.3390/w18111255

Chicago/Turabian Style

El-Dakhakhni, Omar, Zhong Li, Pengxiao Zhou, and Spencer Snowling. 2026. "Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches" Water 18, no. 11: 1255. https://doi.org/10.3390/w18111255

APA Style

El-Dakhakhni, O., Li, Z., Zhou, P., & Snowling, S. (2026). Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches. Water, 18(11), 1255. https://doi.org/10.3390/w18111255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Wastewater Influent Characteristics Using Data-Driven Modeling Approaches

Abstract

1. Introduction

2. Study Area and Data Collection

3. Methodology

3.1. Ensemble Members

3.2. Ensemble Architecture

3.3. Feature Selection

3.4. AI Use Statement

4. Results and Discussion

4.1. Model Performance

4.2. Model Interpretability

4.2.1. Feature Importance Score

4.2.2. SHAP

4.3. Model Uncertainties

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI