A Machine Learning-Based Data-Driven Model for Predicting Wastewater Quality Parameters in the Industrial Domain
Abstract
1. Introduction
- Implementation and customization of eight ML algorithms designed to address specific characteristics of water conductivity;
- Comparative analysis identifying the optimal approach using performance metrics;
- Development of HGBRCond model for water final conductivity (C2) prediction that combines gradient boosting speed and accuracy with hyperparameter optimization during biodegradation of synthetic wastewater;
- Rigorous statistical validation using 10-fold cross-validation;
- Sensitivity analysis and confidence intervals (CI) demonstrating model robustness and calibration;
- Multi-level interpretability using feature importance, Morris screening and SHAP analysis to quantify the contribution of water initial conductivity (C1), dissolved oxygen (O1) and flowrate (FR) features to final conductivity (C2) prediction.
2. Materials and Methods
2.1. Experimental Design and Wastewater Treatment System
2.1.1. Installation Description
2.1.2. Materials, Reagents, and Chemicals
2.1.3. Operating Conditions and Experimental Protocol
2.2. Machine Learning Framework
3. Results
3.1. Biodegradability Evolution
3.1.1. Biocenosis Evolution Analysis
3.1.2. Analysis of pH, BOD and COD Parameters Evolution
3.2. HGBRCond Mathematical Model Development
- —the initial prediction obtained through loss function minimization over the training data set (in this case, 323.7071, representing the mean of C2 training values),
- —the learning rate (in this case, 0.01), which controls each estimator contribution and, it prevents overfitting;
- —the number of boosting iterations (sequential estimators), in this case, set to 5000 estimators;
- —histogram-based estimators trained on pseudo-residuals (negative gradients). It represents the prediction of the j-th histogram-based decision tree fitted on the negative gradient of the loss function (residuals) at iteration j-1. Therefore, it is not the direct prediction of the target value (conductivity) but the prediction of the histogram-based tree j trained on the negative gradient (residuals) of the loss function from the previous iteration;
- —observation .
- represents the cumulative prediction after j-1 iterations;
3.2.1. HGBRCond Model Validation, Sensitivity Analysis and Morris Method Screening
- Data splitting: The dataset was initially split intro training (70%, n = 424) and testing (30%, n = 182) sets. The test set was never used during model development, hyperparameters tuning or any other evaluations;
- Parameter optimization: This stage was performed exclusively on the training dataset (70%) using 10-fold cross-validation. Within each fold, the training set was subdivided into 90% for training and 10% for validation. The parameters were optimized by averaging the performance obtained across all 10-folds. Importantly, no information from the test set was used at any stage of the parameters identification;
- Model final evaluation: Once the optimal parameters were identified through 10-fold cross-validation on the training set, the HGBRCond model was trained on the entire training set (70%). The model was then evaluated once again on the unused (reserved) test dataset (30%). The performance metrics reported in Table 9 (R2 = 0.8772 ± 0.0110, RMSE = 10.2353 ± 0.54092388 [μS/cm], and MAE = 4.8599 ± 0.2388 [μS/cm]) demonstrate the model final evaluation on unused data.
- R2 = 0.8772 ± 0.0110: the model captures 87.72% of the target variable (conductivity C2) explained variance, with low standard deviation (SD = 0.0110) confirming stable performance across folds; an R2 > 0.7 indicates excellent HGBRCond model predictive capability;
- RMSE = 10.2353 ± 0.5409 μS/cm: the root mean absolute error is 10.24 μS/cm; relative to the target variable (conductivity-C2) range (285–360 μS/cm; range = 75 μS/cm), this represents 13.65% of the target scale—a good performance per empirical criteria (RMSE 10–20%: good; <10%: excellent); the low standard deviation (SD = 0.54) demonstrates model consistency and an good model stability across validation sets;
- MAE = 4.8599 ± 0.2388 μS/cm: the mean absolute error is excellent at under 5 μS/cm, with minimal variability across folds.
3.2.2. HGBRCond Model Feature Importance and SHAP Analysis
3.2.3. HGBRCond Model Predictions for Water Conductivity
4. Discussion
- While LSTM (R2 = 0.88), XGBoost (R2 = 0.82), and ANN combined with PCA (R2 = 0.88) supplies comparable or superior performance, they are operating in batch context, full-scale or using historical data, limiting their applicability in continuous monitoring; on the other hand, HGBRCond model with R2 = 0.877 uses pilot data, that allows its testing and validation before possible implementation at industrial scale;
- The HGBRCond model’s main advantage is stability, highlighted through cross-validation (SD = 0.011) and sensitivity analysis;
- Unlike other complex models such as Hybrid CNN-LSM (RMSE = 53.83 µS/cm) or LSTM which are require high computational resources and have high computational complexity, HGBRCond model uses simple and measurable predictors (such as O1 and C1), eliminating the necessity of complex laboratory analysis or multiple parameters (such as DO, pH, BOD, COD or NH4.).
- Faster training efficiency (63 s vs. 150–400 s for comparable dataset sizes) ensuring rapid model development and iterative optimization;
- It operates with 67% fewer parameters, requiring only two measurable predictors (O1 and C1) compared to 4–6 features (DO, pH, BOD, COD, NH4) typically needed by XGBoost and GBR, substantially reducing sensor infrastructure costs and system complexity;
- It has 50–78% more stable predictions (SD = 0.011 vs. 0.02–0.04 for ensemble methods), demonstrating superior robustness;
- Provides native missing value handling;
- Superior interpretability through SHAP screening, detecting O1 dominance (98% relative importance);
- It achieves 6.8% higher accuracy (R2 = 0.877 vs. 0.82) despite being trained on pilot-scale data rather than full industrial datasets, demonstrating a good performance.
5. Conclusions
- Model validation across multiple flowrates and extended conductivity ranges using controlled experimental conditions; model testing on real industrial wastewater data to evaluate its performance under real conditions; full-scale validation in operational industrial plants; model computational performance evaluation under real production conditions (experimental validation plan);
- Model extension in order to predict additional wastewater parameters (such as COD, BOD, pH) for a more detailed analysis of wastewater treatment efficiency; the development of dedicated ML-based models for specific toxic pollutants (pesticides, phenols, cyanides, petroleum tars) biodegradation prediction; the computational time comparison with alternative models under industrial operating conditions (model extension strategy).
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ANN | Artificial Neural Network |
| BOD | Biochemical oxygen demand |
| COD | Chemical oxygen demand |
| CNN-LSTM | Convolutional Neural Network-Long Short-Term Memory |
| CI | Confidence Intervals |
| CV | Coefficient of variation |
| DTR | Decision Tree Regression |
| DO | Dissolved oxygen |
| EVS | Explained variance score |
| FR | Water flowrate |
| GB | Gradient Boosting model |
| GBR | Gradient Boosting Regression |
| CFU/mL | Total bacterial colony count |
| GPR | Gaussian process regression |
| HGBR | Histogram-based Gradient Boosting Regression |
| HGBRCond | Histogram-based Gradient Boosting Regression proposed mathematical model for water conductivity prediction |
| IDS | Intelligent Digital Sensor |
| KNR | KNeighbors Regression |
| KNN | k-nearest neighbors |
| LR | Linear Regression |
| LSTM | Long Short-Term Memory |
| MAE | Mean absolute error |
| MAPE | Mean absolute percentage error |
| MedAE | Median absolute error |
| ML | Machine Learning |
| MLP | Multilayer perceptron |
| MM | Mineral medium |
| OECD | Guidelines for the testing of chemicals of the Organization for Economic Cooperation and Development |
| PCA | Principal Component Analysis |
| R2 Score | Coefficient of determination |
| RFR | Random Forest Regression |
| RMSE | Root Mean Square Error |
| RR | Ridge Regression |
| SD | Standard Deviation |
| SVR | Support Vector Regression |
| SVM | Support Vector Machine |
| SHAP | SHapley Additive exPLanations |
| TOC | Total organic carbon |
| TEM | Transmission electron microscopy |
| TDS | Total Dissolved Solids |
| WWTP | Wastewater treatment plants |
| WQI | Water quality index |
References
- Chen, M.; Li, Y.; Jiang, X.; Zhao, D.; Liu, X.; Zhou, J.; He, Z.; Zheng, C.; Pan, X. Study on soil physical structure after the bioremediation of Pb pollution using microbial-induced carbonate precipitation methodology. J. Hazard. Mater. 2021, 411, 125103. [Google Scholar] [CrossRef]
- Chang, Y.C.; Peng, Y.-P.; Chen, K.-F.; Chen, T.-Y.; Tang, C.-T. The effect of different in situ chemical oxidation (ISCO) technologies on the survival of indigenous microbes and the remediation of petroleum hydrocarbon-contaminated soil. Process Saf. Environ. Prot. 2022, 163, 105–115. [Google Scholar] [CrossRef]
- OECD. Guideline for Testing of Chemicals-301, Adopted by Council on 17 July 1992. Available online: https://www.google.ro/books/edition/OECD_Guidelines_for_the_Tesing_of_Chemi/7s5yoSa3vykC?hl=en&gbpv=1&printsec=frontcover (accessed on 15 October 2025).
- Su, Y.; Cheng, Z.; Hou, Y.; Lin, S.; Gao, L.; Wang, Z.; Bao, R.; Peng, L. Biodegradable and conventional microplastics posed similar toxicity to marine algae Chlorella vulgaris. Aquat. Toxicol. 2022, 244, 106097. [Google Scholar] [CrossRef]
- Murdock, J.N.; Wetzel, D. FT-IR Microspectroscopy Enhances Biological and Ecological Analysis of Algae. Appl. Spectrosc. Rev. 2009, 44, 335–361. [Google Scholar] [CrossRef]
- Traverso-Soto, J.M.; Figueredo, M.; Punta-Sánchez, I.; Campana, O.; Ciufegni, E.; Hampel, M.; Buoninsegni, J.; Quiñones, M.A.M.; Anfuso, G. Assessment of Organic Pollutants Desorbed from Plastic Litter Items Stranded on Cadiz Beaches (SW Spain). Toxics 2025, 13, 673. [Google Scholar] [CrossRef]
- Davis, A.B.; Evans, M.; McKindles, K.; Lee, J. Co-Occurrence of Toxic Bloom-Forming Cyanobacteria Planktothrix, Cyanophage, and Symbiotic Bacteria in Ohio Water Treatment Waste: Implications for Harmful Algal Bloom Management. Toxins 2025, 17, 450. [Google Scholar] [CrossRef]
- Renganathan, P.; Gaysina, L.A.; Gutiérrez, C.G.; Puente, E.O.R.; Sainz-Hernández, J.C. Harnessing Engineered Microbial Consortia for Xenobiotic Bioremediation: Integrating Multi-Omics and AI for Next-Generation Wastewater Treatment. J. Xenobiot. 2025, 15, 133. [Google Scholar] [CrossRef] [PubMed]
- Wolff, D.; Krah, D.; Dötsch, A.; Ghattas, A.; Wick, A.; Ternes, T. Insights into the variability of microbial community composition and micropollutant degradation in diverse biological wastewater treatment systems. Water Res. 2018, 143, 313–324. [Google Scholar] [CrossRef]
- Saini, S.; Tewari, S.; Dwivedi, J.; Sharma, V. Biofilm-mediated wastewater treatment: A comprehensive review. Mater. Adv. 2023, 4, 1415–1443. [Google Scholar] [CrossRef]
- Xiong, H.; Zhou, X.; Cao, Z.; Xu, A.; Dong, W.; Jiang, M. Microbial biofilms as a platform for diverse biocatalytic applications. Bioresour. Technol. 2024, 386, 129396. [Google Scholar] [CrossRef]
- Negri, F.; Galeazzi, A.; Gallo, F.; Manenti, F. Reshaping Industrial Maintenance with Machine Learning: Fouling Control Using Optimized Gaussian Process Regression. Ind. Eng. Chem. Res. 2025, 64, 6633–6654. [Google Scholar] [CrossRef]
- Li, Y.; Xu, J.; Anastasiu, D.C. An Extreme-Adaptive Time Series Model Based on Probability-Enhanced LSTM Neural Networks. Proc. AAAI Conf. Artif. Intell. 2023, 37, 8684–8691. [Google Scholar] [CrossRef]
- Karbasi, M.; Ali, M.; Bateni, S.M.; Jun, C.; Jamei, M.; Farooque, A.A.; Yaseen, Z.M. Multi-step ahead forecasting of electrical conductivity in rivers by using a hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) model enhanced by Boruta-XGBoost feature selection algorithm. Dent. Sci. Rep. 2024, 14, 1991. [Google Scholar] [CrossRef]
- Hridoy, A.M.; Shawkat, A.I.; Bordin, C.; Acharjee, M.R.; Masood, A.; Baki, A.O.; Al Mamun, A. Advanced machine learning models for accurate water quality classification and WQI prediction: Implications for aquatic disease risk management. Sci. Total Environ. 2025, 1008, 180965. [Google Scholar] [CrossRef]
- Cechinel, M.A.P.; Neves, J.; Fuck, J.V.R.; de Andrade, R.C.; Spogis, N.; Riella, H.G.; Padoin, N.; Soares, C. Enhancing wastewater treatment efficiency through machine learning-driven effluent quality prediction: A plant-level analysis. J. Water Process Eng. 2024, 58, 104758. [Google Scholar] [CrossRef]
- Dikmen, F.; Demir, A.; Özkaya, B.; Raza, M.O.; Rasheed, J.; Asuroglu, T.; Alsubai, S. AI-driven wastewater management through comparative analysis of feature selection techniques and predictive models. Sci. Rep. 2025, 15, 25347. [Google Scholar] [CrossRef] [PubMed]
- Dong, Z.; Wang, J.; Ye, G.; Wang, Y. Data-driven prediction of effluent quality in wastewater treatment processes: Model performance optimization and missing-data handling. J. Water Process Eng. 2025, 71, 107352. [Google Scholar] [CrossRef]
- Lv, J.; Du, L.; Lin, H.; Wang, B.; Yin, W.; Song, Y.; Chen, J.; Yang, J.; Wang, A.; Wang, H. Enhancing effluent quality prediction in wastewater treatment plants through the integration of factor analysis and machine learning. Bioresour. Technol. 2024, 393, 130008. [Google Scholar] [CrossRef]
- Yin, H.; Chen, Y.; Zhou, J.; Xie, Y.; Wei, Q.; Xu, Z. A probabilistic deep learning approach to enhance the prediction of wastewater treatment plant effluent quality under shocking load events. Water Res. X 2025, 26, 100291. [Google Scholar] [CrossRef] [PubMed]
- Fitriyani, N.; Syafrudin, M.; Chamidah, N.; Rifada, M.; Susilo, H.; Aydin, D.; Qolbiyani, S.L.; Lee, S.W. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics 2025, 13, 2194. [Google Scholar] [CrossRef]
- Zamfir, F.-S.; Carbureanu, M.; Mihalache, S.F. Application of Machine Learning Models in Optimizing Wastewater Treatment Processes: A Review. Appl. Sci. 2025, 15, 8360. [Google Scholar] [CrossRef]
- Grbčić, L.; Druzeta, S.; Kranjčević, L. Water distribution network leak localization with histogram-based gradient boosting histogram-based gradient boosting water network leak localization. J. Hydroinform. 2023, 25, 663–684. [Google Scholar] [CrossRef]
- Makumbura, R.K.; Mampitiya, L.; Rathnayake, N.; Meddage, D.; Henna, S.; Dang, T.L.; Hoshino, Y.; Rathnayake, U. Advancing Water Quality Assessment and Prediction Using Machine Learning Models, Coupled with Explainable Artificial Intelligence (XAI) Techniques Like Shapley Additive Explanations (SHAP) For Interpreting the Black-Box Nature. Results Eng. 2024, 23, 102831. [Google Scholar] [CrossRef]
- Bhuria, R.; Gill, K.S.; Upadhyay, D.; Devliyal, S. Predicting Water Purity by Riding the Ensemble Waves with Gradient Boosting Classification Technique. In Proceedings of the 2024 2nd International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India, 10–12 July 2024; pp. 1365–1368. [Google Scholar] [CrossRef]
- Nagarajan, G.; Reddy, N.K.; Kumar, Y.V.; Reddy, A.; Thota, C. Water Quality Classification Using XG Boost. In Proceedings of the 2024 4th International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT), Pune, India, 22–23 March 2024; Volume 190, pp. 1–3. [Google Scholar] [CrossRef]
- Sharma, J.; Gill, K.S.; Kumar, M. Innovating Water Purity Analysis with Gradient Boosting Classification Techniques. In Applied Intelligence and Computing; SCRS: Delhi, India, 2023; pp. 159–168. [Google Scholar] [CrossRef]
- Sattari, M.T.; Mirabbasi, R.; Shamsi Sushab, R.; Abraham, J. Prediction of Groundwater Level in Ardebil Plain Using Support Vector Regression and M5 Tree Model. Ground Water 2018, 56, 636–646. [Google Scholar] [CrossRef]
- Ainapure, B.; Baheti, N.; Buch, J.; Appasani, B.; Jha, A.V.; Srinivasulu, A. Drinking water potability prediction using machine learning approaches: A case study of Indian rivers. Water Pract. Technol. 2023, 18, 3004–3020. [Google Scholar] [CrossRef]
- Nguyen, T.T.; Le, H.T.T. Water Level Prediction at TICH-BUI River in Vietnam Using Support Vector Regression. In Proceedings of the 2019 International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan, 7–10 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Sarkar, H.; Goriwale, S.S.; Ghosh, J.K.; Ojha, C.S.P.; Ghosh, S.K. Potential of machine learning algorithms in groundwater level prediction using temporal gravity data. Groundw. Sustain. Dev. 2024, 25, 101114. [Google Scholar] [CrossRef]
- Oliveira-Esquerre, K.P.; Mori, M.; Bruns, R. Simulation of an industrial wastewater treatment plant using artificial neural networks and principal components analysis. Braz. J. Chem. Eng. 2002, 19, 365–372. [Google Scholar] [CrossRef]
- Tchobanoglous, G.; Burton, F.L.; Stensel, H.D. Wastewater Engineering: Treatment and Reuse, 4th ed.; McGraw-Hill: New York, NY, USA, 2003. [Google Scholar]
- Prabu, P.; Alluhaidan, A.S.; Aziz, R.; Basheer, S. AquaFlowNet a machine learning based framework for real time wastewater flow management and optimization. Sci. Rep. 2025, 15, 19182. [Google Scholar] [CrossRef]
- Rasool, J.M.; Somashekar, J.A. A Comprehensive Review of Machine Learning Applications in Wastewater Treatment: Current State, Comparative Analysis, and Future Directions. J. Innov. Technol. 2025, 2025, 1–14. [Google Scholar] [CrossRef]
- Hossen, A.M.; Salam, T. Advancing Water Quality Assessment: Leveraging XGBoost for Precise Predictive Modeling; A Machine Learning Technique. In Proceedings of the 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS), Chattogram, Bangladesh, 5–26 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Gheorghe, C.G.; Dusescu, C.; Carbureanu, M. Asphaltenes biodegradation in biosystems adapted on selective media. Rev. Chim. 2016, 67, 2106–2110. [Google Scholar]
- Popovici, D.R.; Gheorghe, C.G.; Dusescu Vasile, C.M. Assessment of the Active Sludge Microorganisms Population During Wastewater Treatment in a Micro-Pilot Plant. Bioengineering 2024, 11, 1306. [Google Scholar] [CrossRef]
- Eshamuddin, M.; Zuccaro, G.; Nourrit, G.; Albasi, C. The influence of process operating conditions on the microbial community structure in the moving bed biofilm reactor at phylum and class level: A review. J. Environ. Chem. Eng. 2024, 12, 113266. [Google Scholar] [CrossRef]
- Gheorghe, C.G.; Dusescu-Vasile, C.M.; Popovici, D.R.; Bombos, D.; Dragomir, R.E.; Dima, F.M.; Bajan, M.; Vasilievici, G. Monitoring the Biodegradation Progress of Naphthenic Acids in the Presence of Spirulina platensis Algae. Toxics 2025, 13, 368. [Google Scholar] [CrossRef]
- Manga, M.; Boutikos, P.; Semiyaga, S.; Olabinjo, O.; Muoghalu, C.C. Biochar and its potential application for the improvement of the anaerobic digestion process: A critical review. Energies 2022, 16, 4051. [Google Scholar] [CrossRef]
- Hassan, A.; Hamid, F.; Pariatamby, A.; Suhaimi, N.; Razali, N.; Ling, K.; Mohan, P. Bioaugmentation-assisted bioremediation and biodegradation mechanisms for PCB in contaminated environments: A review on sustainable clean-up technologies. J. Environ. Chem. Eng. 2023, 11, 110055. [Google Scholar] [CrossRef]
- Chakraborty, S.; Talukdar, A.; Dey, S.; Bhattacharya, S. Role of fungi, bacteria and microalgae in bioremediation of emerging pollutants with special reference to pesticides, heavy metals and pharmaceuticals. Discov. Environ. 2025, 3, 91. [Google Scholar] [CrossRef]
- Yang, Z.; Peng, C.; Cao, H.; Song, J.; Gong, B.; Li, L.; Wang, L.; He, Y.; Liang, M.; Lin, J.; et al. Microbial functional assemblages predicted by the FAPROTAX analysis are impacted by physicochemical properties, but C, N and S cycling genes are not in mangrove soil in the Beibu Gulf, China. Ecol. Indic. 2022, 139, 108887. [Google Scholar] [CrossRef]
- Tyagi, I.; Tyagi, K.; Ahamad, F.; Bhutiani, R.; Kumar, V. Assessment of bacterial community structure, associated functional role, and water health in full-scale municipal wastewater treatment plants. Toxics 2024, 13, 3. [Google Scholar] [CrossRef]
- La Cognata, R.; Piazza, S.; Freni, G. Pollutant Monitoring Solutions in Water and Sewerage Networks: A Scoping Review. Water 2025, 17, 1423. [Google Scholar] [CrossRef]
- Carbureanu, M.; Roșca, C.-M. Evaluating Wastewater pH Prediction Solutions in Custom Python and C# Models. In Proceedings of the 5th International Conference on Emerging Trends and Technologies on Intelligent Systems, Noida, India, 27–28 March 2025; pp. 19–21. [Google Scholar]
- Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef]
- Srisuradetchai, P.; Suksrikran, K. Random kernel k-nearest neighbors’ regression. Front. Big Data 2024, 7, 1402384. [Google Scholar] [CrossRef]
- Schreiber-Gregory, D.N. Ridge Regression and Multicollinearity: An In-Depth Review. Model. Assist. Stat. Appl. 2018, 13, 359–365. [Google Scholar] [CrossRef]
- Kassim, N.M.; Santhiran, S.; Alkahtani, A.A.; Islam, M.A.; Tiong, S.K.; Mohd Yusof, M.Y.; Amin, N. An Adaptive Decision Tree Regression Modeling for the Output Power of Large-Scale Solar (LSS) Farm Forecasting. Sustainability 2023, 15, 13521. [Google Scholar] [CrossRef]
- Singh, U.; Rizwan, M.; Alaraj, M.; Alsaidan, I. A Machine Learning-Based Gradient Boosting Regression Approach for Wind Power Production Forecasting: A Step towards Smart Grid Environments. Energies 2021, 14, 5196. [Google Scholar] [CrossRef]
- Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
- Cap. 2.6.12. Biological tests. Microbial examination of nonsterile products Total viable aerobic count Plate count methods. In European Pharmacopoeia 5.0; Council of Europe: Strasbourg, France, 2004; p. 154.
- Validation of microbial recovery from pharmacopeia articles cap 1227 Estimating the number of colony forming units. In USP Pharmacopoeia 29; The United States Pharmacopeia Convention: Frederick, MD, USA, 2021.
- SR EN ISO 5667-15:2010; Calitatea apei. Prelevare. Partea 15: Ghid General Pentru Conservarea şi Tratarea Probelor de Nămol şi Sediment. Asociația Română de Standardizare ASRO: București, Romania, 2010.
- Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3147. [Google Scholar]
- Oliveira, R.I.; Orenstein, P.; Ramos, T.; Romano, J.V. Split conformal prediction and non-exchangeable data. J. Mach. Learn. Res. 2024, 25, 1–38. [Google Scholar]
- Morris, M.D. Factorial Sampling Plans for Preliminary Computational Experiments. Technometrics 1991, 33, 161–174. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
















| Microscopic Examinations | Analysis Period | ||||||
|---|---|---|---|---|---|---|---|
| Adaptation period/Test period | 2 h/4 days | 8 h/8 days | 10 h/12days | 24 h/12 days | 30 h/5 days | 48 h/18 days | 56 h/20 days |
| CFU/mL | |||||||
| Total aerobic bacteria | 2 × 105/5 × 107 | 3.5 × 105/6 × 107 | 5 × 105/8 × 107 | 7 × 105/0.2 × 108 | 8 × 105/1 × 108 | 0.5 × 106/ 1.5 × 108 | 2 × 106/ 2 × 108 |
| Abundance % | |||||||
| Paramecium caudatum | 4/10 | 4/12 | 8/13 | 12/25 | 15/28 | 15/20 | 15/18 |
| Colpidium colpoda | -/2 | 1/4 | 2/7 | 4/8 | 4/6 | 2/5 | 2/8 |
| Stentor | -/- | -/- | -/- | -/- | -/- | -/5 | 2/7 |
| Aspidisca polystila | -/- | -/- | -/- | -/4 | -/4 | 1/7 | 2/5 |
| Vorticella microstoma, | -/- | -/- | -/- | -/4 | -/2 | 1/2 | 1/4 |
| Litonotus setigerum. | -/- | -/- | -/5 | 2/6 | 2/8 | 3/10 | 3/10 |
| Zooglee ramigera | 5/10 | 5/10 | 20/25 | 20/40 | 20/40 | 20/45 | 25/50 |
| Rotifers sp | -/- | -/- | -/- | -/4 | -/4 | -/8 | -/10 |
| Inputs | Output | |||
|---|---|---|---|---|
| Sample | FR [L/h] | C1 [μS/cm] | O1 [mg/L] | C2 [μS/cm] |
| 1 | 0.5 | 285 | 3 | 350 |
| 2 | 0.5 | 290 | 4 | 350 |
| 3 | 0.5 | 290 | 4 | 355 |
| … | … | … | … | .. |
| 423 | 0.5 | 295 | 6.1 | 290 |
| 424 | 0.5 | 295 | 6.1 | 285 |
| Sample | Range | Mean | SD | Mean ± SD |
|---|---|---|---|---|
| FR [L/h] | 0.5 (Constant) | 0.5 | 0 | [0.5; 0.5] |
| C1 [μS/cm] | [285; 300] | 294.7636 | 4.5863 | [290.1773; 299.3499] |
| O1 [mg/L] | [3; 6.49] | 5.4065 | 1.0108 | [4.3957; 6.4172] |
| C2 [μS/cm] | [285; 360] | 324.0449 | 30.5363 | [293.5086; 354.5812] |
| Model | Hyperparameters | Search Range | Selected Value |
|---|---|---|---|
| learning_rate | [0.01, 0.05, 0.1, 0.2] | 0.2 | |
| max_iter | [100, 200, 400, 500] | 500 | |
| HGBR | max_depth | [None, 3, 5, 7, 9] | 7 |
| min_samples_leaf | [10, 20] | 20 | |
| max_bins | [255] | 255 | |
| L2_regularization | [0] | 0 | |
| GBR | n_estimators | [100, 500, 1000, 1500] | 500 |
| max_depth | [3, 4] | 3 | |
| min_samples_split | [2, 3, 5] | 2 | |
| learning_rate | [0.01, 0.02, 0.1] | 0.01 | |
| loss | [‘squared_error’, ‘absolute_error’] | ‘squared_error’ | |
| LR | fit_intercept | [False, True] | True |
| copy_X | [False, True] | True | |
| n_jobs | [None, −1, 1, 2] | None | |
| positive | [False, True] | False | |
| RR | alpha | [0.01, 0.1, 1.0, 2.0, 5.0] | 1.0 |
| copy_X | [False, True] | True | |
| fit_intercept | [False, True] | True | |
| positive | [False, True] | False | |
| SVR | kernel | [‘rbf’] | ‘rbf’ |
| degree | [3, 4] | 3 | |
| gamma | [‘scale’, ‘auto’] | ‘auto’ | |
| coef0 | [0.0] | 0.0 | |
| C | [0.1, 1.0, 6.01, 8.53, 32.2] | 8.53 | |
| epsilon | [0.0002, 0.0053, 0.0079, 0.1] | 0.0079 | |
| DTR | max_depth | [None, 5, 10] | 10 |
| min_samples_split | [2, 5, 10] | 5 | |
| min_samples_leaf | [1, 5] | 1 | |
| max_features | [None, ‘sqrt’] | ‘sqrt’ | |
| max_leaf_nodes | [None, 20, 50] | None | |
| criterion | [‘squared_error’, ‘absolute_error’] | ‘squared_error’ | |
| ccp_alpha | [0.0, 0.01] | 0.0 | |
| KNR | n_neighbors | [3, 5, 10, 14, 15] | 15 |
| weights | [‘uniform’, ‘distance’] | ‘distance’ | |
| p | [2] | 2 | |
| metric | [‘minkowski’, ‘manhattan’] | ‘minkowski’ | |
| RFR | max_depth | [None, 10, 20] | 20 |
| max_features | [1.0, 2, 5] | 2 | |
| min_samples_split | [2, 4, 5] | 4 | |
| min_samples_leaf | [1] | 1 | |
| n_estimators | [100, 300. 500, 1000] | 1000 | |
| criterion | [‘squared_error’, ‘absolute_error’] | ‘squared_error’ |
| Best Identified Performance Metrics for the Analyzed ML Models | |||||
|---|---|---|---|---|---|
| Model | EVS | R2 Score | MAE [μS/cm] | MAPE [%] | MedAE [μS/cm] |
| HGBR | 0.9468 | 0.9468 | 3.6718 | 1.19 | 1.3608 |
| GBR | 0.8603 | 0.8602 | 6.1167 | 1.96 | 1.7135 |
| LR | 0.7665 | 0.7661 | 10.8594 | 3.47 | 5.9093 |
| RR | 0.7668 | 0.7664 | 10.8836 | 3.47 | 6.1132 |
| RFR | 0.8561 | 0.8561 | 6.0276 | 1.95 | 2.0543 |
| SVR | 0.8390 | 0.8375 | 6.2120 | 2.03 | 1.0392 |
| KNR | 0.9444 | 0.9442 | 3.7009 | 1.20 | 1.3273 |
| DTR | 0.8566 | 0.8566 | 5.9221 | 1.92 | 2.0833 |
| Test No. | Learning_Rate | Max_Iter | Max_Depth | Min_Samples_Leaf | Max_Bins | L2_Regularization |
|---|---|---|---|---|---|---|
| Test 1 | 0.1 | 100 | None | 20 | 255 | 0 |
| Test 2 | 0.05 | 500 | 9 | 20 | 255 | 0 |
| Test 3 | 0.2 | 500 | 7 | 20 | 255 | 0 |
| Test 4 | 0.01 | 200 | 3 | 10 | 255 | 0 |
| Test 5 | 0.05 | 400 | 5 | 20 | 255 | 0 |
| Selected | 0.2 | 500 | 7 | 20 | 255 | 0 |
| R2 Score | Adjusted R2 | MAE [μS/cm] | RMSE [μS/cm] |
|---|---|---|---|
| 0.95 | 0.94 | 3.74 | 7.97 |
| Hyperparameter | Tested Interval | Optimal Value |
|---|---|---|
| learning-rate | [0.001, 0.005, 0.01, 0.05, 0.1, 0.2] | 0.001 |
| max_iter | [500, 1000, 2000, 3000, 5000, 7000, 10,000] | 500 |
| max_depth | [5, 10, 15, 18, 20, 25, 30] | 5 |
| min_samples_leaf | [1, 2, 5, 10, 20, 30, 50] | 20 |
| max_bins | [32, 64, 128, 255] | 128 |
| L2_regularization | [0, 0.001, 0.01, 0.1, 0.5, 1, 2] | 2 |
| k-Fold | R2 Score | RMSE [μS/cm] | MAE [μS/cm] |
|---|---|---|---|
| 2-fold | 0.8755 | 10.5479 | 5.0227 |
| 3-fold | 0.8862 | 9.9946 | 4.7526 |
| 4-fold | 0.8678 | 10.5765 | 5.0011 |
| 5-fold | 0.8672 | 10.6579 | 5.0326 |
| 6-fold | 0.8909 | 9.7889 | 4.5984 |
| 7-fold | 0.8792 | 9.9775 | 4.8193 |
| 8-fold | 0.8891 | 9.6402 | 4.5545 |
| 9-fold | 0.8730 | 10.0580 | 4.8831 |
| 10-fold | 0.8862 | 9.7411 | 4.6224 |
| Mean ± SD | 0.8772 ± 0.0110 | 10.2353 ± 0.5409 | 4.8599 ± 0.2388 |
| Hyperparameter | Values Tested | Optimal Value | Best R2 Score | R2 Score Range | Sensitivity |
|---|---|---|---|---|---|
| learning_rate | [0.001, 0.005, 0.01, 0.05, 0.1, 0.2] | 0.001 | 0.868468 | 0.000687 | Low |
| max_depth | [5, 10, 15, 18, 20, 25, 30] | 5 | 0.874545 | 0.006765 | Low |
| max_iter | [500, 1000, 2000, 3000, 5000, 7000, 10,000] | 500 | 0.868453 | 0.000671 | Low |
| min_samples_leaf | [1, 2, 5, 10, 20, 30, 50] | 20 | 0.891068 | 0.023616 | Medium |
| L2_regularization | [0, 0.001, 0.01, 0.1, 0.5, 1, 2] | 2 | 0.868355 | 0.000669 | Low |
| max_bins | [32, 64, 128, 255] | 128 | 0.867781 | 0.013517 | Low |
| Hyperparameter | Values Tested | Optimal Value | Best RMSE | RMSE Range | Sensitivity |
|---|---|---|---|---|---|
| learning_rate | [0.001, 0.005, 0.01, 0.05, 0.1, 0.2] | 0.001 | 10.552453 | 0.024079 | Low |
| max_depth | [5, 10, 15, 18, 20, 25, 30] | 5 | 10.332666 | 0.243861 | Medium |
| max_iter | [500, 1000, 2000, 3000, 5000, 7000, 10,000] | 500 | 10.552935 | 0.023582 | Low |
| min_samples_leaf | [1, 2, 5, 10, 20, 30, 50] | 20 | 9.705647 | 0.882672 | High |
| L2_regularization | [0, 0.001, 0.01, 0.1, 0.5, 1, 2] | 2 | 10.557375 | 0.023294 | Low |
| max_bins | [32, 64, 128, 255] | 128 | 10.576482 | 0.479255 | Medium |
| Hyperparameter | Values Tested | Optimal Value | Best MAE | MAE Range | Sensitivity |
|---|---|---|---|---|---|
| learning_rate | [0.001, 0.005, 0.01, 0.05, 0.1, 0.2] | 0.001 | 5.001112 | 0.075435 | Low |
| max_depth | [5, 10, 15, 18, 20, 25, 30] | 5 | 4.921218 | 0.080026 | Low |
| max_iter | [500, 1000, 2000, 3000, 5000, 7000, 10,000] | 500 | 5.000653 | 0.074295 | Low |
| min_samples_leaf | [1, 2, 5, 10, 20, 30, 50] | 20 | 4.695790 | 0.970615 | High |
| L2_regularization | [0, 0.001, 0.01, 0.1, 0.5, 1, 2] | 2 | 4.998375 | 0.009763 | Low |
| max_bins | [32, 64, 128, 255] | 128 | 5.001112 | 1.026032 | High |
| Parameter | Relative Importance [%] | (Importance) [μS/cm] | (Non-Linearity) [μS/cm] | |
|---|---|---|---|---|
| C1 [μS/cm] | 1.20 | 0.8552 ± 0.3663 | 0.6701 | 0.7835 |
| O1 [mg/L] | 98.79 | 70.3164 ± 25.7647 | 42.4077 | 0.6030 |
| Real | Prediction | CI_Lower | CI_Upper | Belongs to CI (Yes/No) |
|---|---|---|---|---|
| 345 | 345.000 | 317.500 | 372.500 | Yes |
| 285 | 290.004 | 262.504 | 317.504 | Yes |
| 340 | 340.000 | 312.500 | 367.500 | Yes |
| 354 | 353.815 | 326.315 | 381.315 | Yes |
| 360 | 360.000 | 332.500 | 387.500 | Yes |
| Feature | Importance | SD | Rank | Relative Importance [%] |
|---|---|---|---|---|
| O1 [mg/L] | 1.916400 | 0.254464 | 1 | 98.860796 |
| C1 [μS/cm] | 0.022083 | 0.014882 | 2 | 1.139204 |
| FR [L/h] | 0.000000 | 0.000000 | 3 | 0.000000 |
| Feature | ) | Relative Importance [%] | Mean_ SHAP | Std_ SHAP | Min_ SHAP | Max_ SHAP |
|---|---|---|---|---|---|---|
| O1 [mg/L] | 29.493465 | 93.114865 | 5.286152186 | 30.37775185 | −36.20218777 | 39.03759729 |
| C1 [μS/cm] | 2.180817 | 6.885135 | −0.640565601 | 2.849973929 | −6.441771443 | 7.032448483 |
| FR [L/h] | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 |
| Validation/ Robustness Test | Method | Key Metrics | Quantitative Results |
|---|---|---|---|
| Cross- Validation | 10-fold cross validation on training set | Coefficient of variation (CV) | R2 = 0.8772 ± 0.0110 (CV = 2.3%); RMSE = 10.2353 ± 0.5409 µS/cm (CV = 5.9%); MAE = 4.8599 ± 0.2388µS/cm (CV = 4.91%) (Table 8) |
| Sensitivity analysis | Hyperparameters variations | ΔR2, ΔRMSE, ΔMAE, CV (%) | Moderate sensitivity: min_samples_leaf: CV = 2.65%, ΔR2 = 0.024 µS/cm; (Table 10) Moderate sensitivity: max_depth: CV = 2.36%; ΔRMSE = 0.243 µS/cm; (Table 11) Moderate sensitivity: max_bins: CV = 4.53%; ΔRMSE = 0.479 µS/cm; (Table 11) High sensitivity: min_samples_leaf: CV = 9.09%; ΔRMSE = 0.88 µS/cm; (Table 11) High sensitivity: min_samples_leaf: CV = 20.67% ΔMAE = 0.971 µS/cm; (Table 12) High sensitivity: max_bins: CV = 20.51%; ΔMAE = 1.026 µS/cm; (Table 12) |
| Morris screening | Morris method | , relative importance | O1 parameter: Relative importance: 98.79%; μ* = 70.32 µS/cm (strong effect); σ = 42.41 µS/cm (moderate interactions); (Table 13) C1 parameter: Relative importance: 1.20%; μ* = 0.86 µS/cm (negligible effect); σ = 0.78 µS/cm (low interactions); (Table 13 ) |
| SHAP analysis | SHAP method | ), relative importance | O1 [mg/L]: Value = 29.49 [μS/cm]; feature importance = 93.11%; (Table 16) C1 [μS/cm]: Value= 2.18 [μS/cm]; feature importance = 6.88% (Table 16) FR [L/h]: Value = 0 [μS/cm]; relative importance = 0% (Table 16) |
| Confidence intervals | CI | CI_lower; CI_upper | 98.8% coverage, suggesting a confidence level of 95% (Table 14) |
| FR [L/h] | C1 [μS/cm] | O1 [mg/L] | C2 [μS/cm] |
|---|---|---|---|
| 0.5 | 295 | 5.32 | 345 |
| 0.5 | 295 | 5.46 | 340 |
| 0.5 | 290 | 3.5 | 359.8333 |
| 0.5 | 295 | 5.26 | 355 |
| 0.5 | 295 | 5.27 | 345 |
| 0.5 | 290 | 4 | 354 |
| 0.5 | 295 | 6.1 | 290.9091 |
| 0.5 | 300 | 6.23 | 325 |
| Model | Data Type | Context | Predictors | R2/MAE/RMSE | HGBRCond Advantages |
|---|---|---|---|---|---|
| XGBoost [13] | Industrial | Full-scale | DO, pH, BOD, COD | R2 = 0.82 | Superior R2, better interpretability |
| GBR [12] | Industrial | Batch | COD, NH4 | R2 = 0.82 | Superior R2, faster inference |
| LSTM [13] | Industrial | Activated sludge conductivity | Temporal dynamics | R2 = 0.88 | Superior R2, rapid processing, reduced computational complexity |
| GPR [12] | - | Membrane filtration process conductivity | - | MAE < 0.3 μS/cm | More scalable and faster |
| Hybrid CNN-LSM [14] | Industrial | Surface water electrical conductivity prediction | Selected features | RMSE = 53.83 μS/cm | Comparable performance with more efficient handling of missing values |
| ANN + PCA [32] | Industrial | WWTP simulation | PCA | R2 = 0.88 | Eliminates the need for PCA by automatically selecting features |
| HGBRCond | Pilot | Real-time | O1, C1 | R2 = 0.877 | Stability (SD = 0.011) |
| Criterion | XGBoost | GBR | HGBRCond |
|---|---|---|---|
| Accuracy | R2 = 0.82 (industrial, full-scale data) | R2 = 0.82 (industrial, batch data) | R2 = 0.877 (pilot data); 6.8% improvement |
| Limitations | Multiple sensors (pH, BOD, DO); Extensive hyperparameter tuning; Limited real-time adaptability; High computational cost for large datasets; Memory intensive for large datasets; | High computational cost; Exact split-finding algorithm; Memory intensive for large datasets Susceptible to overfitting; Sensitive to hyperparameter tuning | Limited to pilot scale data; strong O1 sensor dependence; Constant FR; Sensitive to critical hyperparameters; Requires recalibration for industrial scale |
| Training efficiency | 200–400 s (estimated for n = 424) | 150–300 s (estimated for n = 424) | 63 s (measured) |
| Parameter complexity | 4–6 features (DO, pH, BOD, COD, etc.) | 4–6 features (COD, NH4, etc.) | 2 features (O1, C1); 67% reduction |
| Stability (SD) | SD = 0.02–0.04 (typical for ensemble methods) | SD = 0.025–0.05 (typical for traditional GBR) | SD = 0.011 (cross-validation) 50–78% more stable |
| Missing value handling | External preprocessing | External preprocessing | Native handling (no preprocessing needed) |
| Interpretability | Moderate (feature importance available, but usually treated as black-box) | Moderate (feature importance available, limited mechanistic insight) | High (SHAP-validated: O1 dominance 98%, mechanistic interpretation) |
| 95% CI | - | - | [0.855, 0.899] (narrow, stable) |
| Scalability | Suitable for large datasets, but memory intensive | Limited by memory for very large datasets | Excellent (histogram compression enables efficient scaling) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Carbureanu, M.; Gheorghe, C.G. A Machine Learning-Based Data-Driven Model for Predicting Wastewater Quality Parameters in the Industrial Domain. Appl. Sci. 2026, 16, 694. https://doi.org/10.3390/app16020694
Carbureanu M, Gheorghe CG. A Machine Learning-Based Data-Driven Model for Predicting Wastewater Quality Parameters in the Industrial Domain. Applied Sciences. 2026; 16(2):694. https://doi.org/10.3390/app16020694
Chicago/Turabian StyleCarbureanu, Madalina, and Catalina Gabriela Gheorghe. 2026. "A Machine Learning-Based Data-Driven Model for Predicting Wastewater Quality Parameters in the Industrial Domain" Applied Sciences 16, no. 2: 694. https://doi.org/10.3390/app16020694
APA StyleCarbureanu, M., & Gheorghe, C. G. (2026). A Machine Learning-Based Data-Driven Model for Predicting Wastewater Quality Parameters in the Industrial Domain. Applied Sciences, 16(2), 694. https://doi.org/10.3390/app16020694

