Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Area and Data Structure
2.2. Ensemble Modeling Framework
2.3. Data Structural Characteristics and Quantification
2.3.1. Autocorrelation
2.3.2. Mean Absolute Change Rate
2.3.3. Coefficient of Variation (CV)
2.4. Comparative Analysis Between Structural Characteristics and Model Performance
3. Results
3.1. Model Performance Across Structural Differences
3.1.1. Coefficient of Determination (R2)
3.1.2. Root Mean Square Error (RMSE)
3.1.3. Mean Absolute Error (MAE)
3.1.4. Mean Absolute Percentage Error (MAPE)
3.2. Relationship Between Data Structure and Model Performance
3.3. Model Performance Evaluation Based on Observed–Estimated Relationships
3.3.1. Scatter Distribution Between Observed and Estimated Values
3.3.2. Effect of Model Structure on Distribution Characteristics
3.3.3. Comparison with the Previous Study
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| RF | Random Forest |
| VR | Voting Regressor |
| ET | Extra Trees Regressor |
| GBR | Gradient Boosting Regressor |
| R2 | Coefficient of Determination |
| RMSE | Root Mean Square Error |
| MAE | Mean Absolute Error |
| MAPE | Mean Absolute Percentage Error |
| CV | Coefficient of Variation |
References
- Tchobanoglous, G.; Stensel, H.D.; Tsuchihashi, R.; Burton, F. Wastewater Engineering: Treatment and Resource Recovery, 5th ed.; McGraw-Hill Education: New York, NY, USA, 2014. [Google Scholar]
- Bertanza, G.; Boiocchi, R. Interpreting per capita loads of organic matter and nutrients in municipal wastewater: A study on 168 Italian agglomerations. Sci. Total Environ. 2022, 819, 153236. [Google Scholar] [CrossRef]
- Mesdaghinia, A.; Nasseri, S.; Mahvi, A.H.; Tashauoei, H.R.; Hadi, M. The estimation of per capita loadings of domestic wastewater in Tehran. J. Environ. Health Sci. Eng. 2015, 13, 21. [Google Scholar] [CrossRef]
- Lee, J.-S.; Kim, C.-H.; Shin, D.-C. Machine learning-based estimation of sewage treatment facility capacity and design adequacy: A case study in Korea. Processes 2025, 13, 3995. [Google Scholar] [CrossRef]
- Wan, K.-Y.; Guo, Z.-W.; Wang, J.-H.; Shen, Y.; Feng, D.; Du, B.-X.; Yu, K.-P. Deep learning-based intelligent management for sewage treatment plants. J. Cent. South Univ. 2022, 29, 1665–1676. [Google Scholar] [CrossRef]
- Liu, T.; Zhang, H.; Wu, J.; Liu, W.; Fang, Y. Wastewater Treatment Process Enhancement Based on Multi-Objective Optimization and Interpretable Machine Learning. J. Environ. Manag. 2024, 364, 121430. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Mahanna, H.; El-Rashidy, N.; Kaloop, M.R.; El-Sapakh, S.; Alluqmani, A.; Hassan, R. Prediction of wastewater treatment plant performance through machine learning techniques. Desalination Water Treat. 2024, 318, 100424. [Google Scholar] [CrossRef]
- Lee, J.-S.; Shin, D.-C. Prediction of waste generation using machine learning: A regional study in Korea. Urban Sci. 2025, 9, 297. [Google Scholar] [CrossRef]
- Willard, J.D.; Varadharajan, C.; Jia, X.; Kumar, V. Time series predictions in unmonitored sites: A survey of machine learning techniques in water resources. Environ. Data Sci. 2025, 4, e7. [Google Scholar] [CrossRef]
- Parmezan, A.R.S.; Souza, V.M.A.; Batista, G.E.A.P.A. Evaluation of statistical and machine learning models for time series prediction: Identifying the state-of-the-art and the best conditions for the use of each model. Inf. Sci. 2019, 484, 302–337. [Google Scholar] [CrossRef]
- Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Srivastava, S.; Wang, J.; Jiang, P. A new loss function for enhancing peak prediction in time series data with high variability. Forecasting 2025, 7, 75. [Google Scholar] [CrossRef]
- Chaudhari, K.; Thakkar, A. Neural network systems with an integrated coefficient of variation-based feature selection for stock price and trend prediction. Expert Syst. Appl. 2023, 219, 119527. [Google Scholar] [CrossRef]
- Khoshvaght, H.; Permala, R.R.; Razmjou, A.; Khiadani, M. A critical review on selecting performance evaluation metrics for supervised machine learning models in wastewater quality prediction. J. Environ. Chem. Eng. 2025, 13, 119675. [Google Scholar] [CrossRef]
- Hodson, T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
- Plevris, V.; Solorzano, G.; Bakas, N.P.; Ben Seghier, M.E.A. Investigation of performance metrics in regression analysis and machine learning-based prediction models. In Proceedings of the 8th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2022), Oslo, Norway, 5–9 June 2022. [Google Scholar]
- Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
- Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
- Schratz, P.; Muenchow, J.; Iturritxa, E.; Richter, J.; Brenning, A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Modell. 2019, 406, 109–120. [Google Scholar] [CrossRef]
- Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
- Marcinkevičs, R.; Vogt, J.E. Interpretable and explainable machine learning: A methods-centric overview with concrete examples. WIREs Data Min. Knowl. Discov. 2023, 13, e1493. [Google Scholar] [CrossRef]
- Uddin, S.; Lu, H. Dataset meta-level and statistical features affect machine learning performance. Sci. Rep. 2024, 14, 1670. [Google Scholar] [CrossRef] [PubMed]
- Rane, N.; Choudhary, S.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar] [CrossRef]
- Chen, Z.; Zheng, Y. RRMSE-enhanced weighted voting regressor for improved ensemble regression. PLoS ONE 2025, 20, e0319515. [Google Scholar] [CrossRef]
- Mahajan, P.; Uddin, S.; Hajati, F.; Moni, M.A. Ensemble learning for disease prediction: A review. Healthcare 2023, 11, 1808. [Google Scholar] [CrossRef]
- Kim, S.; Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 2016, 32, 669–679. [Google Scholar] [CrossRef]
- Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
- Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
- Brown, C.E. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar] [CrossRef]
- Bischl, B.; Mersmann, O.; Trautmann, H.; Weihs, C. Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 2012, 20, 249–275. [Google Scholar] [CrossRef]
- Kantz, H.; Schreiber, T. Nonlinear Time Series Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
- Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef]
- Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
- Taylor, K.E. Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res. 2001, 106, 7183–7192. [Google Scholar] [CrossRef]
- Zhang, G.; Patuwo, B.E.; Hu, M.Y. Forecasting with artificial neural networks: The state of the art. Int. J. Forecast. 1998, 14, 35–62. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble methods in machine learning. In Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar] [CrossRef]
- Hawkins, D.M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004, 44, 1–12. [Google Scholar] [CrossRef]
- Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
- Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms; Chapman & Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar] [CrossRef]




| Model | Hyperparameter | Baseline Voting Regressor (VR_Base) [4] | Structure-Aware Voting Regressor (VR_SA) |
|---|---|---|---|
| Random Forest(RF) | n_estimators | 500 | 500 |
| max_depth | None | None | |
| min_samples_split | 2 | 2 | |
| min_samples_leaf | 1 | 1 | |
| max_features | Sqrt | Sqrt | |
| Random_state | - | 42 | |
| bootstrap | True | True | |
| Structure-Aware Voting Regressor (VR_SA) | Base learners | RF + LR | RF + ET + GB |
| Weights | [0.6, 0.4] | [0.4, 0.3, 0.3] | |
| Learning strategy | Soft Voting | Ensemble averaging | |
| GB learning_rate | - | 0.03 | |
| GB max_depth | - | 2 | |
| random_state | - | 42 |
| Region | Year | Actual SG (106 m3/Year) | Structure-Aware Voting Regressor (VR_SA) SG (106 m3/Year) | Error (%) |
|---|---|---|---|---|
| A | 2017 | 4.206 | 4.203 | 0.07 |
| 2018 | 4.227 | 4.221 | 0.38 | |
| 2019 | 3.948 | 3.980 | 0.82 | |
| 2020 | 4.180 | 4.162 | 0.43 | |
| 2021 | 4.203 | 4.207 | 0.10 | |
| 2022 | 4.238 | 4.237 | 0.01 | |
| 2023 | 4.280 | 4.272 | 0.18 | |
| B | 2017 | 5.012 | 5.033 | 0.42 |
| 2018 | 5.233 | 5.220 | 0.25 | |
| 2019 | 5.169 | 5.181 | 0.24 | |
| 2020 | 5.378 | 5.363 | 0.28 | |
| 2021 | 5.419 | 5.425 | 0.12 | |
| 2022 | 5.462 | 5.465 | 0.05 | |
| 2023 | 5.553 | 5.540 | 0.22 | |
| C | 2017 | 0.761 | 0.762 | 0.12 |
| 2018 | 0.755 | 0.758 | 0.38 | |
| 2019 | 0.771 | 0.772 | 0.17 | |
| 2020 | 0.824 | 0.821 | 0.38 | |
| 2021 | 0.815 | 0.817 | 0.26 | |
| 2022 | 0.872 | 0.870 | 0.19 | |
| 2023 | 0.883 | 0.879 | 0.39 | |
| D | 2017 | 0.202 | 0.205 | 1.07 |
| 2018 | 0.221 | 0.221 | 0.12 | |
| 2019 | 0.233 | 0.232 | 0.45 | |
| 2020 | 0.234 | 0.234 | 0.03 | |
| 2021 | 0.245 | 0.244 | 0.35 | |
| 2022 | 0.241 | 0.242 | 0.26 | |
| 2023 | 0.252 | 0.251 | 0.49 |
| Region | Model | RMSE (m3) Baseline Voting Regressor (VR_Base) [4] | RMSE (m3) Structure-Aware Voting Regressor (VR_SA) |
|---|---|---|---|
| A | RF | 49,149.8 | 26,700 |
| Voting | 24,574.9 | 12,300 | |
| B | RF | 37,634 | 17,700.3 |
| Voting | 18,817.2 | 15,800.4 | |
| C | RF | 77,709.5 | 20,000 |
| Voting | 3854.7 | 1950.5 | |
| D | RF | 3282.2 | 2200 |
| Voting | 1641.1 | 1500.8 |
| Region | Model | MAE (m3) Baseline Voting Regressor (VR_Base) [4] | MAE (m3) Structure-Aware Voting Regressor (VR_SA) |
|---|---|---|---|
| A | RF | 37,028.3 | 21,400 |
| Voting | 18,514.2 | 9800 | |
| B | RF | 32,931.3 | 14,200.2 |
| Voting | 16,415.6 | 12,700.3 | |
| C | RF | 6793.5 | 1600 |
| Voting | 3396.7 | 1560.4 | |
| D | RF | 2549.5 | 1800 |
| Voting | 1274.8 | 1200.6 |
| Region | Model | MAPE (%) Baseline Voting Regressor (VR_Base) [4] | MAPE (%) Structure-Aware Voting Regressor (VR_SA) |
|---|---|---|---|
| A | RF | - | 2.16 |
| Voting | 0.45 | 1.63 | |
| B | RF | - | 2.78 |
| Voting | 0.31 | 2.19 | |
| C | RF | - | 5.23 |
| Voting | 0.41 | 4.40 | |
| D | RF | - | 5.01 |
| Voting | 0.56 | 4.13 |
| Region | Autocorrelation | Mean Absolute Change Rate | Coefficient of Variation |
|---|---|---|---|
| A | −0.041 | 0.026 | 0.026 |
| B | 0.776 | 0.021 | 0.035 |
| C | 0.833 | 0.032 | 0.064 |
| D | 0.870 | 0.043 | 0.072 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lee, J.-S.; Kim, C.-H.; Shin, J.-H.; Kim, D.-H.; Shin, D.-C. Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Appl. Sci. 2026, 16, 4842. https://doi.org/10.3390/app16104842
Lee J-S, Kim C-H, Shin J-H, Kim D-H, Shin D-C. Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Applied Sciences. 2026; 16(10):4842. https://doi.org/10.3390/app16104842
Chicago/Turabian StyleLee, Jae-Sang, Chae-Ho Kim, Jun-Hee Shin, Dong-Ho Kim, and Dong-Chul Shin. 2026. "Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation" Applied Sciences 16, no. 10: 4842. https://doi.org/10.3390/app16104842
APA StyleLee, J.-S., Kim, C.-H., Shin, J.-H., Kim, D.-H., & Shin, D.-C. (2026). Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Applied Sciences, 16(10), 4842. https://doi.org/10.3390/app16104842

