Unlocking the Potential of Wastewater Treatment: Machine Learning Based Energy Consumption Prediction

: Wastewater treatment plants (WWTPs) are energy-intensive facilities that fulﬁll stringent efﬂuent quality norms. Energy consumption prediction in WWTPs is crucial for cost savings, process optimization, compliance with regulations, and reducing the carbon footprint. This paper evaluates and compares a set of 23 candidate machine-learning models to predict WWTP energy consumption using actual data from the Melbourne WWTP. To this end, Bayesian optimization has been applied to calibrate the investigated machine learning models. Random Forest and XGBoost (eXtreme Gradient Boosting) were applied to assess how the incorporated features inﬂuenced the energy consumption prediction. In addition, this study investigated the consideration of information from past data in improving prediction accuracy by incorporating time-lagged measurements. Results showed that the dynamic models using time-lagged data outperformed the static and reduced machine learning models. The study shows that including lagged measurements in the model improves prediction accuracy, and the results indicate that the dynamic K-nearest neighbors model dominates state-of-the-art methods by reaching promising energy consumption predictions.


Introduction
Recycled water is a strategic alternative to mitigate water scarcity, particularly in arid regions.Notably, the treated water from wastewater treatment plants (WWTPs) can be used for different purposes, such as irrigation, aquariums, or discharged with a low level of pollution [1].WWTPs are energy-intensive processes; therefore, improving their energy efficiency is needed from environmental and economic viewpoints [2].It has been reported in refs.[3][4][5] that WWTPs energy usage accounts for 4% of the national electricity in the United States and around 7% of the electrical energy worldwide [6].One potential option for optimizing WWTPs and achieving energy savings is accurately predicting their energy consumption.By developing predictive models, operators and engineers can forecast the expected energy requirements of the treatment processes, enabling them to anticipate the energy demand associated with different operational scenarios and efficiently schedule equipment and processes.For example, by aligning energy-intensive activities, such as aeration or pumping, with periods of lower electricity rates or increased availability of renewable energy, WWTPs can strategically reduce their energy costs.This optimization approach ensures that energy-intensive processes are carried out when energy prices are more favorable or when renewable energy sources are abundant, leading to significant cost savings and a reduced environmental impact.
WWTPs currently operate conservatively, with high operating costs, and waste a large amount of energy They are energy intensive because they require significant energy to perform the various treatment processes necessary to clean the wastewater.Some of the major energy-intensive processes in a WWTP include the aeration, mixing, and pumping of water and solids for recirculation, filtration, and disinfection.Additionally, WWTPs also require energy for processing biosolids, which may involve aerobic digestion, heat drying, and dewatering.The energy demand for disinfection processes can vary depending on the method employed, with chlorination being less energy-intensive compared to UV or ozone disinfection.While sedimentation is a necessary process in wastewater treatment, it is not considered highly energy-intensive.These energy-intensive activities and the mechanical and electrical equipment used within the WWTP contribute significantly to the overall energy consumption.The energy consumption of a WWTP can be reduced by implementing energy-efficient technologies, such as using renewable energy sources, or by optimizing the treatment process.Machine learning methods are being used in WWTPs to improve their efficiency and reduce operational costs.
Enhancing the energy efficiency of WWTPs is essential to saving energy, reducing economic expenses, and preserving resources and the environment [7][8][9][10][11].Over the years, numerous methods have been developed for modeling and predicting key characteristics of WWTPS, including analytical and data-derived techniques [12][13][14].Analytical-based methods rely on a fundamental understanding of the physical process [15].Developing a precise physical model for complex, high-dimensional, and nonlinear systems is challenging, expensive, and time-consuming [16].On the other hand, data-based methods only rely on available historical data.Nowadays, data-driven methods, particularly machine learning methods, are more common in modeling and managing WWTP processes.For example, ref. [14] investigated the application of machine learning-based approaches to predict wastewater quality from WWTPs.They applied and compared six models, namely RF, SVM, Gradient Tree Boosting (GTB), Adaptive Neuro-Fuzzy Inference System (ANFIS), LSTM, and Seasonal Autoregressive Integrated Moving Average (SARIMAX).Hourly data collected from three WWTPs has been used to assess the investigated models.Results demonstrated that SARIMAX outperformed the other models in predicting wastewater effluent quality and with acceptable time computation.Recently, Andreides et al. presented an overview of data-driven techniques applied for influent characteristics prediction at WWTPs [17].They showed that most reviewed works use machine learning-based approaches, particularly neural networks.Some studies achieved comparable or better outcomes using machine learning methods such as kNN and RF.This review concludes that no one approach dominates all the others from the reviewed literature because they are conducted using different datasets and settings, making the comparison difficult.Overall, NNs and hybrid models exhibited satisfactory prediction performance.Guo et al. considered ANN and SVM models to predict the total nitrogen concentration in a WWTP in Ulsan, Korea [18].To this end, daily water quality and meteorological data were used as input variables for the machine-learning models.The pattern search algorithm is adopted to calibrate the ANN and SVM models.This study demonstrated the capability of these two models in predicting water quality and the superior performance of the SVM in this case study [18].The study in [19] focused on predicting effluent water quality parameters of the Tabriz WWTP through a supervised committee fuzzy logic approach.The study in [2] evaluates the energy efficiency of a sample of Chilean wastewater treatment plants (WWTPs) using a newly developed technique called stochastic non-parametric envelopment of data (StoNED).This technique combines non-parametric and parametric methods, and allows for an exploration of the influence of the operating environment on the energy performance of WWTPs.The study found that the Chilean WWTPs were considerably inefficient, with an average energy efficiency score of 0.433 and significant opportunities to save energy (average savings were 203,413 MWh/year).The age of the facilities negatively affected energy efficiency, and WWTPs using suspended-growth processes, such as conventional activated sludge and extended aeration, had the lowest levels of energy efficiency.The study suggests that this methodology could be used to support decision-making for regulation and to plan the construction of new facilities, and the authors also suggest that this methodology could be used to measure the energy efficiency of other stages of the urban water cycle, such as drinking water treatment.
This study explores the potential of machine learning models for predicting energy consumption in WWTPs.The following key points summarize the contributions of this paper: • First, we considered all input variables, including hydraulic, wastewater characteristics, weather, and time data, to predict energy consumption in a WWTP.This study compared twenty-three machine learning models, including support vector regression with different kernels, GPR with different kernels, boosted trees, bagged trees, decision trees, neural networks (NNs), RF, k-nearest neighbors (KNN), Gradient Boosting (XGBoost), and LightgBMs.Bayesian optimization has been applied to calibrate and fine-tune the investigated machine-learning models to develop efficient energy consumption predictions.In addition, A 5-fold cross-validation technique has been used to construct these models based on training data.Five performance evaluation metrics are employed to assess the goodness of predictions.Results revealed that using all input variables to predict EC, the machine learning models did not provide satisfactory predictions.

•
Second, the aim is to construct reduced models by keeping only pertinent input variables to predict EC in WWTP.To this end, Random Forest and XGBoost algorithms were applied to identify important variables that considerably influence the prediction capability of the considered models.Results showed that the reduced models obtained a slightly improved prediction of EC compared to the full models.

•
It is worthwhile to mention that the studied methods do not take into account the time-dependent nature of energy consumption in the prediction process.Our final contribution to addressing this limitation is constructing dynamic models by incorporating lagged measurements as inputs to enhance the ML models' ability to perform effectively.Results demonstrated that using lagged data contributes to improving the prediction quality of the ML models and highlights the superior performance of the dynamic GPR and KNN.
The remainder of this study is organized as follows.Section 2 provides an overview of related works on energy consumption prediction.Section 3 presents the data from the Melbourne WWTP and the airport weather station, along with an exploratory data analysis.Furthermore, the investigated machine learning models are briefly described.Section 4 describes the proposed machine learning-based prediction framework.Section 5 contains the results and discussion of the machine learning algorithms within our datasets.Lastly, Section 6 recapitulates the paper and gives future directions for potential enhancements.

Related Works
Recently, many studies have explored the concept of machine learning to control WWTP by predicting how much energy they consume [20,21].Machine learning-based models are flexible and rely only on historical data from the inspected process.For instance, ref. [22] used a Random Forest model to predict the energy consumption of WWTPs.They assessed this machine-learning approach using 2387 records from the China Urban Drainage Yearbook.Results indicate that Random Forest exhibited satisfactory prediction performance with an R 2 of 0.702.However, this study does not consider the effects of local climate and technology in building the predictive model, which is very important.Ref. [23] considered a logistic regression approach for predicting the energy consumption of a WWTP in Romania.The input variables, including the flowrate and wastewater characteristics, are used to predict energy consumption.Data were collected from a WWTP between 2015 and 2017 with 403 records to verify the efficiency of this approach.Results showed a reasonable prediction quality with an accuracy of 80%.Nevertheless, all parameters that affect water quality were not considered when constructing the logistic regression model.Ref. [24] investigated the capacity of the Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Linear Regression in predicting the energy consumption of a WWTP located in Peninsular Malaysia.The energy consumption data was collected from the Tenaga National Berhad (TNB) electrical bills from March 2011 to February 2015.The wastewater characteristics are collected to construct the predictive models.This study showed that the ANN model outperformed the other models.In ref. [25], the purpose was to save energy at WWTP by performing a daily benchmark analysis.Torregrossa et al. examined Support Vector Regression (SVR), ANN, and RF algorithms on the Solingen-Burg WWTP dataset (designed for a connected population of 120,000 individuals).The RF was chosen as the most efficient algorithm based on an R 2 of 0.72 in the validation and an R 2 of 0.71 in the testing.
Furthermore, ref. [26] applied machine learning methods (ANN and RF) to predict energy costs in WWTPs.This study is conducted based on 279 WWTPs located in northwest Europe, including the Netherlands, France, Denmark, Belgium, Germany, Austria, and Luxembourg.Regarding average R 2 , RF reached 0.82, followed by the ANN with 0.81.ANN could be extensively used to investigate larger WWTP databases.Qiao et al. proposed an approach to predict energy consumption and effluent quality based on a density peaks-based adaptive fuzzy neural network (DP-AFNN) [27].They showed that this approach achieved high prediction accuracy compared to multiple linear regression, the FNN-EBP (error backpropagation), and the Dynamic FNN.Ref. [5] focused on the modeling and optimization of a wastewater pumping system to reduce energy consumption using the ANN model.Specifically, they applied neural networks to model pump energy consumption and wastewater flow rate.An artificial immune network algorithm is adopted to solve the optimization problem by minimizing energy consumption and maximizing the pumped wastewater flow rate.Results revealed that 6% to 14% of energy could be saved while maintaining pumping performance.Ref. [28] investigated the capability of ANN, Gradient Boosting Machine (GBM), RF, and the Long Short-Term Memory (LSTM) network in predicting energy consumption records from a Melbourne WWTP.The prediction has been performed by considering weather variables, wastewater characteristics, and hydraulic variables.Feature extraction has been considered to select important variables to construct the machine learning models.Results showed that the GBM model provided the best prediction in this case study.However, with future changes in the test set data, a model's performance will degrade when applied to data from subsequent months.In [29], a neural network model has been applied to predict pump energy consumption in a WWTP.This will enable the generation of operational schedules for a pump system to decrease energy consumption.The ANN model showed satisfactory prediction by reaching MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error) of 0.78 and 0.02, respectively.Table 1 shows some recent studies on energy consumption prediction in WWTPs using various Machine Learning techniques.

Data Description and Analysis
This study uses multivariate data from the Melbourne water treatment plant and airport weather stations: https://data.mendeley.com/datasets/pprkvz3vbd/1,accessed on 25 April 2023.The data contains 1382 records gathered over a period of five years, from January 2014 to June 2019, from nineteen variables listed in Table 1.The dataset includes power consumption data as well as biological, hydraulic, and climate variables.The data on water quality and biological characteristics are collected using sensors, while weather data is collected using the Melbourne airport weather station located near the water treatment plant.This dataset provides a comprehensive view of the various factors that can impact energy consumption at a wastewater treatment plant and allows the researchers to build models that take into account the interplay between these factors.As shown in Figure 1, the data also contains time-domain information that will be considered in this study to improve prediction performance.More details about this data can be found in [28].Additionally, we eliminated data points that had unusually low or high energy consumption, which were considered outliers.Roughly 5% of the data points were removed [1].
The evolution of energy consumption in wastewater treatment plants (WWTPs) can be affected by several factors, including the influx of wastewater volume and composition, weather conditions such as temperature, precipitation, and water flow, seasonality changes (e.g., tourism season, temperature, precipitation, and water flow), treatment processes used, and maintenance of the equipment.WWTP operators can use real-time data on the influx of wastewater, weather conditions, and treatment processes to monitor the daily evolution of energy consumption and adjust the treatment process as needed.This can help reduce energy consumption and costs and improve the overall performance of the WWTP.
Before constructing predictive models, it is important to perform exploratory data analysis.At first, we plot the time-series data to get a visual idea of the variation and correlation in the data.Plotting time-series data is an important step in exploratory data analysis, as it can help identify patterns and trends that can be useful in building predictive models.Visualizing the data can also help identify any outliers or missing data, which can be necessary for preprocessing the data before building the predictive models.Figure 2 displays the daily evolution of energy consumption, hydraulic and wastewater variables, and some weather time series collected in 2018.As WWTPs should handle highly dynamic influent, energy consumption and influent will vary accordingly (Figure 2).3, we observe a high similarity in the variation between the inflow and the consumed energy.Essentially, the volume and composition of wastewater (inflow) that needs to be treated can affect the energy consumption of a WWTP.An increase in the volume of wastewater can increase the energy consumption of pumps and other equipment used to move the water through the treatment process.Additionally, changes in the composition of the wastewater, such as an increase in the amount of organic matter, can increase the energy consumption of the biological treatment process.Here, the Kernel density estimation [35], a non-parametric method, is applied to estimating the underlying distribution of a dataset.KDE is a powerful and flexible method for estimating the distribution of a dataset; it does not make any assumptions about the underlying distribution and can be used to estimate any kind of distribution.Figure 4 allows us to see the shape of the distribution, the presence of outliers, and the general characteristics of the data.It is an effective way of understanding the nature of the data and the underlying patterns in it.
To visualize the distribution of energy consumption over the years, Figure 5 displays the boxplots of yearly WWTP power consumption during the studied period from 2014 to 2019.Note that data for 2019 is available for only six months, which is why the boxplot for that year is more compact than the other years.From Figure 5, we observe that the annual distribution of energy consumption in 2018 has slightly decreased in average values and standard deviations compared to 2017.This decrease in energy consumption could be due to the operator's optimization and management of the WWTP.The boxplots in Figure 6 display the monthly energy consumption patterns, and it is observed that there is a significant increase in variance during the hot months of October, November, and December.This could be attributed to the high demand for water during these months and also to the increase in tourism.These months are typically considered the hottest period in Australia.The weather during this period is generally sunny, warm, and humid, which can lead to increased water usage for cooling and other purposes and also cause an influx of tourists, leading to increased strain on the WWTPs.This information could be used to improve energy consumption prediction for better resource allocation and planning.

Methodology
This section briefly describes the considered machine learning models for energy consumption prediction in this paper.As presented in Figure 7, twenty-three machine learning models are considered, including GPR, SVR, kNN, and ensemble learning models (RF, BT, BS, XGBoost, and LightgBM).Each model has its own strengths and weaknesses.Next, we provide a brief description of these popular machine-learning models.

SVR Model
It is a flexible data-based approach with good learning capacity via kernel tricks.The key idea underlying the SVR consists of mapping the train data to a higher dimensional space and conducting linear regression in that space.SVR can handle non-linear and non-separable data using the kernel trick, handle datasets with large numbers of features, and be robust to the presence of outliers.However, it requires a large amount of computational power and can be sensitive to the choice of kernel function and regularization parameter [36,37].Moreover, the relevant concept used in designing the SVR model lies in structural risk minimization.It is demonstrated that SVR provides satisfactory performance with limited samples [38].Thus, SVR models have been broadly exploited in various applications, such as solar irradiance prediction [39], wind power prediction [40], and anomaly detection [41].This study will use optimized SVR via Bayesian optimization for comparison.

GPR Model
It is a probabilistic machine learning technique used for both regression and classification problems [42].It models the target variable as a Gaussian process and assigns a probability distribution to the predicted values [43].GPR is non-parametric, flexible, and able to capture complex relationships between the input variables and the target variable [44].A key feature of GPR is that it can also model the uncertainty in the predictions it makes.This is achieved by assigning a probability distribution to the predicted values [42].However, GPR is computationally expensive because it requires the calculation of the covariance matrix between all the data points, which can become computationally intensive as the number of data points increases.The computational complexity of GPR is typically O(N 3 ) where N is the number of data points.Additionally, GPR also requires the inversion of the covariance matrix and the calculation of the likelihood of the model given the data, which can be computationally expensive as well.

K-Nearest Neighbor
The K-NN is a lazy machine-learning method that does not require training before use.k-NN is a non-parametric method that can be used for regression by predicting the output of a new data point based on the average of the k-nearest data points' output [45].The value of k is a hyperparameter that needs to be set before training the model, and it affects the model's sensitivity to outliers.However, the k-NN algorithm is sensitive to the scale of the features, so it may not perform well if the features have different scales.The k-NN algorithm may not perform well with high-dimensional data, as it becomes more difficult to find the k-nearest neighbors in high-dimensional space.

ANN Models
Artificial Neural Networks (ANNs) are advanced models capable of recognizing complex and non-linear relationships between inputs and outputs; however, they can be sensitive to the way they are structured and initiated and the parameters used.They are frequently used in studies related to wastewater treatment plants (WWTPs).Each layer of an ANN contains multiple neurons connected to neurons in the next layer and created to fulfill specific tasks.Different ANN models have been studied, such as Narrow, Wide, Medium, Bilayered, and Trilayered Neural Networks.Narrow Neural Networks have a small number of neurons in hidden layers, which makes them less complicated and quicker to train, but they may not be able to recognize complex relationships.Wide Neural Networks have many neurons in hidden layers, which makes them more complex and able to recognize more complex relationships.However, they can be costly to compute and susceptible to overfitting.Medium Neural Networks have an intermediate number of neurons in hidden layers and provide a balance between narrow and wide neural networks.Bilayered Neural Networks have two hidden layers, making them more complex than single-layered neural networks.Trilayered Neural Networks have three hidden layers, making them even more complex than Bilayered Neural Networks, but also more costly to compute and prone to overfitting.

Decision Tree Regression
Decision trees, or regression trees, are a method used to predict a continuous outcome.It uses a tree-like structure to divide the data into subsets and make predictions based on the average value of the target variable in each subset.The simplicity of the algorithm makes it easy to understand and implement [46].However, the complexity increases as the tree grows deeper, and the model can become overfit.To mitigate this, techniques such as pruning, limiting tree depth and using regularization can be used.The time complexity of the algorithm is typically O(log(n)) on average, but can be as high as O(n) in the worst-case scenario.The space complexity is O(n) because it stores the entire tree in memory.

Ensemble Methods
Ensemble learning is a technique where multiple models are combined to make a single, more accurate prediction [47].There are several ensemble methods like Bagging, Boosting, Random Forest, XGBoost, and LightGBM.Bagging, also known as Bootstrap Aggregating, trains several models independently on different subsets of the data and then combines the predictions by taking an average [48].Boosting, on the other hand, trains multiple models sequentially, where each model tries to correct the prediction errors of the previous model [49].Random Forest combines multiple decision trees, creating them on different subsets of the data and taking the average or majority vote of the predictions from the individual trees to make a final prediction [50].XGBoost and LightGBM are open-source libraries for gradient boosting.XGBoost is highly efficient, scalable, and flexible and is used in various applications such as machine learning competitions, structured data, and time series [51].LightGBM is efficient and scalable, particularly suitable for large datasets and high-dimensional data; it uses a histogram-based algorithm to build the trees, reducing computation time and memory usage [52,53].Ensemble methods are considered more robust and accurate than single models; they reduce variance and bias in predictions by combining the strengths of multiple models [39].
When comparing the time complexity of Bagging, Boosting, Random Forest, XGBoost, and LightGBM, it is important to note that it can depend on the specific implementation and the size of the dataset.Generally speaking, the time complexity of Bagging and Boosting is O(mnT), where m is the number of samples, n is the number of features, and T is the number of estimators (models) used.As T increases, the time complexity increases linearly.Random Forest has a time complexity of O(mnlog(m) * T), where m is the number of samples, n is the number of features, and T is the number of estimators (models) used.As T increases, the time complexity increases logarithmically.XGBoost and LightGBM, which are based on gradient boosting, have a time complexity of O(Tn * log(n)), where T is the number of estimators (models) used, and n is the number of samples.As T increases, the time complexity increases linearly.To sum up, Bagging and Boosting have a similar time complexity, which increases linearly with the number of estimators.Random Forest has a more complex time complexity, which increases logarithmically with the number of estimators.XGBoost and LightGBM, based on gradient boosting, have a time complexity that increases linearly with the number of estimators and the size of the dataset.
In terms of time complexity, Random Forest is considered to have the best time complexity among Bagging, Boosting, Random Forest, XGBoost, and LightGBM.Its time complexity is O(mnlog(m) * T), where m is the number of samples, n is the number of features, and T is the number of estimators (models) used.As T increases, the time complexity increases logarithmically.This means that as the number of predictors increases, the time complexity of Random Forest increases at a slower rate compared to Bagging and Boosting, which have a linear increase in time complexity with the number of predictors.

Models Calibration via Bayesian Optimization
Fine-tuning a machine learning model involves adjusting the hyperparameters of the model in order to improve its performance on a specific task.Choosing the best hyperparameter configuration for a given model has a direct effect on its performance.There are several ways to fine-tune a machine learning model, including Grid Search, Random Search, and Bayesian optimization [54].Grid Search is an exhaustive method that tries all possible combinations of hyperparameters within a predefined range [55].Random Search is similar to grid search, but it randomly samples a set of hyperparameters from a predefined range [55].Bayesian optimization, on the other hand, uses probabilistic models and previous evaluations to guide the Search, which makes it more efficient than the other two methods [56,57].
Of course, Bayesian optimization aims to find the optimal set of hyperparameters by selecting the best set of hyperparameters in a more efficient manner than Grid search and Random search, it uses the information gained from previous evaluations to guide the search, so it does not need to evaluate all possible combinations.In this study, we adopted Bayesian optimization to calibrate the models.For more details on Bayesian Optimization, refer to refs.[57,58].

Machine Learning-Based EC prediction framework
Energy consumption (EC) prediction is essential to optimally designing and operating sustainable energy-saving WWTPs.EC in WWTPs is influenced by diverse biological and environmental factors, making it complicated and challenging to build soft sensors.The techniques based on machine learning represent an appealing solution to predict the EC in WWTPs.In this paper, we investigated the prediction performance of the commonly used machine learning methods to predict the WWTP's energy consumption.The general framework of the adopted machine learning framework is given in Figure 8.After preprocessing the data by eliminating outliers and imputing missing values, the data are subdivided into training and test sets.Here, 75% of the data are used to train the machine learning algorithms; furthermore, the trained models are evaluated using the testing data (i.e., the remaining 25% of data).We used the k-fold Cross-validation technique to train the models.Cross-validation is a technique used to assess the performance of a model by dividing the data into train and test sets, and training and testing the model multiple times on different subsets of the data [59].The benefit of using cross-validation is that it provides a more accurate estimate of the model's performance, reduces the risk of overfitting, and can also be used to tune the hyperparameters of a model.Note that the considered models are calibrated using Bayesian optimization during the training stage.Selecting the best model is an important step in machine learning.The best model can be selected by comparing the performance of different models using these metrics.Here, we employed four commonly used metrics: root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and training time are often used to evaluate the performance of a model.Training time measures how long it takes to train the model.In addition, we computed J 2 metric, which has been recently in [28].The model that performs best based on these metrics can be selected as the best model.The measured and the predicted energy consumption are denoted respectively by y t and ŷt , and n as the number of records.
• RMSE measures the differences between predicted and true values.
• MAE measures the average absolute difference between predicted and true values.
• MAPE measures the average percentage difference between predicted and true values.
• j 2 is the ratio of the square RMSE in testing to the square RMSE in training [28].
In summary, lower values of these metrics indicate that the model is making predictions that are more accurate, closer to the true values, and faster to train, which can be considered as better precision and quality of prediction.

Results and Discussion
In this study, we performed three investigations to develop effective machine-learning models to predict WWTP energy consumption.At first, all the available input variable data is considered in predicting energy consumption via several machine learning models.In the second experiment, we considered only the most important variables via variable selection in predicting WWTP energy consumption.In the last experimentation, we included lagged data to improve the prediction accuracy of the considered machine learning models.

EC prediction Using Full Models
This first experiment considers hydraulic, weather, wastewater characteristics, and time variables for EC prediction.This experiment assesses the capacity of machine learning methods for EC prediction based on all variables in Table 1.Here, we investigate the conventional Static machine learning models that are built using a fixed set of features and do not take into account any changes or updates in the data over time.These models are trained on a single dataset and make predictions based on that dataset.
As discussed, we used 75% of the data (from 1 January 2014, to 28 January 2018) for training sub-data and 25% for testing sub-data (from 29 January 2018, to 27 June 2019).In this context, we refer to full prediction models that predict EC without considering previous energy consumption.To train the investigated models, we used a 5-fold crossvalidation technique.As the prediction accuracy of machine learning methods relies on the hyperparameters' values, we adopted Bayesian Optimization herein to fine-tune the investigated methods during the training.
Table 2 summarizes the prediction performance of the investigated machine learning models based on testing data.A neural network tends to be the fastest model among all others in terms of training time.We observe that ensemble learning models (i.e., BST, BT, RF, XGBoost, and LightgBM) predicted WWTP EC with the lowest RMSE and MAE.For instance, the BST produced better RMSE values in the train with 35.77 and in the test data with MAE = 31.97MWh, RMSE = 41.37 MWh, and MAPE = 11.25.Furthermore, XGBoost also yielded good results with RMSE = 41.38 (MWh/Day), MAE = 32.23 (MWh/Day), and J2 = 0.95.This could be due to their ability to reduce prediction errors using several weak regressors.Based on MAPE criteria, the top six models are QSVR, GPRE, GPRRQ, BT, BST, and XGBoost.Table 2 shows that the GPR models are time-consuming to train, followed by the SVR models.Overall, no one approach dominates all the others in terms of all considered statistical scores.Results in Table 2 indicate that there is still space for improvement in WWTP energy consumption.

EC Prediction Using Reduced Models
Now, the aim is to build parsimonious predictive models for WWTP EC prediction.Feature importance identification is the process of determining which input variables in a dataset have the most impact on the outcome of a machine learning model.This knowledge can help to understand the relationship between inputs and outputs, simplify the model, and improve its performance.Importantly, non-informative and redundant input variables will be ignored in building a predictive model to reduce the number of input variables.Here, we used two popular machine learning algorithms, XGBoost and RF, to identify important features in a dataset.By using two methods, one can obtain a more robust understanding of feature importance and make more informed decisions about which features to include or exclude in a model.Figure 9 displays the selection results based on the RF and XGBoost methods.The larger the input amplitude for feature importance, the more significant the influence of that variable on the WWTP EC prediction.
In Figure 9, results from the two feature selection methods indicate that the month is the most important variable, followed by the year.The relationship between total energy consumption and the month may be due to seasonal and weather changes.Including the year as a predictor may not provide meaningful insights beyond capturing the overall trend of energy consumption changing over time.Factors such as population growth, infrastructure improvements, or policy changes occurring annually in the WWTP may influence energy consumption broadly.However, the year variable alone may not offer specific insights into the factors driving energy demand variations within a year.Therefore, we have focused primarily on the month as the key predictor.Importantly, the month captures the seasonal and weather-related fluctuations that have a more immediate and direct impact on energy consumption patterns in WWTPs.The two hydraulic variables, average daily inflow and outflow rates, correlate well with the target.As these hydraulic variables are correlated, we selected only one of them (i.e., Q in ).We can see that in climate variables, Tavg, Tmax, Tmin, and the average humidity have more effect on EC prediction.As there is a correlation between the average humidity and Tavg of 0.55, Tmin and Tmax of 0.76, and between Tavg and Tmax of 0.92, we can select only a few of them to construct machine learning models.For example, we can ignore two variables from Tavg, Tmax, and Tmin.From Figure 9, the other climate variables, such as WSmax, atmospheric pressure, and visibility, are not relevant and can be ignored.Figure 9 shows that for wastewater variables, TN, COD, and BOD are correlated with the target [60].As there is a relatively important correlation between COD and TN (0.68); COD can be ignored.Figure 9 also indicates that Ammonia appears slightly more impactful than BOD in predicting energy consumption, which could be attributed to several factors.One possible explanation is that Ammonia levels in the influent wastewater can indicate the presence of nitrogen-rich compounds, which require additional energy for their removal during the treatment process.The energy-intensive processes involved in nitrogen removals, such as nitrification and denitrification, may contribute to the higher impact of Ammonia on energy consumption.On the other hand, BOD represents the amount of biodegradable organic matter in the wastewater.This could potentially explain the relatively lower impact of BOD compared to Ammonia on energy consumption in the studied context.Unexpectedly, influent water quality did not exhibit a stronger predictive power for energy consumption than other factors, such as water quantity.This could be attributed to several factors.Firstly, it is important to consider the specific characteristics of the dataset and the operational context of the studied WWTP.The dataset used in this study may have had relatively stable and controlled influent water quality conditions, with limited fluctuations or variations that could have a pronounced impact on energy consumption.Furthermore, it is worth noting that the energy consumption in WWTPs is influenced by a complex interplay of various factors, including hydraulic conditions, treatment processes, operational strategies, and system design.While water quality parameters are important in the overall treatment process, their individual contribution to energy consumption may be relatively lower than other influential factors.Here, we investigate the performance of the reduced machine learning models that are built using a subset of features from the original dataset.These models are trained on a reduced dataset that contains only the relevant features, and make predictions based on that reduced dataset.Table 3 summarized the prediction results of the reduced models using testing data.The comparison of the performance of the models shows that KNN outperformed all of them, with the lowest RMSE, MAE, and MAPE errors at 37.33, 28.23, and 10.65, respectively.The KNNs outperform the static models across all criteria, and the training time is decreased from 12 s to 10 s.It was followed by GSVR, BT, RF, and LightgBM, which had the lowest RMSE.In terms of the J2 criteria, WNN has the lowest score of 0.38.Furthermore, the training time for neural networks is generally the fastest of all models.In terms of a MAPE, the KNN model has the best prediction performance of 10.65, followed by the GSVR model with a MAPE of 11.17, followed by the GPR models between 11. 23-11.25.Moreover, based on the RMSE criteria, Table 3 shows the six best models: GSVR, KNeighbors, LightgBM, RF, and XGBoost.The KNN and LightgBM models capture more fluctuations, respectively, with RMSE values of 37.33 and 41.27, followed by the RF, BT, XGBoost, and GSVR models with RMSEs between 41.6 and 41.9.Overall, the forecasting gets better with slower training times and fewer features from the static model.

EC Prediction Using Dynamic Models
The results obtained in the previous experiments are based on static models that do not account for the previous day's energy consumption.However, energy consumption data often exhibits a dynamic nature.This experiment will examine how machine learning models perform when past data is incorporated into their construction.Towards this end, we introduce lagged data, the lag 1 energy consumption data, when building prediction models to capture energy consumption's dynamic and evolving nature.These dynamically reduced models are trained on a reduced dataset that contains only the relevant features but also incorporates time-lagged measurements.Figure 10 shows the results of variable importance identification based on the RF and XGBoost algorithms.According to RF and XGBoost, Figure 10 illustrates the effect of each feature on energy consumption.For both RF and XGBoost, it shows that the lag 1 energy consumption data significantly impacts the target with a large score of around 0.53.This confirms the need to take past data into account when building predictive models for energy consumption prediction.Similarly to the reduced models' experiment, we construct the machine learning models based on training data with selected variables.In dynamic models, such as autoregressive models, it is common to use lagged data to capture the dynamic and time-dependent nature of the process being modeled.In this study with dynamic machine learning models, Lag 1 energy consumption is a key predictor because it reflects the immediate past energy consumption, significantly influencing the current energy consumption in the wastewater treatment process.Note that we investigated the inclusion of other Lag orders, such as Lag 2 and Lag 3 data, and the analysis consistently showed that incorporating Lag 1 data provided superior prediction results compared to the use of Lag 2 and Lag 3 data.This suggests the strong influence of immediate past energy consumption on the current energy consumption in the wastewater treatment process.After building the models with Lag 1 energy consumption data as an input variable, we tested the dynamic models using data from 29 January 2018, and 27 June 2019.Table 4 lists the prediction results of the dynamic machine learning models.Results show that XGBoost achieved the best prediction results in terms of RMSE (37.14); it is followed by kNN, GPRE, LightGBM, and BT with RMSE values of 37. 33, 37.36, 37.38, and 37.56, respectively (Table 4).Moreover, the time consumed by XGBoost in training (398 s) is nearly 30 times that of the second-best model, kNN (13 s), with only a 1% loss in predicting performance with regard to RMSE.
Overall, the time consumed by XGBoost in training is significantly higher than that of kNN, with XGBoost taking approximately 30 times longer than kNN.However, the difference in prediction performance between the two models is relatively small, with XGBoost having a slightly better performance in terms of RMSE.This suggests that while XGBoost may have a higher time complexity, it may be more suitable for certain applications where a higher level of accuracy is desired, while kNN may be more suitable for applications where computational efficiency is a priority.Hence, a trade-off between prediction accuracy and computational time may need to be considered when selecting a machine-learning model for energy consumption prediction in WWTPs.Figure 11a,b shows the heatmap of the RMSE values of the twenty-three models (static, reduced, and dynamic models).The dynamic reduced models that incorporate lagged energy consumption data show better prediction results in terms of RMSE and MAPE when compared to static and reduced models.This suggests that considering past energy consumption data can lead to more accurate predictions of energy consumption in WWTPs (Figure 11).The dynamic ensemble models, led by XGBoost, show good prediction performance in terms of RMSE and MAPE compared to static and reduced models.The use of lagged energy consumption data in the dynamic models improves the accuracy of predictions.However, it's important to note that the time complexity of XGBoost is considerably higher than other models like kNN.In summary, machine learning is a powerful tool that can be used to predict energy consumption in WWTPs.By analyzing the available input-output data, machine learning models can identify patterns and relationships that can be used to make accurate predictions about energy consumption.This can help WWTPs reduce energy costs by identifying opportunities for energy efficiency and optimizing the treatment process in several ways.By accurately forecasting energy consumption, operators can implement proactive measures to optimize energy usage.For example, suppose the prediction model indicates a peak in energy demand during a specific time period.In that case, operators can schedule the operation of energy-intensive equipment during off-peak hours or consider alternative energy sources to minimize costs.Furthermore, by analyzing the factors influencing energy consumption, such as influent characteristics, operational parameters, and treatment processes, WWTPs can identify specific areas where energy efficiency improvements can be made.This analysis may reveal opportunities to optimize process parameters, retrofit equipment with energy-saving technologies, or implement advanced control strategies to reduce energy waste.Additionally, predicting energy consumption can support decision-making in allocating resources and investments.WWTPs can prioritize projects and investments based on the predicted energy demands and potential energy savings.This allows for targeted interventions and resource allocation towards areas that yield the greatest energy efficiency improvements, resulting in long-term cost reductions.Moreover, by continuously monitoring and updating the predictive model, WWTPs can assess the effectiveness of energy-saving initiatives over time and fine-tune their energy management strategies.This iterative process enables the identification of further optimization opportunities and the implementation of adaptive measures to achieve sustained energy efficiency gains.In the context of energy consumption prediction in WWTPs, dynamic models would be more suitable, as they can capture the temporal dynamics of the data and make more accurate predictions about future energy consumption.It is important to note that while XGBoost performed well in terms of RMSE, other models such as kNN and GPR performed well in terms of other evaluation metrics.Additionally, the time complexity of the models should also be taken into consideration.Therefore, it is recommended to use a combination of different models and evaluation metrics to optimize energy consumption prediction in WWTPs.

Conclusions
This study investigates the application of machine learning techniques for predicting energy consumption in WWTPs.Real data from a WWTP in Melbourne is utilized, and a range of machine learning models, including kernel-based methods, ensemble learning methods, ANN models, decision trees, and k-nearest neighbors, are assessed.Feature selection methods, such as Random Forest and XGBoost, are employed to enhance model efficiency.The findings demonstrate that incorporating past data through dynamic models, specifically time-lagged measurements, improves the accuracy of energy consumption predictions.The dynamic K-nearest neighbors model emerges as the top-performing model.It is important to highlight that while XGBoost excels in terms of RMSE, other models like kNN and GPR exhibit strong performance in different evaluation metrics.Furthermore, considering the time complexity of the models is crucial.To optimize energy consumption prediction in WWTPs, it is recommended to employ a combination of diverse models and evaluation metrics.
The analysis of the Melbourne East WWTP data demonstrated that multiple variables significantly influenced EC.Among these variables, month, TN, ammonia, daily temperature, humidity, and influent flow showed the highest impact on EC in the WWTP.However, our investigation also revealed that factors such as rainfall, atmospheric pressure, and wind speed did not exhibit significant effects on EC in the WWTP.Furthermore, our findings indicated that incorporating lag 1 EC data improved the predictive performance of the models.These results provide valuable insights into the factors influencing EC in the Melbourne East WWTP and highlight the potential benefits of considering these variables and lagged energy consumption in future energy consumption prediction models.
The proposed framework presented in this study can be customized and implemented in other WWTPs by incorporating plant-specific data and relevant variables.While the specific conclusions drawn from our research may not directly translate to other plants due to differences in operational conditions and data characteristics, the underlying principles, methodologies, and insights gained from our study can serve as valuable references for other pollution treatment plants.By adapting the framework to their specific context, other WWTPs can leverage the knowledge and approaches developed in our study to enhance their understanding and prediction of energy consumption in their respective systems.
There is still room for improvement in energy consumption prediction for WWTPs using machine learning.

•
Despite optimizing the prediction model through variable selection methods, it is important to acknowledge that our model's predictive capability could be influenced by other variables that were not included due to data limitations.Future research should focus on exploring and incorporating a broader range of variables to enhance the accuracy and comprehensiveness of energy consumption prediction models in WWTPs.This could involve considering additional variables related to process conditions, influent characteristics, operational parameters, and external factors such as climate and regulatory changes.By incorporating these variables, we can improve the predictive power of the models and gain a more comprehensive understanding of the factors impacting energy consumption in WWTPs.

•
In future work, we will emphasize the need for additional studies that focus on validating the feasibility and utility of these models in real-world scenarios.This will involve considering factors such as computational requirements and operational constraints commonly encountered in real WWTP settings.• Deep learning models, known for their ability to handle time-series data, present an intriguing avenue for further exploration in forecasting energy consumption in WWTPs.These models, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, have demonstrated promising capabilities in capturing temporal dependencies and patterns within time-series data [61,62].By leveraging their strengths, deep learning models could improve the accuracy and precision of energy consumption forecasts in WWTPs.

•
Another possibility for improvement is integrating wavelet-based multiscale data representation with machine learning models.This approach would take into account the temporal and frequency characteristics of the data and could potentially improve the accuracy of the prediction models.Wavelet-based multiscale representation can also be used to extract relevant features and patterns from the data, which could be used to improve the performance of the machine learning models.This approach could potentially provide more accurate predictions and lead to further optimization of energy consumption in WWTPs.

Figure 1 .
Figure 1.Measured variables in this study.

Figure 2 .
Figure 2. Distribution of some the used variables.

Figure 3
Figure3illustrates the total inflow and energy consumption for each year from 2014 to 2018.The data in Figure3is based on the available dataset, which covers the period from January 2014 to June 2019.However, due to the limited data available for 2019 (only six months), we have excluded the 2019 data from Figure3to avoid any potential confusion.The figure specifically focuses on the years from 2014 to 2018 to provide a clear representation of the yearly variations in inflow and energy consumption.From Figure3, we observe a high similarity in the variation between the inflow and the consumed energy.Essentially, the volume and composition of wastewater (inflow) that needs to be treated can affect the energy consumption of a WWTP.An increase in the volume of wastewater can increase the energy consumption of pumps and other equipment used to move the water through the treatment process.Additionally, changes in the composition of the wastewater, such as an increase in the amount of organic matter, can increase the energy consumption of the biological treatment process.

Figure 3 .
Figure 3. Yearly sum of inflow (a) and energy consumption (b) from 2014 to 2018.

Figure 4
Figure 4  depicts the distribution of the collected data, indicating that these datasets are non-Gaussian distributed.Here, the Kernel density estimation[35], a non-parametric method, is applied to estimating the underlying distribution of a dataset.KDE is a powerful and flexible method for estimating the distribution of a dataset; it does not make any assumptions about the underlying distribution and can be used to estimate any kind of distribution.Figure4allows us to see the shape of the distribution, the presence of outliers, and the general characteristics of the data.It is an effective way of understanding the nature of the data and the underlying patterns in it.To visualize the distribution of energy consumption over the years, Figure5displays the boxplots of yearly WWTP power consumption during the studied period from 2014 to 2019.Note that data for 2019 is available for only six months, which is why the boxplot for that year is more compact than the other years.From Figure5, we observe that the annual distribution of energy consumption in 2018 has slightly decreased in average values and standard deviations compared to 2017.This decrease in energy consumption could be due to the operator's optimization and management of the WWTP.

Figure 4 .
Figure 4. Distribution of some the used variables.

Figure 5 .
Figure 5. Distribution of annual energy consumption over the course of the study period.

Figure 6 .
Figure 6.Distribution of monthly energy consumption over the course of the study period.

Figure 7 .
Figure 7.The investigated machine learning method in this study.

Figure 8 .
Figure 8. Illustration of the general framework of machine learning-based prediction procedure.

Figure 9 .
Figure 9. Feature importance identification using RF and XGBoost methods with all input variables.

Figure 10 .
Figure 10.Feature importance identification using RF and XGBoost methods with all original input variables and Lag 1 Energy consumption.

Figure 11 .
Figure 11.Heatmap of (a) MAPE and (b) RMSE values obtained using the twenty-three models.

Table 2 .
Prediction results of full machine learning models based on testing data.

Table 3 .
Prediction results of the reduced models based on testing data.

Table 4 .
Prediction results of the dynamic models based on testing data.