Gene Expression Programming Coupled with Unsupervised Learning : A Two-Stage Learning Process in Multi-Scale , Short-Term Water Demand Forecasts

This article proposes a new general approach in short-term water demand forecasting based on a two-stage learning process that couples time-series clustering with gene expression programming (GEP). The approach was tested on the real life water demand data of the city of Milan, in Italy. Moreover, multi-scale modeling using a series of head-time was deployed to investigate the optimum temporal resolution under study. Multi-scale modeling was performed based on rearranging hourly based patterns of water demand into 3, 6, 12, and 24 h lead times. Results showed that GEP should receive more attention among the emerging nonlinear modelling techniques if coupled with unsupervised learning algorithms in detailed spherical k-means.


Introduction
Water demand forecasting is a key predictive analytic among researchers in the field of water resource management.Due to a shortage of potable water, a lack of access to a mitigation of water resources in semi-arid and arid regions, climate change, and rapid worldwide urbanization [1], public awareness and concerned water authorities have led researchers to come up with novel techniques in this endeavor.Consumer satisfaction is the prime objective of water utility operators.However, population growth has caused infrastructures dealing with a significant amount of stress to meet sustainable states of engineered systems.A key parameter for water distribution systems (WDS) operators in decision making for pumping schedules, storage, treatment, and distribution of water is an accurate and reliable forecast of short-term water demand [2].Such a forecast can certainly aid the operators with an optimized water supply system that takes a fairly accurate demand of water into account for the future [3].Over recent years, there has been a boom in data analytics [4], which has brought to the attention of researchers and engineers that traditional approaches are not enough in designing a WDS for its future state.Predictive models are becoming increasingly popular, since more data are available now than in the past.This popularity is highlighted in the field of water demand even more due to a lack of records on the consumption of water in the past.Indeed, the water utilities' archives are relatively poor in regards to water demand data with higher sampling rates.Unfortunately, most utilities have readings of water consumption every few months.Therefore, short-term forecasts of water demand are usually based on the experience of the WDS operators in situations in which SCADA systems are not yet deployed.On the other hand, water usage is really different than actual production due to the high percentage (10-50%) of water loss or leakage through WDS [5].Indeed, there is a need for a comprehensive assessment of temporal resolutions through time-scale modelling, which can assess different time-scales for short term water demand.
A survey of scholarly research articles in the field of water demand forecasting reveals the novelty of these works was due to the consideration of temporal resolutions, the type/number of input variables fed into models, and modelling approaches.There is a common convention among water demand forecast modellers that short term forecasts are those targeting temporal resolutions hourly, daily, or weekly that are used for operational purposes of WDS [3].Furthermore, other researchers considered temperature, precipitation, and humidity in their analysis [6][7][8][9][10][11].The majority of the models in the literature are data-driven techniques, using water demand with a lead time to predict the future demand.This research should be categorized under short-term forecasts due to the temporal resolutions selected.Timescale modelling proposed in this research was aimed to help the decision makers tackle the data acquisition challenges of water utilities in their operations.Not all the water utilities use supervisory control and data acquisition (SCADA) systems to keep track of water consumption.Thus, this research targeted short term demand forecasting in a long time span (1 year) for the sake of operation, as well as management.Researchers have used multiple approaches for predictive analytics in the field of water demand forecasts, from very simple projections to the sophisticated data-driven and machine learning models used these days.Early stages of this field show linear regression was adopted by some researchers [12][13][14]; however, due to nonlinear trends in the characteristics of water demand, time series analysis or using the periodic pattern of data has been used by others [15,16].The majority of recent research uses data-driven techniques with learning algorithms for predictive analysis.Artificial neural networks (ANN) are probably the most popular ones [17][18][19], also most studies are done with slight changes in such models when pre-processing the data or changes in the structures of the defined ANN models.One study combined ANN with the wavelet bootstrapping machine learning approach as a hybrid model to improve performance of the models by pre-processing the data [20].In another study, performance of ANN was improved through a hybrid approach using the Fourier time series to model the water demand forecast [21].Another recent study proposes the coupling of the kernel partial least squares-autoregressive moving average with wavelet transformation as a hybrid approach for modeling annual urban water demand [22].On the other hand, support vector regression/machine (SVR/SVM)-based models have become increasingly popular recently [23][24][25][26].Other data-driven techniques, which are not that common, are random forests and multivariate adaptive regression splines [24].
Inspired by Darwin's theory of evolution, Ferreira introduced genetic expression programing (GEP), which brings the optimum selection of input variables in regressions/function findings [27].The GEP is used in many engineering disciplines [28][29][30][31], and its operating functions are subjected to a vigorous learning process to find the optimum ones to use in the gene structures.As a tool to perform data-driven or self-learning techniques, GEP has some advantages over the conventional predictive models.GEP defines individual block structures (input variables, response and function sets) and selects the optimized operating functions and multipliers through the process of learning algorithms.Furthermore, GEP has a built-in sensitivity analysis that selects the most important variables.The ability of GEP to propose a function/equation at the end of analysis is also unique, while most other data-driven techniques are considered as a black-box model that fails to provide a mathematical function.Therefore, this research investigated the performance of GEP in water demand forecast models for short-term time-scales, which has not yet been explored in this field.The increasing use of time series, also due to the adoption of high-frequency sensors and devices, has initiated many research and development attempts in time series data mining.Time series clustering is only a part of the effort in time series data mining research, but it has always aroused great research interest.The data mining approaches with regard to time-series data are often categorized into pattern recognition and clustering, classification, rule discovery, and summarization [32].This research proposed a coupled deployment of a two-stage learning process, with clustering as unsupervised learning followed by GEP as supervised learning process to forecast short-term water demand.The main outcome of this paper study is:

•
Evaluation of GEP as an alternative to other black-box models used in the literature that have not been explored by other researchers in the field of short-term water demand forecasting • Investigation of coupling time series clustering with GEP in short-term water demand forecasts to reduce the adverse effect of seasonality and holidays/working days on performance of proposed forecast models • Proposing a suitable sampling frequency for WDS operators through a time-scale modeling process

Model Development
The input design of the proposed models is based on reaching a broad understanding of the nature of input factors in the data-driven model, the self-interaction of the water demand, and the use of appropriate lag times in demand forecasting models (Figure 1).This is labelled as K a HT b OP c , in which a ∈ [1,2,3,4,5,6]  A total of 90 models (6 clusters × 5 head-times × 3 operators) were created.Through a two-stage spherical k-means clustering, data were divided into six different groups.Five different head-times were used to perform a time-scale modeling to obtain the optimum temporal resolution (1, 3, 6, 12, 24 h).Three types of mathematical operators [OP 1 , {+, −, ×}; OP 2 , {+, −, ×, ×2, ×3}; OP 3 , {+, −, ×, ×2, ×3, √ , ex, log, ln}] were used in the developing of the GEP models.All of these 90 models were used in partitions of 80% for train and 20% for test sets.
Water 2018, 10, x FOR PEER REVIEW 3 of 13 research proposed a coupled deployment of a two-stage learning process, with clustering as unsupervised learning followed by GEP as supervised learning process to forecast short-term water demand.The main outcome of this paper study is: • Evaluation of GEP as an alternative to other black-box models used in the literature that have not been explored by other researchers in the field of short-term water demand forecasting • Investigation of coupling time series clustering with GEP in short-term water demand forecasts to reduce the adverse effect of seasonality and holidays/working days on performance of proposed forecast models • Proposing a suitable sampling frequency for WDS operators through a time-scale modeling process

Model Development
The input design of the proposed models is based on reaching a broad understanding of the nature of input factors in the data-driven model, the self-interaction of the water demand, and the use of appropriate lag times in demand forecasting models (Figure 1).This is labelled as KaHTbOPc, in which a ∈ A total of 90 models (6 clusters × 5 head-times × 3 operators) were created.Through a two-stage spherical k-means clustering, data were divided into six different groups.Five different head-times were used to perform a time-scale modeling to obtain the optimum temporal resolution (1, 3, 6, 12, 24 h).Three types of mathematical operators [OP1, {+, −, ×}; OP2, {+, −, ×, ×2, ×3}; OP3, {+, −, ×, ×2, ×3, √, ex, log, ln}] were used in the developing of the GEP models.All of these 90 models were used in partitions of 80% for train and 20% for test sets.

Unsupervised Learning: K-Means Clusters
The same data were clustered in a study by one of our authors in [33], in which SVM regression is used as forecasting mechanism; therefore, further details on the clustering stage can be found in the related paper.The clustering procedure is briefly summarized here.The time-series used all have the same length: a (1 × 24) vector representing daily water demand.The following picture shows an example of a typical daily water demand pattern selected randomly for illustration.

Unsupervised Learning: K-Means Clusters
The same data were clustered in a study by one of our authors in [33], in which SVM regression is used as forecasting mechanism; therefore, further details on the clustering stage can be found in the related paper.The clustering procedure is briefly summarized here.The time-series used all have the same length: a (1 × 24) vector representing daily water demand.The following picture shows an example of a typical daily water demand pattern selected randomly for illustration.
As Figure 2 shows, the nature of water demand is bound to the temporal shifts, which are due to the habits of consumers in a 24 h time window.Therefore, triangle similarity (also known as cosine similarity) is used in the clustering algorithm, since it can effectively deal with "similarity in time" (temporal alignment of peaks and bursts) [33].More specifically, two time series are considered similar in time when they vary in a similar way on each time step.Other possible similarities for comparing time series are "similarity in shape", based on the occurrence of trends at different times or similar sub patterns, and "similarity in change", which identifies two series as similar according to their variations from time step to time step.As water consumption behaviours can be associated with the recurrent occurrence of peaks and bursts at some hours of the day, the triangle similarity, which is a similarity in time measure, was used to compare and cluster water demand time series.Triangle similarity is the cosine value of the angle between two vectors and is computed as: It is equal to 1 if x and y have the same orientation, 0 if they are orthogonal, and −1 if they have opposite orientations, independently of their magnitude.Since the components of the urban water demand vectors are not negative, triangle similarity varies in [0; 1].As Figure 2 shows, the nature of water demand is bound to the temporal shifts, which are due to the habits of consumers in a 24 h time window.Therefore, triangle similarity (also known as cosine similarity) is used in the clustering algorithm, since it can effectively deal with "similarity in time" (temporal alignment of peaks and bursts) [33].More specifically, two time series are considered similar in time when they vary in a similar way on each time step.Other possible similarities for comparing time series are "similarity in shape", based on the occurrence of trends at different times or similar sub patterns, and "similarity in change", which identifies two series as similar according to their variations from time step to time step.As water consumption behaviours can be associated with the recurrent occurrence of peaks and bursts at some hours of the day, the triangle similarity, which is a similarity in time measure, was used to compare and cluster water demand time series.Triangle similarity is the cosine value of the angle between two vectors and is computed as: It is equal to 1 if x and y have the same orientation, 0 if they are orthogonal, and −1 if they have opposite orientations, independently of their magnitude.Since the components of the urban water demand vectors are not negative, triangle similarity varies in [0; 1].The main objective of the clustering phase was to group the days in clusters that can consider the seasonality in water consumption behaviours.In order to reach the evidence of this seasonality, a two-level clustering was performed to group the months in the first stage, followed by clustering of the days with the "month clusters".
The unsupervised learning algorithm used in this paper was K-means, available as "skmeans" among R packages.The following equation shows how the cosine similarity concept is implemented through this algorithm.Calinski-Harabasz and Silhouette proposed methods to measure validity of the numbers of selected clusters [34].Therefore, they were used to find the optimum number of K in this study.More details on this approach are explained in [33].

Average Mutual Information
Performing data-driven models requires selection of an appropriate lag time, as it improves the computational efficiency by adding more valuable information for training, as inputs feeding the GEP models.Traditionally, lag times have been used up to the point where using more lag times would not result in a model's improvement.Some methods like autocorrelation function (ACF) or  The main objective of the clustering phase was to group the days in clusters that can consider the seasonality in water consumption behaviours.In order to reach the evidence of this seasonality, a two-level clustering was performed to group the months in the first stage, followed by clustering of the days with the "month clusters".
The unsupervised learning algorithm used in this paper was k-means, available as "skmeans" among R packages.The following equation shows how the cosine similarity concept is implemented through this algorithm.Calinski-Harabasz and Silhouette proposed methods to measure validity of the numbers of selected clusters [34].Therefore, they were used to find the optimum number of K in this study.More details on this approach are explained in [33].

Average Mutual Information
Performing data-driven models requires selection of an appropriate lag time, as it improves the computational efficiency by adding more valuable information for training, as inputs feeding the GEP models.Traditionally, lag times have been used up to the point where using more lag times would not result in a model's improvement.Some methods like autocorrelation function (ACF) or correlation Water 2018, 10, 142 5 of 13 integral (CI) methods could bring an educated guess about which lag time should be used in the development of such models [35][36][37].However, we have used average mutual information (AMI), which does not require large data sets, unlike the ACF and CI methods; it has also been widely used in the field of hydro-informatics in recent years [38].The following equation is the method for computing AMI for each one of the data sets designed as input variables of the GEP models.It simply uses the joint probability of two successive time series, as well as the marginal probability of them.It should be noted that it is similar to Shannon's entropy; therefore, it shows less entropy will be in the selection of the optimum lag time based on AMI.
In this equation, the joint probability of two successive time series P(X(i), X(i + τ)) and the product of their individual marginal probability are used to find the appropriate lag time.This delay can contribute to the maximum valuable information added on X(i) by the successive time series X(i + τ).

Gene Expression Programing
GEP's learning process begins with random generation of chromosomes for the given raw data/population.The generated populations work with two entities: chromosomes and expression trees.Environment selection or the fitness criteria will evaluate which of the offspring solutions can outclass the others.This repetitive process will eventually deliver a good candidate to be selected.In this study, the general settings of the learning/training algorithms were 30 chromosomes, 8 head size, and 3 numbers of genes as suggested.Selection of the head size determined the complexity of each one of the parameters.Each head of the genes was exposed to a variety of arrangements prior to feeding data into the models.Reproduction of the randomly generated populations could reach the superior model with the optimized stopping condition.Figure A1 (Appendix A) shows the expression or tree diagram of the proposed model.As shown, the model was based on 3 genes (sub-expression tree diagrams) linked together by addition function.The number of genes used in a chromosome characterized the different layers/blocks building the whole structure.Using a very big gene could result in more accurate models; however, in this study chromosomes are used in smaller units for simpler computation as a limited number of generations were used.The last part of GEP modelling is selection of stopping condition-function.Root mean square of error (RMSE) was used as fitness function to fit a curve to target values.The stopping condition was 3000 generations for all the runs or models in order to have a fair comparative assessment between all input designs.
Logical sequence of steps in function finding through nonlinear regression of GEP is explained below: Expressing chromosomes in a tree diagram with subsets • Comparing the new offspring solutions based on fitness criteria • Keeping the best solution, followed by reproduction methods like replication, mutation, recombination, etc.

Study Area and Data Collection
The proposed approach mentioned in the previous section was applied to the urban water demand in Milan, Italy, recorded between 1 October 2012 and 30 September 2013.The WDS in Milan serves drinking water to more than 5000 buildings in this city, which serves a population of approximately 1 million people.Other features of this WDS are listed below: Samples of water demand are recorded through a Supervisory Control and Data Acquisition system (SCADA).Quantity of water pumped into the network by each one of the pumping stations in the city is collected with a sampling rate of 1 sample/min.Collected data are then sent to the centralized SCADA, which sum measures over the different stations to provide the overall water demand of the city.Moreover, SCADA allows for modifying time scale; more specifically, data used in this study were retrieved from SCADA according to an hourly resolution.The multi-scale analysis of the data was performed by scaling the recorded data to head-time bases of hourly, 3 h, 6 h, 12 h, and 24 h.The scaled data were then prearranged into a time-series dataset D = {x 1 , x 2 , . . ., x n } consisting of n vectors, one for each day in the observation period, in which each vector x i is a set of 24 ordered values for hourly, followed by 8, 4, 2, and 1 for the rest of scaled data, which are under study for the i-th day.

Results
The main purpose of clustering in data mining is to illustrate the typical patterns of the trend in consumption of water that are inferred from recurrent peak/burst hours depending only on consumers' habits.This assignment is done without using any information about the data in the learning process (time of the day or week, and working/non-working days).Since the trend of water consumption follows a specific pattern shown earlier in this paper, "cosine similarity", known as triangle similarity, was opted for as a similarity index in the clustering process (spherical k-means).
Results of the time-series data clustering (i.e., a two-level clustering approach) allow us to identify 6 typical daily urban water demand patterns (i.e., consumption behaviours), in which the number of clusters was defined according to the best values of the Silhouette and Calinski-Harabasz indices (respectively, 0.74 and 97.87, averaged on the two-level clustering) obtained by varying the possible number of clusters from 2 to 24.More details regarding the detailed results of clustering can be found in [33].
The following Figure 3 shows a calendar with the cluster assignment for every day of the observation period.This kind of visualization makes seasonality and cyclic behaviours more evident.Looking into the 3 clusters of the first stage, one can identify these clusters as (1) Months of November, December, January, February, and March, which correspond to Fall and Winter; (2) Months of April, May, June and September, and October; and (3) July and August, which correspond to the period of largest consumption during summer holidays.The second level of clustering is to target the working/non-working days, which shows how holidays can affect the consumption behaviour of water demand.Samples of water demand are recorded through a Supervisory Control and Data Acquisition system (SCADA).Quantity of water pumped into the network by each one of the pumping stations in the city is collected with a sampling rate of 1 sample/min.Collected data are then sent to the centralized SCADA, which sum measures over the different stations to provide the overall water demand of the city.Moreover, SCADA allows for modifying time scale; more specifically, data used in this study were retrieved from SCADA according to an hourly resolution.The multi-scale analysis of the data was performed by scaling the recorded data to head-time bases of hourly, 3 h, 6 h, 12 h, and 24 h.The scaled data were then prearranged into a time-series dataset D = {x1, x2, …, xn} consisting of n vectors, one for each day in the observation period, in which each vector xi is a set of 24 ordered values for hourly, followed by 8, 4, 2, and 1 for the rest of scaled data, which are under study for the i-th day.

Results
The main purpose of clustering in data mining is to illustrate the typical patterns of the trend in consumption of water that are inferred from recurrent peak/burst hours depending only on consumers' habits.This assignment is done without using any information about the data in the learning process (time of the day or week, and working/non-working days).Since the trend of water consumption follows a specific pattern shown earlier in this paper, "cosine similarity", known as triangle similarity, was opted for as a similarity index in the clustering process (spherical k-means).
Results of the time-series data clustering (i.e., a two-level clustering approach) allow us to identify 6 typical daily urban water demand patterns (i.e., consumption behaviours), in which the number of clusters was defined according to the best values of the Silhouette and Calinski-Harabasz indices (respectively, 0.74 and 97.87, averaged on the two-level clustering) obtained by varying the possible number of clusters from 2 to 24.More details regarding the detailed results of clustering can be found in [33].
The following Figure 3 shows a calendar with the cluster assignment for every day of the observation period.This kind of visualization makes seasonality and cyclic behaviours more evident.Looking into the 3 clusters of the first stage, one can identify these clusters as (1) Months of November, December, January, February, and March, which correspond to Fall and Winter; (2) Months of April, May, June and September, and October; and (3) July and August, which correspond to the period of largest consumption during summer holidays.The second level of clustering is to target the working/non-working days, which shows how holidays can affect the consumption behaviour of water demand.Centroids of the 6 prototypes of daily water demand are shown in Figure 4; looking into this figure, one can easily recognize the differences in the peaks of the mornings and evenings that differentiate these 6 clustered prototypes.To be more precise, we can capture a meaningful definition that shows the peak in the mornings of holidays and weekends is always delayed by approximately 1 h compared to that of working days for each period of the year.
Water 2018, 10, x FOR PEER REVIEW 7 of 13 Centroids of the 6 prototypes of daily water demand are shown in Figure 4; looking into this figure, one can easily recognize the differences in the peaks of the mornings and evenings that differentiate these 6 clustered prototypes.To be more precise, we can capture a meaningful definition that shows the peak in the mornings of holidays and weekends is always delayed by approximately 1 h compared to that of working days for each period of the year.Results of the AMI code applied on the rescaled time-series for each head-time under study are shown in Figure 5.In this bar chart, X-axis represents all 30 models (6 clusters × 5 head-times) labelled accordingly.Y-axis is the computed AMI value, which is discussed earlier in this paper through equation 2. These values were used to define the data-driven GEP models in this study.It is important to note that the AMI values make sense in that they are always at their maximum value when the head-time is 1 h, except for K4, in which the maximum is in head-time 3.This exception is due to the fact that the pattern of consumer behaviours shifts temporally.Moreover, more data is acquired when time-scale is on an hourly basis.Results of the AMI code applied on the rescaled time-series for each head-time under study are shown in Figure 5.In this bar chart, X-axis represents all 30 models (6 clusters × 5 head-times) labelled accordingly.Y-axis is the computed AMI value, which is discussed earlier in this paper through equation 2. These values were used to define the data-driven GEP models in this study.It is important to note that the AMI values make sense in that they are always at their maximum value when the head-time is 1 h, except for K 4 , in which the maximum is in head-time 3.This exception is due to the fact that the pattern of consumer behaviours shifts temporally.Moreover, more data is acquired when time-scale is on an hourly basis.
Centroids of the 6 prototypes of daily water demand are shown in Figure 4; looking into this figure, one can easily recognize the differences in the peaks of the mornings and evenings that differentiate these 6 clustered prototypes.To be more precise, we can capture a meaningful definition that shows the peak in the mornings of holidays and weekends is always delayed by approximately 1 h compared to that of working days for each period of the year.Results of the AMI code applied on the rescaled time-series for each head-time under study are shown in Figure 5.In this bar chart, X-axis represents all 30 models (6 clusters × 5 head-times) labelled accordingly.Y-axis is the computed AMI value, which is discussed earlier in this paper through equation 2. These values were used to define the data-driven GEP models in this study.It is important to note that the AMI values make sense in that they are always at their maximum value when the head-time is 1 h, except for K4, in which the maximum is in head-time 3.This exception is due to the fact that the pattern of consumer behaviours shifts temporally.Moreover, more data is acquired when time-scale is on an hourly basis.Table 1 shows the performance of GEP models in the testing period separated by the performance indices used (MAE, RMSE, R 2 , and MAPE).It is important to note that these data are averaged on each cluster, since the overall performance is what matters to predictive modelers.However, the detailed results are shown in Table A1 (Appendix A) for further investigation of these performances.Results showed the hourly-based models could significantly outperform the other sampling rate/frequencies.It was expected that hourly-based models could come out on top, since data-driven methods perform better with larger data sets.It is known that aggregating the data temporally can improve forecasting due to the temporal averaging which reduces the noise; however, since averaging was not the focus of this study, higher head times did not improve the models.Another valid reason is "oversampling", which can increase the resolution of the data; consequently, there would be a better definition of the trend in the time series.Moreover, it is known that temporal aggregations can lead to loss of valuable information about the primary data process.MAE measures the mean absolute error between prediction and actual values.The GEP operators performed similarly, as MAE values are very close to each other (MAE of 0.2319 ± 0.0391 being the best using 2nd GEP operator Table 1a).The same story was repeated for RMSE, as 2nd GEP operator outperformed other models with RMSE of 0.3048 ± 0.0632 Table 1b.The third GEP operator was the best among all R 2 , with the highest value of 0.8962 ± 0.0569 averaged on clusters.Interestingly, the performance of 6-h head-time models was significantly better than 3-h models.This improved performance might be because the behaviour of consumer's demand or the temporal shifts of the consumption can be better defined with a frequency of 1 sample per 6 h rather than 3 h.In other words, resolution of the data might be better represented using this time-scale.Moreover, it is known that performance of these models deteriorated with less frequent data for 12 and 24 h due to much smaller data sets within these time-scales.Results showed that the mathematical operations used in GEP operators do not play a very significant role; however, since these data are not linear, the second {+, −, ×, ×2, ×3} and third operators {+, −, ×, ×2, ×3, √ , ex, log, ln} consistently outperformed the first one {+, −, ×} in all simulations.MAPE (minimum absolute percentage error) is used just for comparing testing sets, since it provides a relative magnitude in terms of percent values.Like other performances indices, the 1 h head time models could outperform others with a value of 0.8850 ± 0.0089 averaged on clusters.; however, all values were scaled between 0 and 1 for fair comparative assessment between the head times under study.
Water 2018, 10, 142 9 of 13 K 3 HT 1 OP 1 is further investigated as one of the superior models to show how GEP models could perform when hourly data is acquired.Table A1 (Appendix A) shows K 3 HT 1 OP 1 is selected with highest values of R 2 , 0.95, and 0.93 for training and testing data set accordingly.A model that is neither over nor under-trained should have similar performances in test and train periods, an important point overlooked by many scholars in our area.On the other hand, MAE and RMSE values were also the lowest compared to other models with 0.19 and 0.24 for training set, followed by 0.18 and 0.23 for testing set. Figure A2 illustrates how close the prediction and actual demand are for this particular model.

Conclusions
The prime objective of this paper was to propose a coupled deployment of supervised and unsupervised learning in short-term water demand forecast models.Time-scale modeling of water demand can lead to a comprehensive understanding of the temporal resolution of this complex system, mainly the pattern of consumption.Having access to hourly water consumption is not very common in many countries' WDS; therefore, this approach gives an idea about the magnitude of errors in the comparative assessments between the time-scale models, helping the operators come up with an appropriate data acquisition frequency.
In the proposed two stage approach, spherical k-means clustering was used to organize daily water demand patterns into six different clusters based on the computed Silhouette and Calinski-Harabasz indices.Gene expression programming was further used, as our supervised learning part of the approach, to model these six clusters separately.The measurement of errors in this paper was done using four performance indices on both training and testing data sets.MAE, RMSE, R 2 , and MAPE are the common methods of error measurement in the field of hydro-informatics, and they are widely adopted among researchers.
Results of this study proved GEP should receive more attention in this area due to the highly accurate predictions it can provide while coupled with unsupervised learning algorithms.It is not a black-box model like the majority of the proposed ANN or SVM models; therefore, meaningful self-interactive relations within the input water demand will be provided, as well as a mathematical equation (through nonlinear regression) to be used by operators of WDS (Appendix A, Figure A1 and Python code in Supplementary Materials for detailed equation).The proposed approach could have a profound impact on the operations of water utilities, as well as on managerial decisions.The frequency of the collected data is a major decision that is used to plan for the next hour, week, month, or even year.The seasonality of water demand patterns is not a new thing; however, the two-level clustering provided (in a completely unsupervised and data-driven paradigm) six groups of data that are not usually used in classified data of predictive models in this field.The positive impacts of this approach are a better understanding of how one can utilize the time series of water demand in short-term forecasting through a completely data-driven technique with a fair understanding of how to opt for the suitable temporal resolution, the proper lag time of the feeding inputs to the models, and an equation/function that can help the operators to use a wide range of models based on their desired duration or the time of year.

Figure 1 .
Figure 1.Schematic of the proposed approach.

Figure 1 .
Figure 1.Schematic of the proposed approach.

Figure 2 .
Figure 2.An example of time series data representing hourly water consumption in a day (L/h).

Figure 2 .
Figure 2.An example of time series data representing hourly water consumption in a day (L/h).

Figure 3 .
Figure 3. Cluster assignments over the observation period.

Figure 3 .
Figure 3. Cluster assignments over the observation period.

Figure 4 .
Figure 4.The k = 6 typical water demand patterns identified through the two-level clustering procedure.The cluster assignment is the same as the previous Figure 3.

Figure 4 .
Figure 4.The k = 6 typical water demand patterns identified through the two-level clustering procedure.The cluster assignment is the same as the previous Figure 3.

Figure 4 .
Figure 4.The k = 6 typical water demand patterns identified through the two-level clustering procedure.The cluster assignment is the same as the previous Figure 3.

Figure A1 .
Figure A1.Target vs Model in testing period for the selected model.

Figure A1 .
Figure A1.Target vs Model in testing period for the selected model.

1 Oct 2012
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Nov 2012 Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Dec 2012 Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Jan 2013 Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Feb 2013 Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Mar 2013 Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Apr 2013 Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue May 2013 Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Jun 2013 Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Jul 2013 Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Aug 2013 Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sep 2013 Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon

. RMSE * (Mean ± Standard Deviation)
* MAE and RMSE should have the same unit as the estimated quantity (water demand (L))