Real-Time Scheduling of Pumps in Water Distribution Systems Based on Exploration-Enhanced Deep Reinforcement Learning

: Effective ways to optimise real-time pump scheduling to maximise energy efficiency are being sought to meet the challenges in the energy market. However, the considerable number of evaluations of popular optimisation methods based on metaheuristics cause significant delays for real-time pump scheduling, and the simplification of traditional deterministic methods may introduce bias towards the optimal solutions. To address these limitations, an exploration-enhanced deep reinforcement learning (DRL) framework is proposed to address real-time pump scheduling problems in water distribution systems. The experimental results indicate that E-PPO can learn suboptimal scheduling policies for various demand distributions and can control the application time to 0.42 s by transferring the online computation-intensive optimisation task offline. Furthermore, a form of penalty of the tank level was found that can reduce energy costs by up to 11.14% without sacrificing the water level in the long term. Following the DRL framework, the proposed method makes it possible to schedule pumps in a more agile way as a timely response to changing water demand while still controlling the energy cost and level of tanks.


Introduction
Water distribution systems (WDSs) represent vast and complex infrastructures that are essential for residents' lives and industrial production.Water utilities are committed to providing customers with sufficient water of the required quantity by operating WDSs.The corresponding energy cost of pumps constitutes the dominant expenditure of the operational cost of a WDS [1][2][3][4].However, the energy market is experiencing great challenges.Extreme climate [5], economic crises [6], war [7], and public health events (such as the COVID-19 pandemic) [8,9] have produced huge negative shocks in the energy market, making it full of uncertainties and fluctuations.These challenges in the energy market have large implications for water utilities.On the one hand, the high operating cost caused by rising energy prices directly affects the financial health of water utilities.On the other hand, a significant energy shortage would make the pumping or treatment of water impossible [10].Hence, it is an important issue for water utilities to improve the energy efficiency of pumps and integrate water supply strategies and energy conservation goals.
The problem of finding the optimum pump schedule is far from simple; both the hourly water demand of consumers and electricity tariffs can vary greatly during the scheduling period.Minimum and maximum levels of tanks are hard constraints that must be satisfied to guarantee the reliability of the supply, and the desired pressures should be maintained for consumers.In addition to these factors, the hydraulic formulas of WDSs are highly nonlinear and complex, making computer modelling a difficult and very timeconsuming process.
Various optimisation methods have been applied to pump scheduling problems.Deterministic methods are used initially, including linear [11], nonlinear [12][13][14], dynamic [15,16], and mixed-integer programming [17,18].Most of these methods simplify the complexities and interdependencies of WDSs by assumptions, discretisation, or heuristic rules [1,13,19].Although these simplifications can make it easier to address the problem, they may introduce bias and exclude potentially good solutions.In the mid-1990s, stochastic optimisation methods (metaheuristics) were introduced to pump scheduling optimisation problems [20], such as the genetic algorithm (GA) [21][22][23][24], particle swarm optimisation (PSO) [25,26], and differential evolution (DE) [27].These metaheuristics do not require simplification of the hydraulic models and have proven to be robust, even for highly nonlinear and nondifferentiable problems.However, metaheuristics require a large number of evaluations to achieve convergence, which requires too much time for real-time processing.
In recent years, the development of machine learning has introduced opportunities for the scientific management of water utilities.Various machine learning methods are used in a wide variety of applications, from anomaly detection [28,29] through system prediction [30][31][32] to system condition assessment [33] and system operation [34][35][36].
In scheduling problems, machine learning techniques are usually used as surrogate models of WDSs in metaheuristic optimisation to save computational load.Broad et al. [37] used an artificial neural network (ANN) as a metamodel, which can approximate the nonlinear functions of a WDS and provide good approximation for simulation models.However, how to reduce the error of surrogate models and ensure that the solution is still optimal compared with a full complex network simulator remains unknown.
Deep reinforcement learning (DRL) is a promising method for nonlinear and nonconvex optimisation problems.The essence of DRL is the combination of reinforcement learning (RL) and deep learning.It has been explored widely in recent years with the appearance of AlphaGo [38].However, its application in pump scheduling problems is still very limited.In 2020, Hajgató et al. [39] applied DRL to the single-step pump scheduling problem and took the results of the Nelder-Mead method as the reward standard.An essential contribution of Hajgató et al. is that the method models the single-step real-time pump scheduling problem as a Markov decision process (MDP) and considers multiple objectives, including satisfaction of consumers, efficiency of the pumps, and the feed ratio of the water network.However, the method sacrifices the regulation and storage capacity of the tanks and takes the pump speeds obtained by the Nelder-Mead method as the optimal setting, which makes the DRL results depend largely on the Nelder-Mead method.
Based on the above literature review of pump scheduling optimisation in WDSs, there are three main limitations for real-time pump scheduling problems: heavy computational loads, a lack of accuracy for surrogate models, and a lack of proper usage of the storage capacity of tanks.Real-time pump scheduling based on reinforcement learning is presented in this paper.The main contributions of this paper are as follows: First, an RL environment of the pump scheduling problem was constructed using a full network simulator, and the computation-intensive task was transferred from online to offline to save application time.
Second, by constructing a reward function, the penalty form of the tank level was explored to reduce the energy cost and maintain the tank level in the long term.
Finally, an exploration-enhanced reinforcement learning framework was proposed, adding an entropy bonus to the policy objective.The results demonstrate that compared with metaheuristics, the proposed method can obtain suboptimal scheduling policies under various demand distributions within one second.
The rest of the paper is organised as follows.In Section 2, we introduce the details of DRL, proximal policy optimisation (PPO), the exploration enhancement method, and the designs of important factors for applying DRL in pump scheduling problems.In Section 3, the reinforcement learning method is applied to a WDS case, the results are presented, and key findings are analysed.Section 4 concludes this paper.

Reinforcement Learning (RL)
The RL algorithm is used to solve the sequence decision problem and can mathematically be formulated as a Markov decision process (MDP).The training process of RL is carried out through the interaction between the agent and environment.At time step , the agent executes an action (  ) according to the current state (  ) and the policy function, obtaining an immediate reward (  ); then, the environment transfers to the next time step state ( +1 ).The agent adjusts its policy through the experience collected in the interaction process.After a large number of interactions with the environment, an agent with an optimal policy is obtained.Then, the trained policy neural network can be used for pump scheduling.Compared with other optimisation methods such as GA, the RL method divides the optimisation process into training and application, which can transfer the computationally intensive training process from online to offline to achieve real-time scheduling applications.The learned policy of RL is also able to handle the uncertainty of the environment, such as the uncertainty of demand, as the learned policy neural network is obtained by interactions with the environment under a large number different states with uncertainty.
The MDP can be represented by a tuple, <S, A, P, R, γ >, where:  S is the state space, which is a set of states;  A is the action space, which is a set of executable actions for the agent;  P is the transition distribution, which describes the probability distribution of the next time step state under different   and   ;  R is the reward function,   is the step reward after the agent takes an   under state   ; and   is the discount factor used to calculate the cumulative reward (  ), which is defined as: where  is the step of the episode, which corresponds to 24 h in this context, and  ∈ [0, 1] is set to 0.9 in this study.

Design of Significant Factors for RL Application to Pump Scheduling
In the following, the most significant factors in RL, namely S, A, and R, are discussed in detail for application to optimal real-time pump scheduling in WDSs.

State Space
State space is a set of all possible states in the environment.The state consists of relevant information for the agent to learn the optimal policy.This means that the state should contain enough effective information of the current environment.However, excess information may lead to confusion for the agent during the process of assigning rewards to the state.Therefore, it is important to properly select the state space in the RL application.For this work, the pump scheduling information in the WDS was divided into two categories: the water demand of consumers and the status information of the tank levels.
To make the built environment approach the actual WDS, the uncertainty of water demand in the real world was considered.The randomisation of water demand in the environment was carried out in two steps to mimic the time and space effects.Firstly, the general default demand pattern (as shown in Figure 1b) was multiplied by hourly random multipliers to simulate the random fluctuation of hourly water demand.Secondly, the base demand was obtained by the product of the nodal random multiplier and the default base demand.The demand with uncertainty was generated in every general node as the product of the pattern and base demand constructed above.Both random multipliers of time and space follow the truncated normal distribution.The truncated normal distribution has its domain (the random multiplier) restricted to a certain range of values, such as (1 − ∆, 1 + ∆).To simplify, the random multipliers of time and space are limited to the same range, and the water demand distribution of which the space and time random multiplier ranges are (1 − ∆, 1 + ∆) is called demand (∆) hereafter.The probability density distributions of the truncated normal distribution of the random multiplier for demand 0.3, demand 0.6, and demand 0.9 are shown as examples in Figure 1a.For a large consumer, we consider it a node with less uncertainty compared to the general consumer and do not apply a randomisation method to it.
The initial water levels of the tanks follow a uniform distribution of (0, 1).Then, the following tank levels are calculated according to the state and action.

Action Space
Action is defined as the relative pump speed, which is the ratio of the pump speed compared to the nominal pump speed.The pumps in a WDS are considered variablespeed pumps.The trained agent selects the optimal pump speed from all combinations of pump speeds for every time interval of the scheduling period under the guidance of the learned strategy.Each relative pump speed is considered a discrete variable ranging from 0.7 to 1.0, with an increase of 0.05 due to mechanical limitation.The size of the action space grows exponentially with an increase in the number of pumps.

Reward Function
The reward function is defined to motivate the agent to achieve its goal.The value of the reward represents the quality of the action.For this work, the reward function consists of three important parts: the energy cost of pumps (  ), the penalty for hydraulic constraints ( ℎ ), and the penalty for tank-level variation (  ).The reward value is calculated according to Algorithm 1. (

1) Energy cost of pumps
The essential goal of the real-time pump scheduling optimisation method proposed in this work is to minimize the energy cost of pumps while fulfilling system constraints.The reward design of energy cost should consider two key points.First, the lower the energy cost of pumps, the higher the designed reward.Second, avoid obtaining all positive or negative rewards in the learning process, as such a strategy is not conducive to agent training.According to the above requirements, the reward of energy cost is defined as the difference between the benchmark of energy cost ( ℎ ) and the actual energy cost (  ).The benchmark is to balance the positive and negative distribution of energy cost rewards.When the energy cost is lower than the benchmark, the reward is positive; otherwise, the reward is negative.The lower the energy cost, the larger the reward.The benchmark is the average energy cost obtained by the agent interacting with the environment for 20,000 episodes, making random actions.
(2) Hydraulic constraint When the hydraulic constraint (such as the pressure of nodes) cannot be fulfilled under the current action, a penalty is added to the reward function.After receiving the penalty, the agent learns that this is a bad action and adjusts the corresponding policy.
(3) Tank-level variation Tank-level variation (mainly refers to tank-level reduction in a day) should be as small as possible to avoid the agent learning to reduce the energy cost by overconsuming water in the tanks.This may lead to water shortages and extra costs of complementing water in the tanks.For these reasons, when the water volume in the tank at the end of the scheduling period ( =24 ) is less than the initial volume ( =0 ), a negative reward is added to the reward function.Compared with the strict limit of the tank level in a day, we attempted to find a way to make full use of the storage and regulation capacity of the tanks to reduce energy cost and maintain the tank level in the long term.The form of the negative reward (  ) has a great impact on the learned policy, as explored in Section 3.1.In this study, we applied the PPO algorithm [40] to determine the optimal real-time pump schedule.PPO is an on-policy RL method that updates policy with a new batch of experiences collected over time.The policy gradient method estimates the policy gradient and inputs it into gradient ascent optimisation to improve the policy.The original policy estimation has the following form:

Algorithm 1
where   is the policy, and   ̂ is the estimation of advantage at timestep  This is a process alternating between sampling and updating.To reuse the sampled data and make the largest improvement to the policy, the probability ratio   () is used in PPO to sup-port several off-policy steps.The definition of   () is expressed as Formula (3).Moreover, to avoid excessively large policy updates, the PPO algorithm has a clipping mechanism in the objective function, as shown in Formulas ( 4) and (5).

Exploration Enhancement
The PPO algorithm suggests adding an entropy bonus to the objective to ensure sufficient exploration when using a neural network architecture that shares parameters between the policy and value function [40].The objective is obtained as follows: where   is the square-error loss between the target value and the estimated value,  is the entropy bonus, and  1 and  2 are coefficients.
In this paper, an exploration-enhanced PPO (E-PPO) based on an entropy bonus is proposed.The idea of the entropy bonus is extended to the PPO model, which does not share parameters between the policy and value function.The policy objective is obtained as Formula (7).The objective of the value function remains unchanged.Moreover, as the initial entropy is expected to be as large as possible to reduce the probability of learning failures [41,42], all the dimensions of state were normalised to maximise the initial entropy.We found that the idea of normalisation is a simple but efficient method for maximising the initial entropy.

Case Study: EPANet Net3
The EPANet Net3 water network was chosen as the test case to illustrate the applicability of the proposed method.This network is one of the most commonly used benchmark networks, owing to its data availability and flexibility to be modified for different optimisation problems [2].The numerical model of the EPANet Net3 water network is accessible online in an EPANET-compatible format from the web page of the Kentucky Hydraulic Model Database for applied water distribution systems research [43].
The EPANet Net3 water network is based on the North Marin Water District in Novato, CA.The network has 2 raw water sources, 2 pump stations, 3 elevated storage tanks, 92 nodes, and 117 pipes.The topology of the EPANet Net3 water network is depicted in Figure 2. The time horizon is 24 h divided into 1 h intervals for the case study.In this study, the wire-to-water efficiency of pumps 10 and 330 was 0.75.It is worth noting that the efficiency of pumps depends on water capacity and rotation speed.In this paper, it is simplified as a fixed value.The electricity tariffs and intervals of peak and offpeak are shown in Figure 3.The peak tariff is USD 0.1194/kWh, and the off-peak tariff is USD 0.0244/kWh [44].Some modifications were made as follows: (1) All the control rules for the lake source, pipe 330 and pump 335, were removed.In addition, pipe 330 was kept closed.This means that these two raw sources only supply water through pumps 10 and 335, which are controlled by the agent, rather than supply water under specific control rules; (2) To simulate the stochastic water demand in a real-world WDS, the randomisation method described above was used.
The epoch of metaheuristics is 100.The values of the crossover probability and mutation probability of the GA are 0.95 and 0.1, respectively.The value of the local coefficient of PSO is 1.2.For DE, the weighting factor and crossover rate are 0.1 and 0.9, respectively.The detailed settings of PPO and E-PPO in the experiments are listed in Table 1.Negative reward for the hydraulic constraint −200

Effects of the Penalty Form of Tank-Level Variation
In pump scheduling problems with tanks in WDSs, the penalty form of tank-level variation must be reasonably designed to make full use of the regulation and storage capacity of the tanks to minimize the energy cost of pumps.According to Algorithm 1 described above, to achieve the goal of the agent, the penalty can be a large constant or a value that is proportional to the reduction rate of the tank level.To study the effects of penalty forms of tank-level variation on energy cost, 6 agents with different penalty forms were trained and tested on 100 random test sets.The results are shown in Figure 4.The uncertainty setting of water demand in the environment is demand 0.3, and the agent used is the PPO algorithm.For the penalty terms of large constants, such as penalty 1 and penalty 2 in Figure 4a, almost all the cases had not lost the water in tanks at all at the end of the scheduling period.The larger the penalty value, the more conservative the agent is; that is, the agent tends to add water to the tanks as much as possible to avoid default.However, this tendency generates a high energy cost, as shown in Figure 4b.The average energy costs of penalty 1 and penalty 2 are USD 241.96/day and USD 242.45/day, respectively.For penalty forms (penalty 3 to penalty 6) that are proportional to the reduction rate of the tank level, the greater the coefficient of the penalty, the higher the energy cost, but the corresponding level of the tanks only increases slightly.
As presented in Figure 4b, penalty 3 performs best, with an average energy cost of USD 215.45/day, representing savings of 6.08% and 11.14% energy cost relative to penalty 4 and penalty 2, respectively.For penalty 3, the volume variation ratios in the tanks in most cases (Figure 4a) are positive, and the negative ratios of the few other cases are concentrated in a small area not exceeding −9.56%.However, the positive variation ratio reaches 184.93%.Suppose the 100 test cases are the states of the water distribution system for 100 different days.On most of the 100 days, the tank level rises at the end of the day.On only a few days, the level drops slightly.However, this reduction water volume is replenished on other days when the water level rises.It can be inferred that the form of penalty 3 can reduce energy costs by up to 11.14% compared to the other five penalties without sacrificing the water level in the long term.Hence, penalty 3 was used in subsequent studies in this work for tank-level variation.

Effects of Cross Entropy
Due to the effects of sampling limitation in the experiments, the exploration range expressed by the coefficient of cross entropy () has a significant impact on the training process.In this study, we compared and verified the various settings of the coefficient of cross entropy () under different water demand distributions.Figure 5 shows the experimental results, which contain the PPO model and E-PPO model with different values of .The same experiment was conducted three times for each model.The solid line shows the average cumulative reward to eliminate the contingency of results, and the shaded part represents the reward variance for three times.
The best performance in Figure 5a reached approximately 178.Meanwhile, optimal  value is 0.2.The PPO model achieved the worst performance, which may be due to the smaller exploration scope of the agent, leading to premature convergence to the suboptimal solution.When the uncertainty coefficient of water demand increases, the optimal performance of the agent decreases.The  setting of 0.2 is the best, with a reward of approximately 170 for a demand uncertainty of 0.6 (Figure 5b), and the  setting of 0.3 is the best with a reward of approximately 160 for a demand uncertainty of 0.9 (Figure 5c).That is because when the environment has great uncertainty, it is difficult to learn an optimal strategy network, as the state transition becomes blurred.The selection of the agent tends to be conservative.When the coefficient exceeds the optimal setting, the performance decreases as the coefficient increases (Figure 5a-c).A  setting of 0.5 for the demand uncertainty of 0.6 (Figure 5b) and 0.9 (Figure 5c) shows poor performance.It may be that an entropy coefficient that is too large leads to policy degradation, which takes too much time or even cannot be optimised.Exploration ability often affects the convergence rate and final performance.Considering that tasks are often sensitive to exploration ability, it is necessary to adjust this parameter for specific tasks.

Comparison of Models
To better verify the performance and robustness of the exploration-enhanced PPO, 15 optimisation test cases were conducted under three different demand distributions in the environment.Each test case simulates 24 h of pump scheduling with a 1 h interval.The agent receives the current state of the WDN at the beginning of each hour, then executes the action of that hour online until the end of the day.The simulation period of 24 h guarantees the tank-level variation in a day, as shown in Sections 2.2.3 and 3.1.The results were compared with metaheuristics, including GA, PSO, and DE, as shown in Tables 2  and 3. Table 2 shows the energy cost during the scheduling period.Table 3 shows the training time and test time of the models.The PPO and E-PPO methods are trained in advance by interacting with the environment.In the application process, operators just need to call the trained model.However, the metaheuristic methods need to train the model for every single scheduling case.Simulations were carried out on a computer with an NVIDIA GeForce RTX 3070 GPU and an Intel Core i7-11700K CPU.The RL models were built by Keras, and the WDS environment was built by WNTR, which is compatible with EPANET.All the models were written in Python version 3.9.GA converges after 100 epochs of training; therefore, the results of GA are considered as optimal solutions and benchmarks for test cases in this paper.As shown in Table 2, E-PPO exhibited optimal performance besides GA, followed by PPO, DE, and PSO, with average costs of USD 207.15/day,USD 220.60/day,USD 223.59/day, and USD 253.87/day, respectively.Compared with the optimal solution of GA, E-PPO only consumes 5.03% more energy cost on average but saves approximately 6.10% of the energy cost compared with PPO, 7.35% compared with DE, and 18.40% compared with PSO.This may be due to the fact that metaheuristics had limited performance with limited training time and parameter tuning.Although the average training time of the E-PPO algorithm is 5495.55 s, which is longer than that of metaheuristics, it only needs 0.42 s in the scheduling process, as shown in Table 3.This is an almost negligible time consumption for hourly scheduling.However, the computation time of the GA is nearly half an hour, which is impractical for hourly scheduling.The performance of the E-PPO is only slightly worse than that of GA, but the application time cost of the E-PPO is only less than a second.As time is a very important factor in industrial production, E-PPO is a potential real-time scheduling method to obtain suboptimal solutions.

Conclusions
In this paper, the real-time scheduling of pumps in water distribution systems based on exploration-enhanced deep reinforcement learning is proposed.By constructing a reward function, the penalty form of the tank level was explored.We found a form that can make full use of the storage and regulation capacity of the tanks that also saves up to

Figure 1 .
Figure 1.Probability density distributions of the time and space random multiplier (a) and general default demand pattern (b).

Figure 3 .
Figure 3. Peak and off-peak tariffs and intervals.
Take action   for state   , collecting energy cost of pumps    , initial water volume in tanks  =0 , final water volume in tanks  =24 .

Table 1 .
Detailed settings of the agent.

Table 2 .
Comparison of E-PPO, PPO, and metaheuristics performance for the test cases.

Table 3 .
Computation time of the models.