Comparative Evaluation of Different Multi-Agent Reinforcement Learning Mechanisms in Condenser Water System Control

: Model-free reinforcement learning (RL) techniques are currently drawing attention in the control of heating, ventilation, and air-conditioning (HVAC) systems due to their minor pre-conditions and fast online optimization. The simultaneous optimal control of multiple HVAC appliances is a high-dimensional optimization problem, which single-agent RL schemes can barely handle. Hence, it is necessary to investigate how to address high-dimensional control problems with multiple agents. To realize this, different multi-agent reinforcement learning (MARL) mechanisms are available. This study intends to compare and evaluate three MARL mechanisms: Division, Multiplication, and Interaction. For comparison, quantitative simulations are conducted based on a virtual environment established using measured data of a real condenser water system. The system operation simulation results indicate that (1) Multiplication is not effective for high-dimensional RL-based control problems in HVAC systems due to its low learning speed and high training cost; (2) the performance of Division is close to that of the Interaction mechanism during the initial stage, while Division’s neglect of agent mutual inference limits its performance upper bound; (3) compared to the other two, Interaction is more suitable for multi-equipment HVAC control problems given its performance in both short-term (10% annual energy conservation compared to baseline) and long-term scenarios (over 11% energy conservation).


Introduction
As the main place for modern human daily activities [1], buildings account for over 30% of total society energy consumption worldwide [2][3][4]. Moreover, the optimal control of building energy systems, especially heating, ventilation, and air-conditioning (HVAC) systems, can help reduce both building energy consumption and carbon emissions [2,5,6]. As a sub-system of the HVAC system, a condenser water loop accounts for the heat rejection of the central chiller plant. Moreover, the operation quality of this loop can considerably influence the overall efficiency of the whole HVAC system [5][6][7][8][9][10][11]. The control subject of this study is the condenser water loop.

Value and Application of Reinforcement Learning (RL) Techniques in HVAC System Control
As previously mentioned, the optimal control of HVAC systems is necessary. Before 2013, model-based control (such as the optimal control methods proposed by Kang et al. [10], Huang et al. [8], and other researchers) was the mainstream of the building optimal control field [2]. Apart from this, model-free optimal control based on RL has been drawing increasingly more attention in the building system control domain since 2013 [2]. Generally, there are several reasons for this phenomenon: (1) As stated in Ref. [11], model-based control methods' heavy dependence on accurate system performance models is the main barrier between academic control approaches and practical engineering applications: an accurate model requires integral sensor systems, extensive manual labor, and time to build [8]; moreover, model uncertainty and inaccuracy may harm control performance [12,13]. (2) Model-free reinforcement learning is a discipline that concerns the fast training of selflearning agents for games and optimal control problems [14]. Its independence from embedded models is suitable to mitigate the "model dependency" issue. The modelfree nature of this technique leads to fewer pre-conditions and faster online computation, which enhance its feasibility in building control applications.
Wang and Hong [2] reviewed RL-based building optimal control studies conducted in the last twenty years. In buildings, RL techniques have been used to control various subjects from windows and batteries to hot domestic water systems and HVAC systems. In the HVAC system control field, control variables include the indoor temperature setpoint [15], chilled water temperature setpoint [10,16], and the cooling tower fan frequency [11].

Multi-Agent Reinforcement Learning (MARL)
The studies reviewed in Ref. [2] mostly applied the single-agent RL scheme to optimize the operation of building energy systems; in the single-agent RL scheme, only one RL agent is set up and used in the control process. When using this technique, the single agent needs to optimize all targeted controllable equipment. When the number of optimized variables grows, the state space and action space of the single agent can grow exponentially, which may lead to unaffordable training costs [17].
To solve the training problem above, one practical method is to decompose the overall control task into several sub-tasks and then assign them to multiple RL agents; these agents then make a multi-agent system (MAS). For instance, Ref. [11] chose to establish two RL agents in order to control the condenser water pump frequency and cooling tower frequency. In doing so, the high-dimensional problem of the single-agent RL scheme could be solved; however, another problem occurred: mutual inference among multiple agents.
As we know, in every time step, an RL agent interacts with the environment: the agent observes the state s of the environment, it takes an action a to control the targeted subject, and then it acquires the reward r from the environment; in the meantime, the environment changes to a new state s . In this process, the outputs (the consequent reward r and new state s ) are determined by the inputs (s, a), along with the transition probability of the environment p(s , r|s, a) [14,18].
For an MAS, the input of transition probability is composed of the former state and all agents' actions: p(s , r 1 , r 2 , . . .|s, a 1 , a 2 , . . .) [18][19][20]. Hence, the influence of the other agent(s) on the environment needs to be considered, but how [21]? If other agents' actions are completely observed by Agent 1, then there will be a huge observable variable space for it, and the learning cost may be unacceptable similarly to the single-agent RL scheme.
If not, then it may be hard for Agent 1 to learn and evolve due to invisible inference from other agents.
To solve the MAS problem above, multi-agent reinforcement learning (MARL) has been proposed [22]. The relationship among four research domains is illustrated in Figure 1. "Distributed control" is designed to solve the system optimization problem with multiple optimizers (could be model based, RL based, etc.) instead of one centralized optimizer [23]. Moreover, MARL is the intersection of the optimal control problem, the MAS environment, game theory, and RL techniques.
MARL techniques are intended to make each agent learn fast and well in an MAS where mutual inference is inevitable. Moreover, scenarios can be divided into three types (fully cooperative, non-fully cooperative, and fully competitive) depending on whether the agents' objectives are consistent [18]. Only the fully cooperative problem is discussed in this article, because generally, all equipment in an HVAC system works collaboratively to maintain an acceptable built environment and energy efficiency. MARL techniques are intended to make each agent learn fast and well in an MA where mutual inference is inevitable. Moreover, scenarios can be divided into thre types (fully cooperative, non-fully cooperative, and fully competitive) depending on whether the agents' objectives are consistent [18]. Only the fully cooperative problem i discussed in this article, because generally, all equipment in an HVAC system works col laboratively to maintain an acceptable built environment and energy efficiency.
Division mechanism: this means that each agent in the MAS learns and works by it self on its own task without considering the potential inference caused by other agents Independent Q-learning (IQL) is a typical Division algorithm [26][27][28][29]. For instance, Yan et al. [30] set up three RL agents to control three loops in a low-exergy building mode Although the three controlled loops were physically related, no measure was specificall taken for the mutual inference among the agents. However, since their targeted problem was a fully-cooperative game, choosing the Division mechanism still achieved accepta ble performance [18,31].
Multiplication mechanism: this is somewhat of a brute. In this scheme, multipl agents are simply piled up in one MAS. Each agent needs to act like a generalist who i capable of carrying out all activities [26]. As far as we are concerned, the single-agen scheme (where one central controller undertakes everything) is simply a special case o the Multiplication mechanism. In this study, we realize Multiplication by merging all RL agents in the MAS into one single agent. In doing so, the action spaces of the forme agents are multiplied to one, and the inference problem in the MAS is eradicated [32] However, the action space of this single agent is enormous due to the dimension in crease.
Interaction mechanism: this is specifically designed for the MARL problem; in thi scheme, the mutual inference problem is positively handled, and each agent in the MA actively adapts to this non-Markovian dynamic environment knowing that there ar others out there [26]. Moreover, the MAS reaches its optimum (in a fully cooperativ MARL problem) or equilibrium (in a non-fully cooperative MARL problem) faste [18,24,25]. For instance, Tao et al. [17] proposed a cooperative learning strategy based on WoLF-PHC. Their proposed strategy was tested on the Keepaway game of RoboCup and their results suggest that a cooperative learning strategy can outperform the schem of tag match learning (where each single agent takes turns to learn from the environ ment).
In this article, we quantitatively compare these three mechanisms in the optima control problem of building condenser water loops. Moreover, the methodologies ar elaborated on in Section 3. Classic MARL algorithms include Nash Q-learning, Friend-or-Foe, Win or Learn Fast-Policy Hill Climbing (WoLF-PHC), and the multi-agent deep deterministic policy gradient (MADDPG) [18,19,24,25]. Generally, MARL mechanisms can be categorized into Division, Multiplication, and Interaction [26].
Division mechanism: this means that each agent in the MAS learns and works by itself on its own task without considering the potential inference caused by other agents. Independent Q-learning (IQL) is a typical Division algorithm [26][27][28][29]. For instance, Yang et al. [30] set up three RL agents to control three loops in a low-exergy building model. Although the three controlled loops were physically related, no measure was specifically taken for the mutual inference among the agents. However, since their targeted problem was a fully-cooperative game, choosing the Division mechanism still achieved acceptable performance [18,31].
Multiplication mechanism: this is somewhat of a brute. In this scheme, multiple agents are simply piled up in one MAS. Each agent needs to act like a generalist who is capable of carrying out all activities [26]. As far as we are concerned, the single-agent scheme (where one central controller undertakes everything) is simply a special case of the Multiplication mechanism. In this study, we realize Multiplication by merging all RL agents in the MAS into one single agent. In doing so, the action spaces of the former agents are multiplied to one, and the inference problem in the MAS is eradicated [32]. However, the action space of this single agent is enormous due to the dimension increase.
Interaction mechanism: this is specifically designed for the MARL problem; in this scheme, the mutual inference problem is positively handled, and each agent in the MAS actively adapts to this non-Markovian dynamic environment knowing that there are others out there [26]. Moreover, the MAS reaches its optimum (in a fully cooperative MARL problem) or equilibrium (in a non-fully cooperative MARL problem) faster [18,24,25]. For instance, Tao et al. [17] proposed a cooperative learning strategy based on WoLF-PHC. Their proposed strategy was tested on the Keepaway game of RoboCup, and their results suggest that a cooperative learning strategy can outperform the scheme of tag match learning (where each single agent takes turns to learn from the environment).
In this article, we quantitatively compare these three mechanisms in the optimal control problem of building condenser water loops. Moreover, the methodologies are elaborated on in Section 3.

Motivation and Overall Framework of this Research
To sum up, the optimal control of HVAC systems is important for building energy conservation and decarbonization; model-free RL control has been drawing attention in the HVAC optimal control domain due to its independence from models; when multiple HVAC appliances need to be controlled with RL-based methods, the mutual inference problem occurs; and this study investigates the feasibility of different MARL mechanisms in the RL-based HVAC optimal control problem. The overall framework/workflow of this research is illustrated in Figure 2. This article first qualitatively discusses the targeted problem in the Introduction. Then, Section 2 demonstrates the establishment and verification of a virtual environment, which is used in the simulation case study. Based on this environment, Section 2 quantitatively analyzes the mutual inference problem in an MAS. Section 3 introduces the methodology of the three MARL mechanisms, and Section 4 presents and discusses the control performances of comparative controllers in the virtual environment.

Motivation and Overall Framework of this Research
To sum up, the optimal control of HVAC systems is important for building energy conservation and decarbonization; model-free RL control has been drawing attention in the HVAC optimal control domain due to its independence from models; when multiple HVAC appliances need to be controlled with RL-based methods, the mutual inference problem occurs; and this study investigates the feasibility of different MARL mechanisms in the RL-based HVAC optimal control problem.
The overall framework/workflow of this research is illustrated in Figure 2. This article first qualitatively discusses the targeted problem in the Introduction. Then, Section 2 demonstrates the establishment and verification of a virtual environment, which is used in the simulation case study. Based on this environment, Section 2 quantitatively analyzes the mutual inference problem in an MAS. Section 3 introduces the methodology of the three MARL mechanisms, and Section 4 presents and discusses the control performances of comparative controllers in the virtual environment.

Virtual Environment Establishment
In this study, the control performances of three different MARL mechanisms are evaluated using simulations. Moreover, the simulations are based on a virtual environment (i.e., system performance model). To enhance the actual meaning of the simulations herein, a real HVAC system with measured operational data is selected as the case system. Due to thorough commissioning and maintenance before data collection, its operational data are of high quality; hence, the data are used for data-driven system modeling in this study. This section demonstrates the establishment of the system model, based on which a quantitative analysis is conducted on the mutual inference problem in an MAS.

Case System
The layout of the case condenser water system is shown in Figure 3, and the characteristics are listed in Table 1. Its 2019 cooling season field data are adopted to establish the virtual environment for the simulations. It should be noted that, generally, the Chinese cooling season is defined as June to September, which is when mechanical cooling is necessary in buildings. However, the actual cooling season of each city is related to the local climate. In this study, the operational data of the real case system from 1 June to 18 September 2019 are collected, during which the case system was almost continuously supplying cooling. Hence, this period is regarded as one cooling season in the simulations herein.

Virtual Environment Establishment
In this study, the control performances of three different MARL mechanisms are evaluated using simulations. Moreover, the simulations are based on a virtual environment (i.e., system performance model). To enhance the actual meaning of the simulations herein, a real HVAC system with measured operational data is selected as the case system. Due to thorough commissioning and maintenance before data collection, its operational data are of high quality; hence, the data are used for data-driven system modeling in this study. This section demonstrates the establishment of the system model, based on which a quantitative analysis is conducted on the mutual inference problem in an MAS.

Case System
The layout of the case condenser water system is shown in Figure 3, and the characteristics are listed in Table 1. Its 2019 cooling season field data are adopted to establish the virtual environment for the simulations. It should be noted that, generally, the Chinese cooling season is defined as June to September, which is when mechanical cooling is necessary in buildings. However, the actual cooling season of each city is related to the local climate. In this study, the operational data of the real case system from 1 June to 18 September 2019 are collected, during which the case system was almost continuously supplying cooling. Hence, this period is regarded as one cooling season in the simulations herein.

Field-Data-Based System Modeling
The virtual environment is established based on a black-box regressor random forest (with default hyper-parameters by sci-kit learn Python package [33]) and the 2019 cooling season field data of the case system (from 1 June to 18 September, with data sampling intervals of 10 min). Equation (1) is proposed to model the real-time electrical power of the whole condenser water loop. The involved variables are introduced in Table 2.
P system = Random f orest CL s , T wet , f pump , n pump , f tower , n tower , status chiller , T chws (1) The undetermined coefficients in Equation (1) are determined using regression with measured field data. The running dataset (when the system is running) is selected, and then it is randomly divided into a training set (80%) and a testing set (20%) to train and verify the black-box model, respectively. The coefficient of variation of the root mean square error (CV-RMSE) and the coefficient of determination (R 2 ) are used to evaluate the reliability of the trained model. In Equations (2) and (3), n is the length of the dataset, y i is the i th measured value of the system power, andŷ i is the corresponding estimated value. y is the mean value of all y i s. The detailed calculation process of model error indicators can be found in ASHRAE Guideline 14 [34,35].
The time-series data of the estimated power, measured power, and absolute residuals are illustrated in Figure 4. The right side of the figure shows the distribution of the three variables; the estimation errors mainly lie within a narrow range around 0, which means that the estimation error is mild. The calculated indicator values are listed in Table 3. According to Page 97 of ASHRAE Guideline 14 [34], CV-RMSE below 30% is acceptable for hourly building energy modeling. Hence, the established model is reliable for the following simulations in 10 min time-step intervals.
for the following simulations in 10 min time-step intervals.

Mutual Inference between Cooling Tower Action and Condenser Water Pump Action
The operation objective of a condenser water loop is to maximize its overall energy efficiency. Herein, we take the comprehensive coefficient of performance (COP, calculated using Equation (4)) of these appliances as our optimization objective and quantitatively analyze the MAS mutual inference problem between cooling towers and condenser water pumps.

Mutual Inference between Cooling Tower Action and Condenser Water Pump Action
The operation objective of a condenser water loop is to maximize its overall energy efficiency. Herein, we take the comprehensive coefficient of performance (COP, calculated using Equation (4)) of these appliances as our optimization objective and quantitatively analyze the MAS mutual inference problem between cooling towers and condenser water pumps. Figure 5 shows the system modeling result under a typical working condition (CL S = 1400 kW, T wet = 25.8°C, n pump = n tower = 2, status chiller = 3, T chws = 10°C), with various combinations of tower pump frequencies.  Figure 5 suggests that both pieces of equipment can sufficiently influence the system's COP. If the system's COP is chosen as the common reward by the two RL agents (tower agent and pump agent), each agent's learning process will be influenced by the other one's movement, as mentioned in Section 1.2. This fact could lead to an uncertain, unstable RL process. Thus, if the condenser water loop is to be controlled by multiple RL agents, the mutual inference problem needs to be considered.  Figure 5 suggests that both pieces of equipment can sufficiently influence the system's COP. If the system's COP is chosen as the common reward by the two RL agents (tower agent and pump agent), each agent's learning process will be influenced by the other one's movement, as mentioned in Section 1.2. This fact could lead to an uncertain, unstable RL process. Thus, if the condenser water loop is to be controlled by multiple RL agents, the mutual inference problem needs to be considered.

Overview
As justified in Section 2.3, the inference problem between agents should be carefully handled when multiple RL agents function simultaneously in an MAS. Typical MARL solutions can be categorized into three types: Division, Multiplication, and Interaction [26]. The control performances of these three different MARL mechanisms are compared and evaluated in this study.
In this study, Policy Hill Climbing (PHC) and Win or Learn Fast-Policy Hill Climbing (WoLF-PHC) are selected as specific algorithms for comparison because their complexity, basic thinking, and pre-conditions are similar [20]. In doing so, the performance gap between the MARL mechanisms rather than the specific algorithms can be better revealed. Figure 6 shows a common workflow of the MARL controller interacting with the virtual environment in this study. Every simulation time step occurs as follows: (1) Input real-time CL S and T wet (two uncontrollable environmental variables [5]) to the virtual environment (i.e., system model) and the controller. (2) Based on the inputs, the controller decides the proper control signals, including on-off control signals and operational signals (i.e., setpoints of f pump , n pump , f tower , n tower , status chiller , and T chws ). Note that, in Figure 6, the solid line indicates that data transmission occurs in the same time step, while the dashed line indicates that data transmission occurs between two adjacent time steps (i.e., the reward calculated in the current time step is not used until the next time step for the agents' learning). In this study, the RL agents only decide and , whereas the on-off statuses of all appliances and ℎ are controlled according to the following rules: (1) To protect chillers from a low partial load ratio (PLR) operation risk, the whole system only operates when is larger than 20% of a single chiller's cooling capacity [36][37][38].
(2) In order to fully utilize the heat exchange area of cooling towers, two cooling towers operate simultaneously when the system is on. (3) Two chillers operate simultaneously when is larger than a single chiller's maximum cooling capacity; otherwise, only Chiller 1 operates to cover the cooling demand. Note that, in Figure 6, the solid line indicates that data transmission occurs in the same time step, while the dashed line indicates that data transmission occurs between two adjacent time steps (i.e., the reward calculated in the current time step is not used until the next time step for the agents' learning).
In this study, the RL agents only decide f pump and f tower , whereas the on-off statuses of all appliances and T chws are controlled according to the following rules: (1) To protect chillers from a low partial load ratio (PLR) operation risk, the whole system only operates when CL s is larger than 20% of a single chiller's cooling capacity [36][37][38]. (2) In order to fully utilize the heat exchange area of cooling towers, two cooling towers operate simultaneously when the system is on. (3) Two chillers operate simultaneously when CL s is larger than a single chiller's maximum cooling capacity; otherwise, only Chiller 1 operates to cover the cooling demand. (4) The number of running condenser water pumps is in accordance with the number of working chillers. (5) T chws is set to 11 • C constantly, which is close to the chillers' nominal value.
The control logic above is used by all controllers in the case study section.

Division and Multiplication MARL Controllers: Policy Hill Climbing
This section presents the details of Multiplication and Division. Firstly, for Division, Policy Hill Climbing (PHC) is adopted as the specific algorithm. In PHC, both the value function and policy function are updated in every learning step. The value function is updated based on the real-time reward, which, in turns, directs the updating process of the policy function [39]. Hence, the PHC algorithm is a somewhat simplified version of the actor-critic algorithm but without neural networks and gradient calculation. The two RL agents (the cooling tower agent and condenser water pump agent) are formulated as follows: State: Real-time T wet and CL s are discretized and combined to the state value such as (CL S = 1060 kW, and T wet = 26°C). The discretization interval of T wet is 1 • C, and the discretization interval of CL s is 10% of a single chiller's cooling capacity, as shown in Table 1. The two agents share the same state value.
Action: The operating frequencies of the cooling tower fan(s) and the condenser water pump(s) are the action variables of the two agents. The action space of the tower agent is 30-50 Hz (1 Hz interval), and the action space of the pump agent is 35-50 Hz with 1 Hz intervals. Note that on-off statuses and T chws are controlled according to the rules addressed in Section 3.1.
Reward: the common reward (optimization objective) of both agents is the real-time system's COP, which is calculated using Equation (4): where CL s is the system cooling load (kW), and P system refers to the modeled total electrical power (kW) of the chillers, condenser water pumps, and cooling towers. Initialization: When applying PHC, the value function Q i (s, a i ) and target policy π i (s, a i ) need to be set up for every agent. The values of the value function should be initialized to 0, and the values of π i (s, a i ) should be initialized to 1 |A i | , where |A i | is the total action number in the i th agent's action space (21 for the tower agent and 16 for the pump agent).
After the off-line agent formulation, the online real-time control process (right side of Figure 6) is realized as follows: In every simulation time step, each agent updates its own value function and target policy with Equations (5)-(8). The footnote i refers to the agent code (the tower agent is 1, and the pump agent is 2). In Equation (5), Q i (s, a i ) is the specific Q-value of the i th agent corresponding to the last state s and its last action a i . α is the agents' learning rate, which is set to 0.7 referring to the engineering application in Ref. [16]; r is the reward value from the last time step; γ is the weight of the expected future reward, which is set to 0.01 referring to Ref. [16]; and max a i Q i (s , a i ) is the maximum Q-value of the i th agent under current state s .
where δ s,a i,j = min π i s, a i,j , δ |A i | − 1 (8) Moreover, Equations (6)-(8) are used for policy updating, the thinking of which is to transfer the "probability to be chosen" from non-optimal actions to the optimal action. The parameter δ determines the transfer amount in every optimization step. According to Ref. [20], this varies with the case game, and it is set to 0.03 herein.
For the Multiplication RL controller, its offline and online processes are similar to those of Division. The difference is that the Multiplication RL controller only builds one RL agent to control both the cooling towers and condenser water pumps. Hence, its action space is composed of 16 × 21 jointed actions (pump 35 Hz, tower 30 Hz), (pump 36 Hz, tower 30 Hz), . . . , (pump 50 Hz, tower 30 Hz), . . . , (pump 50 Hz, tower 50 Hz). Moreover, this single agent works alone using Equations (5)- (8), with the same state variable, reward variable, and hyperparameters (α, γ, δ) as those of Division.

Interaction MARL Controller: WoLF-PHC
Win or Learn Fast-Policy Hill Climbing (WoLF-PHC) is a classic MARL algorithm suitable for fully cooperative and non-fully cooperative problems [17,20,40]. This algorithm is derived from the PHC algorithm, and it is composed of three core functions: the value function Q i (s, a i ), the target policy π i (s, a i ), and the historical average policy π i (s, a i ).
The core thinking of this algorithm is to determine whether one agent is performing better than before in the learning process by comparing the mathematical expectations of the value function under the target policy and historical average policy (Equation (11)). If the current policy is proven to be better than the historical average, then we can claim that this agent is winning/adapting in this MAS, and its learning pace is mild; otherwise, this agent is not adapting well in this MAS (suppressed by other agents), and this agent then needs to change/update its policy faster to manage. In WoLF-PHC, although the agents do not interact with each other explicitly, each agent is aware of the other agents' existence and deliberately adapts itself to the dynamic MAS.
When realizing the Interaction mechanism (WoLF-PHC), the offline formulation of the agents is similar to that in the Division mechanism: the two agents (the tower agent and pump agent) are set up with common state variables (jointed discrete T wet and CL s ) and separated action spaces (30-50 Hz for the tower agent and 35-50 Hz for the pump agent) as described in Section 3.2. The main difference between Interaction and Division in this study is the usage of the historical average policy π i (s, a i ) and the dynamic δ. The detailed workflow of WoLF-PHC is presented in Table 4. Table 4. Workflow of WoLF-PHC algorithm.

Off-line initialization:
For the tower agent and pump agent (footnote i refers to the i th agent), formulate their action spaces and common state spaces in the same way as Division. For each agent, initialize all Q i (s, a i ) values to 0, initialize all π i (s, a i ) values to 1 |A i | , and initialize all π i (s, a i ) values to 1 |A i | . The number of each state's occurrence is recorded by C(s), and it is initialized to 0.

Online decision-making procedure in every time step:
A.
Receive reward r and (s, a 1 , a 2 ) of the last time step. a 1 and a 2 are the last actions of the tower agent and pump agent, respectively. B.
Receive real-time CL s and T wet Then, both agents execute the following procedure parallel: Table 4. Cont.
D. For ∀a i,j ∈ A i , update its corresponding historical average policy π i s, a i,j with Equations (9) and (10) E.
Use the latest Q-table to update the target policy π i s, a i,j with Equation (11) (δ win = 0.01, δ lose = 0.05 ).
F. Decide the next action for current state s with updated π i (s, a i ) For WoLF-PHC, the key pre-defined parameters include α, γ, δ win , and δ lose . In this study, α = 0.7, and γ = 0.01, which are the same as Division's parameters. δ win and δ lose are the special parameters of WoLF-PHC, and these two parameters determine how fast the agent changes its target policy. According to Refs. [20,39], δ lose : δ win = 4 is recommended to reach fast convergence. In this study, δ win = 0.01, δ lose = 0.05, which means that, when an agent is adapting well in the MAS (winning), the probability of every non-optimal action to be chosen decreases 0.01 |A i |−1 in every simulation time step (these decreases are transferred to the optimal action); however, when the agent is losing, the probability of every non-optimal action to be chosen decreases 0.05 |A i |−1 in every simulation time step (learn fast). The abovementioned reflects the idea of "Win or Learn Fast".

Simulation Case Study and Discussion
The operation of the case system from 1 June to 18 September is simulated in this section, and it is based on the established system's black-box model as the virtual environment and the measured data of T wet and CL s as real-time inputs. Five simulation cases corresponding to five different controllers are conducted to evaluate the energy-saving performance of the three different MARL mechanisms. The simulation process is illustrated in Figure 6, and the case characteristics are listed in Table 5. Note that, since the RL learning process is stochastic, Cases 2-4 are simulated three times to obtain the average results for analysis. In addition to the three described MARL controllers, there are two typical controllers, namely, the baseline constant speed controller and the proportional-integral-derivative (PID) feedback controller. The constant speed controller keeps the online cooling towers and condenser water pumps running at 50 Hz, the performances of which are taken as the baseline in this study. The logic of the PID feedback controller here is the same as the control logic used in the real case system: (1) adjust the cooling tower frequency to maintain the approach (tower outlet water temperature minus T wet ) at 2.5 • C, and (2) adjust the condenser water pump frequency to maintain the temperature difference between the supplied condenser water and returned condenser water at 3.3 • C.
Since the system model set up in this study does not model the condenser water temperature, it is not practical to embed PID logic into the system model in Case #5. Instead, the on-site control signal record of the real case system is directly reused to control the virtual environment in Case #5, because the recorded control signals are determined by the real PID controller. Furthermore, it did maintain the monitored temperature variables at their set points for the real system. Table 6 lists the simulated system energy consumptions under the five controllers in the first simulated cooling season, and Figure 7 illustrates the equipment frequency distributions under the different controllers. In the first cooling season, all MARL controllers learn to control from ground zero (all agents are initialized during the first simulation step). Moreover, simulated results in this scenario reflect the learning speed of each mechanism, which is an important evaluation indicator of RL algorithms [14,41]. mined by the real PID controller. Furthermore, it did maintain the monitored temperature variables at their set points for the real system. Table 6 lists the simulated system energy consumptions under the five controllers in the first simulated cooling season, and Figure 7 illustrates the equipment frequency distributions under the different controllers. In the first cooling season, all MARL controllers learn to control from ground zero (all agents are initialized during the first simulation step). Moreover, simulated results in this scenario reflect the learning speed of each mechanism, which is an important evaluation indicator of RL algorithms [14,41].    Table 6 suggests that the energy-saving performances of all MARL controllers are better than that of the original PID feedback controller. This result is in accordance with the equipment frequency distribution: the control actions chosen by the original PID controller are more conservative than those of the MARL controllers. This is because the on-site engineers need to seriously consider system safety when configuring the PID control logic, which can result in higher equipment frequency and more energy waste. Different from the original PID logic, MARL controllers are designed to focus on the optimization task, which is system energy efficiency; therefore, they perform better than the original PID controller.

Short-Term Performance
Comparing the performance of the three MARL controllers, it could be seen that Multiplication performs worse than the other two. Moreover, the frequency data under Multiplication are more diffused than those of the others. This is as anticipated in Section 1.2: the Multiplication mechanism typically means a high-dimension action space, which can lead to high exploration costs, long training periods, and poor performance at the initial stage.
The initial performance gap between Division and Interaction is slight. It is inferred that the small performance gap is due to the following reasons: (1) Both Division and Interaction address the case problem with two agents (the action and solution spaces are limited); hence, the learning speed of these two mechanisms are both fast and close to each other. (2) At the initial stage of the RL agents' learning processes, the performance potential of the RL agents is still far from being fully utilized; thus, Interaction's advantage of having a higher upper limit cannot be revealed in this short-term scenario. (3) The design of this case study is intended to better investigate the difference among the different MARL mechanisms rather than different RL algorithms; hence, the theoretical gap among the three MARL controllers is not that evident in the first place.
Therefore, the mild performance difference between Division and Interaction in the first cooling season suggests that a short-term experiment may not be sufficient to compare these two MARL mechanisms in the HVAC control field if they are both based on a similar theoretical basis (i.e., Policy Hill Climbing herein). To better reveal the advantage and disadvantage of each MARL mechanism, long-term simulations are conducted in the next section.

Long-Term Performance
The post-convergence performance (upper limit) is another critical indicator of RL algorithms [41]. To better analyze the long-term performances of each MARL controller, five-episode continuous simulations are conducted under the three MARL controllers. This is realized by continuously simulating the system's operation in the same five cooling seasons, without resetting the RL agents midway. In doing so, the long-term evolution of a MARL controller can be analyzed. For every MARL controller, five-episode simulations are conducted three times to mitigate the influence of randomness on the results. Figure 8 illustrates the performance evolution of each MARL controller, which suggests the following: Multiplication are more diffused than those of the others. This is as anticipated in Section 1.2: the Multiplication mechanism typically means a high-dimension action space, which can lead to high exploration costs, long training periods, and poor performance at the initial stage. The initial performance gap between Division and Interaction is slight. It is inferred that the small performance gap is due to the following reasons: (1) Both Division and Interaction address the case problem with two agents (the action and solution spaces are limited); hence, the learning speed of these two mechanisms are both fast and close to each other. (2) At the initial stage of the RL agents' learning processes, the performance potential of the RL agents is still far from being fully utilized; thus, Interaction's advantage of having a higher upper limit cannot be revealed in this short-term scenario. (3) The design of this case study is intended to better investigate the difference among the different MARL mechanisms rather than different RL algorithms; hence, the theoretical gap among the three MARL controllers is not that evident in the first place.
Therefore, the mild performance difference between Division and Interaction in the first cooling season suggests that a short-term experiment may not be sufficient to compare these two MARL mechanisms in the HVAC control field if they are both based on a similar theoretical basis (i.e., Policy Hill Climbing herein). To better reveal the advantage and disadvantage of each MARL mechanism, long-term simulations are conducted in the next section.

Long-Term Performance
The post-convergence performance (upper limit) is another critical indicator of RL algorithms [41]. To better analyze the long-term performances of each MARL controller, five-episode continuous simulations are conducted under the three MARL controllers. This is realized by continuously simulating the system's operation in the same five cooling seasons, without resetting the RL agents midway. In doing so, the long-term evolution of a MARL controller can be analyzed. For every MARL controller, five-episode simulations are conducted three times to mitigate the influence of randomness on the results. Figure 8 illustrates the performance evolution of each MARL controller, which suggests the following: Although Division can match Interaction's performance in the beginning, Interaction can reach an upper limit higher than that of Division, because Division does not consider the agents' mutual inferences. In MAS, each agent's learning process is undoubtedly inferred by other agents (Figure 5), and Division's neglect of that fact affects Although Division can match Interaction's performance in the beginning, Interaction can reach an upper limit higher than that of Division, because Division does not consider the agents' mutual inferences. In MAS, each agent's learning process is undoubtedly inferred by other agents (Figure 5), and Division's neglect of that fact affects its performance upper bound [28].
The performances of Division and Interaction basically converge after the second cooling season due to their multi-agent configurations and small action spaces. However, the learning process of Multiplication does not seem to converge within five years due to its large action space, which needs to be explored. Moreover, the performance of Multiplication after five episodes is still inferior to that of the other two MARL controllers. Although Multiplication can theoretically realize a higher performance upper bound due to its complete optimization solution space, this low learning speed weakens its feasibility in engineering practices.

Conclusions and Future Work
The application of model-free reinforcement learning (RL) techniques in the control of HVAC systems has been widely studied in recent years [2]. When controlling multiple HVAC appliances, the dimensions of control signals can increase fast, which leads to an enormous action space and an unacceptable training cost if only one single RL agent is assigned with the control task. Hence, it is necessary to investigate how to address high-dimensional control problems with multiple agents. Moreover, different multi-agent reinforcement learning (MARL) mechanisms are available for this.
In this study, the measured data of a real condenser water system are adopted to establish a virtual environment in order to compare and evaluate three MARL mechanisms: Division, Multiplication, and Interaction. A static parameter analysis addresses the problem of agent mutual inference in the condenser water loop with multiple controllable appliances (cooling towers and condenser water pumps). After that, simulations are conducted to quantitatively analyze the energy-saving performance of different MARL controllers. The simulation results indicate that (1) Multiplication is not effective for high-dimensional RL-based control problems in HVAC systems due to its low learning speed and high training cost; (2) the performance of Division is close to that of the Interaction mechanism during the initial stage (10% annual energy saving compared to baseline), while Division's neglect of agent mutual inference limits its performance upper bound; and (3) compared to the other two, Interaction is more suitable for a case problem with two agents corresponding to two types of equipment. As mentioned in Section 1.2, when the system scale increases, the state space and action space grow fast [17]. In this case, the other two mechanisms face bigger challenges, demonstrating the greater advantage of Interaction. Hence, the Interaction mechanism is more promising for multi-equipment HVAC control problems given its good performance in both short-term (10% annual energy saving compared to the baseline) and long-term scenarios (over 11% annual energy saving after convergence).
This article studies the simplest scenario of MARL in HVAC system control: only two agents are set up, and the two agents share the same state information and same optimization objective (reward); moreover, discretized states and action spaces are used rather than continuous ones. Hence, this study is more similar to a preliminary investigation and discussion about the MARL technique in HVAC system control. When more appliances, such as chillers and AHUs, are involved in the MARL control framework, potential challenges such as agents' communication regarding observed information and agents' competition based on game theory need to be considered and addressed [42].