1. Introduction
The buildings and construction sector accounts for 32% of global energy consumption, making improvements in energy efficiency crucial for mitigating the global energy crisis [
1]. However, improving energy efficiency through conventional approaches, such as code adoption and minimum performance standards, is difficult. The challenge underscores the importance of controlling building energy management systems (BEMS) [
2]. The optimal control of HVAC plays a critical role in advanced BEMS, as the end-use consumption for HVAC systems accounts for 50% of building operational energy [
3]. For large-scale commercial buildings, more than 50% of the energy consumed by HVAC systems is concentrated in the chiller plant systems. As the facilities are designed according to peak building load but always operate under partial load, the chiller plant system usually operates in a sub-optimal state [
4]. Therefore, an amount of energy can be saved during the operation of the chiller plant system. A chiller plant system consists of two main loops: a chilled water loop and a condenser water loop, including core components such as chillers, water pumps, and cooling towers. The condenser water loop significantly impacts the overall operation energy cost of the chiller plant system [
5]. Maintaining optimal operating conditions for each device in the condenser water loop is worth studying.
Different control methods were applied to optimize the control strategies of cooling water systems, and it was found that energy and cost can be saved by controlling the operation parameters. Adjusting the speed of cooling tower fans, varying the condenser water flow rate, and setting the water outlet temperature of cooling towers are measures that can be used to optimize the performance of the condenser water loop. Kim et al. [
6] explored the performance of predictive control on optimizing condenser water setpoint temperature, and total cooling energy consumption was saved by 5.6%. Huang et al. [
7] also studied the optimization of the condenser water setpoint temperature of the cooling water system. Model predictive control (MPC) was utilized for a real legacy chiller plant, and annual energy consumption savings achieved by chillers and cooling towers were up to around 9.67%. The performance of MPC relies on the model’s accuracy, but constructing a complex model is full of challenges. The generalization of a built model is weak, and the model cannot be easily used in another situation. Control methods based on data-driven methods reduce the dependence on physical models. Wang et al. [
8] applied random forest to achieve the ground optimization of a chiller plant system and proposed a stepwise optimization strategy. The strategy obtained an energy-saving rate of 6.41% and 13.56% on two research days. The parameters optimized on the cooling side consisted of the cooling water outlet temperature and the cooling water flow. Ma et al. [
9] proposed a hybrid programming particle swarm optimization (HP-PSO) algorithm to reduce the energy consumption of cooling water systems. By adjusting the number of chillers and pumps, the water mass flow rate of a single pump, and the air mass flow rate of the cooling tower, the energy consumption of the cooling water system was reduced by 15.3% compared with the rule-based constant temperature difference optimization. However, these data-driven control approaches require quality and quantity of historical data. Reinforcement learning (RL) is a model-free control method that does not depend on an accurate model and has low requirements for historical data and prior knowledge. It is suitable for optimizing the control of building energy systems. Qiu et al. [
10] studied the performance of RL in a chilled water system for a Guangzhou subway station. By adjusting the frequency of the pumps and the cooling tower fans in the cooling water loop, the control of the chilled water outlet temperature setpoint was realized, and the energy saving rate was stabilized to 12% after the second applied cooling season. Fu et al. [
11] proposed a Multi-Agent deep RL method for the building cooling water system; the performance of pumps, chillers, and cooling towers was optimized by adjusting the load distribution, cooling tower fan frequency, and cooling water pump frequency. Compared with the rule-based control method, the RL method showed an 11.1% improvement in energy-saving performance. Studies demonstrated that RL performed well in the optimal control of the cooling water system.
More and more advanced RL algorithms were utilized to optimize the operation of HVAC systems. The Q-learning algorithm is the most classic RL algorithm. Chen et al. [
12] optimized the on/off strategy of the air-conditioner and window by utilizing the Q-learning algorithm in two residential zones. In Miami and Los Angeles, the energy consumption saving rate compared with heuristic control reached 13% and 23%, respectively. The deep-Q-network (DQN) algorithm is an improvement of Q-learning. In the DQN algorithm, the traditional Q-table used for storing and updating values based on actions and states was replaced by a neural network. Ahn et al. [
13] used a DQN method to minimize the energy consumption of a building while maintaining the indoor CO
2 concentration; a 14.9% energy saving potential was found between DQN and baseline operation. To overcome problems existing in DQN, such as the fact that it can only solve problems on a discrete action space, the deep deterministic policy gradient (DDPG) algorithm was developed. Du et al. [
14] utilized the DDPG algorithm to optimally control the HVAC system of a multi-zone residential building and verified its advantages over the DQN. Peng et al. [
15] proposed an enhanced DDPG to optimize the energy consumption predicted by a convolutional neural network and long short-term memory (CNN-LSTM) model; they compared the energy efficiency ratio with a proximal policy optimization (PPO) algorithm and reached a 49% performance improvement. More RL algorithms were proposed and used to optimize the control of building energy management systems by simulation. Fu et al. [
16] integrated MPC and twin delayed deep deterministic policy gradient (TD3) to optimize the energy consumption of an HVAC system and demonstrated a 16% cost-saving performance better than DDPG. These studies focused on developing more advanced algorithms to improve the convergence speed of RL and control performance. However, constructing advanced RL algorithms for different HVAC systems requires the researcher to have proficient programming skills and considerable knowledge of specialized knowledge. Moreover, the algorithms are related to the HVAC systems the researchers built, resulting in a lack of universality for other applications.
Many open-source RL algorithms developed in other fields are accessible through  GitHub (
https://github.com, accessed on 19 March 2025). Although these algorithms are not developed for optimal control in the HVAC field, they can still simplify developing RL algorithms in HVAC systems. Biemann et al. [
17] demonstrated a way to use a library of open-source RL algorithms for optimal control of HVAC systems. They compared the performance of four different actor–critic algorithms (soft actor–critic (SAC), TD3, PPO, and trust-region policy optimization (TRPO)) in the HVAC system of a data center. The algorithms utilized by Biemann came from the Stable-Baselines3 library [
18].
OpenAI Gym provides a framework for constructing an RL environment and contains a collection of environments [
19]. With the OpenAI Gym Application Programming Interfaces (APIs), agents with different open-source RL algorithms can be deployed to interact with the environment. Moriyama et al. [
20] wrapped an EnergyPlus program into an OpenAI Gym environment and used the TRPO algorithm as the agent. The data center’s cooling system was optimized, and the controller performance obtained by RL was 22% higher than a built-in controller. Zhang et al. [
21] used the same method to construct an OpenAI Gym environment as Moriyama. The RL controller was deployed for a radiant heating system in a one-floor office building and reduced heating demand by 16.7% compared with a rule-based controller. Arroyo et al. [
22] reported a framework coupling the OpenAI Gym environment and building optimization performance tests (BOPTEST). A floor heating system for a single-zone residential building was chosen as the testing case for the framework. Wang et al. [
23] and Chen et al. [
24] developed two platforms based on the OpenAI Gym interactive interface to analyze and evaluate the performance of regional H2-electricity network systems and building electric vehicle systems. These virtual testbeds, constructed based on OpenAI Gym, provided a more convenient way to analyze and optimize the control of BEM systems. However, most environments were coupled with other software, such as EnergyPlus. The co-simulation during the RL training process costs a huge amount of computing power and requires a long time for simulation. Moreover, for the optimization of the sub-system of the HVAC system, it is not necessary to calculate all of the operation states of a whole building.
In summary, RL is an appropriate optimization control method for complex systems such as HVAC. However, developing specialized RL algorithms for HVAC systems is a challenge. Utilizing open-source RL algorithms to control HVAC systems is a new way, but it usually requires co-simulation with other software, resulting in a long training cycle.
Motivated by the interactive environment framework of OpenAI Gym and the function of the automated building performance simulation (AutoBPS) platform, this paper aims to develop a tool that can quickly generate interactive environments for HVAC systems and realize a simplified application of open-source RL algorithms to HVAC systems. A cooling water system in the chiller plant system was selected as the study objective, and a toolkit called AutoBPS-Gym was developed. Based on the building energy models generated by AutoBPS between different building types and climate zones, interactive environments were generated and combined with RL algorithms to test and explore the energy-saving potential of cooling water. In addition, the developed toolkit reduces repetitive modeling work in traditional studies on RL and effectively alleviates the computational power demand in the traditional co-simulation process. Furthermore, it reduces the time consumed in the training process of RL.
The rest of the paper is organized as follows: 
Section 2 describes the details of the development of AutoBPS–Gym, including parameters of case building, physical modeling of equipment, and related information for the environment and RL. The results of energy-saving performance and control strategies obtained by the developed tool when optimizing cooling water systems are demonstrated in 
Section 3. 
Section 4 discusses some shortcomings in developing the tool, and 
Section 5 concludes the results achieved in this study.
  3. Results
This section introduces the research results of this paper. Firstly, the accuracy of the OpenAI Gym cooling water system environment is explained. Then, the performance of the optimized control of the cooling water system environment by the DQN algorithm is demonstrated. In addition, the difference in the control performance of the two algorithms, DQN and DDQN, is compared. Finally, the control performance of the DQN algorithm is compared in different climatic zones.
  3.1. Validation of the OpenAI Gym Environment Model Accuracy
According to [
31], the model accuracy of the cooling water system environment developed based on the OpenAI Gym can be verified through the “Comparison to other models” method. This paper used the EnergyPlus model generated by AutoBPS to validate the simulation results.
The energy consumed by the cooling water system model was verified when the control variables were the same as the EnergyPlus model. The approach temperature was set to 3 °C, and the condenser water flow rate ratio remained at 100% of the rated water flow rate (1085 kg/s). 
Figure 6 shows the results for 1 July. The solid line represents the results of the EnergyPlus simulation, while the dotted line represents the results from the environmental model calculation. As seen in the figure, there is almost no difference in the energy consumption results of the chiller and water pump, although there is a slight error in the results of the cooling tower model.
To calibrate the model error rate, the hourly energy consumption of each model of the cooling water system was evaluated using two metrics: the normalized mean bias error (NMBE) and the coefficient of variation of the root mean square error (CVRMSE). The equations are shown in Equations [
11] and [
12], where 
y is the energy consumption result calculated by the Gym models, 
 is the EnergyPlus simulation results, 
 is the average hourly energy consumption for the cooling season, and 
n is the number of simulated cooling season hours.
ASHRAE Guideline 14 provides limit values for these metrics for different scenarios, generally specifying that the NMBE should not be higher than 5% and the CVRMSE should not be higher than 15%, or the perceived accuracy of the model may be unconvincing.
The results of the accuracy validation of the equipment energy consumption data, calculated by the Gym model compared to the energy consumption results from the EnergyPlus simulation, are shown in 
Table 6.
The NMBE of the three types of cooling water system equipment is less than the limit value of 5%, and the CVRMSE of the chiller and pump is less than the limit value of 15%. However, the CVRMSE value of the cooling tower is higher than 15%, which indicates a certain degree of error in the simulation process between the calculation model of the cooling tower and the calculation model of EnergyPlus. From the overall energy consumption of the cooling water system, the energy consumption of the chiller and the cooling water pump is the main part, accounting for about 95% of the energy consumption of the system. In contrast, the energy consumption of the cooling tower accounts for a smaller part of the system, accounting for only about 5%. The existence of errors in the simulation of the entire cooling water system has a small impact on the simulation error. After calculating the energy consumption of the cooling system, the model shows an NMBE of 0.81% <5% and a CVRMSE of 1.65% (<15%), which shows that the accurate performance of the constructed Gym environment model is convincing.
The energy consumption during the cooling season, as simulated by the environment model and EnergyPlus, is shown in 
Figure 7. The energy consumption of the cooling tower in the cooling season simulated by EnergyPlus was 0.22 GWh, while the total energy consumption calculated by the model was 0.21 GWh, and the error of the chiller energy consumption was 4.76%. The energy consumed by the chiller and pump was kept at the same level between the EnergyPlus and the environment model, at 3.60 GWh and 0.83 GWh, respectively. Considering the total energy consumption of the cooling water system in the cooling season, the EnergyPlus simulation result was 4.65 GWh, while the model’s calculation was 4.64 GWh, with an error of 0.15%.
With the verification completed, the error generated by the energy consumption calculation of the developed cooling water system environment model was within an acceptable range. Therefore, the results of the environmental calculation were used as the benchmark to compare and analyze the optimization effects of the two RL algorithms.
  3.2. Optimization Results Using the DQN Algorithm
DQN was first used as an RL agent to interact with the developed environment model. By analyzing the energy consumption and distribution of control actions between different training episodes, the optimization performance of DQN was demonstrated in this section.
Figure 8 illustrates the optimization process of DQN during the training episodes. The greedy factor was set as a decay function of the exponential function, starting at 1.0 and decreasing gradually to 0.001 with optimization. At the beginning of optimization, the energy reward concentrated near the maximum value (4.26 GWh). With the high epsilon value at the beginning of training, the agent explored the control action space more to learn about the energy-saving effect of different actions. As epsilon gradually decreased, the energy reward exhibited a clear downward trend, though the rate of decrease slowed over time. The agent gradually completed exploring and began optimizing the control actions using the learned experience in each step. The minimum energy consumption was reached when the epsilon was close to 0.001. After 4000 episodes, the energy reward had reduced from 4.26 GWh to 3.99 GWh. However, in the final 1000 episodes, the agent could not discover a more effective control strategy, suggesting that the optimization process had reached a plateau.
 To investigate the energy-saving potential of the DQN algorithm during the cooling season, the energy consumption of the cooling water system was analyzed during the optimization process. The cooling season spanned from 1 June to 30 September. A total of 47 episodes with reduced energy consumption were recorded in the training process. For a more intuitive presentation, 10 episodes were selected, among which the first 5 episodes were in the first 2000 episodes, representing the initial stage of the training, while the last five represented relatively convergent episodes. The selected indexes of episodes were 0, 150, 423, 858, 1982, 2223, 2989, 3031, 3425, and 4008. The energy consumption of the chiller, cooling tower, and pump in these episodes was counted and shown in 
Figure 9.
At the start of optimization, the agent fully explored the control actions in the action space. The energy consumption of the cooling water system was 4.26 GWh, which is 0.38 GWh lower than the baseline calculation. As the optimization process continues, the energy consumption of the cooling water system progressively decreases. After 4008 episodes, the energy consumption has been reduced to 3.99 GWh. Compared with the baseline model, the energy-saving rate increased from 8.19% to 14.16%. Further analysis of the energy consumption of each facility in the cooling water system revealed the following: The energy consumption of the cooling tower showed a continuous decreasing trend, from 0.2 GWh to 0.12 GWh, which means that in each selected episode, the control strategy had a positive effect on the operation of the cooling tower. The energy consumption of the condenser pump decreased sharply from 0.35 GWh to 0.16 GWh in the first 2000 episodes, while the energy consumption of the chiller increased from 3.71 GWh to 3.76 GWh. The decrease in energy consumption of the water pump and cooling tower came at the expense of an increase in the cooling machine’s energy consumption. After 2000 episodes, the agent used its experience to gradually balance the energy consumption between the chiller and the pump, eventually returning the chiller’s energy consumption to the initial level of 3.71 GWh. Although the energy consumption increased compared with the baseline model of 3.60 GWh, the energy consumption of the pumps and cooling towers had improved significantly.
To demonstrate the tendency of the DQN algorithm to select actions during the training process, the distribution of control actions in each episode is analyzed. The percentage distribution of the control action situation is shown in 
Figure 10. The color depth in the table represents the selection frequency of different control actions in different training episodes, and the deeper the color is, the more times the action is selected in the corresponding episode. Based on the previous analysis of energy consumption change, the reasons for the change in control action and its influence on energy consumption optimization were discussed. In the early stage of training (episodes 0–858), the approach temperature and water flow rate ratio selection frequency were scattered uniformly in the action space. DQN was in the exploration stage, and the influence of different control strategies on energy consumption had not been effectively judged. In the middle of the training period (episodes 1982–2989), the selection frequency for approach temperatures of 2 °C and 3 °C increased significantly (up to 26–29%). The selection frequency for low flow ratios (0.3 to 0.5) increased significantly, especially in episodes 1982 and 2223. This indicates that DQN begins to learn that a lower approach temperature helps to reduce the condensing pressure and improve the operating efficiency of the chiller. In contrast, a lower cooling water flow can reduce the pump’s energy consumption without significantly affecting the cooling effect, thus reducing the system’s total energy consumption. At the later stage of training (episode 3031–4008), the approach temperature setting was stabilized at about 2.5 °C, while the flow ratio of 0.3 to 0.5 was still the main choice. It is worth noting that the selection of some high approach temperatures and high flow ratio control actions still exist, which may be due to the following reasons: (1) Under some low-load conditions, appropriately increasing the approach temperature can reduce the energy consumption of the cooling tower, thereby improving overall energy efficiency. (2) During some periods of high load, DQN still selected a flow ratio of 0.6–0.7 to optimize the overall system’s energy efficiency. (3) DQN tried other options at some exploratory steps to prevent falling into local optimality.
The details of control actions selection and energy consumption during the 24 h on July 1 were analyzed to study the impact of the approach temperature and condenser water flow rate. The control actions of four episodes and the baseline model were demonstrated. 
Figure 11 shows the control actions of each episode. The approach temperature ranges from 1.5 to 5 °C, and the condenser water flow ratio ranges from 0.3 to 1.0. The approach temperature in the baseline model was set to 3 °C, and the condenser water flow was set to 100% of the rated flow. At the beginning of the training, the agent randomly selected the control actions in the action space with a high probability. Therefore, the control actions in episode 0 and episode 423 were random, and the agent chose the control actions with significant differences per hour. With the increased training episodes, the agent gradually reduced the probability of randomly selecting control actions and used the optimal actions in each simulation step as the control strategy. After 2223 episodes, the greedy factor is less than 0.05, resulting in the agent’s tendency to choose the best control action in each simulation step. The approach temperature mainly varies between 2 and 3 °C, and the flow rate ratio ranges between 0.4 and 0.6.
The agents selected the optimal control actions under the corresponding observation states through the neural network at every simulation step. Therefore, applying the control strategy optimized by RL to actual engineering leads to frequent adjustments of equipment control parameters. To avoid hourly control of the equipment, the control strategies generated by the algorithm need to be simplified. The average value of the control parameters was used as the actual control action to explore the energy savings of the cooling water system under this control action.
First, the energy management system (EMS) programs were added to the baseline input data file (IDF) of EnergyPlus to control the flow of the cooling water system. Then, the approach temperature was set to the average value by modifying the SetpointManager:FollowOutdoorAirTemperature field in the IDF. In this way, the baseline model was modified, and the energy consumption of the cooling water system under this control strategy was obtained through simulation.
The statistical analysis of control strategies generated by DQN yielded the average values of the control actions. The approach temperature and cooling water flow rate ratio were calculated as 2.96 °C and 47.12%. The energy consumption comparison is shown in 
Figure 12.
By setting a rule-based control strategy (DQN-EMS) for the cooling water system through the EMS, the energy-saving performance of the feasible control scheme generated by DQN was demonstrated. There was a 10.29% reduction in energy consumption compared to the baseline model, with a significant reduction in the pump’s energy consumption and an increase in the energy consumption of both the chiller and the cooling tower. The reduction of condenser water flow is directly related to the reduction of cooling water pump energy consumption. However, the smaller water flow increased the cost of heat release by the chiller, so the energy consumption of the chiller compressor increased slightly. The reduction of the temperature setpoint of the approach temperature caused the cooling tower fan to expend more energy to release heat into the ambient air. The overall effect was a reduction in the energy consumption of the cooling water system. The energy consumption of the cooling tower mainly caused the difference between DQN and DQN-EMS. Since the control strategy of DQN can adjust the action according to the state of the cooling water system, and the simple EMS control strategy currently implemented sets the control action to a fixed value, the adaptive control effect of the dynamic change of the air conditioning system is not as good as that of DQN.
  3.3. Comparison of Optimization Performance Across Different Algorithms
This paper aimed to develop an AutoBPS-Gym toolkit to realize the rapid generation of an interactive environment for RL. However, there are many RL algorithms, and whether the built environment can adapt to different RL algorithms remains to be studied. Based on the previous study of the DQN algorithm, this section introduces the DDQN algorithm to verify the applicability of the generated interactive environment to different RL algorithms. In the same interactive environment generated, the DDQN algorithm was also trained for 5000 episodes to compare with the DQN algorithm. The convergence of the two algorithms is shown in 
Figure 13.
From the figure, it can be seen that the convergence process of DDQN is slower than that of DQN, and the final state is difficult to evaluate whether the results have converged. This may be because DDQN is more conservative in evaluating actions and takes longer to explore high-quality actions, which can save more energy in the cooling water system. In addition, DDQN is more sensitive to parameters such as learning rate and target network update frequency. Using the same hyperparameters of DQN directly may cause unstable updating of the Q value. The exact cause was not confirmed in this study, and further exploration needs to be carried out in future work. In this paper, the algorithm gradually converges, indicating that the generated RL environment can effectively interact with DDQN and achieve certain energy-saving effects.
The final energy consumption of the cooling water system obtained by DDQN is close to the optimization result of the DQN algorithm. The statistical results are shown in 
Figure 14. The energy consumption of the chiller is 3.75 GWh, which is 0.4 GWh higher than DQN, while the energy consumption of the cooling tower is 0.3 GWh lower than DQN. The energy-saving rate of the DDQN algorithm reaches 14.01%.
The comparison of DDQN and DQN control strategies is shown in 
Figure 15. The approach temperature selected by DDQN is slightly higher than that of DQN, while the cooling water flow is lower than that of DQN. The higher approach temperature reduces the energy consumption of the cooling tower, and the lower cooling water flow increases the energy consumption of the chiller. The overall comparison shows that the energy consumption of the cooling water system is close.
  3.4. Optimization Performance Across Different Climate Zones
The developed AutoBPS-Gym can generate environments in different climate zones and building types to interact with RL algorithms for optimal control of HVAC systems. This section takes multiple climate zone shopping mall buildings generated by AutoBPS-Gym as a case to test the optimization of the DQN algorithm in different climate zones. The cooling water system is usually used in the climate zone that needs cooling, according to the Standard of Climatic Regionalization for Architecture (GB50178-93). It mainly includes three major climate zones: Hot Summer and Cold Winter Zone (HSCWZ), Hot Summer and Warm Winter Zone (HSWWZ), and Temperate Zone (TZ). A total of 20 sub-climate zones are divided into AutoBPS, seven of which can be used to analyze the optimal control of cooling water systems. More detailed information for climate zones and representative cities can be found in [
29].
Three climate zones were selected in this study to test the performance of DQN, HSCWZ-3A (Shanghai, China), HSCWZ-3B (Changsha, China), and HSWWZ-4A (Shenzhen, China). Shanghai and Changsha are located in the same climate zone, but one is a coastal city, and the other is an inland city. Shenzhen and Changsha are cities with similar longitudes but in different climatic zones. At the same time, to test the environment with different numbers of cooling water equipment, we modified the generated models so that the number of chillers and cooling towers in the cooling water system was increased to two. The condenser water pump was modified to a variable-speed pump.
Figure 16 illustrates the comparative analysis of cooling water system energy consumption between baseline models generated by AutoBPS-Gym and DQN-optimized control strategies across three climatic zones. The baseline energy consumption in Shenzhen exhibits the highest values among the three cities, with chiller, pump, and cooling tower consumptions reaching 3.15 GWh, 0.14 GWh, and 0.41 GWh, respectively. These values surpass those of Shanghai (2.09, 0.10, 0.31 GWh) and Changsha (2.18, 0.10, 0.31 GWh), a phenomenon attributed to Shenzhen’s elevated cooling demand, driven by its year-round high-temperature and high-humidity climate conditions.
 DQN-based control optimization reduced the total energy consumption of the cooling water system in all regions. Specifically, Shenzhen achieved a 4.05% energy saving (from 3.70 GWh to 3.55 GWh), while Shanghai and Changsha exhibited reductions of 0.11 GWh (4.40%) and 0.10 GWh (3.86%), respectively. At the component level, the pump energy consumption shows limited optimization potential (e.g., Shenzhen: 0.14 GWh to 0.14 GWh), primarily due to the inherent efficiency improvement from replacing fixed-speed pumps with variable-speed pumps in the baseline model. Furthermore, the DQN algorithm balances the trade-off between chiller and cooling tower energy consumption through dynamic optimization of the approach temperature. This strategy results in a notable reduction in cooling tower energy consumption (e.g., Shenzhen: 0.41 GWh to 0.20 GWh) at the expense of a marginal increase in chiller energy consumption (e.g., Shenzhen: 3.15 GWh to 3.21 GWh), highlighting the algorithm’s capability to prioritize system-level efficiency over individual component performance.
  3.5. Time Consumption Between Different Ways of Utilizing the Deep-Q-Network
Before the research in this paper was carried out, an attempt at co-simulation between Python 3.7.16 and EnergyPlus V9-3-0 was carried out. The EnergyPlus model was packaged through the Functional Mock-Up Unit (FMU) and interacted with the DQN agent built in Python. All simulations were performed on a computer with an 11th Gen Intel(R) Core (TM) i7-1165G7 @2.80GH CPU and an NVIDIA GeForce MX450 GPU.
During the co-simulation process, the training time for each round of building energy information simulation using the FMU module is about 44.3 s on average, and the total time required is about 61 h if 5000 training rounds are to be performed. In the research process of this paper, the training time for conducting 5000 rounds was about 11 h, with the average training time per round being about 7.92 s. The time comparison graph of the interaction of RL algorithms through different methods is shown in 
Figure 17, and the results show that the interaction environment for generating locally optimized objects through AutoBPS-Gym can greatly reduce the time requirement for RL training.
  4. Discussion
The current research still has shortcomings that need to be improved and further explored in the following aspects of the follow-up research:
(a) The parameters contained in the state space may have an impact on the results of the research. Currently, the parameters contained in the state space mainly include the flow rate, the temperature in the cooling water system, and the outdoor wet-bulb temperature, among others. The timestamp and electricity price of the corresponding state can be added to the state space in the future.
(b) The current control action space contains only two control parameters, and the segmentation of control actions is rough, so there are certain differences in the control strategies that optimize the control distance. In addition, the current algorithm can only solve the discrete control action space. If the control action is segmented more finely or increased, the action space will be too large, and the time for optimization will increase sharply. Algorithms such as DDPG can optimize the continuous action space, and multi-agent RL algorithms can simplify the problem of controlling multiple actions. More RL algorithms can be implemented in future work.
(c) Currently, the feasibility study of the control strategy optimized by the RL algorithm is tested only using the average value of control actions. A finer control strategy can lead to higher energy savings in the actual control process. In the future, we will further optimize our feasibility strategy study and propose a more operational control strategy for the operation control of the cooling water system by using the AutoBPS-Gym tool.
(d) In addition, when testing the building models of different climate zones generated by AutoBPS-Gym, due to the various modeling methods of the devices in EnergyPlus, the device models in the current environment cannot cover all the device models. As a result, there are still some errors in the results. In the future, we will add more device models to the environment to ensure that errors are minimized when generating the environment.
(e) The current studies focus on optimization and testing in the virtual environment. Although the cooling water system can produce an energy-saving effect of 14.01% in theory, whether the control strategy can produce an effect if applied to a practical cooling water system still needs further evaluation. In the future, we will build a scaled experiment platform to verify the control strategy of RL and explore the possibility of applying RL to practice.
(f) During the development of AutoBPS-Gym, the coefficients of the empirical equations of the plant models were not validated by regression, and the coefficients in EnergyPlus were taken directly, which can lead to problems in the practical application of the model. In the future, we will optimize AutoBPS-Gym’s model to achieve more accurate cooling water system optimization.