Energy Dispatch for CCHP System in Summer Based on Deep Reinforcement Learning

Combined cooling, heating, and power (CCHP) system is an effective solution to solve energy and environmental problems. However, due to the demand-side load uncertainty, load-prediction error, environmental change, and demand charge, the energy dispatch optimization of the CCHP system is definitely a tough challenge. In view of this, this paper proposes a dispatch method based on the deep reinforcement learning (DRL) algorithm, DoubleDQN, to generate an optimal dispatch strategy for the CCHP system in the summer. By integrating DRL, this method does not require any prediction information, and can adapt to the load uncertainty. The simulation result shows that compared with strategies based on benchmark policies and DQN, the proposed dispatch strategy not only well preserves the thermal comfort, but also reduces the total intra-month cost by 0.13~31.32%, of which the demand charge is reduced by 2.19~46.57%. In addition, this method is proven to have the potential to be applied in the real world by testing under extended scenarios.


Introduction
The utilization of fossil fuels has caused environmental problems globally, such as air pollution and climate warming [1]. At the same time, energy consumption rises instead of falling with the rapid depletion of energy resources, of which about 40% is used for the production of cool, heat, and electricity [2]. Compared with the traditional energy supply system, the combined cooling, heating, and power (CCHP) system can promote the coordinated operation of the above three kinds of energy, which provides an effective way to solve environmental and energy problems with high energy efficiency. As a consequence, CCHP systems have been widely used in residential and office buildings, and hospitals in the past decade [3][4][5].
Although high energy efficiency has been verified in engineering applications, existing CCHP systems usually suffer from high operating costs, especially in summer. For example, Shanghai, a city in southern China, has a subtropical monsoon climate with outdoor temperatures up to 39 • C in summer, so a large amount of electricity is required to provide cooling to balance the demand-side user's high cooling demand, which leads to the surge of CCHP system's energy purchasing cost. Therefore, it is very necessary to optimize existing systems to achieve optimal economic operation in summer while meeting the demand-side energy load.
In the related literature, there are many studies focusing on optimizing the dispatch strategy of the CCHP system. Ref. [6] established a dispatch model of the CCHP system with the optimization goal of energy cost and carbon emission, and solved it by CPLEX solver. Ref. [7] established a two-stage dispatch model for the CCHP system based on quantity adjustment and proposed an iterative solution algorithm. Ref. [8] proposed an improved bee colony algorithm to solve the multi-objective dispatching model, so as to balance economic benefits and environmental friendliness. Ref. [9] built a load prediction model using neural networks, and proposed a feed-forward active operation optimization method considering load prediction for the CCHP system. Refs. [10][11][12] improved the traditional electric load following strategy and thermal load following strategy to improve the matching degree of energy output and demand. Ref. [13] used a genetic algorithm to optimize the dispatch strategy of the CCHP system to minimize the energy cost under different operating circumstances. In addition, because of human activities and outdoor weather conditions, there are dynamic uncertainties in the demand-side load, which brings a severe challenge to the energy dispatch of the CCHP system. To solve this problem, some studies tried to optimize the CCHP system dispatch strategy using model predictive control (MPC) [14][15][16] and robust optimization [17][18][19] separately, and the dispatch result showed that high-quality optimization depended on the accuracy of prediction.
Although the above studies have laid the solid foundation for the optimal energy dispatch of the CCHP system, the impact of demand charge on the operating cost is not considered. According to surveys, the so-called demand charge is already common among power industries in countries including China, the U.S, and Sweden, and is generally charged based on the electricity customers' monthly peak power demand to the grid, that is, the peak electric power purchase (in kW) regardless of timing, rather than the cumulatively purchased electricity (in kWh) [20,21]. Additionally, some studies suggest that introducing demand charge into the utility rate has two potential benefits: (1) incentivizing smarter demand-side management; (2) guaranteeing the stability of the grid, and avoiding power accidents that may endanger public safety, such as large-scale blackouts [22,23]. The presence of demand charge further increases the difficulty of CCHP system energy dispatching. Therefore, in view of the above two factors, this paper will achieve the optimal economic operation of the CCHP system in summer under the rate structure including time-of-use electricity price and demand charge.
On the other hand, the optimization methods applied in all the above study works can be categorized as model-based methods. Although the model-based method well reflects the thermodynamic performance of the CCHP system and the internal mechanism of demand-side load variation, it relies on the expert experience of modeling or the prediction information of future uncertainty, which is difficult to adapt to dynamic changes of the actual environment. Once the operating status and structure of the CCHP system change with time, the model-based method needs to remake the dispatch strategy, which increases the computational burden and greatly reduces the decision-making efficiency. Besides, the model-based method often suffers from a low intelligence level and a long solution time, which cannot meet the requirements for real-time operation.
Therefore, in view of the above limitations, this paper introduces deep reinforcement learning (DRL) into the CCHP system energy dispatching problem. DRL is an emerging technique that has received extensive attention in recent years. With its strong perception and decision-making ability, DRL not only ensures real-time requirements, but also provides a model-free optimization approach to generate adaptive strategies without prior knowledge of the environment, avoiding drawbacks of the model-based method. In this sense, the DRL-based method has more advantages.
Studies have demonstrated that DRL can be used to solve complex high-dimensional control and optimization problems such as in robots [24], games [25], and traffic controls [26]. In the field of energy dispatching, DRL-based methods have also attracted broad attention. Ref. [27] proposed a microgrid energy management method based on a deep Q network to achieve higher economic benefits under stochastic conditions. Ref. [28] achieved the optimal control of the heating, ventilation, and air conditioning (HVAC) system using deep deterministic policy gradient (DDPG) to reduce energy costs and thermal discomfort. After improving the traditional DDPG algorithm, ref. [29] proposed a dynamic dispatch method for an integrated energy system based on improved DDPG, and verified the superiority of this method. Ref. [30] used DRL to optimize the electricity dispatch of an all-electric ferry, so as to achieve the dual-objective optimization of energy cost and load expected loss. Ref. [31] developed a microgrid energy management method using proximal policy optimization (PPO) to deal with the uncertainties of renewable energy. Ref. [32] developed a CHP system dispatch method using PPO, efficiently handling the wind-turbine failure without rewriting constraints. Although these study works have made significant contributions, they only considered a simple optimization objective of minimizing costs due to grid exchange and power generation, ignoring the impact of the demand charge mechanism and the control of peak exchanging power, which may result in an extra ultra-high cost. In addition, there is not only no benchmark policy designed to further verify the superiority and generalization of the proposed method, but also no comprehensive discussion of its potential application in real scenarios from multiple aspects. Therefore, motivated by the above problems, this paper aims to develop a DRL-based energy dispatch method for the CCHP system to achieve optimal economic operation in summer. The innovation and main contributions are summarized as follows:

•
We focus on the energy dispatch for the CCHP system in summer (EDCS) and formulate the EDSC problem as a Markov decision process (MDP), in which the load uncertainty, energy cost, demand charge, and energy balance are considered. • DRL algorithm DoubleDQN is used to solve the formulated MDP and make dispatch strategies for the CCHP system. In contrast to previous study works, the proposed method directly makes decisions based on the current state, getting rid of the dependence on the accuracy of prediction information and model description.

•
By comparing with the DQN-based method and benchmark policies, the advantages of the proposed method in computational efficiency, total intra-month cost saving, and peak power purchase control are verified.

•
From the aspects of dealing with unseen physical environments, load fluctuation, and sudden unit failure, the potential of application in real scenarios is discussed.
The rest of the paper is organized as follows. The EDCS problem is mathematically formulated in Section 2; the proposed DRL-based method is introduced in Section 3; the case study is given in Section 4; finally, the conclusion is drawn in Section 5.

EDCS Problem Formulation
EDCS problem aims to efficiently manage the energy output of the CCHP system, to achieve optimal economic operation in summer. Therefore, this section first introduces a typical CCHP system, then establishes mathematical models of its key units, and finally designs the objective function and constraints of the EDCS problem.  Figure 1 shows a typical CCHP system, which consists of an internal combustion engine (ICE), absorption chiller (AC), electric chiller (EC), cooling tower (CT), gas boiler (GB), storage tank (ST) and auxiliary equipment (AE). The high-grade heat generated by the burning of natural gas is used to drive ICE to produce electricity, and the low-grade waste heat is used to produce cool through AC. If ICE fails to meet the electricity load, it will be supplemented by the grid. The cooling load is mainly satisfied by EC and AC, and the low-grade heat generated during the cooling process will be released through CT. The heating load is mainly satisfied by GB. Insufficient heating and cooling energy are supplied by ST, while the operation of this system is assisted by AE.
Additionally, the following assumptions are made for the typical CCHP system: (1) Demand-side user is directly connected to the grid, and thus the electricity load only consists of ECs, AEs, and CT.
(2) ICEs and ECs run at rated power.
(3) ICEs and ACs run in a one-to-one matching mode, meaning the number of running ICEs is equal to the number of running ACs at each time step. Additionally, the following assumptions are made for the typical CCHP system: (1) Demand-side user is directly connected to the grid, and thus the electricity load only consists of ECs, AEs, and CT.
(2) ICEs and ECs run at rated power.
(3) ICEs and ACs run in a one-to-one matching mode, meaning the number of running ICEs is equal to the number of running ACs at each time step.

Mathematical Model of Key Unit
Mathematical modeling is carried out for key units of the CCHP system under summer conditions, including ICE, AC, EC, ST, and AE: (1) ICE The ICE electric power output ( ) (kW) at time step t is defined according to the following equation: where, is the rated electric power generation of ICE (kW). Then, the ICE gas consumption ( ) (m 3 /h) and the low-grade waste heat (kW) can be calculated as follows: where, is the electric efficiency of ICE and is the low calorific value of natural gas (kWh/m 3 ).
(2) AC The AC cooling output ( ) (kW) at time step t is jointly decided by the absorbed waste heat ( ) and the coefficient of performance , that is: (3) EC The EC cooling output ( ) (kW) at time step t is defined according to the following equation:

Mathematical Model of Key Unit
Mathematical modeling is carried out for key units of the CCHP system under summer conditions, including ICE, AC, EC, ST, and AE: (1) ICE The ICE electric power output P ICE (t) (kW) at time step t is defined according to the following equation: where, P ICE rated is the rated electric power generation of ICE (kW). Then, the ICE gas consumption V ICE (t) (m 3 /h) and the low-grade waste heat Q waste (kW) can be calculated as follows: where, η ICE is the electric efficiency of ICE and ε LHV is the low calorific value of natural gas (kWh/m 3 ).
(2) AC The AC cooling output Q AC (t) (kW) at time step t is jointly decided by the absorbed waste heat Q waste (t) and the coefficient of performance COP AC , that is: (3) EC The EC cooling output Q EC (t) (kW) at time step t is defined according to the following equation: where, Q EC rated is the rated cooling generation of EC (kW). Then, the EC electric power consumption P EC e (t) (kW) can be calculated as the following equation: where, COP EC is the coefficient of performance of EC. (4) ST The energy storage relationship of ST between adjacent time steps is defined as the following equation: where, ∆t (h) is the time step interval, E ST (t) (kWh) and Q ST (t) (kW) are the remaining energy storage and the storing (Q ST (t) < 0) or releasing (Q ST (t) ≥ 0) power of ST at time step t, respectively. (5) CT and AE Since the running of CT and AE are closely related to the operating status of other energy supply units, the electricity consumption of CT and AE is allocated to ECs, ACs, ICEs, and ST for the convenience of calculation, where ICE and AC are allocated as a whole because of the matching mode, that is: where, P CT e (t) and P AE e (t) are the electric power consumption (kW) of CT and AE, respectively; P EC e,a (t), P ICE&AC e,a (t) and P ST e,a (t) are the allocated electric power consumption (kW) of ECs, ACs, ICEs and ST, respectively; n EC (t) and n ICE (t) are the number of running ECs and running ICEs, respectively; α, β, γ are the electric power consumption coefficients.

Objective Function
Specifically, the objective of EDCS is to minimize the total intra-month cost C CCHP total (RMB), which is composed of the total energy cost C ECT total (RMB) and the total demand charge C DC total (RMB), by optimally dispatching each unit of the CCHP system. So, the objective function can be defined as follows: C ECT total can be calculated by: where, T (h) is the total time steps in a dispatch period, c ECT t (RMB) is the energy costs at time step t, V ICE i (t) (m 3 /h) is the gas consumption of ith running ICE; µ gas (RMB/m 3 ), µ grid (t) (RMB/kWh) and µ sell (RMB/kWh) are the natural gas price, and the purchasing and selling electricity price, respectively. C DC total is obtained according to [33], and can be calculated by: Entropy 2023, 25, 544 6 of 20 where, max 1≤t≤T max P grid (t), 0 (kW) is the peak electric power purchase in a dispatch period, µ demand (RMB/kW) is the unit price of the demand charge. Further, to allocate C DC total to each time step [34], Equation (14) can be mathematically transformed into: where, c DCS t (RMB) is the demand charge at time step t; P peak t is the peak electric power purchase (kW) in the last t time steps and defines P peak 0 = 0. Therefore, the EDCS problem objective function can be eventually expressed as:

Energy balance Constraints
The energy balance constraints of the CCHP system under summer conditions can be listed as follows.
Cooling balance: where, Q EC i (t) (kW) and Q AC i (t) (kW) are the cooling output of ith running EC and ith running AC at time step t, respectively; Q d (t) is the cooling load (kW).
Electricity balance: where, P ICE i (t) (kW) and P EC e,i (t) (kW) are the electric power output and consumption of ith running ICE and ith running EC, respectively; P grid (t) (kW) is the exchanging of electric power with the grid, P grid (t) > 0, if power is purchased, otherwise P grid (t) ≤ 0; P d (t) is the electricity load (kW).

Operational Constraints
Besides energy balance constraints expressed in Equations (18) and (19), there are some operational constraints for energy supply units of the CCHP system. ECs, ICEs, and ACs are constrained by the quantity. Hence, where n EC max , n ICE max and n AC max are the maximum number of ECs, ICEs, and ACs, respectively. ST is constrained by the capacity and the storing/releasing power. Hence, (24) where, Q ST max is the maximum storing/releasing the power of ST; E ST rated is the rated capacity of ST and defines E ST (0) = 0.

DRL-Based EDSC Method
Since the EDCS objective function designed in Section 2.2 is to determine the output of energy supply units at each time step so as to minimize the total intra-month cost, the EDCS problem is essentially a sequential decision-making problem. In this section, we first convert the EDCS problem into a Markov decision process (MDP), in which the energy balance constraints and operational constraints of the CCHP system are considered. And the DRL algorithm is adopted to find the optimal strategy for this MDP.

Converting of EDCS Problem into MDP
An MDP is usually defined as a tuple S, A, p, r , where S is the state space, A is the action space, p is the state transition probability, r : S × A → r is the reward function. The agent observes the current environment state s ∈ S and chooses an action a ∈ A(s), where A(s) represents the set of all admissible actions at state s [35].
In this paper, the CCHP system is the environment where the agent is located, and the interaction between the agent and the environment is shown in Figure 2: at time step t, the agent observes the environment state s t , and generates an action a t based on the policy π (policy is a mapping from state s to action a, that is, a = π(s)) to make dispatch strategy so as to determine the output of energy supply units.
where, is the maximum storing/releasing the power of ST; is the rated capacity of ST and defines (0) = 0.

DRL-Based EDSC Method
Since the EDCS objective function designed in Section 2.2 is to determine the output of energy supply units at each time step so as to minimize the total intra-month cost, the EDCS problem is essentially a sequential decision-making problem. In this section, we first convert the EDCS problem into a Markov decision process (MDP), in which the energy balance constraints and operational constraints of the CCHP system are considered. And the DRL algorithm is adopted to find the optimal strategy for this MDP.

Converting of EDCS Problem into MDP
An MDP is usually defined as a tuple < , , , >, where is the state space, is the action space, is the state transition probability, : × → is the reward function.
The agent observes the current environment state ∈ and chooses an action ∈ ( ), where ( ) represents the set of all admissible actions at state [35].
In this paper, the CCHP system is the environment where the agent is located, and the interaction between the agent and the environment is shown in Figure 2: at time step t, the agent observes the environment state , and generates an action based on the policy (policy is a mapping from state to action , that is, = ( )) to make dispatch strategy so as to determine the output of energy supply units. Next, we convert the EDCS problem into an MDP, and the fundamental elements of which are defined as follows.
(1) State At time step t, the environment state information for the agent includes physical time, remaining ST storage, peak electric power purchase obtained so far, purchasing electricity price, and cooling load. Among them, the cooling load is a state variable with uncertainty, which can't be controlled by the agent. Thus, the state (5-dimension) is described as: (2) Action The aim of the EDCS problem is to decide the electric power output (∑ ( ) ( ) =1 + ( )) and the cooling output (∑ ( ) can be determined by Equations (1) and (4), respectively; after can be determined by Equation (5). Further, ( ) and ( ) can be calculated by Next, we convert the EDCS problem into an MDP, and the fundamental elements of which are defined as follows.
(1) State At time step t, the environment state information for the agent includes physical time, remaining ST storage, peak electric power purchase obtained so far, purchasing electricity price, and cooling load. Among them, the cooling load is a state variable with uncertainty, which can't be controlled by the agent. Thus, the state s t (5-dimension) is described as: (2) Action The aim of the EDCS problem is to decide the electric power output (∑ can be determined by Equations (1) and (4), respectively; after n EC (t) is decided, ∑ Q EC i (t) can be determined by Equation (5). Further, Q ST (t) and P grid (t) can be calculated by Equations (18) and (19), respectively. In other words, the output of other energy supply units can be obtained immediately after n EC (t) and n ICE (t) are jointly decided. Therefore, the agent's action a t at time step t can be represented by n EC (t) and n ICE (t), that is: (26) where, a t ∈ A, and A is the set of all admissible actions that satisfy the EDCS constraints. According to Equations (20) and (21), [0, n EC max ] and [0, n ICE max ] are the range of n EC (t) and n ICE (t), respectively.
(3) Reward function According to the EDCS objective function and constraints, the goal of the agent is to minimize the total cost which is composed of the energy cost and demand charge, while balancing the supply and demand of energy. In order to achieve this goal, the reward r t received by the agent consists of the above three parts, and can be defined as: where, λ 1 , λ 2 and λ 3 are the weighting factors, ∆Q d (t) is the cooling error (kW) (that is, the difference between supply and demand of cooling power); θ·max 0, P peak t − ε peak is the extra penalty obtained when the peak power purchase P peak t is larger than the threshold ε peak . The setting of r t transfers the total intra-month cost minimization problem to the reward maximization form of the MDP.
From the viewpoint of MDP, the quality of chosen action a at the given environment state s can be evaluated by the state-action value function Q π (s, a): where, τ [0, 1] is the discount factor used to balance the future reward and current received reward [36], E π [·] is the reward expectation under the policy π. The aim of the agent is to find an optimal policy π * , so as to maximize the function Q π (s, a), that is, π * = argmax a∈A Q π (s, a).
On the other hand, the above MDP doesn't define the state transition probability p. With a full knowledge of p, the model-based method can solve E π ∑ T k=0 τ k ·r t+k |s t = s, a t = a (that is, the Q π value of action a) through fully observing the environment [37]. However, due to the influence of dynamic uncertainty factors such as human activity, environmental weather, and unit failure, the establishment of p model becomes extremely difficult, which makes the model-based method, not a suitable solution.
Therefore, in this paper, the model-free DRL-based method is used to solve the EDCS problem under the MDP framework. By interacting with the environment, the DRL-based method can incrementally improve its decision strategy without any information on state transition probability p.

DRL Solution
where, ψ [0, 1] is the learning factor. DRL is the combination of deep neural network (DNN) and RL [39], the essence of which is to approximate Q(s, a) through the nonlinear function. When encountering problems with large state space, DRL utilizes DNN as the regression tool, which solves the Entropy 2023, 25, 544 9 of 20 potential dimension-explosion problem of the RL algorithm caused by the establishment of a huge Q table [40].
Additionally, DRL can be generally divided into the value-based DRL algorithm for discrete action space and the policy-based DRL algorithm for continuous action space [37,41]. Since the action space defined in Equation (25) is discrete, the value-based DRL algorithm is adopted to generate a dispatch strategy for the CCHP system.

Basic Principles of Value-Based DRL Algorithm
A general DNN structure for the value-based DRL algorithm is shown in Figure 3. Specifically, DNN is used to evaluate the Q value of each potential action corresponding to the state s, and the agent will select the action with the highest Q value. the potential dimension-explosion problem of the RL algorithm caused by the establishment of a huge Q table [40]. Additionally, DRL can be generally divided into the value-based DRL algorithm for discrete action space and the policy-based DRL algorithm for continuous action space [37,41]. Since the action space defined in Equation (25) is discrete, the value-based DRL algorithm is adopted to generate a dispatch strategy for the CCHP system.

Basic Principles of Value-Based DRL Algorithm
A general DNN structure for the value-based DRL algorithm is shown in Figure 3. Specifically, DNN is used to evaluate the Q value of each potential action corresponding to the state , and the agent will select the action with the highest Q value. A deep Q network (DQN) is a representative value-based DRL algorithm [42], which has two DNNs named Q network and the target network. The training objective of DQN is to minimize the loss function ( ): where, is the target Q value, − Q( , ; ) is the time-difference error; and ′ are weight parameters of the Q network and target network, respectively.
However, as shown in Equation (30), Q values used for action evaluation and selection in DQN are both outputs by the target network, which tends to cause overvaluation. In order to solve this problem, a greedy-policy-based double deep Q network (Dou-bleDQN) algorithm is proposed to decouple the evaluation and selection [43]. Dou-bleDQN evaluates the Q value using the Q network and selects the action to take using the target network. The target Q value is then: and the gradient ∇ ( ) provides the direction for DNN parameters updating, that is: where, weights is updated every step and copied to weights ′ every fixed number of steps. Additionally, a mechanism called experience replay is integrated into DoubleDQN [44]. In this mechanism, the agent stores the experience = ( , , , +1 ) at each time step, and randomly extracts a batch of experience samples for the off-line training.

Realizing EDCS with DoubleDQN
The framework of the DoubleDQN-based EDCS method is shown in Figure 4.  A deep Q network (DQN) is a representative value-based DRL algorithm [42], which has two DNNs named Q network and the target network. The training objective of DQN is to minimize the loss function L(ω): where, y t is the target Q value, y t − Q(s t , a t ; ω) is the time-difference error; ω and ω are weight parameters of the Q network and target network, respectively. However, as shown in Equation (30), Q values used for action evaluation and selection in DQN are both outputs by the target network, which tends to cause overvaluation. In order to solve this problem, a greedy-policy-based double deep Q network (DoubleDQN) algorithm is proposed to decouple the evaluation and selection [43]. DoubleDQN evaluates the Q value using the Q network and selects the action to take using the target network. The target Q value is then: and the gradient ∇ ω L(ω) provides the direction for DNN parameters updating, that is: where, weights ω is updated every step and copied to weights ω every fixed number of steps. Additionally, a mechanism called experience replay is integrated into Dou-bleDQN [44]. In this mechanism, the agent stores the experience e t = (s t , a t , r t , s t+1 ) at each time step, and randomly extracts a batch of experience samples for the off-line training.

Realizing EDCS with DoubleDQN
The framework of the DoubleDQN-based EDCS method is shown in Figure 4. For the Dou-bleDQN network, the input is a 5-dimensional vector s t = t, E ST (t), P peak t−1 , µ grid (t), Q d (t) , and the outputs are the Q values of all potential actions. We use historical data of the CCHP system as environment states to train the Dou-bleDQN algorithm offline. Its input includes physical time, ST storage, peak electric power purchase, electricity price, and cooling load. After the offline training process shown in Algorithm 1, the parameters of DoubleDQN will be fixed and used for the online decisionmaking of the CCHP system.

Algorithm 1
Offline-training process of the DoubleDQN algorithm 1: Initialize parameters of Q network ( ) and target network ( ′).

4:
for t = 1 to T do: 5: Select action at given using the greedy policy. 6: Execute in the CCHP environment and transit to the next state +1 . 7: Get reward . 8: Store the experience ( , , , +1 ) in the experience replay buffer. 9: Extract a mini-batch of experience ( , , , +1 ) with the size N from the experience replay buffer.  We use historical data of the CCHP system as environment states to train the Dou-bleDQN algorithm offline. Its input includes physical time, ST storage, peak electric power purchase, electricity price, and cooling load. After the offline training process shown in Algorithm 1, the parameters of DoubleDQN will be fixed and used for the online decisionmaking of the CCHP system.
The decision-making procedure of the proposed DoubleDQN-based EDCS method can be found in Algorithm 2. When the dispatch begins, the weights ω of the Q network trained by Algorithm 1 are loaded. At each time step t, the agent selects an action a t based on the current CCHP state s t . Next, the action is executed by energy supply units, and the CCHP environment transits to the next state s t+1 . Meanwhile, the agent receives the reward r t and observes s t+1 as the current state. This procedure repeats until the end of the dispatch period. From the procedure, it can be found that the proposed method requires no prediction information, realizing a direct mapping from the real-time state observation to the CCHP system energy dispatching. Algorithm 1 Offline-training process of the DoubleDQN algorithm 1: Initialize parameters of Q network (ω) and target network (ω ). 2: for episode = 1 to M do: 3: Initialize for t = 1 to T do: 5: Select action a t at given s t using the greedy policy. 6: Execute a t in the CCHP environment and transit to the next state s t+1 . 7: Get reward r t . 8: Store the experience (s t , a t , r t , s t+1 ) in the experience replay buffer.

12:
Copy the weights ω into the target network every fixed number of time steps: ω = ω 13: Load the weights ω of the Q network trained by Algorithm 1.

4:
Execute a t in the CCHP environment and transit to the next state s t+1 . 5: Get reward r t . 6: end for

Case Study
In this section, the effectiveness and superiority of the proposed method are verified by comparison with the designed DQN-based method and benchmark policies. In addition, the proposed method is further tested under extended scenarios.

Simulation Setup
In order to evaluate the proposed DoubleDQN-based EDCS method, a CCHP system located in EXPO Site (Shanghai, China) is taken as a subject for the case study, the structure of which is shown in Figure 1. The system provides cooling for 28 office buildings on the site, and their working hours of them are from 6:00 to 18:00 on weekdays. This paper uses the historical cooling load data of these buildings from 2018 to 2020 for the proposed method of training and testing. The cooling load data from May to July is used to train, and the cooling load data from August is used to test. The length of the dispatch period is 720 h (that is, a full month). In addition, considering the buildings' working hours, 19:00 on the previous month's last day and 18:00 on the current month's last day are regarded as the start and end indexes of the dispatch period, respectively.
The hyperparameters of DoubleDQN are shown in Table 2. Meanwhile, two common DRL algorithms DQN and dueling deep Q network (DuelingDQN) are introduced for use in subsequent subsections. Their hyperparameters are consistent with those of DoubleDQN. The proposed algorithm is implemented using the deep learning framework PyTorch 1.8.0, and the simulation experiments are carried out on a computer equipped with AMD Ryzen7 5700U CPU and 16 G RAM. Figure 5 shows the cumulative reward performance of DoubleDQN, DQN, and Du-elingDQN during the offline training process, where the full line represents the average. In the beginning, since the DRL agent is unfamiliar with the environment, the selected action results in a large variation in cumulative reward. As the training process continues, the agent begins to optimize the strategy to accumulate greater rewards. After about 32 episodes, DuelingDQN starts to converge, which is slightly prior to DoubleDQN (45 episodes) and DQN (68 episodes). However, DoubleDQN eventually obtains the greatest cumulative reward at convergence, which is far higher than that of DuelingDQN. This shows DoubleDQN's superior learning performance in exploring the optimal strategy. Additionally, due to DuelingDQN's poor cumulative reward performance, it is not considered in subsequent sections.  The proposed algorithm is implemented using the deep learning framework PyTorch 1.8.0, and the simulation experiments are carried out on a computer equipped with AMD Ryzen7 5700U CPU and 16G RAM. Figure 5 shows the cumulative reward performance of DoubleDQN, DQN, and Du-elingDQN during the offline training process, where the full line represents the average. In the beginning, since the DRL agent is unfamiliar with the environment, the selected action results in a large variation in cumulative reward. As the training process continues, the agent begins to optimize the strategy to accumulate greater rewards. After about 32 episodes, DuelingDQN starts to converge, which is slightly prior to DoubleDQN (45 episodes) and DQN (68 episodes). However, DoubleDQN eventually obtains the greatest cumulative reward at convergence, which is far higher than that of DuelingDQN. This shows DoubleDQN's superior learning performance in exploring the optimal strategy. Additionally, due to DuelingDQN's poor cumulative reward performance, it is not considered in subsequent sections.

Dispatch Result Evaluation
Well-trained DRL agents from both the DoubleDQN and DQN are run during the test month (August 2020) to evaluate their dispatch results. Additionally, two benchmark

Dispatch Result Evaluation
Well-trained DRL agents from both the DoubleDQN and DQN are run during the test month (August 2020) to evaluate their dispatch results. Additionally, two benchmark policies are designed for comparison. The benchmark policies are described as follows: (a) Rule-based policy: during the valley electricity price period, ECs are given the priority to providing cooling, followed by ST and ACs, and the priority is completely reversed at all other times; (b) Shortsighted policy (based on DRL): the reward function defined in Equation (27) only consists of the energy cost and cooling error, and DRL agents are constructed using DQN and DoubleDQN, respectively.
The dispatch results and computational time of each method above are shown in Table 3, where the cooling error is expressed by the ratio of the total error to the total load, and the shortsighted policy is identified by the letter "a". As observed from Table 3, the three DRL methods have a significant advantage in online running time. Compared with the rule-based policy, the average time for DoubleDQN, DoubleDQN-a, DQN, and DQN-a to generate a set of dispatch decisions is reduced by 92.89%, 92.79%, 93.04%, and 92.45%, respectively. That's because the rule-based policy has to take a few seconds at each time step to make a series of logical judgments based on different environment information, so as to output the dispatch strategy. On the contrary, the DRL-based methods can make decisions in less than 4.25 ms by using the DNN that is well trained during the off-line training phase, which is far less than the dispatch interval of the rule-based policy. So, they can meet the requirements for real-time operation better.
It can be also found that the rule-based policy outperforms all DRL-based methods in controlling the cooling error. However, since it can only rigidly make the dispatch strategy according to the electricity price signal, the highest total cost is obtained as a consequence. On the other hand, compared to the rule-based policy, DoubleDQN-a saves 34.69% in energy cost while DoubleDQN saves only 32.53%, and a similar difference holds comparing DQN-a to DQN. It suggests that the shortsighted policy can better save energy costs and is more suitable for scenes without demand charges. However, DoubleDQN shows a greater advantage in the demand charge, which is 2.19~46.57% lower than other methods, thus obtaining the lowest total intra-month cost. Additionally, DoubleDQN also controls the cooling error at a relatively low level, indicating the demand-side user's comfort zone is well preserved.
To explore DoubleDQN's advantage in the demand charge, we compare the electricity dispatch strategies of DoubleDQN, DoubleDQN-a, and rule-based policy for a typical week of the test month, as shown in Figure 6. As observed, under DoubleDQN-a's dispatch strategy, the grid mainly supplies the electricity load, and the electric power purchase in the flat electricity price period is far higher than other methods, especially reaching a peak of 5739 kW in 70 h. However, under DoubleDQN's dispatch strategy, the peak electric power purchase is reduced from 3578 kW (rule-based policy) and 5739 kW (DoubleDQN-a) to 3112 kW by increasing ICEs' total electric power output, which greatly reduces the demand charge and flattens the electric power purchase curve. Additionally, part of the electricity load under this strategy is shifted to the nighttime when both the cooling load and electricity price are relatively low, which saves energy costs and maintains electricity stability. Therefore, through the above comparison, it can be easily found that DoubleDQN exhibits better capabilities of demand response and peak demand management.  Figure 7 further illustrated the dispatch strategy of DoubleDQN for a typical day of the test month. It can be observed that the supply and demand of electric and cooling power are balanced throughout the day. As the electricity purchasing price is in the valley period (19:00~5:00), ECs consume the electric power purchased from the grid for providing cooling, and the redundant energy is stored in ST to release when requiring more cooling (6:00~17:00). When the cooling load and electricity price are both relatively high (9:00~16:00), ECs maintain the high total cooling output, while ICEs consume the natural gas for electric power generation to curb the peak electric power purchase, and the waste heat is used to produce cool through ACs in the corresponding period. This shows that DoubleDQN can flexibly manage the output of multiple units according to the environment information, so as to achieve the optimal energy dispatch.

Extending the Proposed Method to Different Scenarios
In this subsection, different scenarios are designed to further test the proposed Dou-bleDQN-based method, and the test month is still applied. The scenarios are: Scenario C1: The proposed method is tested in new CCHP system models generated with different system parameters to test its generalization.  Figure 7 further illustrated the dispatch strategy of DoubleDQN for a typical day of the test month. It can be observed that the supply and demand of electric and cooling power are balanced throughout the day. As the electricity purchasing price is in the valley period (19:00~5:00), ECs consume the electric power purchased from the grid for providing cooling, and the redundant energy is stored in ST to release when requiring more cooling (6:00~17:00). When the cooling load and electricity price are both relatively high (9:00~16:00), ECs maintain the high total cooling output, while ICEs consume the natural gas for electric power generation to curb the peak electric power purchase, and the waste heat is used to produce cool through ACs in the corresponding period. This shows that DoubleDQN can flexibly manage the output of multiple units according to the environment information, so as to achieve the optimal energy dispatch.  Figure 7 further illustrated the dispatch strategy of DoubleDQN for a typical day of the test month. It can be observed that the supply and demand of electric and cooling power are balanced throughout the day. As the electricity purchasing price is in the valley period (19:00~5:00), ECs consume the electric power purchased from the grid for providing cooling, and the redundant energy is stored in ST to release when requiring more cooling (6:00~17:00). When the cooling load and electricity price are both relatively high (9:00~16:00), ECs maintain the high total cooling output, while ICEs consume the natural gas for electric power generation to curb the peak electric power purchase, and the waste heat is used to produce cool through ACs in the corresponding period. This shows that DoubleDQN can flexibly manage the output of multiple units according to the environment information, so as to achieve the optimal energy dispatch.

Extending the Proposed Method to Different Scenarios
In this subsection, different scenarios are designed to further test the proposed Dou-bleDQN-based method, and the test month is still applied. The scenarios are: Scenario C1: The proposed method is tested in new CCHP system models generated with different system parameters to test its generalization.

Extending the Proposed Method to Different Scenarios
In this subsection, different scenarios are designed to further test the proposed DoubleDQN-based method, and the test month is still applied. The scenarios are: Scenario C 1 : The proposed method is tested in new CCHP system models generated with different system parameters to test its generalization.
Scenario C 3 : Three typical unit failures are designed to test the proposed method's effectiveness in handling sudden unit failure. Those three-unit failures occur randomly at each time step, and the probability distribution of which is shown in Table 4, where the value pair (A, B) represents the maximum runnable number of ECs and ICEs at time step t, respectively. Ten new CCHP system models are generated with different system parameters, and the variation of which follows a normal distribution N(ζ, 0.1ζ), where ζ is the raw parameter. The dispatch results for these 10 models under DoubleDQN, DQN and rule-based policy are compared in Figure 8. It can be observed that similar to the results in Table 3, the rulebased policy obtains the lowest cooling error, while the highest energy cost and demand charge are obtained at the same time; DQN obtains the lowest energy cost and highest cooling error. In contrast, DoubleDQN can properly balance the above three objectives, and thus stably bringing the lowest total cost and relatively lower cooling error to all CCHP system models. Therefore, it can be easily concluded that DoubleDQN can adapt to unseen physical environments after being trained offline in a fixed environment. Scenario C3: Three typical unit failures are designed to test the proposed method's effectiveness in handling sudden unit failure. Those three-unit failures occur randomly at each time step, and the probability distribution of which is shown in Table 4, where the value pair (A, B) represents the maximum runnable number of ECs and ICEs at time step t, respectively. Ten new CCHP system models are generated with different system parameters, and the variation of which follows a normal distribution ( , 0.1 ), where is the raw parameter. The dispatch results for these 10 models under DoubleDQN, DQN and rulebased policy are compared in Figure 8. It can be observed that similar to the results in Table 3, the rule-based policy obtains the lowest cooling error, while the highest energy cost and demand charge are obtained at the same time; DQN obtains the lowest energy cost and highest cooling error. In contrast, DoubleDQN can properly balance the above three objectives, and thus stably bringing the lowest total cost and relatively lower cooling error to all CCHP system models. Therefore, it can be easily concluded that DoubleDQN can adapt to unseen physical environments after being trained offline in a fixed environment.

Scenario C2
For each load fluctuation degree , 150 experiments are carried out respectively, and the MPC-based method, which is widely used to solve uncertainty problems, is introduced for the comparison. MPC-based method: (a) making dispatch decisions based on for the comparison. MPC-based method: (a) making dispatch decisions based on the rolling load-prediction, and the selected rolling horizon is set as 4 h; (b) the load-prediction error follows a normal distribution N(0, 0.2σ), where σ is the actual load value.
As shown in Figure 9, the dispatch results of DoubleDQN, DQN and MPC in a total of 600 experiments are plotted and compared by violin plot. It can be observed that with the increase of load fluctuation degree α, the three methods' distributions of total cost and cooling error become more divergent. However, compared with DQN and MPC, the results of DoubleDQN show no significant discretization and achieve the lowest mean instead. Especially when the load fluctuation degree α is 5%, the superiority of DoubleDQN is more prominent. This shows that, unlike the MPC-based method, which relies on load-forecast accuracy, DoubleDQN efficiently deals with the uncertain load by directly making dispatch decisions based on real-time observations, and thus provides the dispatch strategy with higher stability and economic benefits. So, the proposed method can be relatively easier applied in real-world scenarios, especially when the prediction information is noisy or even missing. the rolling load-prediction, and the selected rolling horizon is set as 4 h; (b) the load-prediction error follows a normal distribution (0, 0.2 ), where is the actual load value. As shown in Figure 9, the dispatch results of DoubleDQN, DQN and MPC in a total of 600 experiments are plotted and compared by violin plot. It can be observed that with the increase of load fluctuation degree , the three methods' distributions of total cost and cooling error become more divergent. However, compared with DQN and MPC, the results of DoubleDQN show no significant discretization and achieve the lowest mean instead. Especially when the load fluctuation degree is 5%, the superiority of Dou-bleDQN is more prominent. This shows that, unlike the MPC-based method, which relies on load-forecast accuracy, DoubleDQN efficiently deals with the uncertain load by directly making dispatch decisions based on real-time observations, and thus provides the dispatch strategy with higher stability and economic benefits. So, the proposed method can be relatively easier applied in real-world scenarios, especially when the prediction information is noisy or even missing.

Scenario C3
For the unit failures defined in Table 4, the corresponding DRL agents are constructed using DoubleDQN and DQN respectively to be called on-demand during the operation period of CCHP system. The dispatch results of DoubleDQN, DQN and rule-based policy in 150 experiments are compared in Table 5. It can be seen from the table that the average total cost of DoubleDQN is 2.72% and 31.51% lower than that of DQN and the rule-based policy, respectively. Meanwhile, Dou-bleDQN also achieves better performance than these methods in terms of minimum and maximum, maintaining relatively high economic benefits. On the other hand, compared with DQN, DoubleDQN has obtained a performance closer to that of the rule-based policy in controlling cooling error, well preserving the comfort zone for the demand-side user. Figure 10 further compares the dispatch strategies of DoubleDQN for a typical day both under normal and faulty conditions, i.e., sudden unit failures. Under the faulty condition, DoubleDQN increases the energy storage of ST by 42% by increasing the cool

Scenario C 3
For the unit failures defined in Table 4, the corresponding DRL agents are constructed using DoubleDQN and DQN respectively to be called on-demand during the operation period of CCHP system. The dispatch results of DoubleDQN, DQN and rule-based policy in 150 experiments are compared in Table 5. It can be seen from the table that the average total cost of DoubleDQN is 2.72% and 31.51% lower than that of DQN and the rule-based policy, respectively. Meanwhile, Dou-bleDQN also achieves better performance than these methods in terms of minimum and maximum, maintaining relatively high economic benefits. On the other hand, compared with DQN, DoubleDQN has obtained a performance closer to that of the rule-based policy in controlling cooling error, well preserving the comfort zone for the demand-side user. Figure 10 further compares the dispatch strategies of DoubleDQN for a typical day both under normal and faulty conditions, i.e., sudden unit failures. Under the faulty condition, DoubleDQN increases the energy storage of ST by 42% by increasing the cool production of ECs during the nighttime. Therefore, the higher storage allows ST to release more cooling energy than the normal condition, so as to reduce the cool production of ECs and ACs by 11% and 19% during the daytime, respectively. On the other hand, the electricity load shows a trend of shifting from the daytime to the nighttime accordingly, which reduces the purchasing electricity cost and the daytime ICE electricity generation. Therefore, it can be easily concluded that by rationally planning the energy storage and release of ST, DoubleDQN reduces the CCHP system's dependence degree of both ECs and ICEs while balancing the supply and demand, so as to efficiently handle unpredictable unit failures during the operation period, which can meet the requirements for practical application. production of ECs during the nighttime. Therefore, the higher storage allows ST to release more cooling energy than the normal condition, so as to reduce the cool production of ECs and ACs by 11% and 19% during the daytime, respectively. On the other hand, the electricity load shows a trend of shifting from the daytime to the nighttime accordingly, which reduces the purchasing electricity cost and the daytime ICE electricity generation. Therefore, it can be easily concluded that by rationally planning the energy storage and release of ST, DoubleDQN reduces the CCHP system's dependence degree of both ECs and ICEs while balancing the supply and demand, so as to efficiently handle unpredictable unit failures during the operation period, which can meet the requirements for practical application.

Conclusions
This paper focuses on the summer energy dispatching problem of the CCHP system. Aiming at minimizing the total intra-month cost and balancing the supply and demand of energy, a model-free DoubleDQN-based method is proposed to generate an optimal dispatch strategy. Different from the traditional method, this method makes dispatch decisions directly based on the real-time observed electricity price and cooling load, avoiding the suboptimal dispatching problem caused by prediction error. Through the simulation results, the following conclusions can be drawn: (1) Compared with DRL algorithms DQN and DuelingDQN, DoubleDQN shows better learning performance during off-line training and obtains the greatest cumulative reward at convergence.
(2) The proposed method shows good demand response and peak shift ability. By restraining the peak electric power purchase of the CCHP system to below 3112 kW, the total intra-month cost is further reduced by 0.13~31.32% compared with the designed DRL methods and the rule-based policy through greater demand charge advantage. In addition, the method also considers the decision speed and thermal comfort, which not only meets the requirements of real-time operation but also well preserves the comfort zone for the demand-side user.
(3) In dealing with unknown system parameters, load uncertainty and sudden unit failure, the proposed method provides the dispatch strategy with higher stability and economic benefits for the CCHP system, showing strong generalization and potential for application in real scenarios.
On the other hand, the future study work will focus on two directions: one is to focus on the improvement of the traditional DRL algorithm to reduce the time and computing resources in DRL training; the other is to include environmental pollution into optimization indicators.

Conclusions
This paper focuses on the summer energy dispatching problem of the CCHP system. Aiming at minimizing the total intra-month cost and balancing the supply and demand of energy, a model-free DoubleDQN-based method is proposed to generate an optimal dispatch strategy. Different from the traditional method, this method makes dispatch decisions directly based on the real-time observed electricity price and cooling load, avoiding the suboptimal dispatching problem caused by prediction error. Through the simulation results, the following conclusions can be drawn: (1) Compared with DRL algorithms DQN and DuelingDQN, DoubleDQN shows better learning performance during off-line training and obtains the greatest cumulative reward at convergence.
(2) The proposed method shows good demand response and peak shift ability. By restraining the peak electric power purchase of the CCHP system to below 3112 kW, the total intra-month cost is further reduced by 0.13~31.32% compared with the designed DRL methods and the rule-based policy through greater demand charge advantage. In addition, the method also considers the decision speed and thermal comfort, which not only meets the requirements of real-time operation but also well preserves the comfort zone for the demand-side user.
(3) In dealing with unknown system parameters, load uncertainty and sudden unit failure, the proposed method provides the dispatch strategy with higher stability and economic benefits for the CCHP system, showing strong generalization and potential for application in real scenarios.
On the other hand, the future study work will focus on two directions: one is to focus on the improvement of the traditional DRL algorithm to reduce the time and computing resources in DRL training; the other is to include environmental pollution into optimization indicators. Institutional Review Board Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest:
The authors declare no conflict of interest.