Co-Optimizing Battery Storage for Energy Arbitrage and Frequency Regulation in Real-Time Markets Using Deep Reinforcement Learning

: Battery energy storage systems (BESSs) play a critical role in eliminating uncertainties associated with renewable energy generation, to maintain stability and improve flexibility of power networks. In this paper, a BESS is used to provide energy arbitrage (EA) and frequency regulation (FR) services simultaneously to maximize its total revenue within the physical constraints. The EA and FR actions are taken at different timescales. The multitimescale problem is formulated as two nested Markov decision process (MDP) submodels. The problem is a complex decision-making problem with enormous high-dimensional data and uncertainty (e.g., the price of the electricity). Therefore, a novel co-optimization scheme is proposed to handle the multitimescale problem, and also coordinate EA and FR services. A triplet deep deterministic policy gradient with exploration noise decay (TDD–ND) approach is used to obtain the optimal policy at each timescale. Simulations are conducted with real-time electricity prices and regulation signals data from the American PJM regulation market. The simulation results show that the proposed approach performs better than other studied policies in literature.


Introduction
With wider integration of renewable resources, energy storage has become a significant technology to help eliminate uncertainties associated with renewable energy generation, in order to maintain stability and improve flexibility of power networks.Among different kinds of energy storage technologies, battery energy storage systems (BESSs) have played an irreplaceable role in energy storage, grid synchronization, and other operation-assistance services [1,2] due to the following advantages: (1) BESSs can be flexibly configured depending on the power and energy requirements of system applications [3]; (2) BESSs have an instantaneous response nature [4,5]; (3) BESSs are not limited by external conditions such as geographical resources.Various research has been carried out related to battery energy storage systems planning and design for different applications.Optimized planning was proposed for a battery energy storage system considering battery degradation to reduce the operational costs of the nanogrid and microgrid [6].In [7], the authors identified the optimal conditions for wireless charging of electric vehicles when they were in motion to reduce energy consumption.A model was presented for a residential energy management system to dispatch battery energy storage in a market-based setting [8].A privacy-aware framework was presented for utility-driven demand-side management with a realistic energy storage system model [9].However, the economic viability of using BESSs to provide various services with a large scale is questionable due to their high investment costs [10].
One of the most discussed revenue sources for BESSs is to provide energy arbitrage (EA) services in a real-time electricity market by deliberately charging at low prices and discharging at higher prices to gain profit [11,12].EA using BESSs was studied in [13], where the electricity price was assumed to be known before making storage decisions.More recent research took electricity price uncertainty into consideration, and thus many forecast methods were proposed to improve the quality of electricity price prediction, and a reinforcement learning method was proposed to maximize the profit of EA based on historical prices [14].A stochastic dynamic programming method was used to optimize the BESS based on the forecast electricity price [15].Neural networks were used to address the price prediction uncertainty by introducing a scenario-based optimal control framework [16].Different models were presented in [17] to process the various price signals to optimize the price forecast.
In order to further increase the revenue of BESSs, some research work has considered a battery to provide EA and frequency regulation (FR) services simultaneously [18], since FR is a significant income source for energy storage [19][20][21][22][23].For FR, BESSs are used to regulate the frequency of the power grid by charging or discharging based on the regulation signals sent by the power grid operator [5,19,24,25].A comprehensive evaluation for stacked revenue by using the grid-connected BESS was introduced to provide EA and FR services [26].A linear programming method was used to maximize the potential revenue of electrical energy storage from participation in EA and FR in the day-ahead market [27].Co-optimizing EA and FR services simultaneously is considered a multitimescale problem, and a dynamic programming approach was proposed to solve the co-optimization problem [19,20].These two existing works on co-optimizing EA and FR services assumed that the electricity prices, regulation signals, or their distributions were known in advance.However, the distributions or the values are hard to attain in the real-time market.Furthermore, these two works did not consider the degradation cost of the BESS, a key factor in energy operational planning, without which there might be aggressive charging or discharging of the BESS [4].
Deep reinforcement learning (DRL), combined with deep neural networks (DNNs) and reinforcement learning (RL) techniques, can be powerful tools for addressing BESSrelated decision-making problems using the trial-and-error mechanism [28,29].Compared to model-based methods, such as MILP methods, DRL approaches have the following advantages: the ability to learn from historical data, to be self-adaptable, and to learn a good control policy even under a very complex environment [12].A novel continuous DRL algorithm was used for energy management of the hybrid electric vehicles [30].An expert-assistance deep deterministic policy gradient (DDPG) strategy was introduced to minimize the energy consumption and optimize the power allocation of the hybrid electric buses [31].A multiphysics-constrained fast-charging strategy was proposed for lithium-ion batteries in [32] based on an environmental perceptive DDPG.However, DDPG is not effective in avoiding overestimation in the actor-critic setting [33,34].
To address the above issues, a novel co-optimization scheme considering the degradation of the battery cell in the BESS is proposed for the multitimescale problem of cooptimizing EA and FR services.A novel deep reinforcement learning (DRL) approach, a triplet deep deterministic policy gradient with exploration noise decay (TDD-ND), is proposed to handle the uncertainty of the real-time electricity prices and frequency regulation signals in the multitimescale co-optimization problem due to the following reasons: (1) TDD-ND does not rely on the knowledge of probability distributions; (2) TDD-ND can be used to solve the problem with continuous action space directly by using deterministic policy in an actor-critic algorithm [34][35][36]; (3) The TDD-ND algorithm takes the weighted action value of triplet critics, which overcomes estimation bias in the deep deterministic policy gradient (DDPG) algorithm and the twin delayed deep deterministic policy gradient (TD3) algorithm [34]; (4) The TDD-ND algorithm adopts the exploration ND policy, which improves the exploration at the beginning of the training compared to DDPG and TD3.
The main contributions of this paper are as follows: 1.
A novel co-optimization scheme is proposed to handle the multitimescale problem.
The BESS decides an optimal EA action every five minutes to maximize its revenue due to the total amount of energy change, and every two seconds the BESS decides an optimal FR action to maximize the total reward including the revenue due to energy change and FR settlement reward.Based on the FR action, the EA action has to be adjusted based on the power constraints of the BESS to maximize the total revenue of the day on the two-second level.

2.
The TDD-ND algorithm is proposed to solve the co-optimization problem.To the best of our knowledge, the TDD algorithm [34] is for the first time used for energy storage.
Our proposed method combines the TDD algorithm with ND policy to improve the exploration during the training, and thus to achieve the higher total revenue.

3.
Real-time data are used to evaluate the performance of the proposed TDD-ND cooptimization approach.Simulation results show that our proposed DRL approach with the co-optimization scheme performs better than studied policies.
The rest of this paper is organized as follows.Section 2 explains the Pennsylvania New Jersey Maryland (PJM)'s frequency regulation market.Section 3 presents the nested system model used to formulate the co-optimizing problem.Our proposed TDD-ND approach is described in Section 4. The simulation results are discussed in Section 5.The conclusion is made in Section 6.

PJM Frequency Regulation Market
In the PJM frequency regulation market, generators and other devices (e.g., energy storage) can provide grid ancillary services in exchange for regulation credits [37].PJM sends the regulation (RegD) signal to the resources wishing to provide regulation services every two seconds.Afterwards, PJM tracks the response from each resource and computes a performance score for each resource every two seconds based on the RegD signal and regulation response.For every five minutes, the market also calculates the average performance score within the five-minute period.The performance score is a weighted sum of correlation, delay, and precision [38,39].A BESS typically has the nature of the instantaneous response and hence the scores of correlation and delay are close to 1. Therefore, the average performance score SC of a BESS within a five-minute period can be calculated based on the precision score as follows [19]: where λ F t , rd t , ar are denoted as the regulation response power taken by BESS response to the RegD signal at time t, the RegD signal at time t, and the maximum power capacity assigned for FR, respectively.∆t is set to two seconds because the RegD signal rd t is sent every two seconds.SC t denotes the two-second performance score.When the BESS is 100% following the RegD signal, λ F t + rd t = 0 and SC t = 1.Every five minutes, the PJM market determines the eligibility of the resource for regulation based on its average performance score SC, and calculates the amount of the regulation credit settlement received by the eligible resource.If the average performance score is less than 40%, the resource will lose its regulation qualification and regulation credits during that time period [37].The five-minute regulation credit settlement R C can be calculated as follows [37]: where P C , in $/MW•5min, is the five-minute regulation clearance price for 1 MW regulation capacity.

Nested System Model and Problem Formulation
The system model, illustrated in Figure 1, consists of two main parts, i.e., the power grid and the BESS including a battery and an energy management system (EMS).The BESS participates in the energy and regulation market.The power grid sends the real-time electricity locational marginal price (LMP), FR signal, and FR market clearance price to the EMS in the BESS.The EMS then generates the operation signal to the battery to take action.At the same time, the battery sends feedback with its real-time status to the EMS.Based on the real-time status of the battery and the information from the power grid, the EMS generates a new operation signal to the battery.The BESS co-optimizes EA and FR services to maximize its total reward within a oneday time horizon in a real-time PJM market: EA acts every five minutes, and FR responds every two seconds [19].Due to the nature of the problem, the timescale is divided into two dimensions: a large timescale with five-minute intervals and a small timescale with two-second intervals, where two-second intervals are nested in the five-minute timescales.The two optimization problems are formulated as two nested MDP submodels in the two following subsections, respectively.

The Five-Minute MDP Submodel Formulation
The one-day horizon of five-minute submodel T A , decomposed into 288 five-minute increments (i.e., ∆T = 5 min) illustrated in Figure 2, is denoted as T A = {0, ∆T, 2∆T, 3∆T, . . .287∆T}.The BESS takes a charging or discharging action every five minutes based on its current state to maximize the cumulative reward within the one-day horizon.The state, action, and reward are defined as follows.
The nested timescales in a day.

State
The state of the BESS at time T can be defined as S A T = E T , P A T , where E T is the BESS energy level, and P A T is the real-time electricity locational marginal price (LMP) at time T.

Action
The action in the five-minute submodel, denoted as λ T , is the total amount of power change due to EA and FR at time T within the five-minute interval.λ T > 0 represents that the BESS is charging, while λ T < 0 implies that the BESS is discharging.The optimal action at time T is denoted as λ * T .The action space should not exceed the maximum power capacity of BESS B: The total amount of energy stored in the BESS at time T should be within its maximum energy capacity E max : 0 After taking action λ T , state S A T is converted to state S A T+∆T at time T + ∆T.The real-time price P A T is updated to P A T+∆T , and the energy level E T evolves to E T+∆T , which can be calculated as follows: where η c and η d denote the charging and discharging efficiency, respectively.

Degradation Cost and Reward
The degradation cost of the BESS is a key factor in energy operational planning [4] as the battery cells degrade for repeated charge/discharge cycles.The degradation cost of the BESS can be calculated as follows [4]: where c b is the degradation cost coefficient, and can be calculated as follows [4]: where P cell is the price of the battery cell in the BESS and N is the number of cycles that the BESS could be operated within the state of charge (SoC) constraint [SOC min , SOC max ].
After taking the action, the BESS will receive a reward.In order to avoid conservative actions caused by the negative reward in the learning process, an average electricity price PA is introduced in the reward R A T S A T , λ T for performing λ T action in state S A T based on the basic principle of EA [14] as follows:

The Two-Second MDP Submodel Formulation
The BESS needs to respond to the updated RegD signal every two seconds.Because two-second intervals are nested in the five-minute horizon, the time horizon of one day within every two-second increment is denoted as T F = {0, ∆t, 2∆t, 3∆t, . . .(150 • 288 − 1)∆t}, where ∆t = 2s, shown in Figure 2. The BESS takes a charging or discharging action every two seconds based on its current state to maximize the cumulative reward within the one-day horizon.The state, action, and reward are defined as follows.

State
The state at time t is denoted as S F t = (E t , rd t ), where E t is the energy level of the BESS and rd t is the received RegD signal at that time.

Action
The action is the regulation response power, denoted as λ F t at time t, which is constrained by Equation ( 4).The action space also should not go beyond the maximum power capacity of BESS B. After performing an action λ F t at time t, state S F t will transfer to state S F t+∆t at time t + ∆t, and the energy level E t will be updated to E t+∆t based on λ t denoted as the total amount of power change at time t due to EA and FR:

Reward
Based on the PJM market regulation policy, the reward R t S F t , λ F t by performing action λ F t at state S F t can be calculated as where f t (b) = c b |λ t | • ∆t according to Equation ( 7) and R A t is the reward due to the total amount of energy change caused by both EA and FR within the two-second interval: Instead of calculating the FR settlement at the end of every five minutes, in the twosecond submodel, we need to evaluate the FR reward every two seconds once choosing an action λ t .Based on Equation (3), R F t is the equivalent real-time regulation settlement reward within the two-second interval: assuming that the maximum power capacity ar assigned for FR is the power capacity of the BESS B.

Proposed Co-Optimization Scheme
Solving the co-optimizing problem for EA and FR is to find the optimal action selection policy for the BESS to obtain the maximum expected reward within a day.A co-optimization scheme is proposed to handle the multitimescale problem and coordinate the EA and FR services, which is illustrated in Figure 3. Once λ F t is derived, λ t can be calculated as follows: where λ t is not always equal to λ * T , due to the power constraint (Equation ( 4)) of the BESS.The first case shows that when the optimal action for FR λ F t is discharging (i.e., λ F t < 0) and the best action for EA is charging, the action for EA will be charging with the highest power capacity B. In this case, the charging value of λ * T was set too high, and λ t is less than λ * T .For the second case, the λ t is set to λ * T .For the third case, when the optimal action for FR λ F t is charging and the best action for EA is discharging, the action for EA is discharging with the highest power capacity −B.In this case, the discharging value of λ * T was set too low, and λ t is greater than λ * T .

Proposed Triplet Deep Deterministic Policy Gradient with Exploration Noise Decay Approach
A novel DRL approach, combining TDD [34] and ND, is proposed to address the co-optimization problem.TDD-ND is a model-free, off-policy actor-critic algorithm, in which the triplet critics are used to limit estimation bias, and the exploration ND policy is used to improve the exploration in the algorithm.

Triplet Deep Deterministic Policy Gradient Algorithm
The TDD algorithm [34] is an off-line RL algorithm which can be applied to solve the optimization problem with continuous state space as well as continuous actions [35,36].TDD includes a single actor network (i.e., a deterministic policy network) π ϕ and its actor target network π ϕ ′ .In addition, TDD adopts three critic networks Q θ 1 , Q θ 2 , and and Q θ 3 , respectively.The target value y t can be updated using the weighted minimum Q-value of target Q-networks , combined with the weighted value of Q θ ′ 3 as follows [34]: where β ∈ (0, 1) is the weight of the pair of critics, γ ∈ [0, 1] is a discount factor, and ãt+1 is the clipped target action, calculated as follows: where ϵ is the clipped Gaussian noise with standard deviation of σ, and c is the edge value.The parameters of the critic networks will be updated by minimizing the following loss: where R is a replay buffer to store and relay experience transactions (s t , a t , r t , s t+1 ) includ- ing states, actions, rewards, and next states.The deterministic policy network in actor is updated using sampled policy gradient which is shown as follows:

Proposed TDD-ND Co-Optimization Approach
The ND policy is combined with the TDD algorithm to address the co-optimization problem.For the ND policy, the standard deviation of the exploration noise ϵ is set to the maximum value σ max at the beginning of the training, gradually reduced to the minimum value σ min with a decay of σ decay with the increase of the number of the training episodes, and kept at the minimum value σ min for the rest of the training.A TDD-ND algorithm for five-minute submodel optimization is presented in Algorithm 1.

Algorithm 1: The TDD-ND training process for five-minute submodel optimization
Initialize the actor network π ϕ , the actor target network π ϕ ′ ← π ϕ , the size R of replay buffer R, and the mini-batch size m.Initialize the three critic networks Q θ 1 , Q θ 2 and Q θ 3 , and three critic target networks for t ∈ T A do 3:

Based on the state of the BESS S A
T including E T and P A T , choose action λ T , observe reward R A T and next state of the BESS S A T+1 .

4:
Store transition S A T , λ T , R A T , S A T+1 in R.

5:
Sample a batch of transitions S A j , λ j , R A j , S A j+1 from R.

6:
From the next state of the BESS S A T+1 , the actor target plays the next charging or discharging action of the BESS λ T+1 via Equation ( 16).

7:
Select Gaussian noise ϵ ∼ N (0, σ) to this next action λ T+1 .Decrease σ from σ max to σ min with the decay of σ decay as the increasing of the episode.

9:
Update parameters of the three critic networks by minimizing the loss defined by Equation (17).Update the weights of the critic target networks by: Update the actor network by performing gradient every 2 iterations based on Equation (18). 12: Update the weights of the actor target networks by: ϕ ′ ← τϕ + (1 − τ)ϕ ′ every 2 iterations.

13:
end for 14: end for The flow chart of the proposed TDD-ND co-optimization approach is illustrated in Figure 3.The TDD-ND algorithm is used to train the neural networks for five-minute submodel optimization.The best actions of the five-minute submodel λ * T are then input into the two-second submodel environment.The TDD-ND algorithm is then used to train neural networks for two-second submodel optimization.For each training iteration, after action λ F t is chosen, λ t is calculated based on Equation ( 14), and the reward R t (S F t , λ F t ) will be calculated using Equation (11) to maximize the accumulated reward within the one-day horizon.After each time step, a mini-batch of m transitions is sampled uniformly from a replay buffer R.

Experimental Results
The performance of the proposed co-optimization approach is evaluated in a realworld scenario.The values of the parameters used in the simulations are listed in Table 1.Some of the parameters are varied in the simulation and will be noted accordingly.The parameter settings for the TDD-ND algorithm are listed in Table 2.  Based on the principle of EA, the BESS charges at low electricity prices and discharges at high electricity prices.The average price works as a simple indicator to determine whether the price P A T is low or high compared to the historical values.The operations of the BESS in a day are illustrated in Figure 4.The figure shows that when the P A T is lower than the average price, the BESS actions are mainly larger than 0, which means the BESS is charging.However, when the P A T is higher than the average price, the BESS operations are discharging to gain profits.The figure matches well with the principle of EA.The performance of the TDD-ND algorithm for co-optimizing EA and FR is studied by comparing it with another widely-used DRL algorithm, the deep Q-learning (DQL) algorithm.TDD-ND and DQL algorithms were used to train the five-minute and twosecond submodels for 500 times (500 episodes).During the training using TDD-ND, the total revenue of a day was validated after every 10 episodes without adding exploration noise ϵ to see whether the results were close to the training results.The learning curves of the TDD-ND algorithm and the DQL algorithm are illustrated in Figures 5 and 6, respectively.These two figures show that the TDD-ND algorithm has a much better performance than the DQL algorithm in terms of the average performance score and the total reward.The reason is that the TDD-ND algorithm can choose more accurate continuous actions rather than using discretized actions in DQL, and can thus obtain a higher average performance score and total reward.The impact of different levels of power capacity B and energy capacity E max on the performance of the TDD-ND algorithm and the DQL algorithm are studied.After training, the TDD-ND test results are slightly higher than their training values without the exploration noise.Figure 7 shows that the proposed TDD-ND algorithm always performs better than the DQL algorithm.The reason is that the DQL algorithm chooses discretized actions rather than continuous actions to take, and thus negatively impacts the total revenue.The figure also shows that the total revenue using both algorithms increases with power capacity B in the similar trend.For both algorithms, the total revenue increases sharply with B when B is between 0.5 and 1.0.The reason is that when B is 0.5, the SC t is smaller than 0.4 in most time slots, and thus the regulation settlement reward R F t becomes 0. When B increases to 1, the SC t is greater than 0.4 in many more time slots.Therefore, the total revenue is significantly increased.Between B = 1 and B = 2.5, the improvement in total revenue approximately follows the increase of B, since R F t is the dominant factor in the total revenue, and is a linear function of B.
How energy capacity E max impacts the total revenue using the proposed TDD-ND algorithm and the DQL algorithm is shown in Figure 8.The figure shows that the TDD-ND algorithm generates more total revenue than the DQL algorithm under each of the E max settings.Compared to the impact of power capacity B, the increase of energy capacity E max only makes a slight change to the total revenue.For both algorithms, the total revenue rises slowly with the increase of E max between E max = 2.5 and 12.5, as the energy capacity increasing only improves R A t but R F t dominates the total revenue in Equation ( 11) when B = 1 MW.Compared to the TDD-ND algorithm, the DQL algorithm has a slightly higher improvement rate of the total reward with the increase of E max , since higher E max allows the DQL algorithm to choose better discretized actions for EA, and thus a higher improvement rate of R A t compared to the TDD-ND algorithm with continuous actions.

Performance Comparison of Various Schemes
To demonstrate the effectiveness of our proposed TDD-ND co-optimization scheme, the following methods are compared: (1) Pure-EA scheme, in which the BESS only provides the EA service; (2) Pure-FR scheme, in which the BESS only provides the FR service; (3) Rule-based co-optimization scheme, in which the BESS provides the EA and FR services.The rule is as follows: The action λ t is set to λ * T to maximize the reward due to the total amount of energy change caused by both EA and FR; (4) TDD-ND co-optimization scheme is our proposed TDD-ND algorithm and co-optimization scheme.
The total revenue using each scheme with different settings of B and E max is illustrated in Figures 9 and 10, respectively.Figure 9 shows that the TDD-ND co-optimization scheme generates much more total revenue than the other three schemes at every setting of B, as the TDD-ND co-optimization scheme tries to maximize the total accumulated reward.The total revenue using the pure-EA scheme is very small, since FR is much more profitable than EA.For the TDD-ND scheme, the rule-based scheme and the pure-FR scheme, the regulation settlement reward R F t increases with B, and thus the total revenue increases with B. For the pure-EA scheme, the higher B allows the BESS to charge more when P A T is low and discharge more when P A T is high.The increasing rates of the total revenue using the TDD-ND co-optimization scheme and the rule-based co-optimization scheme from B = 0.5 to B = 1 are higher than those from B = 1 to B = 2.5.The reason is that when B = 0.5, the performance score is smaller than 0.4 in most time slots, and thus the regulation settlement reward becomes 0. Therefore, the total revenue of the rule-based co-optimization scheme is close to that of the pure-FR scheme.When B increases to 1, the performance score is greater than 0.4 in many more time slots, and with the coordination of EA, the total revenue is significantly increased.When B is between 1 and 1.5, the total revenue of the pure-FR scheme is much lower than that of the rule-based co-optimization scheme and that of the TDD-ND cop-optimization scheme.The reason is that the pure-FR scheme cannot follow rd t signals closely due to the limitations of the energy capacity, while the rule-based scheme can coordinate the energy capacity for EA and FR.When B reaches 2 or higher, the rule-based scheme has similar total revenue to the pure-FR scheme, as the setting of B allows the rule-based scheme to follow rd t signal closely.The total revenue of each of the four schemes under different settings of energy capacities E max is presented in Figure 10.The total revenue of the proposed TDD-ND co-optimization scheme is much higher than those of the other three schemes.FR is much more profitable than EA under all of the E max settings.The total revenue of the TDD-ND scheme and the rule-based scheme increases slightly with E max , because the increase of E max only improves the value of R A t , which is a small portion of the total revenue R t .For the pure-EA scheme, the total revenue increases with energy capacity E max , as higher energy capacity allows the BESS to charge more when P A T is low and discharge more when P A T is high.

Conclusions
A battery energy storage system (BESS) providing both energy arbitrage (EA) and frequency regulation (FR) services simultaneously to maximize its total revenue within a day was considered.The BESS takes an EA action every five minutes and an FR action every two seconds.The multitimescale co-optimization problem was formulated as two nested Markov decision process (MDP) submodels.A novel co-optimization scheme was proposed to handle the multitimescale problem and to coordinate the EA and FR services to maximize the total revenue.The novel deep reinforcement learning (DRL) algorithm, triplet deep deterministic policy gradient with exploration noise decay (TDD-ND), was proposed to determine the best actions to take to maximize the accumulated reward within the one-day horizon.The proposed TDD-ND algorithm achieved 22.8% to 32.9% higher total revenue than the deep Q-learning (DQL) algorithm under various power capacity settings of the BESS when its energy capacity was 5 MWh, and achieved 16.7% to 26.6% higher total revenue under various energy capacity settings when the power capacity was 1 MW.Additionally, our proposed TDD-ND co-optimization scheme achieved 37.7% to 148.8%, 41.8% to 156.3%, and 3507.8% to 15,583.2%higher total revenues compared to the rule-based co-optimization scheme, the pure-FR scheme, and the pure-EA scheme, respectively, under various power capacity settings when the energy capacity of the BESS was 5 MWh.When the power capacity was set to 1 MW, the proposed TDD-ND cooptimization scheme achieved total revenues 49.6% to 56.2%, 51.0% to 198.4%, and 7156.2% to 12,777.0%higher than the rule-based co-optimization scheme, the pure-FR scheme, and the pure-EA scheme, respectively, under the various energy capacity settings.In the future, investigation can be carried out on the use of the co-optimization methods in multivector energy systems considering different timescales.

Nomenclature ar
The maximum regulation capacity in MW assigned by PJM B The maximum power capacity of the BESS in MW c b The linearized battery degradation cost co-efficient E T Energy level of the BESS in MWh at time T in five-minute submodel E t Energy level of the BESS in MWh at time t in two-second submodel E max The maximum energy capacity of the BESS in MWh f (b) The degradation cost m The mini-batch size N The number of cycles that the BESS P A T The real-time electricity price at time T PA The average value of electricity prices in the past day P cell The price of the battery cell in the BESS RegD Dynamic signal for fast regulation, which is a measure of the imbalance between sources and uses of power in MW in the grid rd t The regulation signal (RegD) sent by PJM at time t to the BESS to provide regulation service R C The five-minute regulation settlement R A

T
The reward for performing an action λ T state S A T in five-minute submodel R t The reward for performing action λ F t at state S F t in two-second submodel R F t The real-time regulation settlement reward within the two-second interval

SC
Average performance score within a five-minute period indicating the performance of FR SC t The two-second performance score at time t S A T The state of five-minute submodel at time T S F t The state of two-second submodel at time t T The time indicator in five-minute submodel t The time indicator in two-second submodel T A The one-day horizon of five-minute submodel T F The one-day horizon of two-second submodel ∆T The five-minute time interval ∆t The two-second time interval λ T The action of the total amount of power change in MW due to EA and FR at time T in five-minute submodel The action in MW of BESS response to the RegD signal at time t in two-second submodel λ *

T
The optimal action of five-minute submodel at time T λ t The total amount of power change at time t due to EA and FR in two-second submodel η c The charging efficiency of the BESS η d The discharging efficiency of the BESS, α actor learning rate for actor α critic learning rate for critic σ max The maximum standard deviation value in the exploration noise σ min The minimum standard deviation value in the exploration noise σ decay The decay of standard deviation value in the exploration noise decay policy γ The discount factor for future rewards R Replay buffer R The size of replay buffer ϵ Clipped Gaussian noise π ϕ The actor network in TDD-ND π ϕ ′ The actor target network in TDD-ND Q θ Critic networks in TDD-ND Q θ ′ Target networks in TDD-ND

Figure 1 .
Figure 1.The configuration of the system model.

FiveFigure 3 .
Figure3.The TDD-ND approach for the proposed co-optimization scheme.

Figure 4 .
Figure 4.The BESS operation in a one-day period after five-minute submodel optimization.

Figure 5 .Figure 6 .
Figure 5.The learning curve of the average performance score of the day trained by TDD-ND and DQL.

Figure 7 .Figure 8 .
Figure 7.The total revenue of the day with different levels of power capacity B.

Figure 9 .
Figure 9.The comparison of the total revenues between using our proposed TDD-ND cooptimization scheme, rule-based co-optimization scheme, pure-FR scheme, and pure-EA scheme under different levels of power capacity B.

E 5 EFigure 10 .
Figure 10.The comparison of the revenues between following the proposed TDD-ND cooptimization, rule-based co-optimization, and pure-FR and pure-EA schemes under different settings of energy capacity E max .

Table 1 .
The value of the parameters used in the simulation.