Bidding a Battery on Electricity Markets and Minimizing Battery Aging Costs: A Reinforcement Learning Approach

: Battery storage is emerging as a key component of intelligent green electricitiy systems. The battery is monetized through market participation, which usually involves bidding. Bidding is a multi-objective optimization problem, involving targets such as maximizing market compensation and minimizing penalties for failing to provide the service and costs for battery aging. In this article, battery participation is investigated on primary frequency reserve markets. Reinforcement learning is applied for the optimization. In previous research, only simpliﬁed formulations of battery aging have been used in the reinforcement learning formulation, so it is unclear how the optimizer would perform with a real battery. In this article, a physics-based battery aging model is used to assess the aging. The contribution of this article is a methodology involving a realistic battery simulation to assess the performance of the trained RL agent with respect to battery aging in order to inform the selection of the weighting of the aging term in the RL reward formula. The RL agent performs day-ahead bidding on the Finnish Frequency Containment Reserves for Normal Operation market, with the objective of maximizing market compensation, minimizing market penalties and minimizing aging costs.


Introduction
Battery storage is emerging as a key component of intelligent green electricity systems [1].The investment into the battery needs to be justified by using the battery to provide services of financial value [2,3].Four categories of such services can be identified: The first category is energy arbitrage, in which participants store energy in the battery during low prices and sell during high prices [4].The key electricity markets for energy arbitrage are the day-ahead, intraday, balancing, and real-time markets.Energy arbitrage can involve a standalone battery or a battery paired with other energy resources such as photovoltaic (PV) generation [5].The second category of services provided by batteries is reserves.Reserve market participants are compensated for adjusting power consumption or generation in response to power grid imbalances, such as grid frequency deviations from a nominal value.Reserve markets have different names and specifications in different parts of the world.In the literature, some commonly used names for these markets are frequency reserves [6], frequency regulation [7], ancillary services [8] and spinning reserves [9].The third category is local markets such as peer-to-peer trading [10] and the recently emerging nodal markets, in which prices vary according to the geographic location of the energy producing or consuming units, which gives the market tools to avoid congestion in the power grid [11,12].The fourth category of services involves coordinating a battery and controllable loads [13] for the provision of demand response.The above-mentioned services are monetized through market participation, which usually involves bidding.Bidding is a multi-objective optimization problem, involving targets such as maximizing market compensation and minimizing penalties for failing to provide the service and costs for battery aging.A subset of research in this field includes the minimization of battery aging costs by embedding those into the multi-objective optimization problem (e.g., [14,15]).However, the optimization problem is challenging to solve, because the aging phenomenon is non-linear [16] and dependent on the type of service provided, as some services, such as fast regulation services, may involve frequent charge-discharge cycling [17].
Recently, numerous researchers have proposed reinforcement learning (RL) as a multiobjective optimization technique for monetizing battery storage, and several of them consider aging in the reward formulation [18,19].The RL problem formulation involves a reward, which gives the RL agent feedback about how advantageous or disadvantageous its actions have been.The reward does not need to be derived from physics, so the reward formula may include a term that is a simplification of the aging phenomena that ignores the non-linear dynamics of the battery.Such simplifications are commonly used by RL practitioners, and various formulations have been proposed by different authors, such as [20].The shortcoming of such an approach is that any benefits for reducing battery aging are not demonstrated either with a physical battery or with an accurate battery model.This prevents researchers from validating effective formulations of battery aging in the RL reward function.Simplified aging cost models also prevent making direct comparisons between results reported by different research groups, preventing the identification of seminal works with respect to battery aging management.The contribution of this article is a methodology involving a realistic battery simulation to assess the performance of the trained RL agent with respect to battery aging in order to inform the selection of the weighting of the aging term in the RL reward formula.Our research is conducted in the context of bidding battery storage on frequency reserves.A few works have investigated using RL for bidding a battery on frequency reserves, but without considering battery aging effects [21,22] or through simple approaches such as penalizing the agent for exceeding minimum and maximum state of charge (SoC) limits for the battery [23].

Literature Review 2.1. Financial Exploitation of Battery Storage
In Section 1, energy arbitrage, reserves, local markets, and demand response were introduced as general categories of battery services.Examples of applications to actual electricity markets are as follows.Solutions for prosumer buildings with PV generation and battery storage include energy arbitrage in the day-ahead Croatian Power Exchange market [24] and the Electric Reliability Council of Texas (ERCOT) nodal market [25].Prestudies to support sizing decisions when purchasing the battery are performed to estimate the revenue of a large-scale battery storage participating in the Finnish Frequency Containment Reserves for Normal Operation market [26], the South China Region frequency regulation market [27], and the real-time market in Queensland, Australia [28].In the case of a nodal market, the pre-study should optimize the siting and sizing of a battery, and such a solution is demonstrated for the New Zealand nodal wholesale market [29].Energy arbitrage solutions for standalone batteries are proposed for the Spanish real-time electricity market [30], Italian balancing market [31] and the California Independent System Operator (CAISO) day-ahead market [32].
Other authors report increased financial benefits from bidding on more than one market.Such market combinations include frequency regulation and energy arbitrage [33], frequency regulation and real-time markets [34], energy arbitrage and reserve markets [35], energy arbitrage and spinning reserves [36], energy arbitrage and demand response [37] and demand response and nodal markets [38].It is notable that some authors specify their targeted market in general terms, such as energy arbitrage, while others identify a more specific market, such as real-time electricity markets.
The emergence of Vehicle-to-Grid (V2G) solutions enables the provision of similar services with electric vehicle batteries when the vehicle is plugged in, if permitted by the vehicle owner.V2G applications are complicated by the fact that vehicle users have requirements on the battery SoC at the time when they intend to use the vehicle.This will increase the demand for fast charging, in which case the thermal safety needs to be considered by the RL agent [39].One approach is to define penalty component to the reward function in case temperature exceeds a threshold [40].Further, V2G needs to cope with the vehicle being unplugged unexpectedly in violation of a contract made with the vehicle owner [41].Despite these challenges, V2G has been used to exploit electric vehicle batteries for ancillary services [42], demand response [43,44], local markets [45] and frequency regulation [46].In the absence of V2G capabilities, an electric vehicle charger may perform frequency regulation by curtailing or rescheduling the charging [47,48].

Reinforcement Learning Applications for Batteries
De Silva et al. [49] propose a machine learning architecture for exploiting distributed energy resources in various energy markets.Forecasts of market prices and energy resource availability are fed into a bidding optimizer.Such forecasts based on machine learning time series forecasts are readily available for markets such as real-time markets [50], frequency reserve markets [51] and day-ahead spot markets [52], as well as for the availability of resources such as PV generation [53], wind generation [54], and building energy consumption [55].RL could emerge as the ideal technology for the bidding optimizer exploiting such forecasts.Currently, RL has been demonstrated as a potential multi-objective optimization technique for performing services such as the ones discussed in Section 2.1.The majority of the research investigates energy arbitrage and local markets.Only a minority of these works consider standalone batteries [56], whereas the rest optimize the battery alongside other distributed energy resources.In the context of a building with local PV generation, stationary batteries [57,58], one-way electric vehicle charging [59,60] and V2G [56] batteries have been used for energy arbitrage.Another form of energy arbitrage is achieved through pairing batteries and wind power generation [61].Similar solutions are applicable to microgrids with local markets [62,63].Only a few works consider frequency reserves [21][22][23] or ancillary services [64].
Battery aging is considered in some of the works applying RL to batteries.One approach is to define maximum and minimum permissible values for the SoC and charging and discharging currents as constraints, which are enforced outside the RL problem formulation.For example, if the RL agent just turns the batteries on or off, a separate algorithm can determine the charging and discharging currents [65].Another approach is to define a logic that overrides the action of the RL agent if these constraints are violated [66].However, it is also possible to incorporate battery aging penalties into the RL reward function, in which case the agent can learn to mitigate aging.If the penalty is defined in terms of SoC, the following approaches are encountered in the literature: exceeding minimum or maximum SoC limits [18], being outside of an ideal SoC range [67], as a deviation from a reference SoC value [68], or in terms of depth of discharge [69].Other authors define the aging penalty in terms of the magnitude of charging and discharging power [70,71].Some authors have added an overheating component to the aging penalty.This can be defined as a temperature deviation above a maximum temperature value [72] or as the second exponent of such deviation [73].
Each of these approaches can be expected to direct the agent to operate the battery in a range that is advantageous for mitigating aging.However, the actual impact on battery aging is not validated on physical batteries or with accurate battery simulation models.Validation on physical batteries is impractical, as there is no easy way to measure aging and since the experiments could be very time-consuming.Thus, in this article, an aging penalty will be defined for the reward function building on the above-mentioned approaches, and additionally, a validation of aging effects will be performed with a simulated battery.

Methodology
The methodology is elaborated in this section.As frequency reserve market rules vary between countries, the Finnish Frequency Containment Reserve for Normal operation (FCR-N) is taken as the case market.The symbols used to define the methodology are presented in Table 1.We propose to model the reward for the FCR-N market using three components: market revenue, market penalty, and aging penalty.The reward can then be defined as: where the first two terms are the FCR-N market revenue and market penalty, respectively [21].The market penalty is due to the battery SoC exceeding minimum or maximum limits, in which case the battery is not available for reacting to grid frequency deviations.The third term is an aging penalty A and a weight w.The value of w determines the weight of the aging penalty relative to the net revenues in the reward function.It is notable that the market prices in one hour can be much higher than in another hour, so a high reward can be due to good decisions by the agent or due to high prices on that particular hour or both.
For the aging penalty in reward, we model aging as a linear approximation of the non-linear battery dynamics.The aging can be modelled as: where the coefficient 1 Energies 2022, 15, 4960 The approximation in (2) builds on top of the previous research reviewed in Section 2.2, but introduces two key differences.Firstly, the works discussed in Section 2.2 identify either SoC or i as significant factors impacting aging.Our formulation recognizes that aging depends on the SoC level as well as the magnitude of the charging/discharging current.Secondly, the step of our RL agent is the market interval, which for FCR-N is 1 h at the time of writing.Power grid frequency can change numerous times during this interval, resulting in a corresponding change in the current i, which will impact SoC and DoD.Thus, a much more accurate approximation of aging can be obtained by performing the approximation once per second and taking the sum over the market interval, which is the duration of the RL step.The reason for performing the approximation once per second is that a control step of one second is sufficient for meeting the dynamic and stability requirements of the FCR-N market [74].
Figure 1 presents an architecture for training the RL agent.Since the step of the RL agent is 1 h, the state and reward are updated once per hour, and the RL agent determines the bid capacity C once per hour.However, the calculation of pen in (1) requires a once per minute resolution [21], and the calculation of A in (2) requires a once per second resolution.Thus, the environment requires simulation at a 1st time step.As presented in Figure 1, this is driven by time series data of the power grid frequency f, which has been obtained from the transmission system operator (TSO) Fingrid, which also operates the FCR-N market.The frequency data have been preprocessed to obtain one data point per second.A 'Power calculator' module determines the required momentary charging/discharging power P based on f, C and the stationary requirement of FCR-N [74,75].The battery simulation determines the required current i according to P and the u, which is not assumed to be constant as it is affected by the SoC.This current is fed as an input to a battery simulation model, which outputs the SoC and i.This information is sufficient for calculating according to ( 1) and ( 2).These calculations are done based on the actual market price FCR act .However, this is not known at the time of making the bid, so it cannot be used as state information for the RL agent.Thus, the state includes the forecasted price FCR fcast , which is obtained using the machine learning time series forecasting method for FCR-N [76].In addition to this forecast, the state information includes R, which is an integer specifying the number of hours since the battery last rested.Resting is defined as not participating in the market, which occurs when the bid capacity C is 0 MW.During the rest, the battery is charged or discharged so that the SoC will reach 50%, reducing the likelihood of SoC out of bounds events that result in market penalties.
Energies 2022, 15, 4960 5 of 19 The approximation in (2) builds on top of the previous research reviewed in Section 2.2, but introduces two key differences.Firstly, the works discussed in Section 2.2 identify either SoC or i as significant factors impacting aging.Our formulation recognizes that aging depends on the SoC level as well as the magnitude of the charging/discharging current.Secondly, the step of our RL agent is the market interval, which for FCR-N is 1 h at the time of writing.Power grid frequency can change numerous times during this interval, resulting in a corresponding change in the current i, which will impact SoC and DoD.Thus, a much more accurate approximation of aging can be obtained by performing the approximation once per second and taking the sum over the market interval, which is the duration of the RL step.The reason for performing the approximation once per second is that a control step of one second is sufficient for meeting the dynamic and stability requirements of the FCR-N market [74].
Figure 1 presents an architecture for training the RL agent.Since the step of the RL agent is 1 h, the state and reward are updated once per hour, and the RL agent determines the bid capacity C once per hour.However, the calculation of pen in (1) requires a once per minute resolution [21], and the calculation of A in (2) requires a once per second resolution.Thus, the environment requires simulation at a 1st time step.As presented in Figure 1, this is driven by time series data of the power grid frequency f, which has been obtained from the transmission system operator (TSO) Fingrid, which also operates the FCR-N market.The frequency data have been preprocessed to obtain one data point per second.A 'Power calculator' module determines the required momentary charging/discharging power P based on f, C and the stationary requirement of FCR-N [74,75].The battery simulation determines the required current i according to P and the u, which is not assumed to be constant as it is affected by the SoC.This current is fed as an input to a battery simulation model, which outputs the SoC and i.This information is sufficient for calculating according to (1) and ( 2).These calculations are done based on the actual market price FCRact.However, this is not known at the time of making the bid, so it cannot be used as state information for the RL agent.Thus, the state includes the forecasted price FCRfcast, which is obtained using the machine learning time series forecasting method for FCR-N [76].In addition to this forecast, the state information includes R, which is an integer specifying the number of hours since the battery last rested.Resting is defined as not participating in the market, which occurs when the bid capacity C is 0 MW.During the rest, the battery is charged or discharged so that the SoC will reach 50%, reducing the likelihood of SoC out of bounds events that result in market penalties.Several RL algorithms are available for optimizing the agent.Many of the RL applications for battery management define continuous action spaces, which motivates the selection of algorithms capable of handling continuous as well as discrete spaces, such as Deterministic Policy Gradient (DDPG) [78,79] and Twin Delayed DDPG (TD3) [80] have been applied in the context of batteries.However, the task of the RL agent in this paper is to select the value for C.This selection must be made from a discrete set of possible values, due to the rules of the FCR-N market.The range of bids is between 0.1 MW and 5 MW with a resolution of 0.1 MW [21].Since our state and action spaces are discrete and not large, computationally heavy methods such as DDPG and TD3 will not be investigated.In this article, the suitability of the REINFORCE [81], A2C [81], and PPO algorithms will be experimentally evaluated.
Equation ( 2) is proposed as a reasonable approximation of battery aging cost for the purpose of training the RL agent.As it does not capture a battery's non-linear dynamics, it cannot be used for an accurate evaluation of how the trained RL agent mitigates battery aging.A modification of the architecture of Figure 1 will be used for performance evaluation, in which the battery age is obtained from a battery simulation.The modified architecture is presented in Figure 2. No reward calculation is conducted at this stage since the training process of the agent has already been completed.The setup in Figure 2 will determine the net revenue in EUR and the battery aging in terms of full equivalent cycles when the trained agent is run against historical market and grid frequency data for a period of several days.In our RL formulation, an episode is one day.Net revenue is defined as market compensation minus market penalties, and this calculation is performed in the 'Revenue calculator' of Figure 2. The 'Battery simulation' in Figure 2 includes the Matlab Simulink battery model, which implements the aging behavior modeled in [82].
Energies 2022, 15, 4960 6 of 19 Advantage Actor-Critic (A2C) [77], Proximal Policy Optimization (PPO) [22], Deep Deterministic Policy Gradient (DDPG) [78,79] and Twin Delayed DDPG (TD3) [80] have been applied in the context of batteries.However, the task of the RL agent in this paper is to select the value for C.This selection must be made from a discrete set of possible values, due to the rules of the FCR-N market.The range of bids is between 0.1 MW and 5 MW with a resolution of 0.1 MW [21].Since our state and action spaces are discrete and not large, computationally heavy methods such as DDPG and TD3 will not be investigated.In this article, the suitability of the REINFORCE [81], A2C [81], and PPO algorithms will be experimentally evaluated.Equation ( 2) is proposed as a reasonable approximation of battery aging cost for the purpose of training the RL agent.As it does not capture a battery's non-linear dynamics, it cannot be used for an accurate evaluation of how the trained RL agent mitigates battery aging.A modification of the architecture of Figure 1 will be used for performance evaluation, in which the battery age is obtained from a battery simulation.The modified architecture is presented in Figure 2. No reward calculation is conducted at this stage since the training process of the agent has already been completed.The setup in Figure 2 will determine the net revenue in EUR and the battery aging in terms of full equivalent cycles when the trained agent is run against historical market and grid frequency data for a period of several days.In our RL formulation, an episode is one day.Net revenue is defined as market compensation minus market penalties, and this calculation is performed in the 'Revenue calculator' of Figure 2. The 'Battery simulation' in Figure 2 includes the Matlab Simulink battery model, which implements the aging behavior modeled in [82].In the training phase (Figure 1), Equation ( 2) is used instead of the aging output of the Matlab Simulink battery model (Figure 2).The reason for this is that the Simulink battery model is not intended for applications in which the age needs to be updated frequently, such as every time the RL environment is stepped forward.The age output of the Simulink battery is updated once every half cycle, and it cannot be assumed that these updates would occur at the same time as the RL environment is stepped forward.

Implementation
The implementations of the two architectures in Figures 1 and 2 are presented in Figures 3 and 4, respectively.In both implementations, the battery voltage is dependent on the extracted capacity [83], which affects SoC directly.The voltage of the battery is used as an input variable for the CurrentCalculator function, which generates the control signal to the controllable current source.The CurrentCalculator function is also responsible for preventing the SoC from exceeding the 5% and 95% limits.The PowerCalculator function is responsible for controlling the power in case of rest action (i.e., no bid).If the rest action In the training phase (Figure 1), Equation ( 2) is used instead of the aging output of the Matlab Simulink battery model (Figure 2).The reason for this is that the Simulink battery model is not intended for applications in which the age needs to be updated frequently, such as every time the RL environment is stepped forward.The age output of the Simulink battery is updated once every half cycle, and it cannot be assumed that these updates would occur at the same time as the RL environment is stepped forward.

Implementation
The implementations of the two architectures in Figures 1 and 2 are presented in Figures 3 and 4, respectively.In both implementations, the battery voltage is dependent on the extracted capacity [83], which affects SoC directly.The voltage of the battery is used as an input variable for the CurrentCalculator function, which generates the control signal to the controllable current source.The CurrentCalculator function is also responsible for preventing the SoC from exceeding the 5% and 95% limits.The PowerCalculator function is responsible for controlling the power in case of rest action (i.e., no bid).If the rest action is taken, the battery will be charged or discharged to 50% SoC at constant power.The blocks is taken, the battery will be charged or discharged to 50% SoC at constant power.The blocks between the penaltyIn and penaltyOut variables keep track of penalty minutes, which are used to calculate pen.   is taken, the battery will be charged or discharged to 50% SoC at constant power.The blocks between the penaltyIn and penaltyOut variables keep track of penalty minutes, which are used to calculate pen.The setup in Figure 3 is used to compute the reward function in (1).The aging output of the Simulink battery is not used in this context since it is updated only every half cycle, so in general, the age output is not up to date at the end of each RL step.For this reason, the simplified approximation of aging behavior was used as defined in (2).However, it is important to assess how well the RL agent trained with this reward will perform against the more realistic battery dynamics.The performance evaluation setup in Figure 4 is used for this purpose.The age output of the Simulink battery in Figure 4 is used to quantify the aging in equivalent full cycles in the performance evaluation phase.The aging dynamics of the Simulink battery model are based on the [82].Equations ( 3)-( 5) are from the Mathworks 'Battery' documentation [84].The aging output is calculated as where CL 100 is the number of cycles when battery is fully charged and discharged at nominal charge and discharge current.The CL 100 is an input parameter to determine how many full cycles battery lasts.The n is the battery aging factor.The aging factor is calculated as: The half-cycle update occurs when battery starts to discharge after charging or when the battery is full, i.e., SoC = 100.The DoD values are from previous three timesteps.The N n is the maximum number of cycles and it is calculated as: The maximum number of cycles is dependent on the average currents during latest half cycle duration, previous DoD and ambient and reference temperatures.The symbols in (5) are presented in Table 2.The constant values of N n are set by Matlab battery model and not available from documentation.The exponent factor for the charge current YES The parameters of the assessed algorithms for training and validation are presented in Table 3.The parameters of the battery are presented in Table 4.The additional parameters for performance evaluation are presented in Table 5.For training and validation, the predicted and actual prices of the Finnish FCR-N market and the Finnish power grid frequency data from 2020 were used.One episode is one day, and one step of the RL agent is one hour since the market interval is one hour.The RL environment was reset at the beginning of each episode.The days were shuffled and then split into training and validation datasets with a ratio of 9:1.In the data preprocessing phase, any days with missing data were excluded, resulting in 315 training days and 35 validation days; 10 random seeds were used to train 10 agents for each RL algorithm.
The tunable weight term w in the reward restricts the learning of the agents.The teaching is meaningful only if the aging penalty is significant, but it does not dominate the net revenue.If the net revenues dominates the reward, the agent is expected to ignore aging penalties, and if the aging penalty dominates, then there is no business case since costs outweigh the revenues.The different components of the reward function were plotted for several values of w and it was determined that a value of 2.63 was in the meaningful range as described above.The methodology is presented in detail using this value of w.The performance evaluation is performed for several values of w in Section 5.2.
Since the state space has only two variables, the mapping from the state space to the actions learned by the RL agent can be visualized as a heatmap, in which the horizontal and vertical axes are the values of the state variables and the color is the value of the bidding action.Figure 5 shows this mapping for each of the three algorithms: REINFORCE (a), A2C (b), and PPO (c).Since 10 random seeds are used, the action values are the mean of actions selected by the 10 agents.The learned policies of the three algorithms display a triangular pattern, which can be intuitively explained.The longer the time since the previous rest, the higher the likelihood of the SoC going out of bounds and incurring market penalties, and the higher the market price, the higher the revenue.The agent learns to capitalize on this phenomenon by using higher bids toward the top right corner of the heatmap.The tunable weight term w in the reward restricts the learning of the agents.The teaching is meaningful only if the aging penalty is significant, but it does not dominate the net revenue.If the net revenues dominates the reward, the agent is expected to ignore aging penalties, and if the aging penalty dominates, then there is no business case since costs outweigh the revenues.The different components of the reward function were plotted for several values of w and it was determined that a value of 2.63 was in the meaningful range as described above.The methodology is presented in detail using this value of w.The performance evaluation is performed for several values of w in Section 5.2.
Since the state space has only two variables, the mapping from the state space to the actions learned by the RL agent can be visualized as a heatmap, in which the horizontal and vertical axes are the values of the state variables and the color is the value of the bidding action.Figure 5 shows this mapping for each of the three algorithms: REINFORCE (a), A2C (b), and PPO (c).Since 10 random seeds are used, the action values are the mean of actions selected by the 10 agents.The learned policies of the three algorithms display a triangular pattern, which can be intuitively explained.The longer the time since the previous rest, the higher the likelihood of the SoC going out of bounds and incurring market penalties, and the higher the market price, the higher the revenue.The agent learns to capitalize on this phenomenon by using higher bids toward the top right corner of the heatmap.

Results
The proposed method was evaluated experimentally to address the following questions: How does the selection of w impact market net revenue and battery aging, and how does the selection of the RL-algorithm impact learning?

Training and Validation
The training dataset consisting of 315 days was presented to the agents 5 times.Figure 6 shows the exponentially weighted mean with adjustment of training reward.Such filtering was needed since the rewards are dependent on the FCRact, and the fluctuation, caused by the market price, is not indicative of the training performance.The weight

Results
The proposed method was evaluated experimentally to address the following questions: How does the selection of w impact market net revenue and battery aging, and how does the selection of the RL-algorithm impact learning?

Training and Validation
The training dataset consisting of 315 days was presented to the agents 5 times.Figure 6 shows the exponentially weighted mean with adjustment of training reward.Such filtering was needed since the rewards are dependent on the FCR act , and the fluctuation, caused by the market price, is not indicative of the training performance.The weight term 1 − α was set to 0.995.Figure 6 has a seasonal component that repeats 5 times, which is due to the dataset being presented to the agent 5 times before terminating the training.After the agent has seen the dataset 3 times, only minor improvements are seen in the reward.Figure 7 has the same information as Figure 6, with the shaded areas being the standard deviations resulting from our use of 10 seed values.It is noteworthy to mention that the training performance of all three algorithms is very similar, and after examining the standard deviations, none of the algorithms are statistically superior.
Energies 2022, 15, 4960 11 of 19 term 1  was set to 0.995.Figure 6 has a seasonal component that repeats 5 times, which is due to the dataset being presented to the agent 5 times before terminating the training.After the agent has seen the dataset 3 times, only minor improvements are seen in the reward.Figure 7 has the same information as Figure 6, with the shaded areas being the standard deviations resulting from our use of 10 seed values.It is noteworthy to mention that the training performance of all three algorithms is very similar, and after examining the standard deviations, none of the algorithms are statistically superior.Energies 2022, 15, 4960 11 of 19 term 1  was set to 0.995.Figure 6 has a seasonal component that repeats 5 times, which is due to the dataset being presented to the agent 5 times before terminating the training.After the agent has seen the dataset 3 times, only minor improvements are seen in the reward.
Figure 7 has the same information as Figure 6, with the shaded areas being the standard deviations resulting from our use of 10 seed values.It is noteworthy to mention that the training performance of all three algorithms is very similar, and after examining the standard deviations, none of the algorithms are statistically superior.were selected for the validation dataset.The validation is done after every 15 episodes of training.The validation reward is the cumulative reward for these 35 days.The purpose of the validation is to confirm that learning occurs and to identify the episode at which the learning plateaus.The validation is repeated for 10 random seeds for each algorithm.The standard deviation of the validation reward of each algorithm is plotted to determine whether the performance of any of the algorithms is statistically superior to the others.
Towards the last episode, only very minor improvements occur in the reward.Figure 8 shows this cumulative reward for 35 days.Figure 9 shows the same as Figure 8 with standard deviations for 10 seeds included.None of the algorithms is statistically better than the others.
Our dataset consisted of 350 days and of the data were reserved for validation.Since the FCR-N market has a strong seasonality, the dataset was shuffled and 35 days were selected for the validation dataset.The validation is done after every 15 episodes of training.The validation reward is the cumulative reward for these 35 days.The purpose of the validation is to confirm that learning occurs and to identify the episode at which the learning plateaus.The validation is repeated for 10 random seeds for each algorithm.The standard deviation of the validation reward of each algorithm is plotted to determine whether the performance of any of the algorithms is statistically superior to the others.
Towards the last episode, only very minor improvements occur in the reward.Figure 8 shows this cumulative reward for 35 days.Figure 9 shows the same as Figure 8 with standard deviations for 10 seeds included.None of the algorithms is statistically better than the others.Our dataset of 350 days and 10% of the data were reserved for validation.Since the FCR-N market has a strong seasonality, the dataset was shuffled and 35 days were selected for the validation dataset.The validation is done after every 15 episodes of training.The validation reward is the cumulative reward for these 35 days.The purpose of the validation is to confirm that learning occurs and to identify the episode at which the learning plateaus.The validation is repeated for 10 random seeds for each algorithm.The standard deviation of the validation reward of each algorithm is plotted to determine whether the performance of any of the algorithms is statistically superior to the others.
Towards the last episode, only very minor improvements occur in the reward.Figure 8 shows this cumulative reward for 35 days.Figure 9 shows the same as Figure 8 with standard deviations for 10 seeds included.None of the algorithms is statistically better than the others.

Performance Evaluation
The validation rewards indicate that learning occurs with respect to our reward function.As has been explained in Section 4, a simplified model of battery aging has been used in the reward function since the battery simulation model is not intended to update the age output at each RL step.This approach is common in the literature.However, it is unclear how such RL agents would perform against a real battery.In this paper, our interest is to further investigate how an RL agent trained and validated on such a simplified model of battery aging would perform if connected to a more detailed model capturing the aging dynamics in detail.The setup in Figure 4 has been used for such a performance evaluation.The agents that have been trained as described in Section 5.1 are run for the full duration of the validation set (35 days).The initial value for the battery age is zero, and the age output is recorded at the end of the 35-day simulation.Since 10 seed values have been used for each of the three RL algorithms, the performance evaluation run is performed 30 times for each value of w.
Figure 10 is the scatter plot of Net revenue (sum of market compensation and market penalty) versus aging with the performance evaluation implementation in Figure 4.Each dot is the mean value for the 10 random seeds.The scatter plot illustrates the tradeoff involved in adjusting the aging penalty w: higher aging penalties result in lower net revenues and lower aging.To avoid clutter, the dots in Figure 10 are labeled with the value of w, but the standard deviations are not shown.Figure 11 repeats the same plot without these labels and with shaded rectangles showing the standard deviation.

Performance Evaluation
The validation rewards indicate that learning occurs with respect to our reward function.As has been explained in Section 4, a simplified model of battery aging has been used in the reward function since the battery simulation model is not intended to update the age output at each RL step.This approach is common in the literature.However, it is unclear how such RL agents would perform against a real battery.In this paper, our interest is to further investigate how an RL agent trained and validated on such a simplified model of battery aging would perform if connected to a more detailed model capturing the aging dynamics in detail.The setup in Figure 4 has been used for such a performance evaluation.The agents that have been trained as described in Section 5.1 are run for the full duration of the validation set (35 days).The initial value for the battery age is zero, and the age output is recorded at the end of the 35-day simulation.Since 10 seed values have been used for each of the three RL algorithms, the performance evaluation run is performed 30 times for each value of w.
Figure 10 is the scatter plot of Net revenue (sum of market compensation and market penalty) versus aging with the performance evaluation implementation in Figure 4.Each dot is the mean value for the 10 random seeds.The scatter plot illustrates the tradeoff involved in adjusting the aging penalty w: higher aging penalties result in lower net revenues and lower aging.To avoid clutter, the dots in Figure 10 are labeled with the value of w, but the standard deviations are not shown.Figure 11 repeats the same plot without these labels and with shaded rectangles showing the standard deviation.

Performance Evaluation
The validation rewards indicate that learning occurs with respect to our reward function.As has been explained in Section 4, a simplified model of battery aging has been used in the reward function since the battery simulation model is not intended to update the age output at each RL step.This approach is common in the literature.However, it is unclear how such RL agents would perform against a real battery.In this paper, our interest is to further investigate how an RL agent trained and validated on such a simplified model of battery aging would perform if connected to a more detailed model capturing the aging dynamics in detail.The setup in Figure 4 has been used for such a performance evaluation.The agents that have been trained as described in Section 5.1 are run for the full duration of the validation set (35 days).The initial value for the battery age is zero, and the age output is recorded at the end of the 35-day simulation.Since 10 seed values have been used for each of the three RL algorithms, the performance evaluation run is performed 30 times for each value of w.
Figure 10 is the scatter plot of Net revenue (sum of market compensation and market penalty) versus aging with the performance evaluation implementation in Figure 4.Each dot is the mean value for the 10 random seeds.The scatter plot illustrates the tradeoff involved in adjusting the aging penalty w: higher aging penalties result in lower net revenues and lower aging.To avoid clutter, the dots in Figure 10 are labeled with the value of w, but the standard deviations are not shown.Figure 11 repeats the same plot without these labels and with shaded rectangles showing the standard deviation.

Discussion
The validation rewards in Figures 8 and 9 show that most of the learning occurs in the first 100 days.Only very minor improvements may be expected by continuing the training beyond the 1575 episodes used in this paper.The standard deviations in Figure 9 show that it is not possible to make any statistically significant statements about the superiority of any of the three RL algorithms.
Figures 10 and 11 illustrate the performance obtained by our agents with a realistic lithium-ion battery model.The dots are on a diagonal from the lower left to the upper right corner.This illustrates the tradeoff involved in adjusting the aging penalty w.A lower penalty results in higher net revenues and faster aging.This is to be expected intuitively since a lower value of w will decrease the negative aging penalty term in the reward function without affecting the positive market compensation term.As w is lowered, the positive market compensation term dominates the reward function and encourages the agent to take actions that increase the compensation.In other words, the agent is encouraged to bid higher capacities.According to Figure 4, higher capacities result in higher charging and discharging currents, which cause faster aging.Figure 10 shows that a straight line could be fitted to the dots with w values in the range 1.1-3.3.For higher w values, a significant drop in net revenues is observed.Our RL agent is not intended to be used in situations in which the aging penalty is very large compared to the market revenues.In such a situation, the business case for participating in the frequency reserve market is questionable.
The standard deviations in Figure 3 show that it is not possible to make statistically meaningful statements about the superiority of any of the RL algorithms.However, the diagonal trend discussed in the previous paragraph is also evident when the shaded areas are considered.Further, it is noted that when w is larger than 3.3, the shaded boxes are much larger.In the previous paragraph, 1.1-3.3 was identified as a relevant range.This can be due to the fact that with large w-values, the aging cost dominates the reward.
The relevant value for w depends on the actual cost of an equivalent full cycle of a particular battery.This cost depends greatly on raw material costs, supply chain disruptions, and government subsidies, which in turn can change drastically in response to global events such as pandemics and military conflicts.Thus, in this article, the aging penalty w is a parameter.If the aging cost value for a specific battery is known in terms of EUR per equivalent full cycle, the horizontal axis of Figure 10 can be converted to EUR by multiplying with this aging cost value.The w value can then be adjusted so that the difference between net revenue and aging cost is maximized.
The results are specific to the lithium-ion battery chemistry and the parameters of our case study battery presented in Tables 3 and 4. The methodology can be readily adapted to another lithium-ion battery by updating these parameters.The methodology can also be easily adapted to other battery chemistries by replacing the battery simulation block in Figures 3 and 4.
The results are specific to the Finnish FCR-N market.It is straightforward to generalize the approach to other auction-based frequency reserve markets in Finland or another country.The following modifications are needed.In the simulation environment, the current calculation should be according to the technical specification of the market, and the calculation of penalty minutes should be according to the market specification.The market price data and power grid frequency data used in our study have been obtained from the Finnish transmission system operator Fingrid's open data portal, so such data need to be obtained from the relevant TSO in another country.It is notable that the RL problem formulation does not need to be changed.Batteries can also be traded on other kinds of electricity markets.For example, a battery can perform energy arbitrage on day-ahead markets.It is not straightforward to generalize beyond frequency reserves markets to other auction-based electricity markets since significant changes to the RL problem formulation will be needed.
It is notable that RL practitioners generally use unique reward formulations, so it is not possible to make performance comparisons between different works.In this article, a physics-based performance evaluation environment has been proposed which enables direct comparisons even with different reward formulations.

Conclusions
For each of the algorithms, learning was observed in the form of a reward that increased and eventually plateaued (Figures 6 and 8).The main statistical findings are summarized in Table 6.It is noted that for each of the algorithms, the reward is within the standard deviation of the other algorithms.It is concluded that each of the algorithms was successfully trained and that none of them was statistically superior to the others.As stated in Section 1, the contribution of this article is a methodology involving a realistic battery simulation to assess the performance of the trained RL agent with respect to battery aging in order to inform the selection of the weighting of the aging term in the RL reward formula.The results presented in Section 5.1 demonstrate that learning occurs and that none of the investigated RL algorithms is statistically superior to the others.As there is no statistically significant difference between the performance of the different algorithms, we conclude that the optimization problem was successfully addressed with all the algorithms.These kinds of results are frequently presented in the RL literature.
Stopping the investigation at this point has two weaknesses.Firstly, it is not known if the RL agent trained in the RL environment could generalize to realistic battery dynamics if it would be tasked with managing a real battery.Secondly, since the results are quantified in terms of reward, they do not permit direct comparisons to other RL investigations of the same phenomenon, even if the battery parameters are identical if the RL reward formulations are different.Further, it is not possible to make performance comparisons to non-RL methods since the results are expressed in terms of the reward value.
To overcome these two weaknesses, a performance evaluation has been performed after confirming the learning on the validation dataset.The concept for the performance evaluation is presented in Figure 2, its implementation is presented in Figure 4, and the results are presented in Figures 10 and 11.The implementation in Figure 4 uses a realistic battery model and conforms to the technical specification of the Finnish FCR-N market.The results in Figures 10 and 11 are expressed in terms of net revenue and aging (equivalent full cycles).The battery parameters are presented in Tables 3 and 4 and the FCR-N market data and power grid frequency data are from the year 2020.Thus, any researcher is able to develop a battery FCR bidding optimizer, using either RL or non-RL methods, run them against this battery model and openly available market data and power grid data, and obtain results in terms of net revenue and aging that are directly comparable to the results that we have presented in Figures 10 and 11.

Figure 1 .
Figure 1.Architecture for training the RL agent.

Figure 1 .
Figure 1.Architecture for training the RL agent.Several RL algorithms are available for optimizing the agent.Many of the RL applications for battery management define continuous action spaces, which motivates the selection of algorithms capable of handling continuous as well as discrete spaces, such as Advantage Actor-Critic (A2C)[77], Proximal Policy Optimization (PPO)[22], Deep

Figure 2 .
Figure 2. Architecture for performance evaluation of the trained RL agent.

Figure 2 .
Figure 2. Architecture for performance evaluation of the trained RL agent.
penaltyIn and penaltyOut variables keep track of penalty minutes, which are used to calculate pen.Energies 2022, 15, 4960 7 of 19

Figure 3 .
Figure 3. Simulation model for training and validation.

Figure 4 .
Figure 4. Simulation model for performance estimation.

Figure 3 .
Figure 3. Simulation model for training and validation.

Figure 4 .
Figure 4. Simulation model for performance estimation.

Figure 4 .
Figure 4. Simulation model for performance estimation.
days; 10 random seeds were used to train 10 agents for each RL algorithm.

Figure 7 .
Figure 7. Training reward with standard deviation included.

Figure 7 .
Figure 7. Training reward with standard deviation included.Figure 7. Training reward with standard deviation included.

Figure 7 .
Figure 7. Training reward with standard deviation included.Figure 7. Training reward with standard deviation included.Our dataset consisted of 350 days and 10% of the data were reserved for validation.Since the FCR-N market has a strong seasonality, the dataset was shuffled and 35 days

Figure 9 .
Figure 9. Validation reward with standard deviation.Figure 9. Validation reward with standard deviation.

Figure 9 .
Figure 9. Validation reward with standard deviation.Figure 9. Validation reward with standard deviation.

Figure 10 .
Figure 10.Scatter plot of net revenue and aging.

Figure 11 .
Figure 11.Scatter plot of net revenue and aging with standard deviation.

Figure 10 .
Figure 10.Scatter plot of net revenue and aging.

Figure 10 .
Figure 10.Scatter plot of net revenue and aging.

Figure 11 .
Figure 11.Scatter plot of net revenue and aging with standard deviation.

Figure 11 .
Figure 11.Scatter plot of net revenue and aging with standard deviation.

Table 3 .
Parameters for training and validation.

Table 5 .
Parameters for battery aging.

Table 6 .
Rewards and standard deviation at the last episode.