# A Simulation Environment for Training a Reinforcement Learning Agent Trading a Battery Storage

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

#### 2.1. Batteries in Primary Frequency Reserves

#### 2.2. Reinforcement Learning Applications for Batteries

- •
- Real-time control.
- •
- Medium-term decision-making for optimizing some operational criteria such as electricity costs or photovoltaic self-consumption. In many cases, this involves decision-making once per electricity market interval, which in many cases is hourly.
- •
- Long-term studies to support investment decisions.

## 3. Battery Trading System

_{damage}is defined. This differs from the penalty in Equation (2) in two respects. Firstly, it is not dependent on the FCR-N price for the hour in question [50]. Secondly, since the market operator does not provide any quantitative criteria for banning the reserve provider [50], a reputation

_{factor}coefficient is defined, which can be adjusted by the reserve provider in order to make the tradeoff between increasing revenues versus bidding prudently to avoid penalties:

## 4. Implementation

#### 4.1. Enviroment

_{h}[s] in Equation (5) is the power grid frequency for the current second, which corresponds to sec in Figure 3. When the frequency is under 49.9 Hz, the full power capacity of the bid, negative of capacity(a), is the discharge power.

_{s}) function in Table 2.

#### 4.2. Bidding Agent

_{factor}. Out of these 61 days, 11 randomly selected days were used for validation and the rest were used for training. A random seed was defined to ensure the repeatability of the results. Figure 6 shows some insights into the training process after 2, 4, 6, 8 and 10 epochs.

_{fcast}on the horizontal axis). Analyzing Equations (2)–(4), it can be seen that the positive component of the reward is directly proportional to the price of the FCR-N market, which is approximated by FCR-N

_{fcast}. However, the negative component of the reward is only partially proportional to the price. Thus, at higher prices, the benefits should outweigh the penalties, so it is expected that the agent will learn to prefer resting during low-price hours. Accordingly, in Figure 6, it is observed that resting actions concentrate on the left of the price forecast axis. With respect to the vertical axis, it is expected that the likelihood of penalties increases when the battery has operated for several hours without resting, so it is expected that the agent will learn to prefer resting on the lower part of the vertical axis. The combined effect of these two learning outcomes is that the best states for resting are in the bottom-left corner and the best states for bidding are in the top-right corner. By observing the progression of the left-hand charts in Figure 6, it is evident that the agent has learned this behavior.

## 5. Result

_{factor}. Figure 12 shows how the results would change if the value of reputation

_{factor}is varied. The experiment described in this paper was repeated for the following values: 10, 30, 50, 70, 90, 110, 130, 150, 170 and 190. Each repetition of the experiment resulted in one dot in the figure, labeled with the value of reputation

_{factor}. According to Equations (3) and (4), a higher value of reputation

_{factor}will result in a large negative component in the reward whenever the battery is unavailable. The duration of this unavailability is penalty

_{min}on the x-axis. The compensation on the y-axis is according to Equation (1). As reputation

_{factor}is increased, it is expected that the RL agent learns to be more careful in avoiding penalties, either by resting more or by bidding a lower capacity, thus reducing the likelihood of the battery being unavailable. The result of this should be decreasing compensation and decreasing penalties as reputation

_{factor}increases. This trend is visible in Figure 12. The dots for 10, 30 and 50 are very close to each other and overlap in the figure. This is because the compensation outweighs the penalties, so the agent learns to ignore the penalties and only tries to maximize the compensation. At a value of 70, the penalties are drastically reduced, without a loss of compensation. In fact, the compensation is slightly higher, which can be understood from Equation (1): there is no compensation for the minutes during which the battery is unavailable. As reputation

_{factor}is increased to 90 and beyond, the trend that was mentioned above is observed: the agent bids more carefully, resulting in a slight decrease in compensation as well as in the penalties. Looking at the relative vertical positions of the dots, 130 is an outlier in this trend. Further, 110 and 150 do not fully fit into the trend. The validation set is 11 days, so a longer set would be expected to result in a clearer trend. From the results, it is concluded that it is advantageous to use a reputation

_{factor}of at least 70. The use of a higher value is a business decision, depending on whether a decrease in compensation is considered desirable in order to decrease the penalties. As has been explained in the context of Equation (3), the potential business impact of incurring excessive penalties is very severe, but the market operator does not publish any quantitative criteria for what it considers to be excessive penalties, so for that reason, the choice of value for reputation

_{factor}is left as a business decision.

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Peters, I.M.; Breyer, C.; Jaffer, S.A.; Kurtz, S.; Reindl, T.; Sinton, R.; Vetter, M. The role of batteries in meeting the PV terawatt challenge. Joule
**2021**, 5, 1353–1370. [Google Scholar] [CrossRef] - de Siqueira Silva, L.M.; Peng, W. Control strategy to smooth wind power output using battery energy storage system: A review. J. Energy Storage
**2021**, 35, 102252. [Google Scholar] [CrossRef] - Sepúlveda-Mora, S.B.; Hegedus, S. Making the case for time-of-use electric rates to boost the value of battery storage in commercial buildings with grid connected PV systems. Energy
**2021**, 218, 119447. [Google Scholar] [CrossRef] - Loukatou, A.; Johnson, P.; Howell, S.; Duck, P. Optimal valuation of wind energy projects co-located with battery storage. Appl. Energy
**2021**, 283, 116247. [Google Scholar] [CrossRef] - Akagi, S.; Yoshizawa, S.; Ito, M.; Fujimoto, Y.; Miyazaki, T.; Hayashi, Y.; Tawa, K.; Hisada, T.; Yano, T. Multipurpose control and planning method for battery energy storage systems in distribution network with photovoltaic plant. Int. J. Electr. Power Energy Syst.
**2020**, 116, 105485. [Google Scholar] [CrossRef] - Nefedov, E.; Sierla, S.; Vyatkin, V. Internet of energy approach for sustainable use of electric vehicles as energy storage of prosumer buildings. Energies
**2018**, 11, 2165. [Google Scholar] [CrossRef] [Green Version] - Ge, X.; Ahmed, F.W.; Rezvani, A.; Aljojo, N.; Samad, S.; Foong, L.K. Implementation of a novel hybrid BAT-Fuzzy controller based MPPT for grid-connected PV-battery system. Control. Eng. Pract.
**2020**, 98, 104380. [Google Scholar] [CrossRef] - Aldosary, A.; Ali, Z.M.; Alhaider, M.M.; Ghahremani, M.; Dadfar, S.; Suzuki, K. A modified shuffled frog algorithm to improve MPPT controller in PV System with storage batteries under variable atmospheric conditions. Control. Eng. Pract.
**2021**, 112, 104831. [Google Scholar] [CrossRef] - Ciupageanu, D.; Barelli, L.; Lazaroiu, G. Real-time stochastic power management strategies in hybrid renewable energy systems: A review of key applications and perspectives. Electr. Power Syst. Res.
**2020**, 187, 106497. [Google Scholar] [CrossRef] - Lin, L.; Jia, Y.; Ma, M.; Jin, X.; Zhu, L.; Luo, H. Long-term stable operation control method of dual-battery energy storage system for smoothing wind power fluctuations. Int. J. Electr. Power Energy Syst.
**2021**, 129, 106878. [Google Scholar] [CrossRef] - Ryu, A.; Ishii, H.; Hayashi, Y. Battery smoothing control for photovoltaic system using short-term forecast with total sky images. Electr. Power Syst. Res.
**2021**, 190, 106645. [Google Scholar] [CrossRef] - Subramanya, R.; Yli-Ojanperä, M.; Sierla, S.; Hölttä, T.; Valtakari, J.; Vyatkin, V. A virtual power plant solution for aggregating photovoltaic systems and other distributed energy resources for northern european primary frequency reserves. Energies
**2021**, 14, 1242. [Google Scholar] [CrossRef] - Koller, M.; Borsche, T.; Ulbig, A.; Andersson, G. Review of grid applications with the Zurich 1MW battery energy storage system. Electr. Power Syst. Res.
**2015**, 120, 128–135. [Google Scholar] [CrossRef] - Giovanelli, C.; Sierla, S.; Ichise, R.; Vyatkin, V. Exploiting artificial neural networks for the prediction of ancillary energy market prices. Energies
**2018**, 11, 1906. [Google Scholar] [CrossRef] [Green Version] - Lund, H.; Hvelplund, F.; Østergaard, P.A.; Möller, B.; Mathiesen, B.V.; Karnøe, P.; Andersen, A.N.; Morthorst, P.E.; Karlsson, K.; Münster, M.; et al. System and market integration of wind power in Denmark. Energy Strategy Rev.
**2013**, 1, 143–156. [Google Scholar] [CrossRef] [Green Version] - Bialek, J. What does the GB power outage on 9 August 2019 tell us about the current state of decarbonised power systems? Energy Policy
**2020**, 146, 111821. [Google Scholar] [CrossRef] - Papadogiannis, K.A.; Hatziargyriou, N.D. Optimal allocation of primary reserve services in energy markets. IEEE Trans. Power Syst.
**2004**, 19, 652–659. [Google Scholar] [CrossRef] - Pavić, I.; Capuder, T.; Kuzle, I. Low carbon technologies as providers of operational flexibility in future power systems. Appl. Energy
**2016**, 168, 724–738. [Google Scholar] [CrossRef] - Zecchino, A.; Prostejovsky, A.M.; Ziras, C.; Marinelli, M. Large-scale provision of frequency control via V2G: The Bornholm power system case. Electr. Power Syst. Res.
**2019**, 170, 25–34. [Google Scholar] [CrossRef] - Malik, A.; Ravishankar, J. A hybrid control approach for regulating frequency through demand response. Appl. Energy
**2018**, 210, 1347–1362. [Google Scholar] [CrossRef] - Borsche, T.S.; de Santiago, J.; Andersson, G. Stochastic control of cooling appliances under disturbances for primary frequency reserves. Sustain. Energy Grids Netw.
**2016**, 7, 70–79. [Google Scholar] [CrossRef] - Herre, L.; Tomasini, F.; Paridari, K.; Söder, L.; Nordström, L. Simplified model of integrated paper mill for optimal bidding in energy and reserve markets. Appl. Energy
**2020**, 279, 115857. [Google Scholar] [CrossRef] - Ramírez, M.; Castellanos, R.; Calderón, G.; Malik, O. Placement and sizing of battery energy storage for primary frequency control in an isolated section of the Mexican power system. Electr. Power Syst. Res.
**2018**, 160, 142–150. [Google Scholar] [CrossRef] - Killer, M.; Farrokhseresht, M.; Paterakis, N.G. Implementation of large-scale li-ion battery energy storage systems within the EMEA region. Appl. Energy
**2020**, 260, 114166. [Google Scholar] [CrossRef] - Oudalov, A.; Chartouni, D.; Ohler, C. Optimizing a battery energy storage system for primary frequency control. IEEE Trans. Power Syst.
**2007**, 22, 1259–1266. [Google Scholar] [CrossRef] - Andrenacci, N.; Pede, G.; Chiodo, E.; Lauria, D.; Mottola, F. Tools for life cycle estimation of energy storage system for primary frequency reserve. In Proceedings of the International Symposium on Power Electronics, Electrical Drives, Automation and Motion (SPEEDAM), Amalfi, Italy, 20–22 June 2018; pp. 1008–1013. [Google Scholar] [CrossRef]
- Karbouj, H.; Rather, Z.H.; Flynn, D.; Qazi, H.W. Non-synchronous fast frequency reserves in renewable energy integrated power systems: A critical review. Int. J. Electr. Power Energy Syst.
**2019**, 106, 488–501. [Google Scholar] [CrossRef] - Srinivasan, L.; Markovic, U.; Vayá, M.G.; Hug, G. Provision of frequency control by a BESS in combination with flexible units. In Proceedings of the 5th IEEE International Energy Conference (ENERGYCON), Limassol, Cyprus, 3–7 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Phan, B.C.; Lai, Y. Control strategy of a hybrid renewable energy system based on reinforcement learning approach for an isolated microgrid. Appl. Sci.
**2019**, 9, 4001. [Google Scholar] [CrossRef] [Green Version] - Li, W.; Cui, H.; Nemeth, T.; Jansen, J.; Ünlübayir, C.; Wei, Z.; Zhang, L.; Wang, Z.; Ruan, J.; Dai, H.; et al. Deep reinforcement learning-based energy management of hybrid battery systems in electric vehicles. J. Energy Storage
**2021**, 36, 102355. [Google Scholar] [CrossRef] - Chen, Z.; Hu, H.; Wu, Y.; Xiao, R.; Shen, J.; Liu, Y. Energy management for a power-split plug-in hybrid electric vehicle based on reinforcement learning. Appl. Sci.
**2018**, 8, 2494. [Google Scholar] [CrossRef] [Green Version] - Sui, Y.; Song, S. A multi-agent reinforcement learning framework for lithium-ion battery scheduling problems. Energies
**2020**, 13, 1982. [Google Scholar] [CrossRef] [Green Version] - Muriithi, G.; Chowdhury, S. Optimal energy management of a grid-tied solar pv-battery microgrid: A reinforcement learning approach. Energies
**2021**, 14, 2700. [Google Scholar] [CrossRef] - Kim, S.; Lim, H. Reinforcement learning based energy management algorithm for smart energy buildings. Energies
**2018**, 11, 2010. [Google Scholar] [CrossRef] [Green Version] - Lee, S.; Choi, D. Reinforcement learning-based energy management of smart home with rooftop solar photovoltaic system, energy storage system, and home appliances. Sensors
**2019**, 19, 3937. [Google Scholar] [CrossRef] [Green Version] - Lee, S.; Choi, D. Energy management of smart home with home appliances, energy storage system and electric vehicle: A hierarchical deep reinforcement learning approach. Sensors
**2020**, 20, 2157. [Google Scholar] [CrossRef] [Green Version] - Roesch, M.; Linder, C.; Zimmermann, R.; Rudolf, A.; Hohmann, A.; Reinhart, G. Smart grid for industry using multi-agent reinforcement learning. Appl. Sci.
**2020**, 10, 6900. [Google Scholar] [CrossRef] - Kim, J.; Lee, B. Automatic P2P Energy trading model based on reinforcement learning using long short-term delayed reward. Energies
**2020**, 13, 5359. [Google Scholar] [CrossRef] - Wang, N.; Xu, W.; Shao, W.; Xu, Z. A q-cube framework of reinforcement learning algorithm for continuous double auction among microgrids. Energies
**2019**, 12, 2891. [Google Scholar] [CrossRef] [Green Version] - Mbuwir, B.V.; Ruelens, F.; Spiessens, F.; Deconinck, G. Battery energy management in a microgrid using batch reinforcement learning. Energies
**2017**, 10, 1846. [Google Scholar] [CrossRef] [Green Version] - Zsembinszki, G.; Fernández, C.; Vérez, D.; Cabeza, L.F. Deep Learning optimal control for a complex hybrid energy storage system. Buildings
**2021**, 11, 194. [Google Scholar] [CrossRef] - Lee, H.; Ji, D.; Cho, D. Optimal design of wireless charging electric bus system based on reinforcement learning. Energies
**2019**, 12, 1229. [Google Scholar] [CrossRef] [Green Version] - Oh, E. Reinforcement-learning-based virtual energy storage system operation strategy for wind power forecast uncertainty management. Appl. Sci.
**2020**, 10, 6420. [Google Scholar] [CrossRef] - Tsianikas, S.; Yousefi, N.; Zhou, J.; Rodgers, M.D.; Coit, D. A storage expansion planning framework using reinforcement learning and simulation-based optimization. Appl. Energy
**2021**, 290, 116778. [Google Scholar] [CrossRef] - Sidorov, D.; Panasetsky, D.; Tomin, N.; Karamov, D.; Zhukov, A.; Muftahov, I.; Dreglea, A.; Liu, F.; Li, Y. Toward zero-emission hybrid AC/DC power systems with renewable energy sources and storages: A case study from Lake Baikal region. Energies
**2020**, 13, 1226. [Google Scholar] [CrossRef] [Green Version] - Xu, B.; Shi, J.; Li, S.; Li, H.; Wang, Z. Energy consumption and battery aging minimization using a q-learning strategy for a battery/ultracapacitor electric vehicle. Energy
**2021**, 229, 120705. [Google Scholar] [CrossRef] - Zhang, G.; Hu, W.; Cao, D.; Liu, W.; Huang, R.; Huang, Q.; Chen, Z.; Blaabjerg, F. Data-driven optimal energy management for a wind-solar-diesel-battery-reverse osmosis hybrid energy system using a deep reinforcement learning approach. Energy Convers. Manag.
**2021**, 227, 113608. [Google Scholar] [CrossRef] - Fingrid. The Technical Requirements and the Prequalification Process of Frequency Containment Reserves (FCR). Available online: https://www.fingrid.fi/globalassets/dokumentit/en/electricity-market/reserves/appendix3---technical-requirements-and-prequalification-process-of-fcr.pdf (accessed on 6 July 2021).
- Fingrid. Fingridin reservikaupankäynti ja tiedonvaihto -ohje. Available online: https://www.fingrid.fi/globalassets/dokumentit/fi/sahkomarkkinat/reservit/fingridin-reservikaupankaynti-ja-tiedonvaihto--ohje.pdf (accessed on 6 July 2021).
- Fingrid. Ehdot ja edellytykset taajuudenvakautusreservin (FCR) toimittajalle. Available online: https://www.fingrid.fi/globalassets/dokumentit/fi/sahkomarkkinat/reservit/fcr-liite1---ehdot-ja-edellytykset.pdf (accessed on 6 July 2021).
- MathWorks. Battery—Generic Battery Model. Available online: https://se.mathworks.com/help/physmod/sps/powersys/ref/battery.html (accessed on 6 July 2021).
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Avila, L.; De Paula, M.; Trimboli, M.; Carlucho, I. Deep reinforcement learning approach for MPPT control of partially shaded PV systems in Smart Grids. Appl. Soft Comput.
**2020**, 97, 106711. [Google Scholar] [CrossRef] - Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lam, K.P. Whole building energy model for HVAC optimal control: A practical framework based on deep reinforcement learning. Energy Build.
**2019**, 199, 472–490. [Google Scholar] [CrossRef] - Azuatalam, D.; Lee, W.; de Nijs, F.; Liebman, A. Reinforcement learning for whole-building HVAC control and demand response. Energy AI
**2020**, 2, 100020. [Google Scholar] [CrossRef] - Brandi, S.; Piscitelli, M.S.; Martellacci, M.; Capozzoli, A. Deep reinforcement learning to optimise indoor temperature control and heating energy consumption in buildings. Energy Build.
**2020**, 224, 110225. [Google Scholar] [CrossRef] - Nakabi, T.A.; Toivanen, P. Deep reinforcement learning for energy management in a microgrid with flexible demand. Sustain. Energy Grids Netw.
**2021**, 25, 100413. [Google Scholar] [CrossRef] - Schreiber, T.; Eschweiler, S.; Baranski, M.; Müller, D. Application of two promising reinforcement learning algorithms for load shifting in a cooling supply system. Energy Build.
**2020**, 229, 110490. [Google Scholar] [CrossRef] - He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl. Based Syst.
**2021**, 212, 106622. [Google Scholar] [CrossRef] - Franke, J.K.; Köhler, G.; Biedenkapp, A.; Hutter, F. Sample-efficient automated deep reinforcement learning. arXiv
**2020**, arXiv:2009.01555. [Google Scholar]

**Figure 3.**Procedure for training the bidding agent using the environment in Figure 2. The shaded area is the step (a) function discussed in Section 4.1.

**Figure 5.**Power as a function of frequency (Equations (5) and (6)) for the time period 4 September 2020 00:32:59–00:52:59. Red horizontal lines are the deadband limits. Black horizontal line is the nominal frequency and the battery idle state.

**Figure 6.**The actions taken by the trained reinforcement learning agent (left) and the resulting bidding behavior on 4 September 2020 (right) after epochs 2 (

**a**), 4 (

**b**), 6 (

**c**), 8 (

**d**) and 10 (

**e**).

**Figure 9.**The cumulative reward for all of the hours for one of the validation days (4 September 2020).

**Figure 10.**Bids and predicted price (

**a**), reward and actual market price (

**b**) and penalties (

**c**) for the validation day 4 September 2020 at 9 epochs.

**Figure 11.**Bids and predicted price (

**a**), reward and actual market price (

**b**) and penalties (

**c**) for the validation day 4 September 2020 at 35 epochs.

**Figure 12.**Compensation versus penalty

_{min}. The dots show the result of running the experiment with different values of reputation

_{factor}. Each dot is labeled with the corresponding value of reputation

_{factor}.

Symbol | Data Type | Description |
---|---|---|

SoC | Float | State of charge of the battery expressed as percentage of full charge |

OoB_{min} | Boolean | True if (SoC) was out of bounds (OoB) at any time during a specific minute |

OoB_{s} | Boolean | True if SoC was out of bounds for a one-second timestep of the battery simulation |

R | Integer | Hours since the battery last rested. Resting is defined as not participating in the frequency reserve market and charging/discharging to bring the SoC to 50%. E.g., if the battery rested most recently on the previous hour, R = 0 |

day | Date | The current day corresponding to the current state of the environment |

day_{start} | Date | The first day of the training set |

day_{end} | Date | The last day of the training set |

h | Integer in range 0–23 | The current hour (the current day is stored in the symbol day) |

FCR-N_{fcast}[h] | Float | The forecasted FCR-N market price in EUR per megawatt (EUR/MW) for hour h |

FCR-N[h] | Float | The actual FCR-N market price (EUR/MW) for hour h |

S[h] | [Integer, Float] | State of the environment at the hour h, i.e., [R, FCR-N_{fcast}[h]] |

f_{h} | Float [3600] | Power grid frequency time series for hour h. One data point per second |

a | Integer in range 0–3 | Action to be taken by the bidding agent. 3 = rest (no bid), 2 = bid with 600 kilowatt (kW) capacity, 1 = bid with 800 kW capacity, 0 = bid with 1 MW capacity |

r | float | Reward |

trace | Array with elements of type [S[h], a, r, S[h + 1]] | An experience trace consisting of all of the experiences collected during one epoch. A single experience consists of the following: [S[h], a, r, S[h + 1]] |

maxEp | Integer | The maximum number of epochs used to train the reinforcement learning agent |

penalty_{min} | Integer | The number of minutes during the current hour in which the battery was not available for providing frequency reserves and thus incurred a financial penalty from the frequency reserve market |

compensation[h] | float | The compensation in EUR for participating in FCR-N for the hour h |

penalty[h] | float | The penalty in EUR for the battery being unavailable while participating in FCR-N for the hour h |

reputation_{damage} | float | A quantification in EUR of the damage to the reputation of the FCR-N reserve provider (i.e., the battery operator), due to failures to provide the reserve |

reputation_{factor} | float | A coefficient in EUR that can be adjusted to train the bidding agent to avoid penalties |

**Table 2.**Functions used in the procedure for training the reinforcement learning agent (Figure 3).

Function | Description |
---|---|

reset(S[0]) | Resets the state variables at the beginning of the epoch |

pow = ctrl(a) | The parameter a is the frequency reserve capacity in kW that the bidding agent decided to bid on the frequency reserve market. The output pow is the discharge/charge power command to the battery from the battery controller. The output is determined according to the frequency data from f_{h} and the FCR-N market technical specification [48]. |

OoB_{s} = sim(pow) | This function runs the battery simulation for one second, according to the pow charge/discharge command from ctrl(a). SoC is an internal state variable of the battery simulator. If the SoC goes OoB during this second, the OoB_{s} output is true, otherwise false. |

OoB_{min} = bounds(OoB_{s}) | If OoB_{s} is true, OoB_{min} is set true. Otherwise, no action. |

capacity(a) | The capacity in MW of the bid corresponding to the action a taken by the agent. The capacity is 0 if a = 3, 0.6 MW if a = 2, 0.8 MW if a = 1 and 1.0 MW if a = 0. |

S[h] = state(h) | Construct the state data structure S[h] with the current value of R and h. |

Parameter | Value |

Type | Lithium-ion |

Nominal voltage | 1200 V |

Rated capacity | 1400 Ah |

Battery response time | 0.1 s |

Simulate temperature effects | No |

Simulate aging effects | No |

Discharge parameters: determine from the nominal parameters of the battery | Yes |

Hyperparameter | Value |

Number of hidden layers | 1 |

Number of nodes in input layer | 2 |

Number of nodes in hidden layer | 20 |

Number of nodes in output layer | 4 |

Epsilon decay | 0.998 |

Learning rate | 0.01 |

Discount factor | 0.5 |

Hidden layer activation function | Sigmoid |

Output layer activation function | Softmax |

Dropout | Not used |

Optimizer | Adam |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Aaltonen, H.; Sierla, S.; Subramanya, R.; Vyatkin, V.
A Simulation Environment for Training a Reinforcement Learning Agent Trading a Battery Storage. *Energies* **2021**, *14*, 5587.
https://doi.org/10.3390/en14175587

**AMA Style**

Aaltonen H, Sierla S, Subramanya R, Vyatkin V.
A Simulation Environment for Training a Reinforcement Learning Agent Trading a Battery Storage. *Energies*. 2021; 14(17):5587.
https://doi.org/10.3390/en14175587

**Chicago/Turabian Style**

Aaltonen, Harri, Seppo Sierla, Rakshith Subramanya, and Valeriy Vyatkin.
2021. "A Simulation Environment for Training a Reinforcement Learning Agent Trading a Battery Storage" *Energies* 14, no. 17: 5587.
https://doi.org/10.3390/en14175587