Reinforcement-Learning-Based Virtual Energy Storage System Operation Strategy for Wind Power Forecast Uncertainty Management

: Uncertainties related to wind power generation (WPG) restrict its usage. Energy storage systems (ESSs) are key elements employed in managing this uncertainty. This study proposes a reinforcement learning (RL)-based virtual ESS (VESS) operation strategy for WPG forecast uncertainty management. The VESS logically shares a physical ESS to multiple units, while VESS operation reduces the cost barrier of the ESS. In this study, the VESS operation model is suggested considering not only its own operation but also the operation of other units, and the VESS operation problem is formulated as a decision-making problem. To solve this problem, a policy-learning strategy is proposed based on an expected state-action-reward-state-action (SARSA) approach that is robust to variations in uncertainty. Moreover, multi-dimensional clustering is performed according to the WPG forecast data of multiple units to enhance performance. Simulation results using real datasets recorded by the National Renewable Energy Laboratory project of U.S. demonstrate that the proposed strategy provides a near-optimal performance with a less than 2%-point gap with the optimal solution. In addition, the performance of the VESS operation is enhanced by multi-user diversity gain in comparison with individual ESS operation.


Introduction
The use and development of renewables has grown continuously in the power sector owing to climate change, with renewable power generation becoming second in the electricity mix in 2018 [1]. Renewables installed more than 200 gigawatts in 2019, which is the largest increase to date [2]. It is expected that renewable-based power capacity will grow by 50% between 2019 and 2024 [3]. In particular, forecasts predict that wind capacity installations will triple by 2024 [3].
Wind power generation (WPG) is subject to high fluctuations and intermittent properties. The characteristics of WPG make it difficult to ensure power system reliability [4,5]. Although various wind power forecasting methods such as the ensemble method [6], aggregated probabilistic method [7], and machine learning-based method [8] have been researched, uncertainty cannot be completely eliminated owing to the nature of wind-resource phenomena. An energy storage system (ESS) plays an essential role in managing the uncertainty of WPG [9]. ESSs for WPG are used in various applications such as frequency regulation [10], ramp rate mitigation [11], and demand response [12]. The basic role of an ESS is to charge the surplus energy and discharge the stored energy according to the operational objective. Therefore, the primary issue of the usage of ESS is the effective operation of the ESS. Gomes et al. proposed a stochastic mixed-integer linear programming approach to manage the mismatching of the renewable power generation uncertainty 1 This study proposes a VESS operation strategy based on an expected state-action-reward-stateaction (SARSA)-based RL approach that is a more robust solution for WPG forecast fluctuation.
To the best of our knowledge, this is the first work to apply the RL approach for the VESS operation; 2 The proposed strategy is also combined with multi-dimensional data clustering for enhancing the policy learning performance of the RL approach; 3 Effect of VESS and clustering is carefully discussed, and the usage case of the proposed strategy is suggested.
The rest of this paper is organized as follows. In Section 2, the system description, including forecasting uncertainty model and VESS system, and the VESS operation problem formulation is described, and in Section 3, the design method of the proposed RL-based VESS operation strategy is discussed. In Section 4, measurement studies using real WPG profiles applied to the proposed strategy are demonstrated, and in Section 5, a conclusion of the study is presented.

Uncertainty Model
In this study, a group of WPGs U = {1, · · · , u, · · · , U}, such as a wind farm, was considered to be connected to the grid, as shown in Figure 1. To connect to the grid, the WPG operator forecasts the power generation production over T = {1, · · · , t, · · · , T}, e.g., T = 24 h for day-ahead operation. Let g u t andĝ u t be the actual WPG and its forecasting of the u-th WPG at time t, respectively. The u-th WPG forecast uncertainty at time t is defined as (1) An ESS is operated to manage the uncertainty by charging or discharging energy. By applying the ESS, the uncertainty is calculated as where q u t is the ESS charging/discharging quantity for the u th WPG at time t. In this study, a group of WPGs = {1, ⋯ , , ⋯ , }, such as a wind farm, was considered to be connected to the grid, as shown in Figure 1. To connect to the grid, the WPG operator forecasts the power generation production over = {1, ⋯ , , ⋯ , }, e.g., = 24 h for day-ahead operation. Let and be the actual WPG and its forecasting of the -th WPG at time , respectively. The -th WPG forecast uncertainty at time is defined as An ESS is operated to manage the uncertainty by charging or discharging energy. By applying the ESS, the uncertainty is calculated as where is the ESS charging/discharging quantity for the th WPG at time .

VESS System
The ESS is constructed in two parts: a power subsystem (PS) and an energy subsystem (ES). The PS capacity limits instantaneous charging and discharging power. The ES stores the energy, and its capacity determines the ESS service time. The ESS operation is performed within these two constraint regions in the individual ESS [21]. However, the VESS is operated by logically dividing one physical ESS over several units. Therefore, the VESS operation region is limited not only by the ESS capacity of the PS and the ES, but also by the operation of each unit.
Let be the ESS charging or discharging action for the th WPG at decision time . The action is first restricted by the PS capacity

VESS System
The ESS is constructed in two parts: a power subsystem (PS) and an energy subsystem (ES). The PS capacity limits instantaneous charging and discharging power. The ES stores the energy, and its capacity determines the ESS service time. The ESS operation is performed within these two constraint regions in the individual ESS [21]. However, the VESS is operated by logically dividing one physical ESS over several units. Therefore, the VESS operation region is limited not only by the ESS capacity of the PS and the ES, but also by the operation of each unit.
Let a u t be the ESS charging or discharging action for the u th WPG at decision time t. The action is first restricted by the PS capacity C PS Appl. Sci. 2020, 10, 6420 4 of 13 Moreover, in VESS, the action is limited by the actions of other WPGs. Including this constraint, the action range in Equation (3) is modified as Considering the charging/discharging efficiency η, the actual operation quantity q u t in Equation (2) is measured as Second, the action is performed in the "energy stored" range, referred to as the state-of-charge (SoC). The SoC at the decision time t, s t , is measured to be where ∆T is the operation time interval, which is limited by the ES capacity C ES , Herein, C min ES and C max ES express the minimum and maximum operable ES capacity, respectively. These values are determined by considering ESS characteristics, such as the depth of discharge (DoD). Note that power system requirements such as line flow and frequency should be considered for implementation into practical systems. However, this study focuses on the forecasting error management aspect on the WPG; hence, a simple model used in [23,24] is considered herein. The VESS operation considering power system requirement is an open problem.

Problem Formulation
The aim of this study is to determine the VESS operation action required to manage WPG forecast uncertainty. Particularly, this study only deals with the WPG forecast error minimization problem. Therefore, the mean absolute error (MAE) is considered to be the performance metric of uncertainty management.
The MAE is calculated as where a = a 1 1 , · · · , a u t , · · · , a U T . Considering the VESS constraints, the VESS operation problem for managing WPG forecast uncertainty can be formulated as subject to (4) and (7).
If perfect information is known, such as the actual WPG properties about a future time, the problem in Equation (9) can be optimally solved using iteration-based search algorithms such as the gradient descent method and Newton method [32]. However, this non-causal assumption cannot be implemented in the real world. In this study, the solution of this problem using perfect information, including the future time, is used as the optimal solution that is compared to the performance of the proposed VESS operation strategy.

Method
As shown in Equation (9), ESS operation is a sequential decision-making (SDM) problem. The SDM problem is mathematically formulated as the state-action space model, and the transaction probability among states is required to optimally solve the problem. However, the RL approach predicts the transaction probability using a learning algorithm, so it requires only the state-action space model to solve the SDM problem in Equation (9) [25].

State-Action Space
The state-action space for the individually operated ESS is described using a one-dimensional model [31]. The VESS is operated as a physical ESS, so the state-action space for the VESS is also presented as a one-dimensional model. However, the individually operated ESS action is only limited by the ESS capacity, although the action range of VESS is determined according to the accumulated actions of all units.
When the VESS is operated during T, the decision sequence, the state-action space is at T + 1, a decision stage including the initial stage, as shown in Figure 2. The state-action space for the RL approach is solved only using the discrete model. Therefore, the ESS operation corresponds to quantization by the unit action step δ, as shown in Figure 2. Accordingly, all state and action sets are expressed as where a t is the accumulated action for all units a t = u∈U a u t . Considering the PS and ES capacity constraints in Equations (4) and (7), κ s and κ a are calculated as where · is the floor operation. The discretized ESS operation by the quantization process makes a quantization error, but the error is bound according to the step size [33].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 13 predicts the transaction probability using a learning algorithm, so it requires only the state-action space model to solve the SDM problem in Equation (9) [25].

State-Action Space
The state-action space for the individually operated ESS is described using a one-dimensional model [31]. The VESS is operated as a physical ESS, so the state-action space for the VESS is also presented as a one-dimensional model. However, the individually operated ESS action is only limited by the ESS capacity, although the action range of VESS is determined according to the accumulated actions of all units.
When the VESS is operated during , the decision sequence, the state-action space is at + 1, a decision stage including the initial stage, as shown in Figure 2. The state-action space for the RL approach is solved only using the discrete model. Therefore, the ESS operation corresponds to quantization by the unit action step , as shown in Figure 2. Accordingly, all state and action sets are expressed as where is the accumulated action for all units = ∑ ∈ . Considering the PS and ES capacity constraints in Equations (4) and (7), and are calculated as where ⋅ is the floor operation. The discretized ESS operation by the quantization process makes a quantization error, but the error is bound according to the step size [33]. At each stage , the current state is defined as the current SoC, . The next state, , is determined by the current state and the selected action set by Equation (6) as follows: Herein, action should be selected from within the feasible action range according to the current state At each stage t, the current state is defined as the current SoC, s t . The next state, s t+1 , is determined by the current state and the selected action a t+1 set by Equation (6) as follows: Herein, action a t should be selected from within the feasible action range according to the current state A t = a j min , · · · , a j , · · · , a j max , where j min = max(−κ a , 1 − i) and j max = min(κ a , κ s − i). As an example, in Figure 2, the feasible action range for the first action a 1 and the second action a 2 become A 1 = a −2 , a −1 , a 0 , a 1 , a 2 and

Decision Policy Design
An RL-based VESS operation strategy determines the decision rule of the current action among the feasible action ranges at each stage in the state-action space. The decision rule is designed to maximize the reward provided by the VESS operation.
The goal of the VESS operation is to minimize the forecast uncertainty presented in the objective function in (9). At stage t, the operator is unaware of the uncertainty of forward time over t. Therefore, the uncertainty included in the VESS operation at stage t is expressed as where the values with a hat represent the expected values. As shown in Equation (14), the uncertainty comprises the accumulated uncertainty of the current and expected uncertainties at the future time. In the RL approach, the current uncertainty performance of the VESS operation is defined as the instantaneous reward value from the current decision action at each stage. The reward at state t is presented as Moreover, the accumulated uncertainty is defined as the return that is the accumulated reward from time t onward, and is calculated as where γ is the discount factor, (0, 1], which reduces the risk of the expected value from the onward decision time. The return in Equation (16) becomes the weighted uncertainty performance of the VESS operation in Equation (14). The RL-based decision-making approach is used to determine the VESS operation action to minimize the reward, which is the uncertainty performance. For this, the state-action value function is defined and presents the quality of an action, a t , at a given state s t , as follows As the transaction probability of an action at each state is known, π = Pr(a t s t ), ∀ t ∈ T , s t ∈ S, a t ∈ A t , the optimal state-action value function Q * (s t , a t ) is measured using the Bellman optimality equation [34] and the optimal action is determined as However, it is impractical to attempt to determine the transaction probability of an action at each state.
In the RL approach, the state-action value function is estimated by learning. In widely used Q-learning-based RL approaches [26][27][28][29], the state-action value function is estimated as where α is the learning rate in (0, 1]. Moreover, the action is determined as However, the WPG has a high uncertainty variance [19]. This variance reduces the reliability of the expected value in forward time, such as Q QL (s t+1 , a t+1 ). Therefore, the Q-learning-based RL approach cannot guarantee uncertainty management performance in WPG environments [31].
Instead of employing the minimum value in the Q-learning-based approach, the expected SARSA-based RL approach uses the expected state-action value to decide the action. The expectation of the value reduces the risk of variance in forward time [25]. In the expected SARSA-based RL approach, the action is determined as In addition, the state-action value function is updated as

Multi-Dimensional Clustering
The determined action for the VESS operation in (22) is an accumulated action set to manage the uncertainty of each WPG, such as a t = a 1 t , · · · , a U t . The expected SARSA approach reduces the expected risk in forward time. However, the multi-dimensional action renders convergence difficult, and also reduces the uncertainty management performance of the VESS operation.
To mitigate this effect of multi-dimensional action, data classification is considered. Data classification is a technique that involves the categorization of data to enable organization for effective operation [35]. With RL approaches, data classification can enhance the learning performance of the state-action value function [36].

RL-Based VESS Operation Strategy
The proposed strategy comprises data clustering for enhancing the learning performance and policy learning to determine the VESS operation action. The proposed strategy is described as follows (Algorithm 1).

Algorithm 1. Proposed RL-based VESS operation algorithm
Data clustering 1: Initialization 2: Set the number of clusters to K.

3:
Initialize centroids c k using historical WPG forecasting data. 4: Data clustering 5: Set cluster k as k = argmin k u∈U ĝ u − c k 2 .

9:
Set s 1 as the current SoC and A 1 using (13). 10: Policy learning 11: For t = {1, · · · , T}, 12: Set a eSARSA t in A t using (22). 13: Update s t+1 , A t+1 and Q eSARSA using (12) and (23). 14: end for First, to apply the k-mean clustering, the number of clusters is set to K, and using the historical WPG forecast data, the centroids are initialized to solve (24) (steps 2 and 3). The cluster number k of data sets is selected as the cluster that has the minimum Euclidean distance to the cluster centroid in step 5. The cluster number is used to select the active state-action value function for the policy learning process. Moreover, the centroid of the selected cluster is updated considering the dataset in step 6.
Combined with K clustering, K state-action value functions are required. In the policy learning process, the k-th state-action value function is loaded as the active state-action value function, Q eSARSA , according to the cluster number k in step 8. The initial state s 1 is set as the current VESS condition and the feasible action range A 1 is determined by the current state s 1 in step 9. During the operation time horizon T , the VESS operation action is selected using (22) and the values of the state and the state-action value function are updated according to the selected action in steps 12 and 13.

Results and Discussion
To verify the performance of the proposed strategy, the simulation results were evaluated. In the simulation, five WPG datasets that were recorded by the National Renewable Energy Laboratory to develop eastern wind resources in the United States of America from 2004 to 2006 were employed [38].
Each WPG had a capacity of 20 megawatt (MW). Day-ahead forecasting data were provided with 1-h time resolution. Therefore, the operation time horizon was set to 24 h, T = {1, · · · , 24}.
The simulation results were measured using the data from the first 14 days of December 2006, and the other datasets were used for RL training. Moreover, for policy learning, the learning rate and discount factor were set as α = 0.1 and γ = 0.95, respectively. The cluster size was assumed to be three. However, a discussion of the cluster size is also presented here.
A lithium-ion based ESS system was considered, which is widely used with renewable energy systems [39]. The charging/discharging efficiency η was set as, 0.95, which provided a 0.9 round-trip efficiency, and the DoD margin that restricts the minimum and maximum operable ES capacity was 0.1. The ES capacity was expressed as the normalized WPG capacity, that is, per unit (p.u.), and the service time was 2 h, which provided a 0.5 charging rate (C-rate).
The simulations in this study were implemented on a 64-bit PC with a 4 GHz Quad-Core Intel Core i7 CPU and 32 GB RAM, using MATLAB R2020a with an IBM CPLEX optimization studio. Figure 3 shows the uncertainty management performance as MAE, with varying VESS size. The black line with circles, the red line with squares, and the blue line with diamonds present the results obtained when applying the optimal solution, the proposed method, and the stochastic method, respectively. The optimal solution is the solution to problem (9), for known information including the future time, and the stochastic method is the VESS operation according to the probabilistic information of the WPG suggested in [40]. The VESS size is the available operation room to manage uncertainty. Therefore, by increasing the size, the MAE is reduced, as shown in Figure 3a. In particular, in Figure 3a the optimal solution and proposed method have a similar slope with increasing VESS size, but the result obtained with the stochastic method shows a less significant decrease. This implies that the optimal solution and the proposed method effectively operate according to the environment, while the stochastic method does not. The stochastic method is designed according to the Markov decision process, similar to the proposed method. However, the stochastic method applying the backward induction approach in [40] predetermines the reserved capacity for the future decision stage according to the probabilistic information of the WPG, so the operational diversity of the stochastic method is lower than those of the optimal solution and proposed method. Figure 3b shows the optimal gap, which represents the difference from the optimal solution. The optimal gap of the proposed method is less than 2%, and is reduced with increasing size. The proposed method can effectively consider environmental characteristics by learning and clustering, and therefore achieves gain by increasing the size. However, the stochastic method cannot reflect this. Therefore, the optimal gap in the stochastic method increases with increasing size.

Performance Results of Proposed Strategy
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 13 Figure 3 shows the uncertainty management performance as MAE, with varying VESS size. The black line with circles, the red line with squares, and the blue line with diamonds present the results obtained when applying the optimal solution, the proposed method, and the stochastic method, respectively. The optimal solution is the solution to problem (9), for known information including the future time, and the stochastic method is the VESS operation according to the probabilistic information of the WPG suggested in [40]. The VESS size is the available operation room to manage uncertainty. Therefore, by increasing the size, the MAE is reduced, as shown in Figure 3a. In particular, in Figure 3a the optimal solution and proposed method have a similar slope with increasing VESS size, but the result obtained with the stochastic method shows a less significant decrease. This implies that the optimal solution and the proposed method effectively operate according to the environment, while the stochastic method does not. The stochastic method is designed according to the Markov decision process, similar to the proposed method. However, the stochastic method applying the backward induction approach in [40] predetermines the reserved capacity for the future decision stage according to the probabilistic information of the WPG, so the operational diversity of the stochastic method is lower than those of the optimal solution and proposed method. Figure 3b shows the optimal gap, which represents the difference from the optimal solution. The optimal gap of the proposed method is less than 2%, and is reduced with increasing size. The proposed method can effectively consider environmental characteristics by learning and clustering, and therefore achieves gain by increasing the size. However, the stochastic method cannot reflect this. Therefore, the optimal gap in the stochastic method increases with increasing size.   Figure 4 compares the MAE of the individual ESS operation and the proposed VESS operation. The results for individual ESS operations are the optimal solution reformulated problem (9) for each WPG. Each WPG presents different uncertainties, so the decrease in the slope with increasing size also differs, as shown in Figure 4a. However, the results of the proposed VESS operation demonstrate that it outperforms all individual operation results. The individual ESS operation works by using its own information. However, in the case of the proposed VESS operation, information from multiple units is used. Therefore, the proposed VESS operation achieves multi-user diversity gain [41]. Figure 4b verifies the diversity gain. By increasing the size, the operation availability also increases. The proposed VESS operation is effectively operated to achieve availability with multi-user diversity. Therefore, the VESS operation gain compared to the individual ESS operation is enhanced with increasing operation availability.   Figure 5 shows the optimal gap of the proposed method with 1, 3, and 5-cluster cases. As shown in Figure 5, the optimal gap is reduced as the number of clusters increases. In particular, with five clusters, an optimal gap enhancement of more than 1.5% can be obtained when the ESS size is 0.6 p.u. This indicates that the clustering method is an effective way to enhance the performance of the proposed method. However, compared to the performance enhancement provided by three clusters or one cluster, the performance increase provided by five clusters is less than that of three clusters. This is because the distance between the centroids is reduced with increasing cluster size. Moreover, by increasing the cluster size, the number of state-action value functions also increases for policy learning. This increases the system complexity in the implementation. Therefore, it is important to set the appropriate cluster size by considering both performance enhancement and system complexity. As an example, in this study operating with five WPG units, three clusters are efficient considering the performance enhancement, as shown in Figure 5.  Figure 5 shows the optimal gap of the proposed method with 1, 3, and 5-cluster cases. As shown in Figure 5, the optimal gap is reduced as the number of clusters increases. In particular, with five clusters, an optimal gap enhancement of more than 1.5% can be obtained when the ESS size is 0.6 p.u. This indicates that the clustering method is an effective way to enhance the performance of the proposed method. However, compared to the performance enhancement provided by three clusters or one cluster, the performance increase provided by five clusters is less than that of three clusters. This is because the distance between the centroids is reduced with increasing cluster size. Moreover, by increasing the cluster size, the number of state-action value functions also increases for policy learning. This increases the system complexity in the implementation. Therefore, it is important to set the appropriate cluster size by considering both performance enhancement and system complexity. As an example, in this study operating with five WPG units, three clusters are efficient considering the performance enhancement, as shown in Figure 5.

Effect of Clustering
learning. This increases the system complexity in the implementation. Therefore, it is important to set the appropriate cluster size by considering both performance enhancement and system complexity. As an example, in this study operating with five WPG units, three clusters are efficient considering the performance enhancement, as shown in Figure 5.

Usage of the Proposed Strategy
The VESS operation applying the proposed strategy can get a higher forecast error management performance than that of the individual ESS operation. For an example, when the MAE target of each WPG sets as 1.5, in case of the individual ESS operation, each ESS size larger than 1 p.u. is required, as shown in Figure 4a. This is economically not viable. However, in the proposed VESS case, 0.2 p.u. of ESS size is required for each WPG with the same target. This makes a business model such as a VESS service with economic benefit by reducing the ESS size. Moreover, by increasing the number of

Usage of the Proposed Strategy
The VESS operation applying the proposed strategy can get a higher forecast error management performance than that of the individual ESS operation. For an example, when the MAE target of each WPG sets as 1.5, in case of the individual ESS operation, each ESS size larger than 1 p.u. is required, as shown in Figure 4a. This is economically not viable. However, in the proposed VESS case, 0.2 p.u. of ESS size is required for each WPG with the same target. This makes a business model such as a VESS service with economic benefit by reducing the ESS size. Moreover, by increasing the number of clusters, the ESS size can be reduced, as shown in Figure 5. The cluster size affects the number of the state-action value function that is related to the memory size and the computational complex. Therefore, the VESS service provider can select the ESS size and the number of clusters considering the ESS cost, the memory cost, and the computational complexity, as well as the WPG forecast error management target.

Conclusions
This study proposed an RL-based VESS operation strategy to manage WPG forecasting uncertainty. The VESS operation model is the first to consider not only its own uncertainty management requirement, but also the requirements of other units. Applying the VESS model, the expected SARSA-based learning policy is suggested to solve the sequential decision-making problem of the VESS operation. Moreover, the k-means data clustering method is employed to enhance the performance of the proposed strategy by reducing uncertainty variance. The simulation results demonstrate that the proposed strategy provides a near-optimal performance, with a less than 2%-point gap to the optimal solution that requires information including the future time. Moreover, the MAE improvement when applying the proposed method has a similar slope to that of the optimal method according to the storage size. This shows that the proposed method obtains a similar operational diversity to that of the optimal method and can achieve near-optimal performance generally. In addition, we evaluated the performance achieved by the VESS operation in terms of multi-user diversity and the effect of the clustering method according to cluster size.
Research on VESS operation is at an early stage. This study shows that VESS operation can outperform individual ESS operation. However, the performance enhancement according to the VESS operation differs for each unit. Therefore, the VESS operation considering the performance balance among units will be the subject of further research. Moreover, this study only considers a simple system model. By including power system requirements, the system model can be practically extended further. Finally, the forecast error management of WPG is highly related to the revenue, and the VESS operation is cost-efficient, rather than the individual ESS operation. Therefore, this study can be extended to research in the economic aspect, such as a revenue maximization problem considering ESS costs.