Exploring Reward Strategies for Wind Turbine Pitch Control by Reinforcement Learning

: In this work, a pitch controller of a wind turbine (WT) inspired by reinforcement learning (RL) is designed and implemented. The control system consists of a state estimator, a reward strategy, a policy table, and a policy update algorithm. Novel reward strategies related to the energy deviation from the rated power are deﬁned. They are designed to improve the e ﬃ ciency of the WT. Two new categories of reward strategies are proposed: “only positive” (O-P) and “positive-negative” (P-N) rewards. The relationship of these categories with the exploration-exploitation dilemma, the use of (cid:15) -greedy methods and the learning convergence are also introduced and linked to the WT control problem. In addition, an extensive analysis of the inﬂuence of the di ﬀ erent rewards in the controller performance and in the learning speed is carried out. The controller is compared with a proportional-integral-derivative (PID) regulator for the same small wind turbine, obtaining better results. The simulations show how the P-N rewards improve the performance of the controller, stabilize the output power around the rated power, and reduce the error over time.


Introduction
Wind energy gains strength year after year. This renewable energy is becoming one of the most used clean energies worldwide due to its high efficiency, its competitive payback, and the growth in investment in sustainable policies in many countries [1]. Despite its recent great development, there are still many engineering challenges regarding wind turbines (WT) technology that must be addressed [2].
From the control perspective, one of the main goals is to stabilize the output power of the WT around its rated value. This should be achieved while the efficiency is maximized, and vibrations and fatigue are minimized. Even more, safety must be guaranteed under all operation conditions. This may even be more critical for floating offshore wind turbines (FOWT), as it has been proved that the control system can affect the stability of the floating device [3,4]. This general and ambitious control objective is implemented in many different control actions, depending on the type of WT. So, the pitch angle control is normally used to maintain the output power close to its rated value once the wind speed overpasses a certain threshold. The control of the generator speed is intended to track the optimum rotor velocity when the wind speed is below the rated output speed. And finally, the yaw angle control is used to optimize the attitude of the nacelle to follow the wind stream direction.
Error Reward Strategy (MSE-RS), Mean error Reward Strategy (Mean-RS), and the corresponding increments, ∆MSE-RS and ∆Mean-RS, have been proposed, implemented, and combined.
In addition, many of the previous works based on RL execute -greedy methods in order to increase the exploration level and avoid actions unexplored. The -greedy methods select an action either randomly or considering the previous experiences, the latter being called greedy selection. The greedy selection is carried out trying to maximize the future expected rewards, based on the previous rewards already received. The probability of selecting a random action is and the probability of performing a greedy selection is (1 − ). This approach tends to improve the convergence of the learning, but its main drawback is that it introduces a higher randomness in the process and makes the system less deterministic. To avoid the use of -greedy methods, the concept of Positive-Negative (P-N) rewards, and its relationship with the exploration-exploitation dilemma is here introduced and linked to the WT control problem. An advantage of P-N rewards, observed in this work, is that the behavior is generally more deterministic than -greedy, allowing more replicable results with fewer iterations.
Moreover, a deep study of the influence of the type of reward in the performance of the system response and in the learning speed has also been carried out. As it will be shown in the simulation experiments, P-N rewards work better than Only-Positive (O-P) rewards for all the policy update algorithms. However, the combination of P-N with O-P rewards helps to soften the variance of the output power.
The rest of the paper is organized as follows. Section 2 describes the model of the small wind turbine used. Section 3 explains the RL-based controller architecture, the policy update algorithms and the reward strategies implemented. The results for different configurations are analyzed and discussed in Section 4. The paper ends with the conclusions and future works.

Wind Turbine Model Description
The model of a small turbine of 7 kW is used. The equations of the model are summarized in Equations (1)- (6). The development of these equations can be found in Sierra-García et al. [18]. . ..
where L a is the armature inductance (H), K g is a dimensionless constant of the generator, K φ is the magnetic flow coupling constant (V·s/rad), R a is the armature resistance (Ω), R L is the resistance of the load (Ω), considered in the study as purely resistive, w is the angular rotor speed (rad/s), I a is the armature current (A), the values of the coefficients c 1 to c 9 depend on the characteristics of the wind turbine, J is the rotational inertia (kg m 2 ), R is the radius or blade length (m), ρ is the air density (kg/m 3 ), v is wind speed (m/s), K f is the friction coefficient (N m/rad/s), θ re f is the reference for the pitch (rad), and θ is the pitch (rad). The state variables of the control system are the current in the armature and the angular speed of the rotor, [I a , w]. On the other hand, the manipulated or control input variable is the pitch angle, θ re f , and the controlled variable is the output power, P out , unlike other works where the rotor speed is the controlled variable.
The RL controller proposed in this paper is applied to generate a pitch reference signal, θ re f , in order to stabilize the output power, P out , of the wind turbine around its rated value. Equations (1)- (6) are used to simulate the behavior of the wind turbine and thus allow us to evaluate the performance of the controller.
The values of the parameters used during the simulations (Table 1) are taken from Mikati et al. [37].

RL-Inspired Controller
The reinforcement learning approach consists of an environment, an agent and an interpreter. The agent, based on the state perceived by the interpreter s t and the previous rewards provided by the interpreter r 1...t , selects the best action to be carried out. This action, a t , produces an effect on the environment. This fact is observed by the interpreter who provides information to the agent about the new state, s t+1 , and the reward of the previous action, r t+1 , closing the loop [38,39]. Some authors consider that the interpreter is embedded in either the environment or the agent; in any case, the function of the interpreter is always present.
Discrete reinforcement learning can be expressed as follows [40]: • S is a finite set of states perceived by the interpreter. This set is made with variables of the environment, which must be observable by the interpreter and may be different to the state variables of the environment.

•
A is a finite set of actions to be conducted by the agent. • s t is the state at t • a t is the action performed by the agent when the interpreter perceives the state s t • r t+1 is the reward received after action a t is carried out • s t+1 is the state after action a t is carried out • The environment or world is a Markov process: MDP = s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 . . . • π : S × A → [0, 1] is the policy; this function provides the probability of selection of an action a for every pair (s, a) • p a ss = Pr s t+1 = s s t = s ∧ a t = a is the probability that a state changes from s to s with action a • p π (s , a ) is the probability of selecting action a at state s under policy π • r a s = E r t+1 s t = s ∧ a t = a is the expected one-step reward • Q π (s,a) = r a s + γ s p a ss a p π (s , a )Q π (s ,a ) is the expected sum of discounted rewards The scheme of the designed controller inspired by this RL approach is presented in Figure 1. It is composed by a state estimator, a reward calculator, a policy table, an actuator, and a method to update Appl. Sci. 2020, 10, 7462 5 of 22 the policy. The state estimator receives the output power error (P err ,), defined as the difference between the rated power P re f and the current output power P out , and its derivative, . P err . These signals are discretized and define the state s t ∈ S, where t is the current time. The interpreter is implemented by the state estimator and the reward calculator. The agent includes the policy, the policy update algorithm and the actuator. Both interpreter and agent form the controller (Figure 1).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 22 The policy is defined as a function : → which assigns an action ∈ to each state in . This action is selected in a way that maximizes the long-term expected reward. The actuator transforms the discrete action into a control signal for the pitch θ in the range [0, /2]. Each time an action is executed, in the next iteration, the reward calculator observes the new and and calculates a reward/penalty for action . The policy update algorithm uses this reward to modify the policy for the state . The policy has been implemented as a table ( , ) : × → R together with a function : → . The table relates each pair ( , ) ∈ × with a real number that represents an estimation of the longterm expected reward, that is, the one that will be received when action is executed in the state , also known as . The estimate depends on the policy update algorithm. The table has s rows (states) and a columns (actions). Given a state , the function searches for the action with the maximum value of in the table.

Policy Update Algorithm
The policy update algorithm calculates the estimate corresponding to the previous pair ( , ) ∈ × of the ( , ) each control cycle. At , the last state and action, that is, ( , ), are updated by the policy function , using the previous estimation of the long-term expected reward ( , ) and the current reward , Equation (7).
Once the table is updated, the table is searched for the action that maximizes the reward, Equation (8): The different policy update algorithms that have been implemented and compared in the experiments are the following i. One-reward (OR), the last one received. As it only takes into account the last reward (smallest memory), it may be very useful when the system to be controlled changes frequently, Equation The policy is defined as a function π : S → A which assigns an action a t ∈ A to each state in S. This action a t is selected in a way that maximizes the long-term expected reward. The actuator transforms the discrete action a t into a control signal for the pitch θ re f in the range [0, π/2]. Each time an action is executed, in the next iteration, the reward calculator observes the new P err and . P err and calculates a reward/penalty r t for action a t−1 . The policy update algorithm uses this reward to modify the policy for the state s t−1 .
The policy has been implemented as a table T π (s,a) : S × A → R together with a function f a : S → A . The table relates each pair (s, a) ∈ S × A with a real number that represents an estimation of the long-term expected reward, that is, the one that will be received when action a is executed in the state s, also known as Q. The estimate depends on the policy update algorithm. The table has s rows (states) and a columns (actions). Given a state s, the function f a searches for the action with the maximum value of Q in the table.

Policy Update Algorithm
The policy update algorithm calculates the estimate corresponding to the previous pair (s, a) ∈ S × A of the T π (s,a) each control cycle. At t i , the last state and action, that is, (s t−1 , a t−1 ), are updated by the policy function f π , using the previous estimation of the long-term expected reward T π (s t−1 ,a t−1 ) and the current reward r t , Equation (7).
Once the table T π is updated, the table is searched for the action that maximizes the reward, Equation (8): The different policy update algorithms f π that have been implemented and compared in the experiments are the following Appl. Sci. 2020, 10, 7462 6 of 22 i.
One-reward (OR), the last one received. As it only takes into account the last reward (smallest memory), it may be very useful when the system to be controlled changes frequently, Equation (9) OR : T π (s t−1 ,a t−1 ) (t i ) := r t (9) ii. Summation of all previous rewards (SAR). It may cause an overflow in the long term, that could be solved if the values are saturated to be maintained within some limits, Equation (10) iii.
Mean of all previous rewards (MAR). This policy gives more opportunities to not yet selected actions than SAR, especially when there are many rewards with the same sign, Equation (11) iv.
Only learning with learning rate (OL-LR). It considers a percentage of all previous rewards, Equation (12), given by the learning rate parameter α ∈ R [0, 1]. v.
Learning and forgetting with learning rate (LF-LR). The previous methods do not forget any previous reward; this may be effective for steady systems but for changing models it might be advantageous to forget some previous rewards, Equation (13). The forgetting factor is modelled as the complementary leaning rate (1 − α).

Exploring Reward Strategies
Once the table T π and the function f a are implemented, and a policy update algorithm f π is selected, it is necessary to define the reward strategy of the reinforcement learning procedure. Although so far the definition of the policy update algorithm is general, the design of the rewards and punishments requires expert knowledge about the specific system.
In this work the target is to stabilize the output power of the WT around it nominal value thus reducing the error between the output power and the rated power. The error will then be the key to define the reward.

Only Positive (O-P) Reward Strategies
The most intuitive approach seems to be rewarding the relative position of the system output to the rated (reference) value. The closer the output to the desired value, the bigger the reward. However, considering the distance (absolute value of the error), the reward grows with the error. To avoid this problem, a maximum error is defined, P err MAX , and the absolute error is subtracted from it. This is called "Position Reward Strategy" (PRS), Equation (16).
PRS : r t i = P err MAX − P err (t i ) (16) This strategy only provides positive rewards and no punishments. Thus it belongs to the category only positive (O-P) reinforcement. As it will be seen in the results section, this is the cause of its lack of convergence when individually applied. The main drawback of O-P rewards is that the same actions are selected repeatedly, and many others are not explored. This means that the optimal actions are rarely visited. To solve it, exploration can be externally forced by greedy-methods [40] or O-P rewards can be combined with positive-negative reinforcement (P-N rewards).
To illustrate the problem, let T π (s,a) be initialized to 0 for all the states and actions, T π (s,a) (t 0 ) = 0, ∀(s, a). At t 0 the system is in the state s 0 = s. Since all actions have the same value in the table, any action a 0 = a s 0 is at random selected. At the next control time t 1 the state is s 1 , which can be different or equal to s 0 . The reward received is r 1 > 0. The policy update algorithm modifies the value of the table associated to the previous pair (state, action) T π (s 0 ,a 0 ) (t 1 ) = f π (r 1 ) > 0. Now a new action a 1 associated with state s 1 must be selected. If s 1 s 0 the action is randomly selected again because all the actions in the row have the value 0. However, if the state is the same as the previous one, the selected action is the same as in t 0 , a 1 = a 0 = a s 0 because f π (r 1 ) > 0 is the maximum value of the row. In that case, at the next control time t 2 the table is updated T π (s 0 ,a 0 ) (t 2 ) = f π (r 1 , r 2 ). With this O-P rewards this value always tends to be greater than 0, forcing the same actions to be selected. Only some specific QL configurations may give negative values. This process is repeated every control period. If the state action is different from all previous states, a new cell in the table is populated. Otherwise, the selected action will be the first action selected in that state.
If the initialization of the table T π (s,a) is high enough (regarding the rewards), we can ensure that all actions will be visited at least once for OR, MAR and QL update policies. This can be a solution if the system is stationary, because the best action for each state does not change, so that once all the actions have been tested, the optimum one has necessarily been found. However, if the system is changing this method is not enough. In these cases, -greedy methods have shown successful results [40]. In each control period, the new action is randomly selected with a probability (forced exploration) or selected from the table T π (s,a) with a probability (1− ) (exploitation). Another possible measure that can be used with the O-P strategy to calculate the reward is the previous MSE. In this case, the reward is calculated by applying a time window to capture the errors prior to the current moment. We have called this strategy "MSE reward strategy" (MSE-RS) and it is calculated by Equation (17): where T w is the time length window; T s j is the variable step size at t j , and k is calculated so As it may be observed, greater errors will produce smaller rewards. In a similar way it is possible to use the mean value, "Mean reward strategy" (Mean-RS), that is defined as Equation (18), In the results section it will be shown how the Mean-RS strategy reduces the mean of the output error and cuts down the error when it is combined with a P-N reward

Positive Negative (P-N) Reward Strategies
Unlike O-P reinforcement, P-N reward strategies encourage the natural exploration of actions that enable convergence of learning. The positive and negative rewards compensate the values of the table T π (s,a) , which makes it easier to carry out different actions even if the states are repeated. An advantage of natural exploration over -greedy methods is that their behavior is more deterministic. This provides more repeatable results with fewer iterations. However, the disadvantage is that if rewards are not well balanced, the exploration may be insufficient.
To ensure that the rewards are well balanced, it is helpful to calculate the rewards with some measure of the error variance. PRS, MSE-RS, Mean-RS perform an error measurement in a specific period of time. They do provide neither a measure of the variation nor how quickly its value changes. A natural evolution of PRS is to use speed rather than position to measure whether we are getting closer to or away from rated power and how fast we are doing so. We call it "velocity reward strategy" (VRS), Equations (19) and (20).
VRS : The calculation is divided into two parts. First, r v is calculated to indicate whether we are getting closer to or away from the nominal power, Equation (19). If the error is positive and decreases, we are getting closer to the reference. The second part is to detect when the error changes sign and the new absolute error is less than the previous one. It would be a punished action according to Equation (19) but when this case is detected, the punishment becomes a reward, Equation (20).
A change in the MSE-RS can also be measured as Equation (17). This produces a new P-N reward, the ∆MSE-RS, which is calculated by Equation (21): Comparing Equation (17) and Equation (21) it is possible to observe how the reward is calculated by subtracting the inverse of MSE-RS at t i from the inverse of MSE-RS at t i−1 . In this way, if the MSE is reduced, the reward is positive; otherwise it is negative.
Similar to MSE, a P-N reward strategy can be obtained based on the mean value of the error. It is calculated with Equation (22) and is called ∆Mean − RS.
As will be shown in the results, the ∆Mean − RS improve the mean value of the error compared to other reward strategies.
In addition to these P-N rewards, it is possible to define new rewards strategies by combining O-P with P-N rewards, such as PRS·VRS. However, the combination of P-N rewards between them does not generally provide better results; even in the same cases this combination can produce an effective O-P reward if they always have the same sign.

Simulation Results and Discussion
An in-depth analysis of the performance of the RL controller under different configurations and reward strategies has been carried out. The algorithm has been coded by the authors using Matlab/Simulink software. The duration of each simulation is 100 s. To reduce the discretization error, a variable step size has been used for simulations, with a maximum step size of 10 ms. The control sampling period T c has been set to 100 ms. In all the experiments, the wind speed is randomly generated between 11.5 m/s and 14 m/s. For comparison purposes, a PID is also designed with the same goal of stabilizing the output power around the rated value of the wind turbine. Thus, the input of the PID regulator is the output power and its output is the pitch angle reference. In order to make a fair comparison, the PID output has been scaled to adjust its range to [0, π/2] rad and it has been also biased by the term π/4. The output of the PID is saturated for values below 0 • and above 90 • . The parameters of the PID have been tuned by trial and error, and they have been set to KP = 0.9, KD = 0.2 and KI = 0.5. Figure 2 compares the power output, the generator torque and the pitch signal obtained with different control strategies. The blue line represents the output power when the angle of the blades is 0 • , that is, the wind turbine collects the maximum energy from the wind. As you would expect, this action provides maximum power output. The red line represents the opposite case, the pitch angle is set to 90 • (feather position). In this position, the blades offer minimal resistance to the wind, so the energy extracted is also minimal. The pitch angle reference values are fixed for the open loop system in both cases, without using any external controller. In a real wind turbine, there is a controller to regulate the current of the blade rotor in order to adjust the pitch angle. In our work this is simulated by Equation (5). The yellow line is the output obtained with the PID, the purple line when the RL controller is used, and the green is the rated power. In this experiment, the policy update algorithm is SAR and the reward strategy is VRS. It is observed how the response of the RL controller is much better than that of the PID, with smaller error and less variation. As expected, the pitch signal is smoother with the PID regulator than with the RL controller. However, as a counterpart, the PID reacts slower, producing bigger overshoot and longer stabilization time of the power output.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 22 action provides maximum power output. The red line represents the opposite case, the pitch angle is set to 90° (feather position). In this position, the blades offer minimal resistance to the wind, so the energy extracted is also minimal. The pitch angle reference values are fixed for the open loop system in both cases, without using any external controller. In a real wind turbine, there is a controller to regulate the current of the blade rotor in order to adjust the pitch angle. In our work this is simulated by Equation (5). The yellow line is the output obtained with the PID, the purple line when the RL controller is used, and the green is the rated power. In this experiment, the policy update algorithm is SAR and the reward strategy is VRS. It is observed how the response of the RL controller is much better than that of the PID, with smaller error and less variation. As expected, the pitch signal is smoother with the PID regulator than with the RL controller. However, as a counterpart, the PID reacts slower, producing bigger overshoot and longer stabilization time of the power output. Several experiments have been carried out comparing the performance of the RL controller when using different policy update algorithms and reward strategies. The quantitative results are presented in Tables 2-4 and confirm the graphical results of Figure 2. These data were extracted at the end of iteration 25. The reward window was set to 100 ms. In these tables, the best results per column (policy) have been boldfaced and the best results per row (reward) have been underlined. Several experiments have been carried out comparing the performance of the RL controller when using different policy update algorithms and reward strategies. The quantitative results are presented in Tables 2-4 and confirm the graphical results of Figure 2. These data were extracted at the end of iteration 25. The reward window was set to 100 ms. In these tables, the best results per column (policy) have been boldfaced and the best results per row (reward) have been underlined.   The smallest MSE error is obtained by combining SAR and VRS. Overall, SAR provides the best results, closely followed by MAR and OL-LR. As expected, the worst results are produced by O-R, this can be explained because it only considers the last reward, which limits the learning capacity. For almost all policy update algorithms, the MSE is lower when VRS is applied; the only exception is QL, which performs better with ∆MSE-RS.
Another interesting result is that the performance of O-P rewards is much worse than for P-N rewards. The reason may be that exploration with O-P rewards is very low, and the best actions for many of the states are not exploited. The exploration can be increased by changing the start of the Qtable. Finally, the P-N reward provides better performance than the PID even with OR. Table 3 shows the mean value of the power output obtained by these experiments. The best value is obtained by combining OL-LR and ∆MEAN-RS. OL-LR is the best policy update followed by SAR. O-R again provides the worst results. In this case, the best reward strategies are ∆MEAN-RS and ∆MSE-RS. This may be because the mean value measurement is intrinsically considered in the reward calculation. Table 4 presents the variance of the power output in the previous experiments. Unlike Table 2, in general, N-P rewards produce worse results than O-P rewards. This is logical because N-P rewards produce more change in the selected actions and a more varying output, therefore more variation. However, it is notable that the combination of VRS and SAR provides a good balance between MSE, mean value, and variance. Figure 3 represents the evolution of the saturated error and its derivative, iteration by iteration. In this experiment, the combination of SAR and ∆MSE-RS is used. In Figure 3 it is possible to observe an initial peak of −1000 W in the error (horizontal axis). This error corresponds to an output power of 8 kW (the rate power is 7 kW). It has not been possible to avoid it at the initial stage with any of the tested control strategies, even forcing the pitch to feather. A remarkable result is that, in each iteration, the errors are merged and centered around a cluster. This explains how the mean value of the output power approaches the nominal power over time.  One way to measure the center of this cluster is to use the radius of the centroid of the error calculated by the Equation (23): to observe how, in general, the MSE and the radius decrease with time, although the ratio is quite different depending on the combination of policy update algorithm and reward strategy that is used.  to observe how, in general, the MSE and the radius decrease with time, although the ratio is quite different depending on the combination of policy update algorithm and reward strategy that is used.           The MSE and radius decrease to a minimum, which is typically reached between iterations 5 and 10. The MSE minimum is greater than 260 and is 0 for the radius. The MSE minimum is high because, as explained, the first peak in power output cannot be avoided, it cannot be improved by learning. As expected, the smallest MSE errors correspond to the smallest values of the radius.
Another remarkable result is that iteration by iteration learning is not observed when O-P rewards are applied. This is because, as stated, these strategies do not promote exploration and The MSE and radius decrease to a minimum, which is typically reached between iterations 5 and 10. The MSE minimum is greater than 260 and is 0 for the radius. The MSE minimum is high because, as explained, the first peak in power output cannot be avoided, it cannot be improved by learning. As expected, the smallest MSE errors correspond to the smallest values of the radius.
Another remarkable result is that iteration by iteration learning is not observed when O-P rewards are applied. This is because, as stated, these strategies do not promote exploration and optimal actions are not discovered. As will be shown in Section 4.3, this problem is solved when O-P rewards are combined with P-N rewards. It can also be highlighted how all the P-N rewards converge at approximately the same speed up to the minimum value, but this speed is different for each policy update algorithm. From this point on, there are major differences between the P-N reward strategies. For some policy update algorithms (OR, LF-LR, and QL), these differences increase over time, while for the rest, they decrease.
The OR strategy is the one that converges the fastest to the minimum, but from this point, the MSE grows and becomes more unstable. Therefore, it is not recommended in the long term. However, SAR provides a good balance between convergence speed and stability. When it is used, the MSE for the three P-N reward strategies converges to the same value.
As expected, OL-LR and SAR produce very similar results because the only difference between them is that in the former, the rewards are multiplied by a constant. As the rewards are higher, the actions are reinforced more and there are fewer jumps between actions. This can be seen in Figures 5 and 7.

Influence of the Reward Window
Several of the reward strategies calculate the reward applying a time window, that is, considering N previous samples of the error signal, specifically: MSE-RS, MEAN-RS, ∆MSE-RS, ∆MEAN-RS. To evaluate the influence of the size of this window, several experiments have been carried out varying this parameter. In all of them, the policy update algorithm has been SAR. Figure 10 (left) shows the results when ∆MSE-RS is applied and Figure 10 (right) for ∆MEAN-RS. Each line is associated with a different window size and is represented in a different color. The reward is a dimensionless parameter as table T π (s t , a t ) is dimensionless. The legend shows the size of the window in seconds. The value −1 indicates that all the previous values, from instant 0 of the simulation, have been taken into account to obtain the reward. That is, the size of the window is variable and increases in each control period, and covers from the start of the simulation to the current moment. MSE-RS and MEAN-RS have not been included as, as explained, they are O-P reward strategies and do not converge without forced exploration.
reward is a dimensionless parameter as table ( , ) is dimensionless. The legend shows the size of the window in seconds. The value −1 indicates that all the previous values, from instant 0 of the simulation, have been taken into account to obtain the reward. That is, the size of the window is variable and increases in each control period, and covers from the start of the simulation to the current moment. MSE-RS and MEAN-RS have not been included as, as explained, they are O-P reward strategies and do not converge without forced exploration. In general, a small window size results in a faster convergence to the minimum, but if the size is too small it can cause oscillations after the absolute minimum. This happens with a window of 0.01s, the MSE oscillates and is even less stable for ∆MEAN-RS. A small window size produces noisy rewards. This parameter seems to be related to the control period; a size smaller than the control period produces oscillations.
For the ∆MSE-RS strategy, the convergence speed decreases with the size of the window up to 1 s. For smaller window sizes it does not converge. This can be explained as if the window is longer than the control period, the window can be divided into two parts: the value of a control period In general, a small window size results in a faster convergence to the minimum, but if the size is too small it can cause oscillations after the absolute minimum. This happens with a window of 0.01s, the MSE oscillates and is even less stable for ∆MEAN-RS. A small window size produces noisy rewards. This parameter seems to be related to the control period; a size smaller than the control period produces oscillations.
For the ∆MSE-RS strategy, the convergence speed decreases with the size of the window up to 1 s. For smaller window sizes it does not converge. This can be explained as if the window is longer than the control period, the window can be divided into two parts: the value of a control period preceding the end of the Tw 2 window, and the remaining part from the beginning of the Tw 1 window (Figure 11). An action performed at t i−1 produces an effect that is evaluated when the reward is calculated at t i . When the size of the window grows during the control period, the Tw 1 part also grows, but Tw 2 remains invariant. To produce positive rewards, it is necessary to reduce the MSE, therefore, during Tw 2 , the increases in Tw 1 should be compensated. A larger Tw 1 would give a larger accumulated error in this part, which would be more difficult to compensate during Tw 2 since only the squared error can be positive. It can then be concluded that the optimal window size for ∆MSE-RS is the control period, in this case, 100 ms.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 15 of 22 preceding the end of the Tw2 window, and the remaining part from the beginning of the Tw1 window ( Figure 11). An action performed at produces an effect that is evaluated when the reward is calculated at . When the size of the window grows during the control period, the Tw1 part also grows, but Tw2 remains invariant. To produce positive rewards, it is necessary to reduce the MSE, therefore, during Tw2, the increases in Tw1 should be compensated. A larger Tw1 would give a larger accumulated error in this part, which would be more difficult to compensate during Tw2 since only the squared error can be positive. It can then be concluded that the optimal window size for ∆MSE-RS is the control period, in this case, 100 ms. The behavior of ∆MEAN-RS with respect to the size of the window is similar up to a size of around 20 s; from this value increasing the window size accelerates the convergence and decreases the MSE. This is because a larger window size implies a longer Tw1 part ( Figure 11). However, unlike ∆MSE-RS, a longer Tw1 produces less accumulated error in Tw1 since, in this case, the positive errors compensate for the negative ones, and the accumulated error tends to 0. Therefore, Tw2 has a greater influence on the window, and learning is faster. Figure 12 shows the variation of the MSE with the size of the reward window, at iteration 5. It is possible to observe how the MSE grows until the size of the window is 1 s, decreases until a size around 10 s, and then it grows again until around 25 s, which continues to grow for ∆MSE-RS and decreases for ∆MEAN-RS. The numerical values of these local minima and maxima are related to the duration of the initial peak (Figure 2). The O-P rewards have also been represented with different reward windows. It is possible to observe how for long windows, the ∆MSE-RS tends to behave like the O-P reward strategies and reaches the same values. The behavior of ∆MEAN-RS with respect to the size of the window is similar up to a size of around 20 s; from this value increasing the window size accelerates the convergence and decreases the MSE. This is because a larger window size implies a longer Tw 1 part ( Figure 11). However, unlike ∆MSE-RS, a longer Tw 1 produces less accumulated error in Tw 1 since, in this case, the positive errors compensate for the negative ones, and the accumulated error tends to 0. Therefore, Tw 2 has a greater influence on the window, and learning is faster. Figure 12 shows the variation of the MSE with the size of the reward window, at iteration 5. It is possible to observe how the MSE grows until the size of the window is 1 s, decreases until a size around 10 s, and then it grows again until around 25 s, which continues to grow for ∆MSE-RS and decreases for ∆MEAN-RS. The numerical values of these local minima and maxima are related to the duration of the initial peak ( Figure 2). The O-P rewards have also been represented with different reward windows. It is possible to observe how for long windows, the ∆MSE-RS tends to behave like the O-P reward strategies and reaches the same values. around 20 s; from this value increasing the window size accelerates the convergence and decreases the MSE. This is because a larger window size implies a longer Tw1 part ( Figure 11). However, unlike ∆MSE-RS, a longer Tw1 produces less accumulated error in Tw1 since, in this case, the positive errors compensate for the negative ones, and the accumulated error tends to 0. Therefore, Tw2 has a greater influence on the window, and learning is faster. Figure 12 shows the variation of the MSE with the size of the reward window, at iteration 5. It is possible to observe how the MSE grows until the size of the window is 1 s, decreases until a size around 10 s, and then it grows again until around 25 s, which continues to grow for ∆MSE-RS and decreases for ∆MEAN-RS. The numerical values of these local minima and maxima are related to the duration of the initial peak ( Figure 2). The O-P rewards have also been represented with different reward windows. It is possible to observe how for long windows, the ∆MSE-RS tends to behave like the O-P reward strategies and reaches the same values.

Influence of the Size of the Reward
Up to this subsection, the reward mechanism provides a variable size reward/punishment depending on how good the previous action was. Better/worse actions give greater positive/negative

Influence of the Size of the Reward
Up to this subsection, the reward mechanism provides a variable size reward/punishment depending on how good the previous action was. Better/worse actions give greater positive/negative rewards. In this section the case of all rewards and punishments having the same size is analyzed. To do so, the P-N reward strategies are binarized, that is, the value +r is assigned if the reward is positive and -r if it is negative. Several experiments have been carried out varying this parameter r to check its influence. In all experiments the policy update algorithm is SAR and the window size is 100 ms.
The results are shown in Figures 13-15, on the left the evolution of the MSE and on the right the evolution of the variance. Each line represents a different size of reward with a color code. The legend indicates the size of the reward. "Var" indicates that the reward strategy is not binarized and therefore the size of the reward is variable.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 16 of 22 rewards. In this section the case of all rewards and punishments having the same size is analyzed. To do so, the P-N reward strategies are binarized, that is, the value + r is assigned if the reward is positive and -r if it is negative. Several experiments have been carried out varying this parameter r to check its influence. In all experiments the policy update algorithm is SAR and the window size is 100 ms. The results are shown in Figures 13-15, on the left the evolution of the MSE and on the right the evolution of the variance. Each line represents a different size of reward with a color code. The legend indicates the size of the reward. "Var" indicates that the reward strategy is not binarized and therefore the size of the reward is variable.     It can be seen how, in all cases, the MSE is much better when the size of the reward is not binarized. However, the variation is similar or even worse. It has already been explained how better MSE performance typically leads to greater variance. Another interesting result is that the speed of convergence does not depend on the size of the reward, all the curves provide similar results up to the absolute minimum. However, from this point on, oscillations appear in the MSE and vary with size.
The VSE reward strategy is the least susceptible to variations in reward size. For ∆MSE-RS, the oscillations seem to depend on the size of the reward: the larger the size, the greater the oscillations; however, its amplitude decreases with time. For ∆MEAN-RS this relationship is not so clear and, what is worse, the amplitude of the oscillations seems to increase with time. Therefore, it is not recommended to use a fixed reward size with ∆MSE-RS and ∆MEAN-RS, it is preferable to use a variable reward size.

Combination of Individual Reward Strategies
As discussed above, O-P reward strategies do not converge due to the lack of exploration of the entire space of possible actions. To solve this, ϵ-greedy methods can be applied [40] or they can be combined with P-N reward strategies. This last option is explored in this section. Different experiments have been carried out combining reward strategies O-P with P-N and their performance is studied. In all experiments, the policy update algorithm is SAR and the window size is 100 ms. Figure 16 shows the results of applying PRS (blue) and VRS (red), PRS•VRS (yellow) and PRS+K•VRS (purple), with K = 2. It is possible to see how the multiplication makes PRS converge. However, this is not true with addition, as this operator cannot convert PRS to a P-N reward strategy in all cases. It depends on the size of each individual reward and on the value of K. Therefore, the It can be seen how, in all cases, the MSE is much better when the size of the reward is not binarized. However, the variation is similar or even worse. It has already been explained how better MSE performance typically leads to greater variance. Another interesting result is that the speed of convergence does not depend on the size of the reward, all the curves provide similar results up to the absolute minimum. However, from this point on, oscillations appear in the MSE and vary with size.
The VSE reward strategy is the least susceptible to variations in reward size. For ∆MSE-RS, the oscillations seem to depend on the size of the reward: the larger the size, the greater the oscillations; however, its amplitude decreases with time. For ∆MEAN-RS this relationship is not so clear and, what is worse, the amplitude of the oscillations seems to increase with time. Therefore, it is not recommended to use a fixed reward size with ∆MSE-RS and ∆MEAN-RS, it is preferable to use a variable reward size.

Combination of Individual Reward Strategies
As discussed above, O-P reward strategies do not converge due to the lack of exploration of the entire space of possible actions. To solve this, -greedy methods can be applied [40] or they can be combined with P-N reward strategies. This last option is explored in this section. Different experiments have been carried out combining reward strategies O-P with P-N and their performance is studied. In all experiments, the policy update algorithm is SAR and the window size is 100 ms. Figure 16 shows the results of applying PRS (blue) and VRS (red), PRS·VRS (yellow) and PRS+K·VRS (purple), with K = 2. It is possible to see how the multiplication makes PRS converge. However, this is not true with addition, as this operator cannot convert PRS to a P-N reward strategy in all cases. It depends on the size of each individual reward and on the value of K. Therefore, the addition operator will not be used from now on to combine the rewards. Another interesting result is that the PRS·VRS combination smoothens the VRS curve, the result converges at a slightly slower speed but is more stable for iterations over 15. This combination presents less variance than VRS in general.   In the following experiment, the P-N ∆MSE-RS reward strategy is combined with each O-P reward strategy. Figure 18      In the following experiment, the P-N ∆MSE-RS reward strategy is combined with each O-P reward strategy. Figure 18   In the following experiment, the P-N ∆MSE-RS reward strategy is combined with each O-P reward strategy. Figure 18   In this last experiment, the P-N ∆Mean-RS strategy is combined with each O-P strategy. Figure  19 shows     In this last experiment, the P-N ∆Mean-RS strategy is combined with each O-P strategy. Figure 19 shows the results, with PRS (dark blue), MSE-RS (red), ∆Mean-RS (purple), PRS·∆Mean-RS (green), Mean-RS·∆MSE-RS (light blue) and Mean-RS·∆Mean-RS (magenta). Again it is possible to observe how all the combined strategies and ∆Mean-RS converge at the same speed until approximately iteration 5, where PRS·∆Mean-RS improves the MSE. The combination of these strategies provides a better result than their individual application. Furthermore, the variance also decreases with iterations. The combination of ∆Mean-RS with Mean-RS and MSE-RS only offers an appreciable improvement in variance. In this last experiment, the P-N ∆Mean-RS strategy is combined with each O-P strategy. Figure  19 shows the results, with PRS (dark blue), MSE-RS (red), ∆Mean-RS (purple), PRS•∆Mean-RS (green), Mean-RS•∆MSE-RS (light blue) and Mean-RS•∆Mean-RS (magenta). Again it is possible to observe how all the combined strategies and ∆Mean-RS converge at the same speed until approximately iteration 5, where PRS•∆Mean-RS improves the MSE. The combination of these strategies provides a better result than their individual application. Furthermore, the variance also decreases with iterations. The combination of ∆Mean-RS with Mean-RS and MSE-RS only offers an appreciable improvement in variance.  Table 5 compiles the numerical results of the previous experiments. The KPIs have been measured at iteration 25. The best MSE is obtained by the combination of PRS•VRS and the best mean value and variance by MSE-RS•∆MSE-RS. In general, it is possible to observe how the combination with PRS decreases the MSE and the combination with MEAN-RS and MSE-RS improves the mean value and the variance.   Table 5 compiles the numerical results of the previous experiments. The KPIs have been measured at iteration 25. The best MSE is obtained by the combination of PRS·VRS and the best mean value and variance by MSE-RS·∆MSE-RS. In general, it is possible to observe how the combination with PRS decreases the MSE and the combination with MEAN-RS and MSE-RS improves the mean value and the variance.

Evolution of the Variance
In view of these results, it is possible to conclude that the combination of individual rewards of O-P and P-N is beneficial since it converges learning by O-P rewards, without worsening the speed of convergence and learning is more stable.

Conclusions and Future Works
In this work, a RL-inspired pitch control strategy of a wind turbine is presented. The controller is composed by a state estimator, a policy update algorithm, a reward strategy, and an actuator. The reward strategies are specifically designed to consider the energy deviation from the rated power aiming to improve the efficiency of the WT.
The performance of the controller has been tested in simulation on a 7 kW wind turbine model with varying different configuration parameters, especially those related to rewards. The RL-inspired controller performance is compared to a tuned PID giving better results in terms of system response.
The relationship of the rewards with the exploration-exploitation dilemma and the -greedy methods is studied. On this basis, two novel categories of reward strategies are proposed, O-P (Only-Positive) and P-N (Positive-Negative) rewards. The performance of the controller has been analyzed for different reward strategies and different policy update algorithms. The individual behavior of these methods and their combination have also been studied. It has been shown that the P-N rewards improve the learning convergence and the performance of the controller.
The influence of the control parameters and RL configuration on the turbine response has been throughout analyzed and different conclusions regarding learning speed and convergence have been drawn. It is worth noting the relationship between the size of the reward and the need for forced exploration for the convergence of learning.
Some potential challenges may include to extend this proposal to design model-free general purpose tracking controllers. Another research line would be to incorporate risk detection in the P-N reward mechanisms to perform safe non-forced exploration for systems, which must fulfill safety requirements during the learning process.
As other future works, it would be desirable to test the proposal in a real prototype of a wind turbine. Also, it would be interesting to apply this control strategy to a larger turbine, and see if this control action affects the stability of a floating offshore wind turbine.