Digital Commons @ Michigan Tech Digital Commons @ Michigan Tech Towards real-time reinforcement learning control of a wave Towards real-time reinforcement learning control of a wave energy converter energy converter

: The levellised cost of energy of wave energy converters (WECs) is not competitive with fossil fuel-powered stations yet. To improve the feasibility of wave energy, it is necessary to develop effective control strategies that maximise energy absorption in mild sea states, whilst limiting motions in high waves. Due to their model-based nature, state-of-the-art control schemes struggle to deal with model uncertainties, adapt to changes in the system dynamics with time, and provide real-time centralised control for large arrays of WECs. Here, an alternative solution is introduced to address these challenges, applying deep reinforcement learning (DRL) to the control of WECs for the ﬁrst time. A DRL agent is initialised from data collected in multiple sea states under linear model predictive control in a linear simulation environment. The agent outperforms model predictive control for high wave heights and periods, but suffers close to the resonant period of the WEC. The computational cost at deployment time of DRL is also much lower by diverting the computational effort from deployment time to training. This provides conﬁdence in the application of DRL to large arrays of WECs, enabling economies of scale. Additionally, model-free reinforcement learning can autonomously adapt to changes in the system dynamics, enabling fault-tolerant control.


Introduction
Ocean wave energy is a type of renewable energy with the potential to contribute significantly to the future energy mix. Despite an estimated global resource of 146 TWh/yr [1], the wave energy industry is still in its infancy. In 2014, there were less than 10 MW of installed capacity worldwide due to the high levellised cost of energy (LCoE) of approximately EUR 330-630/MWh [1]. The main operational challenge is the maximisation of energy extraction in the common, low-energetic sea states, whilst ensuring the survival of the wave energy converters (WECs) in storms [2]. Two major contributors to the lowering of the LCoE to EUR 150/MWh by 2030 are expected to be the achievement of economies of scale and the development of effective control strategies [3].
Over the past decade, model predictive control (MPC) has attracted much research interest, as it can offer improved performance over the control strategies developed in the 1970s and 1980s, based on hydrodynamic principles. Assuming knowledge of the wave excitation force, MPC computes the control action, typically the force applied by the power take-off system (PTO), that results in optimal energy absorption over a future time horizon using a model of the WEC dynamics. The controller applies only the first value of the PTO force, recomputing the optimal control action at the next time horizon. The iterative procedure enables the controller to reduce the negative impact of inaccuracies in the prediction of the future excitation force and modelling errors. Additionally, the MPC framework enables the inclusion of constraints on both the control action and the system dynamics. A good review of MPC for WEC control can be found in [4]. Li and Belmont first proposed a fully convex implementation, which trades off the energy absorption, the energy consumed by the actuator and safe operation [5]. An even more efficient implementation cast in a quadratic programming form has been proposed by Zhong and Yeung [6]. Fundamentally, the convex form enables the strategy to be generalised to the control of multiple WECs in real-time [7,8]. However, linear MPC relies on a linear model of the WEC dynamics. In energetic waves, nonlinearities in the static and dynamic Froude-Krylov forces (i.e., the hydrostatic restoring and wave incidence forces) and viscous drag effects can become significant [9]. Although nonlinear MPC strategies have been proposed [10,11] and even tested experimentally [12], achieving a successful real-time, centralised control implementation for multiple WECs is expected to be challenging [13].
As described in [14,15], alternative strategies have been developed for the control of WECs. Some, like simple-and-effective control [16], present similar performance to MPC at a much lower computational cost. Alternative solutions based on machine learning have been recently considered thanks to the advancements in the field of artificial intelligence. Most commonly, the neural networks are used to provide a data-based, nonlinear model of the system dynamics, i.e., for system identification. After training, the identified model is coupled with standard strategies used for the control of WECs, e.g., resistive or damping control in [17], reactive or impedance-matching in [18] and latching control in [19,20]. On the one hand, some studies have proposed the use of neural networks to find the optimal parameters for impedance-matching control on a time-averaged basis [21], thus being readily applicable to the centralised control of multiple WECs [22]. On the other hand, other works have focused on real-time control [19,20,23], exploiting the capability of neural networks to handle the predicted wave elevation over a future time horizon, similar to MPC. The main advantage of machine learning models for the system identification of WECs is that the same method can be used for different WEC technologies and is potentially adaptive to changes in the system dynamics, e.g., to subsystem failures or biofouling.
A promising solution to developing an optimal, real-time nonlinear controller for WECs inclusive of constraints on both the state and action is to cast the problem in a dynamic programming framework, similarly to MPC for WEC problems. In non-linear dynamic programming solutions, a neural network is used as a critic to approximate the time-dependent optimal cost value expressed as a Hamilton-Jacobi-Bellman equation. Numerical studies have shown the effectiveness and robustness of this approach for the control of a single WEC [24][25][26]. In particular, dynamic programming, also classified as model-based reinforcement learning (RL), is much more data-efficient than model-free RL [27]. Using a machine learning model trained on the collected data, such as Gaussian processes [27] or neural networks [24][25][26], enables the controller to plan off-line, thus significantly speeding up the learning of a suitable control policy even from a small set of samples. Conversely, model-free RL methods which learn from direct interactions with the environment require a much larger number of samples (in order of 10 8 as opposed to 10 4 for complex control tasks [28]). For this reason, to date model-free RL has been applied only to the time-averaged resistive and reactive control of WECs with discrete PTO damping and stiffness coefficients [29][30][31], with lower level controllers necessary to ensure constraints abidance [32]. However, model-free RL schemes are known to find the optimal control policy, even for real-time applications and very complex systems [28].
This paper introduces the world-first deep reinforcement learning (DRL) control method for WECs. The novel approach enables the real-time, nonlinear optimal control of WECs based on model-free RL. Deep learning allows the method to treat continuous control input and output features efficiently at deployment time.
To avoid unpredictable behaviour during the initial learning stage, WECs are expected to be controlled with model-based, robust methods once first deployed in the future. For a real-time implementation on complex WECs, these are likely to rely on linear models considering current technologies. After sufficient data samples are collected, the controller can move to the proposed data-driven model, whose computational cost at deployment does not increase if nonlinearities are present and can adapt to changes in the system dynamics or noncritical faults if retrained regularly. Hence, first of all, the WEC is operated in a range of representative sea states under the convex MPC proposed in [6,8]. Data samples are collected from 15-minute-long wave traces in each sea state. Subsequently, the dataset is used to train a deep neural network (DNN), defined as a neural network with more than one hidden layer according to [33], which mimics the controller behaviour. The DNN thus corresponds to the actor of an actor-critic RL strategy. The actor will be then used to initialise a model-free RL controller, as in [28]. The agent will then be further trained to optimise its behaviour as in [34].
In this article, the analysis is limited to the simulation of a standard spherical point absorber constrained to heaving motions [9]. The simulation environment is currently based on a linear model as presented in Section 2. The new DRL-based control method for WECs is described in Section 3. Finally, the performance of the trained actor is assessed directly against the original linear MPC developed in [6] in Section 4, with conclusions drawn in Section 5.

Linear Model of a Heaving Point Absorber
A heaving point absorber is shown schematically in Figure 1. Assuming linear potential wave theory, the equation of motion of a heaving point absorber can be expressed in the time domain as [35] where t indicates time, z the heave displacement, f e the wave excitation force inclusive of wave incidence and diffraction effects (or dynamic Froude-Krylov and scattering forces), f r the wave radiation force, f h the hydrostatic restoring force (or static Froude-Krylov), f m the mooring force, f PTO the PTO or control force, and m is the mass of the buoy. In this study, the mooring forces are ignored for simplicity. The linear excitation wave force can be obtained using the excitation impulse response function where ζ is the wave elevation. Similarly, the linear hydrostatic restoring force is where ρ is the water density, g the gravitational acceleration and A w the waterplane area. Using Cummins' equation [36], the radiation force can be expressed as where A(∞) is the heave added mass at infinite wave frequency and K is the radiation impulse response function. The convolution integral in (4) causes significant challenges for control tasks. Hence, it is common practice to approximate the convolution integral with a state-space model to improve the computational performance and ensure controllability. Here, the approach based on moment matching proposed in [37] is followed. Hence, (4) is reformulated aṡ (1) allows the equation of motion of the heaving point absorber to be expressed in state space form:ẋ with M = m + A(∞). The net useful energy that can be absorbed from the waves between times t 0 and t f is given by

Real-Time Reinforcement Learning Control of a Wave Energy Converter
RL is a decision-making framework in which an agent learns a desired behaviour, or policy π, from direct interactions with the environment [38]. As shown in Figure 2, at each time step, the agent is in a state s and takes and action a, thus landing in a new state s while receiving a reward r. A Markov decision process is used to model the action selection depending on the value function Q(s, a), which represents an estimate of the future reward. By interacting with the environment for a long time, the agent learns an optimal policy, which maximises the total expected reward.

Problem Formulation
As a decision-making framework, RL is typically used to train an agent, or system, to perform a task that is particularly challenging to express in a standard control setting, e.g., walking for a legged robot. These tasks are usually described as episodic, i.e., the experience can be subdivided into discrete trials whose end is determined by either success in achieving the desired task, e.g., the robot has correctly made a walking step, or failure, e.g., the robot has fallen over and needs to start again. However, WEC control is clearly continuous, which will require a reformulation of the RL schemes as shown in [29][30][31][32].
In a feedforward configuration, the control of WECs is dependent on the wave excitation force and its predicted value over a future time horizon. Here, a simple state space is selected, which includes only the WEC displacement and velocity and capture the wave excitation information from the wave elevation and its rate. Therefore, the RL state space for a heaving point absorber is defined as The action space is identical to the control input u: Although RL originated with the treatment of discrete actions, as for instance shown in [29][30][31][32], successful strategies with continuous action spaces have been recently proposed [34,39,40]. With either solution, constraints on the action can be easily imposed, so that |a| < f max .
Specifying an appropriate reward function is fundamental to have the agent learn the desired behaviour. Note that in RL, the optimisation problem is typically cast as a maximisation rather than a minimisation. Taking inspiration from [24][25][26], the reward function can be defined as where the weights w u and w z can be used to tune the penalty on the control action and heave displacement, respectively. Whilst a constraint can be placed on the PTO force, as it coincides with the control action, it is not possible to impose proper limits on the heave displacement. Therefore, a discontinuous function is used to determine the penalty term p z to produce an aggressive controller: where z max defines the displacement limit. Another difference between the RL and MPC frameworks consists of the way the information on the future incoming waves is treated, i.e., the prediction step. In feedforward MPC, an external method, e.g., autoregressive techniques and the excitation impulse response function [41], is used to predict the incoming wave force and the information is included in the cost function to select the control action. In the RL framework, the agent learns an optimal policy for the maximisation of the total reward, which is a function of the current reward as well as discounted future rewards deriving from following either the current or the optimal policy. This means that the reward function should be specified for the current time step rather than include information from predicted future time steps. The prediction step is embedded within the RL system in a probabilistic setting.

RL Real-Time WEC Control Framework
Although trust region policy optimisation is used in [28] for a model-free controller initialised with samples obtained from an MPC controller, here actor-critic strategies are considered for the real-time control of a WEC. An example of a successful scheme with continuous state and action spaces, which are necessary for improved control performance, is soft actor-critic (SAC) [34]. In particular, SAC [34], which is the most advanced actor-critic DRL algorithm at the time of writing, is selected here for the control framework for the point absorber.
As shown in Figure 3, the controller would be split into an actor and a critic. The function of the critic is to evaluate the policy, thus updating the action-value function, which is a measure of the total discounted reward, using the samples collected from observations of the environment. The discounted reward estimated by the critic is then fed to the actor. Using the estimated action-value function, the actor selects an action based on the current state, directly interacting with the environment. The policy is then improved by learning from the collected observations. As SAC is an off-line, off-policy algorithm, the critic and actor can be updated using batches of data samples, known as experience replay buffer.  The agent seeks to maximise not only the environment's expected reward, but also the policy's entropy. The concept of entropy ensures that the agent selects random actions to explore the environment through a parameter α defined as the entropy temperature. The parameter is automatically adjusted with gradient descent to ensure sufficient exploration at the start of learning, and subsequently a greater emphasis on the maximisation of the expected reward. A DNN is used to model the mean of the log of the standard deviation of the policy. For the policy improvement step, the policy distribution is updated towards the softmax distribution for the current Q function by minimizing the Kullback-Leibler divergence.
In SAC, two DNNs are used to approximate the critic's policy evaluation to mitigate positive bias in the policy evaluation step. The DNNs are trained off-line using batches of data collected by the actor during deployment. The minimum value of the two soft Q-functions is used in the gradient descent during training, which has been found to significantly speed up convergence. Additionally, target networks that are obtained as an exponentially moving average of the soft Q-function weights are used to smooth out the effects of noise in the sampled data.
The SAC algorithm is summarised in Algorithm 1. For a full explanation, the reader is referred to [34]. As compared with the RL solutions presented in [29][30][31][32], once trained the new DRL implementation can be implemented in real time, with a control time step similar to the one used by other control methods, e.g., MPC. Figure 1 is selected as a case study for the development of the real-time RL WEC control scheme. The spherical buoy represents a standard case study based on the Wavestar prototype WEC, which has also been used in [9,37,42] among other studies. Additionally, the simple geometry enables the inclusion of computationally efficient nonlinear Froude-Kryolv and viscous forces in the future. The properties of the point absorber in the simulation environment can be found in Table 1. The hydrodynamic coefficients have been computed in the panel-code WAMIT. However, the same matrices as in [37] have been used for the state-space approximation of the radiation convolution integral. The problem has been programmed in the Python/Pytorch framework for the SAC controller.

A spherical point absorber as shown in
In addition, the robust and computationally efficient MPC strategy described in [6,8] is selected to initialise the training and benchmark the results of the DRL scheme. The method is implemented in the MATLAB/Simulink framework using the quadratic-programming solver quadprog, after discretising the matrix equation in (6) with a zero-order hold. Similarly to [6,8], the future wave elevation is assumed here to be known exactly, as prediction methods with 90% accuracy up to 10 s into the future have been proposed [41,43].
For both control methods, a first-order Euler scheme is used for the time integration of the simulations with a time step of 0.01 s. Typical ocean waves have an energy wave period approximately ranging from 5 s to 20 s [44]. Hence, it is clear that the selected point absorber will need significant control effort to extract energy from realistic ocean waves, since its resonant period is lower, as shown in Table 1. In this work, the peak wave period is considered to range from 4 s to 10 s, which is expected to be realistic for the small point absorber. Additionally, as the simulation environment is based here on a linear model, only small wave amplitudes up to 1 m are analysed. As a result, no constraints on either the buoy displacement or the PTO force are set on the MPC controller. The penalty on the slew rate is expected to be sufficient for the achievement of a suitable WEC response, by setting r MPC = 10 −5 to ensure convexity according to [8].
Zhong and Yeung [6] have shown that, in regular waves, the mean absorbed power does not increase with time horizon duration after the horizon is one wave period long. Hence, here we set H = 10 s, since it corresponds to the longest wave period that is analysed and corresponds to realistic prediction timeframes [41,43]. The control time step length is set to δt = 0.2 s.
A maximum PTO force f max = 10 5 N and displacement z max = 2.5 m are selected. Additionally, the weights of (10) are set to w u = 10 −5 and w z = 10 6 . The hyperparameters used for the SAC agent are the same as in [34] and are reported in Table 2 for greater clarity.

Results in Irregular Waves
To generate sufficient data samples for the training of the actor DNN, 28 wave traces of irregular waves lasting 15 min each are produced, with the significant wave height ranging from 0.5 m to 2 m in steps of 0.5 m and the peak wave period from 4 s to 10 s in steps of 1 s. A Bretschneider spectrum is used [44]. The controller is started only after 100 s from the start of the wave trace to avoid numerical instabilities during the initial transient. The wave trace is logged after an additional 50 s for 900 s.

Training
The sampled data is used to initialise the experience replay buffer of the SAC agent. Each episode consists of a randomly initialised wave trace whose significant wave height and peak wave period are randomly selected in the 1.5-2 m and 6-8 s range, respectively. The wave trace lasts for 200 s and is initialised with no control force for the first 100 s to avoid numerical instabilities. Hence, each episode lasts a total of 2001 steps for the selected control time step of 0.2 s. The same control time step is selected for the DRL controller. To ensure the robustness of the algorithm, the agent is trained with five different seed values to the random number generator.
As can be seen in Figure 4, after the initialisation with the samples collected by the MPC controller, the agent learns a policy to maximise the expected reward after approximately 50 episodes (or approximately 10 5 steps). Note that in Figure 4, the total reward per episode is highly dependent on the randomly selected significant wave height and peak period; hence, large variations are possible even after training, due to the different level of energy in the waves. The convergence time corresponds to approximately 5000 s of experience in addition to the previous 25,200 s with MPC control for a total of approximately 8.4 h. This is a really short time over the life time of the WEC and provides confidence in the controller being able to deliver adaptive control in practical implementations. However, the negative absorbed powers shown during the first episodes are highly worrying. At the start of learning, the agent is preferring random actions to ensure exploration. However, during exploration the device, and in particular the PTO, may fail. Therefore, in the future, a fixed entropy temperature may reduce exploration at the start and thus its associated risks if the controller is already initialised with data from a robust controller. This solution is however likely to slow down the training time.

Comparison between SAC and MPC
To assess the performance of the DRL control, MPC and SAC are tested in unseen waves. The traces have a Bretschneider spectrum in the same range of significant wave height and peak wave period, but different seed numbers to the random number generator from the training set. They last 1050 s, with the controller initialised after 100 s and the averaging to compute the mean power started after a further 50 s.
The mean useful or net absorbed power is shown in the dashed lines in Figure 5 for MPC. The reactive power, P rea , is defined as the power transferred from the PTO to the point absorber, whilst the active or resistive power, P act , is the power transferred from the absorber to the PTO. The net useful power is thus P u = P act − P rea . For the MPC, the extracted power at higher wave periods is curtailed by the penalty on the slew rate. In Figure 6, the ratio of the reactive and active power for the MPC can be seen in the dashed lines. Reactive power is primarily used to speed the WEC up in shorter waves, i.e., when the wave period is shorter than the resonant period, whereas passive damping can be used to slow the device down for longer wave periods. The steady increase in the ratio of the reactive and active power for higher wave periods for the MPC is thus unexpected. The greater control effort the further from the resonance period (3.17 s for the point absorber) is however visible also in [6] and can be explained with the decrease of the absolute value of the active power for longer waves. Furthermore, a comparison with the case studies in [6,8] shows that the selected value of r MPC is providing a stronger influence in this example.   Figures 5 and 6 also display the performance of the SAC agent before and after training. The dot-dash lines correspond to the SAC agent before training, with the entropy set to zero. In this case, the agent replicates closely the MPC behaviour, even though there are differences in the actual time-domain response. After training, the SAC agent (shown with thick continuous lines) improves the energy absorption for the higher peak wave period values (T p > 6 s) and the higher significant wave height values (T p > 1 m). These ranges correspond with the ranges used during training and show poor generalisation ability. The negative mean absorbed power values for the lower periods close to the WEC's resonant period are particularly worrying. From Figure 6, it is clear that the main cause for this behaviour is the aggressive policy that the SAC agents selects. The large flows of reactive power are useful for periods smaller than the resonant period, i.e., in shorter waves, but unhelpful close to the resonant period or for long wave periods, where damping is more useful to slow the WEC down. The problem may be caused by the low resonant period of the point absorber. A larger device whose resonant period is within the typical ocean waves period range should be selected in the future to assess the behaviour of the controller for both short and long waves.
In Figure 7, the magnitude of the maximum heave displacement and PTO force can be seen. Although the SAC algorithm presents higher displacements than MPC, the maximum value of 2.5 m is not exceeded. This hints at the efficacy of the discontinuous penalty term in (10). However, designing a method to guarantee the constraint handling for the displacement is critical for the DRL controller to find an industrial application in the future. The aggressive behaviour of the SAC agent is further underlined in Figure 7b, where the peak PTO force is hit in all sea states.  The aggressiveness of the controller can be reduced by increasing the penalty on the PTO force (through w u ). The poor ability of the SAC agent to generalise to unseen wave conditions is problematic and symptomatic of a possibly over-simplistic selection of the state-space. In [34], the information from a number of past time steps is captured in the state-space to ensure convergence for the control of a walking robot. A similar approach will be needed for the control of a WEC to capture the oscillatory nature of gravity waves, similar to MPC. Furthermore, it is clear that the experience replay buffer should include data samples from a broad range of sea states, in particular with regards to period both below and above the resonance period of the device. Currently, the memory buffer is updated with new samples by removing the oldest sample if the memory is full. This technique will be changed by binning the data by wave period and height and ensuring a minimum number of samples per bin. Table 3 shows the computational time required to train the SAC controller (over 100 episodes) and to run a simulation of the WEC lasting 1050 s using the MPC and trained SAC schemes. The mean from the 28 simulations employed to compare the two strategies is used. Note that the first 150 s are needed to initialise the WEC dynamics and power averaging and that the control time step is 0.2 s for both algorithms. Hence, there are 4501 control time steps per simulation, leading to the values for the computational time per time step shown in Table 3. The simulations are run on a laptop with an Intel 5, 2.3 GHz, dual-core processor and 16 GB RAM. As can be seen in Table 3, the SAC algorithm requires approximately 15 min to train over 100 episodes for analysed point absorber. The large computational time prevents an on-line application, although training can happen regularly off-line in practice, once significantly large new batches of data are collected. Conversely, once trained, the computational effort is 40 times lower than the simulation control time step, thus enabling a real-time implementation with ease. Additionally, the computational effort associated with SAC is one order of magnitude smaller than for linear MPC. In fact, the computational time per control time step shown in Table 3 is overly conservative for SAC, as it includes the overhead from the dynamic simulation in Python. Conversely, the linear MPC code is implemented in a very efficient MALTAB/Simulink script with C-coded S-functions. Therefore, a gain in performance as high as one additional order of magnitude is expected from compiled solutions if the system has to be implemented on an actual WEC [45].
For reproducibility, the results of the SAC algorithm in the unseen test wave traces can be accessed on Github 1 , including the wave elevation, vertical velocity and excitation force.

Conclusions
In this article, an established convex MPC has been used to generate observations for a heaving point absorber in a range of irregular waves in a linear simulation environment. The samples have been used to initialise a DRL agent, which learns an optimal policy from direct interactions with with the environment for the maximisation of the energy absorption. By being off-line and off-policy, the SAC algorithm enables the training to be decoupled from deployment, thus shifting the computational effort on the training. This is a fundamental trait, as the control of large groups of WECs in the future to achieve economies of scale is reliant on having an effective, real-time centralised strategy.
The DRL control improves the energy absorption of the point absorber over convex MPC for wave periods higher than the resonant period of the device, whilst meeting the displacement and force constraints. This is achieved by adopting a more aggressive policy with higher slew rate. However, poorer performance is shown for lower wave height and period values. These problems will be addressed by reformulating the state-space, updating the reward function and the sampling of data for the experience replay. Additionally, the exploration will be reduced from the start of training to prevent the controller from taking actions that damage the PTO. Furthermore, the DRL controller will be tested in a simulation environment inclusive of nonlinear effects, e.g., nonlinear static and dynamic Froude-Krylov forces as in [42] and viscous drag. A sensitivity analysis will be run to assess the impact of modelling errors on the performance of the DRL and MPC algorithms.