Refined Continuous Control of DDPG Actors via Parametrised Activation

Continuous action spaces impose a serious challenge for reinforcement learning agents. While several off-policy reinforcement learning algorithms provide a universal solution to continuous control problems, the real challenge lies in the fact that different actuators feature different response functions due to wear and tear (in mechanical systems) and fatigue (in biomechanical systems). In this paper, we propose enhancing the actor-critic reinforcement learning agents by parameterising the final layer in the actor network. This layer produces the actions to accommodate the behaviour discrepancy of different actuators under different load conditions during interaction with the environment. To achieve this, the actor is trained to learn the tuning parameter controlling the activation layer (e.g., Tanh and Sigmoid). The learned parameters are then used to create tailored activation functions for each actuator. We ran experiments on three OpenAI Gym environments, i.e., Pendulum-v0, LunarLanderContinuous-v2, and BipedalWalker-v2. Results showed an average of 23.15% and 33.80% increase in total episode reward of the LunarLanderContinuous-v2 and BipedalWalker-v2 environments, respectively. There was no apparent improvement in Pendulum-v0 environment but the proposed method produces a more stable actuation signal compared to the state-of-the-art method. The proposed method allows the reinforcement learning actor to produce more robust actions that accommodate the discrepancy in the actuators’ response functions. This is particularly useful for real life scenarios where actuators exhibit different response functions depending on the load and the interaction with the environment. This also simplifies the transfer learning problem by fine-tuning the parameterised activation layers instead of retraining the entire policy every time an actuator is replaced. Finally, the proposed method would allow better accommodation to biological actuators (e.g., muscles) in biomechanical systems.


Introduction
Deep reinforcement learning (DRL) has been used in different domains; and has achieved good results on different tasks such as robotic control, natural language processing and biomechanical control of human models (Kidziński et al., 2018(Kidziński et al., , 2020Kober et al., 2013;Mnih et al., 2015).
While DRL is proven to handle discrete problems effectively and efficiently, continuous control remains a challenging task to accomplish. This is because it relies on physical systems which are prone to noise due to wear and tear, overheating, and altered actuator response function depending on the load each actuators bears; this is more apparent in robotic and biomechanical control problems. In the robotic control domain, for instance, Bi-pedal robots robustly performs articulated motor movements under complex environments and limited resources. These robust movements are achieved using highly sophisticated model-based controllers. However, the motor characteristics are highly dependent on the load and the interaction with the environment. Thus, adaptive continuous control is required to adapt to new situations.
Biomechanical modelling and simulation present a clearer example. In a biomechanical system, the human movement is performed using muscle models (Millard et al., 2013;Thelen, 2003). These models simulate muscle functions, which are complex and dependent on multiple parameters, like muscle maximum velocity, muscle optimal length, muscle maximum isometric force to name a few (Zajac, 1989).
The common challenge facing training DRL agents on continuous action spaces, is the flow of the gradient update throughout the network. The current state of the art is relying on a single configuration of the activation function producing the actuation signals. However, different actuators exhibit different transfer functions; and also noisy feedback from the environment propagates through the entire actor neural network and thus, a drastic change is imposed on the learned policy. The solution we are proposing in this work is to use multiple actuation transfer functions that allow the actor neural network to adaptively modify the actuation response functions to the needs of each actuator.
In this paper, we present a modular perspective of the actor in actor-critic DRL agents and propose modifying the actuation layer to learn the parameters defining the actuation-producing activation functions (e.g. Tanh and Sigmoid). It is important to emphasise the difference between parameterised action spaces and parameterised activation functions. In reinforcement learning, a parametrised action space is commonly referred to as a discrete action space that has an accompanying one or more continuous parameters (Masson et al., 2016). It has been used to solve problems such as the RoboCup (Kitano et al., 1997), which is a robots world-cup soccer game (Hausknecht and Stone, 2015). On the other hand, parameterised activation functions, such as PReLU (He et al., 2015) and SeLU (Klambauer et al., 2017), were introduced to combat overfitting and saturation problems. In this paper, we adopt parameterised activation functions to improve performance of the deep deterministic policy gradient (DDPG) to accommodate the complex nature of real-life scenarios. The rest of this paper is organised as follows. Related work is discussed in Section 2. Proposed method is presented in Section 3. Experiments and results are presented in Section 4. Finally, Section 5 concludes and introduces future advancements.

Background
Deep deterministic policy gradient (DDPG) is a widely adopted deep reinforcement learning method for continuous control problems (Lillicrap et al., 2015). A DDPG agent relies on three main components; the actor, the critic and the experience replay buffer (Lillicrap et al., 2015).
In the actor-critic approach (Sutton et al., 1998), the actor neural network reads observations from the environment and produces actuation signals. After training, the actor neural network serves as the controller which allows the agent to navigate the environment safely and to perform the desired tasks. The critic network assesses the anticipated reward based on the current observation and the actor's action. In control terms, the critic network serves as a black-box system identification module which provides guidance for tuning the parameters of a PID controller. The observations, actions, estimated reward and next state observation are stored as an experience in a circular buffer. This buffer serves as a pool of experiences, from where samples are drawn to train the actor and the critic neural networks to produce the correct action and estimate the correct reward, respectively.
There are different DDPG variations in the literature. In (Fujimoto et al., 2018), a twin delay DDPG (TD3) agent was proposed to limit overestimation by using the minimum value between a pair of critics instead of one critic. In (Barth-Maron et al., 2018), it was proposed to expand the DDPG as a distributed process to allow better accumulation of experiences in the experience replay buffer. Other off-policy deep reinforcement learning agents such as soft actor-critic (SAC) are inspired by DDPG although they rely on stochastic parameterisation (Haarnoja et al., 2018a,b). In brief, SAC adapts the reparameterisation trick to learn a statistical distribution of actions from which samples are drawn based on the current state of the environment.

DDPG Challenges
Perhaps the most critical challenge of the DDPG, and off-policy agents in general, is its sample inefficiency. The main reason behind this challenge is that the actor is updated depending on the gradients calculated by the continuously training of the critic neural network. This gradient is noisy because it relies on the outcome of the simulated episodes. Therefore, the presence of outlier scenarios impact the training of the actor and thus constantly change the learned policy instead of refining it. This is the main reason off-policy DRL training algorithms require maintaining a copy of the actor and critic neural networks to avoid divergence during training.
While radical changes in the learned policy may provide a good exploratory behaviour of the agent, it does come at the cost of requiring many more episodes to converge. Additionally, it is often recommended to have controllable exploratory noise parameters separated from the policy either by inducing actuation noise such as Orn-steinUhlenbeck (Uhlenbeck and Ornstein, 1930) or maximising the entropy of the learned actuation distribution (Haarnoja et al., 2018a,b). Practically, however, for very specific tasks, and most continuous control tasks are, faster convergence is often a critical aspect to consider. Another challenge, which stems from practical applications, is the fact that actuators are physical systems and are susceptible to having different characterised transfer functions in response to the supplied actuation signals. These characterisation discrepancies are almost present in every control system due to wear and tear, fatigue, overheating, and manufacturing factors. While minimal discrepancies are easily accommodated with a PID controller, they impose a serious problem with deep neural networks. This problem, in return, imposes a serious challenge during deployment and scaling operations.

Proposed Method
In order to address the aforementioned challenges, we propose parameterising the final activation function to include scaling and translation parameters k, x 0 . In our case, we used tanh (kx − kx 0 ) instead of tanh(x) to allow the actor neural network to accommodate the discrepancies of the actuator characteristics by learning k and x 0 . The added learnable parameters empower the actor with two additional degrees of freedom.

Modular Formulation
In a typical DRL agent, an actor consists of several neural network layers. While the weights of all layers collectively serve as a policy, they do serve different purposes based on their interaction with the environment. The first layer encode observations from the environment and thus we propose to call it the observer layer. The encoded observations are then fed into several subsequent layers and thus we call them the policy layers. Finally, the output of the policy layers are usually fed to a single activation function. Throughout this paper, we will denote to the observer, policy and action parts of the policy neural network as π O , π P , π A , respectively. We will also denote to the observation, the pre-mapped action space, and the final action space as O,Ã and A, respectively. To that end, the data flow of the observation o t ∈ O through the policy π to produce an action a t ∈ A can be summarised as; where π O : O →Ã, π P :Õ →Ã and π A :Ã → A.
In a typical actor, there is no distinction between the observer and policy layers. Also, the actuation layer is simply regarded as the final activation function π A (x) = tanh(x) and thus the actor is typically modelled as one multi layer perceptron neural network (MLP). The problem with having π A as tanh 1 is that it assumes that all actuators in the system exhibit the same actuation-torque characterisation curves under all conditions. Allowing extra degrees of freedom empowers the actor neural network to accommodate outlier scenarios with minimal update to the actual policy. The results here are from the proposed actor trained and tested on bipedal walker environment. Colour brightness indicate different stages throughout the episode from start (bright) to finish (dark).

Parameterising π A
Beccause actuation characterisation curves differ based on their role and interaction with the environment, using a single activation function, forces the feedback provided by the environment to propagate throughout the gradients of the entire policy. Therefore, we chose to use a parameterised π A (kx − kx 0 ) to model the scaling and the translation of the activation function, and thus the data flow in Eq. 2 can be expanded as; where π a0 , π k are simple fully connected layers and π A remains an activation function (i.e. tanh) as shown in Fig. 1.
Adjusting the activation curves based on the interaction with the environment allows the policy to remain intact and thus leads to a more stable training as discussed in the following section. Figure 2 shows the learned parameterised tanh activation functions of the bipedal walker problem.
While the automatic differentiation engines are capable of adjusting the flow of gradient updates, there are two implementation considerations to factor in the design of the actor. First, the scale degree of freedom parameterised by k, in the case of tanh and sigmoid does affect the slope of the activation function. A very small k < 0.1 will render π A to be almost constant while a very high k > 25 produces a square signal. Both extreme cases impose problems to the gradient calculations. On one hand, a constant signal produces zero gradients and prevents the policy neural network from learning. On the other hand, a square signal produces unstable exploding gradients. Another problem also occurs when k < 0, which usually changes the behaviour of the produced signals. Therefore, we recommend using a bounded activation function after π k when estimating k t .
Second, the translation degree of freedom parameterised by a 0 , allows translating the activation function to an acceptable range which prevents saturation. However, this may, at least theoretically, allow the gradients of the policy π P and observer π O layers to have monotonically increasing gradients as long as the a 0 t can accommodate. This in return may cause an exploding gradient problem. In order to prevent this problem we recommend employing weight normalisation after calculating the gradients (Salimans and Kingma, 2016).

Experiments and Results
In order to test the efficacy and stability of the proposed method we trained a DDPG agent with and without the proposed learnable activation parameterisation.
Both models were trained and tested on three OpenAI gym environments, shown in Fig. 3, that are Pendulum-v0, LunarLanderContinous-v2 and BipedalWalker-v2. For each environment six checkpoint models were saved (best model for each seed). The saved models were then tested for 20 trials with new random seeds (10 episodes with 500 steps each). The total number of test episodes is 1200 for each environment. The results of the three environments are tabulated in Tab. 1 and Tab. 2.

Models and Hyperparameters
The action mapping network is where the proposed and classical models differ. The proposed model branches the final layer of into two parallel fully connected layers to infer the parameters of k, x 0 in tanh(kx − kx 0 ) activation function. The classical model adds two more fully connected layers separated by tanh activation function. The added layers ensures that the number of learnable parameters is the same in both models to guarantee a fair comparison.
Both models were trained on the three environments for the same number of episodes (200 steps each). However, number of steps may vary depending on early termination cases. The models were trained with 5 different pseudo-random number generator (PRNG) seeds. We set the experience replay buffer to 10 6 samples. We chose ADAM optimiser for the back-propagation optimisation and set the learning rate of both the actor and the critic to 1E-3 with first and second moments set to 0.9, 0.999, respectively. We set the reward discount γ = 0.99 and the soft update of the target neural networks τ = 0.005. We also added a simple Gaussian noise with σ = 0.25 to allow exploration. During the training we saved the best model (i.e. checkpoint). DDPG hyper-parameters tuning is thoroughly articulated in (Lillicrap et al., 2015).

Inverted Pendulum Results
In the inverted pendulum problem (Fig. 4), the improvement is insignificant because the environment featured only one actuator. However, the policy adapted by the proposed agent features a fine balance of actuation signals. In contrast, the classical MLP/Tanh model exerts additional oscillating actuation signals to maintain the achieved optimal state, as shown in Fig. 4(e). This oscillation, imposes a wear and tear challenge on mechanical systems and fatigue risks in biomechanical systems. While this difference is reflected with minimal difference in the environment reward, it is often a critical decision to make in practical applications. Figure 5 shows the training and reward curves of the lunar landing problem. The instant reward curve of the lunar landing problem demonstrate an interesting behaviour in the first 100 steps. The classic method adopts an energyconservative policy by shutting down the main throttle and engaging in free falling for 25 steps to a safe margin and then keep hovering above ground to figure out a safe   landing. The conserved energy contributes to the overall reward at each time step. While this allows for faster reward accumulation, this policy becomes less effective with different initial conditions. Depending on the speed and the attack angle, the free-falling policy requires additional effort for manoeuvring the vehicle to the landing pad. The proposed agent, on the other hand, accommodates the initial conditions and slows down the vehicle in the opposite direction to the entry angle to maintain a stable orientation and thus allows for a smoother lateral steering towards a safe landing as shown in Fig 6 (a and c). It is worth noting that both agents did not perform any type of reasoning or planning. The main difference is the additional degrees of freedom the proposed parametrised activation function offers. These degrees of freedom allow the actor neural network to adopt different response functions to accommodate the initial conditions.

Bipedal Walker Results
Training and reward curves of the bipedal walking problem are illustrated in Fig. 7. In general, the agent with the proposed action mapping out performs the classical agent in the training, step and episode reward curves as shown in Fig. 7(a, b, c). The spikes in the step reward curves shows instances where agents lost stability and failed to complete the task. The episode reward curve shows that the proposed method allows the agent to run and complete the task faster. This is due to a better coordination between the left and right legs while taking advantage of the gravity to minimise the effort. This is demonstrated in Fig. 7-d where the proposed agent maintains a pelvis orientation angular velocity and vertical velocity close to zero. This, in return, dedicates the majority of the spent effort towards moving forward. This is also reflected in Fig. 7-e where the actuation of the proposed agent stabilises faster around zero and thus exploiting the gravity force. In contrast, the classical agent, spends more effort to balance the pelvis and thus it takes longer time to stabilise actuation. Finally, the locomotion actuation patterns in Fig. 7-e demonstrate the difference between the adapted policies. The classical agent relies more on locomoting using Knee2 while the proposed agent provides more synergy between joint actuators. This difference in exploiting the gravity during locomotion is an essential key in successful bipedal locomotion as "controlled falling" (Novacheck, 1998).

Conclusions
In this paper, we discussed the advantages of adding parameterisation degrees of freedom to the actor in the DDPG actor-critic agent. The proposed methods is simple and straight forward, yet it outperforms the classical actor which utilises the standard tanh activation function. The main advantage of the proposed method lies in producing stable actuation signals as demonstrated in the inverted pendulum and bipedal walker problems. Another advantage that was apparent in the lunar landing problem is the ability to accommodate changes in initial conditions. This research highlights the importance of parameterised activation functions. While the discussed advantage may be minimal for the majority of supervised learning problems, they are essential for dynamic problems addressed by reinforcement learning. This is because reinforcement learning methods, especially the off-policy ones, rely on previous experiences during training.
The advantage of the proposed method in the bipedal walking problem and the wide variety of activation functions demonstrated in Fig. 2 suggests a promising potential for solving several biomechanics problems where different muscles have different characteristics and response functions. Applications such as fall detection and prevention (Abobakr et al., 2018), ocular motility and the associated cognitive load and motion sickness (Attia et al., 2018a;Iskander et al., 2019Iskander et al., , 2018a, as well as intent prediction of pedestrians and cyclists (Saleh et al., 2018(Saleh et al., , 2020. The stability of the training using the parameterised tanh in an actor-critic architecture also shows potential for advancing the Generative Adversarial Networks (GANs) research for image synthesis (Attia et al., 2018b).
This research can be expanded in several directions. First, the parameterisation of tanh can be extended from being deterministic (presented in this paper), to a stochastic parameterisation by inferring the distributions of k and x 0 . Second, the separation between the policy and the action parts of the actor neural network allows preserving the policy part while fine tuning only the action part to accommodate actuator characterisation discrepancies due to wear and tear during operations. Finally, the modular characterisation of different parts of the actor neural network into observer, policy and action parts requires investigating scheduled training to lock and unlock both parts alternatively to maximise the dedicated function each part the actor carries out.