Just Don’t Fall: An AI Agent’s Learning Journey Towards Posture Stabilisation

: Learning to maintain postural balance while standing requires a signiﬁcant, ﬁne coordination effort between the neuromuscular system and the sensory system. It is one of the key contributing factors towards fall prevention, especially in the older population. Using artiﬁcial intelligence (AI), we can similarly teach an agent to maintain a standing posture, and thus teach the agent not to fall. In this paper, we investigate the learning progress of an AI agent and how it maintains a stable standing posture through reinforcement learning. We used the Deep Deterministic Policy Gradient method (DDPG) and the OpenSim musculoskeletal simulation environment based on OpenAI Gym. During training, the AI agent learnt three policies. First, it learnt to maintain the Centre-of-Gravity and Zero-Moment-Point in front of the body. Then, it learnt to shift the load of the entire body on one leg while using the other leg for ﬁne tuning the balancing action. Finally, it started to learn the coordination between the two pre-trained policies. This study shows the potentials of using deep reinforcement learning in human movement studies. The learnt AI behaviour also exhibited attempts to achieve an unplanned goal because it correlated with the set goal (e.g. walking in order to prevent falling). The failed attempts to maintain a standing posture is an interesting by-product which can enrich the fall detection and prevention research efforts.


Introduction
Postural balance is one of the key contributing factors towards fall prevention [1]. Despite the common misconception, maintaining a standing posture does require a huge co-ordination effort between neuromuscular and sensory systems. It also involves coordinated activity of muscles [2,3]. In contrast, walking was referred to as controlled falling [4]. Walking, which is similar to a swinging pendulum, utilises the potential energy preserved by the upper body and actuated by gravity, where efficiency in walking is maintained by the effective interchange between potential and kinetic energy [4][5][6]. In addition, from a standing posture, gait is initiated by the de-innervation of the muscles responsible of maintaining balance which causes the body to fall forwards, then a series of coordinated protective steps are performed [7,8]. In postural balance, walking, running and other human movements, the coordinated motor control of the neuro-musculoskeltal units (muscles) plays an essential role [9].
Recently machine learning techniques have been used beside statistical analysis tools in human movement studies with emphasis on classification, prediction and estimation tasks [25]. Bipedal and quadrupedal robotics movement has also been studied extensively where the use of different machine learning techniques were investigated [26][27][28]. Reinforcement learning (RL) [29] and especially deep reinforcement learning (DRL) that utilised deep learning techniques are slowly but steadily providing solutions to many simulation studies [30][31][32][33]. Reinforcement learning is observing an environment and then learning to perform actions that will maximise a score [29]. Learning to maximise the score is done through exploration of the environment and exploitation of past experience, where the challenge is how to create a balance between exploration and exploitation. Continuous action spaces are still imposing a bigger challenge for agents but more sophisticated learners were developed to solve it [34][35][36][37]. Recently, DRL was used to train a human musculoskeletal model, based on OpenSim models, to walk and run [31][32][33] and even to walk with a prosthetic leg [30]. Pilotto et al. has discussed the importance of technology in the advancement of geriatrics and prevention of falls in the elderly population [38]. While the results demonstrated by Lee et al. demonstrate posture stability, their solution presented in [39] relied on providing a reference motion collected from participants. In our work, we focused our efforts on self induced motion leading to posture stabilisation. In this paper, we discuss the use of Artificial Intelligence (AI) and especially DRL to learn how not to fall. More specifically, we investigate the learning progress of an AI agent towards maintaining a standing posture. We adopted a modular design of the control neural network by separating the observation from the policy. We also used multiple policies, each trained separately, as well as a coordination policy to coordinate between the learnt policies. The rest of this paper is organised as follows. Section 2 describes the materials and methods used in this study. Section 3 reports the results. Section 4 discusses the behaviour learnt by the AI agent. Finally, Section 5 derives conclusions and introduces to future directions.

Material and Methods
In this work, we focus our efforts on studying the muscle control strategies an AI agent can learn in order to prevent falling. This is achieved via biomechanical simulation using OpenSim and reinforcement learning via deep neural networks. The biomechanics simulation serves as the environment and the deep neural network serves as the motor control part of the brain.

Biomechanic Simulation Environment
The used environment is based on OpenSim [31], and OpenAI gym [40]. The human musculoskeletal model used is based on the previous works presented in [41][42][43], where the model is made up of seven body parts. The head, neck, torso and pelvis are represented as a single body for simplicity, in addition, each leg is represented by three body parts, upper leg, lower leg and foot. The model has 14 degrees-of-freedom (DoF), 6 DoFs (3 rotational and 3 translational) for the pelvis, 2 rotational DoFs for each hip, 1 rotations DoF foe each knee and finally, 1 rotational DoF for each ankle. The model is actuated by 22 muscles [44], 11 muscles for each leg. The muscles actuating each leg included the hip adductor, hip abductor, hamstrings, biceps femoris, gluteus maximus, iliopsoas, rectus femoris, vastus intermedius, gastrocnemius, soleus, and tibialis anterior. In addition, contact with the ground is modelled using the Hunt-Crossley model [45]. As shown in Fig. 1, for each foot, contact spheres are positioned at the heel and toes. In addition, another rectangular contact plane is placed over the ground. Force is generated when the objects come into contact, which depends on the velocity of the collision and depth of penetration of the contact objects [31].
The observation fed to the AI agent included 100 values covering ground reaction forces, pelvis velocities, joint angles and muscle state. These readings were grouped as listed in Table 1. Three values were providing random numbers, summarising velocity vector field, from the environment were incorporated to introduce a randomisation factor to the AI training. Typically, the AI agent should learn how to progress within the environment by observing the score provided by the environment. This score reflects how good or bad the AI agent behaved towards achieving the desired goal. It is worth noting that the AI agent must be completely oblivious to what the desired goal is and thus, it should learn based on the praise it receives from the environment (much like a toddler learning to stand and walk). The scoring function was designed to provide a positive reward for having contact with the ground via F g while penalising undesired actions resulting in change in pelvis height H, and velocity ||ṗ||, ||θ p ||. In order to maintain a standing posture the score was penalised with the magnitude of values and change in the joint angles ||θ J ||, ||∆θ J ||. The score also includes penalty for the change in joint angles in order to prevent the scissor legs posture described in [30]. This was achieved by penalising the adduction angles θ J add of both legs. The muscular state was not incorporated in the score function. The scoring function is then formulated as Norm.
where H s is the baseline pelvis height when standing, and || · || is the Euclidean norm. All angles were normalised by π. The Norm. Cost term is the normalised cost using the sum of the weights.

Artificial Neural Network Model
For any AI agent, the choice of an artificial neural network (ANN) is largely affected by the inputs provided from the environment for several reasons. First, the dynamic range of input values from the environment differs based on the measured phenomenon. For example, joint angles range from [−π, +π] and increase on a spherical scale while muscle length is normalised between [0, 1] and increases linearly. Even with normalisation of joint angles asθ i = θ i /π, the rate of change remains different. The second challenge was how information leakage from different measured values during the training of the ANN. To address these two challenges, we designed smaller neural networks to act as mini-observers trained for encoding the input values of the phenomena sharing the same dynamic range and purpose in the simulation. The output of these mini-observers were then concatenated into one encoded output. The complexity of the task also affected the design of the ANN. Specifically, during the training of the neural network, it attempts to minimise the error between the estimated output and the target output via calculated update gradients. The update gradients update the parameters of the neural network at once which prevents the network from solving sub-tasks (e.g. joint flexion and extension) in order to solve the desired task (e.g. do not fall). In order to address this challenge we chose to train each policy separately with different initialisation parameters and then train a coordination policy to derive a mixture of the actuation signals from different policies. There are two solutions to this problem. The first solution is to enable only certain parts of the neural network to train while locking the rest of the network. The other solution, which is bio-inspired, is to expand the network as needed. In this work, we manually adapted the second solution, by training different policies separately and then integrating them with the coordination policy. Ideally, this approach should be done automatically by using the Neuro-Evolution of Augmenting Topology algorithm (NEAT) [46]. The NEAT algorithm is an evolutionary algorithm which relies on generating several MLP architectures and harnessing the power of mutations and cross-over for exploration of thousands for generations. However, the NEAT algorithm is more suitable to figuring out the MLP topology (i.e. policy). To the best of our knowledge, there is no evidence it could be expanded for an entire architecture of deep neural networks especially with the computationally expensive training and the vast amount of data and/or trials requirement. While UBER is now investing the expansion of the NEAT algorithm to deep learning [47], perhaps the meta-learning research [48] is the closest approach to expanding the NEAT algorithm on an entire deep learning architecture.
To that end, as shown in Fig. 1, the AI model proposed in this work consists of three neural networks; mini-observer networks (one for each modality provided by the environment), policy networks (one for each sub-task) and a coordination network to combine the actions from different policies into a final actuation signal.

Reinforcement Training Procedure
Due to the dynamic nature of the problem, we chose to utilise deep reinforcement learning. We adopted the Deep Deterministic Policy Gradient (DDPG) method because of its impressive results in continuous action spaces [34]. The DDPG model consists of an actor network (described above) and a critic network to evaluate the actions produced by the actor in relation to the environment. The critic network takes the actions produced by the actor network and the values obtained from the environment and produces a score [34,49,50]. In our setup, the actor is the entire coordinated multi-policy shown in Fig. 1 while the critic is a classical multi-layer-perceptron (MLP) neural network model which takes the observations and coordinated action as input and produces an estimated score. The estimated score is then compared to the actual score reported from the environment.
In our experiments, we altered the training algorithm to suit the incremental expansion of the policy neural networks. The proposed training was carried out in two stages, action policy training and policy aggregation. We started by training two policy networks for 10000 steps (roughly 100 episodes). Both policies were initialised with different seeds in order to obtain diversity in the outcome. During the training the neural network model with the best score was stored. In the second stage, both action policies were combined and a new coordination policy was constructed. The coordination policy works as a switch to choose the weight of the actuation signals of different muscles and produce the final action. During this stage, the aggregated policies and the coordination policy were trained with a fresh experience replay buffer and a critic neural network. The rationale behind this is that the environment has changed from the policies' point of view and thus new experiences must be gathered.
When expanding the policy pool with a new policy network in the second stage, the trained weights of the previous policy network is copied to the new untrained policy network instead of the standard random initialisation. This allows the newly added policy to have a training head start with the current training state and protects the previously trained policies from being drastically altered while adapting to the newly added policy.
During the coordination refinement stage of the training, we prevent all policy networks from training and use the experience to fine tune the coordination network. This allows the coordination network to adapt to the distinctive postural stability strategies adopted by different policies. Locking the policies is also essential to preserve their trained strategies without leaking information from other policies and the coordinator network.

Results
During the training of the AI agent, the final goal was clearly defined by the scoring function implemented in the environment. However, in order to achieve this goal, the AI agent had to learn to discover and solve two more intermediate tasks, 1. identify the importance of centre of gravity (COG); and 2. identify and exploit the dominant leg concept.
The training took place on two stages with incrementally increasing number of policies from one policy towards a total of three policies.
As illustrated in Fig. 2-top, in the first few training episodes the AI agent explored the extremes of the de-innervating (10 episodes) and randomly innervating (100 episodes) the muscle-set controlling the body before it discovered the concept of the centre of gravity (COG). This allowed the AI agent attempt maintaining the COG and Zero-Moment-Point (ZMP) in front of the body. This, in return, allowed it to fall bottom-first instead of head-first. This resembles the behaviour toddlers exhibit when falling from a standing posture. It is worth noting that neither the COG nor the ZMP were provided as an input to the AI agent. In contrast to toddlers, the newly discovered concept (from the AI agent's perspective) was derived from the need to balance. Toddlers, on the other hand, already grasp this concept during the early stages of locomotion, which are sitting without assistance, crawling, standing with assistance and standing without assistance with an average of 1.43 ± 2.1 months between different milestones [51]. This is achieved via inputs from the vestibular, visual, and somatosensory sensory systems [52]. During the second stage of the training (two policies), the AI agent learnt to exploit the Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 June 2020 doi:10.20944/preprints202006.0046.v2 concept of using a dominant leg [53]. This allowed it to shift the load of the entire body on one leg while using the other leg for fine tuning the balancing action.  The AI agent adopted a knee locking strategy while thrusting the pelvis forward to control the centre of gravity. Each milestone was trained for 500000 simulation steps. The final agent was able to maintain balance for 4 seconds on average. The maximum training episode length was 500 simulation steps or 5 seconds (1 step=0.01 seconds). Solid blue line is the average of 35 test trials and the light blue envelop is the estimated standard error at p < 0.05 using random bootstrapping [54].

WOW!! There is forward too!! Easy though!! Finally Standing!!!
The first policy was trying to prevent the backward falls by exploiting the dominant leg and thrusting the pelvis forward. The second policy, however, adopted to perform a leaning forward action by pivoting on the heels and adjusting the ZMP via shifting the weight of the upper body anteriorly as shown in Fig. 2-bottom. While it abused the newly discovered capability, the fine tuning of the coordination neural network allowed it to maintain the balance between the new action and the previously learnt actions. Finally, during the fine tuning and coordination between the two trained policies, the AI agent explored the possibility of expanding the leg base and finally managed to stand.
As shown in Fig. 3, the muscle actuation pattern has changed in four milestones. Each milestone was trained for 500000 simulation steps. The maximum training episode length was 500 simulation steps or 5 seconds (1 step=0.01 seconds). Each milestone was evaluated via 35 test trials. 1 First, the AI agent adopted a strategy with three actuation levels (no, medium and full actuation). While this strategy does not actually maintain balance, it did serve as a foundation for subsequent milestones. In the next milestone, the AI agent adopted a left dominant leg strategy by maximising the actuation of the left gluteus maximus (glut_max_l) muscle to thrust the pelvis forward while locking the left knee with full extension by fully actuating the vastus intermedius (vasti_l) muscle. Accordingly, the right gluteus maximus and vastus intermedius muscles were actuated to achieve dexterous traction with the ground.
This strategy was further refined in the third milestone which allowed the agent to prolong the balancing action by engaging the hamstrings (hamstrings) and the iliopsoas (iliopsoas) muscles for finer hip and knee control while exploiting alternating foot tapping as shown in the actuation graphs of the gastrocnemius (gastroc) and tibialis anterior (tib_ant) muscles. In the fourth and final milestone, the AI agent further improved the actuation strategy to maintain longer standing duration. Because the AI agent adopted a locked knee strategy, it did not attempt to actuate the soleus muscle (soleus). This can be considered a local minimum caused by the negligible weight of the effort penalty term in the score function (Eq. 3).

Discussion
Because each policy is self-contained, the proposed approach is expected to work with other off-policy reinforcement learning algorithms such as Soft Actor Critic (SAC) [55] and Distributed Distributional DDPG (D4PG) [56]. However, while early tests using SAC did show similar behaviour, we anticipate technical challenges with D4PG because of the asynchronous update to the experience replay buffer. In this section we discuss an interesting behaviour of the AI agent, limitations of the study, and future directions.

An Interesting Behaviour
An interesting AI behaviour emerged during an early training stage of the coordination neural network. Because the ultimate goal remains not to fall, the AI agent explored the option of taking a protective step to maintain a better balance. In doing so, the AI agent learnt to take few coordinated steps as shown in Fig. 4. This behaviour took place when we injected the AI agent with noise to increase exploration. However, because the score function was designed for standing, the learnt behaviour did not constitute a proper gait cycle. Also, the limited capacity of the AI model may have limited the dexterity of the learnt gait cycle. That been said, the ambition of the AI agent to engage in locomotion as a way to prevent falling remains an interesting behaviour. Considering Novacheck's conclusion that walking is controlled falling [4], the AI behaviour here cannot be considered walking because it is not self induced. This behaviour known as the value misalignment problem in computer science literature. This occurs when the score function is not specific enough to excite the AI agent to learn the desired task, but instead it causes the AI agent to engage in an obsessive behaviour of maximising the score by any means necessary. This problem poses a paradox because having a very specific score function may lead the neural networks to overfit on the observed training scenarios and fail to generalise to other variations in the environment. It is worth noting that this locomotion attempt (3 steps) could not be reproduced with the same coordination demonstrated in the video attached in the supplementary material.

Limitations
It is worth noting that, theoretically, the AI agent should be able to achieve the same result using only two policies (ideally a much deeper single policy). This limitation is usually addressed via model pruning which discards the redundant parts of the neural network. However, when applying model pruning, the performance (measured by standing duration) dropped by 50% and the model could not sustain a standing posture for more than two seconds. This suggests that the two policies do indeed contribute to posture stabilisation. We also noticed that after allowing training on a third policy, the AI agent discovered a new sub-task of slowly spreading the feet laterally to achieve a wider base. This behaviour opens the door for conducting further research into rearranging the policy neural networks into a chain or a pipeline. In this case, there will be no need for the coordination network. That being said, distributing the load over multiple neural networks does provide an explanation advantage of the behaviour of the AI model which is an important step towards explainable AI (XAI). The main challenge with training a single neural network on such a complex task is the lack of control over the flow of gradient update. Not only does it update the entire policy at once which alters the policy and the first few observation layers interacting with the environment, but also deprives the agent from perfecting any of the sub-tasks required to solve the problem. This was observed as an oscillation between two policies favouring leaning forward and backward in the 2D planar setup, i.e. no lateral movement. In the 3D setup, we used in this work, the lateral movement became a problem not only because of the added dimension but also because the hip adduction and hip abduction muscles are now engaged. These muscles' maximum isometric force is approximately 10 times the maximum isometric force of muscles like the biceps femoris which flexes the knee joint. This highlights the discrepancy in the characteristics of different actuators and is now being investigated using learnable parameterised activation functions in [57].

Conclusions
In this paper, we followed the learning journey of an AI agent attempting to assume a stable standing posture. We used the Opensim biomechanics simulation environment [30]. We adopted the DDPG reinforcement learning technique to derive coordinated continuous muscle actuation signals in order to stabilise a standing posture [34,49,50]. The AI agent learnt to maintain a standing posture for 4 seconds by learning two sub-tasks of leaning backward, forward and the coordination between the two actions. While considered a short duration for maintaining a standing posture, it is worth noting Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 June 2020 doi:10.20944/preprints202006.0046.v2 that maintaining a standing posture for prolonged periods requires recurrent backtracking through different standing states. Such a recursive behaviour would require utilising Long-Short-Term-Memory modules (LSTM). Nevertheless, it was very interesting to witness the evolution of learnt sub-tasks as we allowed the AI agent to train on new policies.
The behaviour witnessed in this study highlights two more research points to be investigated. The first research point is related to training AI models using synthetic data. The main motivation deriving this area forward is the expanding gap in available data for training AI models. This issue becomes more significant when considering sensitive applications where collecting realistic data is difficult or may raise safety and ethical concerns. Fall detection and prevention is a growing concerns among public health where there is a shortage in datasets of realistic fall posture sequences. These datasets are usually recorded by stunt actors who can fall safely or generated by 3D artists [58]. However, both solutions do provide data that is not a real representation of fall occurrences. Fortunately, the presented work does derive the coordinated actuation of muscles that do cause a realistic fall. The AI agent's failed attempts to maintain a standing posture can provide a comprehensive dataset of falling posture sequences that can advance the fall detection and prevention research endeavours.
The second research point to be investigated is the discovered AI ambition to explore and exploit locomotion as a mean to prolong not falling. This problem is known in AI research as value misalignment problem and it has sparked a huge debate among computer scientists and philosophers. The reason for this is that relying on maximising the score solely can excite the AI agent to achieve this via exploiting the environment. While the argument of designing tighter score functions is sound, it is a very fine line that we have to walk between resorting to classical rule-based AI and the modern aspirations towards Artificial General Intelligence (AGI). This, in return, may introduce much feared scenarios of having intelligent machines running critical aspects of human lives [59]. While these scenarios are exaggerated in the media and the dystopian literature, it is unlikely to actually occur in the near future due to the limitation of compute power. However, not only does this debate raise good questions regarding AI safety, ethics and even rights, it also raises questions about our societal rights and duties.

Abbreviations
The following abbreviations are used in this manuscript: