6.1. Overall Framework
As depicted in
Figure 5, the overall framework for gait adjustment consists of two parts: state reconstruct and SD3. First of all, the simulated environment initialization. Secondly, we reconstruct the initial observation via extracting features from existing states based on the attention mechanism, where the states are sampled in pairs with actions from the replay buffer randomly.
In the second part, the reconstructed state is taken as the input of SD3. Then, the actor network selects an action according to the observation where i refers to the serial number of the action corresponding to different actor networks, and following, the critic network evaluates the value of the state action pair . Moreover, the final action a depends on the result of comparing action-values which are evaluated by two critic networks. It is worth noting that, we add noise directly to the actor network parameters for a state-dependent exploration, which ensures a dependency between the sampled state and the corresponding selected action.
6.2. Key-Value Attention-Based State Reconstruction
In this work, the initial observation is a 339-dimensional state which consists of a 97-dimensional body state and a 242-dimensional target velocity map. Therefore, the RL agent cannot extract effective information easily, and then choose better actions due to too much redundant information in this high-dimensional observation. Moreover, in RL, the observed state
s and the selected action
a of an RL agent often plays a significant role for the training of RL algorithms, and the information in each state usually play an important role in the choice of the action. For example, in the case of the same policy and different states, RL agent takes different actions without active exploration. As shown in
Figure 6, the actions taken to reach
,
are shown by arrows. Although
and
are very close in space, they are functionally different, and these states contain necessary self-dependent feature information for the agent to perform the corresponding action. In other words, the self-dependent feature information in a state, for example
, is different from shared information that exists in all states, and necessary for decision making, for example
, which differs to the action
. In our work, the musculoskeletal model moves accoring to the target velocity map, if the musculoskeletal model moves to the target position, and then a new target position will be randomly generated. Immediately, the RL agent will make a new action, for example turning right, to move towards another target position. Therefore, in this case, we refer to the specific information contained in the state that signals that the musculoskeletal model has reached the target position as the self-dependent information, which makes the agent makes a specific action.
The attention mechanism is introduced to focus on the information which is critical to the current task among the input information. Therefore, on one hand, based on the key-value attention mechanism, we try to reconstruct the current observation via capturing self-dependent feature information in each sampled state. To be specific, firstly, we randomly sample
n sets of state action pairs
from the replay buffer. Here, the role of the sampled state action pairs
in our proposed framework is equal to
in the key-value attention mechanism. The state
and the action
are used to calculate the attention distribution and aggregate information, respectively. Moreover, we take the state-dependent exploration for the dependency between the sampled state
and the sampled action
. In other words, in the case of the same policy, the selected action is only related to the state inputted to the policy. Secondly, considering the advantage of the critic network in dealing with continuous action spaces, for example the simulated environment in our work, the critic network is usually used to approximate action-value function [
25], so we take the critic network as the attention evaluation function. Thus, we calculate the action-value
of the above sampled actions with the critic network which takes the current observation and each sampled action
as input.
Based on the above method, a series of action-value for the sampled actions can be achieved, which will serve as a basis for distinguishing the corresponding sampled state and reconstructing the initial observation. Thus, next to this operation, Softmax is used to normalize the corresponding action-value , where the normalized action-value represents the proportion of the sampled state in the reconstructed state. Significantly, the computed proportion can be seen as the attention distribution in key-value attention mechanism. Then, based on the attention distribution , the sampled states will be fused with the initial observation proportionally. In a word, the self-dependent feature information in each sampled state corresponding to the sampled action with higher action-value will account for a larger proportion in reconstructed state. It is worth noting that, the way we perform feature fusion is element-wise addition. Based on this approach, the reconstructed state is influenced by the agent’s action, and accordingly the state contains the information necessary to the action. Thus, the RL agent can select the corresponding action based on the information.
On the other hand, notably, autoencoder [
26] is a kind of unsupervised neural network, and the goal of dimensionality reduction can be achieved by adjusting the number of hidden layers in both modules including the encoder and the decoder. Therefore, we use autoencoders to overcome the curse of dimensionality caused by the high-dimensional musculoskeletal model. The specific process is depicted as
Figure 7.