In the first phase, the RL algorithms are trained with the emulator and an optimum policy is generated, which is then plugged into the actual controller. In the second phase, the controller is executed in the live environment. The state combiner is responsible for encoding the state of the model from the system variables. The optimum policy is then referred to, in order to select the best action for this state. The action interpreter then applies this action through meaningful velocity commands to the robotic arm. The formulation of the Markov model, the selection of RL algorithms, and their training will presently be discussed.
4.1. Markov Decision Process of Reinforcement Learning
The model of the environment created in order to be used with the Markov decision process is referred to as a Markov model
in this paper. The five main factors to be defined for such a model are States, Actions, State transition probability distribution, Rewards function, and Discount factor [22
]. These shall now be explained in the context of this paper:
States: The two variables of the environment that affect the decision-making process are the current pose of the robot, and the force input from the user. These two sub-states shall combine together to form the states of the model.
The first sub-state is the pose of the robot arm. Considering the maximum upper and lower limits of the roll angle for the end effector due to practical constraints, the complete space of end effector roll angle manipulation is divided into
total number of equal parts i.e., the arm sub-states
was empirically set to 10 for best results). The size of each part (
) can easily be expressed as follows
Thus, also defines the magnitude of rotation of the end effector roll angle per step. The arm sub-states are defined in this manner for two reasons. First, the absolute values of roll angles cannot be directly used since this would cause a state explosion, as well as suffer from drift and inaccuracies of the robot. As long as is small enough, encapsulates these errors. Second, the limit varies for each execution, depending on the starting position of the user, causing inconsistencies in the definition of each individual sub-state. Defining a dynamic roll space that is always divided into equal parts solves this problem. The values that the arm sub-state can take are in the limits [0,9].
The second sub-state is the force input from the user. The entire region of force values is once again divided into three parts due to the same aforementioned reasons. These parts are directly derived from the control strategy i.e., three parts divided for the purpose of raising cup, holding cup, and lowering cup. The values that the force sub-state can take are in the limits [0,2].
As a result of these definitions, the total number of states
, which is simply the combination of both sub-states is easily shown as
are the number of force and arm sub-states respectively. Since
is selected as 10, the model has 30 states.
The final Markov state is simply a composition of the arm sub-state and force sub-state. This can be done in any number of ways as desired, but the following empirical formula was defined for this implementation:
= Markov state value,
= arm sub-state value,
= force sub-state value. The values of Markov states derived per this definition are in the limits [0,29]. To understand the virtue of this formulation with an example—the arm and force sub-states of 3 (sa
) and 4 (sf
) respectively compose into the Markov state (sm
) of 13 whereas the sub-states 4 (sa
) and 3 (sf
) respectively compose into the Markov state (sm
: The three actions defined for this model were to rotate the cup, or rotate it down or do nothing (see Table 1
). These three actions provide complete control to the user to drink from the cup. The additional movement of fallback for safety is considered a special case and not modeled as an action.
Transition Probability Distribution: For a given arm sub-state, an action can either transition it one step higher/lower in the arm sub-states table or make no change. But for a given action, it is a deterministic change. The user’s force input sub-state on the other hand, is a random transition since the user can choose to give any input desired. It is, however, still limited in possible values since there are only three force regions. Therefore, it is random with discrete values. As a result, the total transition of the Markov states (which is a combination of the two sub-states) is random with a finite set of possible next states. Assuming the user is equally inclined to choose any of the possible inputs, the transition probability to each possible next state is equal.
An example from this distribution is: . If action 3 (raise cup) is taken on state 3 (arm sub-state = 1, force sub-state = 0), the possible next states are 6, 7, 8 (for arm sub-state = 2, and 3 possible force sub-states), each with equal probability.
Reward Function: The reward function is highly shaped. Assignment of rewards to a state ×action pair is based on the force sub-state as this signifies the relationship between the intent of the user and the response of the agent. A positive reward is provided to the action that follows the user intent, for example, if the user provides force input to raise the cup and the action selected is to raise the cup. Other cases receive a negative reward. The border cases are special, in that if the robot is already at a maximum angle limit, but the user intends to exceed the limit, the hold cup action still needs to be incentivized regardless of user input. To ensure this, a positive reward is assigned to the hold cup action to make sure that the cup does not actually exceed the safety angle limits.
Discount Factor: It is selected to be low (0.1) in order to prioritize immediate rewards. This is very important because the agent has to select the action that gives the best reward right away, as opposed to trying to accumulate a greater reward over time and choosing undesirable actions.
A Markov model can be designed in order to train an RL algorithm that either works perpetually or for a limited period of time/trials. To experiment with both approaches, two special cases of the model were developed: terminating and non-terminating. As the names suggest, the terminating model takes the three final states (i.e., max upper angle of rotation) as the terminating states and the non-terminating model has no such definition. Therefore, the terminating model describes a time-limited model in that the target is for the user to reach a terminating state to finish drinking from the cup within a finite amount of time. This is easier to comprehend and hence better during development. The non-terminating model is perpetual because the user may not always want to completely finish drinking from the cup, and the agent needs to function for an infinite amount of time. This is closer to practical requirements and is more desirable for the final solution.
With the full formal definition of each element of the 5-tuple, a complete Markov model is prepared for the robotic drinking assistant environment, and is summarized in Figure 3
4.3. Reinforcement Learning Algorithms
Five reinforcement learning algorithms were trained with the previously described environment. In order to explain them, the phrases “online” and “offline” used hereafter refer to the necessity of the live environment/emulator for training online, and that fact that a mathematical Markov model is sufficient for offline training. The algorithms applied are all iterative, and they are stopped once convergence has been achieved. Convergence is said to occur based on slightly different factors for each algorithm, but generally deals with minimizing a so-called converging factor, which is the primary variable being iteratively calculated by the algorithm. This method shall be referred to as variation minimization in this paper. These algorithms are briefly explained as follows:
Value iteration (VI) and policy Iteration (PI) [24
]. The approach of VI is to optimize a so-called value function over many iterations until convergence, and extracting the optimum policy based on it. PI functions in a similar manner by computing the value function but optimizes the policy directly. Since they are both offline, the training emulator was not used, and the training process is drastically simpler and faster compared to the other algorithms. The resulting optimum policy obtained served as not only a benchmark for further algorithms but also a reference for the rapid prototyping of the PDC for live testing. The value function is iteratively calculated in both cases, and convergence is said to occur when it drops to 0.00001. Convergence was achieved consistently after five episodes for non-terminating and six episodes for terminating models respectively for PI, and two episodes for both models for VI.
Q learning and state–action–reward–state–action (SARSA) [25
]. These two algorithms are similar since they both compute and optimize the Q table as the convergence factor. The difference is that the SARSA algorithm takes into account one more state action pair than Q learning (QL). These two are of greater interest since they are online and avoid the need for a full model, while at the same time being simple/fast enough to compute solutions efficiently. Being model-free also helps with rapid prototyping and development whenever the environment is modified. The learning rate was selected to be 0.8, the epsilon greedy approach to exploration vs. exploitation was used with a decay rate of 0.1 and the convergence threshold was set to 0.001.
Deep Q network (DQN) [27
]. This final deep reinforcement learning class of algorithms combines the advantages of deep learning as well as RL. Whereas the Q table is still computed, it is used to train a neural network instead of saving as a lookup table. This allows for even more number of states to be modeled (which could be useful to capture data from the environment at a greater resolution), at a reduced cost of training time. The learning rate and epsilon factor were the same as before. The neural network used had four fully connected regular layers of 24 neurons with relu
activation. There were 30 inputs (states) and 3 outputs (actions). Loss function was mean square error which serves as the convergence factor, the threshold of which was 0.0001.
These five pre-existing algorithms were chosen, as they represent different classes of RL algorithms that are readily available. PI and VI belong to the dynamic programming class, which are offline and faster, providing a good benchmark. However, both require a complete model. Q-Learning and SARSA are model-free approaches, providing more flexibility. On the other hand, both require online training. DQN allows for training of systems with a very large number of states, at the cost of more training time. This was considered a good platform for comparing the training process of these algorithms. The resulting optimum policy was expected, and later experimentally verified, to be the same for all five algorithms, since the same Markov model was considered.