4.2.1. Programming Environment
The whole program is written in Python 2.7 with a Python Application Programming Interface (API) developed by Aldebaran robotics to communicate with the NAOqi framework.
We designed and trained the artificial neural networks with the Python library Keras; this is a high-level neural network API, written in Python, and is capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation (retrieved from
https://keras.io/).
The library contains numerous implementations of commonly used neural network building blocks such as layers, objectives, activation functions, optimizers, and a lot of tools to make working with images and text data easier. From these, we selected and loaded only the required modules in order to have an efficient use of memory.
Each time a new action is taken, the reward that this action yields must be evaluated and the model is trained again based on this information. Because of this, Graphical Processing Unit (GPU)-optimization is needed. We use the Keras backend with Theano to take advantage of the GPU of the NVIDIA Geforce GTX 580; otherwise, we would not have been able to compute the results fast enough. The motors of the NAO take 0.1 seconds to complete a movement instruction, while computing the update of a Q-network using the Central Processing Unit (CPU) took approximately 2 seconds (this was implemented on an Intel Core i5 computer with 8 GB of RAM; see
Figure 10). Evidently, this is not fast enough. Fortunately, using the GPU, we could compute the network update in around 0.002 seconds (see
Figure 11).
4.2.3. Algorithm Description
First, we need to load the fall module to detect if the robot has fallen (see
Figure 14). Every time the simulated robot fell down, we needed to restore the simulation manually because the simulated robot could not rise by itself. For this reason, we could not train the Q-network in one run; however, we needed to keep training the network just after the last run. For this reason, we needed to save the network model at the end of each run and to load it at the beginning of each new run.
We needed to initialize the NAO proxies, the motion proxy that allows to send signals to the NAO motors; the posture proxy that allows one to set the robot in a “home position”; the sonar proxy, which tells us the information of the ultrasonic sensors; and the memory proxy to access the data recorded, v. gr. the angle of the joints. After that, the robot required activation. We could do this by setting the stiffness of the body to 1—this was achieved using the instruction motion.setStiffnesses (“Body”, 1.0).
We initialized the number of epochs to 50. Notice that we need many more iterations due to the fact that when the robot falls down, it cannot always stand up by itself, and the simulation has to be restored manually. For instance, imagine we set the number of epochs to 2000 and the robot fell down in the second iteration, then we would have to wait until the 2000th iteration to restore the simulation, which would take considerably more time.
Afterwards, we initialized gamma to 0.9, which means experience is considered, the closer gamma to zero, the less we take into consideration the experience of previous trials.
In Reference [
31], the authors used a technique to let the algorithm converge via
experience replay. It works in the following way: During the run of the algorithm, all the experiences < s
t, a, r, s
t+1 > are stored in a
replay memory. When training the network, random mini batches from the replay memory are used instead of the most recent transition. Before training we needed to define a variable for this purpose with
replay as an empty list; however, as we were running the algorithm multiple times, in our case it is not an empty list, but it was the list previously filled in a previous run, i.e.,
replay = lastreplay.We opened a loop from i = 0 until i = epochs, set the robot in the “home position,” which was the NAO posture “StandInit,” then we read sonars with the memory proxy and stored it in sonar. We then read the state; remember that the state in the first level will be a vector of 4 values while in the second level is a vector of 11 values.
Then, we opened a while loop with the condition that the NAO had not reached a terminal state. A terminal state was when the robot fell or when it reached the goal; that is, in the first level a ZMP was located in a specific interval, and in the second level, a distance of 8 cm was covered.
Later, we ran the Q-network forward and stored the result in qval (a vector). Suppose that we have the vector (1.2, −0.36, 2.2, 0.0). The actiont+1 will be the action 3 because the third value is the greatest; however, we wanted to explore more options in order to not fall in a local minimum, so with a probability of epsilon, we chose between the max action or a random action. This epsilon gradually decreased in such a way that after many iterations it became 0.
Then, we applied the actiont+1 and observed the reward and the new state, where the values of the reward were described in the previous section. At this point, we had [state, action, reward, new state]. We needed to store this tuple in replay and repeat the process until the length of replay was the same as buffer; we set buffer as 60. Once the buffer was filled, we got a mini batch, which was a random sample of length 30 from the replay array.
Thereafter, we looped over each element of the mini batch. Each element was a list of four values that we used to set (old state, action, reward2, new state2), then we ran forward our Q-network using as input the old state and stored the result in old qval, where we selected the greatest value of old qval and saved the index of it in maxQ. Later, we defined a vector X with the same values of old state, and a vector Y with the same values of old qval.
At this point, we needed to check whether the reward2 belonged to a terminal state, i.e., if the variable update was equal to the value of the reward2; if not, update = (reward2 + (gamma × maxQ)). This variable update was the rule used to compute the update of the neural network.
The value of the variable
update needed to replace the value in the
Y vector in the position of
index. For instance, suppose we have a
Y = [1.2, −0.36, 2.2, 0.0],
index = 3, and
update = 10, such that the result after replacing is
Y = [1.2, −0.36, 10, 0.0]. This will be the target for the Q-network, which should be done for all the elements in the minibatch. Once we finished the loop, we used the entire
Y and
X vector to train the Q-network using backpropagation. The Q-network algorithm is shown in
Figure 15.