# Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Context

#### 1.2. State of the Art and Related Works

#### 1.3. Paper Contribution

#### 1.4. Paper Layout

## 2. Problem Description

- Reaching: the agents receive a reward if the robot’s end-effector approaches the cube;
- Grasping: a grasping reward is given if both of the fingers of the gripper touch the cube, so that the grasping phase is fulfilled;
- Lifting: when the cube is lifted off the surface, the lifting reward is provided, and this constitutes a successful episode.

## 3. Methodology

#### 3.1. Preliminaries

- Value-based methods, which learn a value function that is then used to derive a policy;
- Policy-based methods, which have a parameterized policy and directly search for the optimal policy;
- Actor-critic methods, which learn both the value function and the policy. The role of the critic is to judge the actions taken by the actor on the basis of the learned value function. The actor then updates its policy parameters based on the critic’s feedback [35].

#### 3.2. Modular Framework for Decentralized Learning

## 4. Implementation of Proposed Approach

#### 4.1. Learning Environment

#### 4.2. State-Space

- Robot information: the data about the robot and its state is a 36-dimensional vector that contains the joint positions, velocities, gripper position, and the end-effector position, velocity, and orientation. The position of the joints is divided and represented by trigonometric functions, i.e., by sine and cosine representations. The robot has 7 DoFs, which gives a vector of 14 values for the position of each joint represented by a sin and cos function. The gripper is defined by a 2-dimensional vector that specifies the position of the left and right fingers. The joint velocities are represented by a single value for each joint, making a 7-dimensional vector. The end-effector data is represented by a 7-dimensional vector, where the position is represented in cartesian coordinates followed by a quaternion to represent the orientation. The end-effector’s state also includes a 6-dimensional vector specifying the linear and angular velocity.
- Object information: the data of the object that needs to be picked up is represented by the position and orientation in the world reference frame. In order to facilitate the reaching of the cube, a relative position vector between the object and the robot’s gripper is added. The object state, thus, makes a 10-dimensional vector.

#### 4.3. Action Space

#### 4.4. Training Details

#### 4.5. Reward Shaping

- ${r}_{dist}$: the distance reward which is computed using the relative position of the gripper w.r.t. the object;
- ${r}_{vel}$: the velocity reward that takes into account the end-effector velocity vector when approaching the object;
- ${r}_{grip}$: the gripper open reward depending upon the action of the gripper.

#### 4.6. Software Implementation

## 5. Results

- One meta-PPO agent and two low level SAC agents;
- Two PPO agents;
- Two SAC agents;
- One PPO agent and one SAC agent.

- One SAC agent;
- One PPO agent.

#### 5.1. Decentralized Multi-Agent Approach

#### 5.1.1. Meta-PPO Agent with Two Low Level SAC Agents

#### 5.1.2. Two PPO Agents

#### 5.1.3. Two SAC Agents

#### 5.1.4. One SAC Agent One PPO Agent

#### 5.2. Decentralized Single-Agent

#### 5.2.1. Decentralized Single-Agent SAC

#### 5.2.2. Decentralized Single-Agent PPO

#### 5.3. Comparison of Centralized and Decentralized Approach

## 6. Discussion

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Rajan, K.; Saffiotti, A. Towards a Science of Integrated AI and Robotics. Artif. Intell.
**2017**, 247, 1–9. [Google Scholar] [CrossRef][Green Version] - Van Roy, V.; Vertesy, D.; Damioli, G. AI and robotics innovation. In Handbook of Labor, Human Resources and Population Economics; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–35. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 22447. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] [PubMed] - Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res.
**2013**, 32, 1238–1274. [Google Scholar] [CrossRef][Green Version] - Cruz, F.; Parisi, G.I.; Wermter, S. Learning contextual affordances with an associative neural architecture. In Proceedings of the 24th European Symposium on Artificial Neural Networks, Bruges, Belgium, 27–29 April 2016; pp. 665–670. [Google Scholar]
- Yen-Chen, L.; Zeng, A.; Song, S.; Isola, P.; Lin, T.Y. Learning to See before Learning to Act: Visual Pre-training for Manipulation. arXiv
**2021**, arXiv:2107.00646. [Google Scholar] - Cruz, F.; Dazeley, R.; Vamplew, P.; Moreira, I. Explainable robotic systems: Understanding goal-driven actions in a reinforcement learning scenario. Neural Comput. Appl.
**2021**, 1–18. [Google Scholar] [CrossRef] - Cruz, F.; Wüppen, P.; Fazrie, A.; Weber, C.; Wermter, S. Action selection methods in a robotic reinforcement learning scenario. In Proceedings of the 2018 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Gudalajara, Mexico, 7–9 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Cruz, F.; Wüppen, P.; Magg, S.; Fazrie, A.; Wermter, S. Agent-advising approaches in an interactive reinforcement learning scenario. In Proceedings of the 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Lisbon, Portugal, 18–21 September 2017; pp. 209–214. [Google Scholar] [CrossRef]
- Rahmatizadeh, R.; Abolghasemi, P.; Bölöni, L.; Levine, S. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3758–3765. [Google Scholar]
- Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3389–3396. [Google Scholar]
- Joshi, S.; Kumra, S.; Sahin, F. Robotic grasping using deep reinforcement learning. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020; pp. 1461–1466. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res.
**2018**, 37, 421–436. [Google Scholar] [CrossRef] - Busoniu, L.; De Schutter, B.; Babuska, R. Decentralized reinforcement learning control of a robotic manipulator. In Proceedings of the 2006 9th International Conference on Control, Automation, Robotics and Vision, Singapore, 5–8 December 2006; pp. 1–6. [Google Scholar]
- Leottau, D.L.; Ruiz-del Solar, J.; Babuška, R. Decentralized reinforcement learning of robot behaviors. Artif. Intell.
**2018**, 256, 130–159. [Google Scholar] [CrossRef][Green Version] - Panait, L.; Luke, S. Cooperative multi-agent learning: The state of the art. Auton. Agents Multi-Agent Syst.
**2005**, 11, 387–434. [Google Scholar] [CrossRef] - Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight experience replay. arXiv
**2017**, arXiv:1707.01495. [Google Scholar] - Lauer, M.; Riedmiller, M. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proceedings of the Seventeenth International Conference on Machine Learning, Standord, CA, USA, 29 June–2 July 2000. [Google Scholar]
- Kapetanakis, S.; Kudenko, D. Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems. In Adaptive Agents and Multi-Agent Systems II; Springer: Berlin/Heidelberg, Germany, 2004; pp. 119–131. [Google Scholar]
- Matignon, L.; Laurent, G.J.; Le Fort-Piat, N. Hysteretic q-learning: An algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007; pp. 64–69. [Google Scholar]
- Tuyls, K.; Hoen, P.J.; Vanschoenwinkel, B. An evolutionary dynamical analysis of multi-agent learning in iterated games. Auton. Agents Multi-Agent Syst.
**2006**, 12, 115–153. [Google Scholar] [CrossRef] - Laurent, G.J.; Matignon, L.; Fort-Piat, L. The world of independent learners is not Markovian. Int. J. Knowl.-Based Intell. Eng. Syst.
**2011**, 15, 55–64. [Google Scholar] [CrossRef][Green Version] - Vidhate, D.; Kulkarni, P. Cooperative machine learning with information fusion for dynamic decision making in diagnostic applications. In Proceedings of the 2012 International Conference on Advances in Mobile Network, Communication and Its Applications, Bangalore, India, 1–2 August 2012; pp. 70–74. [Google Scholar]
- Theodorou, E.; Buchli, J.; Schaal, S. Reinforcement learning of motor skills in high dimensions: A path integral approach. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2397–2403. [Google Scholar]
- Kuba, J.G.; Wen, M.; Yang, Y.; Meng, L.; Gu, S.; Zhang, H.; Mguni, D.H.; Wang, J. Settling the Variance of Multi-Agent Policy Gradients. arXiv
**2021**, arXiv:2108.08612. [Google Scholar] - Schwager, M.; Rus, D.; Slotine, J.J. Decentralized, adaptive coverage control for networked robots. Int. J. Robot. Res.
**2009**, 28, 357–375. [Google Scholar] [CrossRef][Green Version] - Sartoretti, G.; Paivine, W.; Shi, Y.; Wu, Y.; Choset, H. Distributed learning of decentralized control policies for articulated mobile robots. IEEE Trans. Robot.
**2019**, 35, 1109–1122. [Google Scholar] [CrossRef][Green Version] - Lee, Y.; Yang, J.; Lim, J.J. Learning to coordinate manipulation skills via skill behavior diversification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Ha, H.; Xu, J.; Song, S. Learning a decentralized multi-arm motion planner. arXiv
**2020**, arXiv:2011.02608. [Google Scholar] - Shahid, A.A.; Roveda, L.; Piga, D.; Braghin, F. Learning continuous control actions for robotic grasping with reinforcement learning. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; pp. 4066–4072. [Google Scholar]
- Littman, M.L. Value-function reinforcement learning in Markov games. Cogn. Syst. Res.
**2001**, 2, 55–66. [Google Scholar] [CrossRef][Green Version] - Lazaric, A.; Restelli, M.; Bonarini, A. Reinforcement learning in continuous action spaces through sequential monte carlo methods. Adv. Neural Inf. Process. Syst.
**2007**, 20, 833–840. [Google Scholar] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv
**2018**, arXiv:1812.05905. [Google Scholar] - Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv
**2017**, arXiv:1706.02275. [Google Scholar] - Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5026–5033. [Google Scholar]
- Erez, T.; Tassa, Y.; Todorov, E. Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 4397–4404. [Google Scholar]
- Zhu, Y.; Wong, J.; Mandlekar, A.; Martín-Martín, R. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. arXiv
**2020**, arXiv:2009.12293. [Google Scholar] - Massa, D.; Callegari, M.; Cristalli, C. Manual guidance for industrial robot programming. Ind. Robot Int. J.
**2015**, 42, 457–465. [Google Scholar] [CrossRef] - Martín-Martín, R.; Lee, M.A.; Gardner, R.; Savarese, S.; Bohg, J.; Garg, A. Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1010–1017. [Google Scholar]
- Moreira, I.; Rivas, J.; Cruz, F.; Dazeley, R.; Ayala, A.; Fernandes, B. Deep Reinforcement Learning with Interactive Feedback in a Human–Robot Environment. Appl. Sci.
**2020**, 10, 5574. [Google Scholar] [CrossRef] - Sesin, J.S.V.; Pecioski, D. GitHub Repository: Software for Decentralized and Multi Agent Control of Franka Emika Panda Robot. Available online: https://github.com/jvidals09/Decentralized-and-multi-agent-control-of-Franka-Emika-Panda-robot-in-continuous-task-execution (accessed on 29 September 2021).
- Shahid, A.A. Continuous Control Actions Learning with Performance Specifications through Reinforcement Learning. Master’s Thesis, Politecnico di Milano, Milan, Italy, 2020. Available online: http://hdl.handle.net/10589/164660 (accessed on 3 October 2021).

**Figure 1.**Main conceptual scheme of the proposed approach using a multi-agent and decentralized reinforcement learning approach.

**Figure 5.**Schematic representation of: (

**top**) policy network with 46-dim input (environment states) and 8-dim output (actions); (

**bottom**) value network with considered 46-dim input and 1-dim output.

**Figure 6.**Progression of mean accumulated reward for the meta-agent with two low-level SAC agents. Training results are reported for three experiments with different random seeds. The standard deviation of three runs is shown in the plot.

**Figure 7.**Progression of mean accumulated reward for the two PPO agents.Training results are reported for four experiments with different random seeds. The standard deviation of four runs is shown in the plot.

**Figure 8.**Progression of mean accumulated reward for the two SAC agents. Training results are reported for four experiments with different random seeds. The standard deviation of four runs is shown in the plot.

**Figure 12.**Progression of the mean accumulated reward for the decentralized single-agent PPO. Training results are reported for three experiments with different random seeds. The standard deviation of three runs is shown in the plot.

Approach | Maximum Accumulated Reward |
---|---|

One meta-PPO agent and two low-level SAC agents | 346 |

Two PPO agents | 580 |

Two SAC agents | 607 |

One PPO agent and one SAC agent | 130 |

Decentralized single SAC agent | 1203 |

Decentralized single PPO agent | 480 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Shahid, A.A.; Sesin, J.S.V.; Pecioski, D.; Braghin, F.; Piga, D.; Roveda, L. Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning. *Appl. Sci.* **2021**, *11*, 10227.
https://doi.org/10.3390/app112110227

**AMA Style**

Shahid AA, Sesin JSV, Pecioski D, Braghin F, Piga D, Roveda L. Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning. *Applied Sciences*. 2021; 11(21):10227.
https://doi.org/10.3390/app112110227

**Chicago/Turabian Style**

Shahid, Asad Ali, Jorge Said Vidal Sesin, Damjan Pecioski, Francesco Braghin, Dario Piga, and Loris Roveda. 2021. "Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning" *Applied Sciences* 11, no. 21: 10227.
https://doi.org/10.3390/app112110227