# Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We propose a new federated reinforcement learning scheme to allow multiple agents to control their own devices of the same type but with slightly different dynamics.
- We verify that the proposed scheme can expedite the learning process overall when the control policies are trained for the multiple devices.

## 2. Related Work

#### 2.1. Federated Reinforcement Learning

#### 2.2. Actor–Critic PPO

## 3. System Architecture & Overall Procedure

## 4. Federated Reinforcement Learning Algorithm

## 5. Experiments

Algorithm 1: Federated RL (Chief) |

Algorithm 2: Federated RL (Worker w) |

#### 5.1. Experiment Configuration

^{TM}-Servo 2 [26]). It is a highly unstable nonlinear IoT device and has been used as a usual device in the nonlinear control engineering field.

#### 5.2. State, Action, and Reward Formulation

#### 5.3. Effect of Gradient Sharing & Transfer Learning

^{TM}-Servo 2) 100 times in different directions, and measure the Pearson correlation between the changes of the motor and the pendulum angles for the three devices. Pearson correlation [33] is commonly used to find the relationship between two random variables. The Pearson correlation coefficient has $+1$ if the two variables X and Y are exactly the same, 0 if they are completely different, and $-1$ if they are exactly the same in the opposite direction. Table 2 shows the results of the homogeneity test for the dynamics of three RIP devices of the same type. As known from the two tables, the angles of motor and pendulum are changed differently even though the forces applied in different directions are constant over 100 times. For each RIP device, in particular, the change in motor angle is more varied than the change in the pendulum angle. For multiple RIP devices of the same type, as a result, their dynamics are slightly different from each other, even though they are produced on the same manufacturing line. This means that the additional learning at Worker I and III is still needed even after receiving the mature model of Worker II.

## 6. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: London, UK, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature
**2016**, 529, 484–503. [Google Scholar] [CrossRef] [PubMed] - Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv
**2017**, arXiv:1712.01815. [Google Scholar] - Vinyals, O.; Babuschkin, I.; Chung, J.; Mathieu, M.; Jaderberg, M.; Czarnecki, W.; Dudzik, A.; Huang, A.; Georgiev, P.; Powell, R.; et al. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Available online: https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ (accessed on 2 February 2019).
- Wang, L.; Liu, Y.; Zhai, X. Design of Reinforce Learning Control Algorithm and Verified in Inverted Pendulum. In Proceedings of the 34th Chinese Control Conference (CCC), Hangzou, China, 28–30 July 2015; pp. 3164–3168. [Google Scholar] [CrossRef]
- Chen, M.; Lam, H.; Shi, Q.; Xiao, B. Reinforcement Learning-based Control of Nonlinear Systems using Lyapunov Stability Concept and Fuzzy Reward Scheme. IEEE Trans. Circuits Syst. Express Briefs
**2019**, 99, 1. [Google Scholar] [CrossRef] - Puriel-Gil, G.; Yu, W.; Sossa, H. Reinforcement Learning Compensation based PD Control for Inverted Pendulum. In Proceedings of the 15th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 5–7 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Kersandt, K.; Muñoz, G.; Barrado, C. Self-training by Reinforcement Learning for Full-autonomous Drones of the Future. In Proceedings of the IEEE/AIAA 37th Digital Avionics Systems Conference (DASC), London, UK, 23–27 September 2018; pp. 1–10. [Google Scholar] [CrossRef]
- Kim, J.; Lim, H.; Kim, C.; Kim, M.; Hong, Y.; Han, Y. Imitation Reinforcement Learning-Based Remote Rotary Inverted Pendulum Control in OpenFlow Network. IEEE Access
**2019**, 7, 36682–36690. [Google Scholar] [CrossRef] - Bonawitz, K.; Eichner, H.; Grieskamp, W. Towards Federated Learning at Scale: System Design. arXiv
**2019**, arXiv:1902.01046. [Google Scholar] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Konda, V. Actor-Critic Algorithms. Ph.D. Thesis, Cambridge of University, Boston, MA, USA, 2002. [Google Scholar]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 17 December 2017; pp. 6382–6393. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
- Konecný, J.; McMahan, H.B.; Ramage, D. Federated Optimization: Distributed Optimization Beyond the Datacenter. arXiv
**2015**, arXiv:1511.03575. [Google Scholar] - McMahan, H.B.; Moore, E.; Ramage, D.; y Arcas, B.A. Federated Learning of Deep Networks using Model Averaging. arXiv
**2016**, arXiv:1602.05629. [Google Scholar] - Konecný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. arXiv
**2016**, arXiv:1610.05492. [Google Scholar] - Torrey, L.; Shavlik, J. Handbook of Research on Machine Learning Applications; IGI Global: Hershey, PA, USA, 2009. [Google Scholar]
- Glatt, R.; Silva, F.; Costa, A. Towards Knowledge Transfer in Deep Reinforcement Learning. In Proceedings of the 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Brazil, 9–12 October 2016; pp. 91–96. [Google Scholar] [CrossRef]
- Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, R.; Maria, A.D.; Panneershelvam, V.; Suleyman, M.; Beattie, C.; Petersen, S.; et al. Massively Parallel Methods for Deep Reinforcement Learning. arXiv
**2015**, arXiv:1507.04296. [Google Scholar] - Zhuo, H.H.; Feng, W.; Xu, Q.; Yang, Q.; Lin, Y. Federated Reinforcement Learning. arXiv
**2019**, arXiv:1901.08277. [Google Scholar] - Liang, X.; Liu, Y.; Chen, T.; Liu, M.; Yang, Q. Federated Transfer Reinforcement Learning for Autonomous Driving. arXiv
**2019**, arXiv:1910.06001. [Google Scholar] - Liu, B.; Wang, L.; Liu, M. Lifelong Federated Reinforcement Learning: A Learning Architecture for Navigation in Cloud Robotic Systems. IEEE Rob. Autom Lett.
**2019**, 4, 4555–4562. [Google Scholar] [CrossRef][Green Version] - Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016. [Google Scholar]
- Quanser. QUBE - Servo 2. Available online: https://www.quanser.com/products/qube-servo-2/ (accessed on 2 March 2016).
- Strom, N. Scalable Distributed DNN Training using Commodity GPU Cloud Computing. In Proceedings of the INTERSPEECH. ISCA, Dresden, Germany, 6–10 September 2015; pp. 1488–1492. [Google Scholar]
- Zheng, S.; Meng, Q.; Wang, T.; Chen, W.; Yu, N.; Ma, Z.M.; Liu, T.Y. Asynchronous Stochastic Gradient Descent with Delay Compensation. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Srinivasan, A.; Jain, A.; Barekatain, P. An Analysis of the Delayed Gradients Problem in Asynchronous SGD. In Proceedings of the ICLR 2018 Workshop, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.I.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv
**2015**, arXiv:1506.02438. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Lindley, D.V. Information Theory and Statistics. Solomon Kullback. J. Am. Stat. Assoc.
**1959**, 54, 825–827. [Google Scholar] [CrossRef] - Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Lawrence Erlbaum Associates: Mahwah, New Zealand, 1988. [Google Scholar]
- Lim, H.K.; Kim, J.B.; Han, Y.H. Learning Performance Improvement Using Federated Reinforcement Learning Based on Distributed Multi-Agents. In Proceedings of the KICS Fall Conference 2019, Seoul, Korea, 23–24 January 2019; pp. 293–294. [Google Scholar]

**Figure 4.**Our experiment configuration with multiple rotary inverted pendulum (RIP) devices, multiple workers, and one chief.

**Figure 5.**Effectiveness of the proposed federated reinforcement learning scheme. The blue line represents the score for each round, and the red line represents the weighted moving average (WMA) of the scores from the last 10 rounds. The green dotted line indicates the loss value for each round. (

**a**) Change of score and loss values without the proposed scheme. (

**b**) Change of score and loss values with the proposed scheme.

Hyper-parameter | Value |
---|---|

Clipping parameter ($\u03f5$) | 0.9 |

Model optimization algorithm | Adam |

GAE parameter ($\lambda $) | 0.99 |

Learning rate for the critic model (${\eta}_{\mu}$) | 0.001 |

Learning rate for the actor model (${\eta}_{\theta}$) | 0.001 |

Trajectory memory size | 200 |

Batch size (U) | 64 |

Number of model optimizations in one round (K) | 10 |

(a) Pearson correlation matrix of motor angle changes | |||
---|---|---|---|

Motor Angle | RIP I | RIP II | RIP III |

RIP I | 1 | 0.77 | 0.86 |

RIP II | - | 1 | 0.75 |

RIP III | - | - | 1 |

(b) Pearson correlation matrix of pendulum angle changes | |||

Pendulum Angle | RIP I | RIP II | RIP III |

RIP I | 1 | 0.98 | 0.96 |

RIP II | - | 1 | 0.98 |

RIP III | - | - | 1 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lim, H.-K.; Kim, J.-B.; Heo, J.-S.; Han, Y.-H. Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices. *Sensors* **2020**, *20*, 1359.
https://doi.org/10.3390/s20051359

**AMA Style**

Lim H-K, Kim J-B, Heo J-S, Han Y-H. Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices. *Sensors*. 2020; 20(5):1359.
https://doi.org/10.3390/s20051359

**Chicago/Turabian Style**

Lim, Hyun-Kyo, Ju-Bong Kim, Joo-Seong Heo, and Youn-Hee Han. 2020. "Federated Reinforcement Learning for Training Control Policies on Multiple IoT Devices" *Sensors* 20, no. 5: 1359.
https://doi.org/10.3390/s20051359