A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control
Abstract
:1. Introduction
2. Related Work
2.1. Humanoid Robot Control Algorithms Based on Reinforcement Learning
2.2. Combining RL with Demonstrations
2.3. Neural Network Control Learning of Manipulator
2.4. Digital Twin Technology with Synchro-Control
3. Preliminaries
3.1. Humanoid Robot
3.2. Twin Synchro-Control Scheme
3.2.1. Design of Data Acquisition Shelf
3.2.2. Control System of Data Acquisition System
- (a)
- The lengths of the upper and lower arms of the data acquisition shelf were not exactly the same as those of the demonstrator.
- (b)
- There were some calibration errors.
- (c)
- The precision of the resistor disc deteriorated due to mechanical friction and oxidation.
3.3. Deep Deterministic Policy Gradient with TSC
3.3.1. DDPG Algorithm
3.3.2. Design of Reward Function
Algorithm 1 DDPG with TSC |
Randomly initialize critic network Q(s,a|θQ) and actor μ(s|θμ) with weights θQ and θμ. Initialize target network Q′ and μ′ with weights Initialize replay buffer M Import a group of joint angle recorded by TSC Generate posterior estimates from Kalman filter for ep = 1, EPISODE do Initialize a random process χ for action exploration Get initial state s1 for t = 1, STEP do Select action at = μ(st|θμ) + Nt according to current policy and exploration noise Execute at to get reward rt and update the joint angle value rt and update the joint angle value (θ1t,θ2t,…θnt), and observe new state st+1. Calculate the rp with (θ1t,θ2t,…θnt) and (): Sample a random mini-batch of N transitions (st,at,Rt,St+1) from M Calculate yi with critic target network and actor target network Update critic by minimizing the loss: Update actor policy using the sampled policy gradient: end for |
4. Implementation
4.1. Trajectory Planning for a Planar 3-DOF Robotic Arm
Algorithm 2 First experiment |
Initialize hyper-parameters of DDPG with TSC scheme Select 50 track points and use B-spline interpolation to obtain joint trajectory curves of three joints Import using B-spline interpolation algorithm for = 1, do Initialize a random process for action exploration Get initial state for = 1, max track points do Update arm target point for = 1, do Select and execute action to get , , , and update the joint angle value (). Calculate the with () and (): Calculate with critic target network and actor target network Update actor and critic network on-line networks by and Update the target networks by soft-replacement parameter end for end for end for |
4.2. Humanoid Robot Arm Simulation
5. Results
5.1. First Experiment
5.2. Second Experiment
6. Discussion and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Zhao, J.; Gao, J.; Zhao, F.; Liu, Y. A search-and-rescue robot system for remotely sensing the underground coal mine environment. Sensors 2017, 17, 2426. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, Y.; Gao, J.; Zhao, J.; Shi, X. A new disaster information sensing mode: Using multi-robot system with air dispersal mode. Sensors 2018, 18, 3589. [Google Scholar] [CrossRef] [Green Version]
- Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Rob. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; University of Cambridge, Department of Engineering: Cambridge, UK, 1994. [Google Scholar]
- Peng, J.; Williams, R.J. Incremental multi-step Q-learning. Mach. Learn. 1996, 22, 283–290. [Google Scholar] [CrossRef] [Green Version]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1329–1338. [Google Scholar]
- Hyon, S.-H.; Osu, R.; Otaka, Y. Integration of multi-level postural balancing on humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 1549–1556. [Google Scholar]
- Stephens, B.J.; Atkeson, C.G. Dynamic balance force control for compliant humanoid robots. In Proceedings of the International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 1248–1255. [Google Scholar]
- Li, Z.; VanderBorght, B.; Tsagarakis, N.G.; Colasanto, L.; Caldwell, D.G. Stabilization for the compliant humanoid robot COMAN exploiting intrinsic and controlled compliance. In Proceedings of the International Conference on Robotics and Automation, Saint Paul, MI, USA, 14–18 May 2012; pp. 2000–2006. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 31 International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Gu, S.; Lillicrap, T.; Sutskever, I.; Levine, S. Continuous deep q-learning with model-based acceleration. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2829–2838. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. Comput. Sci. 2016, 8, A187. [Google Scholar]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. Available online: https://arxiv.org/abs/1506.02438 (accessed on 18 June 2020).
- Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
- Levine, S. Motor Skill Learning with Local Trajectory Methods. Ph.D. Thesis, Stanford University, Stanford, CA, USA, March 2014. [Google Scholar]
- Levine, S.; Pastor, P.; Krizhevsky, A.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection; Springer: Cham, Switzerland, 2016; pp. 173–184. [Google Scholar]
- Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21 International Conference on Machine learning (ACM), New York, NY, USA, 21–23 July 2004. [Google Scholar]
- Wu, Y.; Wang, R.; D’Haro, L.F.; Banchs, L.R.; Tee, K.P. Multi-modal robot apprenticeship: Imitation learning using linearly decayed DMP+ in a human-robot dialogue system. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–7. [Google Scholar]
- Zhou, R.; Zhang, Z.; Peng, K.; Mi, Y.; Huang, X. Humanoid action imitation learning via boosting sample DQN in virtual demonstrator environment. In Proceedings of the 23rd International Conference on Mechatronics and Machine Vision in Practice (M2VIP), Nanjing, China, 28–30 November 2016; pp. 1–9. [Google Scholar]
- Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3389–3396. [Google Scholar]
- Schaal, S. Learning from demonstration. In Proceedings of the 9th International Conference on Neural Information Processing Systems, Denver, CO, USA, 3–5 December 1996; pp. 1040–1046. [Google Scholar]
- Subramanian, K.; Isbell, C.L., Jr.; Thomaz, A.L. Exploration from demonstration for interactive reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems, Hakodate, Japan, 8–12 May 2006; pp. 447–456. [Google Scholar]
- Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Dulac-Arnold, G.; et al. Deep Q-learning from demonstrations. In Proceedings of the Thirty-Second Conference on Artificial Intelligence (AAAI), New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
- Vecerık, M.; Hester, T.; Scholz, J.; Wang, F.; Pietquin, O.; Piot, B.; Heess, N.; Rothorl, T.; Lampe, T.; Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv 2017, arXiv:1707.08817. Available online: https://arxiv.org/abs/1707.08817 (accessed on 18 June 2020).
- Kang, B.; Zequn, J.; Jiashi, F. Policy optimization with demonstrations. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Rajeswaran, A.; Kumar, V.; Gupta, A.; Vezzani, G.; Schulman, J.; Todorov, E.; Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Proceedings of the Robotics Science and Systems, Pittsburgh, PN, USA, 26–30 June 2018. [Google Scholar]
- Liu, Z.; Chen, C.; Zhang, Y.; Chen, C.L.P. Adaptive neural control for dual-arm coordination of humanoid robot with unknown nonlinearities in output mechanism. IEEE Trans. Cybern. 2014, 45, 507–518. [Google Scholar]
- Jin, L.; Li, S.; La, H.M.; Luo, X. Manipulability optimization of redundant manipulators using dynamic neural networks. IEEE Trans. Ind. Electron. 2017, 64, 4710–4720. [Google Scholar] [CrossRef]
- Li, S.; Zhou, M.C.; Luo, X. Modified primal-dual neural networks for motion control of redundant manipulators with dynamic rejection of harmonic noises. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4791–4801. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.; Jiang, Y.; He, W.; Na, J.; Li, Z.; Xu, B. Adaptive parameter estimation and control design for robot manipulators with finite-time convergence. IEEE Trans. Ind. Electron. 2018, 65, 8112–8123. [Google Scholar] [CrossRef]
- Brahmi, B.; Saad, M.; Rahman, M.H.; Ochoa-Luna, C. Cartesian trajectory tracking of a 7-DOF exoskeleton robot based on human inverse kinematics. IEEE Trans. Syst. Man Cybern. Syst. 2017, 49, 600–611. [Google Scholar] [CrossRef]
- Hu, Y.; Huang, W.; Hu, P.H.; Liu, W.-W. Design and validation of a self-driven joint model for articulated arm coordinate measuring machines. Appl. Sci. 2019, 9, 3151. [Google Scholar] [CrossRef] [Green Version]
- Hochberg, L.R.; Bacher, D.; Jarosiewicz, B.; Masse, N.Y.; Simeral, J.D.; Vogel, J.; Haddadin, S.; Liu, J.; Cash, S.S.; van der Smagt, P.; et al. Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature 2012, 485, 372. [Google Scholar] [CrossRef] [Green Version]
- Wilemon, D.L.; Cicero, J.P. The project manager—Anomalies and ambiguities. Acad. Manag. J. 1970, 13, 269–282. [Google Scholar]
- Verner, I.; Cuperman, D.; Fang, A.; Reitman, M. Robot Online Learning through Digital Twin Experiments: A Weightlifting Project. Online Engineering and Internet of Things; Springer: Cham, Switzerland, 2018; pp. 307–314. [Google Scholar]
- Laaki, H.; Miche, Y.; Tammi, K. Prototyping a digital twin for real time remote control over mobile networks: Application of remote surgery. IEEE Access 2019, 7, 20325–20336. [Google Scholar] [CrossRef]
- Spranger, J.; Buzatoiu, R.; Polydoros, A.; Nalpantidis, L.; Boukas, E. Human-machine interface for remote training of robot tasks. In Proceedings of the International Conference on Imaging Systems and Techniques (IST), Kraków, Poland, 16–18 October 2018; pp. 1–5. [Google Scholar]
- Hixon, D.L.; Kesler, D.J.; Troxel, T.R.; Vincent, D.L.; Wiseman, B.S. Reproductive hormone secretions and first service conception rate subsequent to ovulation control with Synchro-Mate, B. Theriogenology 1981, 16, 219–229. [Google Scholar] [CrossRef]
- Jonsson, O.C.; Borch, O.; Bret, A.; Catherall, R.; Deloose, I.; Focker, G.J.; Forkel, D.; Kugler, E.; Olesen, G.; Pace, A.; et al. The control system of the CERN-ISOLDE on-line mass-separator facility. Nucl. Instrum. Methods Phys. Res. B 1992, 70, 541–545. [Google Scholar] [CrossRef] [Green Version]
- Jiang, Q.B. The design of textile machinery synchro-control system based on PROFIBUS communication. Adv. Mat. Res. 2011, 332, 335–338. [Google Scholar] [CrossRef]
- Marin-Reyes, H.; Tokhi, M.O. Control system adaptation of a synchro drive mobile robot for target approximation. In Proceedings of the Twelfth International Conference on Climbing and Walking Robots and the Support Technologies for Mobile Machines, Nagoya, Japan, 31 August–3 September 2010; pp. 1063–1070. [Google Scholar]
- Wada, M. A synchro-caster drive system for holonomic and omnidirectional mobile robots. In Proceedings of the International Conference on Industrial Electronics, Control and Instrumentation (IECON), Dearborn, MI, USA, 3–5 October 2000; pp. 1937–1942. [Google Scholar]
- Zhao, H.; Ben-Tzvi, P. Synchronous position control strategy for bi-cylinder electro-pneumatic systems. Int. J. Control Autom. Syst. 2016, 14, 1501–1510. [Google Scholar] [CrossRef]
- Sutton, R.S. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural. Inf. Process Sys. 2000, 12, 1057–1063. [Google Scholar]
Ti | |||
T1 | −50° | 84° | 51° |
T2 | −45.7° | 89.9° | 40.8° |
T50 | 130° | 91° | 79° |
DDPG Setup Hyper-Parameters | |
---|---|
Actor/Critic learning rate | 1 × 10−3 |
Reward discount factor | 0.9 |
Soft replacement | 0.01 |
Batch size | 32 |
Running episodes | 300 |
Number of track points | 50 |
Training steps per update | 200 |
Memory capacity | 80,000 |
Updates | episodes × points × steps |
Items | Specifications | ||||
---|---|---|---|---|---|
Range constraints of the joints | Left arm | Shoulder | Joint 1 | Pitch | −20°~20° |
Joint 2 | Roll | −5°~10° | |||
Elbow | Joint 3 | Pitch | −10°~60° | ||
Right arm | Shoulder | Joint 1 | Pitch | −20°~20° | |
Joint 2 | Roll | −5°~10° | |||
Elbow | Joint 3 | Pitch | −10°~60° | ||
Steering wheel | −90°~90° |
Algorithm Setup Hyper-Parameters | |
---|---|
Actor/Critic learning rate | 0.001 |
Reward discount factor | 0.9 |
Soft replacement | 0.001 |
Batch size | 32 |
Running episodes | 500 |
Steps per episode | 200 |
Memory capacity | 15,000 |
Updates | episodes × steps |
Angle factor | 0.5 |
Distance factor | 0.5 |
TSC factor | 20 |
Control cycle (s) | 0.125 |
Hyper-Parameters | Actor/Critic Learning Rate | Reward Discount Factor | Batch Size |
---|---|---|---|
1 | 0.001 | 0.6 | 32 |
2 | 0.001 | 0.8 | 16 |
3 | 0.001 | 0.8 | 32 |
4 | 0.001 | 0.8 | 64 |
5 | 0.001 | 0.8 | 128 |
6 | 0.001 | 0.8 | 256 |
7 | 0.001 | 0.9 | 16 |
8 | 0.001 | 0.9 | 32 |
9 | 0.001 | 0.9 | 64 |
10 | 0.001 | 0.9 | 128 |
11 | 0.001 | 0.9 | 256 |
12 | 0.005 | 0.9 | 16 |
13 | 0.005 | 0.9 | 32 |
14 | 0.005 | 0.9 | 64 |
15 | 0.005 | 0.8 | 64 |
16 | 0.005 | 0.9 | 128 |
17 | 0.005 | 0.9 | 256 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, C.; Gao, J.; Bi, Y.; Shi, X.; Tian, D. A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control. Sensors 2020, 20, 3515. https://doi.org/10.3390/s20123515
Liu C, Gao J, Bi Y, Shi X, Tian D. A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control. Sensors. 2020; 20(12):3515. https://doi.org/10.3390/s20123515
Chicago/Turabian StyleLiu, Chuzhao, Junyao Gao, Yuanzhen Bi, Xuanyang Shi, and Dingkui Tian. 2020. "A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control" Sensors 20, no. 12: 3515. https://doi.org/10.3390/s20123515
APA StyleLiu, C., Gao, J., Bi, Y., Shi, X., & Tian, D. (2020). A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control. Sensors, 20(12), 3515. https://doi.org/10.3390/s20123515