# Feasibility Analysis and Application of Reinforcement Learning Algorithm Based on Dynamic Parameter Adjustment

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Temporal-Difference Learning

## 3. The Method of Dynamic Adjustment Learning Rate and Convergence Proof

#### 3.1. Dynamic Regulation Method Based on Temporal-Difference

#### 3.2. Mathematics Model and Convergence of Temporal-Difference

**Theorem**

**1.**

**Proof.**

**Theorem**

**2.**

**Theorem**

**3.**

- (1)
- Finite state space;
- (2)
- ${\sum}_{i=1}^{\infty}{\alpha}_{i}=\infty ,{\sum}_{i=1}^{\infty}{\alpha}^{2}<+\infty $.

**Proof.**

#### 3.3. Convergence Relation between Approximation Method and Dynamic Regulation Learning Rate

- (1)
- ${\alpha}_{k}>=0,k=1,2,3,\dots $;
- (2)
- The condition under which an arbitrary point of convergence can be reached without any initial restriction: ${\sum}_{k=1}^{\infty}{\alpha}_{k}^{2}<\infty $;
- (3)
- Finally, the convergence point can be reached without noise restriction.

## 4. Experiment

#### 4.1. Learning Rate Order of Magnitude Initial Determination

#### 4.2. Convergence and Rationality Are Combined to Determine the Learning Rate

#### 4.3. Experimental Results and Analysis

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Chen, X.; Yang, Y. A review of reinforcement learning research. Appl. Res. Comput.
**2010**, 27, 2834–2838. [Google Scholar] - Wang, J.X.; Kurth-Nelson, Z.; Kumaran, D.; Tirumala, D.; Soyer, H.; Leibo, J.Z.; Hassabis, D.; Botvinick, M. Prefrontal cortex as a meta-reinforcement learning system. Nat. Neuroence
**2018**, 21, 860–868. [Google Scholar] [CrossRef] [PubMed] - LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] [PubMed] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv
**2016**, arXiv:1603.04467. [Google Scholar] - Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
- Jia, W.; Senpeng, C.; Xiuyun, C.; Rui, Z. Model Selection and Hyper-parameter Optimization based on Reinforcement learning. J. Univ. Electron. Sci. Technol. China
**2020**, 49, 255–261. [Google Scholar] - Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Jomaa, H.S.; Grabocka, J.; Schmidt-Thieme, L. Hyp-RL: Hyperparameter Optimization by Reinforcement Learning. arXiv
**2019**, arXiv:1906.11527. [Google Scholar] - Bernstein, A.; Chen, Y.; Colombino, M.; Dall’Anese, E.; Mehta, P.; Meyn, S.P. Optimal Rate of Convergence for Quasi-Stochastic Approximation. arXiv
**2019**, arXiv:1903.07228. [Google Scholar] - Pohlen, T.; Piot, B.; Hester, T.; Azar, M.G.; Horgan, D.; Budden, D.; Barth-Maron, G.; van Hasselt, H.; Quan, J.; Vecerík, M.; et al. Observe and Look Further: Achieving Consistent Performance on Atari. arXiv
**2018**, arXiv:1805.11593. [Google Scholar] - Marco, W.; Otterlo, V. Reinforcement Learning: State of the Art; Springer: Berlin/Heidelberg, German, 2012; pp. 206–217. [Google Scholar] [CrossRef]
- Leslie, K. Learning in Embedded Systems; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar] [CrossRef]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res.
**1996**, 4, 237–285. [Google Scholar] [CrossRef][Green Version] - Watkins, C. Learning From Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
- Peng, J.; Williams, R.J. Incremental multi-step Q-learning. Mach. Learn. Proc. 1994
**1996**, 22, 226–232. [Google Scholar] [CrossRef][Green Version] - Yingzi, W.; Mingyang, Z. Design of Heuristic Return Function in Reinforcement Learning Algorithm and Its Convergence Analysis. Comput. Sci.
**2005**, 32, 190–192. [Google Scholar] - Liu, S.; Grzelak, L.; Oosterlee, C. The Seven-League Scheme: Deep learning for large time step Monte Carlo simulations of stochastic differential equations. arXiv
**2020**, arXiv:2009.03202. [Google Scholar] - Baxter, L.A. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Technometrics
**1995**, 37, 353. [Google Scholar] [CrossRef] - Sunehag, P.; Hutter, M. Rationality, optimism and guarantees in general reinforcement learning. J. Mach. Learn. Res.
**2015**, 16, 1345–1390. [Google Scholar] - Beggs, A.W. On the convergence of reinforcement learning. J. Econ. Theory
**2002**, 122, 1–36. [Google Scholar] [CrossRef] - Matignon, L.; Laurent, G.J.; Fort-Piat, N.L. Hysteretic q-learning: An algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007. [Google Scholar]
- Zadeh, L.A. A Rationale for Fuzzy Control. J. Dyn. Syst. Meas. Control
**1972**, 94, 3–4. [Google Scholar] [CrossRef] - Yingshi, Z. Research and development of parameter self-adjusting method for fuzzy controller. Harbin Railw. Sci. Technol.
**2006**, 1, 13–15. [Google Scholar] - Sathyan, A.; Cohen, K.; Ma, O. Comparison Between Genetic Fuzzy Methodology and Q-Learning for Collaborative Control Design. Int. J. Artif. Intell. Appl.
**2019**, 10, 1–15. [Google Scholar] [CrossRef] - Rui, Y. Convergence analysis of multi steps reinforcement learning algorithm. Comput. Digit. Eng.
**2019**, 47, 1582–1585. [Google Scholar] - Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Lagoudakis, M.G.; Parr, R. Least-Squares Policy Iteration. J. Mach. Learn. Res.
**2003**, 4, 1107–1149. [Google Scholar] - Michail, L.; Ronald, P. Reinforcement Learning as Classification: Leveraging Modern Classifiers. Available online: https://www.aaai.org/Papers/ICML/2003/ICML03-057.pdf (accessed on 21 September 2020). [CrossRef]
- Xiliang, C.; Lei, C.; Chenxi, L.; Zhixiong, X.; Ming, H. Deep reinforcement learning method based on resampling optimization cache experience playback mechanism. Control Decis.
**2018**, 33, 600–606. [Google Scholar] - Moore, A.W.; Atkeson, C.G. The Parti game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State spaces. Mach. Learn.
**1995**, 21, 199–233. [Google Scholar] [CrossRef] - Darzentas, R.B.J. Problem Complexity and Method Efficiency in Optimizationby (AS Nemirovsky and DB Yudin). J. Oper. Res. Soc.
**1984**, 35, 455. [Google Scholar] [CrossRef] - Zhang, T. Solving Large Scale Linear Prediction Problems Using Stochastic 2004. In Proceedings of the Twenty-First International Conference on MACHINE Learning, Banff, AB, Canada, 4–8 July 2004; pp. 2–3. [Google Scholar]
- Kolesnikov, A.; Zhai, X.; Beyer, L. Revisiting Self-Supervised Visual Representation Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1920–1929. [Google Scholar]
- Lu, Y. Unsupervised Learning on Neural Network Outputs. arXiv
**2015**, arXiv:1506.00990. [Google Scholar]

**Figure 2.**A new dynamic learning rate adjustment framework is more suitable for deep reinforcement learning.

Hyperparameters | Value |
---|---|

Replay Buffer Size | 20,000 |

Batch Size | 32 |

discount factor $\gamma $ | 0.99 |

learning rate | Static or dynamic |

Learning Rate | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
---|---|---|---|---|---|---|---|---|---|---|---|

$lr=0.1$ | −198.52 | −197.91 | −200 | −199.64 | −200 | −200 | −200 | −2005 | −200 | −200 | −199.607 |

$lr=0.01$ | −170.57 | −153.68 | −168.26 | −168.82 | −159.43 | −169.99 | −172.21 | −183.17 | −166.16 | −182.69 | −169.498 |

$lr=0.001$ | −118.23 | −116.03 | −114.3 | −120.0 | −124.69 | −123.25 | −120.29 | −126.11 | −110.97 | −130.76 | −120.463 |

$lr=0.0001$ | −119.94 | −123.38 | −125.16 | −130.38 | −126.16 | −129.12 | −110.19 | −122.44 | −117.18 | −118.78 | −122.273 |

$lr=0.00001$ | −200 | −200 | −200 | −200 | −179.53 | −114.09 | −200 | −200 | −200 | −200 | −189.362 |

TD Error | $\mathit{lr}$ |
---|---|

$TD>=0$ | 0.0001 |

$-0.005<=TD<0$ | 0.0001 |

$-0.05<=TD<-0.005$ | 0.0002 |

$TD<-0.05$ | 0.001 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Li, M.; Gu, X.; Zeng, C.; Feng, Y.
Feasibility Analysis and Application of Reinforcement Learning Algorithm Based on Dynamic Parameter Adjustment. *Algorithms* **2020**, *13*, 239.
https://doi.org/10.3390/a13090239

**AMA Style**

Li M, Gu X, Zeng C, Feng Y.
Feasibility Analysis and Application of Reinforcement Learning Algorithm Based on Dynamic Parameter Adjustment. *Algorithms*. 2020; 13(9):239.
https://doi.org/10.3390/a13090239

**Chicago/Turabian Style**

Li, Menglin, Xueqiang Gu, Chengyi Zeng, and Yuan Feng.
2020. "Feasibility Analysis and Application of Reinforcement Learning Algorithm Based on Dynamic Parameter Adjustment" *Algorithms* 13, no. 9: 239.
https://doi.org/10.3390/a13090239