Optimizing Reinforcement Learning Using a Generative Action-Translator Transformer
Abstract
:1. Introduction
- A novel sequence-decision-based reinforcement learning method: We introduce the ATT network model, which is built upon the Sequence to Sequence model structure used in text translation tasks. It predicts actions based on observable scene information, utilizes the Transformer model to identify trajectories with the maximum reward return in different game environments, and is the first to employ the translation task framework to align with the reinforcement learning task.
- Encoding form adapted for text translation models: We devise a unique encoding form based on the original elements of reinforcement learning (state, action, reward) and introduce positional information to tailor it for the language model’s training process.
- Based on a review of the existing literature, we analyze future development directions and challenges from the perspectives of reinforcement learning algorithms and other tasks in the field of natural language processing. The purpose of this discussion is to help researchers better comprehend the key aspects of combining reinforcement learning with large models and encourage the application of more language models in reinforcement learning tasks.
2. Background
2.1. Markov Decision Process (MDP)
- S: a finite set of states, where represents the state at step i.
- A: a finite set of actions, where represents the action at step i.
- : the probability of state transition, which denotes the probability of transitioning to other states given the action a in the current state s.
- R: the immediate or expected reward obtained from the state transition.
- : the discount factor that adjusts the influence of future rewards on the current state.
2.2. Reinforcement Learning Algorithms
2.3. Transformer
2.4. Transformer in RL
- The training parameters of Transformer are complex and demand substantial data and computational resources to converge, while RL itself suffers from low sample utilization.
- RL receives state observation information sequentially in chronological order, and the order is not explicitly given but must be learned by Transformer.
- RL algorithms are highly sensitive to the architecture of deep neural networks.
2.5. Decision Transformer (DT)
3. Action-Translator Transformer
3.1. Problem Description
3.2. Model Formulation
3.2.1. Basic Introduction of the Model
3.2.2. Overall Structure of the Model
3.2.3. Model Details
3.2.4. Training and Inference
Algorithm 1 ATT training, optimize network parameters based on offline trajectories |
|
Algorithm 2 ATT inference, generate full trajectories based on target rewards |
|
4. Experimental Validation
4.1. Experimental Environment
4.2. Dataset
- Medium dataset: A strategy network is trained online using Soft Actor-Critic (SAC), and the training is prematurely terminated to collect 1M samples from this partially trained strategy.
- Medium-replay dataset: Comprising all samples recorded in the replay buffer during training until the policy achieves a “medium” performance level.
- Medium-expert dataset: This dataset introduces a “medium-expert” dataset by blending an equal amount of expert demonstration and sub-optimal data. The latter is generated either by partially trained strategies or by expanding a uniformly randomized strategy.
4.3. Experimental Results
4.4. Discussion
5. Conclusions
- Target Reward Setting: In both ATT and DT, we adopt the supervised learning task training approach, requiring the calculation of a target reward value as the trajectory label for offline dataset training. Fluctuations in experimental results may be attributed to the accurate setting of reward values. Future research should explore methods to precisely set reward values or update them dynamically during training.
- Efficient Language Model Training: The language model, serving as the foundational training architecture, demands a substantial amount of datasets for effective training. While reinforcement learning can generate trajectory data through interaction, this method is not applicable to online reinforcement learning due to efficiency concerns. Investigating how to train the language model in an online interactive manner becomes a focal point for future work, expanding the model’s usability.
- Exploration of Different NLP Models: The field of NLP boasts numerous powerful models, each with distinct characteristics and adaptive capabilities. Future endeavors can involve experimenting with other high-performing NLP models, aligning them with various reinforcement learning tasks, and further exploring the potential of combining these technologies.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Error Analysis of Reinforcement Learning Based on MDP
Appendix A.1. Maximization
Appendix A.2. Bootstrapping
References
- Lu, Y.; Li, W. Techniques and Paradigms in Modern Game AI Systems. Algorithms 2022, 15, 282. [Google Scholar] [CrossRef]
- Prudencio, R.F.; Maximo, M.R.O.A.; Colombini, E.L. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems 33, Online, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Los Angeles, CA, USA, 4–9 December 2017. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits And Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar]
- Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing transformers for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 7487–7498. [Google Scholar]
- Banino, A.; Badia, A.P.; Walker, J.; Scholtes, T.; Mitrovic, J.; Blundell, C. Coberl: Contrastive bert for reinforcement learning. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
- Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. In Proceedings of the Advances in Neural Information Processing Systems 34, Online, 6–14 December 2021; pp. 15084–15097. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems 26, Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
- Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Advances in Neural Information Processing Systems 12, Denver, CO, USA, 29 November–4 December 1999. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
- Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A simple neural attentive meta-learner. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
- Kumar, S.; Parker, J.; Naderian, P. Adaptive transformers in RL. arXiv 2020, arXiv:2004.03761. [Google Scholar]
- Goodman, S.; Ding, N.; Soricut, R. TeaForN: Teacher-forcing with n-grams. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 8704–8717. [Google Scholar]
- Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5026–5033. [Google Scholar]
- Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv 2020, arXiv:2004.07219. [Google Scholar]
- Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative q-learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems 33, Online, 6–12 December 2020; pp. 1179–1191. [Google Scholar]
- Kumar, A.; Fu, J.; Tucker, G.; Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; pp. 11761–11771. [Google Scholar]
- Wu, Y.; Tucker, G.; Nachum, O. Behavior regularized offline reinforcement learning. In Proceedings of the Asian Conference on Machine Learning (ACML 2021), Virtual Event, 17–19 November 2021; pp. 204–219. [Google Scholar]
- Peng, X.; Kumar, A.; Zhang, G.; Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv 2019, arXiv:1910.00177. [Google Scholar]
- Torabi, F.; Warnell, G.; Stone, P. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial IntelligenceJuly 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4950–4957. [Google Scholar]
Environment | Dataset | DT () | DT () |
---|---|---|---|
Walker | Medium | 74.0 | 73.1 |
Medium-Expert | 108.1 | 108.3 | |
Medium-Replay | 66.6 | 67.2 | |
Hopper | Medium | 67.6 | 67.0 |
Medium-Expert | 107.6 | 108.0 | |
Medium-Replay | 82.7 | 78.4 |
Dataset | Environment | ATT (Ours) | DT | CQL | BEAR | BRAC-v | AWR | BC |
---|---|---|---|---|---|---|---|---|
Medium-expert | Hopper | 109.8 | 107.6 | 111.0 | 96.3 | 0.8 | 27.1 | 76.9 |
Walker | 110.2 | 108.1 | 98.7 | 40.1 | 81.6 | 53.8 | 36.6 | |
HalfCheetah | 88.9 | 86.8 | 62.4 | 53.4 | 41.9 | 52.7 | 59.5 | |
Medium | Hopper | 68.4 | 67.6 | 58.0 | 52.1 | 31.1 | 35.9 | 63.9 |
Walker | 82.0 | 74.0 | 79.2 | 59.1 | 81.1 | 17.4 | 77.4 | |
HalfCheetah | 40.3 | 42.6 | 44.4 | 41.7 | 46.3 | 37.4 | 43.1 | |
Medium-replay | Hopper | 79.3 | 82.7 | 48.6 | 33.7 | 0.6 | 28.4 | 27.6 |
Walker | 68.7 | 66.6 | 26.7 | 19.2 | 0.9 | 15.5 | 36.9 | |
HalfCheetah | 36.3 | 36.6 | 46.2 | 38.6 | 47.7 | 40.3 | 4.3 |
Environment | ATT | DT | CQL |
---|---|---|---|
Maze2D-umaze | 42.2 | 31.0 | 94.7 |
Maze2D-medium | 13.7 | 8.2 | 41.8 |
Maze2D-large | 10.2 | 2.3 | 49.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, J.; Xie, N.; Zhao, T. Optimizing Reinforcement Learning Using a Generative Action-Translator Transformer. Algorithms 2024, 17, 37. https://doi.org/10.3390/a17010037
Li J, Xie N, Zhao T. Optimizing Reinforcement Learning Using a Generative Action-Translator Transformer. Algorithms. 2024; 17(1):37. https://doi.org/10.3390/a17010037
Chicago/Turabian StyleLi, Jiaming, Ning Xie, and Tingting Zhao. 2024. "Optimizing Reinforcement Learning Using a Generative Action-Translator Transformer" Algorithms 17, no. 1: 37. https://doi.org/10.3390/a17010037
APA StyleLi, J., Xie, N., & Zhao, T. (2024). Optimizing Reinforcement Learning Using a Generative Action-Translator Transformer. Algorithms, 17(1), 37. https://doi.org/10.3390/a17010037