# LLM-Informed Multi-Armed Bandit Strategies for Non-Stationary Environments

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

## 3. Multi-Armed Bandit

## 4. Methodology

#### 4.1. Non-Stationary Multi-Armed Bandit

#### 4.2. Strategies

- Comparative simplicity: Both epsilon-greedy and UCB strategies are simpler in their implementation compared to Thompson sampling. These strategies provide clear baselines for comparison, allowing us to measure the impact of the LLM-informed strategy against well-understood and straightforward mechanisms [2].
- Demonstrated effectiveness: While Thompson sampling has its advantages, epsilon-greedy [26] and UCB strategies [15] have been extensively studied and proven effective in a wide variety of scenarios. They provide solid and reliable benchmarks, against which the novel LLM-informed strategy can be compared.
- Novelty of LLM-informed strategy: The main goal of our study was to explore and demonstrate the potential of leveraging LLMs [9] in the multi-armed bandit problem. By focusing on comparing this novel strategy with simpler, well-known strategies, we aimed to isolate and highlight the impact of LLM advice on problem solving.
- Computation resources: Thompson sampling [27] often requires more computational resources than epsilon-greedy and UCB strategies due to the need to sample from probability distributions during each decision-making step. As our study included large-scale experiments, we decided to exclude Thompson sampling to minimize computational resource consumption.

#### 4.2.1. Strategy Epsilon-Greedy

#### 4.2.2. UCB Strategy

#### 4.2.3. LLM-Informed Strategy

#### 4.2.4. Quantized Low-Rank Adapters

## 5. Experiments and Results

#### 5.1. Epsilon-Greedy

#### 5.2. Alternative Strategies: UCB and Thompson Sampling

#### 5.3. Parametrized Distributions of Bandits

#### 5.4. Non-Stationary Bandits

#### 5.4.1. Graphical Display

#### 5.4.2. Performance Metrics beyond Average Rewards

#### 5.4.3. Convergence Analysis

#### 5.5. LLM-Informed Strategy

- Game state definition: The game state could encapsulate an array of information, including the total rewards accrued from each bandit, the frequency with which each bandit is selected, and the average reward obtained from each bandit. These data must be translated into a format that can be readily comprehended by the LLM.
- Strategy recommendation request: This game state information can be utilized to request a strategy recommendation from the LLM. It is crucial to structure the prompt in a manner that clearly articulates the game state and seeks a specific output (e.g., the designation of a strategy).
- Output interpretation: The LLM output must then be translated back into a form that can be interpreted by the bandit selection algorithm. This could be as straightforward as mapping strategy names to corresponding functions within the code.
- Recommended strategy implementation: The final step entails utilizing the strategy recommended by the model to decide the next bandit to be selected.

- Observe the state of the bandits.
- Decide on a strategy, either to explore (choose a bandit randomly) or exploit (choose the bandit currently known to give the highest reward).
- Apply the chosen strategy, meaning pull the arm of a bandit based on the decision in step 2.
- Receive a reward from the bandit that was chosen.
- Update the knowledge about the bandit that was chosen, based on the reward received.

- Firstly, we incorporate a caching mechanism to store the previous LLM recommendations. By doing so, we eliminate the need for redundant API calls when the state of the game has not changed significantly, thereby conserving resources and increasing efficiency. The state of the game is represented as a string summarizing the pull count and estimated average reward for each bandit, which is then used as the key in the recommendation cache.
- Secondly, our implementation is designed to handle potential exceptions that may occur during interaction with the OpenAI API. Specifically, we implement an exponential backoff strategy, which essentially means that if an API call fails, the system waits for a certain amount of time before retrying, with the wait time increasing exponentially after each consecutive failure. This mechanism provides robustness against temporary network issues or API rate-limiting, enhancing the overall reliability of the system.
- Lastly, we introduce a threshold (ö) for determining significant changes in the bandit state. This is particularly important, as it governs when a new strategy recommendation is required from the LLM. If the change in the bandit state falls below this threshold, the system reuses the previous recommendation, once again avoiding unnecessary API calls. This threshold is a flexible parameter that can be fine tuned to balance the trade-off between responsiveness to changes and minimizing API requests.

#### 5.6. Utilizing QLoRA with A100 GPU

- In the first step, we defined the state of the game, converting the relevant data into a format comprehensible to QLoRA.
- Then, we made a strategy recommendation request, using the game state information to prompt QLoRA for a strategy.
- After receiving the QLoRA output, we interpreted it, translating it into a form that the bandit selection algorithm could understand and act upon.
- Finally, we implemented the recommended strategy to determine the next bandit to choose.

## 6. Applications of the LLM-Informed Strategy in Various Fields

#### 6.1. Digital Marketing

#### 6.2. Healthcare

#### 6.3. Reinforcement Learning

#### 6.4. Robotics

#### 6.5. Biology and Life Sciences

#### 6.6. Finance

#### 6.7. Challenges and Discussion

## 7. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AI | Artificial Intelligence |

RL | Reinforcement Learning |

GPT | Generative Pretrained Transformer |

QLoRA | Quantized Low-Rank Adapters |

GPU | Graphical Processing Unit |

MAB | Multi-Armed Bandit |

LLM | Large Language Models |

UCB | Upper Confidence Bound |

## References

- Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc.
**1952**, 58, 527–535. [Google Scholar] [CrossRef] [Green Version] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Advances in Neural Information Processing Systems 27, (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Russac, Y.; Vernade, C.; Cappé, O. Weighted linear bandits for non-stationary environments. In Proceedings of the Advances in Neural Information Processing Systems 32, (NIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. arXiv
**2022**, arXiv:2204.14198. [Google Scholar] - Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv
**2022**, arXiv:2206.07682. [Google Scholar] - Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. NeurIPS
**2020**, 33, 1877–1901. [Google Scholar] - Muglich, D.; de Witt, C.S.; Pol, E.V.; Whiteson, S.; Foerster, J. Equivariant networks for zero-shot coordination. Adv. Neural Inf. Process. Syst.
**2022**, 35, 6410–6423. [Google Scholar] - Shah, D.; Osiński, B.; ichter, B.H.; Levine, S. LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action. In Proceedings of the 6th Conference on Robot Learning, Proceedings of Machine Learning Research, PMLR, Atlanta, GA, USA, 6–9 November 2023; Volume 205, pp. 492–504. [Google Scholar]
- Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual Language Maps for Robot Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv
**2021**, arXiv:2106.09685. [Google Scholar] - Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv
**2023**, arXiv:2305.14314. [Google Scholar] - Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika
**1933**, 25, 285–294. [Google Scholar] [CrossRef] - Silva, N.; Werneck, H.; Silva, T.; Pereira, A.C.M.; Rocha, L. Multi-armed bandits in recommendation systems: A survey of the state-of-the-art and future directions. Expert Syst. Appl.
**2022**, 197, 116669. [Google Scholar] [CrossRef] - Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M. Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy
**2021**, 23, 380. [Google Scholar] [CrossRef] - Zhao, P.; Zhang, L.; Jiang, Y.; Zhou, Z. A simple approach for non-stationary linear bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Online, 26–28 August 2020; pp. 746–755. [Google Scholar]
- Garivier, A.; Cappé, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, JMLR, Budapest, Hungary, 9–11 June 2011. [Google Scholar]
- Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning, and Games; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models are Zero-Shot Learners. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Princeton University: Princeton, NJ, USA, 2018. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog
**2019**, 1, 9. [Google Scholar] - Tokic, M. Adaptive ϵ-Greedy Exploration in Reinforcement Learning Based on Value Differences. In KI 2010: Advances in Artificial Intelligence; Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 203–210. [Google Scholar]
- Russo, D.; Roy, B.V.; Kazerouni, A.; Osband, I.; Wen, Z. A Tutorial on Thompson Sampling. Found. Trends® Mach. Learn.
**2018**, 11, 1–96. [Google Scholar] [CrossRef] - Rosin, C.D.; Belew, R.K. New methods for competitive coevolution. Evol. Comput.
**1997**, 5, 1–29. [Google Scholar] [CrossRef] - Wooldridge, M. An Introduction to MultiAgent Systems; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. arXiv
**2022**, arXiv:1908.03963. [Google Scholar] [CrossRef] - Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. 8-bit Optimizers via Block-wise Quantization. In Proceedings of the 9th International Conference on Learning Representations, ICLR, Virtual, 25 April 2022. [Google Scholar]
- Wortsman, M.; Dettmers, T.; Zettlemoyer, L.; Morcos, A.; Farhadi, A.; Schmidt, L. Stable and low-precision training for large-scale vision-language models. arXiv
**2023**, arXiv:2304.13013. [Google Scholar]

**Figure 1.**Cumulative average of the rewards over time, on a logarithmic scale, for the epsilon-greedy strategy.

**Figure 2.**Cumulative average of the rewards over time for strategies epsilon-greedy, UCB, and Thompson sampling.

**Figure 4.**Average reward over time for the epsilon-greedy and UCB strategies with non-stationary bandits.

**Figure 11.**Cumulative average rewards over time for epsilon-greedy, UCB, and LLM-informed strategies with ö = 0.1.

**Figure 12.**Temporal progression of cumulative average rewards for epsilon-greedy, UCB, and QLoRA-driven LLM-informed strategies with ö = 0.1.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

de Curtò, J.; de Zarzà, I.; Roig, G.; Cano, J.C.; Manzoni, P.; Calafate, C.T.
LLM-Informed Multi-Armed Bandit Strategies for Non-Stationary Environments. *Electronics* **2023**, *12*, 2814.
https://doi.org/10.3390/electronics12132814

**AMA Style**

de Curtò J, de Zarzà I, Roig G, Cano JC, Manzoni P, Calafate CT.
LLM-Informed Multi-Armed Bandit Strategies for Non-Stationary Environments. *Electronics*. 2023; 12(13):2814.
https://doi.org/10.3390/electronics12132814

**Chicago/Turabian Style**

de Curtò, J., I. de Zarzà, Gemma Roig, Juan Carlos Cano, Pietro Manzoni, and Carlos T. Calafate.
2023. "LLM-Informed Multi-Armed Bandit Strategies for Non-Stationary Environments" *Electronics* 12, no. 13: 2814.
https://doi.org/10.3390/electronics12132814