Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits
Abstract
1. Introduction
2. Motivation
3. Statistical Methods
3.1. Vine Copula: Definition and Background
3.2. Context Generation via Vine Copulas
3.3. Reward Generation via Inverse Beta Transformation
3.4. Gaussian Process Regression for Reward Estimation
- Squared Exponential (RBF) Kernel:
- Matérn Kernel (e.g., ):
- is the covariance vector between training inputs and ;
- is the covariance matrix with entries ;
- I is the identity matrix;
- is the noise variance capturing observation noise;
- is the vector of observed rewards.
3.5. Bandit Policies
- TS:
- Epsilon-Greedy (-Greedy):
- UCB:
3.6. Performance Metrics
3.7. Computational Complexity and Overhead Analysis
3.7.1. Gaussian Process Inference Overhead
3.7.2. Vine Copula Fitting Overhead
- Selecting a vine structure (e.g., C-vine, D-vine) among d dimensions.
- Estimating bivariate copula parameters at each tree level.
3.8. Comparison with Simpler Policies
- -greedy chooses between exploration and exploitation based on a uniform draw and maintains simple running averages of rewards.
- UCB updates estimate using a closed-form expression involving logarithmic scaling.
4. Simulation Study
- Total rounds: ;
- Number of arms: ;
- Context dimension: ;
- Training proportion: 80% (i.e., );
- Block correlation parameters: , ;
- Exploration parameter: .
- TS.
- Epsilon-Greedy.
- UCB.
- Final Cumulative Reward: Total accumulated reward obtained by each policy at the end of the test period. A higher value indicates better performance in maximizing rewards. Epsilon-Greedy achieved the highest final cumulative reward (101.08), followed by TS (60.63) and UCB (57.42).
- Final Cumulative Regret: The difference between the cumulative reward of an oracle policy (always choosing the best arm) and the actual policy. Lower values indicate better decision-making. Epsilon-Greedy showed the lowest final regret (9.72), suggesting it approximated optimal arm selection more closely.
- Mean Cumulative Reward and Regret: Average cumulative reward and regret across the entire test period, reflecting overall policy performance over time. Epsilon-Greedy again outperformed others with the highest mean cumulative reward (50.95) and lowest mean cumulative regret (5.13).
- Standard Deviation (SD) of Cumulative Reward and Regret: Measures variability in cumulative reward and regret over time. Lower values indicate more stable performance. While Epsilon-Greedy had the highest reward variability (29.06), it exhibited the lowest regret variability (2.71), indicating relatively stable regret despite fluctuations in rewards.
5. Illustrated Real Data Analysis
5.1. Wine Quality
- TS.
- Epsilon-Greedy.
- UCB.
5.2. Boston Housing
- TS: For each context , a reward is sampled from the posterior predictive distribution of each arm’s GP model, and the arm with the highest sampled reward is selected:
- Epsilon-Greedy (EG): With probability , a random arm is selected uniformly (exploration); otherwise, the arm with the highest posterior mean reward is chosen (exploitation):
- UCB: The arm maximizing the UCB is selected:
6. Discussion and Conclusions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Russo, D.; Van Roy, B. Learning to Optimize via Information-Directed Sampling. arXiv 2017, arXiv:1403.5556. [Google Scholar] [CrossRef]
- Srinivas, N.; Krause, A.; Kakade, S.M.; Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 1015–1022. [Google Scholar]
- Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 661–670. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Nelsen, R.B. An Introduction to Copulas, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar]
- Aas, K.; Czado, C.; Frigessi, A.; Bakken, H. Pair-copula constructions of multiple dependence. Insur. Math. Econ. 2009, 44, 182–198. [Google Scholar] [CrossRef]
- Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Joe, H. Multivariate Models and Dependence Concepts; Chapman & Hall: London, UK, 1997. [Google Scholar]
- Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, Volume 2; Wiley-Interscience: Hoboken, NJ, USA, 1995. [Google Scholar]
- Niederreiter, H. Random Number Generation and Quasi-Monte Carlo Methods; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
- Czado, C. Analyzing Dependent Data with Vine Copulas: A Practical Guide with R; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
- Quiñonero-Candela, J.; Rasmussen, C.E. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 2005, 6, 1939–1959. [Google Scholar]
- Russo, D.; Van Roy, B.; Kazerouni, A.; Osband, I.; Wen, Z. A tutorial on Thompson sampling. Found. Trends Mach. Learn. 2018, 11, 1–96. [Google Scholar] [CrossRef]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
- Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Dua, D.; Graff, C. UCI Machine Learning Repository. 2019. Available online: http://archive.ics.uci.edu/ml (accessed on 1 May 2025).
- Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef]
- Nagler, T.; Schepsmeier, U.; Stober, J.; Brechmann, E.C.; Graeler, B.; Erhardt, T.; Almeida, C.; Min, A.; Czado, C.; Hofmann, M.; et al. The R Package VineCopula, version 2.6.1; Statistical Inference of Vine Copulas; CRAN: Vienna, Austria, 2025. [Google Scholar]
- Gramacy, R.B.; Lee, H.K.H. Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 2008, 103, 1119–1130. [Google Scholar] [CrossRef]
Component | Operation | Time Complexity | Relative Cost |
---|---|---|---|
GP Model Training (per arm) | newGPsep() | High | |
GP Prediction (per arm) | predGPsep() | Moderate | |
Vine Copula Fitting | Structure + MLE | to | High |
-greedy | which.max(…) | Low | |
UCB/Greedy | Closed-form index | Low |
Policy | Final Cumulative Reward | Final Cumulative Regret | Mean Cumulative Reward | Mean Cumulative Regret | SD Cumulative Reward | SD Cumulative Regret |
---|---|---|---|---|---|---|
Epsilon-Greedy | 101.08 | 9.72 | 50.95 | 5.13 | 29.06 | 2.71 |
TS | 60.63 | 50.18 | 30.02 | 26.06 | 17.53 | 14.20 |
UCB | 57.42 | 53.39 | 28.87 | 27.21 | 16.11 | 15.62 |
Variable | Description | Units |
---|---|---|
fixed acidity | Tartaric acid content; contributes to wine stability and taste | g/dm |
volatile acidity | Acetic acid content; high levels lead to unpleasant sourness | g/dm |
citric acid | Enhances freshness and flavor; contributes to acidity balance | g/dm |
residual sugar | Remaining sugar after fermentation; influences sweetness | g/dm |
chlorides | Salt content; may affect taste and preservation | g/dm |
free sulfur dioxide | Free form of SO; inhibits microbial growth | mg/dm |
total sulfur dioxide | Combined free and bound SO used as preservative | mg/dm |
density | Density of wine, influenced by alcohol and sugar content | g/cm |
pH | Acidity level; lower pH indicates higher acidity | – |
sulphates | Sulfate compounds acting as antioxidants and preservatives | g/dm |
alcohol | Alcohol content of the wine | % vol. |
quality | Sensory quality score from wine tasters | Ordinal (0–10) |
Policy | Final Cumulative Reward | Final Cumulative Regret | Mean Cumulative Reward | Mean Cumulative Regret | SD Cumulative Reward | SD Cumulative Regret |
---|---|---|---|---|---|---|
Epsilon-Greedy | 101.08 | 9.72 | 50.95 | 5.13 | 29.06 | 2.71 |
TS | 60.63 | 50.18 | 30.02 | 26.06 | 17.53 | 14.20 |
UCB | 57.42 | 53.39 | 28.87 | 27.21 | 16.11 | 15.62 |
Policy | Final Cumulative Reward | Final Cumulative Regret | Mean Cumulative Reward | Mean Cumulative Regret | SD Cumulative Reward | SD Cumulative Regret |
---|---|---|---|---|---|---|
Epsilon-Greedy | 101.08 | 9.72 | 50.95 | 5.13 | 29.06 | 2.71 |
TS | 60.63 | 50.18 | 30.02 | 26.06 | 17.53 | 14.20 |
UCB | 57.42 | 53.39 | 28.87 | 27.21 | 16.11 | 15.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, J.-M. Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits. Mathematics 2025, 13, 2058. https://doi.org/10.3390/math13132058
Kim J-M. Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits. Mathematics. 2025; 13(13):2058. https://doi.org/10.3390/math13132058
Chicago/Turabian StyleKim, Jong-Min. 2025. "Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits" Mathematics 13, no. 13: 2058. https://doi.org/10.3390/math13132058
APA StyleKim, J.-M. (2025). Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits. Mathematics, 13(13), 2058. https://doi.org/10.3390/math13132058