You are currently on the new version of our website. Access the old version .
MathematicsMathematics
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

10 January 2026

Integrating Contextual Causal Deep Networks and LLM-Guided Policies for Sequential Decision-Making

1
Statistics Discipline, Division of Science and Mathematics, University of Minnesota-Morris, Morris, MN 56267, USA
2
EGADE Business School, Tecnológico de Monterrey, Ave. Rufino Tamayo, Monterrey 66269, Mexico
Mathematics2026, 14(2), 269;https://doi.org/10.3390/math14020269 
(registering DOI)
This article belongs to the Special Issue Computational Methods and Machine Learning for Causal Inference

Abstract

Sequential decision-making is critical for applications ranging from personalized recommendations to resource allocation. This study evaluates three decision policies—Greedy, Thompson Sampling (via Monte Carlo Dropout), and a zero-shot Large Language Model (LLM)-guided policy (Gemini-1.5-Pro)—within a contextual bandit framework. To address covariate shift and assess subpopulation performance, we utilize a Collective Conditional Diffusion Network (CCDN) where covariates are partitioned into B=10 homogeneous blocks. Evaluating these policies across a high-dimensional treatment space (K=5, resulting in 25=32 actions), we tested performance in a simulated environment and three benchmark datasets: Boston Housing, Wine Quality, and Adult Income. Our results demonstrate that the Greedy strategy achieves the highest Model-Relative Optimal (MRO) coverage, reaching 1.00 in the Wine Quality and Adult Income datasets, though performance drops significantly to 0.05 in the Boston Housing environment. Thompson Sampling maintains competitive regret and, in the Boston Housing dataset, marginally outperforms Greedy in action selection precision. Conversely, the zero-shot LLM-guided policy consistently underperforms in numerical tabular settings, exhibiting the highest median regret and near-zero MRO coverage across most tasks. Furthermore, Wilcoxon tests reveal that differences in empirical outcomes between policies are often not statistically significant (ns), suggesting an optimization ceiling in zero-shot tabular settings. These findings indicate that while traditional model-driven policies are robust, LLM-guided approaches currently lack the numerical precision required for high-dimensional sequential decision-making without further calibration or hybrid integration.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.