Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic
Abstract
1. Introduction
2. Related Work
2.1. Evolution and Static Limitations of Credit Scoring Models
2.2. Exploration and Shortcomings of Reinforcement Learning in Financial Decision-Making
2.3. Theoretical Bridge Between Maximum-Entropy Reinforcement Learning and the HJB Equation
2.4. Summary of Literature Comparison
2.5. Problem Formulation
3. Theoretical Foundations
3.1. Entropy-Regularized Markov Decision Process (ER-MDP)
3.2. Entropy-Regularized HJB Equation (ER-HJB) and Properties of Its Solution
- Step 1 (Monotonicity). If pointwise, then . Since the exponential and logarithm functions are strictly increasing, and , we obtain .
- Step 2 (). For any constant and ,
- Step 3 (Existence and uniqueness). Since , is a strict contraction on a complete metric space. By the Banach fixed-point theorem, there exists a unique such that ; i.e., Equation (4) holds. □
3.3. Limiting Behavior as the Entropy Coefficient Vanishes
3.4. Theoretical Connection Between the SAC Algorithm and the ER-HJB Equation
- Soft policy evaluation: For a fixed policy , repeatedly apply the following operator to compute the soft Q-function:
- Soft policy improvement: Update the policy by minimizing the KL divergence to the Boltzmann distribution of the current Q-function:
- Part (a)—Exact tabular setting (finite state/action spaces, no function approximation). The soft Bellman operator defined in Equation (3) coincides with the optimal soft Bellman operator in this setting. Exact soft policy iteration—alternating exact soft policy evaluation and exact soft policy improvement—generates a sequence satisfying . Since is a -contraction (Theorem 1), this sequence converges linearly to the unique fixed point of the ER-HJB Equation (4).
- Part (b)—Practical deep RL implementation (neural networks, stochastic gradients, replay buffer). The practical SAC algorithm can be interpreted as an asynchronous stochastic approximation of the exact fixed-point iteration described in Part (a). However, a rigorous global convergence proof for nonlinear function approximation remains an open problem in reinforcement learning theory. The algorithm has been empirically shown to perform well under standard assumptions (bounded rewards, compact action space, sufficient exploration). We therefore adopt it as a numerically effective heuristic for the ER-HJB equation, without claiming a general convergence guarantee.
- Part 1 (Optimal soft Bellman operator). Define the optimal soft Bellman operator by
- Part 2 (exact soft policy iteration). Lemma 2 of [18] shows that the policy improvement step (13) has the closed-form solution
- Part 3 (Practical deep RL implementation). In practical SAC, the exact backups are replaced by stochastic gradient descent using samples from a replay buffer and function approximators (neural networks). This results in an asynchronous stochastic approximation of the fixed-point iteration. While a global convergence guarantee for nonlinear function approximation is not available, the method satisfies the usual RL assumptions: rewards are bounded, the action space is compact, and the Gaussian policy with bounded variance ensures sufficient exploration. Under these conditions, the algorithm has been empirically shown to work well [18], and we adopt it as an effective numerical heuristic for the ER-HJB equation. □
3.5. Linear-Quadratic Analytical Verification
3.6. SAC Algorithm Pseudocode and Overall Research Framework
| Algorithm 1: SAC Algorithm Pseudocode | |
| Input: environment env, temperature parameter , discount factor , soft update rate , learning rate , replay buffer capacity , batch size | |
| Output: optimal policy network parameters | |
| 1 | Initialize policy network parameters , two Q-function network parameters , |
| 2 | Initialize target Q-network parameters , |
| 3 | Initialize replay buffer with capacity |
| 4 | for each environment step do |
| 5 | Observe current state from the environment and sample action according to |
| 6 | Execute action in the environment, observe reward and next state |
| 7 | Store transition tuple in replay buffer |
| 8 | if number of samples in then |
| 9 | for each gradient step do |
| 10 | Sample a random mini-batch of transitions from |
| 11 | Compute target value , where |
| 12 | for do |
| 13 | Update : |
| 14 | end for |
| 15 | Update : , where |
| 16 | (Optional) Automatically tune |
| 17 | Soft update target network parameters: for |
| 18 | end for |
| 19 | end if |
| 20 | end for |
| 21 | return |
4. Experimental Design and Methodology
4.1. From ER-MDP to the Credit Environment: Connecting Theory and Experiment
4.2. Dataset Description
4.3. Model Architecture
4.4. Evaluation Metrics
- Average Reward (AR) measures the expected net profit level brought by the model for each loan in the test set, serving as a direct measure of profitability:
- Total Reward (TR) measures the cumulative net profit brought by the model for all loans in the test set, providing an aggregate view of profitability:
- Standard Deviation of Reward (Std) quantifies the overall uncertainty of single-loan returns, defined as the standard deviation of rewards:
- Sortino Ratio (Sortino) evaluates downside risk-adjusted return, considering only volatility below a target return (set to 0 in this work) as risk [33]:
- Conditional Value-at-Risk (CVaR95%) measures extreme tail risk, defined as the average loss in the worst 5% of the return distribution (expressed as a negative number) [34]:
- High-Risk Credit Ratio (HCR) captures the strategy’s propensity to grant credit to high-risk borrowers. The high-risk group consists of borrowers whose unemployment rate exceeds the 75th percentile and who have a positive number of historical delinquencies in the past two years:
4.5. Experimental Environment
5. Experimental Results and Analysis
5.1. Theoretical Verification
5.2. Continuous Regulation of Risk–Return Trade-Off
5.3. Systematic Comparison with Static Baselines
5.4. Credit-Granting Behavior Towards High-Risk Groups
6. Discussion
6.1. Economic Interpretation of the Temperature Parameter
6.2. The Advantage Mechanism of the Dynamic Strategy
6.3. The Theoretical Roots of Interpretability
6.4. Practical Implications
6.5. Theoretical Limitations and Practical Scope
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Appl. Soft Comput. 2020, 91, 106263. [Google Scholar] [CrossRef]
- Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
- Chen, T.Q.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Ke, G.L.; Meng, Q.; Finley, T.; Wang, T.F.; Chen, W.; Ma, W.D.; Ye, Q.W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Li, Y.H.; Chen, W.D. A Comparative Performance Assessment of Ensemble Learning for Credit Scoring. Mathematics 2020, 8, 1756. [Google Scholar] [CrossRef]
- Sousa, M.R.; Gama, J.; Brandao, E. A new dynamic modeling framework for credit risk assessment. Expert Syst. With Appl. 2016, 45, 341–351. [Google Scholar] [CrossRef]
- Gunnarsson, B.R.; Broucke, S.V.; Baesens, B.; Oskarsdóttir, M.; Lemahieu, W. Deep learning for credit scoring: Do or don’t? Eur. J. Oper. Res. 2021, 295, 292–305. [Google Scholar] [CrossRef]
- Chen, Y.J.; Calabrese, R.; Martin-Barragan, B. Interpretable machine learning for imbalanced credit scoring datasets. Eur. J. Oper. Res. 2024, 312, 357–372. [Google Scholar] [CrossRef]
- Moos, J.; Hansel, K.; Abdulsamad, H.; Stark, S.; Clever, D.; Peters, J. Robust Reinforcement Learning: A Review of Foundations and Recent Advances. Mach. Learn. Knowl. Extr. 2022, 4, 276–315. [Google Scholar] [CrossRef]
- Paul, S.; Gupta, A.; Kar, A.K.; Singh, V. An Automatic Deep Reinforcement Learning based Credit Scoring Model using Deep-Q Network for Classification of Customer Credit Requests. In Proceedings of the 29th Annual IEEE International Symposium on Technology and Society, Swansea, UK, 13–15 September 2023. [Google Scholar]
- Wang, Y.D.; Jia, Y.L.; Fan, S.; Xiao, J. Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lending. Artif. Intell. Rev. 2024, 57, 93. [Google Scholar] [CrossRef]
- Barbierato, E.; Gatti, A. The Challenges of Machine Learning: A Critical Review. Electronics 2024, 13, 416. [Google Scholar] [CrossRef]
- Zhang, H.; Chen, H.G.; Xiao, C.W.; Li, B.; Liu, M.Y.; Boning, D.; Hsieh, C.J. Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. In Proceedings of the 34th Conference on Neural Information Processing Systems, 33 (NeurIPS 2020), Virtual Event, 6–12 December 2020. [Google Scholar]
- Crandall, M.G.; Lions, P.L. Viscosity solutions of Hamilton-Jacobi equations. Trans. Am. Math. Soc. 1983, 277, 1–42. [Google Scholar] [CrossRef]
- Bardi, M.; Capuzzo-Dolcetta, I. Continuous viscosity solutions of Hamilton-Jacobi equations. In Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations; Birkhäuser Boston: Boston, MA, USA, 1997; pp. 25–96. [Google Scholar]
- Munos, R. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Mach. Learn. 2000, 40, 265–299. [Google Scholar] [CrossRef]
- Tang, W.P.; Zhang, Y.P.; Zhou, X.Y. Exploratory HJB Equations and Their Convergence. SIAM J. Control Optim. 2022, 60, 3191–3216. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Xu, Y.; Kou, G.; Peng, Y.; Ding, K.X.; Ergu, D.; Alotaibi, F.S. Profit- and risk-driven credit scoring under parameter uncertainty: A multiobjective approach. Omega-Int. J. Manag. Sci. 2024, 125, 103004. [Google Scholar] [CrossRef]
- Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; Volume 3. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar] [CrossRef]
- Kim, J.; Yang, I. Hamilton-Jacobi-Bellman Equations for Maximum Entropy Optimal Control. arXiv 2020, arXiv:2009.13097. [Google Scholar]
- Stoorvogel, A.A.; Saberi, A. The discrete algebraic Riccati equation and linear matrix inequality. Linear Algebra Appl. 1998, 274, 317–365. [Google Scholar] [CrossRef]
- All Lending Club Loan Data. Available online: https://www.kaggle.com/datasets/wordsforthewise/lending-club (accessed on 15 March 2026).
- Unemployment Rate (UNRATE). Available online: https://fred.stlouisfed.org/series/UNRATE (accessed on 15 March 2026).
- Federal Funds Effective Rate (FEDFUNDS). Available online: https://fred.stlouisfed.org/series/FEDFUNDS (accessed on 15 March 2026).
- Consumer Price Index for All Urban Consumers: All Items in U.S. City Average (CPIAUCSL). Available online: https://fred.stlouisfed.org/series/CPIAUCSL (accessed on 15 March 2026).
- Industrial Production: Total Index (INDPRO). Available online: https://fred.stlouisfed.org/series/INDPRO (accessed on 15 March 2026).
- Year Fixed Rate Mortgage Average in the United States (MORTGAGE30US). Available online: https://fred.stlouisfed.org/series/MORTGAGE30US (accessed on 15 March 2026).
- Raffin, A.; Hi, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
- Columba, F.; Cugliari, M.; Di Virgilio, S. Credit risk assessment with stacked machine learning. In Corporate Credit Analysis and AI: Advancing the Rating System at a Central Bank; Springer Nature Switzerland: Cham, Switzerland, 2026; pp. 263–292. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Sortino, F.A.; van der Meer, R. Downside risk. J. Portf. Manag. 1991, 17, 27–31. [Google Scholar] [CrossRef]
- Rockafellar, R.T.; Uryasev, S. Optimization of conditional value-at risk. J. Risk 2000, 3, 21–41. [Google Scholar] [CrossRef]






| Work | Method | Dynamic? | Risk-Tunable? | Theoretical Optimality | Main Limitation |
|---|---|---|---|---|---|
| [2] | Logistic regression | No | No | No | Static classification, not aligned with long-term profit |
| [3,4] | Gradient boosting | No | No | No | Single-period, no intertemporal optimization |
| [6] | Sliding window | Yes | No | No | Heuristic forgetting, no optimal control |
| [7] | Deep neural networks | No | No | No | High computational cost, not profit-aligned |
| [11] | DQN + BSPER | Yes | No | No | Handcrafted reward, not risk-theory driven |
| [21] | Maximum-entropy RL | Yes | Yes | Yes (RL) | No theoretical equivalence to HJB |
| This paper | ER-HJB + SAC | Yes | Yes | Yes (HJB) | — |
| AR | TR | Std | Sortino | CvaR95% | HCR | |
|---|---|---|---|---|---|---|
| 0.01 | 1.6773 | 870,036.4111 | 2.0434 | 1.5679 | −1.3383 | 0.9095 |
| 0.05 | 1.4841 | 769,829.1321 | 1.9077 | 1.5868 | −1.1217 | 0.7465 |
| 0.10 | 1.3235 | 686,503.9132 | 1.7329 | 1.6123 | −0.9744 | 0.6608 |
| 0.15 | 1.2255 | 635,658.0914 | 1.6075 | 1.6199 | −0.8988 | 0.6176 |
| 0.20 | 1.1731 | 608,513.7459 | 1.5302 | 1.6312 | −0.8579 | 0.5971 |
| 0.50 | 1.0178 | 527,947.7472 | 1.2608 | 1.6308 | −0.7577 | 0.5396 |
| 1.00 | 0.9606 | 498,271.8749 | 1.1530 | 1.6302 | −0.7217 | 0.5211 |
| Model | AR | TR | Std | Sortino | CvaR95% |
|---|---|---|---|---|---|
| LR | 1.3651 | 708,066.8276 | 1.6022 | 1.6109 | −1.0454 |
| RF | 1.3521 | 701,327.8073 | 1.5931 | 1.5963 | −1.0435 |
| XGB | 1.3541 | 702,378.4029 | 1.5851 | 1.6241 | −1.0316 |
| LGBM | 1.3533 | 701,940.5208 | 1.5844 | 1.6230 | −1.0317 |
| MLP | 1.3661 | 708,588.9227 | 1.6009 | 1.6205 | −1.0397 |
| Stacked_Meta-Model | 1.3236 | 686,540.0049 | 1.6813 | 1.3463 | −1.1189 |
| Model | Mean | Min | Q1 (25%) | Median (50%) | Q3 (75%) | Max |
|---|---|---|---|---|---|---|
| 1.6773 | −31.6424 | 0.5505 | 1.2061 | 2.4063 | 12.2055 | |
| 1.4841 | −28.1988 | 0.4173 | 0.9668 | 2.1135 | 11.9399 | |
| 1.3235 | −25.2654 | 0.3822 | 0.8349 | 1.8201 | 11.5482 | |
| 1.2255 | −23.4186 | 0.3655 | 0.7802 | 1.6622 | 11.1560 | |
| 1.1731 | −22.2460 | 0.3604 | 0.7557 | 1.5821 | 10.8789 | |
| 1.0178 | −18.1451 | 0.3467 | 0.7040 | 1.3959 | 8.7703 | |
| 0.9606 | −17.0367 | 0.3421 | 0.6884 | 1.3344 | 7.5645 | |
| LR | 1.3651 | −30.2107 | 0.4958 | 1.0102 | 1.9372 | 10.2371 |
| RF | 1.3521 | −32.0278 | 0.4906 | 1.0004 | 1.9187 | 10.3183 |
| XGB | 1.3541 | −31.7479 | 0.4921 | 1.0017 | 1.9216 | 10.2592 |
| LGBM | 1.3533 | −31.5754 | 0.4918 | 1.0011 | 1.9204 | 10.3152 |
| MLP | 1.3661 | −31.7889 | 0.4957 | 1.0102 | 1.9381 | 10.5684 |
| Stacked_Meta-Model | 1.3236 | −33.5188 | 0.4656 | 0.9683 | 1.8790 | 12.0437 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jin, L.; Zhang, R. Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics 2026, 14, 1980. https://doi.org/10.3390/math14111980
Jin L, Zhang R. Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics. 2026; 14(11):1980. https://doi.org/10.3390/math14111980
Chicago/Turabian StyleJin, Lei, and Runchi Zhang. 2026. "Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic" Mathematics 14, no. 11: 1980. https://doi.org/10.3390/math14111980
APA StyleJin, L., & Zhang, R. (2026). Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics, 14(11), 1980. https://doi.org/10.3390/math14111980

