Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic

Jin, Lei; Zhang, Runchi

doi:10.3390/math14111980

Open AccessArticle

Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic

by

Lei Jin

^*

and

Runchi Zhang

School of Economics, Nanjing University of Posts and Telecommunications, No.9 Wen Yuan Road, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(11), 1980; https://doi.org/10.3390/math14111980

Submission received: 27 April 2026 / Revised: 28 May 2026 / Accepted: 2 June 2026 / Published: 3 June 2026

(This article belongs to the Special Issue Intelligent Financial Systems: Algorithms, Learning, and Decision Mechanisms)

Download

Browse Figures

Versions Notes

Abstract

Traditional credit scoring treats lending as static classification and lacks the ability to adjust risk preferences dynamically. This paper develops a dynamic credit decision framework based on the entropy-regularized Hamilton–Jacobi–Bellman (ER-HJB) equation. Theoretically, we prove the existence and uniqueness of a solution to the ER-HJB equation, show that under exact tabular assumptions the soft policy iteration underlying Soft Actor-Critic (SAC) converges to this solution, and derive a closed-form analytical solution under linear-quadratic conditions. Empirically, using LendingClub loan panel data (2016–2018), we show that a single entropy coefficient continuously modulates the risk–return trade-off. As this coefficient increases from 0.01 to 1.00, tail risk (CVaR 95%) steadily improves, while the Sortino ratio peaks near 0.20. The dynamic SAC model outperforms static baselines (logistic regression, XGBoost, LightGBM) in average reward and, by tuning the entropy coefficient, achieves significant downside risk reduction without retraining. This framework transforms credit scoring into dynamic optimal control with continuously adjustable and interpretable risk preferences, offering a theoretically grounded tool for refined risk management.

Keywords:

entropy regularization; Hamilton–Jacobi–Bellman equation; Soft Actor-Critic; dynamic credit decision-making; risk preference regulation

MSC:

91G40; 90C40; 68T05

1. Introduction

Credit risk management is a cornerstone of modern financial stability. Since the 1950s, credit scoring models have undergone profound paradigm shifts, evolving from expert judgment to statistical models and, more recently, to machine learning models [1]. Despite continuous methodological advances, the core objective remains unchanged: to accurately assess a borrower’s probability of default and provide a quantitative basis for credit decisions. Traditional approaches—whether the robust and transparent logistic regression [2] or ensemble learning models such as XGBoost [3] and LightGBM [4,5]—essentially simplify decisions into a static classification problem based on current features. However, credit risk is not static; it evolves with macroeconomic cycles and borrower behavior, and the static perspective fundamentally ignores the intertemporal dependence of decisions. Empirical studies have shown that traditional static scoring models suffer from continuous performance degradation in real-world operations [6], deep neural networks do not outperform gradient boosting trees on structured credit data [7], and post-hoc interpretability methods become unstable under class imbalance [8]. These findings collectively point to a core dilemma: the prevailing “model once, use long-term” paradigm cannot adapt to the inherently dynamic nature of risk.

To capture the sequential dependence of decisions, reinforcement learning (RL)—a natural framework for sequential decision-making—has been introduced into the financial domain [9]. In credit risk applications, early work attempted to apply Deep Q-Networks (DQNs) to credit classification, designing handcrafted reward functions to address sample imbalance [10]. Wang et al. [11] further proposed Balanced Stratified Prioritized Experience Replay (BSPER) to improve DQN convergence on highly imbalanced credit data. However, these works share a fundamental paradigm limitation: they still view credit decisions as a sequential classification problem with terminal states, with reward functions that are handcrafted weightings of single-step classification outcomes. They offer neither explicit and continuous regulation of policy stochasticity—i.e., risk preference—nor an intrinsic connection to optimal control theory. The survey by Barbierato and Gatti [12] critically notes that current machine learning models generally suffer from “high predictive accuracy but lack of causal explanatory power,” indicating that pursuing classification accuracy alone cannot meet the regulatory demands for interpretability and causal logic in financial decisions.

Meanwhile, at the intersection of optimal control and reinforcement learning theory, the Hamilton–Jacobi–Bellman (HJB) equation serves as a core tool for characterizing dynamic programming in continuous decision-making [13]. The existence and uniqueness theory of its solutions provides a weak solution framework for such nonlinear partial differential equations [14,15], and Munos [16] further proved that the value function of a discrete MDP converges to the solution of the continuous HJB equation. Tang et al. [17] introduced entropy regularization into continuous-time control, proposing an exploratory HJB equation in which a temperature parameter controls the stochasticity of the policy. Building on this theoretical foundation, the present paper seeks to embed risk preference explicitly and continuously into the HJB optimal control framework in a mathematically interpretable and numerically stable manner.

To this end, we use the entropy of the policy—i.e., the degree of randomness in its behavior—as a proxy for risk preference: high entropy corresponds to more uniform, conservative behavior that hedges against uncertainty; low entropy approaches a deterministic, risk-neutral exploitative strategy. We introduce entropy regularization into the discrete-time Markov decision process (MDP), construct an entropy-regularized MDP (ER-MDP), and derive the corresponding ER-HJB equation. Theoretical analysis shows that this equation has a unique solution, and as the entropy coefficient tends to zero, the solution converges to that of the classical risk-neutral HJB equation. On this basis, we show that under the idealizing assumptions of the exact tabular setting, the soft policy iteration underlying the Soft Actor-Critic (SAC) algorithm [18] satisfies a fixed-point equation that coincides with the ER-HJB solution; the practical deep implementation can be interpreted as an asynchronous stochastic approximation of this fixed-point iteration, providing a numerically effective heuristic. This theoretical connection not only provides theoretical legitimacy for using SAC in risk-preference regulation but also reveals the intrinsic connection between maximum-entropy reinforcement learning and KL-divergence robust control.

The main contributions of this paper are threefold. Theoretically, we prove the existence and uniqueness of a solution to the ER-HJB equation, show that under exact tabular assumptions the soft policy iteration of SAC converges to this solution, and derive a closed-form analytical solution under linear-quadratic conditions as a numerical benchmark. Methodologically, we establish the entropy coefficient α as a single, interpretable parameter for continuous risk-preference regulation, constructing a complete pathway from optimal control theory to deep RL algorithms. Empirically, using real loan panel data from LendingClub, we verify for the first time the monotonic and continuous regulation effect of α on the risk–return trade-off, demonstrating that the dynamic strategy achieves Pareto improvements in profitability and tail risk control over strong static baselines such as XGBoost and LightGBM.

The remainder of this paper is organized as follows. Section 2 reviews related work and provides a problem formulation. Section 3 establishes the ER-MDP model and presents three core theorems with proofs. Section 4 describes the experimental setup. Section 5 reports experimental results and analysis. Section 6 provides an in-depth discussion from the perspectives of economic interpretation, informational mechanisms, and behavioral interpretability. Section 7 concludes and outlines future directions.

2. Related Work

2.1. Evolution and Static Limitations of Credit Scoring Models

Modern credit scoring systems have long been dominated by statistical methods, with logistic regression serving as the benchmark in industry and regulatory frameworks due to its transparency and robustness [2]. Gradient boosting tree models such as XGBoost [3] and LightGBM [4] have continuously pushed classification accuracy through their strong fitting capacity on high-dimensional data, but these methods reduce credit decisions to a single-period binary classification problem, and their objective functions do not directly align with the long-term profit and risk-adjusted returns that ultimately concern financial institutions [1]. Sousa et al. [6], using real credit card data from a Brazilian financial institution, showed that traditional static scoring models suffer from continuous performance degradation over a two-year operational horizon, and that a “short-memory” sliding window dynamic updating strategy significantly mitigates this issue. Gunnarsson et al. [7] systematically compared ten classifiers on ten datasets using Bayesian statistical tests, concluding that XGBoost is the best overall method, while deep neural networks do not outperform their shallow counterparts and incur significantly higher computational costs. From another critical perspective, Xu et al. [19] found that models selected via feature selection optimizing AUC perform significantly worse in deteriorating economic environments than those optimizing profit or risk—directly demonstrating the necessity of embedding business objectives into the model development process.

2.2. Exploration and Shortcomings of Reinforcement Learning in Financial Decision-Making

To overcome the fundamental limitations of the static paradigm, RL has been introduced into financial decision-making [9]. In credit risk, early work applied DQN to credit classification with handcrafted reward functions to handle sample imbalance [10]. Wang et al. [11] proposed BSPER, which separates the replay buffer by class and prioritizes samples, significantly improving DQN’s ability to recognize minority-class (default) samples and its convergence stability. However, these works still treat credit decisions as sequential classification with terminal states; their reward functions are handcrafted weightings of single-step outcomes, lacking explicit modeling of policy stochasticity (risk preference) and intrinsic connection to HJB optimal control theory. The survey by Barbierato and Gatti [12] further notes that current machine learning models generally suffer from “high predictive accuracy but lack of causal explanatory power”—under frameworks such as Basel III, models must not only accurately estimate risk but also explain the rationale behind decisions, which is a common shortcoming of existing RL credit scoring methods.

2.3. Theoretical Bridge Between Maximum-Entropy Reinforcement Learning and the HJB Equation

The existence and uniqueness theory of solutions proposed by Crandall and Lions [14] provides a weak solution framework for HJB equations that generally lack classical solutions. Munos [16] proved that, under appropriate discretization, the value function of an MDP converges to the solution of the continuous HJB equation, providing theoretical legitimacy for using RL algorithms to solve optimal control problems. On the RL side, starting with the maximum-entropy inverse reinforcement learning of Ziebart et al. [20], the maximum-entropy RL framework encourages the agent to maintain behavioral stochasticity by adding a policy entropy bonus to the objective function [21]. Tang et al. [17] extended entropy regularization to continuous time and space, proposing an exploratory control framework and deriving the corresponding exploratory HJB equation. On this theoretical foundation, the present paper introduces this framework into the discrete-time credit decision MDP, establishes the ER-HJB equation, and shows that under exact tabular conditions the soft policy iteration underlying SAC converges to the solution of the ER-HJB equation, while the practical deep implementation serves as an effective numerical heuristic. We emphasize that this convergence result is restricted to the exact setting and does not imply a general convergence guarantee for deep SAC.

2.4. Summary of Literature Comparison

A methodological comparison of representative works and the proposed frame-work is presented in Table 1.

2.5. Problem Formulation

Given a stream of loan applications arriving over time

t = 0, 1, \dots

, a credit decision-maker seeks a stochastic policy

π (a | s)

that maximizes the entropy-regularized objective

J_{π}

defined in Equation (2). This objective simultaneously achieves three goals: (i) maximize the expected discounted profit from lending; (ii) control downside risk via the entropy coefficient

α

—larger

α

penalizes deterministic exploitation in uncertain states, leading to more conservative behavior; (iii) provide a continuously parameterized spectrum of risk preferences between risk-neutral profit maximization (

α \to 0

) and maximal robustness.

3. Theoretical Foundations

3.1. Entropy-Regularized Markov Decision Process (ER-MDP)

We model dynamic credit decision-making as a discrete-time entropy-regularized Markov decision process (ER-MDP). Within a standard MDP quintuple

(S, A, P, r, γ)

, we introduce the entropy of the policy

π (\cdot | s)

:

H (π (\cdot | s)) = - \int_{A} π (a | s) \log π (a | s) d a

(1)

and define the entropy-regularized objective function:

J_{π} = E_{T \sim π} [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}) + α H (π (\cdot | s_{t})))]

(2)

where

α > 0

is the entropy coefficient (also referred to as the temperature parameter), the core parameter for regulating risk preference in this paper. A larger

α

encourages the policy to tend towards a uniform distribution, i.e., to be more conservative; as

α \to 0

, the objective reduces to the standard risk-neutral reinforcement learning objective [17].

3.2. Entropy-Regularized HJB Equation (ER-HJB) and Properties of Its Solution

Corresponding to the above ER-MDP, we define the soft Bellman operator

T_{α}

:

(T_{α} V) (s) = α \log \int_{A} e x p (\frac{1}{α} (r (s, a) + γ E_{s'} [V (s^{'})])) d a

(3)

Theorem 1 (Contraction and Existence–Uniqueness of Solution).

When

γ \in [0, 1)

,

T_{α}

is a contraction mapping on the Banach space

(B (S), ∥ \cdot ∥_{\infty})

of bounded functions on

S

. Consequently, there exists a unique fixed point

V_{α}^{*} \in B (S)

satisfying the discrete-time ER-HJB equation:

V_{α}^{*} (s) = α \log \int_{A} e x p (\frac{1}{α} (r (s, a) + γ E_{s'} [V_{α}^{*} (s')])) d a

(4)

Proof.

We will show that

T_{α}

is a

γ - c o n t r a c t i o n

.

Step 1 (Monotonicity). If $V_{1} \leq V_{2}$ pointwise, then $E_{s^{'}} [V_{1} (s^{'})] \leq E_{s^{'}} [V_{2} (s^{'})]$ . Since the exponential and logarithm functions are strictly increasing, and $α > 0$ , we obtain $T_{α} V_{1} \leq T_{α} V_{2}$ .
Step 2 ( $γ - c o n t r a c t i o n$ ). For any constant $c \in R$ and $V \in B (S)$ ,

\begin{array}{l} T_{α} (V + c) (s) = α \log \int_{A} e x p (\frac{1}{α} (r + γ E_{s'} [V (s') + c])) d a = α \log (e^{γ c / α} \int_{A} e x p (\frac{1}{α} (r + γ E_{s'} [V (s')])) d a) \\ = T_{α} V (s) + γ c \end{array}

(5)

Now take any

V_{1}, V_{2} \in B (S)

and set

c = ∥ V_{1} - V_{2} ∥_{\infty}

. Then

V_{1} \leq V_{2} + c

. By monotonicity,

T_{α} V_{1} \leq T_{α} (V_{2} + c) = T_{α} V_{2} + γ c

(6)

Hence

T_{α} V_{1} - T_{α} V_{2} \leq γ ∥ V_{1} - V_{2} ∥_{\infty}

. By symmetry, the reverse inequality also holds, so

{‖T_{α} V_{1} - T_{α} V_{2}‖}_{\infty} \leq γ {‖V_{1} - V_{2}‖}_{\infty}

(7)

Step 3 (Existence and uniqueness). Since $γ < 1$ , $T_{α}$ is a strict contraction on a complete metric space. By the Banach fixed-point theorem, there exists a unique $V_{α}^{*}$ such that $V_{α}^{*} = T_{α} V_{α}^{*}$ ; i.e., Equation (4) holds. □

Remark 1.

The discrete-time contraction argument presented above is elementary and self-contained. For continuous state spaces, analogous convergence results can be obtained via viscosity solution theory; see, e.g., Munos [16] (Theorems 5 and 6), which show that under appropriate discretization and monotonicity conditions, the value function of a discrete MDP converges to the unique viscosity solution of the continuous HJB equation. Our setting can be viewed as a discrete-time instance of that general framework.

3.3. Limiting Behavior as the Entropy Coefficient Vanishes

Theorem 2 (Convergence to the Risk-Neutral HJB Equation).

Let

V_{α}^{*}

be the unique solution of Equation (4). Then for every

s \in S

,

\lim_{α \to 0} V_{α}^{*} = V_{0}^{*}

(8)

where

V_{0}^{*}

satisfies the risk-neutral HJB equation:

V_{0}^{*} = \sup_{a \in A} (r (s, a) + γ E_{s'} [V_{0}^{*} (s')])

(9)

Proof.

We use the Laplace principle (a standard result in large deviations theory): for a continuous function

g

on a compact set

A

,

\lim_{α \to 0} α \log \int_{A} e^{g (a) / α} d a = \sup_{a \in A} g (a)

(10)

Fix

s \in S

. By Theorem 1, the family

{\{V_{α}^{*}\}}_{α > 0}

is uniformly bounded. For a sequence

α_{n} \to 0

, the bounded sequence

\{V_{α_{n}}^{*} (s)\}

has a convergent subsequence (Bolzano–Weierstrass). Since the state space

S

is finite, we can iterate this subsequence extraction finitely many times to obtain a single subsequence

\{α_{n_{k}}\}

such that

V_{α_{n_{k}}}^{*} (s)

converges for all

s \in S

. Denote the pointwise limit by

V_{0} (s)

.

Taking the limit in Equation (4) along this subsequence and applying Equation (10) gives Equation (11).

V_{0} (s) = \sup_{a \in A} (r (s, a) + γ E_{s'} [V_{0} (s')])

(11)

Uniqueness of the limit follows from the standard Bellman contraction argument (see Lemma 1 of [18] for the soft case; the risk-neutral case corresponds to

α \to 0

). Hence

V_{0} = V_{0}^{*}

and the convergence is pointwise. □

Remark 2.

In continuous state spaces, Tang et al. [17] prove that the value function of the exploratory control problem converges to that of the classical problem as the exploration weight

λ \to 0

(see Theorem 3.9 in [17]). Their proof relies on viscosity solution theory and equicontinuity arguments. In our discrete-time finite-state setting, we provide a simpler self-contained proof using only the Laplace principle and finite subsequence extraction. The conclusion is consistent with the general result of Tang et al. [17].

3.4. Theoretical Connection Between the SAC Algorithm and the ER-HJB Equation

Soft Actor-Critic (SAC) is an efficient off-policy maximum-entropy reinforcement learning algorithm [18]. Its core iteration consists of:

Soft policy evaluation: For a fixed policy $π$ , repeatedly apply the following operator to compute the soft Q-function:

$T^{π} Q (s, a) = r (s, a) + γ E_{s'} [α \log \int_{A} e x p (Q (s^{'}, a') / α) d a']$

(12)
Soft policy improvement: Update the policy by minimizing the KL divergence to the Boltzmann distribution of the current Q-function:

$π_{n e w} (\cdot | s) = \arg \min_{π} D_{K L} (π (\cdot | s) ‖ \frac{e x p (\frac{1}{α} Q^{π_{o l d}} (s, \cdot))}{Z^{π_{o l d}} (s)})$

(13)

Theorem 3 (Operator-theoretic correspondence under exact tabular policy iteration).

The relationship between the SAC framework and the ER-HJB equation is described in two distinct settings:

Part (a)—Exact tabular setting (finite state/action spaces, no function approximation). The soft Bellman operator $T_{α}$ defined in Equation (3) coincides with the optimal soft Bellman operator in this setting. Exact soft policy iteration—alternating exact soft policy evaluation and exact soft policy improvement—generates a sequence $\{V_{k}\}$ satisfying $V_{k + 1} = T_{α} V_{k}$ . Since $T_{α}$ is a $γ$ -contraction (Theorem 1), this sequence converges linearly to the unique fixed point $V_{α}^{*}$ of the ER-HJB Equation (4).
Part (b)—Practical deep RL implementation (neural networks, stochastic gradients, replay buffer). The practical SAC algorithm can be interpreted as an asynchronous stochastic approximation of the exact fixed-point iteration described in Part (a). However, a rigorous global convergence proof for nonlinear function approximation remains an open problem in reinforcement learning theory. The algorithm has been empirically shown to perform well under standard assumptions (bounded rewards, compact action space, sufficient exploration). We therefore adopt it as a numerically effective heuristic for the ER-HJB equation, without claiming a general convergence guarantee.

Proof.

We break the proof into three parts.

Part 1 (Optimal soft Bellman operator). Define the optimal soft Bellman operator $T_{α}^{o p t}$ by

(T_{α}^{o p t} V) (s) = α \log \int_{A} e x p (\frac{1}{α} (r (s, a) + γ E_{s'} [V (s')])) d a

(14)

which is exactly

T_{α}

in Equation (3). According to Lemma 1 of [18], for any fixed policy

π

, the soft policy evaluation operator

T^{π}

has a unique fixed point

Q^{π}

. The optimal soft Q-function satisfies

Q^{*} = T^{π} Q^{*}

with the optimal policy, and taking the logarithm and integrating yields Equation (4). Hence

V_{α}^{*}

is the fixed point of

T_{α}

.

Part 2 (exact soft policy iteration). Lemma 2 of [18] shows that the policy improvement step (13) has the closed-form solution

π_{n e w} (a | s) = \frac{e x p (\frac{1}{α} Q^{π_{o l d}} (s, a))}{\int_{A} e x p (\frac{1}{α} Q^{π_{o l d}} (s, a')) d a'}

(15)

Substituting this policy into the soft Bellman equation for the value function gives exactly the update

V_{new} = T_{α} V_{old}

. Therefore, the alternating evaluation–improvement process corresponds to iterating

T_{α}

. By Theorem 1,

T_{α}

is a

γ

-contraction, so the iteration converges linearly to the unique fixed point

V_{α}^{*}

.

Part 3 (Practical deep RL implementation). In practical SAC, the exact backups are replaced by stochastic gradient descent using samples from a replay buffer and function approximators (neural networks). This results in an asynchronous stochastic approximation of the fixed-point iteration. While a global convergence guarantee for nonlinear function approximation is not available, the method satisfies the usual RL assumptions: rewards are bounded, the action space $[0, 1]$ is compact, and the Gaussian policy with bounded variance ensures sufficient exploration. Under these conditions, the algorithm has been empirically shown to work well [18], and we adopt it as an effective numerical heuristic for the ER-HJB equation. □

Remark 3 (Scope of the theorem).

Theorem 3 establishes a theoretical correspondence only under the exact tabular setting. In that setting, the deterministic soft policy iteration converges to the fixed point of the ER-HJB equation. This correspondence provides a mathematical foundation for interpreting practical deep SAC as an approximate solver. It does not constitute a convergence proof for deep SAC with neural networks, which remains an open problem. The practical algorithm should be understood as a heuristic that performs well empirically, as demonstrated in Section 5.1.

3.5. Linear-Quadratic Analytical Verification

To numerically verify the above theory, we consider a simplified linear-quadratic regulator (LQR) problem. Assume the system dynamics and reward function are both linear-quadratic:

s_{t + 1} = A s_{t} + B a_{t} + ω_{t}

(16)

r = - (s^{T} Q s + a^{T} R a)

(17)

Under this setting, the ER-HJB equation admits an analytical solution, as derived in the continuous-time maximum entropy LQR framework [22]. The optimal policy is a Gaussian distribution:

π^{*} (a | s) = N (- K s, α {(R + B^{T} P B)}^{- 1})

(18)

where the gain matrix

K

has exactly the same form as in the risk-neutral LQR, and the covariance matrix of the policy is linearly regulated by

α

[17]. The matrix

P

is the solution to the following Discrete Algebraic Riccati Equation (DARE) [23]:

P = Q + A^{T} P A - A^{T} P B {(R + B^{T} P B)}^{- 1} B^{T} P A

(19)

This analytical solution clearly illustrates the effect of entropy regularization: it does not alter the “direction” (mean) of the optimal policy, but solely controls the policy’s “exploration radius” (covariance) through the temperature parameter

α

. This result provides a rigorous ground-truth benchmark for verifying the effectiveness of the SAC algorithm as a numerical heuristic.

3.6. SAC Algorithm Pseudocode and Overall Research Framework

Based on the above theoretical analysis, we choose the SAC algorithm as the core numerical tool for solving the ER-HJB equation. As an off-policy Actor-Critic method, SAC approximates the solution to Equation (4) by maintaining a parameterized policy network

π_{ϕ}

, two soft Q-function networks

Q_{θ_{1}}

,

Q_{θ_{2}}

, and an optional automatic entropy tuning module. To facilitate understanding and implementation, we present the core pseudocode of the SAC algorithm in Algorithm 1 below.

Algorithm 1: SAC Algorithm Pseudocode
Input: environment env, temperature parameter $α$ , discount factor $γ$ , soft update rate $τ$ , learning rate $λ$ , replay buffer capacity $N$ , batch size $B$
Output: optimal policy network parameters $ϕ$
1	Initialize policy network parameters $ϕ$ , two Q-function network parameters $θ_{1}$ , $θ_{2}$
2	Initialize target Q-network parameters ${\bar{θ}}_{1} \leftarrow θ_{1}$ , ${\bar{θ}}_{2} \leftarrow θ_{2}$
3	Initialize replay buffer $D$ with capacity $N$
4	for each environment step $t = 1, T$ do
5	Observe current state $s_{t}$ from the environment and sample action $a_{t}$ according to $π_{ϕ} (\cdot \| s_{t})$
6	Execute action $a_{t}$ in the environment, observe reward $r_{t}$ and next state $s_{t + 1}$
7	Store transition tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer $D$
8	if number of samples in $D \geq B$ then
9	for each gradient step $g = 1, G$ do
10	Sample a random mini-batch of $B$ transitions from $D$
11	Compute target value $y = r + γ \min_{i = 1, 2} Q_{{\bar{θ}}_{i}} (s^{'}, a^{'})$ , where $a' \sim π_{ϕ} (\cdot \| s')$
12	for $i = 1, 2$ do
13	Update $θ_{i}$ : $θ_{i} \leftarrow θ_{i} - λ \nabla_{θ_{i}} \frac{1}{B} \sum {(Q_{θ_{i}} (s, a) - y)}^{2}$
14	end for
15	Update $ϕ$ : $ϕ \leftarrow ϕ - \nabla_{ϕ} \frac{1}{B} \sum [\min_{i = 1,2} Q_{θ_{i}} (s, a_{ϕ}) - α \log π_{ϕ} (a_{ϕ} \| s)]$ , where $a_{ϕ} \sim π_{ϕ} (\cdot \| s)$
16	(Optional) Automatically tune $α$
17	Soft update target network parameters: ${\bar{θ}}_{i} \leftarrow τ θ_{i} + (1 - τ) {\bar{θ}}_{i}$ for $i = 1, 2$
18	end for
19	end if
20	end for
21	return $ϕ$

The above pseudocode outlines the core alternating optimization process of the policy network and Q-networks in SAC. Under the idealizing assumptions of the tabular setting, the iterative updates would converge to

V_{α}^{*}

. In the practical deep implementation, the algorithm is designed to progressively approximate the optimal value function, but a rigorous convergence guarantee is not currently available. Empirically, the algorithm achieves stable and accurate approximations, as demonstrated in our LQR verification experiment (Section 5.1). The overall research framework of our dynamic credit decision system is illustrated in Figure 1 below.

Figure 1 provides a visual overview of the complete research design proposed in this paper. The upper block corresponds to the entropy-regularized Markov decision process (ER-MDP). The state

s_{t} \in R^{11}

consists of six borrower features (e.g., loan amount, interest rate, debt-to-income ratio) and five macroeconomic variables (unemployment rate, federal funds rate, etc.). The action

a_{t} \in [0, 1]

represents the credit granting ratio for the current loan, interpretable as the approval intensity or the degree of risk exposure reduction. The reward

r_{t}

is defined as interest income minus default loss minus a convex cost (see Equation (20) for details). The lower-left block is the SAC agent, whose core objective is to maximize the entropy-regularized return

E [\sum γ^{t} (r_{t} + α H (π (\cdot | s_{t})))]

, where

α

, as a single tunable parameter, continuously controls the risk preference of the policy: as

α

approaches zero, the policy tends to be risk-neutral and aggressive; as

α

increases, the policy becomes more conservative and stochastic. The lower-right block is the evaluation and output module, which employs downside risk-adjusted return (Sortino ratio), tail risk (CVaR 95%), and the high-risk credit ratio (HCR) to compare the trained dynamic policy against static baseline models such as XGBoost and LightGBM. The arrows in Figure 1 connect the data flows and decision feedback paths, fully illustrating the closed-loop research pipeline from environment interaction to policy optimization and performance evaluation.

4. Experimental Design and Methodology

4.1. From ER-MDP to the Credit Environment: Connecting Theory and Experiment

To empirically test Theorems 1–3 and verify the

α - t u n i n g

ability predicted by Theorem 2, we constructed a realistic credit decision environment based on the ER-MDP defined in Section 3.1. Specifically: the state space

S

corresponds to an 11-dimensional feature vector (6 borrower features + 5 macroeconomic variables); the action space

A = [0, 1]

encodes the credit granting ratio; the reward function defined in Equation (20) directly instantiates

r (s, a)

in the ER-MDP. Theorem 3 provides a theoretical interpretation that supports using SAC as an approximate numerical solver for the ER-HJB equation—therefore, training SAC in this environment yields policy networks that approximate

V_{α}^{*}

and the corresponding optimal stochastic policy, and varying

α

allows us to observe the empirical manifestation of the “mean unchanged, covariance linearly scaled” behavior predicted by Equation (18) on real credit data.

4.2. Dataset Description

This study employs publicly available personal loan data issued by LendingClub from 2016 to 2018 for empirical validation [24]. After rigorous cleaning, a total of 518,706 valid loans are retained, comprising 402,449 fully repaid loans and 116,257 defaulted loans. To capture the dynamic evolution of risk, the data are expanded into a monthly panel, resulting in over 21 million observations. The dataset is strictly split by time: January 2016 to December 2017 serves as the training set (approximately 6.8 million time steps), and the full year 2018 serves as the test set (approximately 6 million time steps), precluding any future information leakage.

The input features consist of 6 borrower-level dimensions and 5 macroeconomic dimensions, including UNRATE [25], FEDFUNDS [26], CPIAUCSL [27], INDPRO [28], and MORTGAGE30US [29]. The reward function is designed as:

r_{t} = {r e w a r d_s c a l e d}_{t} \cdot a_{t} - 0.01 a_{t}^{2}

(20)

where

{r e w a r d_s c a l e d}_{t}

is the monthly interest (for fully repaid loans) or the net loss after deducting principal (for defaults), scaled by one-thousandth, and the action

a_{t} \in [0, 1]

represents the credit granting ratio. The quadratic term

- 0.01 a_{t}^{2}

has the economic interpretation of imposing a small convex cost on extreme credit decisions (i.e., fully granting or fully rejecting) to smooth the decisions.

4.3. Model Architecture

This study adopts the Soft Actor-Critic (SAC) algorithm as the core dynamic decision model. SAC is an off-policy Actor-Critic algorithm based on the maximum-entropy framework and consists of five neural network modules: one policy network (Actor), two soft Q-function networks (Critic), and two target Q-function networks. The policy network outputs the probability distribution over actions; the Q-function networks estimate the value of state–action pairs; and the target Q-function networks are used to stabilize value estimation during training.

In the concrete implementation, both the policy network and the Q-function networks adopt a multilayer perceptron (MLP) architecture with an input dimension of 11, corresponding to the 6 borrower features and 5 macroeconomic variables. Following the default configuration of the MlpPolicy in the Stable-Baselines3 framework [30], the two networks share the feature extraction layer architecture. Specifically, the Actor-Critic network consists of two hidden layers, each with 64 neurons and ReLU activation functions. The policy network output layer contains two branches, a mean branch outputting the action mean

μ_{ϕ} (s)

and a log-standard-deviation branch outputting

\log σ_{ϕ} (s)

, which together parameterize a Gaussian distribution

π_{ϕ} (a | s) = N (μ_{ϕ} (s), σ_{ϕ}^{2} (s))

. The Q-function network outputs a scalar

Q_{ϕ} (s, a)

that estimates the expected cumulative reward of taking action

a

in state

s

.

Regarding parameter configuration, the discount factor

γ

is set to 0.99, the soft update rate

τ

to 0.005, the learning rate

λ

to

3 \times 10^{- 4}

, the replay buffer capacity

N

to

1 \times 10^{6}

, and the batch size

B

to 256. One gradient update is performed per environment step, and both the policy network and Q-function networks use the Adam optimizer. The entropy coefficient

α

, the core regulatory parameter of this paper, is explored across seven values:

\{0.01, 0.05, 0.10, 0.15, 0.20, 0.50, 1.00\}

. Each SAC agent is run for 2 million environment steps on the training set to ensure sufficient convergence.

As baselines for comparison, six static baseline models are trained, including logistic regression (LR), random forest (RF), XGBoost [3], LightGBM [4], multilayer perceptron (MLP), and a stacked meta-model [31]; all models are implemented using standard libraries such as scikit-learn [32]. The default probability output by the static models is converted into a credit granting ratio via the linear mapping

a = 1 - P (d e f a u l t)

.

4.4. Evaluation Metrics

To ensure the rigor of the experimental conclusions and the validity of financial interpretation, we explicitly define the core evaluation metrics used in this paper. The evaluation system characterizes the decision quality of the models from multiple dimensions, including profitability, downside risk, and tail risk. The financial meaning of each metric is directly linked to the properties of

V_{α}^{*}

in Theorem 2; the Sortino ratio and CVaR 95% measure how the downside risk and tail properties of

V_{α}^{*}

change with

α

.

Average Reward (AR) measures the expected net profit level brought by the model for each loan in the test set, serving as a direct measure of profitability:

$A R = \frac{1}{N} \sum_{i = 1}^{N} R_{i}$

(21)
Total Reward (TR) measures the cumulative net profit brought by the model for all loans in the test set, providing an aggregate view of profitability:

$T R = \sum_{i = 1}^{N} R_{i}$

(22)
Standard Deviation of Reward (Std) quantifies the overall uncertainty of single-loan returns, defined as the standard deviation of rewards:

$S t d = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(R_{i} - A R)}^{2}}$

(23)
Sortino Ratio (Sortino) evaluates downside risk-adjusted return, considering only volatility below a target return (set to 0 in this work) as risk [33]:

$S o r t i n o = \frac{E [R]}{σ_{d o w n}}, σ_{d o w n} = \sqrt{E [{(m i n \{0, R\})}^{2}]}$

(24)
Conditional Value-at-Risk (CVaR95%) measures extreme tail risk, defined as the average loss in the worst 5% of the return distribution (expressed as a negative number) [34]:

C V a R 95 % = E [R | R \leq V a R 95 %]

(25)

where VaR95% is the 5th percentile of the return distribution.

High-Risk Credit Ratio (HCR) captures the strategy’s propensity to grant credit to high-risk borrowers. The high-risk group consists of borrowers whose unemployment rate exceeds the 75th percentile and who have a positive number of historical delinquencies in the past two years:

H C R = \frac{1}{N_{h}} \sum_{i \in H} a_{i}

(26)

where

H

is the set of high-risk borrowers,

N_{h}

is the number of such borrowers, and

a_{i} \in [0, 1]

is the credit ratio assigned to borrower

i

. A higher HCR indicates a more lenient strategy towards high-risk groups.

4.5. Experimental Environment

All experiments in this study were conducted on a LlamaFactory instance image on the StarverseAI platform. The hardware configuration consisted of one NVIDIA GeForce RTX 4090 GPU, an AMD EPYC 7542 32-Core Processor CPU, 48 GB DDR4 memory, and a 150 GB NVMe SSD system disk. The software environment used Python 3.11 as the programming language, PyTorch 2.2.0 as the deep learning framework (CUDA 12.1), gymnasium 0.29.1 as the reinforcement learning environment library, and stable-baselines3 2.3.2 as the SAC algorithm implementation library.

5. Experimental Results and Analysis

5.1. Theoretical Verification

The theoretical verification experiment aims to confirm Theorem 3 on a three-dimensional LQR system—namely that SAC is an effective numerical solver for the ER-HJB equation. We solved the analytical solution of the entropy-regularized LQR according to Equation (19), obtaining the true optimal gain

K_{t r u e} = [0.4104, 0.2484, 0.0823]

(27)

as well as the policy covariance matrix. Subsequently, with

α = 0.1

fixed, we trained an SAC agent in the LQR environment corresponding to this analytical solution for a total of 200,000 environment steps. After training, the deterministic actions (the mean of the policy distribution) learned by SAC were compared with the true optimal actions given by the analytical solution on 1000 randomly generated states.

The scatter plot comparison of the two is presented in Figure 2. The mean squared error between the SAC policy outputs and the true optimal actions is merely MSE = 0.0267 with all data points tightly clustered around the ideal 45-degree diagonal. This extremely low error strongly corroborates the conclusion of Theorem 3 from a numerical perspective: the iterative gradient update process of the SAC algorithm is an effective asynchronous stochastic approximation for solving the contraction mapping

T_{α}

defined by the ER-HJB equation. As a “proof of concept” bridging the theoretical framework and subsequent complex credit applications, this verification experiment lays a solid numerical foundation for employing the SAC algorithm to solve the ER-HJB equation on real credit data.

5.2. Continuous Regulation of Risk–Return Trade-Off

We investigate the regulatory capacity of the entropy coefficient

α

over the risk preference of decision strategies in a real credit environment, testing the core predictions of Theorem 2 and Equation (18)—that adjusting only

α

continuously changes policy stochasticity and risk preference, and that this regulation is monotonic, smooth, and bidirectional.

On the full LendingClub test set, we systematically evaluate SAC agents with seven different values of

α

; their core performance metrics are summarized in Table 2. A clear pattern emerges: as

α

increases from 0.01 (approaching risk neutrality) to 1.00 (emphasizing entropy maximization), strategies exhibit a monotonic and continuous trade-off between return and risk. Specifically, the average reward AR decreases monotonically from 1.6773 to 0.9606, and total reward TR declines correspondingly; meanwhile, risk metrics improve across the board—the standard deviation of reward Std narrows continuously from 2.0434 to 1.1530, the Sortino ratio climbs from 1.5679 to a peak of 1.6312 at

α = 0.20

and plateaus thereafter, and tail risk CVaR95% substantially improves from −1.3383 to −0.7217, while the high-risk credit ratio HCR decreases strictly monotonically from 0.9095 to 0.5211.

Figure 3 and Figure 4 visualize the above trade-off relationship from different perspectives. The

α - S o r t i n o r a t i o

curve in Figure 3 exhibits a pattern that rises rapidly at first and then flattens out, with the Sortino ratio forming a global optimum around

α = 0.20

, indicating the existence of a “golden temperature” that maximizes the robust return adjusted for downside risk. The subsequent plateau implies that further increasing

α

, while continuing to suppress tail risk, no longer enhances risk-adjusted return efficiency—the marginal benefit of risk aversion diminishes. The dual-axis plot in Figure 4 more directly reveals this “seesaw” effect: the red line representing average return and the blue line representing the absolute value of CVaR exhibit symmetric divergence as

α

increases. Between

α = 0.05

and

α = 0.20

, a “trade-off corridor” emerges between the two curves—within this interval, a decision-maker can clearly weigh how much return concession to accept in exchange for a specific degree of tail-risk improvement. This continuous and predictable regulatory effect holds profound practical significance: it endows financial institutions with a single, interpretable, and transparent “risk knob,” enabling them to precisely position themselves at any operating point shown in Figure 4 according to their capital adequacy and risk appetite.

5.3. Systematic Comparison with Static Baselines

To comprehensively evaluate the practical effectiveness of the ER-HJB/SAC dynamic framework—both its absolute profitability advantage and its flexibility in risk–return trade-offs—we systematically compare it with six representative static baseline models (LR, RF, XGBoost, LightGBM, MLP, and stacked meta-model) on the same test set. The evaluation results for all models are detailed in Table 2, Table 3 and Table 4. The analysis proceeds along two core dimensions: the absolute advantage in profitability and the fundamental difference in risk–return trade-off flexibility.

Absolute advantage in profitability. The SAC model operating with

α = 0.01

can be regarded as a near-risk-neutral dynamic strategy benchmark. Its average reward AR reaches 1.6773 and total reward TR exceeds 870,000, representing an approximately 22.8% improvement over the best result among static models. Even when the entropy coefficient is adjusted to a more conservative range, SAC can still maintain a partial advantage in return over static models: at

α = 0.05

, AR remains at 1.4841 and TR at approximately 770,000, both higher than all six static baselines. This advantage stems from the forward-looking decision capability of the dynamic strategy—by learning the gradient of the state-value function

V (s)

, SAC’s policy network can anticipate the intertemporal consequences of different credit-granting actions. For example, when macroeconomic indicators deteriorate or borrower credit quality declines, the model can proactively and progressively tighten exposure; during expansionary cycles, it moderately expands credit. This “smooth adjustment” mechanism is unavailable to static models, which can only make one-time binary decisions between “approve” and “reject.”

Risk–return adjustability represents a flexibility beyond the reach of the static paradigm. A finding of even greater theoretical and practical value is that, by sliding a single hyperparameter

α

, the SAC framework can continuously traverse multiple Pareto-optimal risk–return combinations—an achievement impossible for all static models. As shown in Table 3, XGBoost leads the static models with a Sortino ratio of 1.6241, but at this fixed risk level its AR is only 1.3541, and its CVaR95% is −1.0316. In contrast, SAC can enter distinct operating modes simply by tuning

α

: at

α = 0.05

, AR still exceeds all static models, the Sortino ratio rises to 1.5868, and CVaR95% improves to −1.1217; at

α = 0.10

, although AR slightly drops to 1.3235 (near the static model average), the Sortino ratio climbs further to 1.6123, already approaching the static best of XGBoost, while CVaR95% substantially improves to −0.9744, significantly better than all static models; at

α = 0.20

, the Sortino ratio reaches a global peak of 1.6312, surpassing all static models, while CVaR95% drops to −0.8579, an improvement of roughly 16.8% over the best static model, XGBoost. The financial essence of this evolution is that when XGBoost achieves its highest Sortino ratio, it does so at the cost of bearing extreme tail risk of approximately −1.03, and its risk–return combination is fixed and non-adjustable. In contrast, SAC, merely by adjusting the hyperparameter

α

, can consistently reduce tail risk while maintaining returns above or on par with the static best. This flexible adjustment capability has clear operational implications; for instance, a financial institution with high capital adequacy may choose a moderately aggressive strategy with

α = 0.01

or

α = 0.05

to maximize profit, whereas another institution subject to strict regulatory constraints could select

α = 0.20

, accepting a limited return concession in exchange for a substantial reduction in tail risk. Table 4 and Figure 5 further corroborate the above conclusions from the perspective of return distributions; the SAC models represented by

α = 0.01

and

α = 0.05

exhibit an overall upward shift in profit distribution and display longer, thicker right tails (high-profit regimes) than all static models; meanwhile, the narrowing trend of the box heights with increasing

α

intuitively reflects the process of progressively controlled risk.

5.4. Credit-Granting Behavior Towards High-Risk Groups

Post hoc interpretability of model behavior is crucial for adoption in financial scenarios with stringent risk control requirements. We perform a micro-level stress test of Theorem 2—examining whether entropy regularization causes the policy to exhibit stronger conservatism preferentially in states with “high decision uncertainty.”

To this end, we focus on the high-risk borrower group and examine the average credit ratio (HCR) they receive under the SAC framework (defined in Section 4.4). As shown in Figure 6, as

α

increases from 0.01 to 1.00, HCR decreases strictly monotonically, dropping from 0.9095 to 0.5211. This result intuitively demonstrates the practical effect of the core design logic at the micro-behavioral level: an increase in

α

amplifies the weight of the entropy term in the objective function, driving the strategy towards conservatism, which manifests as an automatic and substantial tightening of credit towards borrowers with higher uncertainty.

This behavioral characteristic is not the product of manually encoded rules but a strategy feature spontaneously formed by the agent in the process of maximizing its entropy-regularized long-term return. Precisely because it originates from rigorous theoretical derivation and numerical optimization, this behavioral function possesses inherent consistency, transparency, and auditability—regulators can clearly observe at which risk levels and to what extent the model tightens credit, and this degree of tightening is entirely and quantitatively controlled by a single parameter

α

. Compared with the “black-box” nature of decision-making in some static models, SAC’s entropy regulation mechanism offers a naturally interpretable solution for dynamic credit decision-making.

6. Discussion

The empirical results of this study systematically reveal the multi-dimensional advantages of the dynamic decision framework based on the entropy-regularized HJB equation over the traditional static paradigm. In what follows, we provide a deeper discussion from four aspects: the economic interpretation of the parameter, the informational mechanism of the strategy, the theoretical roots of behavioral interpretability, and the practical significance of the framework.

6.1. Economic Interpretation of the Temperature Parameter

One of the core findings of this study is that the entropy coefficient

α

exerts a monotonic, continuous, and predictable regulatory effect on the risk preference of the strategy. This empirical regularity is not an incidental algorithmic phenomenon, but rather a natural mapping onto the real credit environment of the mathematical fact revealed by Theorem 2, namely that

V_{α}^{*}

varies continuously with

α

and converges to

V_{0}^{*}

. At the theoretical level,

α

appears in the entropy regularization term of the objective function, and its magnitude determines the trade-off an agent makes between “maximizing expected return” and “maintaining policy stochasticity (i.e., refraining from overly certain decisions).” As

α

approaches zero, the effect of the entropy regularization term vanishes, the objective degenerates back to the standard risk-neutral MDP, and the strategy exhibits high-return, high-volatility aggressive behavior. As

α

gradually increases, the penalty of the entropy term strengthens, forcing the agent to maintain policy diversity while optimizing returns; the resulting behavioral consequence is an automatic avoidance of high-risk decisions. Thus,

α

essentially plays the role of a continuous risk aversion index: without altering the system dynamics or the reward function, merely by adjusting the relative weight between “determinism” and “stochasticity” in the optimization objective, it realizes a strategic spectrum shift from “aggressive exploitation” to “conservative robustness.” The “seesaw” effect in Figure 4 and the narrowing trend of box heights with increasing

α

in Figure 5 are both direct projections of this index regulation mechanism at the data level. This ability to directly map an algorithmic hyperparameter onto a financial risk concept is a fundamental feature that distinguishes our framework from static classification models and early RL credit scoring works.

6.2. The Advantage Mechanism of the Dynamic Strategy

The global advantage of the SAC model in profitability—especially when

α \leq 0.05

, where both central returns and distributional right tails significantly surpass all static baselines (Table 4, Figure 5)—deserves deeper understanding from its underlying learning mechanism. Static models essentially learn a mapping

P (d e f a u l t | x)

from the feature space to default probability on the training set, which is merely a statistical summary of historical data and contains no causal inference about decision consequences. In contrast, SAC learns in the same environment the state-value function

V (s)

and its gradient with respect to features,

\nabla_{x} V (s)

. This gradient contains rich forward-looking information; it not only reflects the goodness of the current state but also implicitly encodes “how the expected future return from the current state onward will change as each risk factor changes.” When borrower characteristics or macroeconomic indicators undergo marginal changes, the SAC policy network can perceive the long-term impact of these changes on future returns through

\nabla_{x} V (s)

and adjust the credit granting ratio proactively and smoothly in response. This enables the dynamic strategy to achieve “soft-landing” risk management—for instance, when unemployment rises moderately, SAC does not immediately reject subsequent loans but gradually reduces the credit granting ratio, striking a dynamic balance between maintaining customer relationships and controlling risk exposure. This model-based rather than rule-based forward-looking adjustment is structurally and informationally impossible for the “approve/reject” binary decision logic of static models.

6.3. The Theoretical Roots of Interpretability

In the credit scoring literature, model interpretability usually relies on post hoc attribution tools—such as SHAP—to approximate the contribution of each feature to a single prediction. However, as our analysis in Section 5.4 reveals, the dynamic framework presented here offers a fundamentally different and more fundamental path to interpretability. The core phenomenon illustrated in Figure 6 is that the average credit ratio HCR received by the high-risk group decreases strictly monotonically with increasing

α

. This relationship is not the product of post hoc attribution but stems directly from the construction of the objective function; the larger

α

is, the more the policy is forced to possess high entropy. Without being able to reduce stochasticity uniformly across all states, the agent must preferentially lower concentration in those states where “decision uncertainty is high”—which precisely correspond to high-risk borrowers whose future returns are difficult to predict. Therefore, the variation of HCR with

α

is an inevitable property of the optimal control solution with entropy regularization, a “built-in” global behavioral function that can be theoretically predicted. Compared with post hoc attribution methods such as SHAP, this interpretability has two fundamental advantages: first, it is global rather than local, describing the systematic behavioral shift of the entire strategy as

α

changes, rather than an approximate attribution on a single instance; second, it is forward-looking rather than retrospective because it originates directly from the penalty on policy stochasticity in the optimization objective, not from a post hoc statistical decomposition of decisions already made. This holds significant compliance value for financial institutions that need to demonstrate the consistency and auditability of their model behavior to regulators.

6.4. Practical Implications

Taken together, the above analysis points methodologically to a conclusion with far-reaching practical significance; the SAC + ER-HJB framework constructed in this paper essentially provides a unified “one model, multiple strategies” solution for credit decision-making. Under the traditional architecture, if a financial institution wishes to adjust its risk strategy under different macroeconomic environments or regulatory cycles, it typically needs to retrain or recalibrate multiple models (e.g., multiple scorecards targeting different customer segments, or multiple XGBoost instances with different risk weights), which not only increases the complexity of model governance but may also lead to inconsistencies and hidden risks during strategy switching. By contrast, in our framework, different risk preferences can be realized merely by changing a single global parameter

α

. The selection of

α

can be conducted at the risk assessment committee level and directly linked to the capital adequacy ratio, stress test results, or regulatory guidance, without touching the underlying model architecture or retraining pipeline. This property of “separation of control and optimization” significantly reduces the operational cost of model governance and provides financial institutions with a transparent, auditable, and mathematically self-consistent operational interface for dynamic risk pricing and forward-looking capital allocation.

6.5. Theoretical Limitations and Practical Scope

The theoretical analysis in Section 3 relies on several idealizing assumptions: a finite state space, exact policy evaluation and improvement, and no function approximation. These assumptions are standard in establishing the convergence of exact soft policy iteration to the ER-HJB fixed point. However, in our practical implementation, we use neural-network function approximators, a replay buffer, and stochastic gradient updates. For such settings, a global convergence guarantee is not currently available in the reinforcement learning literature. Our empirical results therefore complement the theory: they demonstrate that the SAC algorithm, despite the lack of a full convergence proof in the nonlinear function approximation regime, performs well on real-world credit data and provides a useful adjustable risk-preference mechanism. This combination of theoretical grounding and empirical validation is the main contribution of the present work. Consequently, the theoretical guarantees proven in this paper apply strictly to the exact tabular setting, and the practical deep SAC should be understood as a heuristic but empirically effective numerical method for the credit decision problem.

7. Conclusions

By systematically introducing entropy regularization into the Hamilton–Jacobi–Bellman (HJB) equation framework and establishing a theoretical connection between the exact soft policy iteration of SAC (under tabular assumptions) and the solution of the ER-HJB equation, this paper provides a mathematically grounded and empirically validated methodology for dynamic credit decision-making. At the theoretical level, we proved the existence and uniqueness of a solution to the ER-HJB equation, characterized its limiting behavior of converging to the risk-neutral solution as

α \to 0

, and derived a closed-form analytical solution under linear-quadratic conditions as a numerical benchmark. More crucially, we elucidated that under exact tabular assumptions the soft policy iteration underlying SAC converges to the solution of the ER-HJB equation; the practical deep implementation can be viewed as an asynchronous stochastic approximation of this iteration, serving as a numerically effective heuristic. This does not constitute a general convergence proof for deep SAC, but the empirical results demonstrate its practical usefulness. Together, the theoretical connection under exact assumptions and the empirical validation of the deep heuristic provide a solid foundation for dynamic credit decision-making. At the empirical level, based on real loan panel data from LendingClub for 2016–2018, we verified for the first time the continuous and bidirectional regulation of credit strategy risk preference via a single, interpretable temperature parameter

α

. The dynamic SAC model not only comprehensively surpasses strong static baselines such as logistic regression, XGBoost, and LightGBM in terms of average and total reward, but also, under an appropriately chosen

α

, enables the Sortino-ratio-measured downside risk-adjusted return to reach or exceed the best static level while continuously and substantially improving tail risk, achieving Pareto improvements unattainable by the static paradigm.

The academic value of this study lies in paradigmatically transforming credit scoring from a static paradigm long focused on single-period classification accuracy to a dynamic optimal control paradigm oriented towards long-term profit maximization with continuously adjustable risk preference. Risk is no longer an exogenous constraint requiring post hoc calibration, but an endogenous variable that can be continuously weighted within the optimization objective via a unified parameter. Looking ahead, this framework can be extended in several directions. First, under a multi-agent reinforcement learning architecture, the framework can be employed to model the dynamic competition and cooperation among multiple financial institutions in the credit market, providing a microfoundation for analyzing the tension between individual rationality and systemic risk. Second, coherent risk measures such as Conditional Value-at-Risk can be directly embedded as constraints or within the objective function of the ER-HJB equation to achieve more precise control over tail risk, thereby better meeting the stringent requirements on capital adequacy and stress testing under regulatory frameworks such as Basel III. Third, the strategy for selecting

α

itself could be formalized as a sequential decision problem—namely how to optimally adjust risk preference in a dynamically changing macro environment—thus incorporating strategy optimization and risk regulation into a unified meta-learning framework. Through sustained exploration along these directions, the framework holds the potential to further unleash its theoretical power and application value in the domains of financial risk management and regulatory technology.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing—original draft, visualization, L.J.; supervision, project administration, funding acquisition, writing—review and editing, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant No. 72401144.

Data Availability Statement

The data presented in this study are openly available in LendingClub at https://www.lendingclub.com/statistics (accessed on 15 March 2026), reference number [24], and Federal Reserve Bank of St. Louis at https://fred.stlouisfed.org/ (accessed on 15 March 2026), reference number [25,26,27,28,29].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Appl. Soft Comput. 2020, 91, 106263. [Google Scholar] [CrossRef]
Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
Chen, T.Q.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Ke, G.L.; Meng, Q.; Finley, T.; Wang, T.F.; Chen, W.; Ma, W.D.; Ye, Q.W.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Li, Y.H.; Chen, W.D. A Comparative Performance Assessment of Ensemble Learning for Credit Scoring. Mathematics 2020, 8, 1756. [Google Scholar] [CrossRef]
Sousa, M.R.; Gama, J.; Brandao, E. A new dynamic modeling framework for credit risk assessment. Expert Syst. With Appl. 2016, 45, 341–351. [Google Scholar] [CrossRef]
Gunnarsson, B.R.; Broucke, S.V.; Baesens, B.; Oskarsdóttir, M.; Lemahieu, W. Deep learning for credit scoring: Do or don’t? Eur. J. Oper. Res. 2021, 295, 292–305. [Google Scholar] [CrossRef]
Chen, Y.J.; Calabrese, R.; Martin-Barragan, B. Interpretable machine learning for imbalanced credit scoring datasets. Eur. J. Oper. Res. 2024, 312, 357–372. [Google Scholar] [CrossRef]
Moos, J.; Hansel, K.; Abdulsamad, H.; Stark, S.; Clever, D.; Peters, J. Robust Reinforcement Learning: A Review of Foundations and Recent Advances. Mach. Learn. Knowl. Extr. 2022, 4, 276–315. [Google Scholar] [CrossRef]
Paul, S.; Gupta, A.; Kar, A.K.; Singh, V. An Automatic Deep Reinforcement Learning based Credit Scoring Model using Deep-Q Network for Classification of Customer Credit Requests. In Proceedings of the 29th Annual IEEE International Symposium on Technology and Society, Swansea, UK, 13–15 September 2023. [Google Scholar]
Wang, Y.D.; Jia, Y.L.; Fan, S.; Xiao, J. Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lending. Artif. Intell. Rev. 2024, 57, 93. [Google Scholar] [CrossRef]
Barbierato, E.; Gatti, A. The Challenges of Machine Learning: A Critical Review. Electronics 2024, 13, 416. [Google Scholar] [CrossRef]
Zhang, H.; Chen, H.G.; Xiao, C.W.; Li, B.; Liu, M.Y.; Boning, D.; Hsieh, C.J. Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. In Proceedings of the 34th Conference on Neural Information Processing Systems, 33 (NeurIPS 2020), Virtual Event, 6–12 December 2020. [Google Scholar]
Crandall, M.G.; Lions, P.L. Viscosity solutions of Hamilton-Jacobi equations. Trans. Am. Math. Soc. 1983, 277, 1–42. [Google Scholar] [CrossRef]
Bardi, M.; Capuzzo-Dolcetta, I. Continuous viscosity solutions of Hamilton-Jacobi equations. In Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations; Birkhäuser Boston: Boston, MA, USA, 1997; pp. 25–96. [Google Scholar]
Munos, R. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Mach. Learn. 2000, 40, 265–299. [Google Scholar] [CrossRef]
Tang, W.P.; Zhang, Y.P.; Zhou, X.Y. Exploratory HJB Equations and Their Convergence. SIAM J. Control Optim. 2022, 60, 3191–3216. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Xu, Y.; Kou, G.; Peng, Y.; Ding, K.X.; Ergu, D.; Alotaibi, F.S. Profit- and risk-driven credit scoring under parameter uncertainty: A multiobjective approach. Omega-Int. J. Manag. Sci. 2024, 125, 103004. [Google Scholar] [CrossRef]
Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; Volume 3. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar] [CrossRef]
Kim, J.; Yang, I. Hamilton-Jacobi-Bellman Equations for Maximum Entropy Optimal Control. arXiv 2020, arXiv:2009.13097. [Google Scholar]
Stoorvogel, A.A.; Saberi, A. The discrete algebraic Riccati equation and linear matrix inequality. Linear Algebra Appl. 1998, 274, 317–365. [Google Scholar] [CrossRef]
All Lending Club Loan Data. Available online: https://www.kaggle.com/datasets/wordsforthewise/lending-club (accessed on 15 March 2026).
Unemployment Rate (UNRATE). Available online: https://fred.stlouisfed.org/series/UNRATE (accessed on 15 March 2026).
Federal Funds Effective Rate (FEDFUNDS). Available online: https://fred.stlouisfed.org/series/FEDFUNDS (accessed on 15 March 2026).
Consumer Price Index for All Urban Consumers: All Items in U.S. City Average (CPIAUCSL). Available online: https://fred.stlouisfed.org/series/CPIAUCSL (accessed on 15 March 2026).
Industrial Production: Total Index (INDPRO). Available online: https://fred.stlouisfed.org/series/INDPRO (accessed on 15 March 2026).
Year Fixed Rate Mortgage Average in the United States (MORTGAGE30US). Available online: https://fred.stlouisfed.org/series/MORTGAGE30US (accessed on 15 March 2026).
Raffin, A.; Hi, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Columba, F.; Cugliari, M.; Di Virgilio, S. Credit risk assessment with stacked machine learning. In Corporate Credit Analysis and AI: Advancing the Rating System at a Central Bank; Springer Nature Switzerland: Cham, Switzerland, 2026; pp. 263–292. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Sortino, F.A.; van der Meer, R. Downside risk. J. Portf. Manag. 1991, 17, 27–31. [Google Scholar] [CrossRef]
Rockafellar, R.T.; Uryasev, S. Optimization of conditional value-at risk. J. Risk 2000, 3, 21–41. [Google Scholar] [CrossRef]

Figure 1. Overall research framework for dynamic credit decision-making based on ER-HJB/SAC (upper block: Credit environment; lower-left block: SAC agent; lower-right block: Evaluation and comparison with static baselines. Arrows denote data flow and feedback).

Figure 2. Scatter plot for LQR theoretical verification (the SAC policy outputs almost perfectly match the analytical solution (MSE = 0.0267); the red dashed line indicates the ideal line of perfect agreement, numerically verifying Theorem 3—that SAC is an effective numerical heuristic for the ER-HJB equation).

Figure 3.

α - S o r t i n o r a t i o

curve (the Sortino ratio first rises rapidly and then flattens, reaching a global peak near

α \approx 0.20

).

Figure 3.

α - S o r t i n o r a t i o

curve (the Sortino ratio first rises rapidly and then flattens, reaching a global peak near

α \approx 0.20

).

Figure 4. Dual-axis plot of

α - r e t u r n

and CVaR 95% (return and tail risk exhibit a “seesaw” relationship; the interval

α \in [0.05, 0.20]

forms a “trade-off corridor” where decision-makers can clearly weigh return concessions against tail-risk improvements).

Figure 4. Dual-axis plot of

α - r e t u r n

and CVaR 95% (return and tail risk exhibit a “seesaw” relationship; the interval

α \in [0.05, 0.20]

forms a “trade-off corridor” where decision-makers can clearly weigh return concessions against tail-risk improvements).

Figure 5. Box plot of return distributions (the profit distributions of SAC models (

α \leq 0.05

) are shifted upward overall with thicker right tails; box heights narrow as

α

increases, visually demonstrating the progressive risk-suppression effect of entropy regularization, consistent with the theoretical prediction of Theorem 2).

Figure 5. Box plot of return distributions (the profit distributions of SAC models (

α \leq 0.05

) are shifted upward overall with thicker right tails; box heights narrow as

α

increases, visually demonstrating the progressive risk-suppression effect of entropy regularization, consistent with the theoretical prediction of Theorem 2).

Figure 6. Average credit granting ratio for high-risk groups as a function of

α

(HCR decreases strictly monotonically, verifying the micro-level prediction of Theorem 2: larger

α

causes the policy to preferentially reduce concentration in states with high uncertainty—i.e., automatically tightening credit to high-risk borrowers).

Figure 6. Average credit granting ratio for high-risk groups as a function of

α

(HCR decreases strictly monotonically, verifying the micro-level prediction of Theorem 2: larger

α

causes the policy to preferentially reduce concentration in states with high uncertainty—i.e., automatically tightening credit to high-risk borrowers).

Table 1. Methodological comparison of representative works and this paper.

Work	Method	Dynamic?	Risk-Tunable?	Theoretical Optimality	Main Limitation
[2]	Logistic regression	No	No	No	Static classification, not aligned with long-term profit
[3,4]	Gradient boosting	No	No	No	Single-period, no intertemporal optimization
[6]	Sliding window	Yes	No	No	Heuristic forgetting, no optimal control
[7]	Deep neural networks	No	No	No	High computational cost, not profit-aligned
[11]	DQN + BSPER	Yes	No	No	Handcrafted reward, not risk-theory driven
[21]	Maximum-entropy RL	Yes	Yes	Yes (RL)	No theoretical equivalence to HJB
This paper	ER-HJB + SAC	Yes	Yes	Yes (HJB)	—

Table 2. Core performance of SAC models with different entropy coefficients on the test set.

$α$	AR	TR	Std	Sortino	CvaR95%	HCR
0.01	1.6773	870,036.4111	2.0434	1.5679	−1.3383	0.9095
0.05	1.4841	769,829.1321	1.9077	1.5868	−1.1217	0.7465
0.10	1.3235	686,503.9132	1.7329	1.6123	−0.9744	0.6608
0.15	1.2255	635,658.0914	1.6075	1.6199	−0.8988	0.6176
0.20	1.1731	608,513.7459	1.5302	1.6312	−0.8579	0.5971
0.50	1.0178	527,947.7472	1.2608	1.6308	−0.7577	0.5396
1.00	0.9606	498,271.8749	1.1530	1.6302	−0.7217	0.5211

Table 3. Core performance of static baseline models on the test set.

Model	AR	TR	Std	Sortino	CvaR95%
LR	1.3651	708,066.8276	1.6022	1.6109	−1.0454
RF	1.3521	701,327.8073	1.5931	1.5963	−1.0435
XGB	1.3541	702,378.4029	1.5851	1.6241	−1.0316
LGBM	1.3533	701,940.5208	1.5844	1.6230	−1.0317
MLP	1.3661	708,588.9227	1.6009	1.6205	−1.0397
Stacked_Meta-Model	1.3236	686,540.0049	1.6813	1.3463	−1.1189

Table 4. Return distribution of each model.

Model	Mean	Min	Q1 (25%)	Median (50%)	Q3 (75%)	Max
$α = 0.01$	1.6773	−31.6424	0.5505	1.2061	2.4063	12.2055
$α = 0.05$	1.4841	−28.1988	0.4173	0.9668	2.1135	11.9399
$α = 0.10$	1.3235	−25.2654	0.3822	0.8349	1.8201	11.5482
$α = 0.15$	1.2255	−23.4186	0.3655	0.7802	1.6622	11.1560
$α = 0.20$	1.1731	−22.2460	0.3604	0.7557	1.5821	10.8789
$α = 0.50$	1.0178	−18.1451	0.3467	0.7040	1.3959	8.7703
$α = 1.00$	0.9606	−17.0367	0.3421	0.6884	1.3344	7.5645
LR	1.3651	−30.2107	0.4958	1.0102	1.9372	10.2371
RF	1.3521	−32.0278	0.4906	1.0004	1.9187	10.3183
XGB	1.3541	−31.7479	0.4921	1.0017	1.9216	10.2592
LGBM	1.3533	−31.5754	0.4918	1.0011	1.9204	10.3152
MLP	1.3661	−31.7889	0.4957	1.0102	1.9381	10.5684
Stacked_Meta-Model	1.3236	−33.5188	0.4656	0.9683	1.8790	12.0437

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, L.; Zhang, R. Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics 2026, 14, 1980. https://doi.org/10.3390/math14111980

AMA Style

Jin L, Zhang R. Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics. 2026; 14(11):1980. https://doi.org/10.3390/math14111980

Chicago/Turabian Style

Jin, Lei, and Runchi Zhang. 2026. "Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic" Mathematics 14, no. 11: 1980. https://doi.org/10.3390/math14111980

APA Style

Jin, L., & Zhang, R. (2026). Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic. Mathematics, 14(11), 1980. https://doi.org/10.3390/math14111980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Credit Decision-Making with Continuous Risk Preference: A Unified Framework of Entropy-Regularized HJB and Soft Actor-Critic

Abstract

1. Introduction

2. Related Work

2.1. Evolution and Static Limitations of Credit Scoring Models

2.2. Exploration and Shortcomings of Reinforcement Learning in Financial Decision-Making

2.3. Theoretical Bridge Between Maximum-Entropy Reinforcement Learning and the HJB Equation

2.4. Summary of Literature Comparison

2.5. Problem Formulation

3. Theoretical Foundations

3.1. Entropy-Regularized Markov Decision Process (ER-MDP)

3.2. Entropy-Regularized HJB Equation (ER-HJB) and Properties of Its Solution

3.3. Limiting Behavior as the Entropy Coefficient Vanishes

3.4. Theoretical Connection Between the SAC Algorithm and the ER-HJB Equation

3.5. Linear-Quadratic Analytical Verification

3.6. SAC Algorithm Pseudocode and Overall Research Framework

4. Experimental Design and Methodology

4.1. From ER-MDP to the Credit Environment: Connecting Theory and Experiment

4.2. Dataset Description

4.3. Model Architecture

4.4. Evaluation Metrics

4.5. Experimental Environment

5. Experimental Results and Analysis

5.1. Theoretical Verification

5.2. Continuous Regulation of Risk–Return Trade-Off

5.3. Systematic Comparison with Static Baselines

5.4. Credit-Granting Behavior Towards High-Risk Groups

6. Discussion

6.1. Economic Interpretation of the Temperature Parameter

6.2. The Advantage Mechanism of the Dynamic Strategy

6.3. The Theoretical Roots of Interpretability

6.4. Practical Implications

6.5. Theoretical Limitations and Practical Scope

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI