Next Article in Journal
Tax Tightrope: The Perils of Foreign Ownership, Executive Incentives and Transfer Pricing in Indonesian Banking
Previous Article in Journal
Exploring the Affiliation of Corporate Social Responsibility, Innovation Performance, and CEO Gender Diversity: Evidence from the U.S.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploratory Dividend Optimization with Entropy Regularization

School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag. 2024, 17(1), 25; https://doi.org/10.3390/jrfm17010025
Submission received: 29 November 2023 / Revised: 1 January 2024 / Accepted: 4 January 2024 / Published: 10 January 2024
(This article belongs to the Section Mathematics and Finance)

Abstract

:
This study investigates the dividend optimization problem in the entropy regularization framework in the continuous-time reinforcement learning setting. The exploratory HJB is established, and the optimal exploratory dividend policy is a truncated exponential distribution. We show that, for suitable choices of the maximal dividend-paying rate and the temperature parameter, the value function of the exploratory dividend optimization problem can be significantly different from the value function in the classical dividend optimization problem. In particular, the value function of the exploratory dividend optimization problem can be classified into three cases based on its monotonicity. Additionally, numerical examples are presented to show the effect of the temperature parameter on the solution. Our results suggest that insurance companies can adopt new exploratory dividend payout strategies in unknown market environments.

1. Introduction

The risk-management problem for insurance companies has been extensively investigated in the literature. This dates back to the Cramér–Lundberg (C-L) model of Lundberg (1903), which describes the surplus process of the insurance company in terms of two cash flows: premiums received and claims paid. Consider an insurance company with claims arriving at Poisson rate ν ; that is, the total number of claims N t up to time t is Poisson-distributed with parameter ν t . Denote using ξ i the size of the i-th claim, where { ξ i } s are independently and identically distributed with E [ ξ i ] = μ 1 and E [ ξ i 2 ] = μ 2 for some constants μ 1 , μ 2 > 0 . Let X ˜ t denote the surplus process of the insurance company. Then,
X ˜ t = x 0 + ζ t i = 1 N t ξ i ,
where x 0 is the initial surplus level, and ζ is the premium rate, which is the amount of premium received by the insurance company per unit of time.
De Finetti (1957) first proposed the dividend optimization problem: An insurance company maximizes the expectation of cumulative discounted dividends until the ruin time by choosing dividend strategies, that is, when and how much of the surplus should be distributed as dividends to the shareholders. De Finetti (1957) derived that the optimal dividend policy under a simple discrete random walk model should be a barrier strategy. Gerber (1969) then generalized the dividend problem from a discrete-time model to the classical C-L model and showed that the optimal dividend strategy should be a band strategy, which degenerates to a barrier strategy for an exponentially distributed claim size.
With the development of technical tools such as dynamic programming, the dividend optimization problem has been analyzed under the stochastic control framework. In particular, X ˜ t in the C-L model can be approximated by a diffusion process X t that evolves according to
d X t = μ d t + σ d W t ,
where μ : = ζ ν μ 1 , σ : = ν μ 2 , and { W t } is a standard Brownian motion; see, e.g., Schmidli (2007). Notably, the diffusion approximation for the surplus process works well for large insurance portfolios, where an individual claim is relatively small compared with the size of the surplus. Under the drifted Brownian motion model, the optimal dividend strategy is a barrier strategy, and if the dividend rate is further upper bounded, the optimal dividend strategy is of the threshold type; see, e.g., Jeanblanc-Picqué and Shiryaev (1995) and Asmussen and Taksar (1997). Other extensions of the dividend optimization problem include Jgaard and Taksar (1999); Asmussen et al. (2000); Azcue and Muler (2005, 2010); Gaier et al. (2003); Kulenko and Schmidli (2008); Yang and Zhang (2005); Choulli et al. (2003); Gerber and Shiu (2006); Avram et al. (2007) and Yin and Wen (2013), etc.
Previous studies have investigated the dividend optimization problem based on the complete information of the environment, that is, all the model parameter values are known. If the environment is a black box or the model parameter values are unknown, this assumption is no longer valid. One way to handle this issue is to use past information to estimate model parameters and to then use the estimated parameters to solve the problem. However, the optimal strategy in the classical dividend optimization problem is a barrier type or threshold type, which is extremely sensitive to model parameter values; a slight change in model parameters will lead to a totally different strategy.1
In contrast to the traditional approach that separates estimation and optimization, reinforcement learning (RL) aims to learn the optimal strategy through trial-and-error interactions with the unknown environment without estimating the model parameters. In particular, one takes different actions in the unknown territory and receives feedback to learn the optimal action and use it to further interact with the environment. In recent years, RL has been successfully applied in many fields, including healthcare, autonomous control, natural language processing, and video games; see, for example, Zhao et al. (2009); Komorowski et al. (2018); Mirowski et al. (2016); Zhu et al. (2017); Radford et al. (2017); Paulus et al. (2017); Mnih et al. (2015); Jaderberg et al. (2019); Silver et al. (2016) and Silver et al. (2017). RL has thus become one of the most popular and fastest-growing fields today.
Exploration and exploitation are key concepts in RL, and they proceed simultaneously. On the one hand, exploitation involves utilizing the so-far-known information to derive the current optimal strategy, which might not be optimal in the long-term view. On the other hand, exploration emphasizes learning from trial-and-error interactions with the environment to improve its knowledge for the sake of long-term benefit. While the optimal strategy of the classical dividend optimization problem is deterministic when the model parameter values are fully known, randomized strategies are considered to encourage the exploration of other actions in an unknown environment. Although exploration generates a short-term cost, it helps to learn the optimal (or near-optimal) strategy and bring long-term benefits.
Evidently, balancing the trade-off between exploitation and exploration is an important issue. The ε -greedy strategy is a frequently used randomized strategy in RL that balances exploration and exploitation by showing that the agent should stick with the current optimal policy most of the time, while the agent could sometimes randomly take other nonoptimal actions to explore the environment; see, for example, Auer et al. (2002). Boltzmann exploration is another randomized strategy that has received extensive attention in the RL literature. Instead of assigning constant probabilities to different actions based on current information, Boltzmann exploration uses the Boltzmann distribution to allocate probability to different actions, where the probability of each action is positively related to its reward. In other words, the agent should choose the action with higher expected rewards with higher probability; see, for example, Cesa-Bianchi et al. (2017).
Another way to introduce a randomized strategy is to intentionally include a regularization term to encourage exploration. Entropy is a frequently used criterion in the RL family that measures the level of exploration. The entropy regularization framework directly incorporates entropy as a regularization term into the original objective function to encourage exploration; see, e.g., Todorov (2006); Ziebart et al. (2008); and Nachum et al. (2017). In the entropy regularization framework, the weight of exploration is determined by the coefficient imposed on the entropy, which is called the temperature parameter. The larger the temperature parameter, the greater the weight of exploration. A temperature parameter that is too large may result in too much focus on exploring the environment and little effort in exploiting the current information. Conversely, if the temperature parameter is too small, one might stick with the current optimal strategy without the opportunity to explore better solutions. Therefore, the careful selection of the temperature parameter is important in RL algorithm design.
Although most RL literature focuses on the Markov decision process, recently, Wang et al. (2020) extended the entropy regularization framework to the continuous-time setting. They showed that the optimal distributional control is Gaussian distribution in the linear–quadratic stochastic control problem. Wang and Zhou (2020) studied a continuous-time mean-variance portfolio selection problem under the entropy-regularized RL framework and showed that the precommitted strategies are Gaussian distributions with time-decaying variance. Dai et al. (2023) considered the equilibrium mean-variance problem with a log return target and showed that the optimal control is a Gaussian distribution with the variance term not necessarily decaying in time.
This study investigates the dividend optimization problem in the entropy regularization framework to encourage exploration in the unknown environment. We follow the same setting as in Wang et al. (2020), which used Shannon’s differential entropy. The key idea is to use distribution as the control to solve the entropy-regularized dividend optimization problem. Consequently, the optimal dividend policy is a randomization over the possible dividend-paying rates. We derive a so-called exploratory HJB and establish theoretical results to guarantee the existence of the solution. We determine that the optimal exploratory dividend policy is a truncated exponential distribution, the parameter of which depends on the surplus level and the temperature parameter. We show that, for suitable choices of the maximal dividend-paying rate and the temperature parameter, the value function of the exploratory dividend optimization problem could be significantly different from the value function in the traditional problem. In particular, we classify the value function of the exploratory dividend optimization problem into three cases based on its monotonicity.
Recently, Bai et al. (2023) also studied the optimal dividend problem under a continuous-time diffusion model. The authors then used a policy improvement argument along with policy evaluation devices to construct approximate sequences of the optimal strategy. The difference is that in their study, the feasible controls were open-loop, whereas we only consider feedback controls. We show that the value function is decreasing when the maximal dividend-paying rate is relatively small compared with the temperature parameter, whereas in their study, the maximal dividend-paying rate was assumed to be larger than one and thus the value function was always increasing.
The rest of this paper is organized as follows. In Section 2, we introduce the formulation of the entropy-regularized dividend optimization problem. In Section 3, we present the exploratory HJB and the theoretical results to solve the exploratory dividend problem. We then discuss three cases of the value function for the exploratory dividend problem in Section 4. Section 5 presents some numerical examples to show the effect of the parameters on the optimal dividend policy and the value function. Section 6 concludes this paper.

2. Problem

2.1. The Model

Suppose an insurance company has surplus X t at time t, with
d X t = μ d t + σ d W t , X 0 = x ,
where μ > 0 , σ > 0 , and { W t } t 0 comprise a standard Brownian motion defined on the filtered probability space ( Ω , F , { F t } t 0 , P ) . As noted by Asmussen and Taksar (1997), such a surplus process (1) can be viewed as either direct modeling with drifted Brownian motion or as an approximation of the classical compound Poisson model.
A dividend strategy or policy is defined as a = { a t } t 0 , where a t is the dividend-paying rate at time t; that is, the cumulative amount of dividends paid from time t 1 to time t 2 is given by t 1 t 2 a t d t . We consider herein the Markov feedback controls—that is, a t = a ( X t ) —where a ( · ) is a function of the surplus level X t . Note that a t is non-negative for any t. Further, we assume that a t is upper bounded by a positive constant M, which is consistent with the assumption made in the literature. We provide the formal definition of the admissible dividend policy below.
Definition 1. 
A dividend policy a is said to be admissible if { a t } t 0 is F t -adapted and a t [ 0 , M ] for all t 0 .
Denote by A the set of admissible dividend policies. For an insurance company, the surplus process of which evolves according to (1) and pays the dividend according to policy a = { a t } t 0 A , the controlled surplus process for this insurance company is
d X t a = ( μ a t ) d t + σ d W t , X 0 a = x .
Define the ruin time as the first time the surplus level hits zero; that is,
τ x a : = inf { t 0 : X t a 0 X 0 a = x } .
For an insurance company starting with an initial surplus x [ 0 , ) , the problem is to find the optimal dividend policy that maximizes the expected value of the exponentially discounted dividends to be accumulated until the ruin time, that is,
J c l ( x , a ) : = E 0 τ x a e ρ t a ( X t a ) d t ,
where ρ > 0 is the discounting rate. Then, the optimal dividend problem is
sup a A J c l ( x , a ) .

2.2. Classical Optimal Dividend Problem

First, we briefly review the results of solving the dividend optimization problem (3) classically. Let V c l ( x ) be the value function of the dividend optimization problem:
V c l ( x ) : = sup a A J c l ( x , a ) .
Assume the value function V c l ( x ) is twice-continuously differentiable. The standard dynamic programming approach leads to the following Hamilton–Jacobi–Bellman equation:
ρ V c l ( x ) = sup a [ 0 , M ] a + ( μ a ) V c l ( x ) + 1 2 σ 2 V c l ( x ) ,
with boundary condition V c l ( 0 ) = 0 . We can easily see that the optimal dividend-paying rate at surplus level x is
a * ( x ) = 0 , if V c l ( x ) > 1 , M , if V c l ( x ) 1 .
Assume V c l ( x ) is a concave function. Then, there exists a non-negative constant x b such that V c l ( x ) 1 when x x b and V c l ( x ) > 1 when 0 x < x b . Substitute (5) into (4); then, it turns into the following ODEs:
1 2 σ 2 V c l ( x ) + μ V c l ( x ) ρ V c l ( x ) = 0 , if 0 x < x b , 1 2 σ 2 V c l ( x ) + ( μ M ) V c l ( x ) ρ V c l ( x ) + M = 0 , if x x b .
Combined with the boundary condition, we can derive that
V c l ( x ) = C 1 e θ 1 x e θ 2 x , if 0 x < x b , M ρ C 2 e θ 3 x , if x x b ,
where
θ 1 = μ + μ 2 + 2 ρ σ 2 σ 2 , θ 2 = μ + μ 2 + 2 ρ σ 2 σ 2 , θ 3 = ( μ M ) + ( μ M ) 2 + 2 ρ σ 2 σ 2 .
C 1 , C 2 , and x b are determined by the smooth pasting conditions; that is,
C 1 e θ 1 x b e θ 2 x b = M ρ C 2 e θ 3 x b , C 1 θ 1 e θ 1 x b + θ 2 e θ 2 x b = 1 , C 2 θ 3 e θ 3 x b = 1 .
If M ρ 1 θ 3 > 0 , there exists a unique solution to (8). In this case, V c l ( x ) is given by (7), where C 1 , C 2 , and x b are determined uniquely through (8). Consequently, the optimal dividend policy is to pay the maximal rate M when the surplus level x exceeds the threshold x b and to pay nothing otherwise. If M ρ 1 θ 3 0 , then V c l ( x ) = M ρ ( 1 e θ 3 x ) . In this case, the optimal dividend policy is always to pay the maximal rate M. Detailed proofs can be found in Asmussen and Taksar (1997). In addition, we can see that the optimal value function V c l ( x ) is concave on x and always smaller than M / ρ , which is the limit of V c l ( x ) as x going to infinity. Figure 1 illustrates the value function and corresponding optimal dividend policy under the following parameter values: μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.

2.3. Exploratory Formulation

The above optimal dividend policy (5) is implemented based on the complete information; that is, the model parameters μ and σ are known. However, in reality, it is difficult to know exactly the values of μ and σ , owing to uncertainty in the premium rate, claim arrival process, and claim size. Therefore, we use the RL technique to learn the optimal (or near-optimal) dividend-paying strategy through trial-and-error interactions with the unknown territory.
Whereas most work on RL considers Markov decision processes in discrete time, we follow Wang et al. (2020), who modeled RL in continuous time as a relaxed stochastic control problem. At time t with a surplus level X t , the dividend-paying rate a is randomly sampled according to a distribution π t : = π ( a ; X t ) , where π ( · ; · ) : [ 0 , M ] × [ 0 , ) [ 0 , ) , satisfying 0 M π ( a ; x ) d a = 1 for any x [ 0 , ) . We call π : = { π t } t 0 the distributional dividend policy. Following the same procedure as in Wang et al. (2020), we derive the exploratory dynamic of the surplus process under π to be
d X t π = μ 0 M a π ( a ; X t π ) d a d t + σ d W t , X 0 π = x ,
and the expected value of the total discounted dividends under exploration to be
E 0 τ x π e ρ t 0 M a π ( a ; X t π ) d a d t ,
where the ruin time is
τ x π : = inf { t 0 : X t π 0 X 0 π = x } .
In addition to the expected value of the total discounted dividends under exploration, Shannon’s differential entropy is introduced into the objective to encourage exploration. For a given distribution π , entropy is defined as
H ( π ) : = 0 M π ( a ) ln π ( a ) d a .
Thus, the objective of the entropy-regularized exploratory dividend problem is
J ( x , π ) : = E 0 τ x π e ρ t 0 M a π ( a ; X t π ) d a + λ H ( π t ) d t = E 0 τ x π e ρ t 0 M a λ ln π ( a ; X t π ) π ( a ; X t π ) d a d t ,
where λ > 0 is the so-called temperature parameter. Note that λ controls the weight to be put on the exploration and is exogenously given. If λ = 0 , the distribution degenerates to the Dirac measure, which is the solution to the classical optimal dividend problem without exploration. The entropy-regularized exploratory dividend problem is
sup π Π J ( x , π ) ,
where Π is the set of admissible exploratory dividend policies. We provide the formal definition of the admissible exploratory dividend policy π below.
Definition 2. 
An exploratory dividend policy π is admissible if the following conditions are satisfied:
(i) 
π ( · ; x ) Π [ 0 , M ] for any x [ 0 , ) , where Π [ 0 , M ] is a set of probability density functions with support [ 0 , M ] ;
(ii) 
The stochastic differential Equation (9) has a unique solution { X t π } t 0 under π;
(iii) 
E 0 τ x π e ρ t 0 M a λ ln π ( a ; X t π ) π ( a ; X t π ) d a d t < .
The following proposition will be used later.
Proposition 1. 
For any distribution π on support [ 0 , M ] , entropy H ( π ) ln M . The proof of propositions and theorems are presented in Appendix A.

3. Exploratory HJB Equation

To solve the exploratory optimal dividend problem (12), we first derive the corresponding HJB equation, or the so-called exploratory HJB; see Wang et al. (2020) and Tang et al. (2022), etc.
Let V ( x ) be the value function of the entropy-regularized exploratory dividend problem, that is,
V ( x ) : = sup π Π J ( x , π ) .
Assume the value function V ( x ) is twice-continuously differentiable. Following the standard arguments in dynamic programming, we derive the exploratory HJB equation below:
ρ V ( x ) = sup π Π [ 0 , M ] 0 M a ( 1 V ( x ) ) λ ln π ( a ; x ) π ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ,
with boundary condition
V ( 0 ) = 0 .

3.1. Exploratory Dividend Policy

To solve the supremum in (13) together with the constraint that 0 M π ( a ; x ) d a = 1 , we introduce the Lagrange multiplier η :
sup π Π [ 0 , M ] 0 M a ( 1 V ( x ) ) λ ln π ( a ; x ) η π ( a ; x ) d a + η .
Maximizing the integrand above pointwisely and using the first-order condition leads to the solution
π * ( a ; x ) = exp a 1 V ( x ) λ 1 η λ , a [ 0 , M ] .
Because 0 M π * ( a ; x ) d a = 1 , we solve that
π * ( a ; x ) = 1 Z M ( ( 1 V ( x ) ) / λ ) exp a 1 V ( x ) λ , a [ 0 , M ] ,
where
Z M ( y ) : = e M y 1 y , y 0 M , y = 0 .
Recall that the classical optimal dividend policy given in (5) is a two-threshold strategy, i.e., it pays nothing, a * ( x ) = 0 , if V c l ( x ) > 1 or pays the maximal rate, a * ( x ) = M , if V c l ( x ) 1 . In contrast, the exploratory dividend policy is not restricted to two extreme actions only but gives the probability to take certain actions. This result is very similar to Gao et al. (2022) in which the authors study the temperature control problem for Langevin diffusions by incorporating the randomization of the temperature control and regularizing its entropy. The classical optimal control of such a problem is of the bang-bang type, whereas the exploratory control is a state-dependent, truncated exponential distribution. Likewise, the optimal distribution π * ( a ; x ) given in (15) is also a continuous version of the Boltzmann distribution or Gibbs measure, which is widely used in discrete reinforcement learning.
When V ( x ) > 1 , π * ( a ; x ) is decreasing in a so it has a large probability to take a small dividend payout rate close to 0; when V ( x ) < 1 , π * ( a ; x ) is increasing in a so it has a large probability to take a large dividend payout rate close to M; and when V ( x ) = 1 , it degenerates to a uniform distribution on [ 0 , M ] . In other words, the optimal exploratory dividend policy is an “exploration” of the classical dividend payout policy: it searches around the current optimal dividend rate given by the classical solution, 0 or M, with the probability to take a certain rate decreasing as it moves away from the classical solution.
The exploratory surplus process under the optimal policy is well posed. Note that the optimal distributional policy is π * = { π t * } t 0 , where
π t * : = π * ( a ; X t π * ) = 1 Z M ( ( 1 V ( X t π * ) ) / λ ) exp a 1 V ( X t π * ) λ .
Applying the optimal distributional policy (17) into the exploratory surplus process (9), we obtain that
d X t π * = μ 0 M a π * ( a ; X t π * ) d a d t + σ d W t = μ M λ 1 V ( X t π * ) + M e M ( 1 V ( X t π * ) ) / λ 1 1 V ( X t π * ) 1 M 2 1 V ( X t π * ) = 1 d t + σ d W t .
Because 0 M a π * ( a ; X t π * ) d a [ 0 , M ] , the SDE (18) has bounded drift and constant volatility. As a result, there exists a unique solution { X t π * } to (18).

3.2. Verification Theorem

Substituting the optimal distribution π * ( a ; x ) as shown in (15) into the HJB Equation (13), we have the following equation for V ( x ) :
ρ V ( x ) = μ V ( x ) + 1 2 σ 2 V ( x ) + λ ln Z M ( ( 1 V ( x ) ) / λ ) ,
or, equivalently,
ρ V ( x ) = μ V ( x ) + 1 2 σ 2 V ( x ) + λ ln λ 1 V ( x ) e M ( 1 V ( x ) ) / λ 1 1 V ( x ) 1 + M 1 V ( x ) = 1 .
The following verification theorem shows that the V ( x ) that solves (19) is indeed the value function of the exploratory dividend problem (12).
Theorem 1. 
Assume there exists a twice-continuously differentiable function V that solves (19) with boundary condition (14), and | V | , | V | are bounded. Then, V is the value function of the entropy-regularized exploratory dividend problem (12) under exponential discounting.
Theorem 1 shows that the solution to the exploratory HJB Equation (19) could be the value function of exploratory dividend problem (12). On the other hand, a similar argument could show that the value function shall also satisfy (19), while the optimal exploratory dividend strategy is given by (17). To establish a rigorous statement, we need the following result. The next proposition shows that the value function V ( x ) converges as x going to infinity.
Proposition 2. 
Let V be the value function of (12) and suppose the optimal exploratory dividend strategy is (17). Then, as x going to infinity, V ( x ) converges to a constant, i.e.,
lim x V ( x ) = λ ln λ + λ ln ( e M / λ 1 ) ρ .

3.3. Solution to Exploratory HJB

Compared with the differential Equation (6), which solves the classical value function, the exploratory HJB Equation (19) has a nonlinear term ln Z M ( ( 1 V ( x ) ) / λ ) , which makes it difficult to be solved explicitly. The theorem below guarantees the existence and uniqueness of the solution V ( x ) .
Theorem 2. 
There exists a unique twice-continuously differentiable function V ( x ) that solves (19) with boundary conditions (14) and (20). Moreover, lim λ 0 | V ( x ) V c l ( x ) | = 0 for all x [ 0 , ) , where V c l ( x ) is the value function of the classical dividend problem.
Theorem 2 follows from the results in (Tang et al. 2022, Theorems 3.9 and 3.10) and in (Strulovici and Szydlowski 2015, Proposition 1). It is straightforward to check that the conditions to guarantee the existence and uniqueness of the solution to (19) and its twice-continuous differentiability are satisfied.
Theorem 2 also states that when λ becomes smaller, the exploratory value function converges to the classical value function. Indeed, a stronger convergence is established by Tang et al. (2022) that V converges to V c l as locally uniformly as λ going to 0. Note that the parameter λ is the weight to be put on the exploration in contrast to the exploitation. If it is more close to 0, the entropy term has a smaller effect on the total objective value, and the optimal exploratory distribution π * ( a ; x ) in (15) is more concentrated and close to the Dirac distribution—the optimal solution to the classical dividend optimization problem. Then, not surprisingly, the exploratory value function V ( x ) also converges to the classical value function V c l ( x ) as λ going to 0.
Now, thanks to Theorem 2, we have V ( x ) that solves the exploratory HJB Equation (19). On the other hand, it is straightforward to show that according to (20), if M < λ ln 1 / λ + 1 , the limit of V ( x ) is negative; if M > λ ln 1 / λ + 1 , the limit of V ( x ) is positive; and if M = λ ln 1 / λ + 1 , the limit of V ( x ) is zero. The next theorem shows that, indeed, we classify V ( x ) into three cases based on its monotonicity.
Theorem 3. 
Let V ( x ) be the solution to (19) with boundary conditions (14) and (20). Then, V ( x ) is monotone. To be more specific, there are the following:
(i) 
If M < λ ln 1 / λ + 1 , V ( x ) is nonincreasing.
(ii) 
If M > λ ln 1 / λ + 1 , V ( x ) is nondecreasing.
(iii) 
If M = λ ln 1 / λ + 1 , V ( x ) 0 .
The following corollary is a direct result from the above theorem.
Corollary 1. 
Let V ( x ) be the solution to (19) with boundary conditions (14) and (20). Then, | V ( x ) | and | V ( x ) | are bounded.
Note that in Theorem 1, we need | V | and | V | to be bounded so that V—the solution to (19)—is indeed the value function of the exploratory optimal dividend problem. Corollary 1 verifies that the boundedness conditions are satisfied. In other words, the solution to the exploratory HJB Equation (19) is the value function of the exploratory dividend problem.

4. Discussion

In view of Theorem 3, value functions can be classified into three cases according to the monotonicity: (1) M < λ ln 1 / λ + 1 ; (2) M > λ ln 1 / λ + 1 ; and (3) M = λ ln 1 / λ + 1 . The following proposition will be useful in analyzing the properties of the value functions.
Proposition 3. 
(a) 
Define
d 1 ( λ ) : = λ ln 1 / λ + 1 1 λ > 0 + 0 · 1 λ = 0 , λ [ 0 , ) .
Then, d 1 ( λ ) is increasing. Therefore, lim λ 0 d 1 ( λ ) = d 1 ( 0 ) = 0 and lim λ d 1 ( λ ) = 1 ;
(b) 
Define
d 2 ( λ ) : = λ ln λ / λ 1 1 λ > 1 + · 1 λ [ 0 , 1 ] , λ [ 0 , ) .
Then, d 2 ( λ ) > d 1 ( λ ) , and d 2 ( λ ) is decreasing on λ > 1 . Therefore, lim λ 1 d 2 ( λ ) = and lim λ d 2 ( λ ) = 1 .
Case 1: M < d 1 ( λ ) .
The value function in this case is nonincreasing and thus non-positive, as a sharp contrast to the results of the classical dividend problem. To see the reason, on one hand, note that for λ > 0 ,
ln M < ln d 1 ( λ ) = ln λ ln 1 λ + 1 ln λ · 1 λ = 0 .
Then, due to Proposition 1, the entropy term is negative, that is, H ( π ) ln M < 0 . On the other hand, when d 1 ( λ ) is large, it implies that the exploration parameter λ is relatively large compared with the maximal dividend-paying rate M. Then, the negative entropy has a large weight in the total objective value, dominating the total expected dividends and leading to a negative value function.
Case 2: M > d 1 ( λ ) .
When M > d 1 ( λ ) , the value function is non-decreasing, which is closer to the increasing value function in the classical dividend optimization problem than in Case 1. This is because a relatively small λ compared with M decreases the weight of the entropy term in the total objective value. Note that in classical dividend optimization, the limit of the value function is M / ρ , while in the current exploratory dividend optimization, the limit of the value function is given in (20). Therefore, if (i) d 1 ( λ ) < M d 2 ( λ ) , the limit of V ( x ) is no larger than that of V c l ( x ) ; if (ii) d 2 ( λ ) < M , then V ( x ) asymptotically achieves a higher value than that of the classical dividend optimization. Then, if λ > 1 and M > λ ln λ λ ln ( λ 1 ) = d 2 ( λ ) , the limit of V ( x ) is larger than that of V c l ( x ) .
As shown in Proposition 3, for any λ 0 , d 1 ( λ ) < lim λ d 1 ( λ ) = 1 . Therefore, when M 1 , it always belongs to Case 2 for any λ 0 . On the other hand, for any λ 0 , d 2 ( λ ) > lim λ d 2 ( λ ) = 1 . Therefore, when M 1 or λ 1 , because d 2 ( λ ) > M , it cannot be Case 2 (ii); when λ 1 and M 1 , it is always Case 2 (i). Note that λ = 0 corresponds to the classical dividend optimization and d 1 ( 0 ) = 0 , d 2 ( 0 ) = by definition. Because d 1 ( 0 ) < M < d 2 ( 0 ) for any positive constant M, classical dividend optimization can be viewed as a special Case 2 (i). It implies that exploratory dividend optimization is a generalization of the classical dividend optimization.
Case 3: M = d 1 ( λ ) .
As shown in Theorem 3, the value function in this case should be constantly zero. This is because λ compared with M happens to strike a balance between exploitation and exploration such that the total expected dividend is offset by the entropy.
Figure 2 depicts the different cases of the value functions given different combinations of M and λ . When M < d 1 ( λ ) , the value function falls into the Case 1 area. When M > d 1 ( λ ) , the value function corresponds to Case 2, which can be further classified into two cases based on the comparison of M and d 2 ( λ ) , i.e., whether the value function asymptotically achieves a higher value than that of the classical problem. When M = d 1 ( λ ) , the value function should be of a Case 3 type.

5. Numerical Examples

In this section, we present the numerical examples of the optimal exploratory policy and the corresponding value function, which solves the exploratory HJB Equation (19) based on the theoretical results obtained in the previous sections.2 To have a clear vision on the weight of the cumulative dividends and that of the entropy in the total objective value, we further decompose V ( x ) into two parts: the expected total discounted dividends under the optimal exploratory dividend policy
D v ( x ) : = E 0 τ x π * e ρ t 0 M a π * ( a ; X t π * ) d a d t ;
and the expected total weighted discounted entropy under the optimal exploratory dividend policy
E n t r ( x ) : = λ E 0 τ x π * e ρ t H ( π t * ) d t ,
where the entropy of π * is derived via substituting the optimal distribution (15) into the definition of entropy (10), i.e.,
H ( π * ) = ln ( Z M ( ( 1 V ( x ) ) / λ ) ) M e M ( 1 V ( x ) ) / λ Z M ( ( 1 V ( x ) ) / λ ) + 1 .
Hence, V ( x ) = D v ( x ) + E n t r ( x ) . We show examples of three cases, respectively, with commonly used parameters: μ = 1 , σ = 1 , and ρ = 0.3 .
First, let λ = 1.5 and M = 0.6 . Then, M < d 1 ( λ ) and it belongs to Case 1. Note that V ( x ) in this case is decreasing and non-positive, which is a sharp contrast to the results of the classical dividend problem. The figure on the top row, left column, of Figure 3 plots the corresponding value function and its two components D v ( x ) and E n t r ( x ) .3 The figure on the middle row, left column, of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) , which is decreasing on x. The figure on the bottom row, left column, of Figure 3 shows the density function of the optimal distribution with respect to the different surplus level x. Because V ( x ) 0 , the optimal distribution is a truncated exponential distribution with rate λ / 1 V ( x ) < 0 for any x 0 . Therefore, it is more likely to pay a high dividend rate. Furthermore, as the surplus x increases, the density function becomes more flat because V ( x ) is increasing to 0 and the rate λ / 1 V ( x ) is decreasing on x.
Second, let λ = 1.5 and M = 1.2 . Then, d 1 ( λ ) < M < d 2 ( λ ) and it belongs to Case 2 (i). The figure on the top row, middle column, of Figure 3 shows the corresponding value function, D v ( x ) and E n t r ( x ) . In contrast to Case 1, E n t r ( x ) in this case becomes positive because M is sufficiently large, making the value function V ( x ) positive. The figure on the middle row, middle column, of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) , which is increasing on x. The figure on the bottom row, middle column, of Figure 3 shows the density function of the optimal distribution with respect to the different surplus level x. When x is small, it is more likely to choose a low dividend-paying rate, because paying a too high dividend rate would probably cause the insurance company to go bankrupt and harms the shareholder’s benefit in the long run. When x becomes larger, it is more likely to pay a high dividend rate.
Third, let λ = 1.5 and M = 1.8 . Then, M > d 2 ( λ ) and it belongs to Case 2 (ii). The figure on the top row, right column, of Figure 3 shows the corresponding value function, D v ( x ) and E n t r ( x ) . In this case, the limit of V ( x ) is higher than that of the classical value function V c l ( x ) , which is M / ρ = 6 . Note that the expected total discounted dividend under the exploratory policy D v ( x ) does not exceed that of the classical policy V c l ( x ) , because the classical optimal dividend policy fully exploits the known environment. For a sufficiently large M and λ , E n t r ( x ) is large enough to make V ( x ) larger than V c l ( x ) . The figure on the middle row, right column, of Figure 3 plots the mean of the optimal distribution π * ( · ; x ) and the figure on the bottom row, right column, of Figure 3 plots the density function of the optimal distribution, which are similar to that of Case 2 (i).
When λ = 1.5 and M = 0.7662 , it belongs to Case 3 and the value function in this case should be constantly zero.
We also vary the value of λ while keeping the other parameter values unchanged. Figure 4 shows the value function under different values of λ with M = 0.6 and M = 1.2 , respectively. Note that when λ = 0 , V ( x ) degenerates to the classical value function V c l ( x ) . For M = 0.6 , it is Case 2 (ii) when λ is small and then becomes Case 3 and Case 1 as λ becomes larger. As aforementioned, it cannot be Case 2 (ii) because M < 1 . Indeed, the left panel of Figure 4 shows the value function could not exceed the classical one as λ becomes smaller. On the other hand, for M = 1.2 , it can only be Case 2 and even Case 2 (ii) if λ is large enough. The right panel of Figure 4 shows the value function is always increasing on x for different values of λ and it can exceed the classical value function for a sufficiently large λ .

6. Conclusions

This study investigates the dividend optimization problem in the entropy regularization framework. In an unknown environment, entropy is incorporated into the objective function to encourage exploration, and an exploratory dividend policy is introduced. We establish an exploratory HJB equation and determine that a truncated exponential distribution is the optimal distributional control. In comparison with the classical value function, the value function in the exploratory dividend problem is classified into three cases. The monotonicity of the value function is determined by the maximal dividend-paying rate and the temperature parameter, which controls the weight of exploration. It indicates that when insurance companies adopt new exploratory dividend payout strategies in unknown market environments, the design of the weight of exploration is important.
One potential direction for future research would be to consider the exploratory dividend policy under nonexponential discounting, which renders the problem time-inconsistent. Furthermore, in addition to the dividend policy, reinsurance could be considered as part of the insurance company’s strategy, which is more technically challenging under the entropy regularization framework. Finally, one could use other definitions of entropy, instead of Shannon’s differential, as measures of the level of exploration in RL.

Author Contributions

Conceptualization, S.H.; methodology, S.H. and Z.Z.; formal analysis, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China (Grant No. 12271462 and No. 11901494) and Shenzhen Science and Technology Program (2022 College Stable Support Project).

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflict of interests.

Appendix A. Proof

Proof of Proposition 1. 
By definition (10),
H ( π ) = 0 M π ( a ) ln π ( a ) d a = 0 M π ( a ) ln 1 π ( a ) d a ln 0 M π ( a ) 1 π ( a ) d a = ln M ,
where the inequality is due to Jensen’s inequality. □
Proof of Theorem 1. 
Let π ˜ Π be an exploratory dividend policy. Because V solves (13), for any x [ 0 , ) ,
0 = sup π Π [ 0 , M ] 0 M a λ ln π ( a ; x ) a V ( x ) π ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ρ V ( x ) 0 M a λ ln π ˜ ( a ; x ) a V ( x ) π ˜ ( a ; x ) d a + μ V ( x ) + 1 2 σ 2 V ( x ) ρ V ( x ) .
This shows that
ρ V ( x ) + μ 0 M a π ˜ ( a ; x ) d a V ( x ) + 1 2 σ 2 V ( x ) 0 M a λ ln π ˜ ( a ; x ) π ˜ ( a ; x ) d a .
Applying Itô’s Lemma on e ρ t V ( X t π ˜ ) ,
V ( x ) = e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) 0 T τ x π ˜ e ρ t ρ V ( X t π ˜ ) + μ 0 M a π ˜ ( a ; X t π ˜ ) d a V ( X t π ˜ ) + 1 2 σ 2 V ( X t π ˜ ) d t 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) + 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t ,
where the inequality is due to (A1). Then, taking the expectation on both sides,
V ( x ) E e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) + E 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t E 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t .
For the first term on the right hand side of (A2), noting that | V | is bounded, by using the bounded convergence theorem,
lim T E e ρ ( T τ x π ˜ ) V ( X T τ x π ˜ π ˜ ) = E e ρ ( τ x π ˜ ) V ( X τ x π ˜ π ˜ ) = 0 .
For the second term on the right hand side of (A2), because π ˜ is admissible and satisfies Definition 2 (iii),
E 0 T τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t = E 0 T τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t λ E 0 T τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t + λ E 0 T τ x π ˜ e ρ t M d t .
Because 0 M a π ˜ ( a ; X t π ˜ ) d a is non-negative, by using monotone convergence theorem,
lim T E 0 T τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t = E 0 τ x π ˜ e ρ t 0 M a π ˜ ( a ; X t π ˜ ) d a d t .
Noting that y ln y + 1 y > 0 for any y ( 0 , ) , by using the monotone convergence theorem,
lim T E 0 T τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t = E 0 τ x π ˜ e ρ t 0 M ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) + 1 d a d t .
and lim T E 0 T τ x π ˜ e ρ t M d t = E 0 τ x π ˜ e ρ t M d t .
For the third term on the right hand side of (A2), noting that | V | is bounded, the stochastic integral 0 s σ e ρ t V ( X t π ˜ ) d W t s 0 is a martingale, and then by using the optional sampling theorem,
E 0 T τ x π ˜ σ e ρ t V ( X t π ˜ ) d W t = 0 .
Thus, letting T on both sides of (A2),
V ( x ) E 0 τ x π ˜ e ρ t 0 M a λ ln π ˜ ( a ; X t π ˜ ) π ˜ ( a ; X t π ˜ ) d a d t = J ( x , π ˜ ) .
Because π ˜ is arbitrarily chosen, V ( x ) becomes an upper bound of the optimal value of J ( x ; · ) .
On the other hand, the above inequality becomes an equality if the supremum in (13) is achieved, that is, π ˜ = π * , where π * is given by (15). Thus, V ( x ) is the value function. □
Define function G λ , M to be
G λ , M ( y ) : = M 1 y + M e M y 1 1 y 0 + M 2 1 y = 0 ( 1 λ y ) + λ ln Z M ( y ) .
where function Z M is given in (16).
Lemma A1. 
The function G λ , M ( y ) defined in (A3) is maximized when y = 1 / λ , and
G λ , M ( 1 / λ ) = λ ln λ + λ ln ( e M / λ 1 ) .
Moreover, G λ , M ( 1 / λ ) < 0 when M < λ ln 1 / λ + 1 , G λ , M ( 1 / λ ) > 0 when M > λ ln 1 / λ + 1 , and G λ , M ( 1 / λ ) = 0 when M = λ ln 1 / λ + 1 .
Proof. 
Take the first-order derivative of function G λ , M :
G λ , M ( y ) = 1 y 2 M 2 e M y ( e M y 1 ) 2 λ M λ M ( e M y 1 ) M 2 y e M y ( e M y 1 ) 2 λ ( e M y 1 ) M y e M y y ( e M y 1 ) = ( 1 λ y ) e 2 M y ( 2 + M 2 y 2 ) e M y + 1 y 2 ( e M y 1 ) 2 = ( 1 λ y ) f 1 ( y ) y 2 ( e M y 1 ) 2 , y 0 ,
where f 1 ( y ) : = e 2 M y ( 2 + M 2 y 2 ) e M y + 1 , y 0 . Take the first-order derivative of f 1 :
f 1 ( y ) = 2 M e 2 M y 2 M e M y M 3 y 2 e M y 2 M 2 y e M y = M e M y f 2 ( y ) , y 0 ,
where f 2 ( y ) : = 2 e M y M 2 y 2 2 M y 2 , y 0 . Take the first-order derivative of f 2 :
f 2 ( y ) = 2 M e M y 2 M 2 y 2 M = 2 M f 3 ( y ) , y 0 ,
where f 3 ( y ) : = e M y M y 1 , y 0 . Take the first-order derivative of f 3 :
f 3 ( y ) = M e M y M , y 0 .
Note that f 3 ( y ) > 0 for y > 0 and f 3 ( y ) < 0 for y < 0 . Hence, f 3 ( y ) is increasing on y > 0 and decreasing on y < 0 , and f 3 ( y ) > 0 . Then, f 2 ( y ) > 0 , which means that f 2 ( y ) is increasing. As a result, f 2 ( y ) > 0 for y > 0 and f 2 ( y ) < 0 for y < 0 . Hence, f 1 ( y ) > 0 for y > 0 and f 1 ( y ) < 0 for y < 0 , which means that f 1 ( y ) is increasing on y > 0 and decreasing on y < 0 . As a result, f 1 ( y ) > 0 for y 0 .
The above analysis shows that G λ , M ( y ) is positive when 1 λ y > 0 , i.e., y < 1 / λ , and negative when 1 λ y < 0 , i.e., y > 1 / λ . Thus, the maximum is obtained at y = 1 / λ :
max G λ , M ( y ) = G λ , M ( 1 / λ ) = λ ln Z M ( 1 / λ ) = λ ln λ + λ ln ( e M / λ 1 ) .
Moreover, when M < λ ln 1 / λ + 1 , G λ , M ( 1 / λ ) < λ ln λ + λ ln ( e ln ( 1 / λ + 1 ) 1 ) = 0 ; when M > λ ln 1 λ + 1 , G λ , M ( 1 / λ ) > 0 ; and when M = λ ln 1 λ + 1 , G λ , M ( 1 / λ ) = 0 . □
Proof of Proposition 2. 
With the optimal distributional policy given in (17), substituting (17) into the objective (11) leads to
V ( x ) = J ( x , π * ) = E 0 τ x π * e ρ t 0 M a λ ln π * ( a ; X t π * ) π * ( a ; X t π * ) d a d t = E 0 τ x π * e ρ t 0 M a λ a 1 V ( X t π * ) λ + λ ln Z M 1 V ( X t π * ) λ π * ( a ; X t π * ) d a d t = E [ 0 τ x π * e ρ t { M λ 1 V ( X t π * ) + M e M ( 1 V ( X t π * ) ) / λ 1 1 V ( X t π * ) 1 + M 2 1 V ( X t π * ) = 1 1 λ 1 V ( X t π * ) λ + λ ln Z M 1 V ( X t π * ) λ } d t ] = E 0 τ x π * e ρ t G λ , M 1 V ( X t π * ) λ d t ,
where G λ , M is defined in (A3).
On one hand,
V ( x ) = E 0 τ x π * e ρ t G λ , M 1 V ( X t π * ) λ d t E 0 τ x π * e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t ,
where the inequality follows from Lemma A1. Letting x and by using the dominated convergence theorem,
lim x V ( x ) lim x E 0 τ x π * e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t = E 0 e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t = λ ln λ + λ ln ( e M / λ 1 ) ρ .
On the other hand, consider an exploratory policy π ^ = { π ^ t } t 0 , where
π ^ t = π ^ ( a ; X t π ^ ) = e a / λ λ ( e M / λ 1 ) , a [ 0 , M ] .
Then,
V ( x ) J ( x , π ^ ) = E 0 τ x π ^ e ρ t 0 M a λ ln π ^ ( a ; X t π ^ ) π ^ ( a ; X t π ^ ) d a d t = E 0 τ x π ^ e ρ t λ ln λ + λ ln ( e M / λ 1 ) d t .
Letting x and by using the dominated convergence theorem,
lim x V ( x ) λ ln λ + λ ln ( e M / λ 1 ) ρ ,
which then together with the previous inequality leads to (20). □
Define a function h as
h ( x ) = ln e k ( 1 x ) 1 1 x , x 1 , ln k , x = 1 ,
where k > 0 is given.
Lemma A2. 
The function h defined in (A4) satisfies the following properties:
(i) 
h ( x ) is continuous and decreasing in x;
(ii) 
There exists a unique x 0 R such that h ( x 0 ) = 0 ;
(iii) 
| h ( x ) | < k | x | + c , for some constant c R , which depends on k only;
(iv) 
| h ( x 1 ) h ( x 2 ) | < k | x 1 x 2 | , x 1 , x 2 R .
Proof. 
We first show that function h ( x ) is continuous at x = 1 . By using the L’Hôpital rule, lim x 1 e k ( 1 x ) 1 1 x = k . Hence, lim x 1 h ( x ) = ln k = h ( 1 ) .
Taking the first-order derivative of h, for x 1 ,
h ( x ) = 1 x e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) + e k ( 1 x ) 1 ( 1 x ) 2 = h 1 ( x ) ( 1 x ) ( e k ( 1 x ) 1 ) ,
where h 1 ( x ) = e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) . Then,
h 1 ( x ) = k e k ( 1 x ) + k e k ( 1 x ) + k 2 ( 1 x ) e k ( 1 x ) = k 2 ( 1 x ) e k ( 1 x ) ,
which is positive when x < 1 and negative when x > 1 . Therefore, h 1 ( x ) is increasing on x < 1 and then decreasing on x > 1 and h 1 ( x ) < lim x 1 h 1 ( x ) = 0 . Combining with the fact that ( 1 x ) ( e k ( 1 x ) 1 ) > 0 for x 1 , we show that h ( x ) < 0 for x 1 . It then completes the proof of ( i ) that h ( x ) is decreasing in x.
To show ( i i ) , note that lim x h ( x ) > 0 and lim x h ( x ) < 0 . By the continuity and monotonicity of h ( x ) , there must exist a unique x 0 R such that h ( x 0 ) = 0 . In particular, when k = 1 , x 0 = 1 .
Note that for x 1 , e k ( 1 x ) 1 > k ( 1 x ) , which implies
e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) > k ( 1 x ) ( e k ( 1 x ) 1 ) ,
Combining with the fact that ( 1 x ) ( e k ( 1 x ) 1 ) > 0 for x 1 ,
h ( x ) = e k ( 1 x ) 1 k ( 1 x ) e k ( 1 x ) ( 1 x ) ( e k ( 1 x ) 1 ) > k .
Based on the previous results, for x < x 0 ,
| h ( x ) | = h ( x ) = h ( x 0 ) x x 0 h ( y ) d y = x x 0 h ( y ) d y < x x 0 k d y = k ( x 0 x ) ;
similarly, for x x 0 ,
| h ( x ) | = h ( x ) = h ( x 0 ) x 0 x h ( y ) d y = x 0 x h ( y ) d y < x 0 x k d y = k ( x x 0 ) .
To show ( i i i ) ,
| h ( x ) | < k | x x 0 | k | x | + k | x 0 | , x R .
It remains to prove ( i v ) . Without the loss of generality, we assume x 1 x 2 . Then,
| h ( x 1 ) h ( x 2 ) | = h ( x 2 ) h ( x 1 ) = x 1 ˜ x 2 h ( y ) d y < x 2 x 1 k d y = k | x 1 x 2 | .
Proof of Theorem 2. 
It is straightforward to show that Assumption 3.8 in Tang et al. (2022) holds for our exploratory dividend problem. The well posedness of SDE (18) for the optimal exploratory surplus process is also established. Then, by applying the results of (Tang et al. 2022, Theorems 3.9 and 3.10), the existence and uniqueness of the solution to (19) and the convergence of V to V c l are established.
To show the twice-continuously differentiability of V ( x ) , we apply the results in (Strulovici and Szydlowski 2015, Proposition 1) (with the infinite domain). We rewrite the HJB Equation (19) into the following form:
V ( x ) + H ( V ( x ) , V ( x ) ) = 0 ,
where
H ( p , q ) : = 2 σ 2 ρ p + μ q + λ ln λ 1 q e M ( 1 q ) / λ 1 1 q 1 + M 1 q = 1 = 2 σ 2 ρ p + μ q + λ ln λ + λ h ( q ) ,
and h is defined in (A4) with k = M / λ . According to Proposition 1 in Strulovici and Szydlowski (2015), if H satisfies Conditions 1–3, then there exists a twice-continuously differentiable solution to the HJB equation.
To check Condition 1 in (Strulovici and Szydlowski 2015, Proposition 1), note that for p , q R ,
| H ( p , q ) | 2 σ 2 ρ | p | + μ | q | + λ | ln λ | + λ | h ( q ) | < 2 σ 2 ρ | p | + μ | q | + λ | ln λ | + M | q | + c ,
where the second inequality comes from Lemma A2 ( i i i ) , and c R is a constant. Taking L 1 : = 2 σ 2 max ( λ | ln λ | + c , ρ , μ + M ) , we have
| H ( p , q ) | L 1 ( 1 + | p | + | q | ) .
Secondly, for p , p ˜ , q , q ˜ R ,
| H ( p , q ) H ( p ˜ , q ˜ ) | 2 σ 2 ρ | p p ˜ | + μ | q q ˜ | + λ | h ( q ) h ( q ˜ ) | < 2 σ 2 ρ | p p ˜ | + μ | q q ˜ | + M | q q ˜ | ,
where the second inequality comes from Lemma A2 ( i v ) . Taking L 2 : = 2 σ 2 max ( ρ , μ + M ) , we have
| H ( p , q ) H ( p ˜ , q ˜ ) | L 2 ( | p p ˜ | + | q q ˜ | ) .
To check Condition 2, note that for all q R , H ( · , q ) is nonincreasing in p.
It remains to check Condition 3. For each K ¯ > 0 , choose K 1 , K 2 > K ¯ such that
K 1 max ( M + μ ) K 2 + λ ln λ + λ c ρ , ( M + μ ) K 2 λ ln λ + λ c ρ ,
where c is a constant satisfying Lemma A2 ( i i i ) . Then, for all p R , ϵ { 1 , 1 } ,
H ( K 1 + K 2 | p | , ϵ K 2 ) = 2 σ 2 ρ K 1 ρ K 2 | p | + μ ϵ K 2 + λ ln λ + λ h ( ϵ K 2 ) < 2 σ 2 ρ K 1 + μ K 2 + λ ln λ + λ h ( ϵ K 2 ) < 2 σ 2 ρ K 1 + μ K 2 + λ ln λ + M K 2 + λ c < 0 ,
where the third inequality is due to Lemma A2 ( i i i ) and the last inequality is due to (A5). Secondly,
H ( K 1 K 2 | p | , ϵ K 2 ) = 2 σ 2 ρ K 1 + ρ K 2 | p | + μ ϵ K 2 + λ ln λ + λ h ( ϵ K 2 ) > 2 σ 2 ρ K 1 μ K 2 + λ ln λ + λ h ( ϵ K 2 ) > 2 σ 2 ρ K 1 μ K 2 + λ ln λ M K 2 λ c > 0 ,
where the third inequality is due to Lemma A2 (iii) and the last inequality is due to (A5). □
Proof of Theorem 3. 
Note that (19) can be rewritten as
ρ V ( x ) = σ 2 2 V ( x ) + μ V ( x ) + λ h ( V ( x ) ) + λ ln λ ,
where h is defined in (A4) with k = M / λ .
First, suppose M < λ ln 1 / λ + 1 . Then, λ ln λ + λ ln ( e M / λ 1 ) ) < 0 . According to (20), lim x V ( x ) < 0 . Define x 0 : = inf { x 0 : V ( x + ) 0 } . Note that V ( x ) is not a constant in this case and hence V ( x ) does not always equal to 0, which implies that x 0 < .
Assume that V ( x 0 + ) > 0 . Because V ( x 0 ) = V ( 0 ) = 0 , there must exist some interval such that V ( x ) is decreasing in order to reach its negative limit, which means that there exists some point such that V ( x ) changes its sign from positive to negative. Define this point as
x 1 : = inf { x > x 0 : V ( x ) = 0 , V ( x + ) < 0 } .
Hence, V ( x 1 ) 0 . Then, according to (A6),
ρ V ( x 1 ) = σ 2 2 V ( x 1 ) + μ V ( x 1 ) + λ h ( V ( x 1 ) ) + λ ln λ = σ 2 2 V ( x 1 ) + λ ln e M λ 1 + λ ln λ < 0 ,
which implies that V ( x 1 ) < 0 . But a contradiction happens because V ( x ) is non-negative on [ 0 , x 1 ] , which leads to V ( x 1 ) > 0 .
Then, assume that V ( x 0 + ) < 0 and there exists some point such that V ( x ) > 0 . Define x 2 as
x 2 : = inf { x > x 0 : V ( x ) = 0 , V ( x + ) > 0 } .
Hence, V ( x 2 ) 0 . According to (A6),
ρ V ( x 2 ) = σ 2 2 V ( x 2 ) + μ V ( x 2 ) + λ h ( V ( x 2 ) ) + λ ln λ = σ 2 2 V ( x 2 ) + λ ln e M λ 1 + λ ln λ λ ln e M λ 1 + λ ln λ .
Therefore,
V ( x 2 ) λ ln λ + λ ln ( e M / λ 1 ) ρ = lim x V ( x ) .
Because V ( x 2 + ) > 0 , V ( x ) is strictly increasing in a local neighborhood after x 2 . Then, after point x 2 , there should exist some interval such that V ( x ) is strictly decreasing in order to achieve the limit. Define x 3 as
x 3 : = inf { x > x 2 : V ( x ) = 0 , V ( x + ) < 0 } .
Hence, V ( x 3 ) 0 . Note that V ( x ) is strictly positive in a local neighborhood after x 2 and non-negative on [ x 2 , x 3 ] ; thus, V ( x 3 ) > V ( x 2 ) . Then, according to (A6),
V ( x 2 ) = 2 σ 2 ρ V ( x 2 ) μ V ( x 2 ) λ h ( V ( x 2 ) ) λ ln λ < 2 σ 2 ρ V ( x 3 ) μ V ( x 3 ) λ h ( V ( x 3 ) ) λ ln λ = V ( x 3 ) ,
which is a contradiction. Therefore, V ( x ) 0 and V ( x ) is decreasing.
For the other two cases, the proof is similar. □
Proof of Corollary 1. 
Because according to Theorem 3 V ( x ) is monotone and its limit as shown in (20) is finite, it is straightforward that | V ( x ) | and | V ( x ) | are bounded. □
Proof of Proposition 3. 
(a) Taking the first-order derivative of d 1 , for λ > 0 ,
d 1 ( λ ) = ln 1 λ + 1 + λ · λ 2 1 / λ + 1 = ln λ λ + 1 + λ λ + 1 1 = ω λ λ + 1 ,
where ω ( x ) : = ln x + ( x 1 ) , x > 0 . Because ω ( x ) = 1 / x + 1 < 0 for x ( 0 , 1 ) , ω ( x ) is decreasing on x ( 0 , 1 ) . Therefore, ω ( x ) > ω ( 1 ) = 0 for x ( 0 , 1 ) , which shows that d 1 ( λ ) is increasing. By using the L’Hôpital rule,
lim λ 0 d 1 ( λ ) = lim λ 0 λ 2 / 1 / λ + 1 λ 2 = lim λ 0 1 1 / λ + 1 = 0 , lim λ d 1 ( λ ) = lim λ 1 1 / λ + 1 = 1 .
(b) Note that for λ > 1 , ( λ + 1 ) / λ < λ / ( λ 1 ) . Therefore, d 1 ( λ ) < d 2 ( λ ) .
Taking the first-order derivative of d 2 , for λ > 1 ,
d 2 ( λ ) = ln λ λ 1 + λ · λ 1 λ · λ 1 λ λ 1 2 = ln λ λ 1 λ λ 1 1 = ω λ λ 1 .
Because ω ( x ) = 1 / x + 1 > 0 for x > 1 , ω ( x ) is increasing on x > 1 . Therefore, ω ( x ) > ω ( 1 ) = 0 for x > 1 , which shows that d 2 ( λ ) is decreasing. By using the L’Hôpital rule,
lim λ 1 d 2 ( λ ) = lim λ 1 λ 1 λ · λ 1 λ λ 1 2 · 1 λ 2 = lim λ 1 λ λ 1 = , lim λ d 2 ( λ ) = lim λ λ λ 1 = 1 .

Notes

1
For example, the dividend-paying rate under the threshold strategy is the maximal rate if the surplus exceeds the threshold; otherwise, it pays nothing. Because the threshold is determined by the model parameters, the change in the estimated parameters may dramatically change the dividend-paying rate from zero to the maximal rate, or vice versa.
2
We apply the shooting method, which adjusts the initial value of the first-order derivative such that the boundary conditions (14) and (20) are satisfied and use the “ode45” function in Matlab to find the numerical solution to (19).
3
For each initial surplus x, we discretize the continuous time into small pieces ( Δ t = 0.0005 ) and sample 2000 independent surplus processes X t π * to simulate D v ( x ) and E n t r ( x ) .

References

  1. Asmussen, Søren, and Michael Taksar. 1997. Controlled diffusion models for optimal dividend pay-out. Insurance: Mathematics and Economics 20: 1–15. [Google Scholar] [CrossRef]
  2. Asmussen, Søren, Bjarne Højgaard, and Michael Taksar. 2000. Optimal risk control and dividend distribution policies. example of excess-of loss reinsurance for an insurance corporation. Finance and Stochastics 4: 299–324. [Google Scholar] [CrossRef]
  3. Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47: 235–56. [Google Scholar] [CrossRef]
  4. Avram, Florin, Zbigniew Palmowski, and Martijn R. Pistorius. 2007. On the optimal dividend problem for a spectrally negative lévy process. The Annals of Applied Probability 17: 156–80. [Google Scholar] [CrossRef]
  5. Azcue, Pablo, and Nora Muler. 2005. Optimal reinsurance and dividend distribution policies in the cramér-lundberg model. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics 15: 261–308. [Google Scholar] [CrossRef]
  6. Azcue, Pablo, and Nora Muler. 2010. Optimal investment policy and dividend payment strategy in an insurance company. The Annals of Applied Probability 20: 1253–302. [Google Scholar] [CrossRef]
  7. Bai, Lihua, Thejani Gamage, Jin Ma, and Pengxu Xie. 2023. Reinforcement learning for optimal dividend problem under diffusion model. arXiv arXiv:2309.10242. [Google Scholar]
  8. Cesa-Bianchi, Nicolò, Claudio Gentile, Gábor Lugosi, and Gergely Neu. 2017. Boltzmann exploration done right. Advances in Neural Information Processing Systems 30. [Google Scholar]
  9. Choulli, Tahir, Michael Taksar, and Xun Yu Zhou. 2003. A diffusion model for optimal dividend distribution for a company with constraints on risk control. SIAM Journal on Control and Optimization 41: 1946–79. [Google Scholar] [CrossRef]
  10. Dai, Min, Yuchao Dong, and Yanwei Jia. 2023. Learning equilibrium mean-variance strategy. Mathematical Finance 33: 1166–212. [Google Scholar] [CrossRef]
  11. De Finetti, Bruno. 1957. Su un’impostazione alternativa della teoria collettiva del rischio. In Transactions of the XVth International Congress of Actuaries. New York: International Congress of Actuaries, vol. 2, pp. 433–43. [Google Scholar]
  12. Gaier, Johanna, Peter Grandits, and Walter Schachermayer. 2003. Asymptotic ruin probabilities and optimal investment. The Annals of Applied Probability 13: 1054–76. [Google Scholar] [CrossRef]
  13. Gao, Xuefeng, Zuo Quan Xu, and Xun Yu Zhou. 2022. State-dependent temperature control for langevin diffusions. SIAM Journal on Control and Optimization 60: 1250–68. [Google Scholar] [CrossRef]
  14. Gerber, Hans U. 1969. Entscheidungskriterien für den zusammengesetzten Poisson-Prozess. Ph.D. thesis, ETH Zurich, Zürich, Switzerland. [Google Scholar]
  15. Gerber, Hans U., and Elias S. W. Shiu. 2006. On optimal dividend strategies in the compound poisson model. North American Actuarial Journal 10: 76–93. [Google Scholar] [CrossRef]
  16. Jaderberg, Max, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, and et al. 2019. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364: 859–65. [Google Scholar] [CrossRef] [PubMed]
  17. Jeanblanc-Picqué, Monique, and Albert Nikolaevich Shiryaev. 1995. Optimization of the flow of dividends. Uspekhi Matematicheskikh Nauk 50: 25–46. [Google Scholar] [CrossRef]
  18. Jgaard, Bjarne Hø, and Michael Taksar. 1999. Controlling risk exposure and dividends payout schemes: Insurance company example. Mathematical Finance 9: 153–82. [Google Scholar] [CrossRef]
  19. Komorowski, Matthieu, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. 2018. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 24: 1716–20. [Google Scholar] [CrossRef] [PubMed]
  20. Kulenko, Natalie, and Hanspeter Schmidli. 2008. Optimal dividend strategies in a cramér–lundberg model with capital injections. Insurance: Mathematics and Economics 43: 270–78. [Google Scholar] [CrossRef]
  21. Lundberg, Filip. 1903. Approximerad framställning af sannolikhetsfunktionen. Återförsäkring af kollektivrisker. Akademisk afhandling. Stockholm: Almqvist & Wiksells. [Google Scholar]
  22. Mirowski, Piotr, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, and et al. 2016. Learning to navigate in complex environments. arXiv arXiv:1611.03673. [Google Scholar]
  23. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and et al. 2015. Human-level control through deep reinforcement learning. Nature 518: 529–33. [Google Scholar] [CrossRef]
  24. Nachum, Ofir, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. 2017. Bridging the gap between value and policy based reinforcement learning. Advances in Neural Information Processing Systems 30. [Google Scholar]
  25. Paulus, Romain, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv arXiv:1705.04304. [Google Scholar]
  26. Radford, Alec, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv arXiv:1704.01444. [Google Scholar]
  27. Schmidli, Hanspeter. 2007. Stochastic Control in Insurance. Cham: Springer Science & Business Media. [Google Scholar]
  28. Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, and et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529: 484–89. [Google Scholar] [CrossRef] [PubMed]
  29. Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, and et al. 2017. Mastering the game of go without human knowledge. Nature 550: 354–59. [Google Scholar] [CrossRef] [PubMed]
  30. Strulovici, Bruno, and Martin Szydlowski. 2015. On the smoothness of value functions and the existence of optimal strategies in diffusion models. Journal of Economic Theory 159: 1016–55. [Google Scholar] [CrossRef]
  31. Tang, Wenpin, Yuming Paul Zhang, and Xun Yu Zhou. 2022. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization 60: 3191–216. [Google Scholar] [CrossRef]
  32. Todorov, Emanuel. 2006. Linearly-solvable markov decision problems. Advances in Neural Information Processing Systems 19. [Google Scholar]
  33. Wang, Haoran, and Xun Yu Zhou. 2020. Continuous-time mean–variance portfolio selection: A reinforcement learning framework. Mathematical Finance 30: 1273–308. [Google Scholar] [CrossRef]
  34. Wang, Haoran, Thaleia Zariphopoulou, and Xun Yu Zhou. 2020. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21: 1–34. [Google Scholar]
  35. Yang, Hailiang, and Lihong Zhang. 2005. Optimal investment for insurer with jump-diffusion risk process. Insurance: Mathematics and Economics 37: 615–34. [Google Scholar] [CrossRef]
  36. Yin, Chuancun, and Yuzhen Wen. 2013. Optimal dividend problem with a terminal value for spectrally positive levy processes. Insurance: Mathematics and Economics 53: 769–73. [Google Scholar] [CrossRef]
  37. Zhao, Yufan, Michael R. Kosorok, and Donglin Zeng. 2009. Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28: 3294–315. [Google Scholar] [CrossRef] [PubMed]
  38. Zhu, Yuke, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning. Paper presented at 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, May 29–June 3; pp. 3357–64. [Google Scholar]
  39. Ziebart, Brian D., Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. Paper presented at AAAI, Chicago, IL, USA, July 13–17; vol. 8, pp. 1433–38. [Google Scholar]
Figure 1. The classical value functions (top) and the optimal dividend-paying rate (bottom) for μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.
Figure 1. The classical value functions (top) and the optimal dividend-paying rate (bottom) for μ = 1 , σ = 1 , ρ = 0.3 , M = 0.6 (left panels), M = 1.2 (middle panels), and M = 1.8 (right panels), respectively.
Jrfm 17 00025 g001
Figure 2. Cases of value functions given M and λ .
Figure 2. Cases of value functions given M and λ .
Jrfm 17 00025 g002
Figure 3. Let μ = 1 , σ = 1 , ρ = 0.3 , and λ = 1.5 . Let M = 0.6 (left column); M = 1.2 (middle column); and M = 1.8 (right column), respectively. The figures on the top row show the value function V ( x ) , the expected total discounted dividends D v ( x ) , and the expected total weighted discounted entropy E n t r ( x ) . The figures on the middle row show the mean of the optimal distribution π * ( · ; x ) . The figures on the bottom row show the density function of the optimal distribution with respect to the different surplus level x.
Figure 3. Let μ = 1 , σ = 1 , ρ = 0.3 , and λ = 1.5 . Let M = 0.6 (left column); M = 1.2 (middle column); and M = 1.8 (right column), respectively. The figures on the top row show the value function V ( x ) , the expected total discounted dividends D v ( x ) , and the expected total weighted discounted entropy E n t r ( x ) . The figures on the middle row show the mean of the optimal distribution π * ( · ; x ) . The figures on the bottom row show the density function of the optimal distribution with respect to the different surplus level x.
Jrfm 17 00025 g003
Figure 4. The value function V ( x ) given different values of λ with M = 0.6 (left) and M = 1.2 (right).
Figure 4. The value function V ( x ) given different values of λ with M = 0.6 (left) and M = 1.2 (right).
Jrfm 17 00025 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, S.; Zhou, Z. Exploratory Dividend Optimization with Entropy Regularization. J. Risk Financial Manag. 2024, 17, 25. https://doi.org/10.3390/jrfm17010025

AMA Style

Hu S, Zhou Z. Exploratory Dividend Optimization with Entropy Regularization. Journal of Risk and Financial Management. 2024; 17(1):25. https://doi.org/10.3390/jrfm17010025

Chicago/Turabian Style

Hu, Sang, and Zihan Zhou. 2024. "Exploratory Dividend Optimization with Entropy Regularization" Journal of Risk and Financial Management 17, no. 1: 25. https://doi.org/10.3390/jrfm17010025

Article Metrics

Back to TopTop