Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios

: This paper considers a risk-averse Markov decision process (MDP) with non-risk constraints as a dynamic optimization framework to ensure robustness against unfavorable outcomes in high-stakes sequential decision-making situations such as disaster response. In this regard, strong duality is proved while making no assumptions on the problem’s convexity. This is necessary for some real-world issues, e


Introduction
In certain sequential decision-making situations, e.g., disaster response or highly disruption-prone supply chains [1][2][3][4], decision outcomes are not only highly uncertain because of the gradual revelation of uncertainty, but their worst effects cannot be tolerated.To be robust against these worst-case outcomes, we need to use constraints to eliminate those decisions from the feasible solution space that may lead to them, while for additional robustness, we must use a dynamic risk-averse objective as a representation of decision makers' risk-averseness (a behavioral consideration).In the literature, constraints are specifically used for robust decision-making, especially in the safe reinforcement learning domain [5][6][7], while the use of a risk-averse objective function is also prevalent for decision robustness [8][9][10][11][12].Recently, the simultaneous use of constraints and risk aversion is also regarded as an effective approach for robust decision-making [1,13,14].This leads us towards multi-stage risk-averse constrained optimization models.Studies have used stochastic dual dynamic programming (SDDP) for the solution of these kinds of models [9,15].However, SDDP assumes the problem's convexity and a finite number of random scenarios for its optimal convergence, which is restrictive as many real-world problems are non-convex, e.g., humanitarian relief allocation with deprivation costs [16], requiring multi-stage solution approaches which can handle non-convexities.Although [17] did not presume the problem's convexity while using disciplined convex-concave programming for multi-stage risk-averse constrained optimization, optimal convergence was not established either theoretically or empirically.In this regard, as a model-free machine learning-based framework [18,19], reinforcement learning (RL) can be a better solution approach to non-convex problems.It has shown state-of-the-art performance in various real-world domains, e.g., cloud computing [20], finance [21,22], transportation [23,24], energy [25,26], inventory management [27,28], manufacturing or production scheduling [29], and routing [30].However, in all of the above applications, robust and high-stack decision-making is not required, and thus expectation-based unconstrained RL is used.As we are considering high-stakes decision-making under uncertainty, risk-averse RL algorithms with constraint-handling abilities are required.
In this context, when it comes to constrained RL, besides other approaches such as mapping any action chosen by the policy to a constraint satisfying one which may need domain knowledge for its implementation [31], the most common and, so far, the best approach for solving constrained MDP is via Lagrangian dual (or augmented Lagrangian dual) optimization [32,33].In this method, a constrained multi-stage optimization problem is converted into a min-max unconstrained problem, which is then solved by traditional RL algorithms.Besides its advantages, such as allowing non-convex constraints and not requiring any prior knowledge, it has theoretical guarantees for convergence as well.In this regard, it was proven that constrained MDPs have zero duality gap with no assumption on the problem's convexity [34].This result explains the performance of these primal-dual methods, such as primal-dual policy optimization (PPO) [35] and Lyapunove-based policy learning [36].However, in all these methods, cumulative constraints are considered, the satisfaction of which does not mean the satisfaction of instantaneous constraints [37].This is why an augmented Lagrangianbased mechanism for constrained RL has recently been developed for instantaneously constrained MDPs by [37], which extended the theoretical results from [34].However, all of the above-constrained RL developments are in the realm of expectation-based RL [12,38], while no research has been performed on handling constraints in risk-averse RL.The reason behind this paucity of research is that the risk-averse RL is in its embryonic stage itself [12], adding constraints to it is the next stage of investigation, which no one has entered yet.

Research Contributions
Therefore, we have initiated the new research domain of constrained risk-averse RL to ensure optimal convergence of multi-stage risk-averse constrained mathematical programs, which can be easily reformulated into their version of a stochastic decisionmaking process, namely risk-averse Markov decision processes with constraints [39,40].In doing so, we will not make any assumption on the problem's convexity.However, some not-so-restrictive assumptions are made on the constraints and risk-averse objective.All the constraints used in this research are non-risk constraints.For the risk-averse MDPs with cumulative expectation-based constraints, we will assume those conditional spectral risk measures that satisfy conditions related to the derivative and integral of their riskaversion function, given on page 7-9.Furthermore, for the incorporation of deterministic instantaneous constraints, the translation invariance property of coherent risk measures is made as one additional assumption on conditional spectral risk measures.This development will help the solution of real-world problems in many domains which are not yet considered solvable.For example, in humanitarian logistics, deprivation cost is an important metric for measuring human suffering due to deprivation from necessities.However, many researchers do not consider this cost to make the models tractable [41].Despite the need for accurate and robust decision-making, no model in the disaster management literature has yet combined constrained optimization, risk aversion, and deprivation costs.This is due to the computational intractability issue which is solved by this research [42].
The main contributions of our research are summarized as follows: • To prove the strong duality of spectral risk-averse MDPs with expectation-based cumulative and/or deterministic instantaneous constraints without making any assumption on the problem's convexity.

•
To propose a constraint handling mechanism for risk-averse reinforcement learning based on the strong duality results.

•
To theoretically and empirically establish the convergence of the proposed constraint handling mechanism.
For ease of understanding, the graphical representation of this research is shown in Figure 1.Subsequently, we will first specify the preliminaries and mathematical notations in Section 2, which will be crucial for understanding our theoretical results in Sections 3 and 4 for risk-averse MDPs with cumulative and instantaneous constraints, respectively.Based on the results, we will develop a constraint-handling mechanism in Section 5 where we will also prove the convergence of this newly developed approach.After this, in Section 6, we will empirically validate the convergence of the proposed mechanism using a multi-stage risk-averse constrained optimization problem.Findings will be discussed in Section 7, and finally, we will conclude in Section 8.

Preliminaries
Let T be the index set of time periods(with indices starting from one) in a finite or infinite horizon MDP, then the key elements of the MDP are as follows: is the feasible action RL agent can take in the period t T ∈ which belongs to the action space with n being the action dimension or the number of decision or control variables (continuous or discrete).It is assumed to be compact.

State Space
{ } is the state belonging to the state space m ⊆  S with m being the state dimension or the number of state variables (continuous or discrete).Similar to action space, it is also assumed to be compact.The state variables specify the state of the MDP in a particular period.

Policy Space
Here, π ∈P is the feasible policy in the arbitrary policy space P .P forms a measure space being a collection of Boral sets of the action space n ∈  A .

Reward and Constraint Functions
There are cost functions is the discount factor which is assumed to be in the open interval ( ) 0,1 .This specifies the importance of the future rewards in the value of the current period's state.The greater its value, the more important are the future rewards for the current state's value.

Reference Equations
Before diving into the theory, we will first mention three Equations ( 1)-( 3), which will be referred to further in this paper.
 and [ ] . represents expectation and conditional (conditioned on the realized trajectory up to the current state in an MDP) spectral risk measure operators, respectively, and the corresponding two equations, Equations ( 1) and (2), are the cumulative expectation and dynamic risk measure functions, respectively.In both Functions (1) and (2), periodic costs ( ) where, corresponding to every instantaneous constraint cost function , , are the components of the Lagrangian vector ' λ for the dual of "Problem 2", referred to as "Dual 2" in Section "Strong Duality for "Problem 2"", on page 13.
Lastly, we defined

( )
M S,A as the set of all measures defined onset of ordered pairs of states and actions ( ) × S A , and T as occupation measures' set originated from feasible policy space P as where ( ) A is given as (A4) in Appendix A.
To initiate the field of constrained risk-averse RL, we will prove the following: "Risk-averse MDPs with dynamic spectral risk objective and expectation-based cumulative constraints have zero duality gap in generic policy space P (as Theorem 1, in Section "Strong Duality for "Problem 1"", on page 6).Based on this Theorem 1, risk-averse MDPs with deterministic instantaneous constraints also have zero duality gap (as Theorem 2, in Section "Strong Duality for "Problem 2"", on page 12) without any assumption on the problem's convexity."These results imply that these two classes of problems can be solved perfectly in the domain of dual variables where the space becomes convex (in terms of dual variables).

Risk-Averse Markov Decision Processes with Cumulative Constraints
In the literature [17], constrained risk-averse MDP refers to the MDP where both the objective function and constraints are risk-averse.However, we will use a varied version of it in which risk constraints [17]are substituted with cumulative expectation constraint(s) (9), which is the same as the cumulative constraints in constrained MDP [34].The mathematical program for this varied version is as follows, which we will refer to as "Problem 1": Subject to the following: ~, ( ) where in "Problem 1" above, (5)-( 8) are nested risk-averse objective, transition function set, policy inference, and initial state being sampled from its probability distribution, respectively.

Strong Duality for "Problem 1"
Let λ be the Lagrangian vector, with its components as Lagrangian multipliers corresponding to all the cumulative constraints (9) in "Problem 1" above.Then, the following (10) is the Lagrangian function: With fixed Lagrangian multipliers \ , , if the Lagrangian function (10) is maximized over the policy space P , then it will become the dual function, which forms a constrained optimization problem with ( 6)- (8).The solution of it provides an upper bound (for a maximization, or lower bound for a minimization problem) on the optimal value of "Problem 1" (primal problem).
In addition to optimizing (maximizing) policy space P , if we also optimize (minimize) it over its Lagrangian multipliers \ , (dual variables) as in (11) below, the resulting min-max optimization problem will be the dual problem of "Problem 1", which we will refer to later as "Dual 1".Upon its solution, we will have the tightest (best) upper bound on the optimal solution of "Problem 1" [17]: Subject to the following: ( 6)- (8).
In general, solving "Dual 1" does not result in the solution of "Problem 1" as there may not be any relation between the optimal values of both problems.However, if strong duality for "Problem 1" holds, i.e., the duality gap Δ between "Problem 1" (primal) and "Dual 1" (dual) is zero, then the optimal value of "Problem 1" * P will be equal to that of the "Dual 1" * D , i.e., * * P D = .In Theorem 1 below, we will prove just that (with certain assumptions on the risk-aversion function in (5)given in Theorem 1's proof)for both finite and infinite planning horizons with a discount factor ( ) 0,1 γ ∈ using perturbation theory.
We define the perturbation function associated with "Problem 1" for any value of perturbation components , i i P ξ ∀ ∈ of the perturbation vector ξ as follows, which we will refer to as "The Perturbation function": Subject to the following: ( 6)-( 8) Proposition 1.If we assume the Slater's condition for "Problem 1", and concavity of "The Perturbation function", then strong duality will hold for "Problem 1".The above theorem will be proved by connecting the strong duality of "Problem 1" with the convexity of "The Perturbation function" as follows in "Proposition 1" and further.
Proof."Proposition 1" has been proven by [43] in Corollary 30.2.2, and as Slater's condition holds by the assumption of Theorem 1, the only thing to prove here is the concavity of "The Perturbation function", i.e., for any two perturbation vectors In the case when either perturbation 1 2  , P ξ ξ ∈  makes "The Perturbation function" infeasible, then either ( ) So, we will prove the existence of such a policy ( ) for "Problem 1" to prove its strong duality.

Expectation-Based Constraints
The manipulations of expectation-based constraints (9) we followed in this theorem are from [34].However, due to notational differences, we have included all of these manipulations in terms of our notation in Appendix A for the readers' ease of understanding.

Risk-Averse Objective
Now, for the spectral risk-averse objective, we can rewrite ( ) In ( 16) above, ( ) ( ) , which is the mapping from a trajectory to its discounted sum of rewards.

F
is the cumulative probability distribution function that maps the discounted sum of rewards , given a policy π ∈P and T being the cardinality of the set T (number of periods in the planning horizon).

Since the cost functions 0t
g are bounded (by the assumption of Theorem 1), the dominance convergence theorem (DCT) will hold, allowing us to reverse the order of integration and summation in (16).With the use of conditional probabilities and Markov's property of the MDPs, we can rewrite ( 16) above as follows:

s a p a s p s p a s M ds da ds da ds da
In ( 17), ( ) represents the probability of state u s given the previous state Then, (17) above can be written as follows: Now, using integration by part formula, we can write (20) as follows: ( ) By applying integration by part formula recursively on the second term each time it is being generated upon the application of integration by part formula, we can write the above expression as follows: ( ) If we continue the application of integration by part formula recursively infinite times, we will have the following expression if we assume that an infinite derivative exists for the risk aversion function 1 f (concerning 1 s ) and ( ) as n → ∞ (which holds in the case of CVaR and mean-CVaR given their piecewise constant risk aversion functions 1 f (which will have zero infinite derivatives for each piecewise component after the splitting of the integration interval), while for the rest of spectral risk measures, it may or may not hold depending on the risk aversion function itself along with the cumulative probability distribution function [44]): ( ) As a side note, it is important to search and develop risk-aversion functions for which we can write ( 23) for ( 22), which will be the direction of our future research. Let Now, to integrate all the individual terms with respect to the first action 1 a , the expression ( 23) can be written as follows: ( In the above expression, the terms ( ) can be expanded individually (similar to the expansion of ( 20)) by recursive application of integration by part formula (infinite times) given the derivatives of all orders of function 1 f are well-defined (with respect to 1 a ) and as m → ∞ , where, in ( )( ) f derivative with respect to 1 s , and m is the order of 1 f derivative with respect to 1 a .
After the expansion of these terms, we will have similar terms (as the individual terms present in (23)).The integrals (with respect to other states and actions, e.g., , , , ... s a s a ) of these resulting terms can be expanded similarly by using the mechanism of recursive by-part integration as used above with 1 s and 1 a . Let

s s a p a s p s p a s d sd ad sd a d s d a
After recursive integration (with the application of by-part integration formula recursively infinite times) with respect to states and actions in all the time periods , , , , , ,..., , T T s a s a s a s a and given the fact that all the resulting multiple integral terms will always have an integral with respect to each state and action ( ) Note that, in (26), s a s a s a s a ) less than that of the expression resulting after the application of the by-part integration formula recursively infinite times on ( ) f f in (20) since one integral (with respect to each state and action variable) will be consumed by the expression of 3 f (25).As we have performed while moving from (A2) to (A3) in Appendix A, the expression (25) can be reduced to the following: From DCT and (A4) in Appendix A, we can write the following form of ( 26) above (similar to (A5) in Appendix A) by interchanging positions of integrals and summation.
( ) Let 5 f and the Equation of occupancy measure (A6) in Appendix A above, we can write the following: ( ) Then, we can write (29) naturally as follows, which is the objective of the "The Perturbation function": ( )

ds ds da da ds ds da da ds ds da da ds ds ds da da da dd ddd df c f ds ds da da ds ds ds ds da da da
Concavity of the Perturbation Function From (A7) in Appendix A and (31), "The Perturbation function" itself can be written in its mathematical program below (given that (31) is to be optimized over the set of feasible occupation measures T ): Subject to: ( 6)-( 8) With the only consideration being the risk-averse objective, the occupation measure set T (in ( 4)) being convex and compact (compact means closed and bounded) is directly followed by Theorem 3.1 in [45], which makes us able to write the following: Now, for the feasible two policies, 1 π ∈P and 2 π ∈P , which are optimal for the perturbation vectors 1 ξ and 2 ξ , respectively, considered above, let their corresponding two occupation measures be ( ) A , respectively.Then, from the convexity of the set T , there must be a feasible policy μ π ∈ P whose occupation measure is given by ( ) A will satisfy the constraints with slack ( ) ( ) From (A7) in Appendix A and (34), we can write the following: For the risk-averse objective, let From (31) and the linearity of the integrals, (36) above can be written as follows: Since ( ) ( ) which means that "The Perturbation function" is concave, which then implies, from "Proposition 1", that strong duality holds for "Problem 1".

Risk-Averse Markov Decision Processes with Instantaneous Constraints
Derived from "Problem 1", the problem class that we will consider in this section consists of deterministic instantaneous constraints (40) instead of cumulative expectation constraints (9).This problem will be referred to as "Problem 2", and its mathematical program is given as follows: ( ) Subject to the following: ( 6)-( 8) The assumption of deterministic constraints is justified in many real-world contexts.Not only do risk constraints [5] make the problem complex, but they are unnecessary in many real-world problems due to the very reason for the constraint utilization itself (in the presence of risk-averseness), which is to completely eliminate extreme realizations of certain stochastic variables regardless of risk-averseness or decision makers' behavioral aspects.In addition, if the robustification of feasible solution space is required using box uncertainty sets, it will make the constraints deterministic in a robust counterpart program.

Strong Duality for "Problem 2"
Now based on Theorem 1 in Section 3, we will prove two theorems (Theorems 2 and 3) for proving strong duality for the "Problem 2" above.Based on these theorems, we will develop a constraint handling mechanism in Section 5 for solving "Problem 2", where we will also perform a theoretical convergence analysis of this algorithm in Section 5.5.
In this regard, we will first assume the following, which we will refer to as "Assumption 1": λ be the Lagrangian vector, with its components \ , , , then the dual of "Problem 2", which we will refer to as "Dual 2", is as follows: ( ) ( ) Subject to the following: ( 6)-( 8) It is important to note here that λ (with its components , i i P λ ∀ ∈ ) is the Lagrangian vector for "Problem 1" while ' λ is the Lagrangian vector for the "Problem 2" with its components \ , , If we consider all the constraint cost functions , , it g i P t T ∀ ∈ ∈ in (9) as deterministic and the risk measure  in (5) satisfies translation invariance (a property of coherent risk measures [46]), we can write (42) above as follows (referring to (3)), which can also be regarded as the min-max objective in "Dual 1" in ( 11): As a side note, the reason for making the above form of the dual objective of "Problem 1" is because, during RL training (for optimization of "Problem 2"), our model-free decision-making agent (heuristic) will obtain From the above, we can make the following implication: Suppose * , , it i P t T λ ∀ ∈ ∈ are the optimal Lagrangian multipliers in"Dual 2", then we can write as follows given Theorem 1: ( ) As a result, the following directly follows from the above: Hence, from ( 45) and ( 47), the strong duality condition is satisfied for "Problem 2".


Due to time-varying Lagrangian multipliers \ , , , with the increase in the number of time periods, the number of Lagrangian multipliers will grow, and, because this, memory complexity may become the point of concern.Also, for infinite time horizon problems, time-varying Lagrangian multipliers become infeasible.To counter this effect, we will prove Theorem 3 below.
Note: It is worth noting that the time invariance of Lagrangian multipliers (in the context of Problem 2) directly follows from the fact that we have used , it i t T λ λ = ∀ ∈ while proving Theorem 2 from Theorem 1 (on page 15) when we moved from (42) to (43).However, if we start from the statement of Theorem 2, we need to prove the time invariance of Lagrangian multipliers \ , , in "Dual 2", which we will perform in Theorem 3 below (which is a redundant theoretical result).( ) ( ) Subject to the following: ( 6)-( 8) Now, suppose that in "Dual 2", * , , ∈P is also optimal for "Problem 2" or "Dual 2".duality will hold for the problem given above (in the statement of Theorem 3), which then implies with Theorem 2 that the optimal solution for the above-motioned problem, which has time-invariant Lagrangian multipliers, is also the optimal solution for "Dual 2". Note: It is important to mention that there can be optimal solutions

Proof. By definition (under
P for "Dual 2", which has time variant Lagrangian multipliers in addition to the optimal solution ( ) P , which has timeinvariant Lagrangian multipliers as given by Theorem 3 above.

Augmented Lagrangian-Based Constraint Handling Mechanism
In this section, based on Theorems 1 and 2, we will develop a risk-averse RL constraint-handling algorithm (Algorithm 1) for optimizing "Problem 2" (risk-averse MDP with instantaneous deterministic constraints).In the end, we will also theoretically prove its convergence.However, to take advantage of Algorithm 1, "Assumption 1" must be satisfied during its execution because Theorem 2 is based on the assumption.While the assumption can be automatically satisfied in the case of some problems, we ensure its satisfaction during the execution of Algorithm 1 by utilizing a so-called "Clipping method" proposed by [47].This method is described as follows.

Clipping Method
In this method, values of constraints for the states where constraints are not violated are reduced to zero, i.e., if constraints are satisfied, then we will consider the constraint violation penalty as zero.For an illustration of the technique, consider a set of instantaneous constraints 1 0, , Other non-negative activation functions like SoftPlus and Segmoid can also be used [47].

Surrogate Objective
As RL is mostly used for non-convex multi-stage optimization, in order to handle constraints in this setting, it is better to use augmented Lagrangian instead of the Lagrangian formulation [48], which we will use in our constraint-handling algorithm by first designing a surrogate objective and then a surrogate reward function, from the augmented Lagrangian formulation of "Problem 2".be the time-invariant quadratic penalty coefficients corresponding to each of the instantaneous constraints (40) in "Problem 2".Based on the strong duality result for "Problem 2" (Theorem 2) and the Lagrangian-based objective (41) of "Dual 2" (with time-invariant Lagrangian multipliers), the augmented Lagrangian-based surrogate objective for "Problem 2" with clipping method (which will be directly optimized by an RL algorithm) is as follows:

Reward Function
Correspondingly, the instantaneous reward function (stage-wise objective component) is given as follows, where the first term is the discounted reward function while the second and third terms are the discounted linear and quadratic penalty terms for constraint violations, respectively, which will discourage the agent from violating hard constraints.
In the above Equations ( 49) and ( 50), the ReLU function is used for the implementation of the clipping method described above to realize Assumption 1 during the execution of the risk-averse RL algorithm.

Developed Mechanism
Based on the surrogate reward function (50) above and [47], we have designed "Algorithm 1" for the solution of "Problem 2" in the presence of only deterministic instantaneous constraints as follows.( ) In the above algorithm, ( ) ( ) | , while in tandem, we will also update the quadratic penalty coefficient .For ease of understanding, the block diagram of the workings of the risk-averse actor-critic algorithm [12] with the proposed augmented lagrangian-based constraint handling mechanism is shown in Figure 2. Here, the upper shaded religion contains all the learning components inside the actor-critic algorithml while the shaded region at the bottom contains the proposed constraint handling mechanism, which updates the Lagrangian multipliers and quadratic penalty factors (inside the environment) after every fixed number of risk-averse actor-critic algorithm's training iterations.For elucidation on conditionally elicitable dynamic risk measure and scoring function mentioned in the following diagram, readers are referred to [12].

Theoretical Results for Convergence
Now, we will perform the above algorithm's convergence analysis by proving Theorem 4, but first, we will prove a proposition as follows, which will be used in proving ( ) , Subject to the following: ( 6)-( 8) Problem B: Subject to the following: ( 6)-( 8) Let us fix Lagrangian multipliers for all the constraints, except the constraint i P ∈ , to any value ' \ .' \ (where these fixed values can be different for "Problem A" and "Problem B") such that the Lagrangian multiplier i λ is the only dual variable in both "Problem A" and "Problem B".

Suppose
{ }{ } ( ) . Also, let * π ∈ P be an optimal policy for "Problem 2" under which 1 0, , t it g i P t T γ − ≤ ∀ ∈ ∈ holds; then, the following relation will also hold Proof.As we know We can imply the following from the above, where instead of i λ , we used Also, by definition By assuming * π ∈ P isthe policy which is optimal for "Problem 2" and, thus, feasible, we can write the following for "Problem A" and "Problem B" (given strong duality holds for "Problem 2"): From ( 56), (57), and (58), we can write the following: Since we have taken no assumption on the constraint i P ∈ (whose Lagrangian multiplier (dual variable) is not fixed), the relation (59) above is true for all constraints which are indexed on the set P , i.e., By direct implication from (60) above, we proved (54).
subject to: ( 6)- (8).* π ∈ P is the optimal solution for "Problem 2" under which 1 0, , Proof.Again, let us fix Lagrangian multipliers for all the constraints to any value \ , ' \ Here, we will consider two cases to prove our proposition.
Consider the objective function of "Problem 1".There are only two cases possible for the constraint i P ∈ : is the policy optimal for "Problem B", then the following expression can be written The result from case 1 contradicts Proposition 2 above.
Hence, it is proved that For Proposition 2 and the strong duality of "Problem 2", the following is implied From this, we can say that ( )

Numerical Example
In this section, we showcase the implications of our theoretical results.For this, all the computational experiments were conducted with a 78.16 GB hard disk, 12.68 GB RAM, Intel(R) Xeon(R) CPU @ 2.00GHz dual-core processor, Ubuntu operating system, and Python 3.10.For deep learning implementation, feed-forward neural networks were used for both actor and critic with two and six layers, respectively.They were implemented in Pytorch and trained on the Tesla T4 (CUDA Version: 12.0) graphical processing unit (GPU).In this regard, a non-convex, non-differentiable, multi-stage disaster response relief allocation problem [49] with three affected areas was considered as a numerical example, the detailed formulation of which is given in Appendix B. It was assumed that all the affected areas have uniform demand stochasticity, and relief is supplied to them from a capacitated local response center.Dynamic conditional value at risk (DCVaR) is used as a risk-averse stochastic ordering for the stochastic objective.The problem has an upper bound (constraint) on the deprivation level of the third affected area.This is because this affected area is assumed to be relatively more vulnerable to disruption than the rest [49,50].These disruptions can be in the form of secondary disasters inside the affected areas or logistical blockages hindering relief supply, which will have devastating effects (increase in deprivation from life-sustaining resources) on the affected population.So, as a preparation tactic against these disruptions, deprivation levels are kept low a priori by using the constraint (A18) in Appendix B, the form of which is assumed to be ( ) , where − ∀ ∈ (given at the end of Appendix B).Therefore, if we assume the quadratic penalty coefficient to be zero, the classical Lagrangian dual method will be obtained [32,[34][35][36].In this regard, using our formulation, we train the agent to be risk-averse in the face of uncertainty while also requiring it to satisfy the hard constraint (A18) in the Appendix via Algorithm ), we find the locally optimal solution using a conditionally elicitable riskaverse actor-critic algorithm from [12].By setting ∈  , we see that the CVaR of episodic rewards converges, as shown in Figure 3a-c, and the constraint violations decrease until they vanish, as shown in Figure 4a-c), in the case of all values of risk-aversion confidence levels (0.6, 0.8 and 0.99) considered in our simulations.

Episode
We also show constraint violations and convergence curves (with a risk-averse confidence level of 0.99) when we used negative reward scheming during training.Here, the fixed constraint violation penalty factor is multiplied by constraint violation and then added to the reward function [51][52][53].As we can see in Figure 5, for negative reward scheming, in case of a larger fixed penalty factor of "10 0 ", the constraint violations vanish to zero but, as shown in Figure 6, the CVaR episodic rewards converge to a higher cost (red curve) relative to that of the augmented Lagrangian-based approach (blue curve).This is because we are not optimizing over the fixed penalty factors, which makes these penalty factors not the optimal Lagrangian coefficients of the dual problem.Consequently, the optimal solution of the regularized unconstrained problem is different from that of the primal problem.Also, larger constraint violation factors earlier in training may hinder the RL agent's exploration leading it to converge to the local optima of primal and/or regularized unconstrained problem.On the other hand, in the case of a lower fixed penalty factor of 10 −1 , although the reward value converges to lower costs, as shown in Figure 7, constraint violations do not reach zero, as shown in Figure 8.This issue is resolved by our proposed constraint-handling mechanism where at the beginning of the training, the action space is less constrained and the agent can freely learn, but as the training progresses, the constraint violations gradually accentuate with the increase in the Lagrangian multipliers and quadratic penalty factors in ( 49) and ( 50), respectively.Convergence to a higher CVaR of episodic costs (red curve) with a fixed penalty factor of "10 0 " relative to that of the augmented Lagrangian-based approach (blue curve).

Discussion
Throughout this work, we have developed a duality theory for risk-averse reinforcement learning problems with non-risk constraints.Specifically, in the case of an arbitrary policy space, the duality gap is zero, and thus, the primal problem is perfectly solvable in its dual domain, where it becomes convex in terms of dual variables.This is significant as it establishes that the constrained problem and the regularized unconstrained problem chase the same Prato optimal front and, therefore, are equivalent.However, despite these theoretical results, these problems may be computationally intractable.This is due to the requirement of the consideration of arbitrary policy in cases in which the evaluation of the dual function becomes practically infeasible.In place of arbitrary policy spaces, linear approximations or neural networks are used, which may cause weak duality or even widen the duality gap.Despite this approximation issue, the theoretical results, the resulting algorithm, and the empirical demonstrations provide a way to solve the constrained risk-averse policy optimization problem without the need to conduct an exhaustive search over the penalty weights (or factors) for constraint violation as was performed in [54][55][56] In this appendix, the multi-stage optimization problem used for numerical simulation is rigorously defined.This multi-stage optimization problem is for relief allocation among a set of affected areas during disaster response and is adapted from the second nonlinear program (NLP2) in [49].
The problem in the form of a diagram is shown as follows: Before discussing the abstract mathematical programming model, we will first give its notation as follows: Sets: K : Index set for affected areas (AAs) K′: Index set for affected areas the terminal deprivation level of which is constrained Ω : Index set for scenarios T : Index set for periods (or stages) Note: Size or cardinality of the above sets will be represented by placing vertical bars around the name of the respective set, e.g., for the set of periods, its size will be represented as T .Objectives: Three objectives are considered corresponding to efficiency, effectiveness, and equity criteria as discussed below.As an efficiency metric, logistics cost (LC) is used [16].Thirdly, the sum of all AAs' TPCs for all time periods and relief items are considered, as shown as follows:  Here, derivation cost (A9) is the effectiveness metric.TPC (A10) is used to ensure AAs' population survival beyond the planning horizon (in a particular scenario) to ensure equity [61].It depends on the deprivation amount at the end of the planning horizon.

Indices:
Single Objective function: In the above, there are three objectives, namely, logistics cost (A8), deprivation cost (A9), and TPC (A10).To make the problem a single objective, we will use the weighted sum method (WSM) by which the corresponding single objective is as follows: ( ) Stochastic Ordering: In our problem, objectives(A9) and (A10) are stochastic due to stochastic demand, which requires us to use some stochastic ordering.As we are investigating risk-averse decision-making situations, any spectral risk measure such as CVaR or mean-CVaR 0 M in (A14) as a dynamic risk measure can be used in (A13) as follows: ( )

Constraints:
First are the capacity constraints (functional) in which solution space (decision space) is constrained, such that all allocations of relief items are only from LRCs that are installed, and total allocation (combined allocation of all the relief items to all AAs) from each of LRC is less than their capacity in a particular period, as shown by the following inequality: , , The second set of constraints is on the robustified value of terminal state variables ∀ ∈ ∈Ω ∈ .These sets of constraints are essential to ensure that those affected populations particularly vulnerable to logistical disruptions or secondary disasters will not be left with huge deprivation at the end of response operations as it would be unethical.

Figure 1 .
Figure 1.Graphical representation of this research.

ϕ
is the risk aversion function (in a spectral risk measure, which depends on the decision maker's risk-aversion) mapping the output of ( ).F to the interval [ ] 0,1 .( ) .ϕ and, its single variable and cross derivatives of all orders (with respect to any state and decision variables) is assumed to exist and be piecewise C ∞ function in terms of all state and decision variables.In the end, , , ,..., , T T s a s a s a s a , we will have the expression for ( ) 0 M π ∈ P in the following form: da da ds ds da da ds ds da da da dd ddd df c f f d s d s d a d a d s d s d a d a ds ds ds da da da c

Algorithm 1 :
Augmented Lagrangian-based constrained risk-averse RL 1 Take[1, ) u ζ ∈ ∞ and dual assent step size E defines the number of iterations in which Lagrangian multipliers and quadratic penalty coefficients need to be updated)

Figure 2 .
Figure 2. Block diagram of risk-averse actor-critic algorithm with proposed augmented Lagrangianbased constraint handling mechanism.

Theorem 4 : 2 .
Proposition Consider following two optimization problems such that both of these have time- γ − ≤ ∀ ∈ = , and the optimal policy of "Problem B" is ( ) ∈ , then as, i i Pζ →∞ ∀ ∈ , the following condition holds are initialized as "0".Before every dual ascent (with the dual ascent step size 0.005 Λ =

∈
 every 25 algorithmic epochs (each epoch has 666,675 MDP episodes) of the risk-averse actor-critic algorithm.As we increase 1

Figure 6 .
Figure 6.Convergence to a higher CVaR of episodic costs (red curve) with a fixed penalty factor of "10 0 " relative to that of the augmented Lagrangian-based approach (blue curve).

Figure 7 .
Figure 7. Convergence curves of CVaR of episodic rewards in case of fixed constraint violation penalty factor of "10 -1 " and the augmented Lagrangian-based approach.

Figure A1 .
Figure A1.Visual description of resource allocation problem in disaster response.
Initial deprivation level (at the start of the planning horizon) at AA k K ∈ and period 1, for all ω ∈ Ω C :Capacity (considered as flexible) of the LRC k p : Deprivation parameter at AA k K ∈ (used in ADC and TPC functions) k q : Deprivation parameter at AA k K ∈ (used in ADC and TPC functions) k T ω : Threshold above which final robustified state variable values must not exceed to ensure population survival beyond the planning horizon.Len : One time period's length 1 weg : Weight for logistics cost for single objective formulation 2 weg : Weight for deprivation cost for single objective formulation 3 weg : Weight for TPC for single objective formulation

:
TPC function with state value of AA k K ∈ in scenario ω ∈ Ω at the end of the planning horizon sum of AAs' ADCs for all time periods and relief items are considered, as shown as follows: as a piecewise non-convex function as shown below: defined as a piecewise non-convex function as shown below: third set of constraints ensures that all decision variables are positive: , .These costs are considered to be random variables either due to stochastic policy π ∈P and/or uncertainty in the MDP environment dynamics.In this regard, P is the index set (with indices starting from one) of these cumulative or instantaneous constraint cost functions in a particular period t T ∈ , and the subscript " 0 " in (2) is regarded as the index of the reward function. ,

Assumption 1 .
The feasible policy space P of "Problem 2" has a non-empty relative interior Let ' that the Lagrangian multiplier .
T : Amount of relief resources being allocated from the LRC location to AA k K ∈ in scenario ω ∈ Ω and period t T ∈ .: per unit accessibility-based delivery cost for transporting unit quantity of relief resource from the LRC to AA k K ∈ in period t T ∈ in all scenarios  k t Dω : Demand for relief resource at AA k K ∈ in scenario ω ∈ Ω and period t T ∈ , it is considered as a stochastic variable ω ω : Deprivation level at AA k K ∈ in scenario ω ∈ Ω and period t T ∈ Parameters: kt a , ,In a period t T ∈ and ω ∈ Ω scenario, there are state variables policy π ∈P is conditioned to make decisions dynamically.These variables have an in-going component k t s ω , which is used by the transition function to determine its out-going component