1. Introduction
In the pricing and revenue management literature, understanding customers’ purchasing behavior, typically captured by demand functions, is central to deriving revenue-maximizing pricing policies. This is particularly important for perishable products with limited inventory and a finite selling horizon. Following the foundational work of Gallego and Van Ryzin [
1] and Gallego and Van Ryzin [
2], traditional methods obtain optimal pricing policies by assuming the underlying demand functions are known beforehand. Although assuming demand functions are known a priori greatly simplifies the analysis of optimal pricing policies, it is often unrealistic for firms to have complete information about customers’ purchasing behavior in modern, highly uncertain markets [
3]. Consequently, dynamic pricing and learning has attracted growing attention as a framework that explicitly accounts for demand uncertainty. This line of work jointly optimizes prices for revenue while learning customers’ purchasing behavior from observed sales data [
4]. Existing studies on dynamic pricing and learning typically fall into three streams. The first is the parametric approach. It assumes that demand belongs to a known functional family (e.g., linear, exponential, or logit) and estimates the unknown parameters from sales data. The second is the nonparametric approach. It avoids committing to a specific functional form. Instead, it exploits structural properties of the demand or revenue functions, such as monotonicity or Lipschitz continuity, to construct effective pricing policies from sales observations. Both parametric and nonparametric approaches rely on modeling assumptions that may not hold in practice. Model misspecification can, therefore, lead to suboptimal pricing policies [
5]. The third is the model-free reinforcement learning (RL) approach, which makes no assumptions about the demand function or market model. Instead, it learns pricing policies directly from interaction data. With recent advances in the RL literature, this paradigm has shown strong potential for solving complex dynamic pricing and learning problems [
6].
Meanwhile, customer heterogeneity allows a firm to segment customers into distinct groups using collected sales data. The firm can then implement discriminatory pricing by charging different prices for the same product across groups, thereby increasing revenue [
7]. In the extreme case, each customer can be treated as a distinct group. Discriminatory pricing then becomes personalized pricing and can, in principle, yield the highest attainable revenue for the firm [
8]. By the same reasoning, when the underlying demand functions are uncertain and must be learned while setting group-specific prices to maximize revenue, the problem becomes discriminatory dynamic pricing and learning [
9]. Such problems are inherently more complex than standard dynamic pricing and learning, as they must account for customer heterogeneity in addition to demand uncertainty. However, this direction is an inevitable trend for revenue-maximizing firms, driven by the growing availability of sales data and rapid advances in information technology and machine learning. Similarly, discriminatory dynamic pricing and learning can be addressed using three main methodological approaches: parametric, nonparametric, and RL, which differ in the extent to which they rely on assumptions about the underlying demand model [
10,
11,
12].
Although discriminatory pricing can further increase a firm’s revenue and may sometimes benefit certain customer groups, customers who perceive the pricing as unfair relative to others may feel deceived and exploited [
13]. Ihlanfeldt and Mayock [
14] demonstrate the existence of discriminatory pricing against Black and Asian individuals in the housing market. Such perceived unfairness can lead to customer dissatisfaction, thereby weakening trust and loyalty toward the firm. In addition, discriminatory pricing has raised concerns among regulators and public authorities due to its potential adverse impacts on protected groups defined by characteristics such as race, gender, age, or other legally protected attributes [
15]. This concern is driven by the possibility that protected groups may have higher willingness to pay due to historical disadvantage or unobserved heterogeneity. Such differences may translate into systematically higher prices for these groups [
16]. Consequently, regulators have begun to constrain or closely monitor price differentiation practices. For example, New York State prohibited gender-based price discrimination in 2020 [
17]. More recently, California expanded enforcement against gender-based price differences for substantially similar consumer products [
18]. In financial services, the UK Financial Conduct Authority introduced general insurance pricing reforms that restrict price walking [
19]. For home and motor insurance, they require renewal quotes to be no higher than equivalent new-business quotes. Regulatory attention has further expanded from traditional anti-discrimination rules to AI-enabled decision systems. The EU AI Act adopts a risk-based framework and phases in obligations including the early application of bans on certain prohibited AI practices [
20]. In the US, the FTC has repeatedly emphasized that there is no AI exemption from consumer-protection law [
21]. Firms can face liability for unfair or deceptive practices when deploying algorithms that mislead consumers, embed bias, or lack adequate safeguards. Taken together, these developments imply that fairness constraints in pricing are not merely an ethical add-on. Instead, they can function as operational compliance mechanisms that prevent large and systematic price disparities across groups. As a result, both legal and reputational exposure can be reduced. These compliance pressures pose a technical challenge in discriminatory dynamic pricing and learning. When demand is uncertain, the firm must learn from sales data while simultaneously setting group-specific prices. At the same time, it must ensure that deployed prices remain within an acceptable fairness region at every selling period. This requirement is fundamentally instantaneous and hard in nature. Many regulatory and compliance interpretations focus on individual-level or transaction-level harm. As a result, occasional violations can be unacceptable, even if the policy is fair on average.
In this work, we study discriminatory dynamic pricing and learning for perishable products under limited inventory. We focus on a specific form of fairness, namely price fairness, which requires inter-group price differences to remain within acceptable limits. In this problem, the initial inventory is fixed, and the selling horizon is finite, with no replenishment throughout the selling horizon. Consequently, the firm must make pricing decisions under joint inventory and time constraints while additionally enforcing price fairness constraints when designing discriminatory dynamic pricing strategies. Motivated by the model-free advantage of reinforcement learning, we adopt RL as the backbone to solve this fairness-constrained dynamic pricing problem without relying on any assumptions about the underlying demand functions. However, existing RL approaches for fairness-aware pricing typically incorporate fairness through converting fairness into a soft objective or an expected, cumulative constraint. Such methods do not provide guarantees of instantaneous feasibility during training, and even at convergence, they may still output constraint-violating actions. Moreover, the revenue-optimal fairness-constrained pricing policy often exhibits boundary-seeking behavior, meaning optimal prices frequently lie near the feasibility boundary. However, the existing RL-based fairness-aware pricing approaches seldom recover this boundary-seeking behavior in an efficient and reliable manner. Our approach is designed specifically to address this gap by enforcing instantaneous and hard price fairness constraints while preserving the ability to search near the constraint boundary to recover high revenue. For clarity, our main contributions are summarized as follows:
We formulate fairness-aware dynamic pricing for perishable products with instantaneous and hard price fairness constraints as an action-constrained Markov decision process (ACMDP). The feasible action set is governed by coupled multi-dimensional constraints on the price vector. Therefore, prices for different customer groups cannot be chosen independently and must satisfy pairwise price-gap requirements within each selling period.
We incorporate an optimization-based Shield module into the interaction loop between the pricing agent and the market. The Shield module maps an infeasible price vector to a nearby feasible one through solving a convex quadratic program, which guarantees step-wise feasibility during both training and deployment. It also facilitates learning when the optimal constrained pricing policy is boundary-seeking by enabling safe exploration near the feasibility boundary.
We develop Shield Soft Actor-Critic (Shield-SAC), a model-free DRL algorithm combined with neural networks. It learns the optimal pricing policy through interaction and thus does not rely on any assumptions about the underlying demand functions. Our Shield-SAC can recover revenue-optimal boundary-seeking behavior, and it achieves strong revenue performance while consistently enforcing the instantaneous and hard price fairness constraints.
Our work also helps connect the fairness-aware pricing literature with the RL literature. It further illustrates the promise of hybrid optimization-and-DRL methods for complex operational decision-making and learning problems with instantaneous and hard constraints.
3. Problem Formulation
In this paper, we study a price-based revenue management (RM) problem in which a firm must determine an optimal discriminatory dynamic pricing policy for multiple customer groups when selling a single perishable product. The initial inventory, I, is fixed and cannot be replenished, and the product must be sold within a finite selling horizon of T periods. Otherwise, any remaining units perish. We consider g heterogeneous customer groups that differ in their sensitivity to price. At each selling period, , the firm posts discriminatory prices, , for each customer group, . In line with standard assumptions in the price-based RM literature, at most one customer from each group arrives in each period and demands at most one unit of the product based on the posted price. Customer groups with higher valuations and lower price sensitivity are considered disadvantaged groups in our setting. These groups may be associated with characteristics such as race, gender, age, or other legally protected attributes. A revenue-maximizing firm would naturally tend to charge them higher prices. We first model the revenue-maximizing discriminatory dynamic pricing problem without considering price fairness as a Markov decision process (MDP). This serves as a baseline and illustrates that, in the absence of price fairness constraints, price differences across customer groups can become substantial. We then incorporate instantaneous and hard price fairness constraints. These constraints require that inter-group price differences remain within acceptable limits. We formulate the resulting fairness-aware discriminatory dynamic pricing problem as an action-constrained Markov decision process (ACMDP). The objective of this ACMDP is to maximize revenue while consistently satisfying the price fairness constraints at every state and in every selling period.
3.1. Markov Decision Process
We formulate the revenue-maximizing discriminatory dynamic pricing problem without considering price fairness as a Markov decision process (MDP), defined by the tuple . Below, we explicitly define each component in our setting.
State space
. At each selling period,
t, the market state is
, where
is the remaining inventory, and
is the current period. Thus, the state space is
A terminal state is reached when
or
.
Action space
. At each selling period,
t, the firm needs to set discriminatory prices,
, for
g customer groups based on the current market state,
. Let
denote the continuous price interval for customer group
i. Then, the action space is the Cartesian product of the group-specific price sets:
Transition function
. The transition function,
, specifies the probability of transitioning to the next state,
, given the current state,
, and action,
. In our setting, when the market is in a non-terminal state,
, and the firm posts discriminatory prices
, the purchase decision of an arriving customer from group i is a Bernoulli random variable,
where
is the unknown demand function of group
i. In period
t, the attempted total demand is
. Because sales cannot exceed the remaining inventory
, the realized sales are
. If
, we ration inventory by selecting
requesting groups uniformly at random and set
accordingly. The inventory then evolves as
, and the time index advances to
. Since the underlying demand functions
are unknown, the transition function
P induced by these stochastic demand processes is also unknown.
Reward function
. After observing the realized demand
from customer group
i under discriminatory prices
, the firm receives an immediate reward signal (revenue),
, at the end of selling period
t, as follows:
The underlying reward function
is unknown because the demand functions
are unknown.
Discount factor . The parameter is the discount factor used to convert future rewards into their present value. Since our problem is a finite-horizon revenue-maximizing discriminatory dynamic pricing with no intertemporal discounting, we set in our paper.
The goal of this MDP model is to find a discriminatory pricing policy that maximizes the expected total rewards over the entire selling horizon of
T periods. Specifically, given the sequence of realized rewards,
, generated under a discriminatory pricing policy,
, that maps each state,
, to a distribution over discriminatory price vectors,
, the MDP model aims to solve
Therefore, the optimal revenue-maximizing discriminatory dynamic pricing policy is given by
3.2. Action-Constrained Markov Decision Process
In this paper, we consider price fairness. It requires that inter-group price differences remain within acceptable limits in any selling period, . We incorporate price fairness constraints into the discriminatory dynamic pricing problem with limited inventory and a finite selling horizon. We formulate this problem as an action-constrained Markov decision process (ACMDP), defined by the tuple . Below, we explicitly define each component in our setting.
State space . We retain the same state definition as in the above MDP model. At each selling period, t, the market state is , where is the remaining inventory, and is the current period. The state space is . A terminal state is reached when or .
Action space . As shown in the above MDP model, when not considering price fairness constraints, the action space is .
Feasible action space
. To capture price fairness constraints, the price dispersion across customer groups is bounded by corresponding upper limits. Specifically, at every state and in every selling period, the price difference between any two groups,
, cannot exceed a given tolerance,
. This tolerance is fixed and independent of the current state. Formally, the price fairness constraints in our setting are defined as
This condition imposes coupled multi-dimensional constraints on the action space, meaning that the price for each customer group cannot be chosen independently but must satisfy pairwise relationships with all other group prices. Therefore, the constraints in (
8) induce a state-independent feasible action space,
Any discriminatory price vector,
, that violates (
8) is deemed infeasible and is not allowed to be chosen at any state or in any selling period.
Transition function . The transition function is the same as in the above MDP model. When the market is in a non-terminal state, , and the firm posts discriminatory prices, , the corresponding realized demand, , of each customer group, i, is observed at the end of selling period t. The inventory then evolves as , and the time index advances to . Since the demand functions, , are unknown, the transition function induced by these stochastic demand processes is also unknown.
Reward function
. The reward function is the same as in the above MDP model. After observing the realized demand,
, from each customer group,
i, under the posted discriminatory prices,
, the firm receives the immediate reward signal (revenue) at selling period
t, as shown in (
4). The underlying reward function, given in (
5), is unknown because it depends on the unknown demand functions
.
Discount factor . Since our problem is a finite-horizon revenue-maximizing discriminatory dynamic pricing with no intertemporal discounting, we set in our paper.
The goal of this ACMDP is to find a fairness-aware discriminatory dynamic pricing policy,
. It maps each state,
, to a distribution over feasible discriminatory price vectors,
, and it maximizes the expected total rewards over the entire selling horizon of
T periods. This means that, for all
, we have
Given the sequence of realized rewards,
, generated under a fairness-aware discriminatory pricing policy,
, the firm seeks to solve
Thus, the optimal fairness-aware discriminatory dynamic pricing policy is given by
Here, we explain why the fairness-aware dynamic pricing problem with price fairness constraints should be modeled as an ACMDP, rather than a classical Constrained Markov decision process (CMDP). CMDPs introduce one or more cost functions,
, and impose constraints on the expected cumulative costs. A typical CMDP problem is written as
where
is the given cost threshold. In this framework, constraint satisfaction is enforced in an expected sense over the whole horizon. In contrast, our price fairness requirements do not act as expected cumulative cost constraints. Instead, they specify instantaneous and hard constraints on the action: any discriminatory price vector,
, that violates the constraint
for any
is considered infeasible at any state or in any selling period. Therefore, the fairness-aware discriminatory dynamic pricing problem should be modeled as an ACMDP. In this model, the price fairness constraints directly modify the original action space,
, into the feasible action space,
. This differs from approaches that introduce additional costs and bound their cumulative expectation.
4. Solution Methods
Due to the lack of information about the underlying demand functions, we adopt the model-free deep reinforcement learning (DRL) framework to solve the ACMDP model. In a classical DRL setting, a DRL agent learns an optimal policy by interacting with its environment through trial and error, with the objective of maximizing expected cumulative rewards. In our discriminatory dynamic pricing context, as illustrated in
Figure 1a, the DRL pricing agent observes the current market state,
, in each selling period,
t, and outputs a discriminatory price vector,
, for
g customer groups. At the end of selling period
t, it receives a reward signal,
, which reflects the realized revenue under its pricing decision. By repeatedly interacting with the market across many training episodes, the DRL pricing agent gradually learns a near-optimal discriminatory dynamic pricing policy that aims to maximize the expected cumulative rewards over the entire selling horizon. However, standard DRL algorithms are designed to solve unconstrained MDPs and cannot directly handle the coupled multi-dimensional action constraints present in our ACMDP model. Moreover, it is nontrivial to modify the policy network architecture of a standard DRL algorithm so that it always outputs actions,
, that lie strictly within the feasible action space,
. To address this issue, we adopt the idea of shielding from Alshiekh et al. [
56] to convert the ACMDP model into an unconstrained MDP from the DRL pricing agent’s perspective, allowing powerful DRL algorithms to be efficiently applied. As shown in
Figure 1b, a Shield module is incorporated into the interaction loop between the DRL pricing agent and the market. The DRL pricing agent still observes the current market state,
, and outputs a discriminatory price vector,
, without considering price fairness constraints. The Shield module then checks whether the proposed action,
, lies in the feasible action space,
. If
, the Shield module directly passes the original action,
, to the market. If
, the Shield module projects
into the feasible action space and produces a corrected feasible action,
. We define the Shield mapping
as the Euclidean projection
which is a convex quadratic program and can be solved efficiently with standard optimization tools.
Proposition 1 (Step-wise feasibility of the Shield). Assume . Then, for any time step t, the executed action satisfies . Consequently, the instantaneous and hard price fairness constraints hold for all at every selling period, t, during both training and deployment.
Proof. is an intersection of box constraints (
) and linear inequalities (
) induced by pairwise gap bounds; hence, it is closed and convex. Therefore, the projection problem (
14) is a feasible convex quadratic program with a strictly convex objective; so, a unique optimizer exists and lies in
, yielding step-wise feasibility.□
The quadratic program in (
14) has
g decision variables. The pairwise gap constraints
can be written as two linear inequalities for each unordered pair,
, resulting in
linear inequalities in total, plus
box constraints. Hence, the per-step computational cost of solving quadratic programs grows polynomially with
g. When the number of customer groups,
g, is small, the resulting quadratic program is lightweight and can be solved efficiently using off-the-shelf solvers. When the number of customer groups,
g, becomes large, solving a quadratic program at every period may become non-negligible. Promising directions include (i) clustering and hierarchical pricing that assigns a shared price within each cluster, reducing the action dimension from
g to
, (ii) exploiting problem structure to develop faster first-order or operator-splitting projection methods, combined with warm-starting from the previous feasible action, and (iii) learning a lightweight surrogate projector to approximate
for low-latency inference, followed by a final feasibility-check-and-correct step to preserve never-violate safety.
Through this optimization-based shield mechanism, the original DRL pricing agent is effectively replaced with a Shield DRL pricing agent from the market’s perspective. This Shield DRL pricing agent always outputs fairness-aware discriminatory prices,
, that lie within the feasible action space,
, at every state and in every selling period. At the same time, from the perspective of the DRL pricing agent, the market is transformed into a new market with a new transition function,
, and a new reward function,
, both induced via the Shield module. Specifically, the original ACMDP model is translated to a shield-induced unconstrained MDP model, denoted as
. In this formulation, the state space
and the original action space
remain unchanged, and the discount factor is still set to
. The effects of the instantaneous and hard price fairness constraints are instead captured through the modified transition function
and reward function
induced by the Shield module. The shield-induced transition function
is defined as
where
is the original transition function of the ACMDP model. Meanwhile,
is the corrected feasible action returned by the Shield module. It is obtained by solving the convex quadratic program (
14) whenever the DRL pricing agent outputs an infeasible action,
. Since the original transition function,
, of the ACMDP model is unknown, the shield-induced transition function,
, is also unknown. Similarly, the immediate reward received by the DRL pricing agent is determined by the pricing action actually executed in the market. When the DRL pricing agent’s output,
, is infeasible, the reward signal received by the DRL pricing agent is based on the corrected feasible action,
. At the end of selling period
t, the reward signal
received by the DRL pricing agent is given by
Accordingly, the shield-induced reward function
is defined as
which remains unknown due to the unknown demand functions
. This shield-induced unconstrained MDP can be directly solved using standard DRL algorithms. At the same time, the Shield module guarantees that all executed pricing actions satisfy the instantaneous and hard price fairness constraints at every state and in every selling period. Note that the Shield module is part of the new market from the very beginning. Therefore, from the DRL pricing agent’s perspective, it always interacts with and learns in the same shield-induced new market. During training, the DRL pricing agent samples and updates under this shield-induced new market. Therefore, the learned pricing policy naturally corresponds to the true dynamics and reward structure of the shield-induced new market, rather than those of the original market without the Shield module. Moreover, at deployment, the DRL pricing agent still faces exactly the same shield-induced new market as during training (the same original market augmented with the same Shield module). Therefore, there is no issue that the pricing policy learned during training is inconsistent with the actual market dynamics at deployment.
In this paper, we use Soft Actor-Critic (SAC) as the base DRL algorithm and incorporate above shielding mechanism to obtain the Shield-SAC DRL pricing algorithm. Shield-SAC can efficiently reuse past experiences through experience replay, thereby improving data efficiency and policy learning performance. It learns a stochastic discriminatory pricing policy,
, which maps the current market state,
s, to a probability distribution over discriminatory price vectors
a. Compared with deterministic policies, stochastic policies naturally facilitate exploration. Randomized action selection enables the agent to explore a broader set of pricing decisions and reduces the risk of getting trapped in suboptimal behaviors early in training. The degree of randomness of a policy can be quantified by its entropy. For a given state,
s, the policy entropy is defined as
To encourage exploration and prevent the policy from converging to a poor local optimum, Shield-SAC maximizes an entropy-regularized objective that trades off cumulative rewards and policy entropy. It interacts with the shield-induced new market under the discriminatory pricing policy
and collects shield-induced trajectories
where
denotes the reward signal generated by the shield-induced reward function
. The shield-induced entropy-regularized objective function is given by
where
is the temperature parameter to control the relative importance of reward maximization and exploration. The entropy term provides an exploration incentive at each time step. It is proportional to the policy entropy under the corresponding state and encourages stochasticity in the learned pricing policy. The goal of the Shield-SAC pricing agent is to find an optimal discriminatory pricing policy,
, that maximizes the shield- induced entropy-regularized objective function (
20), namely
To implement Shield-SAC, we define two fundamental value functions under the discriminatory pricing policy
: the shield-induced state-value function
and the shield-induced action-value function
. These functions measure, respectively, the expected desirability of being in a given market state,
s, and the expected desirability of selecting a particular discriminatory pricing action,
a, when the market is in state
s, under the shield-induced entropy-regularized objective function (
20). The shield-induced state-value function
is defined as
Similarly, the shield-induced action-value function
is defined as
With these definitions,
and
satisfy
Finally, the Bellman equation for
under the shield-induced market dynamics
and
is given by
In our Shield-SAC algorithm, we use a deep neural network parameterized by
to approximate the discriminatory pricing policy
. It adopts a fully connected feedforward architecture, as shown in
Figure 2, with
hidden layers each containing
neurons. This neural network, which we refer to as the pricing policy network, takes the market state
s as input. It outputs the parameters of a diagonal Gaussian distribution, namely the mean vector
and the standard deviation vector
. This parameterization induces a stochastic policy,
, over the
g-dimensional discriminatory price vector. To sample an action,
, from the stochastic policy
under state
s while keeping the sampling operation differentiable, we adopt the reparameterization trick. We first draw noise
and construct a squashed Gaussian sample,
where ⊙ denotes element-wise multiplication. When interacting with the market, we need further map the squashed action
to the prescribed discriminatory price ranges
via an element-wise affine transformation:
Hence, the final sampled discriminatory pricing action,
, from pricing policy
is
The policy
should act to maximize shield-induced state-value function,
, defined in (
22) at each state,
. According to (
24), the policy
is thus optimized according to
According to (
26)–(
28), it can be translated to
In order to optimize the pricing policy
according to (
30), we also need to construct a deep neural network parameterized by
to approximate the shield-induced action-value function
under
. Following the standard design, we use two action-value function
and
with parameters
to mitigate overestimation bias in Q-value estimation. These two neural networks, which are referred to as the Q-value networks, both take the state
s and the action
a as input and output the estimated value of
. They both adopt a fully connected feedforward architecture, as shown in
Figure 3, with
hidden layers, each containing
neurons. Then, (
30) can be translated to
Shield-SAC reuses past experiences to train the pricing policy network and the Q-value networks. Therefore, the collected experiences need to be stored in a replay buffer,
, to fuel the training process. The process of collecting experiences is as follows. At each time step,
t, the pricing agent observes the current state,
, and selects a discriminatory pricing action,
, according to the current pricing policy,
. The market then transitions to the next state,
, and returns a shield-induced reward signal,
. Consequently, the agent can collect an experience defined as
at each time step,
t, to the replay buffer,
, where
if
is a terminal state, and
otherwise. Here, a terminal state is reached when either the inventory is depleted or the selling horizon ends. After collecting enough experiences to the replay buffer, we can update the pricing policy network and the Q-value networks. First, the parameters
of the pricing policy network are updated based on (
31) using samples
drawn from the replay buffer
by one step of gradient ascent, as follows:
where
is a sample from
, which is differentiable with respect to
via the reparameterization trick, as shown in (
26)–(
28). Then, the parameters
of two Q-value networks are updated by minimizing the mean squared error between its output values,
, and their approximating target values,
. The target values are computed according to the Bellman optimality Equation (
25) by using samples
drawn from the replay buffer
, as follows:
The approximating target value
of both Q-value networks is defined as:
where
is a sample from
via the reparameterization trick, as shown in (
26)–(
28). Here, the role of
is to make sure that, when the state of the input of the Q-value network is a terminal state, the output of the Q-value network is muted when computing the approximating target value. However, computing
y according to (
34) is challenging, as it depends on the parameters
that we aim to update, potentially causing instability during learning. To mitigate this, two separate neural networks parameterized by
are introduced, which are referred to as the target Q-value networks. The target Q-value networks share the same architecture as the Q-value networks. Its role is to compute the approximating target value,
y, in a more stable way. This is achieved by updating their parameters,
, more slowly by Polyak-averaging the parameters
of the Q-value networks, respectively, over the course of training, as follows:
where
is a smoothing coefficient controlling the update rate. Then, the approximating target value
y in (
34) is now translated to:
Algorithm 1 presents the complete pseudocode of our Shield-SAC algorithm. At each time step, the pricing policy network samples a discriminatory price vector via the reparameterization trick. Then, the Shield module either executes it directly if feasible or projects it into
via the convex quadratic program in (
14). The resulting transition is stored in the replay buffer, and the Q-value networks and the pricing policy network are updated using samples drawn from the replay buffer. Finally, the target Q-value networks are updated via Polyak averaging. Moreover, the corresponding flowchart of our Shield-SAC is shown in
Figure 4 to make the algorithmic workflow explicit and easier to reproduce.
| Algorithm 1: Shield-SAC |
![Mathematics 14 00600 i001 Mathematics 14 00600 i001]() |