Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning

Qiao, Wenchuan; Wood, Lincoln C.; Tang, Shanshan; Teng, Zeyu; Huang, Min

doi:10.3390/math14040600

Open AccessArticle

Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning

by

Wenchuan Qiao

¹,

Lincoln C. Wood

²

,

Shanshan Tang

³,

Zeyu Teng

¹

and

Min Huang

^1,*

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

Department of Management, University of Otago, Dunedin 9016, New Zealand

³

Department of New Networks, Pengcheng Laboratory, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(4), 600; https://doi.org/10.3390/math14040600

Submission received: 11 January 2026 / Revised: 5 February 2026 / Accepted: 6 February 2026 / Published: 9 February 2026

(This article belongs to the Special Issue Artificial Intelligence and Operations Research for Logistics, Supply Chain and Optimization Systems)

Download

Browse Figures

Versions Notes

Abstract

Firms increasingly develop dynamic pricing policies to maximize revenue for perishable products with limited inventory over a finite selling horizon. This trend is enabled by the growing availability of sales data and is observed across industries such as airlines, hotels, cruise lines, fashion, and seasonal retail. Given customer heterogeneity, firms may further adopt discriminatory pricing across customer groups. However, excessive price disparities can trigger legal risks and consumer backlash, motivating price fairness constraints that bound inter-group price differences in each selling period. We formulate this problem as an action-constrained Markov decision process (ACMDP) with unknown demand functions and adopt a model-free deep reinforcement learning (DRL) framework. However, standard DRL algorithms for unconstrained MDPs cannot directly handle these fairness constraints. Therefore, we introduce an optimization-based shielding mechanism. From the DRL pricing agent’s perspective, this mechanism converts the ACMDP into a shield-induced unconstrained MDP. Meanwhile, it guarantees constraint satisfaction for all executed prices. Building on this framework, we propose the Shield Soft Actor-Critic (Shield-SAC) algorithm. This is the first Shield-SAC method for fairness-aware pricing under instantaneous and hard price fairness constraints. We test it in two simulated markets of different scales and validate that Shield-SAC achieves strong revenue performance while consistently enforcing the price fairness constraints during both training and deployment.

Keywords:

dynamic pricing; algorithmic fairness; action-constrained Markov decision processes; soft actor–critic; shielding mechanism; perishable products; revenue management

MSC:

90C40

1. Introduction

In the pricing and revenue management literature, understanding customers’ purchasing behavior, typically captured by demand functions, is central to deriving revenue-maximizing pricing policies. This is particularly important for perishable products with limited inventory and a finite selling horizon. Following the foundational work of Gallego and Van Ryzin [1] and Gallego and Van Ryzin [2], traditional methods obtain optimal pricing policies by assuming the underlying demand functions are known beforehand. Although assuming demand functions are known a priori greatly simplifies the analysis of optimal pricing policies, it is often unrealistic for firms to have complete information about customers’ purchasing behavior in modern, highly uncertain markets [3]. Consequently, dynamic pricing and learning has attracted growing attention as a framework that explicitly accounts for demand uncertainty. This line of work jointly optimizes prices for revenue while learning customers’ purchasing behavior from observed sales data [4]. Existing studies on dynamic pricing and learning typically fall into three streams. The first is the parametric approach. It assumes that demand belongs to a known functional family (e.g., linear, exponential, or logit) and estimates the unknown parameters from sales data. The second is the nonparametric approach. It avoids committing to a specific functional form. Instead, it exploits structural properties of the demand or revenue functions, such as monotonicity or Lipschitz continuity, to construct effective pricing policies from sales observations. Both parametric and nonparametric approaches rely on modeling assumptions that may not hold in practice. Model misspecification can, therefore, lead to suboptimal pricing policies [5]. The third is the model-free reinforcement learning (RL) approach, which makes no assumptions about the demand function or market model. Instead, it learns pricing policies directly from interaction data. With recent advances in the RL literature, this paradigm has shown strong potential for solving complex dynamic pricing and learning problems [6].

Meanwhile, customer heterogeneity allows a firm to segment customers into distinct groups using collected sales data. The firm can then implement discriminatory pricing by charging different prices for the same product across groups, thereby increasing revenue [7]. In the extreme case, each customer can be treated as a distinct group. Discriminatory pricing then becomes personalized pricing and can, in principle, yield the highest attainable revenue for the firm [8]. By the same reasoning, when the underlying demand functions are uncertain and must be learned while setting group-specific prices to maximize revenue, the problem becomes discriminatory dynamic pricing and learning [9]. Such problems are inherently more complex than standard dynamic pricing and learning, as they must account for customer heterogeneity in addition to demand uncertainty. However, this direction is an inevitable trend for revenue-maximizing firms, driven by the growing availability of sales data and rapid advances in information technology and machine learning. Similarly, discriminatory dynamic pricing and learning can be addressed using three main methodological approaches: parametric, nonparametric, and RL, which differ in the extent to which they rely on assumptions about the underlying demand model [10,11,12].

Although discriminatory pricing can further increase a firm’s revenue and may sometimes benefit certain customer groups, customers who perceive the pricing as unfair relative to others may feel deceived and exploited [13]. Ihlanfeldt and Mayock [14] demonstrate the existence of discriminatory pricing against Black and Asian individuals in the housing market. Such perceived unfairness can lead to customer dissatisfaction, thereby weakening trust and loyalty toward the firm. In addition, discriminatory pricing has raised concerns among regulators and public authorities due to its potential adverse impacts on protected groups defined by characteristics such as race, gender, age, or other legally protected attributes [15]. This concern is driven by the possibility that protected groups may have higher willingness to pay due to historical disadvantage or unobserved heterogeneity. Such differences may translate into systematically higher prices for these groups [16]. Consequently, regulators have begun to constrain or closely monitor price differentiation practices. For example, New York State prohibited gender-based price discrimination in 2020 [17]. More recently, California expanded enforcement against gender-based price differences for substantially similar consumer products [18]. In financial services, the UK Financial Conduct Authority introduced general insurance pricing reforms that restrict price walking [19]. For home and motor insurance, they require renewal quotes to be no higher than equivalent new-business quotes. Regulatory attention has further expanded from traditional anti-discrimination rules to AI-enabled decision systems. The EU AI Act adopts a risk-based framework and phases in obligations including the early application of bans on certain prohibited AI practices [20]. In the US, the FTC has repeatedly emphasized that there is no AI exemption from consumer-protection law [21]. Firms can face liability for unfair or deceptive practices when deploying algorithms that mislead consumers, embed bias, or lack adequate safeguards. Taken together, these developments imply that fairness constraints in pricing are not merely an ethical add-on. Instead, they can function as operational compliance mechanisms that prevent large and systematic price disparities across groups. As a result, both legal and reputational exposure can be reduced. These compliance pressures pose a technical challenge in discriminatory dynamic pricing and learning. When demand is uncertain, the firm must learn from sales data while simultaneously setting group-specific prices. At the same time, it must ensure that deployed prices remain within an acceptable fairness region at every selling period. This requirement is fundamentally instantaneous and hard in nature. Many regulatory and compliance interpretations focus on individual-level or transaction-level harm. As a result, occasional violations can be unacceptable, even if the policy is fair on average.

In this work, we study discriminatory dynamic pricing and learning for perishable products under limited inventory. We focus on a specific form of fairness, namely price fairness, which requires inter-group price differences to remain within acceptable limits. In this problem, the initial inventory is fixed, and the selling horizon is finite, with no replenishment throughout the selling horizon. Consequently, the firm must make pricing decisions under joint inventory and time constraints while additionally enforcing price fairness constraints when designing discriminatory dynamic pricing strategies. Motivated by the model-free advantage of reinforcement learning, we adopt RL as the backbone to solve this fairness-constrained dynamic pricing problem without relying on any assumptions about the underlying demand functions. However, existing RL approaches for fairness-aware pricing typically incorporate fairness through converting fairness into a soft objective or an expected, cumulative constraint. Such methods do not provide guarantees of instantaneous feasibility during training, and even at convergence, they may still output constraint-violating actions. Moreover, the revenue-optimal fairness-constrained pricing policy often exhibits boundary-seeking behavior, meaning optimal prices frequently lie near the feasibility boundary. However, the existing RL-based fairness-aware pricing approaches seldom recover this boundary-seeking behavior in an efficient and reliable manner. Our approach is designed specifically to address this gap by enforcing instantaneous and hard price fairness constraints while preserving the ability to search near the constraint boundary to recover high revenue. For clarity, our main contributions are summarized as follows:

We formulate fairness-aware dynamic pricing for perishable products with instantaneous and hard price fairness constraints as an action-constrained Markov decision process (ACMDP). The feasible action set is governed by coupled multi-dimensional constraints on the price vector. Therefore, prices for different customer groups cannot be chosen independently and must satisfy pairwise price-gap requirements within each selling period.
We incorporate an optimization-based Shield module into the interaction loop between the pricing agent and the market. The Shield module maps an infeasible price vector to a nearby feasible one through solving a convex quadratic program, which guarantees step-wise feasibility during both training and deployment. It also facilitates learning when the optimal constrained pricing policy is boundary-seeking by enabling safe exploration near the feasibility boundary.
We develop Shield Soft Actor-Critic (Shield-SAC), a model-free DRL algorithm combined with neural networks. It learns the optimal pricing policy through interaction and thus does not rely on any assumptions about the underlying demand functions. Our Shield-SAC can recover revenue-optimal boundary-seeking behavior, and it achieves strong revenue performance while consistently enforcing the instantaneous and hard price fairness constraints.

Our work also helps connect the fairness-aware pricing literature with the RL literature. It further illustrates the promise of hybrid optimization-and-DRL methods for complex operational decision-making and learning problems with instantaneous and hard constraints.

2. Literature Review

2.1. Survey on Related Work

In traditional pricing models, as exemplified by Gallego and Van Ryzin [1] and Aoki [22], optimal pricing policies are computed with or without inventory constraints by assuming prior knowledge of the demand functions. Along the same lines, Aydin and Ziya [8] and Cohen et al. [16] study discriminatory pricing problems, with or without fairness constraints, to differentiate prices across customer groups by assuming prior knowledge of the demand functions. However, for contemporary digital markets characterized by high uncertainty, assuming access to complete a priori information about the true demand functions is unrealistic. Therefore, it has become an inevitable trend to study how to set dynamic pricing policies trying to maximize revenue while learning the underlying demand functions under uncertain markets [4]. Motivated by the extent of price differentiation across customer groups and the presence of fairness considerations, we discuss three relevant streams: dynamic pricing and learning, discriminatory pricing and learning, and fairness-aware discriminatory pricing and learning.

2.1.1. Dynamic Pricing and Learning

In dynamic pricing and learning, a firm seeks to identify optimal dynamic pricing policies while simultaneously inferring customers’ purchasing behavior from observed sales data. This setting is often referred to as a learning-while-doing problem and features a fundamental exploration–exploitation trade-off [23]. Studies in this literature generally fall into three methodological categories: parametric approaches, nonparametric approaches, and RL-based approaches.

First are parametric approaches. Aoki [22] is among the first to couple Bayesian learning of unknown excess-demand parameters with dynamic programming. They derive an optimal pricing policy that adapts to updated demand estimates. Similarly, under a linear demand specification, Lobo and Boyd [24] model the unknown demand parameters with a Gaussian prior and derive closed-form recursive expressions for posterior updates. The resulting optimal pricing policy is then obtained by solving a Bayesian dynamic program. Other studies that build on the Bayesian updating framework include Chhabra and Das [25], Kwon et al. [26], and Qu et al. [27], among others. A common drawback of these methods is that they are often computationally intractable, and they may also suffer from the incomplete learning issue identified by Harrison et al. [28]. Motivated by these limitations, a large body of research has examined how to use statistical methods to estimate the unknown demand parameters, such as classical maximum likelihood and least-squares estimators. These estimates are then used to determine pricing policies. Broder and Rusmevichientong [29] develop maximum-likelihood-based algorithms for a general parametric choice model with the objective of minimizing regret, measured as the expected revenue shortfall compared to a clairvoyant benchmark that knows the true parameters a priori. den Boer and Zwart [30] study dynamic pricing and learning for perishable products, for which inventory is limited, cannot be replenished, and any unsold units perish at the end of a finite selling horizon. They apply maximum-likelihood estimation to infer demand parameters from sales observations. They argue that limited inventory creates endogenous variation in posted prices. This variation, in turn, guarantees the convergence of the parameter estimates to the true demand parameters. Other studies that build on statistical updating frameworks include Den Boer and Zwart [31], Cheung et al. [32], and Bertsimas and Vayanos [33], among others. The learning-while-pricing process under parametric approaches is relatively straightforward to analyze and handle mathematically. However, these approaches are vulnerable to model misspecification: if the assumed functional form of demand deviates from the true one, the resulting pricing policies can be suboptimal [34].

Second are nonparametric approaches. Besbes and Zeevi [35] investigate a single-product revenue management problem under demand uncertainty; the demand function is modeled as Lipschitz-continuous, and the associated revenue rate function is assumed to be concave. They establish regret lower bounds that apply to any feasible pricing policy and show that their proposed nonparametric algorithm achieves regret close to the theoretical lower bound. Lei et al. [36] study single-product dynamic pricing and learning under limited inventory and a finite horizon, and they use bisection search to manage the exploration–exploitation trade-off. Their analysis assumes a Lipschitz-continuous, twice-differentiable demand function and a strictly concave revenue rate function. They propose three pricing heuristics that achieve regret levels close to the theoretical lower bound, and they show that one heuristic matches the lower bound as the potential market size approaches infinity. Other related studies that follow the nonparametric framework include Chen et al. [11], Chen and Gallego [37], and Chen and Shi [38], among others. A main difficulty of nonparametric approaches is the loss of tractability [35]. They also commonly hinge on certain conditions on the demand or revenue (rate) functions (e.g., concavity, smoothness, or differentiability). When these conditions are violated in practice, the usefulness of nonparametric methods can be substantially constrained in contemporary complex and dynamic markets [5].

Third are RL approaches. More recently, an expanding literature has explored model-free RL as a data-driven approach to dynamic pricing and learning, sidestepping explicit assumptions about the demand function. The interest in RL is also propelled by its demonstrated effectiveness in large-scale sequential decision-making problems in diverse industry settings. Rana and Oliveira [39] use the Q

(λ)

algorithm to learn an optimal pricing policy for a single perishable product under limited inventory and a finite selling horizon. Liu et al. [40] apply the DQN algorithm for discrete price sets and the DDPG algorithm for continuous price sets. They also propose a reshaped reward signal to improve algorithmic performance. Based on offline tests with real Alibaba Inc. data and online field experiments on Tmall.com, they show that continuous price sets outperform discrete ones and that their approaches significantly outperform manual pricing by operational experts. Yang et al. [6] investigate quality-based pricing and the role of quality information disclosure in a retail market. In their setting, a monopolistic firm sells fresh produce over a finite selling horizon to consumers with heterogeneous quality perceptions, and they apply the PPO algorithm to compute optimal pricing and disclosure strategies.

2.1.2. Discriminatory Dynamic Pricing and Learning

None of the above studies consider customer heterogeneity by setting different prices across customer groups when developing learning-while-pricing policies. This is notable because such discrimination can further increase revenue. With the growing availability of rich sales data, firms can understand customers more accurately and segment them into meaningful groups. As a result, it is becoming increasingly important for revenue-maximizing firms to adopt discriminatory (or even personalized) dynamic pricing and learning. In this setting, prices are adjusted based on group-level (or individual-level) characteristics, while the corresponding demand functions are learned from observed sales data. The literature in this field can likewise be categorized into three methodological approaches: parametric, nonparametric, and RL-based methods. It is also worth noting that discriminatory dynamic pricing across customer groups can be interpreted as a special case of horizontally differentiated multiproduct dynamic pricing [9], since each customer group can be viewed as an analogous product variant. Accordingly, we further discuss the related literature through the lens of multi-product dynamic pricing and learning.

First are parametric approaches. Ferreira et al. [41] consider multi-product dynamic pricing and learning for perishable products, with discriminatory dynamic pricing arising as a special case. Assuming a known prior over the unknown demand parameters, they employ Bayesian updating and develop two Thompson-sampling-based learning-while-pricing algorithms to balance exploration and exploitation under inventory constraints. Ban and Keskin [10] study how to design discriminatory pricing policies for sequentially arriving customers to minimize regret. The demand model is assumed to depend on an unknown subset of s features among d customer covariates. They employ least-squares regression for a linear demand specification when the sparsity pattern is known and lasso-regularized quasi-likelihood regression for a general demand model when sparsity is unknown. Chen et al. [42] integrate customer covariates to capture personalized choice behavior in a joint assortment-and-pricing setting. They use a precollected customer dataset to estimate the unknown demand parameters and provide high-probability guarantees for their regularized maximum-likelihood estimator. Leveraging the estimated parameters, they construct personalized assortment and pricing decisions and show that the revenue loss relative to the oracle benchmark decreases at an explicit rate as more customer data become available.

Second are nonparametric approaches. Chen et al. [11] consider a multi-product network revenue management problem with unknown demand, which subsumes discriminatory dynamic pricing and learning as a special case. Their model assumes Lipschitz-continuous demand functions with well-defined second-order partial derivatives, and a concave revenue function. They develop a nonparametric self-adjusting control algorithm with two phases: exploration and exploitation. During exploration, they construct a spline-based approximation of demand. During exploitation, they compute a baseline control via an approximate quadratic program and then update prices according to a specified adjustment rule. They show that their learning-while-pricing algorithm achieves the tightest known performance bound for this setting. Chen and Gallego [37] study how to design a nonparametric algorithm that simultaneously learns customer preferences from covariates and maximizes revenue over a finite selling horizon. They assume that the revenue function is Lipschitz-continuous and concave in the charged prices. They show that their proposed Adaptive Binning and Exploration algorithm, which adaptively clusters customers and assigns similar prices to customers within the same cluster, achieves near-optimal regret. Chen and Gallego [9] study the pricing problem of selling a perishable product to multiple customer groups over a finite selling horizon. They assume that each arriving customer’s group identity is observable, while the corresponding demand functions are initially unknown. The group-specific demand functions are assumed to be Lipschitz-continuous, and the revenue rate function is twice differentiable and strictly concave. They propose a Primal–Dual Learning algorithm that exploits a primal–dual formulation and explicitly learns the optimal dual solution. They show that the algorithm mitigates the curse of dimensionality, achieves near-optimal regret, and yields regret bounds that do not depend on the number of customer groups.

Third are RL approaches. Rana and Oliveira [43] study finite-horizon, limited-inventory pricing for multiple perishable products and apply the Q

(λ)

algorithm to learn optimal pricing policies, with which discriminatory dynamic pricing and learning can be viewed as a special case. Krasheninnikova et al. [12] study how an insurance company can set customized renewal prices to maximize revenue; the renewal offer depends on a customer’s financial segment and previously charged price. They consider both an unconstrained setting and a constrained setting; the latter requires the client retention rate to remain above a specified threshold. They cast the unconstrained problem in the standard MDP framework and extend it to a constrained MDP (CMDP) framework to accommodate constraints. The state includes global features describing the company’s revenue and retention status, along with customer-specific features. To solve these models, they apply a Q-learning algorithm and discretize the continuous state space using k-means clustering. Qiao et al. [44] investigate how to overcome the curse of dimensionality in finite-horizon pricing for multiple perishable products, a framework that includes discriminatory dynamic pricing and learning as a special case. They propose a distributed multi-agent pricing architecture formulated as a Fully Cooperative Markov Game (FCMG). To solve the FCMG, they apply Q-learning and DQN at the agent level and introduce a counterfactual-baseline reward-reshaping mechanism to stabilize and improve learning outcomes.

2.1.3. Fairness-Aware Discriminatory Dynamic Pricing and Learning

None of the above work on discriminatory dynamic pricing and learning accounts for fairness. As discussed earlier, fairness is becoming increasingly important for firms because it can strongly affect long-term revenue, brand reputation, and regulatory compliance. Consequently, incorporating fairness into pricing decisions is becoming both urgent and essential. Despite this growing relevance, however, relatively little work has examined fairness-aware discriminatory dynamic pricing and learning.

Building on the seminal work of Cohen et al. [16], which assumes prior knowledge of the demand functions, Cohen et al. [45] are among the first to study this problem under a parametric framework. They consider three types of fairness constraints: group fairness (requiring similar prices across customer groups), time fairness (requiring price stability over time) and demand fairness (requiring expected demand across groups to remain relatively similar). Using least-squares parameter estimation, they propose the FaPU and FaPD algorithms. FaPU uses an infrequently updated UCB strategy to enforce group and time fairness constraints, whereas FaPD, adapted from a primal–dual framework, enforces demand fairness constraints. Both algorithms achieve near-optimal regret, but their performance degrades as the number of customer groups increases, reflecting the curse of dimensionality. Chen et al. [46] consider context-dependent discriminatory pricing and impose a utility fairness constraint that requires customers with similar utility levels to receive similar prices. Using maximum-likelihood parameter estimation, they discretize the pricing policy and reformulate the problem as a multi-armed bandit model, which they solve using a UCB-based algorithm. Liu and Sun [47] consider a setting in which customers may behave strategically to manipulate their group status. They employ least-squares estimation to learn demand parameters and design a pricing policy that both satisfies price-fairness requirements (limited inter-group price dispersion) and discourages strategic behavior. Chen et al. [48] are among the first to utilize a nonparametric framework. They assume monotonically decreasing, injective, and Lipschitz demand functions, strongly concave revenue functions, and the Lipschitz continuity of the fairness measures. Under these assumptions, they propose two regret-optimal dynamic pricing policies. The first strictly enforces the hard price fairness constraint, whereas the second incorporates a penalty term associated with the soft fairness constraint into the regret-minimization objective. Xu et al. [49] consider a setting of two customer groups under a nonparametric framework in which each group’s valuations follow unknown distributions, and the seller observes only binary purchase feedback under a finite discrete price set. They formalize doubly fair pricing by imposing two in-expectation constraints, covering procedural fairness on offered prices and substantive fairness on accepted prices. They then propose a policy-elimination algorithm that achieves sublinear revenue regret while guaranteeing zero procedural unfairness and sublinear substantive unfairness.

In this paper, we adopt an RL framework to learn an optimal fairness-aware discriminatory dynamic pricing policy under price fairness constraints, which require setting similar prices across customer groups. We do not make any assumptions about the underlying demand functions. This assumption-free property is particularly desirable in practical applications, and it also connects our work to a broader literature on fairness in RL for sequential decision making. Reuel and Ma [50] provide a recent survey that systematizes how fairness notions arise in RL. It summarizes commonly used definitions and evaluation criteria. It also categorizes methodological approaches for enforcing fairness in both single-agent and multi-agent settings. Finally, it highlights open challenges for fair RL in realistic applications. Cimpean et al. [51] further propose a general framework for formalizing fairness in RL. They extend the MDP description to incorporate historical information. This enables a unified discussion of individual- and group-level sequential fairness notions and the associated research challenges. Ju et al. [52] study fairness in multi-agent MDPs by replacing the standard social-welfare objective with fairness-aware aggregations of agents’ returns. They develop learning algorithms based on confidence sets and online convex optimization over occupancy measures. These algorithms achieve sublinear regret and admit corresponding probably approximately correct guarantees.

However, despite these advances in fairness-aware RL, their problem settings and fairness notions differ substantially from the pricing literature [16]. As far as we are aware, very little work has examined fairness-aware discriminatory dynamic pricing and learning under an RL framework. Thibodeau et al. [53] study unfair outcomes induced by profit-maximizing dynamic pricing through a demand-fairness lens, where unfairness is measured by gaps between the realized buyer-group distribution and the underlying population distribution. To mitigate this unfairness, they model a regulator who learns dynamic taxation and subsidy policies to incentivize firms toward fairer pricing behavior while improving social welfare. They formulate the planner’s learning problem under multi-armed bandit, contextual bandit, and full RL settings. They report welfare gains over a fairness-agnostic baseline and strong performance relative to analytically optimal fairness-aware baselines. Rathore and Tiwari [54] study dynamic pricing and learning under a multi-armed bandit setting with a discrete price set. Customers exhibit sticky fairness concerns, for which perceived unfair price changes accumulate over time and can trigger lost sales once a fairness-tolerance budget is exceeded. They propose an index policy, UCBFT, that augments the standard UCB revenue index with an additional fairness-payoff term weighted by a multiplicative update capturing cumulative unfairness. They prove that UCBFT achieves logarithmic regret in time, supported by numerical experiments showing revenue improvements in price-sensitive settings. Maestre et al. [55] examine how to balance revenue maximization with fairness, defined as providing homogeneous average prices across customer groups and quantified using Jain’s index. They incorporate fairness directly into the reward function and thus formulate the problem as a multi-objective optimization. Using customers’ group information and the current fairness level as inputs, they apply a Q-learning algorithm with neural network approximation to learn pricing decisions for sequentially arriving customers. Their results show that the learned policy substantially improves fairness while maintaining strong revenue performance in a simulated market.

However, most existing RL methods for fairness-aware pricing incorporate fairness into the objective as a soft, multi-objective term and typically target cumulative or expected constraint satisfaction. As a result, they cannot ensure step-wise feasibility during training and may still produce violating actions even after convergence. This limitation is especially problematic for the price fairness constraints considered here, which are instantaneous and hard bounds on inter-group price gaps. Moreover, under such price fairness constraints, the revenue-optimal pricing policy is often boundary-seeking, with optimal prices frequently located near the feasibility frontier. However, existing RL-based fairness-aware pricing approaches seldom recover this boundary-seeking behavior in an efficient and reliable manner. To address these issues, our paper develops Shield-SAC, a DRL algorithm tailored to discriminatory dynamic pricing with instantaneous and hard price fairness constraints. Shield-SAC enforces step-wise feasibility by design and enables efficient boundary-seeking learning, allowing the learned pricing policy to approach the revenue-optimal frontier without incurring fairness violations. Finally, Table 1 summarizes the strengths and weaknesses of parametric, nonparametric, and RL-based approaches and our Shield-SAC approach for fairness-aware discriminatory dynamic pricing and learning.

3. Problem Formulation

In this paper, we study a price-based revenue management (RM) problem in which a firm must determine an optimal discriminatory dynamic pricing policy for multiple customer groups when selling a single perishable product. The initial inventory, I, is fixed and cannot be replenished, and the product must be sold within a finite selling horizon of T periods. Otherwise, any remaining units perish. We consider g heterogeneous customer groups that differ in their sensitivity to price. At each selling period,

t \in {1, \dots, T}

, the firm posts discriminatory prices,

p_{t}^{i}

, for each customer group,

i \in {1, \dots, g}

. In line with standard assumptions in the price-based RM literature, at most one customer from each group arrives in each period and demands at most one unit of the product based on the posted price. Customer groups with higher valuations and lower price sensitivity are considered disadvantaged groups in our setting. These groups may be associated with characteristics such as race, gender, age, or other legally protected attributes. A revenue-maximizing firm would naturally tend to charge them higher prices. We first model the revenue-maximizing discriminatory dynamic pricing problem without considering price fairness as a Markov decision process (MDP). This serves as a baseline and illustrates that, in the absence of price fairness constraints, price differences across customer groups can become substantial. We then incorporate instantaneous and hard price fairness constraints. These constraints require that inter-group price differences remain within acceptable limits. We formulate the resulting fairness-aware discriminatory dynamic pricing problem as an action-constrained Markov decision process (ACMDP). The objective of this ACMDP is to maximize revenue while consistently satisfying the price fairness constraints at every state and in every selling period.

3.1. Markov Decision Process

We formulate the revenue-maximizing discriminatory dynamic pricing problem without considering price fairness as a Markov decision process (MDP), defined by the tuple

〈 S, A, P, R, γ 〉

. Below, we explicitly define each component in our setting.

State space

S

. At each selling period, t, the market state is

s_{t} = (I_{t}, t)

, where

I_{t} \in {0, 1, \dots, I}

is the remaining inventory, and

t \in {1, \dots, T}

is the current period. Thus, the state space is

S = {0, 1, \dots, I} \times {1, \dots, T} .

(1)

A terminal state is reached when

I_{t} = 0

or

t = T + 1

.

Action space

A

. At each selling period, t, the firm needs to set discriminatory prices,

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, for g customer groups based on the current market state,

s_{t}

. Let

[p_{min}^{i}, p_{max}^{i}]

denote the continuous price interval for customer group i. Then, the action space is the Cartesian product of the group-specific price sets:

A = \prod_{i = 1}^{g} [p_{min}^{i}, p_{max}^{i}] .

(2)

Transition function

P : S \times A \times S \to [0, 1]

. The transition function,

P

, specifies the probability of transitioning to the next state,

s_{t + 1}

, given the current state,

s_{t}

, and action,

a_{t}

. In our setting, when the market is in a non-terminal state,

s_{t} = (I_{t}, t)

, and the firm posts discriminatory prices

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, the purchase decision of an arriving customer from group i is a Bernoulli random variable,

{\hat{d}}_{t}^{i} (p_{t}^{i}) \in {0, 1}, {\hat{d}}_{t}^{i} (p_{t}^{i}) \sim Bernoulli (D^{i} (p_{t}^{i})), i = 1, . . ., g,

(3)

where

D^{i} (\cdot)

is the unknown demand function of group i. In period t, the attempted total demand is

{\hat{d}}_{t} = \sum_{i = 1}^{g} {\hat{d}}_{t}^{i} (p_{t}^{i})

. Because sales cannot exceed the remaining inventory

I_{t}

, the realized sales are

d_{t} = min {{\hat{d}}_{t}, I_{t}}

. If

{\hat{d}}_{t} > I_{t}

, we ration inventory by selecting

I_{t}

requesting groups uniformly at random and set

d_{t}^{i} (p_{t}^{i}) \in {0, 1}

accordingly. The inventory then evolves as

I_{t + 1} = I_{t} - d_{t}

, and the time index advances to

t + 1

. Since the underlying demand functions

{D_{i} (\cdot)}_{i = 1}^{g}

are unknown, the transition function P induced by these stochastic demand processes is also unknown.

Reward function

R : S \times A \to R

. After observing the realized demand

d_{t}^{i} (p_{t}^{i}) \in {0, 1}

from customer group i under discriminatory prices

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, the firm receives an immediate reward signal (revenue),

r_{t}

, at the end of selling period t, as follows:

r_{t} (s_{t}, a_{t}) = \sum_{i = 1}^{g} p_{t}^{i} d_{t}^{i} (p_{t}^{i}) .

(4)

The underlying reward function

R (s_{t}, a_{t}) = E [r_{t} (s_{t}, a_{t})] = \sum_{i = 1}^{g} p_{t}^{i} D^{i} (p_{t}^{i})

(5)

is unknown because the demand functions

{D^{i} (\cdot)}_{i = 1}^{g}

are unknown.

Discount factor

γ \in [0, 1]

. The parameter

γ

is the discount factor used to convert future rewards into their present value. Since our problem is a finite-horizon revenue-maximizing discriminatory dynamic pricing with no intertemporal discounting, we set

γ = 1

in our paper.

The goal of this MDP model is to find a discriminatory pricing policy that maximizes the expected total rewards over the entire selling horizon of T periods. Specifically, given the sequence of realized rewards,

{r (s_{t}, a_{t})}_{t = 1}^{T}

, generated under a discriminatory pricing policy,

π

, that maps each state,

s_{t}

, to a distribution over discriminatory price vectors,

a_{t} \in A

, the MDP model aims to solve

max_{π} E_{π} [\sum_{t = 1}^{T} r_{t} (s_{t}, a_{t})] .

(6)

Therefore, the optimal revenue-maximizing discriminatory dynamic pricing policy is given by

π^{*} = arg max_{π} E_{π} [\sum_{t = 1}^{T} r_{t} (s_{t}, a_{t})] .

(7)

3.2. Action-Constrained Markov Decision Process

In this paper, we consider price fairness. It requires that inter-group price differences remain within acceptable limits in any selling period,

t \in {1, \dots, T}

. We incorporate price fairness constraints into the discriminatory dynamic pricing problem with limited inventory and a finite selling horizon. We formulate this problem as an action-constrained Markov decision process (ACMDP), defined by the tuple

〈 S, A, A_{feasible}, P, R, γ 〉

. Below, we explicitly define each component in our setting.

State space

S

. We retain the same state definition as in the above MDP model. At each selling period, t, the market state is

s_{t} = (I_{t}, t)

, where

I_{t} \in {0, 1, \dots, I}

is the remaining inventory, and

t \in {1, \dots, T}

is the current period. The state space is

S = {0, 1, \dots, I} \times {1, \dots, T}

. A terminal state is reached when

I_{t} = 0

or

t = T + 1

.

Action space

A

. As shown in the above MDP model, when not considering price fairness constraints, the action space is

A = \prod_{i = 1}^{g} [p_{min}^{i}, p_{max}^{i}]

.

Feasible action space

A_{feasible} \subseteq A

. To capture price fairness constraints, the price dispersion across customer groups is bounded by corresponding upper limits. Specifically, at every state and in every selling period, the price difference between any two groups,

(i, j)

, cannot exceed a given tolerance,

Δ_{i j} \geq 0

. This tolerance is fixed and independent of the current state. Formally, the price fairness constraints in our setting are defined as

| p_{t}^{i} - p_{t}^{j} | \leq Δ_{i j}, \forall i, j \in {1, \dots, g}, i \neq j .

(8)

This condition imposes coupled multi-dimensional constraints on the action space, meaning that the price for each customer group cannot be chosen independently but must satisfy pairwise relationships with all other group prices. Therefore, the constraints in (8) induce a state-independent feasible action space,

A_{feasible} = \{a = {p^{i}}_{i = 1}^{g} \in A : | p^{i} - p^{j} | \leq Δ_{i j}, \forall i \neq j\} \subseteq A .

(9)

Any discriminatory price vector,

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, that violates (8) is deemed infeasible and is not allowed to be chosen at any state or in any selling period.

Transition function

P : S \times A_{feasible} \times S \to [0, 1]

. The transition function is the same as in the above MDP model. When the market is in a non-terminal state,

s_{t} = (I_{t}, t)

, and the firm posts discriminatory prices,

a_{t} = {p_{t}^{i}}_{i = 1}^{g} \in A_{feasible}

, the corresponding realized demand,

d_{t}^{i} (p_{t}^{i}) \in {0, 1}

, of each customer group, i, is observed at the end of selling period t. The inventory then evolves as

I_{t + 1} = I_{t} - d_{t}

, and the time index advances to

t + 1

. Since the demand functions,

{D^{i} (\cdot)}_{i = 1}^{g}

, are unknown, the transition function

P

induced by these stochastic demand processes is also unknown.

Reward function

R : S \times A_{feasible} \to R

. The reward function is the same as in the above MDP model. After observing the realized demand,

d_{t}^{i}

, from each customer group, i, under the posted discriminatory prices,

a_{t} = {p_{t}^{i}}_{i = 1}^{g} \in A_{feasible}

, the firm receives the immediate reward signal (revenue) at selling period t, as shown in (4). The underlying reward function, given in (5), is unknown because it depends on the unknown demand functions

{D^{i} (\cdot)}_{i = 1}^{g}

.

Discount factor

γ \in [0, 1]

. Since our problem is a finite-horizon revenue-maximizing discriminatory dynamic pricing with no intertemporal discounting, we set

γ = 1

in our paper.

The goal of this ACMDP is to find a fairness-aware discriminatory dynamic pricing policy,

π_{fair}

. It maps each state,

s_{t}

, to a distribution over feasible discriminatory price vectors,

a_{t} \in A_{feasible}

, and it maximizes the expected total rewards over the entire selling horizon of T periods. This means that, for all

s_{t} \in S

, we have

π_{fair} (A_{feasible} ∣ s_{t}) = 1 .

(10)

Given the sequence of realized rewards,

{r_{t} (s_{t}, a_{t})}_{t = 1}^{T}

, generated under a fairness-aware discriminatory pricing policy,

π_{fair}

, the firm seeks to solve

max_{π_{fair}} E_{π_{fair}} [\sum_{t = 1}^{T} r_{t} (s_{t}, a_{t})] .

(11)

Thus, the optimal fairness-aware discriminatory dynamic pricing policy is given by

π_{fair}^{*} = arg max_{π_{fair}} E_{π_{fair}} [\sum_{t = 1}^{T} r_{t} (s_{t}, a_{t})] .

(12)

Here, we explain why the fairness-aware dynamic pricing problem with price fairness constraints should be modeled as an ACMDP, rather than a classical Constrained Markov decision process (CMDP). CMDPs introduce one or more cost functions,

C^{k} : S \times A \to R_{+}

, and impose constraints on the expected cumulative costs. A typical CMDP problem is written as

max_{π} E_{π} [\sum_{t = 1}^{T} r_{t} (s_{t}, a_{t})] subject to E_{π} [\sum_{t = 1}^{T} c_{t}^{k} (s_{t}, a_{t})] \leq δ^{k}, k = 1, \dots, K,

(13)

where

δ^{k}

is the given cost threshold. In this framework, constraint satisfaction is enforced in an expected sense over the whole horizon. In contrast, our price fairness requirements do not act as expected cumulative cost constraints. Instead, they specify instantaneous and hard constraints on the action: any discriminatory price vector,

a_{t}

, that violates the constraint

| p_{t}^{i} - p_{t}^{j} | \leq Δ_{i j}

for any

i, j

is considered infeasible at any state or in any selling period. Therefore, the fairness-aware discriminatory dynamic pricing problem should be modeled as an ACMDP. In this model, the price fairness constraints directly modify the original action space,

A

, into the feasible action space,

A_{feasible}

. This differs from approaches that introduce additional costs and bound their cumulative expectation.

4. Solution Methods

Due to the lack of information about the underlying demand functions, we adopt the model-free deep reinforcement learning (DRL) framework to solve the ACMDP model. In a classical DRL setting, a DRL agent learns an optimal policy by interacting with its environment through trial and error, with the objective of maximizing expected cumulative rewards. In our discriminatory dynamic pricing context, as illustrated in Figure 1a, the DRL pricing agent observes the current market state,

s_{t}

, in each selling period, t, and outputs a discriminatory price vector,

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, for g customer groups. At the end of selling period t, it receives a reward signal,

r_{t}

, which reflects the realized revenue under its pricing decision. By repeatedly interacting with the market across many training episodes, the DRL pricing agent gradually learns a near-optimal discriminatory dynamic pricing policy that aims to maximize the expected cumulative rewards over the entire selling horizon. However, standard DRL algorithms are designed to solve unconstrained MDPs and cannot directly handle the coupled multi-dimensional action constraints present in our ACMDP model. Moreover, it is nontrivial to modify the policy network architecture of a standard DRL algorithm so that it always outputs actions,

a_{t}

, that lie strictly within the feasible action space,

A_{feasible}

. To address this issue, we adopt the idea of shielding from Alshiekh et al. [56] to convert the ACMDP model into an unconstrained MDP from the DRL pricing agent’s perspective, allowing powerful DRL algorithms to be efficiently applied. As shown in Figure 1b, a Shield module is incorporated into the interaction loop between the DRL pricing agent and the market. The DRL pricing agent still observes the current market state,

s_{t}

, and outputs a discriminatory price vector,

a_{t} \in A

, without considering price fairness constraints. The Shield module then checks whether the proposed action,

a_{t}

, lies in the feasible action space,

A_{feasible}

. If

a_{t} \in A_{feasible}

, the Shield module directly passes the original action,

a_{t}

, to the market. If

a_{t} \notin A_{feasible}

, the Shield module projects

a_{t}

into the feasible action space and produces a corrected feasible action,

{\tilde{a}}_{t}

. We define the Shield mapping

Π : A \to A_{feasible}

as the Euclidean projection

{\tilde{a}}_{t} = Π (a_{t}) \in arg min_{\tilde{a} \in R^{g}} {∥ \tilde{a} - a_{t} ∥}_{2}^{2} s . t . \tilde{a} \in A_{feasible},

(14)

which is a convex quadratic program and can be solved efficiently with standard optimization tools.

Proposition 1

(Step-wise feasibility of the Shield). Assume

A_{feasible} \neq \emptyset

. Then, for any time step t, the executed action

{\tilde{a}}_{t} = Π (a_{t})

satisfies

{\tilde{a}}_{t} \in A_{feasible}

. Consequently, the instantaneous and hard price fairness constraints

| {\tilde{p}}_{t}^{i} - {\tilde{p}}_{t}^{j} | \leq Δ_{i j}

hold for all

i \neq j

at every selling period, t, during both training and deployment.

Proof.

A_{feasible}

is an intersection of box constraints (

p_{min}^{i} \leq p^{i} \leq p_{max}^{i}

) and linear inequalities (

| p^{i} - p^{j} | \leq Δ_{i j}

) induced by pairwise gap bounds; hence, it is closed and convex. Therefore, the projection problem (14) is a feasible convex quadratic program with a strictly convex objective; so, a unique optimizer exists and lies in

A_{feasible}

, yielding step-wise feasibility.□

The quadratic program in (14) has g decision variables. The pairwise gap constraints

| p^{i} - p^{j} | \leq Δ_{i j}

can be written as two linear inequalities for each unordered pair,

{i, j}

, resulting in

g (g - 1)

linear inequalities in total, plus

2 g

box constraints. Hence, the per-step computational cost of solving quadratic programs grows polynomially with g. When the number of customer groups, g, is small, the resulting quadratic program is lightweight and can be solved efficiently using off-the-shelf solvers. When the number of customer groups, g, becomes large, solving a quadratic program at every period may become non-negligible. Promising directions include (i) clustering and hierarchical pricing that assigns a shared price within each cluster, reducing the action dimension from g to

k ≪ g

, (ii) exploiting problem structure to develop faster first-order or operator-splitting projection methods, combined with warm-starting from the previous feasible action, and (iii) learning a lightweight surrogate projector to approximate

Π (\cdot)

for low-latency inference, followed by a final feasibility-check-and-correct step to preserve never-violate safety.

Through this optimization-based shield mechanism, the original DRL pricing agent is effectively replaced with a Shield DRL pricing agent from the market’s perspective. This Shield DRL pricing agent always outputs fairness-aware discriminatory prices,

a_{t} = {p_{t}^{i}}_{i = 1}^{g}

, that lie within the feasible action space,

A_{feasible}

, at every state and in every selling period. At the same time, from the perspective of the DRL pricing agent, the market is transformed into a new market with a new transition function,

\tilde{P}

, and a new reward function,

\tilde{R}

, both induced via the Shield module. Specifically, the original ACMDP model is translated to a shield-induced unconstrained MDP model, denoted as

〈 S, A, \tilde{P}, \tilde{R}, γ 〉

. In this formulation, the state space

S

and the original action space

A

remain unchanged, and the discount factor is still set to

γ = 1

. The effects of the instantaneous and hard price fairness constraints are instead captured through the modified transition function

\tilde{P}

and reward function

\tilde{R}

induced by the Shield module. The shield-induced transition function

\tilde{P}

is defined as

\tilde{P} (s_{t + 1} ∣ s_{t}, a_{t}) = \{\begin{matrix} P (s_{t + 1} ∣ s_{t}, a_{t}), & if a_{t} \in A_{feasible} \\ P (s_{t + 1} ∣ s_{t}, {\tilde{a}}_{t}), & if a_{t} \notin A_{feasible} \end{matrix},

(15)

where

P

is the original transition function of the ACMDP model. Meanwhile,

{\tilde{a}}_{t}

is the corrected feasible action returned by the Shield module. It is obtained by solving the convex quadratic program (14) whenever the DRL pricing agent outputs an infeasible action,

a_{t} \notin A_{feasible}

. Since the original transition function,

P

, of the ACMDP model is unknown, the shield-induced transition function,

\tilde{P}

, is also unknown. Similarly, the immediate reward received by the DRL pricing agent is determined by the pricing action actually executed in the market. When the DRL pricing agent’s output,

a_{t}

, is infeasible, the reward signal received by the DRL pricing agent is based on the corrected feasible action,

{\tilde{a}}_{t}

. At the end of selling period t, the reward signal

{\tilde{r}}_{t}

received by the DRL pricing agent is given by

{\tilde{r}}_{t} (s_{t}, a_{t}) = \{\begin{matrix} \sum_{i = 1}^{g} p_{t}^{i} d_{t}^{i} (p_{t}^{i}), & if a_{t} \in A_{feasible} \\ \sum_{i = 1}^{g} {\tilde{p}}_{t}^{i} d_{t}^{i} ({\tilde{p}}_{t}^{i}), & if a_{t} \notin A_{feasible} \end{matrix} .

(16)

Accordingly, the shield-induced reward function

\tilde{R}

is defined as

\tilde{R} (s_{t}, a_{t}) = E [{\tilde{r}}_{t} (s_{t}, a_{t})] = \{\begin{matrix} \sum_{i = 1}^{g} p_{t}^{i} D^{i} (p_{t}^{i}), & if a_{t} \in A_{feasible} \\ \sum_{i = 1}^{g} {\tilde{p}}_{t}^{i} D^{i} ({\tilde{p}}_{t}^{i}), & if a_{t} \notin A_{feasible} \end{matrix},

(17)

which remains unknown due to the unknown demand functions

{D^{i} (\cdot)}_{i = 1}^{g}

. This shield-induced unconstrained MDP can be directly solved using standard DRL algorithms. At the same time, the Shield module guarantees that all executed pricing actions satisfy the instantaneous and hard price fairness constraints at every state and in every selling period. Note that the Shield module is part of the new market from the very beginning. Therefore, from the DRL pricing agent’s perspective, it always interacts with and learns in the same shield-induced new market. During training, the DRL pricing agent samples and updates under this shield-induced new market. Therefore, the learned pricing policy naturally corresponds to the true dynamics and reward structure of the shield-induced new market, rather than those of the original market without the Shield module. Moreover, at deployment, the DRL pricing agent still faces exactly the same shield-induced new market as during training (the same original market augmented with the same Shield module). Therefore, there is no issue that the pricing policy learned during training is inconsistent with the actual market dynamics at deployment.

In this paper, we use Soft Actor-Critic (SAC) as the base DRL algorithm and incorporate above shielding mechanism to obtain the Shield-SAC DRL pricing algorithm. Shield-SAC can efficiently reuse past experiences through experience replay, thereby improving data efficiency and policy learning performance. It learns a stochastic discriminatory pricing policy,

π

, which maps the current market state, s, to a probability distribution over discriminatory price vectors a. Compared with deterministic policies, stochastic policies naturally facilitate exploration. Randomized action selection enables the agent to explore a broader set of pricing decisions and reduces the risk of getting trapped in suboptimal behaviors early in training. The degree of randomness of a policy can be quantified by its entropy. For a given state, s, the policy entropy is defined as

H (π (\cdot | s)) = E_{a \sim π (\cdot | s)} [- log π (a | s)] .

(18)

To encourage exploration and prevent the policy from converging to a poor local optimum, Shield-SAC maximizes an entropy-regularized objective that trades off cumulative rewards and policy entropy. It interacts with the shield-induced new market under the discriminatory pricing policy

π

and collects shield-induced trajectories

\tilde{τ} = (s_{1}, a_{1}, {\tilde{r}}_{1}, s_{2}, a_{2}, {\tilde{r}}_{2}, . . ., s_{T}, a_{T}, {\tilde{r}}_{T}, s_{terminal}),

(19)

where

{\tilde{r}}_{t}

denotes the reward signal generated by the shield-induced reward function

\tilde{R} (s_{t}, a_{t})

. The shield-induced entropy-regularized objective function is given by

\tilde{J} (π) = E_{\tilde{τ} \sim π} [\sum_{t = 1}^{T} (\tilde{R} (s_{t}, a_{t}) + α H (π (\cdot | s_{t})))],

(20)

where

α > 0

is the temperature parameter to control the relative importance of reward maximization and exploration. The entropy term provides an exploration incentive at each time step. It is proportional to the policy entropy under the corresponding state and encourages stochasticity in the learned pricing policy. The goal of the Shield-SAC pricing agent is to find an optimal discriminatory pricing policy,

π^{*}

, that maximizes the shield- induced entropy-regularized objective function (20), namely

π^{*} = arg max_{π} \tilde{J} (π) = arg max_{π} E_{\tilde{τ} \sim π} [\sum_{t = 1}^{T} (\tilde{R} (s_{t}, a_{t}) + α H (π (\cdot | s_{t})))] .

(21)

To implement Shield-SAC, we define two fundamental value functions under the discriminatory pricing policy

π

: the shield-induced state-value function

{\tilde{V}}^{π} (s)

and the shield-induced action-value function

{\tilde{Q}}^{π} (s, a)

. These functions measure, respectively, the expected desirability of being in a given market state, s, and the expected desirability of selecting a particular discriminatory pricing action, a, when the market is in state s, under the shield-induced entropy-regularized objective function (20). The shield-induced state-value function

{\tilde{V}}^{π} (s)

is defined as

{\tilde{V}}^{π} (s) = E_{\tilde{τ} \sim π} [\sum_{k = t}^{T} (\tilde{R} (s_{k}, a_{k}) + α H (π (\cdot | s_{k}))) | s_{t} = s] .

(22)

Similarly, the shield-induced action-value function

{\tilde{Q}}^{π} (s, a)

is defined as

{\tilde{Q}}^{π} (s, a) = E_{\tilde{τ} \sim π} [\sum_{k = t}^{T} \tilde{R} (s_{k}, a_{k}) + α \sum_{k = t + 1}^{T} H (π (\cdot | s_{k})) | s_{t} = s, a_{t} = a] .

(23)

With these definitions,

{\tilde{V}}^{π} (s)

and

{\tilde{Q}}^{π} (s, a)

satisfy

{\tilde{V}}^{π} (s) = E_{a \sim π (\cdot | s)} [{\tilde{Q}}^{π} (s, a)] + α H (π (\cdot | s)) = E_{a \sim π (\cdot | s)} [{\tilde{Q}}^{π} (s, a) - α log π (a | s)] .

(24)

Finally, the Bellman equation for

{\tilde{Q}}^{π} (s, a)

under the shield-induced market dynamics

\tilde{P}

and

\tilde{R}

is given by

{\tilde{Q}}^{π} (s, a) = E_{\begin{matrix} s^{'} \sim \tilde{P} (\cdot | s, a) \\ a^{'} \sim π (\cdot | s^{'}) \end{matrix}} [\tilde{R} (s, a) + {\tilde{Q}}^{π} (s^{'}, a^{'}) - α log π (a^{'} | s^{'})] .

(25)

In our Shield-SAC algorithm, we use a deep neural network parameterized by

θ

to approximate the discriminatory pricing policy

π

. It adopts a fully connected feedforward architecture, as shown in Figure 2, with

Z_{policy}

hidden layers each containing

U_{policy}

neurons. This neural network, which we refer to as the pricing policy network, takes the market state s as input. It outputs the parameters of a diagonal Gaussian distribution, namely the mean vector

μ_{θ} (s) \in R^{g}

and the standard deviation vector

σ_{θ} (s) \in R_{+}^{g}

. This parameterization induces a stochastic policy,

π_{θ} (\cdot | s)

, over the g-dimensional discriminatory price vector. To sample an action,

a (s)

, from the stochastic policy

π_{θ} (\cdot | s)

under state s while keeping the sampling operation differentiable, we adopt the reparameterization trick. We first draw noise

ξ \sim N (0, I)

and construct a squashed Gaussian sample,

a_{θ} (s, ξ) = tanh (μ_{θ} (s) + σ_{θ} (s) ⊙ ξ) \in {(- 1, 1)}^{g}, ξ \sim N (0, I),

(26)

where ⊙ denotes element-wise multiplication. When interacting with the market, we need further map the squashed action

a_{θ} (s, ξ) \in {(- 1, 1)}^{g}

to the prescribed discriminatory price ranges

\prod_{i = 1}^{g} [p_{min}^{i}, p_{max}^{i}]

via an element-wise affine transformation:

p_{θ}^{i} (s, ξ) = \frac{p_{max}^{i} - p_{min}^{i}}{2} a_{θ}^{i} (s, ξ) + \frac{p_{max}^{i} + p_{min}^{i}}{2}, i = 1, \dots, g .

(27)

Hence, the final sampled discriminatory pricing action,

a (s)

, from pricing policy

π_{θ} (\cdot | s)

is

a (s) = (p_{θ}^{1} (s, ξ), \dots, p_{θ}^{g} (s, ξ)) \in \prod_{i = 1}^{g} [p_{min}^{i}, p_{max}^{i}] .

(28)

The policy

π_{θ} (\cdot | s)

should act to maximize shield-induced state-value function,

{\tilde{V}}^{π} (s)

, defined in (22) at each state,

s \in S

. According to (24), the policy

π_{θ} (\cdot | s)

is thus optimized according to

max_{θ} {\tilde{V}}^{π_{θ}} (s) = max_{θ} E_{a \sim π_{θ} (\cdot | s)} [{\tilde{Q}}^{π_{θ}} (s, a) - α log π_{θ} (a | s)], \forall s \in S .

(29)

According to (26)–(28), it can be translated to

max_{θ} E_{ξ \sim N (0, I)} [{\tilde{Q}}^{π_{θ}} (s, p_{θ} (s, ξ)) - α log π_{θ} (p_{θ} (s, ξ) | s)], \forall s \in S .

(30)

In order to optimize the pricing policy

π_{θ} (\cdot | s)

according to (30), we also need to construct a deep neural network parameterized by

ϕ

to approximate the shield-induced action-value function

{\tilde{Q}}^{π_{θ}} (s, a)

under

π_{θ} (\cdot | s)

. Following the standard design, we use two action-value function

{\tilde{Q}}_{ϕ_{1}} (s, a)

and

{\tilde{Q}}_{ϕ_{2}} (s, a)

with parameters

ϕ_{1}, ϕ_{2}

to mitigate overestimation bias in Q-value estimation. These two neural networks, which are referred to as the Q-value networks, both take the state s and the action a as input and output the estimated value of

{\tilde{Q}}^{π_{θ}} (s, a)

. They both adopt a fully connected feedforward architecture, as shown in Figure 3, with

Z_{value}

hidden layers, each containing

U_{value}

neurons. Then, (30) can be translated to

max_{θ} E_{ξ \sim N (0, I)} [min_{i = 1, 2} {\tilde{Q}}_{ϕ_{i}} (s, p_{θ} (s, ξ)) - α log π_{θ} (p_{θ} (s, ξ) | s)], \forall s \in S .

(31)

Shield-SAC reuses past experiences to train the pricing policy network and the Q-value networks. Therefore, the collected experiences need to be stored in a replay buffer,

D

, to fuel the training process. The process of collecting experiences is as follows. At each time step, t, the pricing agent observes the current state,

s_{t}

, and selects a discriminatory pricing action,

a_{t}

, according to the current pricing policy,

π_{θ} (\cdot | s_{t})

. The market then transitions to the next state,

s_{t + 1}

, and returns a shield-induced reward signal,

{\tilde{r}}_{t}

. Consequently, the agent can collect an experience defined as

e ≐ (s_{t}, a_{t}, {\tilde{r}}_{t}, s_{t + 1}, d o n e)

at each time step, t, to the replay buffer,

D

, where

d o n e = 1

if

s_{t + 1}

is a terminal state, and

d o n e = 0

otherwise. Here, a terminal state is reached when either the inventory is depleted or the selling horizon ends. After collecting enough experiences to the replay buffer, we can update the pricing policy network and the Q-value networks. First, the parameters

θ

of the pricing policy network are updated based on (31) using samples

B = {(s_{t}, a_{t}, {\tilde{r}}_{t}, s_{t + 1}, d o n e)}

drawn from the replay buffer

D

by one step of gradient ascent, as follows:

\nabla_{θ} \frac{1}{| B |} \sum_{(s_{t}, a_{t}, {\tilde{r}}_{t}, s_{t + 1}, d o n e) \in B} (min_{i = 1, 2} {\tilde{Q}}_{ϕ_{i}} (s_{t}, a_{θ} (s_{t})) - α log π_{θ} (a_{θ} (s_{t}) | s_{t})),

(32)

where

a_{θ} (s_{t})

is a sample from

π_{θ} (\cdot | s_{t})

, which is differentiable with respect to

θ

via the reparameterization trick, as shown in (26)–(28). Then, the parameters

ϕ_{1}, ϕ_{2}

of two Q-value networks are updated by minimizing the mean squared error between its output values,

Q_{ϕ} (s_{t}, a_{t})

, and their approximating target values,

y ({\tilde{r}}_{t}, s_{t + 1}, d o n e)

. The target values are computed according to the Bellman optimality Equation (25) by using samples

B = {(s_{t}, a_{t}, {\tilde{r}}_{t}, s_{t + 1}, d o n e)}

drawn from the replay buffer

D

, as follows:

\nabla_{ϕ_{i}} \frac{1}{| B |} \sum_{(s_{t}, a_{t}, {\tilde{r}}_{t}, s_{t + 1}, d o n e) \in B} {({\tilde{Q}}_{ϕ_{i}} (s_{t}, a_{t}) - y ({\tilde{r}}_{t}, s_{t + 1}, d o n e))}^{2}, i \in {1, 2} .

(33)

The approximating target value

y ({\tilde{r}}_{t}, s_{t + 1}, d o n e)

of both Q-value networks is defined as:

y ({\tilde{r}}_{t}, s_{t + 1}, d o n e) = {\tilde{r}}_{t} + (1 - d o n e) (min_{i = 1, 2} {\tilde{Q}}_{ϕ_{i}} (s_{t + 1}, a (s_{t + 1})) - α log π_{θ} (a (s_{t + 1}) | s_{t + 1})),

(34)

where

a (s_{t + 1})

is a sample from

π_{θ} (\cdot | s_{t + 1})

via the reparameterization trick, as shown in (26)–(28). Here, the role of

d o n e

is to make sure that, when the state of the input of the Q-value network is a terminal state, the output of the Q-value network is muted when computing the approximating target value. However, computing y according to (34) is challenging, as it depends on the parameters

ϕ_{1}, ϕ_{2}

that we aim to update, potentially causing instability during learning. To mitigate this, two separate neural networks parameterized by

ϕ_{1}^{targ}, ϕ_{2}^{targ}

are introduced, which are referred to as the target Q-value networks. The target Q-value networks share the same architecture as the Q-value networks. Its role is to compute the approximating target value, y, in a more stable way. This is achieved by updating their parameters,

ϕ_{1}^{targ}, ϕ_{2}^{targ}

, more slowly by Polyak-averaging the parameters

ϕ_{1}, ϕ_{2}

of the Q-value networks, respectively, over the course of training, as follows:

ϕ_{i}^{targ} \leftarrow ρ ϕ_{i}^{targ} + (1 - ρ) ϕ_{i}, i \in {1, 2},

(35)

where

ρ \in [0, 1)

is a smoothing coefficient controlling the update rate. Then, the approximating target value y in (34) is now translated to:

y ({\tilde{r}}_{t}, s_{t + 1}, d o n e) = {\tilde{r}}_{t} + (1 - d o n e) (min_{i = 1, 2} {\tilde{Q}}_{ϕ_{i}}^{targ} (s_{t + 1}, a (s_{t + 1})) - α log π_{θ} (a (s_{t + 1}) | s_{t + 1})) .

(36)

Algorithm 1 presents the complete pseudocode of our Shield-SAC algorithm. At each time step, the pricing policy network samples a discriminatory price vector via the reparameterization trick. Then, the Shield module either executes it directly if feasible or projects it into

A_{feasible}

via the convex quadratic program in (14). The resulting transition is stored in the replay buffer, and the Q-value networks and the pricing policy network are updated using samples drawn from the replay buffer. Finally, the target Q-value networks are updated via Polyak averaging. Moreover, the corresponding flowchart of our Shield-SAC is shown in Figure 4 to make the algorithmic workflow explicit and easier to reproduce.

Algorithm 1: Shield-SAC

5. Case Studies and Numerical Results

5.1. Case Studies

In our case studies, we construct two simulated markets to train the proposed Shield-SAC algorithm and evaluate its effectiveness in solving discriminatory dynamic pricing for perishable products under instantaneous and hard price fairness constraints. All results are obtained from synthetic (simulated) data generated by these two market simulators described below. In both simulated markets, the unconstrained setting that ignores price fairness (solved by the original SAC algorithm) serves as a baseline, illustrating the magnitude of price unfairness that a revenue-driven discriminatory dynamic pricing policy may induce. The first simulated market contains two customer groups. In this low-dimensional setting, an oracle dynamic programming (oracle DP) benchmark is available. The oracle DP algorithm assumes that the true demand functions are known. It can compute the optimal discriminatory dynamic pricing policy without price fairness constraints via value iteration, and it can also compute the fairness-aware optimal discriminatory dynamic pricing policy with price fairness constraints via action-constrained value iteration. This benchmark enables a direct comparison between SAC-based algorithms and the oracle DP solution in terms of total revenue, as well as the structure of the learned pricing policy over inventory and time. To assess scalability, we further consider a second simulated market with five customer groups. This setting substantially increases the action dimension and expands the pricing policy space, making the oracle DP approach computationally impractical due to the curse of dimensionality. In contrast, SAC-based DRL algorithms remain scalable through function approximation with deep neural networks. In this higher-dimensional simulated market, we evaluate whether the Shield-SAC algorithm can still learn an effective discriminatory dynamic pricing policy that achieves strong revenue performance while strictly adhering to the instantaneous and hard price fairness constraints. Moreover, to isolate the benefit of our optimization-based shield mechanism, we include two SAC-based baselines, namely SAC-Penalty and SAC-Lagrangian.

In the SAC-Penalty algorithm, a pricing action,

a_{t}

, is sampled from an unconstrained pricing policy,

π_{θ} (\cdot | s_{t})

. If

a_{t} \notin A_{feasible}

, a predefined feasible price vector,

p_{fair}

(arbitrarily chosen to satisfy the price fairness constraints), is executed in the market, and the agent receives a penalized reward signal designed to discourage violations of the price fairness constraints. Specifically, the SAC-Penalty algorithm receives the following reshaped reward signals under state

s_{t}

and proposed pricing action

a_{t}

:

r_{t}^{pen} (s_{t}, a_{t}) = \{\begin{matrix} r_{t} (s_{t}, a_{t}), & if a_{t} \in A_{feasible} \\ r_{t} (s_{t}, p_{fair}) - ω max_{i \neq j} (| a_{t}^{i} - a_{t}^{j} {| - Δ_{i j})}_{+}, & if a_{t} \notin A_{feasible} \end{matrix},

(37)

where

{(x)}_{+} = max {x, 0}

and

ω > 0

controls the penalty intensity, and the penalty term measures the maximum amount of price fairness violation across all group pairs. This SAC-Penalty baseline reflects a common practice in constrained RL, for which infeasible actions are discouraged through penalties.

Another commonly used approach in constrained RL is the Lagrangian method. Its basic idea is to rewrite a constrained optimization problem as a saddle point problem by introducing a nonnegative dual variable,

λ

, and forming the Lagrangian that couples the objective and the constraint. It then alternates between optimizing the primal variables and ascending in

λ

. The dual update enforces the constraint, while the primal update maximizes performance. Together, these updates converge to a primal-dual equilibrium. In the SAC-Lagrangian algorithm, a pricing action,

a_{t}

, is sampled from an unconstrained pricing policy,

π_{θ} (\cdot | s_{t})

, and is directly executed, even when

a_{t} \notin A_{feasible}

. The SAC-Lagrangian algorithm uses the Lagrangian-shaped reward

r_{t}^{lag} (s_{t}, a_{t}) = r_{t} (s_{t}, a_{t}) - λ_{t} max_{i \neq j} | a_{t}^{i} - a_{t}^{j} |, a_{t} \in A .

(38)

Here,

λ_{t} \geq 0

is updated adaptively through a dual ascent step, as follows:

λ_{t + 1} = {[λ_{t} + η max_{i \neq j} (| a_{t}^{i} - a_{t}^{j} | - Δ_{i j})]}_{+},

(39)

with step size

η > 0

. By comparing our Shield-SAC with SAC-Penalty and SAC-Lagrangian baselines, we demonstrate the benefit of explicitly projecting infeasible actions into

A_{feasible}

via the convex quadratic program (14). This projection leads to stronger revenue performance under instantaneous and hard price fairness constraints when solving the fairness-aware discriminatory dynamic pricing problem for perishable products under limited inventory.

In both simulated markets, a monopoly firm sells a single product to g customer groups. For each group,

i \in {1, \dots, g}

, the feasible price lies in a continuous interval,

[p_{min}^{i}, p_{max}^{i}]

. The firm starts with an initial inventory level, I, which cannot be replenished and perishes at the end of the selling horizon. The selling horizon is divided into T selling periods. In each period,

t \in {1, 2, \dots, T}

, at most one customer from each group can arrive, with arrival rate

{λ_{i}}_{i \in {1, . ., g}}

. Each arrived customer demands, at most, one unit of the product. The purchasing probability of an arriving customer from group i is modeled by a logit function:

D^{i} (p_{i}^{t}) = \frac{e^{(a_{i} - b_{i} p_{i}^{t})}}{1 + e^{(a_{i} - b_{i} p_{i}^{t})}} .

(40)

The parameter settings of the two simulated markets are summarized in Table 2 and Table 3, and the corresponding demand curves are illustrated in Figure 5. Note that the demand functions (40) specified here are used only to build simulation markets and generate interaction data. All SAC-based DRL algorithms in our paper do not have prior access to these underlying demand functions. Instead, they learn pricing policies from the realized sales outcomes by interacting with the simulated markets, highlighting the model-free nature of the DRL approach.

For a fair comparison, all SAC-based algorithms adopt identical network architectures. Specifically, the pricing policy network is implemented as a fully connected feedforward neural network with

Z_{policy} = 2

hidden layers, each containing

U_{policy} = 256

neurons. The Q-value networks and their corresponding target Q-networks share the same architecture, with

Z_{value} = 2

hidden layers and

U_{value} = 256

neurons per layer. For all SAC-based algorithms, the temperature parameter is fixed at

α = 0.1

, and target networks are updated using Polyak averaging with coefficient

ρ = 0.995

. For the SAC-Penalty algorithm, the penalty intensity is set to

ω = 1

in both simulated markets. The feasible price vector

p_{fair}

of the SAC-Penalty algorithm is set to

[5.5, 5.5]

in the simulated market with two customer groups and

[5.5, 5.5, 5.5, 5.5, 5.5]

in the simulated market with five customer groups. In our numerical experiments, for simplicity, we use a uniform threshold for all pairwise price fairness constraints, meaning that

Δ_{i j} = Δ

for all

i \neq j

, so that the maximum allowable absolute price difference is identical for any pair of customer groups. We consider two levels of price fairness constraint strictness,

Δ = 2

and

Δ = 1

, in both simulated markets.

5.2. Numerical Results and Analysis

The training processes of all SAC-based pricing algorithms under both levels of price fairness constraint strictness in the two simulated markets are shown in Figure 6 and Figure 7, respectively. We evaluate each learned policy over 1000 episodes and report the mean episodic total revenue. When applicable, we also report the step-level violation rate, defined as the number of constraint-violating periods divided by the total number of periods across all test episodes. The evaluation results of the learned pricing policies for all compared algorithms are reported in Table 4. From Figure 6 and Table 4, we can see that in the two-group simulated market, the unconstrained SAC algorithm and the Shield-SAC algorithm under both price fairness thresholds achieve revenue levels close to those of the optimal pricing policies obtained by the oracle DP algorithm. Once the price fairness constraint is imposed, the Shield-SAC algorithm preserves comparable revenue performance. Meanwhile, it satisfies the price fairness constraint at every state and in every selling period during both training and evaluation. From Figure 7 and Table 4, we can see that in the five-group simulated market, the unconstrained SAC algorithm and the Shield-SAC algorithm under both price fairness thresholds still achieve satisfactory revenue performance. This indicates that the SAC-based approach remains effective in the higher-dimensional setting where the oracle DP benchmark is no longer computationally tractable. In contrast, Figure 6 and Figure 7 and Table 4 show that the SAC-Penalty baseline consistently yields lower revenue under both fairness thresholds in both simulated markets. This suggests that penalty-based reward shaping is overly conservative and does not achieve an effective balance between price fairness and revenue maximization. Moreover, the SAC-Penalty algorithm attains similar revenue levels under

Δ = 2

and

Δ = 1

in both simulated markets, implying that it tends to keep prices across customer groups overly close, regardless of the price fairness constraint strictness. Additionally, Figure 6 and Table 4 show that, in the two-group simulated market, the SAC-Lagrangian baseline attains higher revenue under the loose fairness threshold with

Δ = 2

. However, this improvement is accompanied by a very high violation rate of

0.62

, indicating that the constraint is not reliably enforced. Although its revenue becomes comparable to the oracle DP and Shield-SAC, SAC-Lagrangian exhibits non-negligible violations. By contrast, Figure 7 and Table 4 show that in the five-group simulated market, SAC-Lagrangian behaves conservatively under both fairness thresholds, which substantially limits violations but also leads to noticeable revenue loss. These results suggest that SAC-Lagrangian tends to fall into one of two undesirable regimes in our setting: either it breaks the constraint by choosing overly aggressive prices to pursue revenue, or it becomes overly conservative to avoid violations and thus sacrifices revenue. A key reason is that Lagrangian-based approaches critically rely on accurately learning both the dual variable

λ

and the underlying Q-value critics. However, achieving a well-calibrated

λ

that precisely balances revenue maximization and constraint satisfaction is difficult, especially under function approximation and stochastic environments. The dual updates can be highly sensitive to estimation noise and bias in the learned critics, which may amplify oscillations and destabilize training, as shown in Figure 6. Consequently, for dynamic pricing problems with instantaneous and hard price fairness constraints, SAC-Lagrangian is prone to instability and struggles to simultaneously deliver strong revenue performance and strict compliance.

To better illustrate the characteristics of the pricing policies learned by each algorithm, we plot heatmaps of the learned pricing policies and visualize pricing trajectories over one selling horizon under these policies. To further illustrate the extent of price unfairness, we plot heatmaps of the price differences. They report, at each state, the maximum price disparity across customer groups under the learned policies. The largest price disparity over all states is annotated at the corresponding location in the heatmaps to facilitate comparison with the price fairness threshold.

Figure 8 presents heatmaps of the learned pricing policies of the unconstrained oracle DP benchmark and the unconstrained SAC algorithm, together with the corresponding maximum cross-group price differences in the two-group simulated market. Both unconstrained pricing policies exhibit substantial price disparities between the two groups, indicating severe price unfairness when price fairness constraints are ignored. Moreover, the unconstrained SAC algorithm learns a pricing policy that induces even larger price differences than the oracle DP benchmark. Figure 9 and Figure 10 present heatmaps of the learned pricing policies of the constrained oracle DP benchmark, Shield-SAC, SAC-Penalty, and SAC-Lagrangian, together with the corresponding maximum cross-group price differences in the two-group simulated market under

Δ = 2

and

Δ = 1

, respectively.

As shown in Figure 9a and Figure 10a, the optimal pricing policy obtained by the constrained oracle DP benchmark places the cross-group price differences at the feasibility boundary for most states under both

Δ = 2

and

Δ = 1

. This boundary-seeking behavior is a key characteristic of the optimal pricing policy under hard price fairness constraints. It indicates that, to maximize revenue, the firm tends to exploit as much allowable price differentiation as the price fairness constraints permit. Consistent with this observation, the pricing policies learned by the Shield-SAC algorithm also concentrate at the boundary of the feasible action space and closely resemble the constrained oracle DP pricing policies, as shown in Figure 9b and Figure 10b. This helps explain why the Shield-SAC algorithm achieves strong revenue performance while strictly satisfying the price fairness constraints.

In contrast, the SAC-Penalty algorithm is noticeably overly conservative. As shown in Figure 9c and Figure 10c, its cross-group price differences are typically well below the threshold (as indicated by larger dark regions in the price difference heatmaps). This behavior is largely driven by the penalty-based reward shaping. It incentivizes the agent to keep prices across customer groups overly close to avoid violations and associated penalties. This results in a less favorable balance between revenue maximization and price-fairness compliance. Moreover, the pricing policies learned by the SAC-Penalty algorithm under

Δ = 2

and

Δ = 1

appear similar, suggesting that it tends to enforce overly small price differences, regardless of constraint strictness. This also explains its weaker revenue performance and the fact that its achieved revenue levels are close under

Δ = 2

and

Δ = 1

. Moreover, as shown in Figure 9d, SAC-Lagrangian is inherently ill-suited to enforce the instantaneous and hard price fairness constraint. In practice, the Lagrangian relaxation converts an always-satisfied, per-step constraint into an expectation-based, cumulative objective. As a result, SAC-Lagrangian cannot guarantee feasibility at every state and in every selling period. In contrast, the shielding module explicitly projects each executed price vector back into the feasible region and enforces the constraint by construction. The mismatch between expected-constraint optimization and instantaneous feasibility requirements helps explain why SAC-Lagrangian can either exhibit substantial constraint violation, as shown in Figure 9d, or adopt overly conservative policies with revenue loss as shown in Figure 10d.

The above insights are more directly illustrated in Figure 11. It visualizes pricing trajectories over one selling horizon in the two-group simulated market under the learned pricing policies of unconstrained SAC, Shield-SAC, SAC-Penalty, and SAC-Lagrangian for both

Δ = 2

and

Δ = 1

. As shown in Figure 11a, customer group 1, which is less sensitive to high prices, is charged substantially more than customer group 2 throughout the selling horizon. In later periods, the firm further lowers the price for customer group 2 to clear remaining inventory while still charging a high price to customer group 1. This illustrates that severe price unfairness can arise when the price fairness constraints are ignored. As shown in Figure 11b,c, once the price fairness constraint is imposed, the Shield-SAC algorithm learns pricing policies that still price higher for customer group 1. Meanwhile, it keeps the cross-group price differences at the boundary of the hard fairness thresholds throughout the selling horizon. As a result, the two customer groups exhibit similar price adjustment trends over time, while the allowed level of price differentiation is preserved. In contrast, Figure 11d,e show that the SAC-Penalty algorithm learns overly conservative pricing policies that keep prices for the two customer groups very close, even under the relatively loose price fairness constraint

Δ = 2

.

Although this reduces price disparities, it leads to inferior revenue performance, indicating that the SAC-Penalty algorithm does not achieve a desirable balance between price fairness compliance and revenue maximization. Moreover, Figure 11f,g show that the SAC-Lagrangian algorithm may learn pricing policies that either remain infeasible and violate the price fairness constraint or become overly conservative and incur unnecessary revenue loss. In both cases, it is difficult for SAC-Lagrangian to reliably capture the boundary-seeking behavior that characterizes the revenue-optimal feasible pricing policy in our setting.

Figure 12 presents heatmaps of the learned pricing policies of the unconstrained SAC algorithm, together with the corresponding maximum cross-group price differences in the five-group simulated market. It can be seen that the learned pricing policy exhibits substantial price disparities across the five groups, indicating severe price unfairness when price fairness constraints are ignored. Figure 13 and Figure 14 present heatmaps of the learned pricing policies of Shield-SAC, SAC-Penalty and SAC-Lagrangian, together with the corresponding maximum cross-group price differences in the five-group simulated market under

Δ = 2

and

Δ = 1

, respectively. As shown in Figure 13a and Figure 14a, the Shield-SAC algorithm still learns pricing policies that place the cross-group price differences at the feasibility boundary under both

Δ = 2

and

Δ = 1

. This pattern is consistent with the boundary-seeking behavior observed for optimal policies under hard price fairness constraints in the two-group simulated market. These results suggest that the Shield-SAC algorithm can effectively exploit the allowable price differentiation to maximize revenue while maintaining strict compliance with the price fairness constraints. In contrast, Figure 13b and Figure 14b show that the SAC-Penalty algorithm remains overly conservative under both price fairness thresholds. The cross-group price differences are small in both cases, implying that the learned pricing policies tend to set very similar prices across all customer groups.

In addition, Figure 13b reveals large regions in which all groups are charged the uniform price vector

[5.5, 5.5, 5.5, 5.5, 5.5]

, which corresponds to the predefined feasible price vector

p_{fair}

. This indicates that the SAC-Penalty algorithm frequently proposes infeasible prices under

Δ = 2

and, as a consequence, the executed prices are repeatedly overridden by

p_{fair}

. Such behavior suggests that the penalty-based reward shaping is not sufficient to steer the pricing policy toward consistently feasible, revenue-improving prices, and instead leads the algorithm to rely heavily on the fallback price vector. This outcome is undesirable because

p_{fair}

is not designed to be optimal, and repeated enforcement of

p_{fair}

can limit revenue performance. Moreover, Figure 13c and Figure 14c show that the SAC-Lagrangian algorithm remains overly conservative under both price fairness thresholds. This behavior is amplified as the customer group scale increases. Learning an appropriately calibrated

λ

to balance revenue and violations becomes substantially harder with noisier, higher-variance constraint feedback and less accurate critic estimates. Consequently, the SAC-Lagrangian algorithm can become trapped in an overly conservative local optimum that avoids violations by collapsing price gaps well below the threshold, leading to persistent revenue loss.

The above insights are more directly illustrated in Figure 15. It visualizes pricing trajectories over one selling horizon in the five-group simulated market under the learned pricing policies of Shield-SAC, SAC-Penalty and SAC-Lagrangian for both

Δ = 2

and

Δ = 1

. As shown in Figure 15a, less price-sensitive customer groups are generally charged higher prices, except that customer group 2 is priced higher in the early stage of the selling horizon. Prices typically decrease over time to clear the remaining inventory across all customer groups. Meanwhile, the price trajectory of customer group 2 varies more noticeably, implying that price adjustments for this group play an important role in revenue maximization. As shown in Figure 15b,c, once the price fairness constraint is imposed under both

Δ = 2

and

Δ = 1

, the Shield-SAC algorithm learns pricing policies that effectively cluster customer groups: customer groups 1 and 5 are priced almost identically, and customer groups 3 and 4 are priced almost identically. The learned pricing policies still maintain higher prices for groups 1 and 5 than for groups 3 and 4 while keeping the cross-group price differences at the boundary of the hard price fairness thresholds throughout the selling horizon. Meanwhile, prices for customer group 2 are adjusted more actively over time to balance supply and demand and improve revenue. In contrast, Figure 15d shows that the SAC-Penalty algorithm learns an overly conservative pricing policy that relies heavily on the fallback price vector

p_{fair}

. This behavior is undesirable and leads to poor revenue performance. In addition, Figure 15e shows that the SAC-Penalty algorithm learns a poor pricing structure, charging lower prices for customer groups 1 and 5 than for customer groups 3 and 4. Overall, these results suggest that the SAC-Penalty algorithm struggles to learn a revenue-maximizing pricing policy while satisfying hard price fairness constraints in the five-group simulated market under both price fairness thresholds. Moreover, Figure 15f,g show that the SAC-Lagrangian algorithm learns an overly conservative pricing policy that avoids violations by collapsing price gaps well below the threshold, leading to persistent revenue loss. This shows again that it is difficult for SAC-Lagrangian to reliably capture the boundary-seeking behavior that characterizes the revenue-optimal feasible pricing policy in our setting.

The reason why the Shield-SAC algorithm achieves strong revenue performance while complying with the instantaneous and hard price fairness constraints is largely due to its Shield module. It projects infeasible price vectors onto the boundary of the feasible price set by solving the convex quadratic program (14). This Shield module serves as a lightweight optimization layer embedded into the DRL interaction loop, yielding a hybrid optimization-and-DRL framework. The DRL pricing policy network focuses on learning revenue-maximizing pricing decisions from market feedback. Meanwhile, the optimization-based Shield module enforces instantaneous and hard price fairness constraints by correcting proposed price vectors on the fly. This optimization-based correction not only guarantees that the executed prices satisfy price fairness constraints at every state and in every selling period but also facilitates learning in our setting, where the optimal constrained discriminatory dynamic pricing policy exhibits boundary-seeking behavior. To further illustrate the effectiveness of this optimization-based shielding mechanism, we plot the raw pricing trajectories over one selling horizon produced by the learned pricing policy networks of the Shield-SAC algorithm before applying the Shield module for both simulated markets in Figure 16 and Figure 17, respectively. The outputs of the learned pricing policy networks of the Shield-SAC algorithm in both simulated markets are counterintuitive in the sense that they do not actively avoid violating the price fairness constraints, unlike the SAC-Penalty algorithm. This behavior is expected under the hybrid optimization-and-DRL design, because feasibility is handled by the optimization-based Shield module rather than being encoded in the learning objective via penalty-based reward shaping. After applying the Shield module, however, the executed pricing trajectories satisfy the hard price fairness constraints in every selling period and exhibit clear boundary-seeking behavior, as shown in Figure 11b,c and Figure 15b,c, which is desirable in our setting. Overall, the numerical results suggest that our Shield-SAC provides a practical way to maintain strict step-wise feasibility without sacrificing much revenue, especially when the learned policy explores near the feasibility frontier. This observation is not unique to price fairness constraints. More generally, whenever the per-period operational or compliance requirements can be written as a state-dependent feasible action set, the same shield mapping can be used to guarantee that every executed action remains feasible during both training and deployment. Constraints that do not admit this per-period feasible-set structure, such as long-term average fairness or cumulative budgets, require additional mechanisms beyond shielding and are left for future work.

5.3. Sensitivity Analysis

We conduct a sensitivity analysis to examine how the learned pricing policies respond to different price fairness thresholds

Δ

and different inventory sizes I under both simulated markets. Specifically, we consider

Δ \in {1, 2, 3}

, which correspond to a strong, moderate, and weak price fairness constraint, and three inventory levels in each simulated market that represent scarce, moderate, and ample initial inventory. Following the main experimental protocol, each learned policy is evaluated over 1000 episodes. We report the mean episodic total revenue and the step-level violation rate in parentheses, defined as the number of constraint-violating selling periods divided by the total number of selling periods across all test episodes. Table 5 and Table 6 both show that relaxing the price fairness requirement (larger

Δ

) generally increases revenue for all methods, because a larger

Δ

enlarges the feasible pricing set. This effect is more pronounced when the initial inventory, I, is larger, at which point the price fairness constraint is more likely to be binding. This is because, under a tight inventory, the firm tends to set similarly high prices across customer groups to maximize revenue, which makes the price-gap constraint less restrictive. Across all tested settings, Shield-SAC always achieves zero violations while maintaining revenue close to the oracle DP benchmark in the two-group market and achieving the highest revenue in most cases in the five-group market. This indicates that the shielding mechanism enforces instantaneous and hard price fairness constraints with limited revenue loss. In contrast, SAC-Penalty is also violation-free but exhibits clear conservatism, with substantially lower revenue even when

Δ

is relaxed. Moreover, SAC-Lagrangian can yield competitive revenue in some cases but at the cost of an increased step-level violation rate. In other cases, it becomes overly conservative and incurs unnecessary revenue loss.

Overall, the sensitivity study supports three conclusions. First, a larger price fairness threshold,

Δ

, predictably yields higher revenue, and this effect becomes more pronounced in markets with a larger initial inventory, I. Second, the shielding mechanism consistently enforces the instantaneous and hard price fairness constraints across all tested configurations while preserving high revenue. Third, the SAC-Penalty and SAC-Lagrangian baselines illustrate a typical safety–performance trade-off in which penalty methods ensure safety but are conservative, whereas Lagrangian methods can be less conservative but may violate price fairness constraints.

5.4. Practical Applications and Managerial Implications

In practice, managers often face a dual challenge in algorithmic pricing: demand is highly complex and difficult to model accurately, while pricing decisions must comply with strict price fairness requirements that are evaluated at the moment prices are shown to customers. Shield-SAC is designed for precisely this setting. It combines model-free reinforcement learning for ongoing learning and adaptive price adjustment with an explicit optimization-based shielding mechanism. This mechanism guarantees per-period feasibility under instantaneous and hard fairness constraints and shows clear boundary-seeking behavior. Based on our research, we provide some managerial implications for researchers and practitioners, as follows:

When market demand is complex and uncertain, relying on accurately specified demand functions or structural assumptions becomes impractical in real operations. In such settings, Shield-SAC provides a practical alternative. Its model-free and data-driven RL backbone can learn pricing policies directly from sales interactions and continuously adapt as market conditions shift.
When pricing decisions must comply with strict price fairness requirements that are assessed at the moment prices are shown to customers, fairness performance in expectation is often insufficient for practical governance. A single visible violation can trigger customer complaints, reputational damage, or regulatory and platform enforcement even if the long-run violation rate is low. Shield-SAC is suitable for these settings because its explicit shielding module enforces per-period feasibility at the execution time and prevents the system from displaying infeasible price vectors.
When revenue performance depends on using as much allowable price differentiation as the fairness threshold permits, firms need a boundary-seeking strategy that is aggressive yet strictly compliant. Shield-SAC addresses this need through optimization-based shielding, which enables it to learn high-revenue behavior near the feasible boundary while guaranteeing constraint satisfaction in every selling period.

6. Conclusions, Limitations, and Future Directions

6.1. Conclusions

With the growing availability of sales data, discriminatory dynamic pricing has become an increasingly important revenue lever for perishable products with limited inventory and a finite selling horizon. However, large cross-group price disparities can trigger legal risks and customer backlash, motivating instantaneous and hard price fairness constraints that must be satisfied in every selling period. We address this challenge with Shield-SAC, a hybrid optimization-and-DRL approach.

Specifically, we formulate discriminatory dynamic pricing with unknown demand functions and price fairness constraints as an Action-Constrained Markov Decision Process. We propose Shield-SAC by integrating an optimization-based Shield module into the interaction loop between an SAC-based pricing agent and the market. We project infeasible price vectors onto the boundary of the feasible set by solving a convex quadratic program. Thereby, all executed prices satisfy the instantaneous and hard price fairness constraints during both training and deployment. We test Shield-SAC in two simulated markets of different scales and validate that it learns high-performing pricing policies while consistently enforcing instantaneous and hard price fairness constraints.

More broadly, our results highlight the promise of embedding an optimization-based Shield module into DRL pipelines to enforce strict operational constraints. This provides a practical pathway toward ethical and compliance-aware pricing automation. This perspective aligns with the growing emphasis on AI governance, for which deployed decision systems are expected to satisfy clear behavioral requirements, rather than only optimizing revenue objectives.

6.2. Limitations

There are some limitations in our work that can be addressed in future research:

Our Shield-SAC relies on solving a convex quadratic program within the Shield module to project potentially infeasible price vectors onto the boundary of the feasible set in each selling period. While this is efficient for a small number of groups, the per-step computational cost may become a bottleneck as the number of customer groups increases.
Our model focuses on a single-product monopoly setting with non-strategic customers. This choice keeps the problem tractable, but it excludes several practically important extensions, such as multi-product pricing with cross-product interactions, competitive or oligopolistic markets, and strategic customers who may wait or game group identity.
Our empirical results are obtained in simulated markets using synthetic data. Although simulation enables a controlled evaluation, additional validation on real-world datasets or higher-fidelity simulators is needed to confirm robustness under demand shifts, noise, and other operational complexities.
We focus on instantaneous and hard fairness constraints to guarantee step-wise feasibility. However, we do not examine other fairness formulations, such as expected constraints, cumulative constraints, or outcome-based fairness notions that may be relevant in practice.

6.3. Future Directions

Here, we give some future directions, as follows:

Develop scalable shielding mechanisms for ultra-large group settings by (i) combining clustering and hierarchical pricing to reduce the action dimension, (ii) using structure-exploiting first-order or operator-splitting projections with warm-starting, and (iii) adopting learning-based surrogate shield mappings with a final feasibility-check-and-correct step.
Extend the framework to richer and more realistic markets, including multi-product pricing, strategic customers, and other operational factors, and develop shielded multi-agent or distributed RL algorithms to handle the resulting high-dimensional coupled decision space.
Incorporate competitive or oligopolistic markets by modeling rivals’ pricing as part of the market and exploring game-aware or opponent-modeling DRL approaches.
Generalize beyond instantaneous and hard fairness constraints by considering alternative fairness notions and metrics, and study hybrid designs that combine step-wise safety with long-run fairness targets.
Strengthen external validity by using real-world datasets or digital-twin simulators calibrated from data, and investigating offline RL, robust learning, and sim-to-real techniques to handle demand shifts, noise, and unobserved confounders.

Author Contributions

Conceptualization, W.Q.; methodology, W.Q.; software, W.Q.; validation, W.Q., L.C.W., S.T. and Z.T.; writing—original draft preparation, W.Q.; writing—review and editing, L.C.W., S.T., Z.T. and M.H.; supervision, L.C.W. and M.H.; project administration, M.H.; funding acquisition, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the NSFC [grant number 92567302, U25A20431, 62432003]; the Liaoning Revitalizing Talent Program [grant number XLYC2202045].

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders played no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

Gallego, G.; Van Ryzin, G. Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Manag. Sci. 1994, 40, 999–1020. [Google Scholar] [CrossRef]
Gallego, G.; Van Ryzin, G. A multiproduct dynamic pricing problem and its applications to network yield management. Oper. Res. 1997, 45, 24–41. [Google Scholar] [CrossRef]
Zhang, M.; Ahn, H.S.; Uichanco, J. Data-driven pricing for a new product. Oper. Res. 2022, 70, 847–866. [Google Scholar] [CrossRef]
den Boer, A.V. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manag. Sci. 2015, 20, 1–18. [Google Scholar] [CrossRef]
Phillips, R.L. Pricing and Revenue Optimization: Second Edition; Stanford University Press: Redwood City, CA, USA, 2021. [Google Scholar] [CrossRef]
Yang, C.; Feng, Y.; Whinston, A. Dynamic pricing and information disclosure for fresh produce: An artificial intelligence approach. Prod. Oper. Manag. 2022, 31, 155–171. [Google Scholar] [CrossRef]
Varian, H.R. Price Discrimination. In Handbook of Industrial Organization; Schmalensee, R., Willig, R.D., Eds.; Chapter 10; Elsevier: Amsterdam, The Netherlands, 1989; Volume 1, pp. 597–654. [Google Scholar] [CrossRef]
Aydin, G.; Ziya, S. Technical Note—Personalized Dynamic Pricing of Limited Inventories. Oper. Res. 2009, 57, 1523–1531. [Google Scholar] [CrossRef]
Chen, N.; Gallego, G. A Primal–Dual Learning Algorithm for Personalized Dynamic Pricing with an Inventory Constraint. Math. Oper. Res. 2022, 47, 2585–2613. [Google Scholar] [CrossRef]
Ban, G.Y.; Keskin, N.B. Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity. Manag. Sci. 2021, 67, 5549–5568. [Google Scholar] [CrossRef]
Chen, Q.; Jasin, S.; Duenyas, I. Nonparametric Self-Adjusting Control for Joint Learning and Optimization of Multiproduct Pricing with Finite Resource Capacity. Math. Oper. Res. 2019, 44, 601–631. [Google Scholar] [CrossRef]
Krasheninnikova, E.; García, J.; Maestre, R.; Fernández, F. Reinforcement learning for pricing strategy optimization in the insurance industry. Eng. Appl. Artif. Intell. 2019, 80, 8–19. [Google Scholar] [CrossRef]
De-Arteaga, M.; Feuerriegel, S.; Saar-Tsechansky, M. Algorithmic fairness in business analytics: Directions for research and practice. Prod. Oper. Manag. 2022, 31, 3749–3770. [Google Scholar] [CrossRef]
Ihlanfeldt, K.; Mayock, T. Price discrimination in the housing market. J. Urban Econ. 2009, 66, 125–140. [Google Scholar] [CrossRef]
Bourreau, M.; De Streel, A. The Regulation of Personalised Pricing in the Digital Era. OECD Competition Committee Working Paper DAF/COMP/WD(2018)150, Organisation for Economic Co-Operation and Development (OECD), 2020. Cancels & Replaces the Same Document of 21 November 2018. Available online: https://one.oecd.org/document/DAF/COMP/WD(2018)150/en/pdf (accessed on 9 January 2026).
Cohen, M.C.; Elmachtoub, A.N.; Lei, X. Price discrimination with fairness constraints. Manag. Sci. 2022, 68, 8536–8552. [Google Scholar] [CrossRef]
Marcus, P.H.; Dufek, C.J. New York Implements “Pink Tax” Ban. Hunton Retail Law Resource, The National Law Review. 2020. Available online: https://natlawreview.com/article/new-york-implements-pink-tax-ban (accessed on 8 January 2026).
California Department of Justice, Office of the Attorney General. AB 1287: California’s Pink Tax Law. California Attorney General. 2024. Available online: https://oag.ca.gov/ab1287 (accessed on 28 January 2026).
Financial Conduct Authority. PS21/11: General Insurance Pricing Practices: Amendments. FCA Policy Statement. 2021. Available online: https://www.fca.org.uk/publications/policy-statements/ps21-11-general-insurance-pricing-practices-amendments (accessed on 28 January 2026).
European Parliament and the Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). Official Journal of the European Union, OJ L, 2024/1689. 12 July 2024. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng (accessed on 23 January 2026).
Federal Trade Commission; U.S. Equal Employment Opportunity Commission; Consumer Financial Protection Bureau; U.S. Department of Justice, Civil Rights Division. Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems. Public Statement. 2023. Available online: https://www.ftc.gov/legal-library/browse/cases-proceedings/public-statements/joint-statement-enforcement-efforts-against-discrimination-bias-automated-systems (accessed on 23 January 2026).
Aoki, M. On a Dual Control Approach to the Pricing Policies of a Trading Specialist. In Optimization Techniques 1973: 5th Conference on Optimization Techniques, Part II; Conti, R., Ruberti, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1973; Volume 4, pp. 272–282. [Google Scholar] [CrossRef]
Simchi-Levi, D. OM Forum—OM Research: From Problem-Driven to Data-Driven Research. Manuf. Serv. Oper. Manag. 2014, 16, 2–10. [Google Scholar] [CrossRef]
Lobo, M.S.; Boyd, S. Pricing and Learning with Uncertain Demand. Working Draft, Stanford University, 2003. November 2003. Available online: https://stanford.edu/~boyd/papers/pdf/pric_learn_unc_dem.pdf (accessed on 9 January 2026).
Chhabra, M.; Das, S. Learning the Demand Curve in Posted-Price Digital Goods Auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Taipei, Taiwan, 2–6 May 2011. [Google Scholar]
Kwon, H.D.; Lippman, S.A.; Tang, C.S. Optimal markdown pricing strategy with demand learning. Probab. Eng. Informational Sci. 2012, 26, 77–104. [Google Scholar] [CrossRef]
Qu, H.; Ryzhov, I.O.; Fu, M.C. Learning Logistic Demand Curves in Business-to-Business Pricing. In Proceedings of the 2013 Winter Simulation Conference (WSC), Washington, DC, USA, 8–11 December 2013. [Google Scholar] [CrossRef]
Harrison, J.M.; Keskin, N.B.; Zeevi, A. Bayesian Dynamic Pricing Policies: Learning and Earning Under a Binary Prior Distribution. Manag. Sci. 2012, 58, 570–586. [Google Scholar] [CrossRef]
Broder, J.; Rusmevichientong, P. Dynamic pricing under a general parametric choice model. Oper. Res. 2012, 60, 965–980. [Google Scholar] [CrossRef]
den Boer, A.V.; Zwart, B. Dynamic Pricing and Learning with Finite Inventories. Oper. Res. 2015, 63, 965–978. [Google Scholar] [CrossRef]
Den Boer, A.V.; Zwart, B. Simultaneously Learning and Optimizing Using Controlled Variance Pricing. Manag. Sci. 2014, 60, 770–783. [Google Scholar] [CrossRef]
Cheung, W.C.; Simchi-Levi, D.; Wang, H. Technical Note—Dynamic Pricing and Demand Learning with Limited Price Experimentation. Oper. Res. 2017, 65, 1722–1731. [Google Scholar] [CrossRef]
Bertsimas, D.; Vayanos, P. Data-Driven Learning in Dynamic Pricing Using Adaptive Optimization. Working Paper, Optimization Online. 2017. Available online: https://optimization-online.org/wp-content/uploads/2014/10/4595.pdf (accessed on 9 January 2026).
Besbes, O.; Zeevi, A. On the (Surprising) Sufficiency of Linear Models for Dynamic Pricing with Demand Learning. Manag. Sci. 2015, 61, 723–739. [Google Scholar] [CrossRef]
Besbes, O.; Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 2009, 57, 1407–1420. [Google Scholar] [CrossRef]
Lei, Y.M.; Jasin, S.; Sinha, A. Near-Optimal Bisection Search for Nonparametric Dynamic Pricing with Inventory Constraint; Technical Report 1252; University of Michigan, Stephen M. Ross School of Business: Ann Arbor, MI, USA, 2014. [Google Scholar] [CrossRef]
Chen, N.; Gallego, G. Nonparametric Pricing Analytics with Customer Covariates. Oper. Res. 2021, 69, 974–984. [Google Scholar] [CrossRef]
Chen, Y.; Shi, C. Network revenue management with online inverse batch gradient descent method. Prod. Oper. Manag. 2023, 32, 2123–2137. [Google Scholar] [CrossRef]
Rana, R.; Oliveira, F.S. Real-time dynamic pricing in a non-stationary environment using model-free reinforcement learning. Omega 2014, 47, 116–126. [Google Scholar] [CrossRef]
Liu, J.; Zhang, Y.; Wang, X.; Deng, Y.; Wu, X. Dynamic Pricing on E-commerce Platform with Deep Reinforcement Learning: A Field Experiment. arXiv 2019, arXiv:1912.02572. [Google Scholar] [CrossRef]
Ferreira, K.J.; Simchi-Levi, D.; Wang, H. Online network revenue management using thompson sampling. Oper. Res. 2018, 66, 1586–1602. [Google Scholar] [CrossRef]
Chen, X.; Owen, Z.; Pixton, C.; Simchi-Levi, D. A statistical learning approach to personalization in revenue management. Manag. Sci. 2022, 68, 1923–1937. [Google Scholar] [CrossRef]
Rana, R.; Oliveira, F.S. Dynamic pricing policies for interdependent perishable products or services using reinforcement learning. Expert Syst. Appl. 2015, 42, 426–436. [Google Scholar] [CrossRef]
Qiao, W.; Huang, M.; Gao, Z.; Wang, X. Distributed dynamic pricing of multiple perishable products using multi-agent reinforcement learning. Expert Syst. Appl. 2024, 237, 121252. [Google Scholar] [CrossRef]
Cohen, M.C.; Miao, S.; Wang, Y. Dynamic pricing with fairness constraints. Oper. Res. 2025, 73, 3027–3043. [Google Scholar] [CrossRef]
Chen, X.; Simchi-Levi, D.; Wang, Y. Utility fairness in contextual dynamic pricing with demand learning. Manag. Sci. 2025. ahead of print. [Google Scholar] [CrossRef]
Liu, P.; Sun, W.W. Fairness-aware Contextual Dynamic Pricing with Strategic Buyers. arXiv 2025, arXiv:2501.15338. [Google Scholar] [CrossRef]
Chen, X.; Lyu, J.; Zhang, X.; Zhou, Y. Technical Note—Fairness-Aware Online Price Discrimination with Nonparametric Demand Models. Oper. Res. 2026, 74, 118–129. [Google Scholar] [CrossRef]
Xu, J.; Qiao, D.; Wang, Y.X. Doubly Fair Dynamic Pricing. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics; Ruiz, F., Dy, J., van de Meent, J.W., Eds.; Proceedings of Machine Learning Research; PMLR: Cambridge MA, USA, 2022; Volume 206, pp. 9941–9975. [Google Scholar] [CrossRef]
Reuel, A.; Ma, D. Fairness in Reinforcement Learning: A Survey. Proc. AAAI/ACM Conf. AI Ethics Soc. 2024, 7, 1218–1230. [Google Scholar] [CrossRef]
Cimpean, A.; Libin, P.; Coppens, Y.; Jonker, C.; Nowé, A. Towards Fairness in Reinforcement Learning. In Proceedings of the Adaptive and Learning Agents Workshop (ALA 2023), Held at AAMAS 2023, London, UK, 29 May–2 June 2023; Available online: https://alaworkshop2023.github.io/papers/ALA2023_paper_39.pdf (accessed on 23 January 2026).
Ju, P.; Ghosh, A.; Shroff, N.B. Achieving Fairness in Multi-Agent Markov Decision Processes Using Reinforcement Learning. arXiv 2023, arXiv:2306.00324. [Google Scholar] [CrossRef]
Thibodeau, J.; Nekoei, H.; Taïk, A.; Rajendran, J.; Farnadi, G. Fairness Incentives in Response to Unfair Dynamic Pricing. arXiv 2024, arXiv:2404.14620. [Google Scholar] [CrossRef]
Rathore, H.; Tiwari, S. Data-Driven Dynamic Pricing: Sticky Fairness Concerns and the Exploitation–Exploration Trade-Off. J. Oper. Res. Soc. 2025, 76, 1–22. [Google Scholar] [CrossRef]
Maestre, R.; Duque, J.; Rubio, A.; Arévalo, J. Reinforcement Learning for Fair Dynamic Pricing. In Intelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference (IntelliSys) Volume 1; Arai, K., Kapoor, S., Bhatia, R., Eds.; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2018; Volume 868, pp. 120–135. [Google Scholar] [CrossRef]
Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; Topcu, U. Safe Reinforcement Learning via Shielding. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2018; Volume 32, pp. 2669–2678. [Google Scholar] [CrossRef]

Figure 1. A Shield module is incorporated into the interaction loop between the DRL pricing agent and the market. In (a), a DRL pricing agent interacts with the market without considering price fairness constraints. In (b), a Shield DRL pricing agent interacts with the market, and the Shield module ensures that the executed pricing action satisfies the price fairness constraints at every state and in every selling period. From the DRL pricing agent’s perspective, the original market is transformed into a shield-induced new market with a modified transition function and a modified reward function induced by the Shield module.

Figure 2. Schematic illustration of the architecture of the pricing policy network in the Shield-SAC algorithm.

Figure 3. Schematic illustration of the architecture of Q-value networks and the corresponding target Q-value networks in the Shield-SAC algorithm.

Figure 4. The flowchart of the Shield-SAC algorithm.

Figure 5. The decreasing purchasing probability curves for arriving customers in each group in both simulated markets. (a) The first simulated market with 2 groups; (b) the second simulated market with 5 groups.

Figure 6. Episodic total-revenue learning curves for the SAC-based algorithms in the first two-group simulated market. To visualize long-run behavior, curves are smoothed over a 100-episode sliding window, with shaded bands representing the corresponding standard deviations. Oracle DP-based algorithms are shown as line baselines for reference.

Figure 7. Episodic total-revenue learning curves for the SAC-based algorithms in the second five-group simulated market. To visualize long-run behavior, curves are smoothed over a 100-episode sliding window, with shaded bands representing the corresponding standard deviations.

Figure 8. The learned pricing policy in the first two-group simulated market without a price fairness constraint. (a) The optimal pricing policy obtained by the unconstrained oracle DP algorithm. (b) The learned pricing policy obtained by the unconstrained SAC algorithm.

Figure 9. The learned pricing policy in the first two-group simulated market with price fairness threshold

Δ = 2

. (a) The optimal pricing policy obtained by the constrained oracle DP algorithm with

Δ = 2

. (b) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (d) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

.

Figure 9. The learned pricing policy in the first two-group simulated market with price fairness threshold

Δ = 2

. (a) The optimal pricing policy obtained by the constrained oracle DP algorithm with

Δ = 2

. (b) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (d) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

.

Figure 10. The learned pricing policy in the first two-group simulated market with price fairness threshold

Δ = 1

. (a) The optimal pricing policy obtained by the constrained oracle DP algorithm with

Δ = 1

. (b) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 1

. (c) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (d) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 10. The learned pricing policy in the first two-group simulated market with price fairness threshold

Δ = 1

. (a) The optimal pricing policy obtained by the constrained oracle DP algorithm with

Δ = 1

. (b) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 1

. (c) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (d) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 11. The pricing trajectories over one selling horizon under the learned policies obtained by the SAC-based algorithms in the first two-group simulated market. (a) The pricing trajectory under the learned policy obtained by the unconstrained SAC algorithm. (b) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 1

. (d) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (e) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (f) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

. (g) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 11. The pricing trajectories over one selling horizon under the learned policies obtained by the SAC-based algorithms in the first two-group simulated market. (a) The pricing trajectory under the learned policy obtained by the unconstrained SAC algorithm. (b) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 1

. (d) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (e) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (f) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

. (g) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 12. The learned pricing policy obtained by the unconstrained SAC algorithm in the second five-group simulated market without price fairness constraints.

Figure 13. The learned pricing policy in the second five-group simulated market with price fairness threshold

Δ = 2

. (a) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 2

. (b) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (c) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

.

Figure 13. The learned pricing policy in the second five-group simulated market with price fairness threshold

Δ = 2

. (a) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 2

. (b) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (c) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

.

Figure 14. The learned pricing policy in the second five-group simulated market with price fairness threshold

Δ = 1

. (a) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 1

. (b) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (c) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 14. The learned pricing policy in the second five-group simulated market with price fairness threshold

Δ = 1

. (a) The learned pricing policy obtained by the Shield-SAC algorithm with

Δ = 1

. (b) The learned pricing policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (c) The learned pricing policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 15. The pricing trajectories over one selling horizon under the learned policies obtained by the SAC-based algorithms in the second five-group simulated market. (a) The pricing trajectory under the learned policy obtained by the unconstrained SAC algorithm. (b) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 1

. (d) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (e) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (f) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

. (g) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 15. The pricing trajectories over one selling horizon under the learned policies obtained by the SAC-based algorithms in the second five-group simulated market. (a) The pricing trajectory under the learned policy obtained by the unconstrained SAC algorithm. (b) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 2

. (c) The pricing trajectory under the learned policy obtained by the Shield-SAC algorithm with

Δ = 1

. (d) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 2

. (e) The pricing trajectory under the learned policy obtained by the SAC-Penalty algorithm with

Δ = 1

. (f) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 2

. (g) The pricing trajectory under the learned policy obtained by the SAC-Lagrangian algorithm with

Δ = 1

.

Figure 16. The raw pricing trajectories over one selling horizon produced by the learned policy networks of the Shield-SAC algorithm before applying the Shield module in the first two-group simulated market. (a) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 2

. (b) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 1

.

Figure 16. The raw pricing trajectories over one selling horizon produced by the learned policy networks of the Shield-SAC algorithm before applying the Shield module in the first two-group simulated market. (a) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 2

. (b) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 1

.

Figure 17. The raw pricing trajectories over one selling horizon produced by the learned policy networks of the Shield-SAC algorithm before applying the Shield module in the second five-group simulated market. (a) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 2

. (b) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 1

.

Figure 17. The raw pricing trajectories over one selling horizon produced by the learned policy networks of the Shield-SAC algorithm before applying the Shield module in the second five-group simulated market. (a) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 2

. (b) The raw pricing trajectories produced by the learned policy network before applying the Shield module with

Δ = 1

.

Table 1. Summary of methodological streams for fairness-aware discriminatory dynamic pricing and learning.

Stream (Representative Works)	Demand Model Assumptions	How Fairness Is Handled	Strengths	Weaknesses
Parametric [45,46,47]	Assumes a prespecified functional form for each group’s demand and learns unknown parameters from sales data	Fairness is enforced by embedding explicit fairness constraints into the parametric learning framework, restricting the feasible set of price policies during learning	Efficient learning and optimization when the assumed demand model is correct	Model misspecification
Nonparametric [48,49]	Avoids committing to a parametric form but relies on the structural properties of the underlying demand functions	Hard constraints for the strictly controllable fairness component and soft, asymptotic constraints for the demand-dependent component	Reduces reliance on a specific parametric form of the demand model	Loss of tractability; structural misspecification
RL [53,54,55]	Does not rely on any assumption about the underlying demand functions	Incorporates fairness into the objective as a soft, multi-objective term and typically targets cumulative or expected constraint satisfaction	Assumption-free learning can adapt to complex demand patterns	Cannot directly or efficiently address instantaneous and hard fairness constraints; inefficient boundary-seeking learning
This paper (Shield-SAC)	Does not rely on any assumption about the underlying demand functions	Instantaneous and hard price fairness constraints are enforced by an optimization-based Shield module that projects infeasible price vectors onto the boundary of the feasible set	Assumption-free learning can adapt to complex demand patterns; directly enforces step-wise hard price-fairness constraints on executed prices; enables efficient boundary-seeking learning	Requires solving a convex quadratic program to project infeasible actions onto the feasible set, which may be time-consuming at large scale

Table 2. The parameters of the first two-group simulated market.

Parameters	Symbols	Values
The number of groups	g	2
Initial inventory	I	50
The number of selling periods	T	30
The price range of each group	${[p_{\min}^{i}, p_{\max}^{i}]}_{i \in {1, 2}}$	$[1, 10]$
The arrival rate of each group	${λ_{i}}_{i \in {1, 2}}$	1
The demand parameters of group 1	$(a_{1}, b_{1})$	$(5, 0.6)$
The demand parameters of group 2	$(a_{2}, b_{2})$	$(2, 1)$

Table 3. The parameters of the second five-group simulated market.

Parameters	Symbols	Values
The number of groups	g	5
Initial inventory	I	120
The number of selling periods	T	30
The price range of each group	${[p_{\min}^{i}, p_{\max}^{i}]}_{i \in {1, 2, 3, 4, 5}}$	$[1, 10]$
The arrival rate of each group	${λ_{i}}_{i \in {1, 2, 3, 4, 5}}$	1
The demand parameters of group 1	$(a_{1}, b_{1})$	$(5, 0.6)$
The demand parameters of group 2	$(a_{2}, b_{2})$	$(2, 1)$
The demand parameters of group 3	$(a_{3}, b_{3})$	$(3, 0.8)$
The demand parameters of group 4	$(a_{4}, b_{4})$	$(4, 0.7)$
The demand parameters of group 5	$(a_{5}, b_{5})$	$(6, 0.5)$

Table 4. The episodic total revenue (and step-level violation rate, if applicable) of learned pricing policies in both simulated markets over 1000 episodes.

		Oracle DP	SAC	SAC-Penalty	SAC-Lagrangian	Shield-SAC
$g = 2$	No fairness constraint	176.28	171.20	—	—	—
	$Δ$ = 2	159.04 (0)	—	148.49 (0)	162.38 (0.62)	155.95 (0)
	$Δ$ = 1	151.34 (0)	—	146.09 (0)	149.07 (0.08)	149.67 (0)
$g = 5$	No fairness constraints	—	516.04	—	—	—
	$Δ$ = 2	—	—	417.19 (0)	437.29 (0)	472.33 (0)
	$Δ$ = 1	—	—	416.61 (0)	431.79 (0)	441.30 (0)

Table 5. The episodic total revenue (and step-level violation rate) of learned pricing policies in two-group simulated markets over 1000 episodes.

	I = 10			I = 25			I = 50
	$Δ = 1$	$Δ = 2$	$Δ = 3$	$Δ = 1$	$Δ = 2$	$Δ = 3$	$Δ = 1$	$Δ = 2$	$Δ = 3$
Oracle DP	90.39 (0)	90.68 (0)	90.72 (0)	149.98 (0)	153.73 (0)	157.16 (0)	151.34 (0)	159.04 (0)	168.61 (0)
SAC-Penalty	85.28 (0)	86.61 (0)	87.13 (0)	142.99 (0)	143.25 (0)	143.63 (0)	146.09 (0)	148.49 (0)	151.98 (0)
SAC-Lagrangian	86.42 (0.57)	86.17 (0.28)	86.25 (0.12)	146.98 (0)	149.82 (0.23)	153.75 (0.33)	149.07 (0.08)	162.38 (0.62)	163.83 (0.38)
Shield-SAC	87.25 (0)	88.45 (0)	89.32 (0)	147.41 (0)	151.45 (0)	153.42 (0)	149.67 (0)	155.95 (0)	162.56 (0)

Table 6. The episodic total revenue (and step-level violation rate) of learned pricing policies in five-group simulated markets over 1000 episodes.

	I = 30			I = 70			I = 120
	$Δ = 1$	$Δ = 2$	$Δ = 3$	$Δ = 1$	$Δ = 2$	$Δ = 3$	$Δ = 1$	$Δ = 2$	$Δ = 3$
SAC-Penalty	290.81 (0)	291.26 (0)	291.16 (0)	397.67 (0)	413.02 (0)	420.19 (0)	416.61 (0)	417.19 (0)	425.13 (0)
SAC-Lagrangian	258.57 (0)	261.11 (0)	262.82 (0)	426.93 (0)	431.68 (0)	445.12 (0)	431.79 (0)	437.29 (0)	505.38 (0.88)
Shield-SAC	291.41 (0)	291.39 (0)	291.98 (0)	435.62 (0)	447.38 (0)	459.33 (0)	441.30 (0)	472.33 (0)	486.61 (0)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qiao, W.; Wood, L.C.; Tang, S.; Teng, Z.; Huang, M. Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning. Mathematics 2026, 14, 600. https://doi.org/10.3390/math14040600

AMA Style

Qiao W, Wood LC, Tang S, Teng Z, Huang M. Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning. Mathematics. 2026; 14(4):600. https://doi.org/10.3390/math14040600

Chicago/Turabian Style

Qiao, Wenchuan, Lincoln C. Wood, Shanshan Tang, Zeyu Teng, and Min Huang. 2026. "Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning" Mathematics 14, no. 4: 600. https://doi.org/10.3390/math14040600

APA Style

Qiao, W., Wood, L. C., Tang, S., Teng, Z., & Huang, M. (2026). Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning. Mathematics, 14(4), 600. https://doi.org/10.3390/math14040600

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fairness-Constrained Dynamic Pricing via Shielded Deep Reinforcement Learning

Abstract

1. Introduction

2. Literature Review

2.1. Survey on Related Work

2.1.1. Dynamic Pricing and Learning

2.1.2. Discriminatory Dynamic Pricing and Learning

2.1.3. Fairness-Aware Discriminatory Dynamic Pricing and Learning

3. Problem Formulation

3.1. Markov Decision Process

3.2. Action-Constrained Markov Decision Process

4. Solution Methods

5. Case Studies and Numerical Results

5.1. Case Studies

5.2. Numerical Results and Analysis

5.3. Sensitivity Analysis

5.4. Practical Applications and Managerial Implications

6. Conclusions, Limitations, and Future Directions

6.1. Conclusions

6.2. Limitations

6.3. Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI