A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints

Wang, Xinmin; Xie, Yuwei; Jian, Ling; Liu, Wei; Lv, Wenting

doi:10.3390/jtaer20040337

Open AccessArticle

A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints

by

Xinmin Wang

¹,

Yuwei Xie

¹,

Ling Jian

^1,*

,

Wei Liu

¹ and

Wenting Lv

²

¹

School of Economics and Management, China University of Petroleum, Qingdao 266580, China

²

Sales Department, Crowne Plaza Hotel, Qingdao 266000, China

^*

Author to whom correspondence should be addressed.

J. Theor. Appl. Electron. Commer. Res. 2025, 20(4), 337; https://doi.org/10.3390/jtaer20040337

Submission received: 30 August 2025 / Revised: 13 October 2025 / Accepted: 7 November 2025 / Published: 2 December 2025

(This article belongs to the Special Issue Emerging Technologies and Marketing Innovation)

Download

Browse Figures

Versions Notes

Abstract

Big data-driven discriminatory pricing not only creates opportunities to boost hotel profits but also amplifies consumers’ negative perceptions of price fairness. Developing a dynamic discriminatory pricing model with fairness constraints helps hotel room managers formulate optimal pricing strategies. This paper proposes a dynamic discriminatory pricing model with fairness constraints that unifies four pricing models: fixed pricing, dynamic pricing, discriminatory pricing, and dynamic discriminatory pricing. It further proposes a two-stage deep reinforcement learning algorithm to efficiently solve the model and generate optimal pricing strategies. Finally, a case study is conducted to validate the proposed model and algorithm. The results show that the two-stage deep reinforcement learning algorithm can instantaneously derive optimal pricing schemes that satisfy both group and temporal fairness constraints, following a reasonably time-efficient training process. By adjusting the fairness parameters, our model can be transformed into the four types of pricing models, and the performance of the algorithm is validated for the commonly used dynamic pricing and dynamic discriminatory pricing models. Compared to traditional nonlinear programming solution algorithms, this algorithm generates optimal daily prices based on real-time market changes, making it more practically applicable.

Keywords:

hotel revenue management; dynamic discriminatory pricing; two-stage deep reinforcement learning; fairness constraints

1. Introduction

Room revenue constitutes a vital income stream for hotels, making its pricing strategy a critical component of effective revenue management. Traditional hotel pricing approaches largely rely on dynamic pricing models driven by supply and demand dynamics [1]. However, such strategies often fail to incorporate differences in price sensitivity and perception across distinct customer segments, thereby constraining the hotel’s potential to fully maximize revenue. A growing number of scholars are investigating methods to segment customer groups based on their distinct characteristics, thereby facilitating the implementation of more refined pricing strategies [2,3]. This approach, referred to as discriminatory pricing, enables hotels to adopt tailored pricing policies aligned with the varying price expectations of different customer segments. With the advent of the big data and artificial intelligence era, big data-driven discriminatory pricing has introduced a more refined approach to hotel revenue management [4]. E-commerce platforms can now construct detailed customer profiles based on behavioral data, enabling them to tailor room prices to individual customers and thereby maximize revenue [5]. While dynamic discriminatory pricing has the potential to increase hotel profits, it also heightens consumers’ negative concerns regarding price fairness [3,6]. In response, many government agencies have implemented regulations aimed at curbing excessive price discrimination in hotel room rates. For instance, the UK’s Financial Conduct Authority has proposed the implementation of a relative price ceiling, while the United States’ Equal Credit Opportunity Act includes explicit provisions prohibiting price discrimination. Whether motivated by the long-term strategic goals of businesses or in response to governmental regulations, integrating fairness considerations into pricing research has become an inevitable trend.

Some researchers have proposed nonlinear optimization algorithms to derive optimal pricing strategies under fairness constraints [7,8,9,10]. However, the algorithms are time-consuming, and the solutions generated by algorithm solvers are static, as they rely on fixed assumptions regarding demand and customer traffic. Consequently, they fail to adapt to daily fluctuations in market conditions. As a result, in practical settings, hotel managers continue to depend on historical demand data and experiential forecasting to formulate room pricing strategies [11]. With the rapid advancement of AI technology, Reinforcement Learning (RL) offers a new approach to solving hotel room pricing models efficiently. Through continuous interaction with its environment, an agent refines its behavioral policy to maximize the cumulative rewards it obtains. Traditional RL methods often struggle to process high-dimensional inputs, owing to their limited ability to extract relevant features effectively. Deep Reinforcement Learning (DRL), which integrates reinforcement learning with deep neural networks (DNNs), exhibits exceptional capabilities in addressing decision-making problems involving complex state spaces and high-dimensional action domains. DRL has been successfully applied across a wide range of real-world domains, including video game AI, competitive sports, autonomous driving, inventory management, and robotic control, among others. In the field of revenue management, researchers have increasingly turned to exploring the application of reinforcement learning [12,13,14,15,16]. However, for hotel room pricing, although Tuncay et al. [17] applied Q-learning to a common model without constraints, the application of DRL that incorporates fairness constraints remains an area requiring further research.

This paper presents a dynamic discriminatory pricing model for hotel rooms with fairness constraints, and proposes a two-stage deep reinforcement learning algorithm to derive revenue-maximizing pricing strategies. The main contributions of this paper are summarized as follows:

(1): We propose a dynamic discriminatory pricing framework for hotel rooms with fairness constraints. By adjusting the group and temporal fairness parameters, this framework can be transformed into four distinct pricing models: fixed pricing, dynamic pricing, discriminatory pricing, and dynamic discriminatory pricing. This framework extends the theoretical foundations of hotel room pricing optimization. It can effectively address the limitation of traditional models that only solve pricing problems in a single scenario, significantly enhancing the model’s generality.
(2): We revise the representation of the fairness gap in the fairness constraints, shifting from the price range of the optimal pricing strategy to customers’ acceptable real fairness perception. This revision is more aligned with the characteristics of the customer groups of a specific hotel. Compared with the methods [9,10] that determine the fairness gap based on the optimal solution of unconstrained models, the fairness gap derived from this method exhibits better determinacy and applicability. Hence, the model-derived solutions for pricing strategies can be more effective in practice.
(3): To solve the dynamic discriminatory pricing model with fairness constraints, we propose a two-stage deep reinforcement learning algorithm. This algorithm can generate optimal pricing strategies that satisfy fairness constraints based on the dynamic changes in the market environment, and it is more applicable and faster-running than traditional nonlinear programming algorithms. This algorithm provides a new solution method for intelligent pricing problems with temporal constraints, expanding the application scope of deep reinforcement learning algorithms in the pricing field.

The rest of the paper is organized as follows. Section 2 presents a review of the literature. Section 3 gives details of our proposed dynamic discriminatory pricing model, the corresponding two-stage deep reinforcement learning algorithm is presented in Section 4. Section 5 presents a case study. Lastly, Section 6 concludes the paper.

2. Literature Review

2.1. Hotel Room Pricing Strategy

Hotel room pricing strategy is a core component of hotel revenue management, and discriminatory pricing as well as dynamic pricing are the most frequently used methods by hotel managers. Specifically, these strategies address the needs of diverse customer segments by offering differentiated discounts tailored to each group [18]. Driven by a differentiation strategy, hotel managers primarily implement price discrimination based on their experience and observable customer characteristics. To further boost hotel revenue, managers collect and integrate large volumes of multi-channel data into Revenue Management Systems (RMSs), thereby enabling more effective implementation of discriminatory pricing in practice [19,20]. Dynamic pricing involves continuously adjusting prices over time based on changes in supply and demand. Such strategies are widely adopted in hotel revenue management [5,21,22]. Meanwhile, numerous studies have focused on dynamic pricing approaches [8,23,24,25,26]. Consequently, both discriminatory pricing and dynamic pricing are attracting growing attention from both practitioners and academics.

True customer demand is ultimately determined by a complex interplay of various factors. To develop more effective pricing strategies, existing research has focused on demand forecasting and the optimization of pricing models. Customer demand forecasting serves as a prerequisite for formulating pricing strategies and holds significant strategic value and theoretical importance [27]. Currently, research on hotel room demand forecasting is primarily based on historical data. For instance, Zhang and Niu [28] used weekly data of two years derived from online reviews to forecast hotel demand. High-frequency data, such as hourly and daily records, is commonly used for short-term demand forecasting Huang et al. [29]. However, a growing body of research has highlighted the significant limitations of forecasts that rely on a single data source. Recent studies have explored the integration of multiple data sources to enhance forecasting accuracy [30]. By incorporating diverse data, which captures various factors influencing future hotel demand, integrated models offer substantial improvements in forecasting performance.

2.2. Pricing Models for Hotel Rooms

Optimization methods are another key factor influencing pricing strategies. Traditional hotel pricing optimization methods primarily utilize approaches such as game theory and nonlinear programming, among others. For instance, Yang and Xia [24] introduced a dynamic pricing game model that demonstrates the existence of an equilibrium pricing strategy under competitive conditions, along with a method for generating viable dynamic prices contingent on the number of competing firms. Aziz et al. [7] and Fadly et al. [8] developed nonlinear programming models to tackle dynamic room pricing challenges. On another front, Vives and Jacob [22] optimized hotel room pricing by proposing a stochastic dynamic pricing model that incorporates an online demand function. Additionally, Bayoumi et al. [25] devised a dynamic pricing approach based on price multipliers and employed a Monte Carlo simulator to identify optimal multiplier values, thereby assisting hotels in maximizing revenue.

However, with research advancing and aligning more closely with practical realities, room pricing models have grown in complexity. Reinforcement learning, a cutting-edge area of machine learning known for its strong performance in complex and dynamic environments, has been increasingly applied to address complex dynamic pricing problems [31]. Lu et al. [32] introduced a deep reinforcement learning algorithm for dynamic pricing in energy management within hierarchical electricity markets. They formulated the pricing process as a discrete finite Markov decision process (MDP) and employed Q-learning to solve the resulting decision-making problem. Qiao et al. [15] proposed a distributed pricing framework by innovatively modeling the MPPDP problem as a fully cooperative Markov game, which was then addressed using multi-agent reinforcement learning (MARL). They developed two efficient distributed dynamic pricing algorithms based on MARL: Counterfactual Q-learning and Counterfactual DQN. Lawhead and Gosavi [13], Bondoux et al. [14] and Lange et al. [16] utilized reinforcement learning algorithms in revenue management, which substantially increased ticket revenues for a new airline. In the field of hotel room pricing, Tuncay et al. [17] applied Q-learning in a Turkish hotel. The Q-Table was created with the information obtained from the data of the hotels with different characteristics and densities. The Q-table method is only applicable in environments with simple state and action spaces, making it unsuitable for more complex problems—such as those involving continuous pricing or large state spaces. Currently, there remains a scarcity of research that integrates hotel pricing models with deep reinforcement learning to develop more efficient pricing strategies.

2.3. Pricing Fairness

Price discrimination often raises concerns regarding pricing fairness. When customers perceive unfairness, they may feel a sense of betrayal, which can ultimately erode trust in the company. Judgments of price fairness are generally influenced by the reasons provided for price changes [33]. Some studies have explored the causes and consequences of perceived unfairness in pricing. In the context of hotel room pricing, research focuses on the relationship between social welfare and dynamic price discrimination strategies. For example, Li and Jain [34] utilized a game-theoretic model to analyze the interactions among fairness perceptions across different customer segments, retailer pricing strategies, and their impacts on profit, consumer surplus, and social welfare. Similarly, Kallus and Zhou [35] examined how fairness concerns and welfare outcomes interact in the context of personalized pricing based on customer characteristics. An additional perspective on fairness in pricing is the concept of “inter-temporal fairness”, introduced by Gupta and Kamble [36]. This principle asserts that individuals should be treated fairly across time—accounting for both their past interactions and future engagements. Pricing fairness is increasingly being integrated into optimization models to achieve more practical and sustainable revenue outcomes. For instance, Cohen et al. [9,10] incorporated fairness constraints into their model and developed an algorithm based on the Upper Confidence Bound (UCB) method. This fairness-aware algorithm provides relative price stability for each customer segment under the given time frames and customer traffic conditions; however, it is not well-suited for highly uncertain and rapidly evolving future markets. Additionally, they defined the fairness gap within the fairness constraints as the price range of the optimal pricing strategy, which may not align with customers’ perceptions of fairness. Further research is still needed to develop a hotel room pricing method that can adapt to rapid market changes while considering price fairness constraints.

3. Hotel Room Pricing Model

3.1. Model Assumptions

To help hotels formulate appropriate pricing strategies for revenue maximization and explore the impact of fair pricing on hotel room revenue, we propose a hotel room pricing model considering fairness constraints; this model can achieve optimal dynamic discriminatory pricing based on distinct customer segments. Hotel room managers make price decisions based on customers’ demand for rooms and the number of available rooms, while customers decide whether to check in according to the room price. Therefore, the room price is set as the decision variable in the model. To facilitate model construction, we select a hotel as the research object, and make the following assumptions:

(1): All rooms in the hotel are homogeneous, that is, we only consider a single room type.
(2): Customers can be segmented into distinct groups, where individuals within the same group are assumed to be homogeneous. This homogeneity means that all members of a group share similar preferences for hotel room prices and demonstrate consistent levels of concern about price fairness.
(3): For customers staying consecutively, room prices remain consistent with the initial booking price throughout their entire stay and will not fluctuate.
(4): In the initial state of our simulation, all the hotel rooms are unoccupied.

3.2. Notations and Variable Definitions

Assume the hotel has M rooms, each with a cost of C. Based on their spending characteristics, customers are segmented into N groups with customer group index i,

i \in {1, 2, \dots, N}

. The time horizon is defined as a fixed period T (e.g., 30 days) with time index t,

t \in {1, 2, \dots, T}

. Hotel managers set daily room prices for different customer groups, denoted as

p_{t, i}

,

p_{t, i} \in [\underset{̲}{p}, \bar{p}]

, and

[\underset{̲}{p}, \bar{p}]

is a predetermined price range. The expected price that group i customers desire most at time t often differs from the actual price

p_{t, i}

; therefore, some customers may give up booking this hotel due to price issues with a certain probability. The probability that customers in group i will accept the room price

p_{t, i}

is denoted as

F_{i} (p_{t, i})

, which conforms to the probability distribution

F_{i}

. The customer traffic

Q_{t, i}

, representing the total number of group i customers on day t who consider this hotel as their intended destination, follows a Poisson distribution

P_{i}

. Thus, the number of group i customers on day t who actually plan to check in is

Q_{t, i} \cdot F_{i} (p_{t, i})

. The room accommodation demand generated on day t is

D_{t} = \sum_{i = 1}^{N} Q_{t, i} \cdot F_{i} (p_{t, i}) .

(1)

When the total room demand exceeds the number of rooms available for booking on a given day, the hotel will stop accepting new reservations. Hence, the actual number of check-in customers

U_{t}

can be expressed as

U_{t} = \{\begin{matrix} M_{t} & D_{t} \geq M_{t} \\ D_{t} & D_{t} < M_{t} \end{matrix},

(2)

where

M_{t}

denotes the number of remaining rooms on day t. Customers may stay for d consecutive days with the probability

Z (d)

, where

d \in {1, 2, \dots, L}

, and L denotes the maximum number of days customers may stay. If we consider the case of consecutive customer stays, Expression (2) will be updated as follows:

U_{t} = \{\begin{matrix} min \{M, D_{t}\} & t = 1 \\ min \{M, D_{t} + \sum_{j = 1}^{t - 1} \sum_{d = j + 1}^{t} Z (d) \cdot U_{t - j}\} & 2 \leq t < L \\ min \{M, D_{t} + \sum_{j = 1}^{L - 1} \sum_{d = j + 1}^{L} Z (d) \cdot U_{t - j}\} & t \geq L \end{matrix} .

(3)

The expressions

\sum_{j = 1}^{t - 1} \sum_{d = j + 1}^{t} Z (d) \cdot U_{t - j}

and

\sum_{j = 1}^{L - 1} \sum_{d = j + 1}^{L} Z (d) \cdot U_{t - j}

represent the cumulative number of occupied rooms by customers who checked in prior to time t. The profit earned by the hotel on day t is

R_{t} = \sum_{i = 1}^{N} U_{t, i} \cdot (p_{t, i} - C),

(4)

where

U_{t, i}

is the actual number of check-in customers in group i on day t.

Our model integrates the fairness constraints with a specific focus on price fairness. Drawing on the definition of price fairness in [36], we adopt two types of price fairness: price fairness across groups and across time periods. For brevity, we call them group fairness and temporal fairness, respectively.

Definition 1

(group fairness). For day t, the offered hotel room prices of group i and j customers satisfy

| p_{t, i} - p_{t, j} | < δ_{t}

almost surely for

\forall i \neq j

, and

δ_{t}

is a constant parameter of group fairness.

Definition 2

(temporal fairness). For customers of group i, the offered hotel room prices on day t and day s satisfy

| p_{t, i} - p_{s, i} | < σ_{i}

almost surely for

\forall t \neq s

, and

σ_{i}

is a constant parameter of temporal fairness.

Li and Jain [34] studied the issue of pricing games between firms and consumers and found that price fairness concerns affect firms’ pricing strategies and consumers’ purchase decisions. When consumers perceive price unfairness, they may choose to switch to purchasing products from other firms [6]. Similarly, price unfairness may also lead customers to refrain from booking hotel rooms. Therefore, we consider incorporating the impact of price fairness on customer traffic into the model. Typically, the perception of price fairness is manifested in two ways [34]: (1) peer-induced fairness, in which customers compare their own prices with those of peer customers; (2) experience-induced fairness, in which customers compare current prices with historical prices they have previously experienced. We define an influence factor

λ_{i}

to represent the impact of price unfairness on customer traffic of group i, where

λ_{i}

ranges from

[0, 1]

. The influence factor

λ_{i}

incorporates both effects: the impact of peer-induced price unfairness and that of experience-induced price unfairness. Then, the

D_{t}

in Expression (1)–(3) is transformed into

D_{t} = \sum_{i = 1}^{N} Q_{t, i} \cdot F_{i} (p_{t, i}) \cdot λ_{i} .

(5)

3.3. Optimization Model

When hotel managers aim to maximize profits through daily pricing strategies without considering any constraints, the optimal pricing model can be formulated as an unconstrained optimization problem:

max R = \sum_{t = 1}^{T} \sum_{i = 1}^{N} U_{t, i} \cdot (p_{t, i} - C),

(6)

When fairness constraints are added to the unconstrained optimization problem (5), a constrained optimization problem is immediately obtained. However, the parameters

δ_{t}

and

σ_{i}

defined in Definitions 1 and 2 are difficult to set. We introduce a fairness parameter

α

to constrain the daily room price differences both across customer groups and across time periods, and

α \in [0, 1]

. In this paper, we define

δ_{t}

and

σ_{i}

as

δ_{t} = (1 - α_{g}) sup_{i} {Δ p_{t, i}^{*}},

(7)

and

σ_{i} = (1 - α_{t}) sup_{t} {Δ p_{t, i}^{*}} .

(8)

Δ p_{t, i}^{*}

denotes the absolute unfair price difference perceived by customer group i at time t.

sup_{i} {Δ p_{t, i}^{*}}

is the supremum of the absolute unfair price differences across all customer groups at time t, and

sup_{t} {Δ p_{t, i}^{*}}

is the supremum for customer group i over the entire time period. The absolute unfair price difference can also be referred to as the fairness gap. While Cohen et al. [9] used the price range of the optimal pricing strategy to represent the fairness gap, this approach may not align with customers’ actual perceptions—thus hindering the practical application of the pricing model.

The parameters

α_{g}

and

α_{t}

denote the group fairness and temporal fairness parameters, respectively. By leveraging

α_{g}

and

α_{t}

, hotel room managers can implement dynamic discriminatory pricing. When

α_{g} = α_{t} = 1

, the hotel room prices will be uniform across all customer groups and time periods corresponding to the fixed pricing model. Conversely, when

α_{g} = 0

or

α_{t} = 0

, it indicates the absence of the respective fairness constraint. If

α_{g} = 0

and

α_{t} = 1

, the model reduces to the discriminatory pricing model, where each customer group is assigned a distinct yet time-invariant price across all periods. Conversely, when

α_{g} = 1

and

α_{t} = 0

, the model corresponds to the dynamic discriminatory pricing model, which treats all customers as a single group. Thus, by introducing the group fairness and temporal fairness parameters, the four pricing models can be defined as a unified model.

The influence of price fairness on customer traffic is from the perception of both the group and temporal fairness. Thus, to characterize the overall influence of price fairness on customer traffic of group i, we define the

λ_{i}

as a function of

α_{g}

and

α_{t}

rather than as a function of price differences.

The problem of the optimization of hotel room pricing can be formulated as follows:

\begin{matrix} max R (p_{t, i}) = \sum_{t = 1}^{T} \sum_{i = 1}^{N} U_{t, i} \cdot (p_{t, i} - C) \\ s . t . \{\begin{matrix} | p_{t, i} - p_{t, j} | \leq (1 - α_{g}) sup_{i} {Δ p_{t, i}^{*}} & \forall i \neq j, \\ | p_{t, i} - p_{s, i} | \leq (1 - α_{t}) sup_{t} {Δ p_{t, i}^{*}} & \forall t \neq s, \\ \underset{̲}{p} \leq p_{t, i} \leq \bar{p} & \forall i, t . \end{matrix} \end{matrix}

(9)

The subscripts of optimization problem (9)

i, j \in {1, 2, \dots, N}

,

t, s \in {1, 2, \dots, T}

. The

U_{t}

is re-expressed as

U_{t} = \{\begin{matrix} min \{M, \sum_{i = 1}^{N} Q_{t, i} \cdot F_{i} (p_{t, i}) \cdot λ_{i} (α_{g}, α_{t})\} & t = 1 \\ min \{M, \sum_{i = 1}^{N} Q_{t, i} \cdot F_{i} (p_{t, i}) \cdot λ_{i} (α_{g}, α_{t}) + \sum_{j = 1}^{t - 1} \sum_{d = j + 1}^{t} Z (d) \cdot U_{t - j}\} & 2 \leq t < L \\ min \{M, \sum_{i = 1}^{N} Q_{t, i} \cdot F_{i} (p_{t, i}) \cdot λ_{i} (α_{g}, α_{t}) + \sum_{j = 1}^{L - 1} \sum_{d = j + 1}^{L} Z (d) \cdot U_{t - j}\} & t \geq L \end{matrix} .

(10)

It is worth noting that the model does not distinguish between room types, meaning all rooms are treated as homogeneous. In real-world operations, hotel managers categorize rooms based on their configuration conditions to meet the different preferences of customers. One solution is to train separate models for different room types and their corresponding customer groups. Another solution is to treat all rooms as homogeneous (as assumed in our model) and match customers with different room types based on the prices they pay for bookings, thereby meeting their preferences.This can also serve as one of the marketing solutions to eliminate the customer perception of unfairness.

4. Two-Stage Deep Reinforcement Learning Algorithm

4.1. The MDP Model for Hotel Room Pricing

The proposed optimization model (9) is a nonlinear programming problem, and solving it using traditional solvers or heuristic algorithms becomes time-consuming when dealing with dynamic discriminatory pricing over an extended period. While an optimal solution is derived, the daily randomness in customer arrivals may result in suboptimal profits in practice. Therefore, there is an urgent need to propose a fast-solving algorithm that can cope with the random changes in customer traffic. Deep reinforcement learning can address this problem effectively. Although the training process takes a relatively long time, the model features fast inference speed and can quickly compute an optimal decision scheme based on different states.

The hotel’s operational process over a given period can be formulated as an episodic Markov Decision Process (MDP). This process comprises five components: S (state), A (action), R (reward), P (state transition probability), and

π

(pricing strategy adopted by the hotel room manager). The hotel room manager can be regarded as an agent.

State: The agent formulates pricing schemes based on the remaining available rooms

M_{t}

and the expected arrivals of each customer group (

Q_{t, i}

) for day t. Therefore, the state on day t can be denoted as

s_{t} = (M_{t}, Q_{t, 1}, Q_{t, 2}, \dots, Q_{t, N})

.

Action: The action

a_{t}

taken by the agent on day t is the set of room prices for each customer group, i.e.,

a_{t} = (p_{t, 1}, p_{t, 2}, \dots, p_{t, N})

.

Reward: When the agent takes the pricing action

a_{t}

based on state

s_{t}

, an immediate reward

r_{t + 1}

is received from the environment. Specifically, this reward is defined as the room profit earned by the hotel on day t, i.e.,

r_{t + 1} = \sum_{i = 1}^{N} U_{t, i} \cdot (p_{t, i} - C)

.

State transition:

P (s_{t + 1} | s_{t}, a_{t})

represents the probability of transitioning to the next state

s_{t + 1}

given the current state

s_{t}

and pricing action

a_{t}

. Given the stochastic nature of customer arrivals, precisely describing the state transition probability is challenging. To address this issue, we assume equal probabilities of customer arrival and refusal for each group in the environment when the hotel rooms are fully occupied.

Policy: The policy

π

determines the action to be taken in a specific state

s_{t}

. Specifically, the agent decides on the room pricing strategy for different customer groups for the next day based on the state during period T. This policy is defined as a policy network that predicts an action.

The framework of the DRL-driven hotel room pricing model is shown in Figure 1.

4.2. Algorithm

The supremum of price differences, denoted as

sup_{t} {Δ p_{t, i}^{*}}

, is dependent on the generated price trajectory. Consequently, the conventional exploration methods employed in deep reinforcement learning algorithms are unable to address the temporal constraint. To address the proposed hotel room pricing model incorporating both group fairness and temporal fairness constraints, we design a two-stage deep reinforcement learning algorithm. In stage I, we train a deep reinforcement learning algorithm under the group fairness constraint, and derive an optimal solution via inference based on the trained policy network. The temporal fairness constraint restricts the variation range over the time period; therefore, in Stage II, the deep reinforcement learning algorithm is continuously trained based on the optimal solution by clipping actions to the constraint range. Once the hotel room pricing environment (Line 1 of the Algorithm) is constructed, a deep reinforcement learning algorithm, such as the Proximal Policy Optimization (PPO) algorithm [37], Deep Deterministic Policy Gradient algorithm (DDPG) [38], or Twin Delayed Deep Deterministic policy gradient algorithm (TD3) [39], may be selected for training.

The steps of our two-stage deep reinforcement learning algorithm are specified in Algorithm 1.

The embedding representation (Lines 6, 12 and 24) is a method that maps integers in the state to multi-dimensional vectors, serving as the input for network training. As a commonly used technique in neural network training, the embedding function maps discrete integers into dense, low-dimensional vectors. It essentially functions as a lookup table with dimensions (vocabulary size, embedding dimension). When an integer sequence is provided as input, the embedding function retrieves the corresponding vectors based on their indices, resulting in an output tensor of shape (batch size, sequence length, embedding dimension). These vectors are initialized randomly and then learned during training. Through a fully connected layer, the embedding mechanism converts integers into one-hot vectors as inputs, outputs dense vectors, and trains an embedding matrix to serve as the embedding representations.

Algorithm 1: Two-stage deep reinforcement learning algorithm

When the action fails to satisfy the group fairness constraint in Stage I, the normalized scaling method (Line 8) is applied. Three candidate selection methods are defined, with their corresponding expressions presented as follows:

p_{t, i} \leftarrow p_{t, i} + (\frac{\sum_{i = 1}^{N} p_{t, i}}{N} - p_{t, i}) \frac{δ_{t}}{max_{i} p_{t, i} - min_{i} p_{t, i}}

(11)

p_{t, i} \leftarrow min_{i} p_{t, i} + (p_{t, i} - min_{i} p_{t, i}) \frac{δ_{t}}{max_{i} p_{t, i} - min_{i} p_{t, i}}

(12)

p_{t, i} \leftarrow max_{i} p_{t, i} - (max_{i} p_{t, i} - p_{t, i}) \frac{δ_{t}}{max_{i} p_{t, i} - min_{i} p_{t, i}}

(13)

Expressions (11)–(13) can be defined as mean-normalized, min-normalized, and max-normalized scaling methods, respectively. The criterion for method selection depends primarily on the model’s training performance.

The customer information table (Line 10)

i n f o_C_{t}

stores the stay durations and group information of all customers at time t. Using

i n f o_C_{t}

, we can calculate the remaining rooms for state

s_{t + 1}

.

After the training process, hotel room managers can perform inference using the policy network to generate pricing schemes. Once the customer group traffic is predicted and the number of unoccupied rooms for the next day is calculated, the state vector

s_{t}

is obtained. Next, the embedding function converts the input into a state embedding vector, which serves as the actual input to the policy network. Following linear computation and non-linear activation in the hidden layers, the model ultimately outputs the prediction results. This process is also referred to as the forward propagation of the policy network. The forward propagation process of a neural network (i.e., inference) is extremely fast. In contrast, traditional solvers and heuristic algorithms, after time-consuming computations, can only provide fixed optimal solutions over the entire time period, rather than an optimal action tailored to the daily variations in customer group traffic and the remaining rooms.

5. Case Study

5.1. Target Hotel and Model Parameters

In this section, we validate the proposed model and algorithm through a case study of a hotel located in Qingdao, China. As a large five-star non-downtown hotel in a tourist city, it exhibits a marked contrast between the off-season and peak-season. The customer traffic is significantly influenced by the peak and off-peak tourist seasons. Accordingly, this paper examines both scenarios to analyze the characteristics of fairness in pricing strategies. Our study is based on the hotel’s historical operational data, and market research in 2019, 2023, and 2024, excluding the period affected by the COVID-19 pandemic. During the COVID-19 period, the hotel room market witnessed government intervention and a large number of irrational behaviors. It was not a rational competitive market, as both hotel room managers and consumers exhibited numerous irrational decision-making behaviors. Therefore, this extremely special period does not fall within the applicable scope of the model.

Both dynamic discriminatory pricing and dynamic pricing are commonly applied in practice for the target hotel; therefore, we use both to simulate the effects of the pricing model established in Section 3. To compare the effects of fairness parameters

α_{g}

and

α_{t}

, we define four models: dynamic discriminatory pricing neglecting the fairness influence on customer traffic (denoted as DDP-N), dynamic discriminatory pricing considering the fairness influence on customer traffic (DDP-C), dynamic pricing neglecting the fairness influence on customer traffic (denoted as DP-N), and dynamic pricing considering the fairness influence on customer traffic (denoted as DP-C).

This hotel has 150 standard rooms, with each room having a marginal cost of ¥167. Based on historical data, we set the pricing ranges as [¥300, ¥640] for the off-season and [¥600, ¥1400] for the peak- season. Customers can stay for a maximum of 5 days, with the probability distribution of their stay duration given by

Z (d) = {0.46, 0.34, 0.15, 0.035, 0.015}

. Customers is classified to be three groups according to the price sensitivity. Referring to the studies by [40,41], we define customer traffic as following a Poisson distribution

P_{i}

. In the training process of deep reinforcement learning, customer traffic is sampled according to

P_{i}

. In demand models, three functions are most commonly used to describe the relationship between demand and price: linear, logistic, and exponential functions [9]. In this paper, we define the probability

F_{i}

that customers in group i accept the price as a logistic function. For example, for customer group 1 on weekdays during the off-season,

F_{1} (p_{t, 1}) = P (p_{t, 1} > x) = 1 / (1 + e^{0.0366 (x - 400)})

. Once a price is given, the probability that customers in group 1 accept the price can be calculated using the distribution function

F_{1}

.

In the studies by [34,42], consumption utility is treated as a linear function of the degree of consumers’ fairness concerns. Within the context of hotel room pricing, customers’ utility directly influences their decisions regarding room bookings. Accordingly, we define the linear fairness degree function

λ_{i}

as a coefficient that affects customer traffic. Based on customer surveys, historical data and the operational experience of the hotel’s room managers, we evaluate and formulate

P_{i}

,

F_{i}

and

λ_{i}

of each group across both off-season and peak-season, encompassing weekdays and weekends, as specified in Table 1. The probability distribution functions

F_{i}

are shown in Figure 2. The

sup_{i} {Δ p_{t, i}^{*}}

and

sup_{t} {Δ p_{t, i}^{*}}

are evaluated based on customer survey data. During the off-season, the values of

sup_{i} {Δ p_{t, i}^{*}}

for customer groups on weekdays are ¥120, while for weekends, they are ¥160 and ¥130. Correspondingly, the values of

sup_{t} {Δ p_{t, i}^{*}}

for customer groups on weekdays and weekends are [¥90, ¥100, ¥120] and [¥180, ¥200, ¥240], respectively. In the peak-season, the values of

sup_{i} {Δ p_{t, i}^{*}}

are ¥200, with those for weekends being ¥250 and ¥220. Similarly, the values of

sup_{t} {Δ p_{t, i}^{*}}

on weekdays and weekends are [¥180, ¥200, ¥240] and [¥240, ¥260, ¥320], respectively.

PPO is selected as the primary training algorithm for our hotel room pricing model. Both the actor network and the critic network consist of two hidden layers, each containing 64 neurons. The actor network uses the tanh activation function, while the critic network employs ReLU. For the PPO algorithm, the parameters are set as follows: discount rate

γ = 1

; learning rate of the actor network

ϕ_{a} = 0.0001

; learning rate of the critic network

ϕ_{c} = 0.0002

; both update steps are set to 10, update batch size is 32; the clipping hyperparameter

ϵ = 0.2

; the number of training episodes in Stage I is 2000 and 4000, while in Stage II it is 500 and 1000, corresponding to the off-season and peak-season, respectively.

In the selection of the normalized scaling method for actions, the min-normalized scaling method (Expression (12)) tends to generate lower prices, which can help boost customer traffic; therefore, this method is adopted during the off-season training process. For the peak season, we use the max-normalized scaling method (Expression (13)). With this setting, the profit performance is better than the mean-normalized scaling method (Expression (11)).

The time horizon is set to 30 days; that is, each episode runs 30 timesteps. Actions are considered continuous values; therefore, we utilize the continuous action variant of the PPO algorithm, where the output of the policy network is typically set to the mean (action) and variance of a normalized distribution. In our work, we only take the action as the output of the policy network. The variance, which serves to control exploration, is designed to vary within a fixed range as training steps proceed.

5.2. Results Analysis

Four pricing models (DDP-N, DDP-C, DP-N, and DP-C) are trained using deep reinforcement learning algorithms; the convergence curves of training rewards are shown in Figure 3. By comparing the performance of different normalized scaling methods, it is found that the mean-normalized scaling method exhibits inferior performance in both the off-season and peak season. Therefore, the min-normalized scaling method for the off-season and the max-normalized scaling method for the peak season are adopted as the primary results of this study. For DDP-C,

α_{g}

and

α_{t}

are both set to 0.5. For DP-C,

α_{g} = 1

and

α_{t} = 0.5

. Due to the absence of fairness constraints, DDP-N and DP-N are trained with the classical PPO, whereas DDP-C and DP-C utilize the two-stage PPO algorithm proposed in Section 4.2. The vertical red dotted lines at episodes 2000 and 4000 mark the boundaries of the two-stage training. The PPO algorithm demonstrates favorable reward convergence for all four models after a sufficient number of training episodes. Stage II training can improve the reward to some extent while ensuring the temporal fairness constraint is satisfied (see the curves positioned to the right of the vertical red dotted lines).

Whether in the off-season or peak-season, the DDP-N (the red line) achieves a better reward than the DP-N (the yellow line). This indicates that dynamic discriminatory pricing contributes to higher profits compared to dynamic pricing alone. When considering the impact of fairness constraints on customer traffic, dynamic discriminatory pricing (the green line) outperforms dynamic pricing (the blue line) slightly. This suggests that the benefits of discriminatory pricing can offset the effects of group fairness on customer traffic. The gaps between the curves in the off-season are larger than those in the peak-season, indicating that fairness has a more significant impact on hotel profits when customer traffic is scarce.

Upon completion of the training process, the optimal pricing strategy can be inferred via the policy network. Figure 4 demonstrates the results of the optimal pricing strategy for all customer groups during both the off-season and peak-season, using different normalized scaling methods. This example is based on the parameter settings of

α_{g} = 0.5

and

α_{t} = 0.5

. The black lines represent the optimal pricing strategies generated by the stage I trained model. Specifically, the black dotted lines indicate the optimal pricing strategy for each customer group, while the black solid line shows the average room prices across all groups. Similarly, the red lines correspond to the results from stage II: the red dotted lines denote the optimal pricing strategy for each customer group, and the red solid line represents the average room prices. Prior to the training of stage II, for customer group 1 on weekdays during the off-season, the minimum price using the mean-normalized action scaling method is ¥327 on day 4, with 11 days exceeding the temporal constraint. Meanwhile, during the peak-season, there are 6 days that exceed this constraint. When using min-normalized action scaling method during the peak-season, only day 14 exceeding the temporal constraint, and the prices are lower than mean-normalized scaling method. Following the two-stage training process, every developed pricing strategy successfully satisfies the specified group and temporal constraints. The prices generated by the stage I model fluctuate more dramatically than those from stage II over the temporal period, which demonstrates the constraint effect on the optimal solution achieved by our two-stage reinforcement learning algorithm.

We employ key performance indicators—including Mean Room Rate (MRR), Profit, Daily Guest Acquisition (DGA), and Room Occupancy Percentage (ROP)—to evaluate and compare the performance differentials across the various models. The results are shown in Table 2. When fairness considerations are excluded, the dynamic discriminatory pricing model DDP-N achieves a hotel profit of ¥0.835 million during the off-season, compared to ¥0.783 million generated by the dynamic pricing model DP-N. This represents a 6.64% profit increase under the DDP-N framework. While in peak-season, the DDP-N model yields a 3.96% higher profit compared to the DP-N model. It shows that differential pricing proves more effective in optimizing revenue performance under off-season conditions where room supply exceeds demand. When considering the impact of fairness pricing, the decline in customer traffic directly correlates with reduced room profitability. When the fairness parameters are set to

α_{g} = 0.3

and

α_{t} = 0.5

, the profit drops to ¥0.734 million, and the MRR dramatically drops from ¥371 to ¥348. Lowering room rates to increase customer traffic can effectively offset the traffic loss caused by fairness factors, thereby boosting profits. Both profit and ROP can be improved as fairness increases.Specifically, when the fairness parameters are set to

α_{g} = 0.5

and

α_{t} = 0.7

, the profit rises to ¥0.770 million, while the ROP increases from 86.5% to 90.9%. During the peak-season, all models achieve 100% ROP while demonstrating significantly higher DGA rates compared to off-season performance. The DDP-N model generates optimal profits of ¥2.913 million, maintaining its superior performance over both DP-N and DDP-C models, consistent with the off-season. When considering the impact of fairness, both MRR and profit exhibit a similar trend to that of the off-season.

5.3. Algorithm Comparison

In addition to PPO, we also utilize three other deep reinforcement learning algorithms—Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Actor-Critic (AC)—to validate the proposed two-stage deep reinforcement learning algorithm. For the comparative algorithms DDPG, TD3, and AC, the same network architecture is adopted. The hyperparameters are listed in Table 3. The training results are presented in Figure 5. For the DDP-N model, PPO, DDPG, and TD3 exhibit the same convergence trend and converge to a similar reward level during both the off-season and peak-season. In contrast, AC fluctuates significantly. Although it shows an improving trend, its instability limits its applicability. Therefore, we only implement it for the DDP-N model during the off-season. For the DDP-C model, the comparative algorithms exhibit certain differences in performance.

Deep reinforcement learning algorithms often require substantial computational time for training. Under a computer configuration equipped with an Intel(R) Core(TM) i7-1165G7 CPU (2.80 GHz) and 16 GB of RAM, the training times for the PPO, DDPG, TD3, and AC algorithms are about 20, 35, 46, and 22 min for off-season of DDP-N, respectively. Each algorithm requires approximately 0.1 s of inference time for a 30-day period. Overall, PPO outperforms other comparative algorithms; therefore, PPO is selected as the primary algorithm for this study.

NLP solvers and heuristic algorithms are widely used to address optimization problems. However, solvers have difficulty handling non-continuous problems and uncertain problems that involve random variables. In our model, Equation (10) exhibits a marked non-continuous nature, and the customer length of stay follows a discrete probability distribution. These characteristics render the problem intractable for standard solvers. Consequently, we employ a heuristic algorithm, specifically Particle Swarm Optimization (PSO) [23], as a benchmark to compare against the reinforcement learning approach. The solution algorithm results and historical data are shown in Table 4. For the DDP-C model with

α_{g} = 0.5

and

α_{t} = 0.5

, PSO outperforms PPO in terms of profit. This is because the constraints narrow the search range and accelerate the solution speed. Due to differences in their design mechanisms, PSO is more aggressive in the optimization process, and its optimization results are slightly better than those of PPO. This is a common phenomenon in solving optimization problems. In contrast, for the DDP-N model, PSO performs significantly worse than PPO; the off-season and peak-season profits only reach ¥0.638 million and ¥2.398 million, respectively. When using the DDP-C model, the PPO algorithm achieves a 7% and 4.8% profit increase during the off-season and peak season, respectively, compared to the corresponding historical profit (i.e., the average monthly profit).

5.4. Supplementary Experiments

To further validate the effectiveness of our model, we conducted experimental studies on two additional hotels located in Qingdao, China. Hotel BH is a four-star hotel with 110 rooms, and its scale and infrastructure are inferior to those of the case hotel. Hotel HJ is a large five-star hotel (meeting five-star standards) with 280 rooms, and its infrastructure conditions and customer groups are similar to those of the case hotel. The training results of both hotels obtained using the PPO algorithm are presented in Figure 6. Specifically, the four models (for the off-season and peak-season, respectively) exhibit excellent convergence characteristics. Additionally, we compare the profit results of the PSO algorithm and historical data, as shown in Table 5. For the DDP-N model, the performance of the PPO algorithm is significantly better than that of the PSO algorithm; in contrast, the optimization effect of the PSO algorithm in the DDP-C model is slightly superior to that of the PPO algorithm. Both algorithms achieve better optimization results than the historical data, which is consistent with the results of the case hotel.

To analyze the impact of customer segment quantity on model training performance, we split the three customer groups of the case hotel (Section 5.1) into 5 and 7 groups, respectively. To facilitate result comparison, we only split the customer traffic data while keeping all other parameters unchanged. Subsequently, simulation experiments based on the DDP-C model using the PPO algorithm were conducted for both the off-season and peak-season. The training results obtained after 5000 episodes are presented in Figure 7. It can be observed that the reward curves of the three customer groups exhibit a consistent convergence trend. During the off-season, the optimal profits of the predicted pricing strategy—derived from the trained policy network—reach ¥0.752 million, ¥0.784 million, and ¥0.776 million for the 3-group, 5-group, and 7-group customer segments, respectively. While during the peak-season, the corresponding profits amount to ¥2.785 million, ¥2.831 million, and ¥2.746 million, respectively, in theory, more segments can enable more refined pricing, thereby achieving higher profits. However, it also increases the dimensionality of the action space, leading to greater complexity in model training. The 5-group segmentation model achieves the best performance during both the off-season and peak-season periods. In the off-season, the 7-group segmentation outperforms the 3-group model, whereas during the peak season, the 3-group segmentation performs better than the 7-group model. In practice, a balance needs to be struck between the number of customer segments, expected profit performance, and model training time.

5.5. Discussion

Typically, pricing problems are formulated as either single-objective or multi-objective optimization models. Cohen et al. [9] proposed a price discrimination model with fairness constraints, and in 2025, Cohen et al. [10] further developed a dynamic pricing model. Both models are structured as single-objective optimization problems. In contrast, Kallus and Zhou [35] formulated a personalized pricing model as a multi-objective optimization problem. To facilitate solving the models, the objective functions must satisfy the assumptions of differentiability and quasi-concavity. However, these classic pricing models, which apply to typical seller-customer scenarios, differ from those tailored to hotel room pricing. In our optimization model (9), the objective function fails to satisfy the assumptions of differentiability and quasi-concavity due to the limitation imposed by room quantity (see Expression (10)). Therefore, hotel room pricing models require specialized techniques. For instance, Bayoumi et al. [25] proposed a hotel room pricing model where a reference price is multiplied by five price multipliers. Ref. [8] defined the hotel room pricing models as single-objective nonlinear optimization problems based on Aziz et al. [7], identified the appropriate range for elasticity that guarantees a feasible solution to the model. Our model differs from those of [7,8], who categorize hotel rooms into different groups with varying prices. In contrast, we segment customers based on their consumption ability without implementing room classification. Our model is specifically designed for a pricing scenario that utilizes customer profiling based on behavioral data analysis to implement differential pricing.

In terms of solution methods for hotel room pricing models, ref. [25] employed a Monte Carlo simulation-based optimization algorithm to test various multiplier values and determine the optimal prices. When dealing with a large number of optimization variables, the efficiency of the Monte Carlo method becomes very low, and it may even fail to produce a solution. The models in [7,8] were solved using nonlinear solvers available in optimization software toolboxes. These solvers employ an iterative approach to converge toward an approximate optimal solution that satisfies the problem constraints. However, this method becomes computationally expensive as the number of variables grows. For instance, solving the model proposed by [7] takes approximately 30 min. Meanwhile, the optimal solution yields fixed prices throughout the entire period. When customer numbers fluctuate and deviate from predictions, this optimal solution may well become non-optimal. Consequently, such solution methods have limitations in practical applications. Reinforcement learning algorithms are capable of addressing these issues. Tuncay et al. [17] proposed a Q-learning algorithm to optimize room pricing across multiple hotels. Different from our model, ref. [17] did not consider the fairness constraints and discriminatory pricing problem. In our model, the action is treated as a continuous price variable for each customer group, unlike the discrete action values used in [12,17]. Consequently, our model exhibits stronger representational power for room pricing.

Optimal results from the model neglecting fairness perception (DDP-N) indicate that dynamic discriminatory pricing can yield higher profits, which aligns with the findings of [43,44]. However, when the impact of fairness on customer traffic is considered, optimal profits decrease, this result is consistent with Li and Jain [34], who also observed that fairness concerns lead to lower profits in the second game period. Abrate et al. [45] found that the price tends to increase when there is a scarcity of hotels available; therefore, the higher prices and bigger supremum of the absolute unfair price differences during weekends and peak-season are reasonable and ordinary, our results are highly consistent with this characteristic.

The pricing strategies of competitors serve as important pricing references for hotel room managers. Price changes by competitors will lead to changes in consumers’ price sensitivity, thereby affecting the actual customer traffic of the hotel. If there is a significant change in the price sensitivity of the customer group (i.e.,

F_{i}

in the model), the previously trained model will exhibit inference bias, resulting in pricing ineffectiveness. Therefore, hotel room managers need to closely monitor changes in consumers’ price sensitivity and perform fine-tuning on the model when

F_{i}

undergoes a significant change. Hotel room managers need to adjust the

F_{i}

parameter based on price sensitivity predictions [26], fine-tune the model based on the pre-trained network structure and parameters, and then calculate the pricing strategy. The model’s pre-training can be conducted offline; subsequently, the well-trained model is integrated into the hotel room revenue management system for price prediction. When model fine-tuning is required, a fixed number of episodes should be set before performing online fine-tuning. The time consumed by model fine-tuning depends on server configuration and the number of episodes, which requires personalized configuration. In extreme cases, the model needs to be fine-tuned once a day. However, based on the historical operation data of the three case hotels, the price elasticity of various user groups remains basically unchanged within a week, while significant changes occur in cycles of 1 to 2 months.

6. Conclusions

In order to address the challenge of setting hotel room prices to maximize revenue under fairness constraints, this paper proposes a unified pricing frame that integrates four distinct pricing models: fixed pricing, discriminatory pricing, dynamic pricing, and dynamic discriminatory pricing. Subsequently, a two-stage deep reinforcement learning algorithm is designed to train a policy network for predicting optimal prices. This model and algorithm are applied and validated using a case study. The PPO algorithm is employed to validate the efficacy of the proposed two-stage deep reinforcement learning approach over simulated 30-day off-season and peak-season periods. Furthermore, alternative algorithms, including DDPG and TD3, are also tested and proven effective, achieving similarly high levels of optimal revenue. The deep reinforcement learning approach generates optimal pricing decisions by responding to fluctuations in customer demand and room availability, making it more adaptable to real-world operational environments compared to conventional methods such as nonlinear programming solvers and heuristic algorithms.

For hotel room managers, training a well-performing two-stage deep reinforcement learning algorithm tailored to their specific operational needs is a highly rewarding and efficiency-boosting endeavor. A large quantity of high-quality data, encompassing operational, market, and customer features, is required to serve as the foundation for the algorithmic environment. Based on this data, hotel room managers must accurately categorize customers into specific segments. They then need to calculate the corresponding customer traffic, price acceptance probability, and perceived fairness of pricing differences for weekdays and weekends across different seasons. Once sufficiently trained, the deep reinforcement learning policy network can be deployed as an agent to automatically generate pricing strategies. It is recommended that hotel managers mitigate perceived price unfairness among group customers by enhancing service differentiation and room amenities, particularly during the off-season.

Our study focuses on the dynamic discriminatory pricing of a single hotel, without accounting for the group game among multiple hotels. This limitation may lead to an incomplete understanding of market dynamics, thereby restricting our analysis of the macro-level characteristics of the hotel room market within a specified area. Future research should integrate competitors’ pricing strategies to advance the study of optimal pricing models within a dynamic game framework. In the context of big data-based discriminatory pricing, every customer with unique features can be treated as an individual customer segment. Consequently, dynamic discriminatory pricing leveraging the swarm agent deep reinforcement learning algorithm represents a valuable research direction.

Author Contributions

Conceptualization, X.W. and W.L. (Wei Liu); methodology, X.W. and W.L. (Wei Liu); software, X.W. and Y.X.; validation, L.J. and W.L. (Wenting Lv); formal analysis, X.W. and L.J.; investigation, X.W., Y.X. and W.L. (Wenting Lv); resources, L.J. and W.L. (Wenting Lv); data curation, X.W. and Y.X.; writing—original draft preparation, X.W. and Y.X.; writing—review and editing, X.W. and L.J.; visualization, X.W.; supervision, L.J. and W.L. (Wei Liu); project administration, L.J.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Laboratory Project of Institute for Digital Transformation at China University of Petroleum; Laboratory Project of Higher Education Institutions in Shandong Province—Energy System Intelligent Management and Policy Simulation Laboratory at China University of Petroleum; Youth Innovation Team of Higher Education Institutions in Shandong Province—Data Intelligence Innovation Team at China University of Petroleum (Grant No. 2021RW041); Teaching Case Collection Construction Project on Postgraduate Education of Shandong Province (Grant No. SDYAL2023030); and Teaching Research and Reform Project at China University of Petroleum (Grant No. CM2024050).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The codes for this study are available on Github: https://github.com/wxmwer/hotel-room-pricing (accessed on 15 October 2025).

Conflicts of Interest

Author Wenting Lv was employed by the Crowne Plaza Hotel in Qingdao, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Den, B.; Arnoud, V. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manag. Sci. 2015, 20, 1–18. [Google Scholar] [CrossRef]
Masiero, L.; Viglia, G.; Nieto-Garcia, M. Strategic consumer behavior in online hotel booking. Ann. Tour. Res. 2020, 83, 102947. [Google Scholar] [CrossRef]
Chen, Y.F.; Pang, T.T.; Kuslina, B.H. The effect of price discrimination on fairness perception and online hotel reservation intention. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 1320–1337. [Google Scholar] [CrossRef]
Steinberg, E. Big data and personalized pricing. Bus. Ethics Q. 2020, 30, 97–117. [Google Scholar] [CrossRef]
Nowak, M.; Pawłowska-Nowak, M. Dynamic pricing method in the E-commerce industry using machine learning. Appl. Sci. 2024, 14, 11668. [Google Scholar] [CrossRef]
Wu, Z.; Yang, Y.; Zhao, J.; Wu, Y. The impact of algorithmic price discrimination on consumers’ perceived betrayal. Front. Psychol. 2022, 13, 825420. [Google Scholar] [CrossRef]
Aziz, H.A.; Saleh, M.; Rasmy, M.H.; ElShishiny, H. Dynamic room pricing model for hotel revenue management systems. Egypt. Inform. J. 2011, 12, 177–183. [Google Scholar] [CrossRef]
Fadly, M.; Ridwan, A.Y.; Akbar, M.D. Hotel room price determination based on dynamic pricing model using nonlinear programming method to maximize revenue. In Proceedings of the 2nd International Conference on Applied Information Technology and Innovation (ICAITI), Bali, Indonesia, 21–22 September 2019; pp. 190–196. [Google Scholar] [CrossRef]
Cohen, M.C.; Elmachtoub, A.N.; Lei, X. Price discrimination with fairness constraints. Manag. Sci. 2022, 68, 8536–8552. [Google Scholar] [CrossRef]
Cohen, M.C.; Miao, S.; Wang, Y. Dynamic pricing with fairness constraints. Oper. Res. 2025; ahead of print. [Google Scholar] [CrossRef]
Ivanov, S.; Del Chiappa, G.; Andy, H. The research-practice gap in hotel revenue management: Insights from Italy. Int. J. Hosp. Manag. 2021, 95, 102924. [Google Scholar] [CrossRef]
Maestre, R.; Duque, J.; Rubio, A.; Arevalo, J. Reinforcement learning for fair dynamic pricing. In Proceedings of the the 2018 Intelligent Systems Conference (IntelliSys), London, UK, 6–7 September 2018; Volume 868, pp. 120–135. [Google Scholar] [CrossRef]
Lawhead, R.; Gosavi, A. A bounded actor-critic reinforcement learning algorithm applied to airline revenue management. Eng. Appl. Artif. Intell. 2019, 82, 252–262. [Google Scholar] [CrossRef]
Bondoux, N.; Nguyen, A.Q.; Fiig, T.; Acuna-Agost, R. Reinforcement learning applied to airline revenue management. J. Revenue Pricing Manag. 2020, 19, 332–348. [Google Scholar] [CrossRef]
Qiao, W.; Huang, M.; Gao, Z.; Wang, X. Distributed dynamic pricing of multiple perishable products using multi-agent reinforcement learning. Expert Syst. Appl. 2024, 237, 121252. [Google Scholar] [CrossRef]
Lange, F.; Dreessen, L.; Schlosser, R. Reinforcement learning versus data-driven dynamic programming: A comparison for finite horizon dynamic pricing markets. J. Revenue Pricing Manag. 2025; ahead of print. [Google Scholar] [CrossRef]
Tuncay, G.; Kaya, K.; Yilmaz, Y.; Yaslan, Y.; Ögüdücü, S. A reinforcement learning based dynamic room pricing model for hotel industry. INFOR Inf. Syst. Oper. Res. 2024, 62, 211–231. [Google Scholar] [CrossRef]
Nicolini, M.; Piga, C.; Pozzi, A. From uniform to bespoke prices: Hotel pricing during EURO 2016. Quant. Mark. Econ. 2023, 21, 333–355. [Google Scholar] [CrossRef]
Saito, T.; Takahashi, A.; Koide, N.; Ichifuji, Y. Application of online booking data to hotel revenue management. Int. J. Inf. Manag. 2019, 46, 37–53. [Google Scholar] [CrossRef]
Martins, A.; Silva, L.; Marques, J. Data Science in supporting hotel management: Application of predictive models to booking.com guest evaluations. In Proceedings of the Advances in Tourism, Technology and Systems, ICOTTS 2023, Bacalar, Mexico, 2–4 November 2023; Volume 384, pp. 51–59. [Google Scholar] [CrossRef]
Ye, P.; Qian, J.; Chen, J.; Chen-hung, W. Customized regression model for airbnb dynamic pricing. In Proceedings of the 24th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; pp. 932–940. [Google Scholar] [CrossRef]
Vives, A.; Jacob, M. Dynamic pricing for online hotel demand: The case of resort hotels in Majorca. J. Vacat. Mark. 2020, 26, 268–283. [Google Scholar] [CrossRef]
Mullen, P.B.; Monson, C.K.; Seppi, K.D.; Warnick, S.C. Particle swarm optimization in dynamic pricing. In Proceedings of the 2006 IEEE International Conference on Evolutionary Computation, Vancouver, BC, Canada, 16–21 July 2006; pp. 1232–1239. [Google Scholar] [CrossRef][Green Version]
Yang, J.; Xia, Y. A nonatomic-game approach to dynamic pricing under competition. Prod. Oper. Manag. 2013, 22, 88–103. [Google Scholar] [CrossRef]
Bayoumi, A.E.M.; Saleh, M.; Atiya, A.F.; Aziz, H.A. Dynamic pricing for hotel revenue management using price multipliers. J. Revenue Pricing Manag. 2013, 12, 271–285. [Google Scholar] [CrossRef]
Zhu, F.; Xiao, W.; Yu, Y.; Wang, Z.; Chen, Z.; Lu, Q.; Liu, Z.; Wu, M.; Ni, S. Modeling price elasticity for occupancy prediction in hotel dynamic pricing. In Proceedings of the CIKM’22: Proceedings of the 31st ACM International Conference on Information and Knowledge Managemen, Atlanta, GA, USA, 17–21 October 2022; pp. 4742–4746. [CrossRef]
Huang, L.; Zheng, W. Hotel demand forecasting: A comprehensive literature review. Tour. Rev. 2023, 78, 218–244. [Google Scholar] [CrossRef]
Zhang, D.; Niu, B. Leveraging online reviews for hotel demand forecasting: A deep learning approach. Inf. Process. Manag. 2024, 61, 103527. [Google Scholar] [CrossRef]
Huang, L.; Li, C.; Zheng, W. Daily hotel demand forecasting with spatiotemporal features. Int. J. Contemp. Hosp. Manag. 2025, 35, 26–45. [Google Scholar] [CrossRef]
Wu, J.; Li, M.; Zhao, E.; Sun, S.; Wang, S. Can multi-source heterogeneous data improve the forecasting performance of tourist arrivals amid COVID-19? Mixed-data sampling approach. Tour. Manag. 2023, 98, 104759. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018. [Google Scholar]
Lu, R.; Hong, S.H.; Zhang, X. A dynamic pricing demand response algorithm for smart grid: Reinforcement learning approach. Appl. Energy 2018, 220, 220–230. [Google Scholar] [CrossRef]
Tarrahi, F.; Eisend, M.; Dost, F. A meta-analysis of price change fairness perceptions. Int. J. Res. Mark. 2016, 33, 199–203. [Google Scholar] [CrossRef]
Li, K.J.; Jain, S. Behavior-based pricing: An analysis of the impact of peer-induced fairness. Manag. Sci. 2016, 62, 2705–2721. [Google Scholar] [CrossRef]
Kallus, N.; Zhou, A. Fairness, welfare, and equity in personalized pricing. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event, 3–10 March 2021; pp. 296–314. [Google Scholar] [CrossRef]
Gupta, S.; Kamble, V. Individual fairness in hindsight. J. Mach. Learn. Res. 2021, 22, 1–35. Available online: http://jmlr.org/papers/v22/19-658.html (accessed on 29 August 2025).
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N. Continuous control with deep reinforcement learning. arXiv 2015. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in Actor-Critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1587–1596. [Google Scholar] [CrossRef]
Lee, M. Modeling and forecasting hotel room demand based on advance booking information. Tour. Manag. 2018, 66, 62–71. [Google Scholar] [CrossRef]
Vajpai, G.N. Managing overbooking in hotels: A probabilistic model using Poisson distribution. Int. J. Adv. Res. Ideas Innov. Technol. 2018, 4, 1376–1379. [Google Scholar]
Huang, H.; Wu, D.; Xu, H. Signaling or not? The pricing strategy under fairness concerns and cost information asymmetry. Eur. J. Oper. Res. 2025, 321, 789–799. [Google Scholar] [CrossRef]
Pazgal, A.; Soberman, D. Behavior-based discrimination: Is it a winning play, and if so, when? Mark. Sci. 2008, 27, 977–994. [Google Scholar] [CrossRef]
Shin, J.; Sudhir, K. A customer management dilemma: When is it profitable to reward one’s own customers? Mark. Sci. 2010, 29, 671–689. [Google Scholar] [CrossRef]
Abrate, G.; Fraquelli, G.; Viglia, G. Dynamic pricing strategies: Evidence from European hotels. Int. J. Hosp. Manag. 2012, 31, 160–168. [Google Scholar] [CrossRef]

Figure 1. Framework of the hotel room pricing model with DRL (PPO algorithm).

Figure 2. Probability distribution functions

F_{i}

. The vertical dashed lines represent the prices that the corresponding groups accept with a 0.5 probability. (a) Off-season weekdays, (b) off-season weekends, (c) peak-season weekdays, (d) peak-season weekends.

Figure 2. Probability distribution functions

F_{i}

. The vertical dashed lines represent the prices that the corresponding groups accept with a 0.5 probability. (a) Off-season weekdays, (b) off-season weekends, (c) peak-season weekdays, (d) peak-season weekends.

Figure 3. Comparison of PPO training rewards among different pricing models. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method. (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.

Figure 4. The optimal room prices of customer groups during the period of 30 days. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method, (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.

Figure 5. Comparison of reinforcement learning algorithms. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of DDP-N, (b) peak-season of DDP-N, (c) off-season of DDP-C, (d) peak-season of DDP-C.

Figure 6. Training results of hotel BH and hotel HJ via PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of hotel BH, (b) peak-season of hotel BH, (c) off-season of hotel HJ, (d) peak-season of hotel HJ.

Figure 7. Training results for multi customer groups based on DDP-C model and PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season, (b) peak-season.

Table 1. Formulas for

P_{i}

,

F_{i}

and

λ_{i}

in the case hotel.

Table 1. Formulas for

P_{i}

,

F_{i}

and

λ_{i}

in the case hotel.

Season	Weekend	Group	$P_{i}$	$F_{i}$	$λ_{i}$
Off-season	Weekdays	1	Poisson(27)	$1 / (1 + e^{0.0366 (x - 400)})$	$0.895 + 0.105 α_{t}$
		2	Poisson(39)	$1 / (1 + e^{0.0275 (x - 460)})$	$0.835 + 0.1 α_{t} + 0.065 α_{g}$
		3	Poisson(25)	$1 / (1 + e^{0.0220 (x - 520)})$	$0.862 + 0.08 α_{t} + 0.058 α_{g}$
	Weekends	1	Poisson(32)	$1 / (1 + e^{0.0549 (x - 360)})$	$0.848 + 0.122 α_{t}$
		2	Poisson(46)	$1 / (1 + e^{0.0366 (x - 420)})$	$0.813 + 0.114 α_{t} + 0.073 α_{g}$
		3	Poisson(24)	$1 / (1 + e^{0.0275 (x - 480)})$	$0.824 + 0.094 α_{t} + 0.082 α_{g}$
Peak-season	Weekdays	1	Poisson(73)	$1 / (1 + e^{0.0220 (x - 770)})$	$0.95 + 0.05 α_{t}$
		2	Poisson(52)	$1 / (1 + e^{0.0146 (x - 880)})$	$0.926 + 0.043 α_{t} + 0.031 α_{g}$
		3	Poisson(37)	$1 / (1 + e^{0.0105 (x - 990)})$	$0.958 + 0.02 α_{t} + 0.022 α_{g}$
	Weekends	1	Poisson(90)	$1 / (1 + e^{0.0220 (x - 750)})$	$0.925 + 0.075 α_{t}$
		2	Poisson(64)	$1 / (1 + e^{0.0146 (x - 850)})$	$0.871 + 0.083 α_{t} + 0.046 α_{g}$
		3	Poisson(42)	$1 / (1 + e^{0.0105 (x - 950)})$	$0.918 + 0.05 α_{t} + 0.032 α_{g}$

Table 2. Performance metrics under optimal pricing strategies of different models.

$α_{g}$	$α_{t}$	Off-Season				Peak-Season
$α_{g}$	$α_{t}$	MRR	Profit	DGA	ROP	MRR	Profit	DGA	ROP
0.5	0.3	348	0.734	74	86.5%	779	2.782	83	100%
0.5	0.5	350	0.752	74	88.6%	778	2.785	85	100%
0.5	0.7	350	0.770	76	90.9%	779	2.798	84	100%
0.3	0.5	361	0.748	71	83.1%	773	2.769	84	100%
0.7	0.5	350	0.771	76	91.4%	781	2.803	84	100%
DDP-N		371	0.835	75	89.3%	822	2.913	82	100%
DP-N		372	0.783	69	83.1%	783	2.802	83	100%

Table 3. Hyperparameter settings for the comparative algorithm.

Hyperparameters	PPO	DDPG	TD3	AC
Actor learning rate	0.0001	0.0001	/	0.0001
Critic learning rate	0.0002	0.0002	/	0.0002
Soft update coefficient	/	0.01	0.01	/
Q-network learning rate	/	/	0.0003	/
Policy network learning rate	/	/	0.0003	/
Reward discount rate	1	1	1	1
Batch size	32	32	32	32
Training episodes	2000	2000	2000	2000
Days per episode	30	30	30	30

Table 4. Comparison of algorithm results and historical data.

Algorithms	Profit (M)	Training (Running) Time	Episodes (Iterations)
PPO (Off-season, DDP-C)	0.752	25 m	2500
PPO (Peak-season, DDP-C)	2.785	50 m	5000
PPO (Off-season, DDP-N)	0.835	20 m	2000
PPO (Peak-season, DDP-N)	2.913	40 m	4000
PSO (Off-season, DDP-C)	0.789	35 m	5000
PSO (Peak-season, DDP-C)	2.822	35 m	5000
PSO (Off-season, DDP-N)	0.638	>1 h	10,000
PSO (Peak-season, DDP-N)	2.398	>1 h	10,000
Historical Data (Off-season)	0.703	/	/
Historical Data (Peak-season)	2.658	/	/

Table 5. Comparison of algorithm results and historical data for two hotels.

Hotel	Algorithms	Off-Season Profit (M)	Peak-Season Profit (M)
Hotel BH	PPO (DDP-C)	0.631	1.418
	PPO (DDP-N)	0.741	1.624
	PSO (DDP-C)	0.661	1.486
	PSO (DDP-N)	0.680	1.555
	Historical Data	0.630	1.400
Hotel HJ	PPO (DDP-C)	1.819	3.125
	PPO (DDP-N)	1.954	3.514
	PSO (DDP-C)	1.824	3.250
	PSO (DDP-N)	1.771	3.377
	Historical Data	1.670	2.970

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Xie, Y.; Jian, L.; Liu, W.; Lv, W. A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 337. https://doi.org/10.3390/jtaer20040337

AMA Style

Wang X, Xie Y, Jian L, Liu W, Lv W. A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. Journal of Theoretical and Applied Electronic Commerce Research. 2025; 20(4):337. https://doi.org/10.3390/jtaer20040337

Chicago/Turabian Style

Wang, Xinmin, Yuwei Xie, Ling Jian, Wei Liu, and Wenting Lv. 2025. "A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints" Journal of Theoretical and Applied Electronic Commerce Research 20, no. 4: 337. https://doi.org/10.3390/jtaer20040337

APA Style

Wang, X., Xie, Y., Jian, L., Liu, W., & Lv, W. (2025). A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. Journal of Theoretical and Applied Electronic Commerce Research, 20(4), 337. https://doi.org/10.3390/jtaer20040337

Article Menu

A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints

Abstract

1. Introduction

2. Literature Review

2.1. Hotel Room Pricing Strategy

2.2. Pricing Models for Hotel Rooms

2.3. Pricing Fairness

3. Hotel Room Pricing Model

3.1. Model Assumptions

3.2. Notations and Variable Definitions

3.3. Optimization Model

4. Two-Stage Deep Reinforcement Learning Algorithm

4.1. The MDP Model for Hotel Room Pricing

4.2. Algorithm

5. Case Study

5.1. Target Hotel and Model Parameters

5.2. Results Analysis

5.3. Algorithm Comparison

5.4. Supplementary Experiments

5.5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI