Next Article in Journal
What Does Bullet Screen Bring to Video Platform? A Theoretical Analysis Comparing Different Bullet Screen Modes
Previous Article in Journal
Application of the Fuzzy MCDM Model for Ranking Social Networks from the Aspect of Perfumery Promotion
Previous Article in Special Issue
How Can AI Virtual Streamers Gain Consumer Trust to Influence Purchase Intention in Live-Streaming E-Commerce?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints

1
School of Economics and Management, China University of Petroleum, Qingdao 266580, China
2
Sales Department, Crowne Plaza Hotel, Qingdao 266000, China
*
Author to whom correspondence should be addressed.
J. Theor. Appl. Electron. Commer. Res. 2025, 20(4), 337; https://doi.org/10.3390/jtaer20040337 (registering DOI)
Submission received: 30 August 2025 / Revised: 13 October 2025 / Accepted: 7 November 2025 / Published: 2 December 2025
(This article belongs to the Special Issue Emerging Technologies and Marketing Innovation)

Abstract

Big data-driven discriminatory pricing not only creates opportunities to boost hotel profits but also amplifies consumers’ negative perceptions of price fairness. Developing a dynamic discriminatory pricing model with fairness constraints helps hotel room managers formulate optimal pricing strategies. This paper proposes a dynamic discriminatory pricing model with fairness constraints that unifies four pricing models: fixed pricing, dynamic pricing, discriminatory pricing, and dynamic discriminatory pricing. It further proposes a two-stage deep reinforcement learning algorithm to efficiently solve the model and generate optimal pricing strategies. Finally, a case study is conducted to validate the proposed model and algorithm. The results show that the two-stage deep reinforcement learning algorithm can instantaneously derive optimal pricing schemes that satisfy both group and temporal fairness constraints, following a reasonably time-efficient training process. By adjusting the fairness parameters, our model can be transformed into the four types of pricing models, and the performance of the algorithm is validated for the commonly used dynamic pricing and dynamic discriminatory pricing models. Compared to traditional nonlinear programming solution algorithms, this algorithm generates optimal daily prices based on real-time market changes, making it more practically applicable.

1. Introduction

Room revenue constitutes a vital income stream for hotels, making its pricing strategy a critical component of effective revenue management. Traditional hotel pricing approaches largely rely on dynamic pricing models driven by supply and demand dynamics [1]. However, such strategies often fail to incorporate differences in price sensitivity and perception across distinct customer segments, thereby constraining the hotel’s potential to fully maximize revenue. A growing number of scholars are investigating methods to segment customer groups based on their distinct characteristics, thereby facilitating the implementation of more refined pricing strategies [2,3]. This approach, referred to as discriminatory pricing, enables hotels to adopt tailored pricing policies aligned with the varying price expectations of different customer segments. With the advent of the big data and artificial intelligence era, big data-driven discriminatory pricing has introduced a more refined approach to hotel revenue management [4]. E-commerce platforms can now construct detailed customer profiles based on behavioral data, enabling them to tailor room prices to individual customers and thereby maximize revenue [5]. While dynamic discriminatory pricing has the potential to increase hotel profits, it also heightens consumers’ negative concerns regarding price fairness [3,6]. In response, many government agencies have implemented regulations aimed at curbing excessive price discrimination in hotel room rates. For instance, the UK’s Financial Conduct Authority has proposed the implementation of a relative price ceiling, while the United States’ Equal Credit Opportunity Act includes explicit provisions prohibiting price discrimination. Whether motivated by the long-term strategic goals of businesses or in response to governmental regulations, integrating fairness considerations into pricing research has become an inevitable trend.
Some researchers have proposed nonlinear optimization algorithms to derive optimal pricing strategies under fairness constraints [7,8,9,10]. However, the algorithms are time-consuming, and the solutions generated by algorithm solvers are static, as they rely on fixed assumptions regarding demand and customer traffic. Consequently, they fail to adapt to daily fluctuations in market conditions. As a result, in practical settings, hotel managers continue to depend on historical demand data and experiential forecasting to formulate room pricing strategies [11]. With the rapid advancement of AI technology, Reinforcement Learning (RL) offers a new approach to solving hotel room pricing models efficiently. Through continuous interaction with its environment, an agent refines its behavioral policy to maximize the cumulative rewards it obtains. Traditional RL methods often struggle to process high-dimensional inputs, owing to their limited ability to extract relevant features effectively. Deep Reinforcement Learning (DRL), which integrates reinforcement learning with deep neural networks (DNNs), exhibits exceptional capabilities in addressing decision-making problems involving complex state spaces and high-dimensional action domains. DRL has been successfully applied across a wide range of real-world domains, including video game AI, competitive sports, autonomous driving, inventory management, and robotic control, among others. In the field of revenue management, researchers have increasingly turned to exploring the application of reinforcement learning [12,13,14,15,16]. However, for hotel room pricing, although Tuncay et al. [17] applied Q-learning to a common model without constraints, the application of DRL that incorporates fairness constraints remains an area requiring further research.
This paper presents a dynamic discriminatory pricing model for hotel rooms with fairness constraints, and proposes a two-stage deep reinforcement learning algorithm to derive revenue-maximizing pricing strategies. The main contributions of this paper are summarized as follows:
(1)
We propose a dynamic discriminatory pricing framework for hotel rooms with fairness constraints. By adjusting the group and temporal fairness parameters, this framework can be transformed into four distinct pricing models: fixed pricing, dynamic pricing, discriminatory pricing, and dynamic discriminatory pricing. This framework extends the theoretical foundations of hotel room pricing optimization. It can effectively address the limitation of traditional models that only solve pricing problems in a single scenario, significantly enhancing the model’s generality.
(2)
We revise the representation of the fairness gap in the fairness constraints, shifting from the price range of the optimal pricing strategy to customers’ acceptable real fairness perception. This revision is more aligned with the characteristics of the customer groups of a specific hotel. Compared with the methods [9,10] that determine the fairness gap based on the optimal solution of unconstrained models, the fairness gap derived from this method exhibits better determinacy and applicability. Hence, the model-derived solutions for pricing strategies can be more effective in practice.
(3)
To solve the dynamic discriminatory pricing model with fairness constraints, we propose a two-stage deep reinforcement learning algorithm. This algorithm can generate optimal pricing strategies that satisfy fairness constraints based on the dynamic changes in the market environment, and it is more applicable and faster-running than traditional nonlinear programming algorithms. This algorithm provides a new solution method for intelligent pricing problems with temporal constraints, expanding the application scope of deep reinforcement learning algorithms in the pricing field.
The rest of the paper is organized as follows. Section 2 presents a review of the literature. Section 3 gives details of our proposed dynamic discriminatory pricing model, the corresponding two-stage deep reinforcement learning algorithm is presented in Section 4. Section 5 presents a case study. Lastly, Section 6 concludes the paper.

2. Literature Review

2.1. Hotel Room Pricing Strategy

Hotel room pricing strategy is a core component of hotel revenue management, and discriminatory pricing as well as dynamic pricing are the most frequently used methods by hotel managers. Specifically, these strategies address the needs of diverse customer segments by offering differentiated discounts tailored to each group [18]. Driven by a differentiation strategy, hotel managers primarily implement price discrimination based on their experience and observable customer characteristics. To further boost hotel revenue, managers collect and integrate large volumes of multi-channel data into Revenue Management Systems (RMSs), thereby enabling more effective implementation of discriminatory pricing in practice [19,20]. Dynamic pricing involves continuously adjusting prices over time based on changes in supply and demand. Such strategies are widely adopted in hotel revenue management [5,21,22]. Meanwhile, numerous studies have focused on dynamic pricing approaches [8,23,24,25,26]. Consequently, both discriminatory pricing and dynamic pricing are attracting growing attention from both practitioners and academics.
True customer demand is ultimately determined by a complex interplay of various factors. To develop more effective pricing strategies, existing research has focused on demand forecasting and the optimization of pricing models. Customer demand forecasting serves as a prerequisite for formulating pricing strategies and holds significant strategic value and theoretical importance [27]. Currently, research on hotel room demand forecasting is primarily based on historical data. For instance, Zhang and Niu [28] used weekly data of two years derived from online reviews to forecast hotel demand. High-frequency data, such as hourly and daily records, is commonly used for short-term demand forecasting Huang et al. [29]. However, a growing body of research has highlighted the significant limitations of forecasts that rely on a single data source. Recent studies have explored the integration of multiple data sources to enhance forecasting accuracy [30]. By incorporating diverse data, which captures various factors influencing future hotel demand, integrated models offer substantial improvements in forecasting performance.

2.2. Pricing Models for Hotel Rooms

Optimization methods are another key factor influencing pricing strategies. Traditional hotel pricing optimization methods primarily utilize approaches such as game theory and nonlinear programming, among others. For instance, Yang and Xia [24] introduced a dynamic pricing game model that demonstrates the existence of an equilibrium pricing strategy under competitive conditions, along with a method for generating viable dynamic prices contingent on the number of competing firms. Aziz et al. [7] and Fadly et al. [8] developed nonlinear programming models to tackle dynamic room pricing challenges. On another front, Vives and Jacob [22] optimized hotel room pricing by proposing a stochastic dynamic pricing model that incorporates an online demand function. Additionally, Bayoumi et al. [25] devised a dynamic pricing approach based on price multipliers and employed a Monte Carlo simulator to identify optimal multiplier values, thereby assisting hotels in maximizing revenue.
However, with research advancing and aligning more closely with practical realities, room pricing models have grown in complexity. Reinforcement learning, a cutting-edge area of machine learning known for its strong performance in complex and dynamic environments, has been increasingly applied to address complex dynamic pricing problems [31]. Lu et al. [32] introduced a deep reinforcement learning algorithm for dynamic pricing in energy management within hierarchical electricity markets. They formulated the pricing process as a discrete finite Markov decision process (MDP) and employed Q-learning to solve the resulting decision-making problem. Qiao et al. [15] proposed a distributed pricing framework by innovatively modeling the MPPDP problem as a fully cooperative Markov game, which was then addressed using multi-agent reinforcement learning (MARL). They developed two efficient distributed dynamic pricing algorithms based on MARL: Counterfactual Q-learning and Counterfactual DQN. Lawhead and Gosavi [13], Bondoux et al. [14] and Lange et al. [16] utilized reinforcement learning algorithms in revenue management, which substantially increased ticket revenues for a new airline. In the field of hotel room pricing, Tuncay et al. [17] applied Q-learning in a Turkish hotel. The Q-Table was created with the information obtained from the data of the hotels with different characteristics and densities. The Q-table method is only applicable in environments with simple state and action spaces, making it unsuitable for more complex problems—such as those involving continuous pricing or large state spaces. Currently, there remains a scarcity of research that integrates hotel pricing models with deep reinforcement learning to develop more efficient pricing strategies.

2.3. Pricing Fairness

Price discrimination often raises concerns regarding pricing fairness. When customers perceive unfairness, they may feel a sense of betrayal, which can ultimately erode trust in the company. Judgments of price fairness are generally influenced by the reasons provided for price changes [33]. Some studies have explored the causes and consequences of perceived unfairness in pricing. In the context of hotel room pricing, research focuses on the relationship between social welfare and dynamic price discrimination strategies. For example, Li and Jain [34] utilized a game-theoretic model to analyze the interactions among fairness perceptions across different customer segments, retailer pricing strategies, and their impacts on profit, consumer surplus, and social welfare. Similarly, Kallus and Zhou [35] examined how fairness concerns and welfare outcomes interact in the context of personalized pricing based on customer characteristics. An additional perspective on fairness in pricing is the concept of “inter-temporal fairness”, introduced by Gupta and Kamble [36]. This principle asserts that individuals should be treated fairly across time—accounting for both their past interactions and future engagements. Pricing fairness is increasingly being integrated into optimization models to achieve more practical and sustainable revenue outcomes. For instance, Cohen et al. [9,10] incorporated fairness constraints into their model and developed an algorithm based on the Upper Confidence Bound (UCB) method. This fairness-aware algorithm provides relative price stability for each customer segment under the given time frames and customer traffic conditions; however, it is not well-suited for highly uncertain and rapidly evolving future markets. Additionally, they defined the fairness gap within the fairness constraints as the price range of the optimal pricing strategy, which may not align with customers’ perceptions of fairness. Further research is still needed to develop a hotel room pricing method that can adapt to rapid market changes while considering price fairness constraints.

3. Hotel Room Pricing Model

3.1. Model Assumptions

To help hotels formulate appropriate pricing strategies for revenue maximization and explore the impact of fair pricing on hotel room revenue, we propose a hotel room pricing model considering fairness constraints; this model can achieve optimal dynamic discriminatory pricing based on distinct customer segments. Hotel room managers make price decisions based on customers’ demand for rooms and the number of available rooms, while customers decide whether to check in according to the room price. Therefore, the room price is set as the decision variable in the model. To facilitate model construction, we select a hotel as the research object, and make the following assumptions:
(1)
All rooms in the hotel are homogeneous, that is, we only consider a single room type.
(2)
Customers can be segmented into distinct groups, where individuals within the same group are assumed to be homogeneous. This homogeneity means that all members of a group share similar preferences for hotel room prices and demonstrate consistent levels of concern about price fairness.
(3)
For customers staying consecutively, room prices remain consistent with the initial booking price throughout their entire stay and will not fluctuate.
(4)
In the initial state of our simulation, all the hotel rooms are unoccupied.

3.2. Notations and Variable Definitions

Assume the hotel has M rooms, each with a cost of C. Based on their spending characteristics, customers are segmented into N groups with customer group index i, i { 1 , 2 , , N } . The time horizon is defined as a fixed period T (e.g., 30 days) with time index t, t { 1 , 2 , , T } . Hotel managers set daily room prices for different customer groups, denoted as p t , i , p t , i [ p ̲ , p ¯ ] , and [ p ̲ , p ¯ ] is a predetermined price range. The expected price that group i customers desire most at time t often differs from the actual price p t , i ; therefore, some customers may give up booking this hotel due to price issues with a certain probability. The probability that customers in group i will accept the room price p t , i is denoted as F i ( p t , i ) , which conforms to the probability distribution F i . The customer traffic Q t , i , representing the total number of group i customers on day t who consider this hotel as their intended destination, follows a Poisson distribution P i . Thus, the number of group i customers on day t who actually plan to check in is Q t , i · F i ( p t , i ) . The room accommodation demand generated on day t is
D t = i = 1 N Q t , i · F i ( p t , i ) .
When the total room demand exceeds the number of rooms available for booking on a given day, the hotel will stop accepting new reservations. Hence, the actual number of check-in customers U t can be expressed as
U t = M t D t M t D t D t < M t ,
where M t denotes the number of remaining rooms on day t. Customers may stay for d consecutive days with the probability Z ( d ) , where d { 1 , 2 , , L } , and L denotes the maximum number of days customers may stay. If we consider the case of consecutive customer stays, Expression (2) will be updated as follows:
U t = min M , D t t = 1 min M , D t + j = 1 t 1 d = j + 1 t Z ( d ) · U t j 2 t < L min M , D t + j = 1 L 1 d = j + 1 L Z ( d ) · U t j t L .
The expressions j = 1 t 1 d = j + 1 t Z ( d ) · U t j and j = 1 L 1 d = j + 1 L Z ( d ) · U t j represent the cumulative number of occupied rooms by customers who checked in prior to time t. The profit earned by the hotel on day t is
R t = i = 1 N U t , i · ( p t , i C ) ,
where U t , i is the actual number of check-in customers in group i on day t.
Our model integrates the fairness constraints with a specific focus on price fairness. Drawing on the definition of price fairness in [36], we adopt two types of price fairness: price fairness across groups and across time periods. For brevity, we call them group fairness and temporal fairness, respectively.
Definition 1 
(group fairness). For day t, the offered hotel room prices of group i and j customers satisfy | p t , i p t , j | < δ t almost surely for i j , and δ t is a constant parameter of group fairness.
Definition 2 
(temporal fairness). For customers of group i, the offered hotel room prices on day t and day s satisfy | p t , i p s , i | < σ i almost surely for t s , and σ i is a constant parameter of temporal fairness.
Li and Jain [34] studied the issue of pricing games between firms and consumers and found that price fairness concerns affect firms’ pricing strategies and consumers’ purchase decisions. When consumers perceive price unfairness, they may choose to switch to purchasing products from other firms [6]. Similarly, price unfairness may also lead customers to refrain from booking hotel rooms. Therefore, we consider incorporating the impact of price fairness on customer traffic into the model. Typically, the perception of price fairness is manifested in two ways [34]: (1) peer-induced fairness, in which customers compare their own prices with those of peer customers; (2) experience-induced fairness, in which customers compare current prices with historical prices they have previously experienced. We define an influence factor λ i to represent the impact of price unfairness on customer traffic of group i, where λ i ranges from [ 0 , 1 ] . The influence factor λ i incorporates both effects: the impact of peer-induced price unfairness and that of experience-induced price unfairness. Then, the D t in Expression (1)–(3) is transformed into
D t = i = 1 N Q t , i · F i ( p t , i ) · λ i .

3.3. Optimization Model

When hotel managers aim to maximize profits through daily pricing strategies without considering any constraints, the optimal pricing model can be formulated as an unconstrained optimization problem:
max R = t = 1 T i = 1 N U t , i · ( p t , i C ) ,
When fairness constraints are added to the unconstrained optimization problem (5), a constrained optimization problem is immediately obtained. However, the parameters δ t and σ i defined in Definitions 1 and 2 are difficult to set. We introduce a fairness parameter α to constrain the daily room price differences both across customer groups and across time periods, and α [ 0 , 1 ] . In this paper, we define δ t and σ i as
δ t = ( 1 α g ) sup i { Δ p t , i * } ,
and
σ i = ( 1 α t ) sup t { Δ p t , i * } .
Δ p t , i * denotes the absolute unfair price difference perceived by customer group i at time t. sup i { Δ p t , i * } is the supremum of the absolute unfair price differences across all customer groups at time t, and sup t { Δ p t , i * } is the supremum for customer group i over the entire time period. The absolute unfair price difference can also be referred to as the fairness gap. While Cohen et al. [9] used the price range of the optimal pricing strategy to represent the fairness gap, this approach may not align with customers’ actual perceptions—thus hindering the practical application of the pricing model.
The parameters α g and α t denote the group fairness and temporal fairness parameters, respectively. By leveraging α g and α t , hotel room managers can implement dynamic discriminatory pricing. When α g = α t = 1 , the hotel room prices will be uniform across all customer groups and time periods corresponding to the fixed pricing model. Conversely, when α g = 0 or α t = 0 , it indicates the absence of the respective fairness constraint. If α g = 0 and α t = 1 , the model reduces to the discriminatory pricing model, where each customer group is assigned a distinct yet time-invariant price across all periods. Conversely, when α g = 1 and α t = 0 , the model corresponds to the dynamic discriminatory pricing model, which treats all customers as a single group. Thus, by introducing the group fairness and temporal fairness parameters, the four pricing models can be defined as a unified model.
The influence of price fairness on customer traffic is from the perception of both the group and temporal fairness. Thus, to characterize the overall influence of price fairness on customer traffic of group i, we define the λ i as a function of α g and α t rather than as a function of price differences.
The problem of the optimization of hotel room pricing can be formulated as follows:
max R ( p t , i ) = t = 1 T i = 1 N U t , i · ( p t , i C ) s . t . | p t , i p t , j | ( 1 α g ) sup i { Δ p t , i * } i j , | p t , i p s , i | ( 1 α t ) sup t { Δ p t , i * } t s , p ̲ p t , i p ¯ i , t .
The subscripts of optimization problem (9) i , j { 1 , 2 , , N } , t , s { 1 , 2 , , T } . The U t is re-expressed as
U t = min M , i = 1 N Q t , i · F i ( p t , i ) · λ i ( α g , α t ) t = 1 min M , i = 1 N Q t , i · F i ( p t , i ) · λ i ( α g , α t ) + j = 1 t 1 d = j + 1 t Z ( d ) · U t j 2 t < L min M , i = 1 N Q t , i · F i ( p t , i ) · λ i ( α g , α t ) + j = 1 L 1 d = j + 1 L Z ( d ) · U t j t L .
It is worth noting that the model does not distinguish between room types, meaning all rooms are treated as homogeneous. In real-world operations, hotel managers categorize rooms based on their configuration conditions to meet the different preferences of customers. One solution is to train separate models for different room types and their corresponding customer groups. Another solution is to treat all rooms as homogeneous (as assumed in our model) and match customers with different room types based on the prices they pay for bookings, thereby meeting their preferences.This can also serve as one of the marketing solutions to eliminate the customer perception of unfairness.

4. Two-Stage Deep Reinforcement Learning Algorithm

4.1. The MDP Model for Hotel Room Pricing

The proposed optimization model (9) is a nonlinear programming problem, and solving it using traditional solvers or heuristic algorithms becomes time-consuming when dealing with dynamic discriminatory pricing over an extended period. While an optimal solution is derived, the daily randomness in customer arrivals may result in suboptimal profits in practice. Therefore, there is an urgent need to propose a fast-solving algorithm that can cope with the random changes in customer traffic. Deep reinforcement learning can address this problem effectively. Although the training process takes a relatively long time, the model features fast inference speed and can quickly compute an optimal decision scheme based on different states.
The hotel’s operational process over a given period can be formulated as an episodic Markov Decision Process (MDP). This process comprises five components: S (state), A (action), R (reward), P (state transition probability), and π (pricing strategy adopted by the hotel room manager). The hotel room manager can be regarded as an agent.
State: The agent formulates pricing schemes based on the remaining available rooms M t and the expected arrivals of each customer group ( Q t , i ) for day t. Therefore, the state on day t can be denoted as s t = ( M t , Q t , 1 , Q t , 2 , , Q t , N ) .
Action: The action a t taken by the agent on day t is the set of room prices for each customer group, i.e., a t = ( p t , 1 , p t , 2 , , p t , N ) .
Reward: When the agent takes the pricing action a t based on state s t , an immediate reward r t + 1 is received from the environment. Specifically, this reward is defined as the room profit earned by the hotel on day t, i.e., r t + 1 = i = 1 N U t , i · ( p t , i C ) .
State transition: P ( s t + 1 | s t , a t ) represents the probability of transitioning to the next state s t + 1 given the current state s t and pricing action a t . Given the stochastic nature of customer arrivals, precisely describing the state transition probability is challenging. To address this issue, we assume equal probabilities of customer arrival and refusal for each group in the environment when the hotel rooms are fully occupied.
Policy: The policy π determines the action to be taken in a specific state s t . Specifically, the agent decides on the room pricing strategy for different customer groups for the next day based on the state during period T. This policy is defined as a policy network that predicts an action.
The framework of the DRL-driven hotel room pricing model is shown in Figure 1.

4.2. Algorithm

The supremum of price differences, denoted as sup t { Δ p t , i * } , is dependent on the generated price trajectory. Consequently, the conventional exploration methods employed in deep reinforcement learning algorithms are unable to address the temporal constraint. To address the proposed hotel room pricing model incorporating both group fairness and temporal fairness constraints, we design a two-stage deep reinforcement learning algorithm. In stage I, we train a deep reinforcement learning algorithm under the group fairness constraint, and derive an optimal solution via inference based on the trained policy network. The temporal fairness constraint restricts the variation range over the time period; therefore, in Stage II, the deep reinforcement learning algorithm is continuously trained based on the optimal solution by clipping actions to the constraint range. Once the hotel room pricing environment (Line 1 of the Algorithm) is constructed, a deep reinforcement learning algorithm, such as the Proximal Policy Optimization (PPO) algorithm [37], Deep Deterministic Policy Gradient algorithm (DDPG) [38], or Twin Delayed Deep Deterministic policy gradient algorithm (TD3) [39], may be selected for training.
The steps of our two-stage deep reinforcement learning algorithm are specified in Algorithm 1.
The embedding representation (Lines 6, 12 and 24) is a method that maps integers in the state to multi-dimensional vectors, serving as the input for network training. As a commonly used technique in neural network training, the embedding function maps discrete integers into dense, low-dimensional vectors. It essentially functions as a lookup table with dimensions (vocabulary size, embedding dimension). When an integer sequence is provided as input, the embedding function retrieves the corresponding vectors based on their indices, resulting in an output tensor of shape (batch size, sequence length, embedding dimension). These vectors are initialized randomly and then learned during training. Through a fully connected layer, the embedding mechanism converts integers into one-hot vectors as inputs, outputs dense vectors, and trains an embedding matrix to serve as the embedding representations.
Algorithm 1: Two-stage deep reinforcement learning algorithm
Jtaer 20 00337 i001
When the action fails to satisfy the group fairness constraint in Stage I, the normalized scaling method (Line 8) is applied. Three candidate selection methods are defined, with their corresponding expressions presented as follows:
p t , i p t , i + ( i = 1 N p t , i N p t , i ) δ t max i p t , i min i p t , i
p t , i min i p t , i + ( p t , i min i p t , i ) δ t max i p t , i min i p t , i
p t , i max i p t , i ( max i p t , i p t , i ) δ t max i p t , i min i p t , i
Expressions (11)–(13) can be defined as mean-normalized, min-normalized, and max-normalized scaling methods, respectively. The criterion for method selection depends primarily on the model’s training performance.
The customer information table (Line 10) i n f o _ C t stores the stay durations and group information of all customers at time t. Using i n f o _ C t , we can calculate the remaining rooms for state s t + 1 .
After the training process, hotel room managers can perform inference using the policy network to generate pricing schemes. Once the customer group traffic is predicted and the number of unoccupied rooms for the next day is calculated, the state vector s t is obtained. Next, the embedding function converts the input into a state embedding vector, which serves as the actual input to the policy network. Following linear computation and non-linear activation in the hidden layers, the model ultimately outputs the prediction results. This process is also referred to as the forward propagation of the policy network. The forward propagation process of a neural network (i.e., inference) is extremely fast. In contrast, traditional solvers and heuristic algorithms, after time-consuming computations, can only provide fixed optimal solutions over the entire time period, rather than an optimal action tailored to the daily variations in customer group traffic and the remaining rooms.

5. Case Study

5.1. Target Hotel and Model Parameters

In this section, we validate the proposed model and algorithm through a case study of a hotel located in Qingdao, China. As a large five-star non-downtown hotel in a tourist city, it exhibits a marked contrast between the off-season and peak-season. The customer traffic is significantly influenced by the peak and off-peak tourist seasons. Accordingly, this paper examines both scenarios to analyze the characteristics of fairness in pricing strategies. Our study is based on the hotel’s historical operational data, and market research in 2019, 2023, and 2024, excluding the period affected by the COVID-19 pandemic. During the COVID-19 period, the hotel room market witnessed government intervention and a large number of irrational behaviors. It was not a rational competitive market, as both hotel room managers and consumers exhibited numerous irrational decision-making behaviors. Therefore, this extremely special period does not fall within the applicable scope of the model.
Both dynamic discriminatory pricing and dynamic pricing are commonly applied in practice for the target hotel; therefore, we use both to simulate the effects of the pricing model established in Section 3. To compare the effects of fairness parameters α g and α t , we define four models: dynamic discriminatory pricing neglecting the fairness influence on customer traffic (denoted as DDP-N), dynamic discriminatory pricing considering the fairness influence on customer traffic (DDP-C), dynamic pricing neglecting the fairness influence on customer traffic (denoted as DP-N), and dynamic pricing considering the fairness influence on customer traffic (denoted as DP-C).
This hotel has 150 standard rooms, with each room having a marginal cost of ¥167. Based on historical data, we set the pricing ranges as [¥300, ¥640] for the off-season and [¥600, ¥1400] for the peak- season. Customers can stay for a maximum of 5 days, with the probability distribution of their stay duration given by Z ( d ) = { 0.46 , 0.34 , 0.15 , 0.035 , 0.015 } . Customers is classified to be three groups according to the price sensitivity. Referring to the studies by [40,41], we define customer traffic as following a Poisson distribution P i . In the training process of deep reinforcement learning, customer traffic is sampled according to P i . In demand models, three functions are most commonly used to describe the relationship between demand and price: linear, logistic, and exponential functions [9]. In this paper, we define the probability F i that customers in group i accept the price as a logistic function. For example, for customer group 1 on weekdays during the off-season, F 1 ( p t , 1 ) = P ( p t , 1 > x ) = 1 / ( 1 + e 0.0366 ( x 400 ) ) . Once a price is given, the probability that customers in group 1 accept the price can be calculated using the distribution function F 1 .
In the studies by [34,42], consumption utility is treated as a linear function of the degree of consumers’ fairness concerns. Within the context of hotel room pricing, customers’ utility directly influences their decisions regarding room bookings. Accordingly, we define the linear fairness degree function λ i as a coefficient that affects customer traffic. Based on customer surveys, historical data and the operational experience of the hotel’s room managers, we evaluate and formulate P i , F i and λ i of each group across both off-season and peak-season, encompassing weekdays and weekends, as specified in Table 1. The probability distribution functions F i are shown in Figure 2. The sup i { Δ p t , i * } and sup t { Δ p t , i * } are evaluated based on customer survey data. During the off-season, the values of sup i { Δ p t , i * } for customer groups on weekdays are ¥120, while for weekends, they are ¥160 and ¥130. Correspondingly, the values of sup t { Δ p t , i * } for customer groups on weekdays and weekends are [¥90, ¥100, ¥120] and [¥180, ¥200, ¥240], respectively. In the peak-season, the values of sup i { Δ p t , i * } are ¥200, with those for weekends being ¥250 and ¥220. Similarly, the values of sup t { Δ p t , i * } on weekdays and weekends are [¥180, ¥200, ¥240] and [¥240, ¥260, ¥320], respectively.
PPO is selected as the primary training algorithm for our hotel room pricing model. Both the actor network and the critic network consist of two hidden layers, each containing 64 neurons. The actor network uses the tanh activation function, while the critic network employs ReLU. For the PPO algorithm, the parameters are set as follows: discount rate γ = 1 ; learning rate of the actor network ϕ a = 0.0001 ; learning rate of the critic network ϕ c = 0.0002 ; both update steps are set to 10, update batch size is 32; the clipping hyperparameter ϵ = 0.2 ; the number of training episodes in Stage I is 2000 and 4000, while in Stage II it is 500 and 1000, corresponding to the off-season and peak-season, respectively.
In the selection of the normalized scaling method for actions, the min-normalized scaling method (Expression (12)) tends to generate lower prices, which can help boost customer traffic; therefore, this method is adopted during the off-season training process. For the peak season, we use the max-normalized scaling method (Expression (13)). With this setting, the profit performance is better than the mean-normalized scaling method (Expression (11)).
The time horizon is set to 30 days; that is, each episode runs 30 timesteps. Actions are considered continuous values; therefore, we utilize the continuous action variant of the PPO algorithm, where the output of the policy network is typically set to the mean (action) and variance of a normalized distribution. In our work, we only take the action as the output of the policy network. The variance, which serves to control exploration, is designed to vary within a fixed range as training steps proceed.

5.2. Results Analysis

Four pricing models (DDP-N, DDP-C, DP-N, and DP-C) are trained using deep reinforcement learning algorithms; the convergence curves of training rewards are shown in Figure 3. By comparing the performance of different normalized scaling methods, it is found that the mean-normalized scaling method exhibits inferior performance in both the off-season and peak season. Therefore, the min-normalized scaling method for the off-season and the max-normalized scaling method for the peak season are adopted as the primary results of this study. For DDP-C, α g and α t are both set to 0.5. For DP-C, α g = 1 and α t = 0.5 . Due to the absence of fairness constraints, DDP-N and DP-N are trained with the classical PPO, whereas DDP-C and DP-C utilize the two-stage PPO algorithm proposed in Section 4.2. The vertical red dotted lines at episodes 2000 and 4000 mark the boundaries of the two-stage training. The PPO algorithm demonstrates favorable reward convergence for all four models after a sufficient number of training episodes. Stage II training can improve the reward to some extent while ensuring the temporal fairness constraint is satisfied (see the curves positioned to the right of the vertical red dotted lines).
Whether in the off-season or peak-season, the DDP-N (the red line) achieves a better reward than the DP-N (the yellow line). This indicates that dynamic discriminatory pricing contributes to higher profits compared to dynamic pricing alone. When considering the impact of fairness constraints on customer traffic, dynamic discriminatory pricing (the green line) outperforms dynamic pricing (the blue line) slightly. This suggests that the benefits of discriminatory pricing can offset the effects of group fairness on customer traffic. The gaps between the curves in the off-season are larger than those in the peak-season, indicating that fairness has a more significant impact on hotel profits when customer traffic is scarce.
Upon completion of the training process, the optimal pricing strategy can be inferred via the policy network. Figure 4 demonstrates the results of the optimal pricing strategy for all customer groups during both the off-season and peak-season, using different normalized scaling methods. This example is based on the parameter settings of α g = 0.5 and α t = 0.5 . The black lines represent the optimal pricing strategies generated by the stage I trained model. Specifically, the black dotted lines indicate the optimal pricing strategy for each customer group, while the black solid line shows the average room prices across all groups. Similarly, the red lines correspond to the results from stage II: the red dotted lines denote the optimal pricing strategy for each customer group, and the red solid line represents the average room prices. Prior to the training of stage II, for customer group 1 on weekdays during the off-season, the minimum price using the mean-normalized action scaling method is ¥327 on day 4, with 11 days exceeding the temporal constraint. Meanwhile, during the peak-season, there are 6 days that exceed this constraint. When using min-normalized action scaling method during the peak-season, only day 14 exceeding the temporal constraint, and the prices are lower than mean-normalized scaling method. Following the two-stage training process, every developed pricing strategy successfully satisfies the specified group and temporal constraints. The prices generated by the stage I model fluctuate more dramatically than those from stage II over the temporal period, which demonstrates the constraint effect on the optimal solution achieved by our two-stage reinforcement learning algorithm.
We employ key performance indicators—including Mean Room Rate (MRR), Profit, Daily Guest Acquisition (DGA), and Room Occupancy Percentage (ROP)—to evaluate and compare the performance differentials across the various models. The results are shown in Table 2. When fairness considerations are excluded, the dynamic discriminatory pricing model DDP-N achieves a hotel profit of ¥0.835 million during the off-season, compared to ¥0.783 million generated by the dynamic pricing model DP-N. This represents a 6.64% profit increase under the DDP-N framework. While in peak-season, the DDP-N model yields a 3.96% higher profit compared to the DP-N model. It shows that differential pricing proves more effective in optimizing revenue performance under off-season conditions where room supply exceeds demand. When considering the impact of fairness pricing, the decline in customer traffic directly correlates with reduced room profitability. When the fairness parameters are set to α g = 0.3 and α t = 0.5 , the profit drops to ¥0.734 million, and the MRR dramatically drops from ¥371 to ¥348. Lowering room rates to increase customer traffic can effectively offset the traffic loss caused by fairness factors, thereby boosting profits. Both profit and ROP can be improved as fairness increases.Specifically, when the fairness parameters are set to α g = 0.5 and α t = 0.7 , the profit rises to ¥0.770 million, while the ROP increases from 86.5% to 90.9%. During the peak-season, all models achieve 100% ROP while demonstrating significantly higher DGA rates compared to off-season performance. The DDP-N model generates optimal profits of ¥2.913 million, maintaining its superior performance over both DP-N and DDP-C models, consistent with the off-season. When considering the impact of fairness, both MRR and profit exhibit a similar trend to that of the off-season.

5.3. Algorithm Comparison

In addition to PPO, we also utilize three other deep reinforcement learning algorithms—Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Actor-Critic (AC)—to validate the proposed two-stage deep reinforcement learning algorithm. For the comparative algorithms DDPG, TD3, and AC, the same network architecture is adopted. The hyperparameters are listed in Table 3. The training results are presented in Figure 5. For the DDP-N model, PPO, DDPG, and TD3 exhibit the same convergence trend and converge to a similar reward level during both the off-season and peak-season. In contrast, AC fluctuates significantly. Although it shows an improving trend, its instability limits its applicability. Therefore, we only implement it for the DDP-N model during the off-season. For the DDP-C model, the comparative algorithms exhibit certain differences in performance.
Deep reinforcement learning algorithms often require substantial computational time for training. Under a computer configuration equipped with an Intel(R) Core(TM) i7-1165G7 CPU (2.80 GHz) and 16 GB of RAM, the training times for the PPO, DDPG, TD3, and AC algorithms are about 20, 35, 46, and 22 min for off-season of DDP-N, respectively. Each algorithm requires approximately 0.1 s of inference time for a 30-day period. Overall, PPO outperforms other comparative algorithms; therefore, PPO is selected as the primary algorithm for this study.
NLP solvers and heuristic algorithms are widely used to address optimization problems. However, solvers have difficulty handling non-continuous problems and uncertain problems that involve random variables. In our model, Equation (10) exhibits a marked non-continuous nature, and the customer length of stay follows a discrete probability distribution. These characteristics render the problem intractable for standard solvers. Consequently, we employ a heuristic algorithm, specifically Particle Swarm Optimization (PSO) [23], as a benchmark to compare against the reinforcement learning approach. The solution algorithm results and historical data are shown in Table 4. For the DDP-C model with α g = 0.5 and α t = 0.5 , PSO outperforms PPO in terms of profit. This is because the constraints narrow the search range and accelerate the solution speed. Due to differences in their design mechanisms, PSO is more aggressive in the optimization process, and its optimization results are slightly better than those of PPO. This is a common phenomenon in solving optimization problems. In contrast, for the DDP-N model, PSO performs significantly worse than PPO; the off-season and peak-season profits only reach ¥0.638 million and ¥2.398 million, respectively. When using the DDP-C model, the PPO algorithm achieves a 7% and 4.8% profit increase during the off-season and peak season, respectively, compared to the corresponding historical profit (i.e., the average monthly profit).

5.4. Supplementary Experiments

To further validate the effectiveness of our model, we conducted experimental studies on two additional hotels located in Qingdao, China. Hotel BH is a four-star hotel with 110 rooms, and its scale and infrastructure are inferior to those of the case hotel. Hotel HJ is a large five-star hotel (meeting five-star standards) with 280 rooms, and its infrastructure conditions and customer groups are similar to those of the case hotel. The training results of both hotels obtained using the PPO algorithm are presented in Figure 6. Specifically, the four models (for the off-season and peak-season, respectively) exhibit excellent convergence characteristics. Additionally, we compare the profit results of the PSO algorithm and historical data, as shown in Table 5. For the DDP-N model, the performance of the PPO algorithm is significantly better than that of the PSO algorithm; in contrast, the optimization effect of the PSO algorithm in the DDP-C model is slightly superior to that of the PPO algorithm. Both algorithms achieve better optimization results than the historical data, which is consistent with the results of the case hotel.
To analyze the impact of customer segment quantity on model training performance, we split the three customer groups of the case hotel (Section 5.1) into 5 and 7 groups, respectively. To facilitate result comparison, we only split the customer traffic data while keeping all other parameters unchanged. Subsequently, simulation experiments based on the DDP-C model using the PPO algorithm were conducted for both the off-season and peak-season. The training results obtained after 5000 episodes are presented in Figure 7. It can be observed that the reward curves of the three customer groups exhibit a consistent convergence trend. During the off-season, the optimal profits of the predicted pricing strategy—derived from the trained policy network—reach ¥0.752 million, ¥0.784 million, and ¥0.776 million for the 3-group, 5-group, and 7-group customer segments, respectively. While during the peak-season, the corresponding profits amount to ¥2.785 million, ¥2.831 million, and ¥2.746 million, respectively, in theory, more segments can enable more refined pricing, thereby achieving higher profits. However, it also increases the dimensionality of the action space, leading to greater complexity in model training. The 5-group segmentation model achieves the best performance during both the off-season and peak-season periods. In the off-season, the 7-group segmentation outperforms the 3-group model, whereas during the peak season, the 3-group segmentation performs better than the 7-group model. In practice, a balance needs to be struck between the number of customer segments, expected profit performance, and model training time.

5.5. Discussion

Typically, pricing problems are formulated as either single-objective or multi-objective optimization models. Cohen et al. [9] proposed a price discrimination model with fairness constraints, and in 2025, Cohen et al. [10] further developed a dynamic pricing model. Both models are structured as single-objective optimization problems. In contrast, Kallus and Zhou [35] formulated a personalized pricing model as a multi-objective optimization problem. To facilitate solving the models, the objective functions must satisfy the assumptions of differentiability and quasi-concavity. However, these classic pricing models, which apply to typical seller-customer scenarios, differ from those tailored to hotel room pricing. In our optimization model (9), the objective function fails to satisfy the assumptions of differentiability and quasi-concavity due to the limitation imposed by room quantity (see Expression (10)). Therefore, hotel room pricing models require specialized techniques. For instance, Bayoumi et al. [25] proposed a hotel room pricing model where a reference price is multiplied by five price multipliers. Ref. [8] defined the hotel room pricing models as single-objective nonlinear optimization problems based on Aziz et al. [7], identified the appropriate range for elasticity that guarantees a feasible solution to the model. Our model differs from those of [7,8], who categorize hotel rooms into different groups with varying prices. In contrast, we segment customers based on their consumption ability without implementing room classification. Our model is specifically designed for a pricing scenario that utilizes customer profiling based on behavioral data analysis to implement differential pricing.
In terms of solution methods for hotel room pricing models, ref. [25] employed a Monte Carlo simulation-based optimization algorithm to test various multiplier values and determine the optimal prices. When dealing with a large number of optimization variables, the efficiency of the Monte Carlo method becomes very low, and it may even fail to produce a solution. The models in [7,8] were solved using nonlinear solvers available in optimization software toolboxes. These solvers employ an iterative approach to converge toward an approximate optimal solution that satisfies the problem constraints. However, this method becomes computationally expensive as the number of variables grows. For instance, solving the model proposed by [7] takes approximately 30 min. Meanwhile, the optimal solution yields fixed prices throughout the entire period. When customer numbers fluctuate and deviate from predictions, this optimal solution may well become non-optimal. Consequently, such solution methods have limitations in practical applications. Reinforcement learning algorithms are capable of addressing these issues. Tuncay et al. [17] proposed a Q-learning algorithm to optimize room pricing across multiple hotels. Different from our model, ref. [17] did not consider the fairness constraints and discriminatory pricing problem. In our model, the action is treated as a continuous price variable for each customer group, unlike the discrete action values used in [12,17]. Consequently, our model exhibits stronger representational power for room pricing.
Optimal results from the model neglecting fairness perception (DDP-N) indicate that dynamic discriminatory pricing can yield higher profits, which aligns with the findings of [43,44]. However, when the impact of fairness on customer traffic is considered, optimal profits decrease, this result is consistent with Li and Jain [34], who also observed that fairness concerns lead to lower profits in the second game period. Abrate et al. [45] found that the price tends to increase when there is a scarcity of hotels available; therefore, the higher prices and bigger supremum of the absolute unfair price differences during weekends and peak-season are reasonable and ordinary, our results are highly consistent with this characteristic.
The pricing strategies of competitors serve as important pricing references for hotel room managers. Price changes by competitors will lead to changes in consumers’ price sensitivity, thereby affecting the actual customer traffic of the hotel. If there is a significant change in the price sensitivity of the customer group (i.e., F i in the model), the previously trained model will exhibit inference bias, resulting in pricing ineffectiveness. Therefore, hotel room managers need to closely monitor changes in consumers’ price sensitivity and perform fine-tuning on the model when F i undergoes a significant change. Hotel room managers need to adjust the F i parameter based on price sensitivity predictions [26], fine-tune the model based on the pre-trained network structure and parameters, and then calculate the pricing strategy. The model’s pre-training can be conducted offline; subsequently, the well-trained model is integrated into the hotel room revenue management system for price prediction. When model fine-tuning is required, a fixed number of episodes should be set before performing online fine-tuning. The time consumed by model fine-tuning depends on server configuration and the number of episodes, which requires personalized configuration. In extreme cases, the model needs to be fine-tuned once a day. However, based on the historical operation data of the three case hotels, the price elasticity of various user groups remains basically unchanged within a week, while significant changes occur in cycles of 1 to 2 months.

6. Conclusions

In order to address the challenge of setting hotel room prices to maximize revenue under fairness constraints, this paper proposes a unified pricing frame that integrates four distinct pricing models: fixed pricing, discriminatory pricing, dynamic pricing, and dynamic discriminatory pricing. Subsequently, a two-stage deep reinforcement learning algorithm is designed to train a policy network for predicting optimal prices. This model and algorithm are applied and validated using a case study. The PPO algorithm is employed to validate the efficacy of the proposed two-stage deep reinforcement learning approach over simulated 30-day off-season and peak-season periods. Furthermore, alternative algorithms, including DDPG and TD3, are also tested and proven effective, achieving similarly high levels of optimal revenue. The deep reinforcement learning approach generates optimal pricing decisions by responding to fluctuations in customer demand and room availability, making it more adaptable to real-world operational environments compared to conventional methods such as nonlinear programming solvers and heuristic algorithms.
For hotel room managers, training a well-performing two-stage deep reinforcement learning algorithm tailored to their specific operational needs is a highly rewarding and efficiency-boosting endeavor. A large quantity of high-quality data, encompassing operational, market, and customer features, is required to serve as the foundation for the algorithmic environment. Based on this data, hotel room managers must accurately categorize customers into specific segments. They then need to calculate the corresponding customer traffic, price acceptance probability, and perceived fairness of pricing differences for weekdays and weekends across different seasons. Once sufficiently trained, the deep reinforcement learning policy network can be deployed as an agent to automatically generate pricing strategies. It is recommended that hotel managers mitigate perceived price unfairness among group customers by enhancing service differentiation and room amenities, particularly during the off-season.
Our study focuses on the dynamic discriminatory pricing of a single hotel, without accounting for the group game among multiple hotels. This limitation may lead to an incomplete understanding of market dynamics, thereby restricting our analysis of the macro-level characteristics of the hotel room market within a specified area. Future research should integrate competitors’ pricing strategies to advance the study of optimal pricing models within a dynamic game framework. In the context of big data-based discriminatory pricing, every customer with unique features can be treated as an individual customer segment. Consequently, dynamic discriminatory pricing leveraging the swarm agent deep reinforcement learning algorithm represents a valuable research direction.

Author Contributions

Conceptualization, X.W. and W.L. (Wei Liu); methodology, X.W. and W.L. (Wei Liu); software, X.W. and Y.X.; validation, L.J. and W.L. (Wenting Lv); formal analysis, X.W. and L.J.; investigation, X.W., Y.X. and W.L. (Wenting Lv); resources, L.J. and W.L. (Wenting Lv); data curation, X.W. and Y.X.; writing—original draft preparation, X.W. and Y.X.; writing—review and editing, X.W. and L.J.; visualization, X.W.; supervision, L.J. and W.L. (Wei Liu); project administration, L.J.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Laboratory Project of Institute for Digital Transformation at China University of Petroleum; Laboratory Project of Higher Education Institutions in Shandong Province—Energy System Intelligent Management and Policy Simulation Laboratory at China University of Petroleum; Youth Innovation Team of Higher Education Institutions in Shandong Province—Data Intelligence Innovation Team at China University of Petroleum (Grant No. 2021RW041); Teaching Case Collection Construction Project on Postgraduate Education of Shandong Province (Grant No. SDYAL2023030); and Teaching Research and Reform Project at China University of Petroleum (Grant No. CM2024050).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The codes for this study are available on Github: https://github.com/wxmwer/hotel-room-pricing (accessed on 15 October 2025).

Conflicts of Interest

Author Wenting Lv was employed by the Crowne Plaza Hotel in Qingdao, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Den, B.; Arnoud, V. Dynamic pricing and learning: Historical origins, current research, and new directions. Surv. Oper. Res. Manag. Sci. 2015, 20, 1–18. [Google Scholar] [CrossRef]
  2. Masiero, L.; Viglia, G.; Nieto-Garcia, M. Strategic consumer behavior in online hotel booking. Ann. Tour. Res. 2020, 83, 102947. [Google Scholar] [CrossRef]
  3. Chen, Y.F.; Pang, T.T.; Kuslina, B.H. The effect of price discrimination on fairness perception and online hotel reservation intention. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 1320–1337. [Google Scholar] [CrossRef]
  4. Steinberg, E. Big data and personalized pricing. Bus. Ethics Q. 2020, 30, 97–117. [Google Scholar] [CrossRef]
  5. Nowak, M.; Pawłowska-Nowak, M. Dynamic pricing method in the E-commerce industry using machine learning. Appl. Sci. 2024, 14, 11668. [Google Scholar] [CrossRef]
  6. Wu, Z.; Yang, Y.; Zhao, J.; Wu, Y. The impact of algorithmic price discrimination on consumers’ perceived betrayal. Front. Psychol. 2022, 13, 825420. [Google Scholar] [CrossRef]
  7. Aziz, H.A.; Saleh, M.; Rasmy, M.H.; ElShishiny, H. Dynamic room pricing model for hotel revenue management systems. Egypt. Inform. J. 2011, 12, 177–183. [Google Scholar] [CrossRef]
  8. Fadly, M.; Ridwan, A.Y.; Akbar, M.D. Hotel room price determination based on dynamic pricing model using nonlinear programming method to maximize revenue. In Proceedings of the 2nd International Conference on Applied Information Technology and Innovation (ICAITI), Bali, Indonesia, 21–22 September 2019; pp. 190–196. [Google Scholar] [CrossRef]
  9. Cohen, M.C.; Elmachtoub, A.N.; Lei, X. Price discrimination with fairness constraints. Manag. Sci. 2022, 68, 8536–8552. [Google Scholar] [CrossRef]
  10. Cohen, M.C.; Miao, S.; Wang, Y. Dynamic pricing with fairness constraints. Oper. Res. 2025; ahead of print. [Google Scholar] [CrossRef]
  11. Ivanov, S.; Del Chiappa, G.; Andy, H. The research-practice gap in hotel revenue management: Insights from Italy. Int. J. Hosp. Manag. 2021, 95, 102924. [Google Scholar] [CrossRef]
  12. Maestre, R.; Duque, J.; Rubio, A.; Arevalo, J. Reinforcement learning for fair dynamic pricing. In Proceedings of the the 2018 Intelligent Systems Conference (IntelliSys), London, UK, 6–7 September 2018; Volume 868, pp. 120–135. [Google Scholar] [CrossRef]
  13. Lawhead, R.; Gosavi, A. A bounded actor-critic reinforcement learning algorithm applied to airline revenue management. Eng. Appl. Artif. Intell. 2019, 82, 252–262. [Google Scholar] [CrossRef]
  14. Bondoux, N.; Nguyen, A.Q.; Fiig, T.; Acuna-Agost, R. Reinforcement learning applied to airline revenue management. J. Revenue Pricing Manag. 2020, 19, 332–348. [Google Scholar] [CrossRef]
  15. Qiao, W.; Huang, M.; Gao, Z.; Wang, X. Distributed dynamic pricing of multiple perishable products using multi-agent reinforcement learning. Expert Syst. Appl. 2024, 237, 121252. [Google Scholar] [CrossRef]
  16. Lange, F.; Dreessen, L.; Schlosser, R. Reinforcement learning versus data-driven dynamic programming: A comparison for finite horizon dynamic pricing markets. J. Revenue Pricing Manag. 2025; ahead of print. [Google Scholar] [CrossRef]
  17. Tuncay, G.; Kaya, K.; Yilmaz, Y.; Yaslan, Y.; Ögüdücü, S. A reinforcement learning based dynamic room pricing model for hotel industry. INFOR Inf. Syst. Oper. Res. 2024, 62, 211–231. [Google Scholar] [CrossRef]
  18. Nicolini, M.; Piga, C.; Pozzi, A. From uniform to bespoke prices: Hotel pricing during EURO 2016. Quant. Mark. Econ. 2023, 21, 333–355. [Google Scholar] [CrossRef]
  19. Saito, T.; Takahashi, A.; Koide, N.; Ichifuji, Y. Application of online booking data to hotel revenue management. Int. J. Inf. Manag. 2019, 46, 37–53. [Google Scholar] [CrossRef]
  20. Martins, A.; Silva, L.; Marques, J. Data Science in supporting hotel management: Application of predictive models to booking.com guest evaluations. In Proceedings of the Advances in Tourism, Technology and Systems, ICOTTS 2023, Bacalar, Mexico, 2–4 November 2023; Volume 384, pp. 51–59. [Google Scholar] [CrossRef]
  21. Ye, P.; Qian, J.; Chen, J.; Chen-hung, W. Customized regression model for airbnb dynamic pricing. In Proceedings of the 24th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; pp. 932–940. [Google Scholar] [CrossRef]
  22. Vives, A.; Jacob, M. Dynamic pricing for online hotel demand: The case of resort hotels in Majorca. J. Vacat. Mark. 2020, 26, 268–283. [Google Scholar] [CrossRef]
  23. Mullen, P.B.; Monson, C.K.; Seppi, K.D.; Warnick, S.C. Particle swarm optimization in dynamic pricing. In Proceedings of the 2006 IEEE International Conference on Evolutionary Computation, Vancouver, BC, Canada, 16–21 July 2006; pp. 1232–1239. [Google Scholar] [CrossRef]
  24. Yang, J.; Xia, Y. A nonatomic-game approach to dynamic pricing under competition. Prod. Oper. Manag. 2013, 22, 88–103. [Google Scholar] [CrossRef]
  25. Bayoumi, A.E.M.; Saleh, M.; Atiya, A.F.; Aziz, H.A. Dynamic pricing for hotel revenue management using price multipliers. J. Revenue Pricing Manag. 2013, 12, 271–285. [Google Scholar] [CrossRef]
  26. Zhu, F.; Xiao, W.; Yu, Y.; Wang, Z.; Chen, Z.; Lu, Q.; Liu, Z.; Wu, M.; Ni, S. Modeling price elasticity for occupancy prediction in hotel dynamic pricing. In Proceedings of the CIKM’22: Proceedings of the 31st ACM International Conference on Information and Knowledge Managemen, Atlanta, GA, USA, 17–21 October 2022; pp. 4742–4746. [CrossRef]
  27. Huang, L.; Zheng, W. Hotel demand forecasting: A comprehensive literature review. Tour. Rev. 2023, 78, 218–244. [Google Scholar] [CrossRef]
  28. Zhang, D.; Niu, B. Leveraging online reviews for hotel demand forecasting: A deep learning approach. Inf. Process. Manag. 2024, 61, 103527. [Google Scholar] [CrossRef]
  29. Huang, L.; Li, C.; Zheng, W. Daily hotel demand forecasting with spatiotemporal features. Int. J. Contemp. Hosp. Manag. 2025, 35, 26–45. [Google Scholar] [CrossRef]
  30. Wu, J.; Li, M.; Zhao, E.; Sun, S.; Wang, S. Can multi-source heterogeneous data improve the forecasting performance of tourist arrivals amid COVID-19? Mixed-data sampling approach. Tour. Manag. 2023, 98, 104759. [Google Scholar] [CrossRef]
  31. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018. [Google Scholar]
  32. Lu, R.; Hong, S.H.; Zhang, X. A dynamic pricing demand response algorithm for smart grid: Reinforcement learning approach. Appl. Energy 2018, 220, 220–230. [Google Scholar] [CrossRef]
  33. Tarrahi, F.; Eisend, M.; Dost, F. A meta-analysis of price change fairness perceptions. Int. J. Res. Mark. 2016, 33, 199–203. [Google Scholar] [CrossRef]
  34. Li, K.J.; Jain, S. Behavior-based pricing: An analysis of the impact of peer-induced fairness. Manag. Sci. 2016, 62, 2705–2721. [Google Scholar] [CrossRef]
  35. Kallus, N.; Zhou, A. Fairness, welfare, and equity in personalized pricing. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event, 3–10 March 2021; pp. 296–314. [Google Scholar] [CrossRef]
  36. Gupta, S.; Kamble, V. Individual fairness in hindsight. J. Mach. Learn. Res. 2021, 22, 1–35. Available online: http://jmlr.org/papers/v22/19-658.html (accessed on 29 August 2025).
  37. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017. [Google Scholar] [CrossRef]
  38. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N. Continuous control with deep reinforcement learning. arXiv 2015. [Google Scholar] [CrossRef]
  39. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in Actor-Critic methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1587–1596. [Google Scholar] [CrossRef]
  40. Lee, M. Modeling and forecasting hotel room demand based on advance booking information. Tour. Manag. 2018, 66, 62–71. [Google Scholar] [CrossRef]
  41. Vajpai, G.N. Managing overbooking in hotels: A probabilistic model using Poisson distribution. Int. J. Adv. Res. Ideas Innov. Technol. 2018, 4, 1376–1379. [Google Scholar]
  42. Huang, H.; Wu, D.; Xu, H. Signaling or not? The pricing strategy under fairness concerns and cost information asymmetry. Eur. J. Oper. Res. 2025, 321, 789–799. [Google Scholar] [CrossRef]
  43. Pazgal, A.; Soberman, D. Behavior-based discrimination: Is it a winning play, and if so, when? Mark. Sci. 2008, 27, 977–994. [Google Scholar] [CrossRef]
  44. Shin, J.; Sudhir, K. A customer management dilemma: When is it profitable to reward one’s own customers? Mark. Sci. 2010, 29, 671–689. [Google Scholar] [CrossRef]
  45. Abrate, G.; Fraquelli, G.; Viglia, G. Dynamic pricing strategies: Evidence from European hotels. Int. J. Hosp. Manag. 2012, 31, 160–168. [Google Scholar] [CrossRef]
Figure 1. Framework of the hotel room pricing model with DRL (PPO algorithm).
Figure 1. Framework of the hotel room pricing model with DRL (PPO algorithm).
Jtaer 20 00337 g001
Figure 2. Probability distribution functions F i . The vertical dashed lines represent the prices that the corresponding groups accept with a 0.5 probability. (a) Off-season weekdays, (b) off-season weekends, (c) peak-season weekdays, (d) peak-season weekends.
Figure 2. Probability distribution functions F i . The vertical dashed lines represent the prices that the corresponding groups accept with a 0.5 probability. (a) Off-season weekdays, (b) off-season weekends, (c) peak-season weekdays, (d) peak-season weekends.
Jtaer 20 00337 g002
Figure 3. Comparison of PPO training rewards among different pricing models. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method. (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.
Figure 3. Comparison of PPO training rewards among different pricing models. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method. (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.
Jtaer 20 00337 g003
Figure 4. The optimal room prices of customer groups during the period of 30 days. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method, (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.
Figure 4. The optimal room prices of customer groups during the period of 30 days. (a) Off-season with min-normalized scaling method, (b) peak-season with max-normalized scaling method, (c) off-season with mean-normalized scaling method, (d) peak-season with mean-normalized scaling method.
Jtaer 20 00337 g004
Figure 5. Comparison of reinforcement learning algorithms. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of DDP-N, (b) peak-season of DDP-N, (c) off-season of DDP-C, (d) peak-season of DDP-C.
Figure 5. Comparison of reinforcement learning algorithms. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of DDP-N, (b) peak-season of DDP-N, (c) off-season of DDP-C, (d) peak-season of DDP-C.
Jtaer 20 00337 g005
Figure 6. Training results of hotel BH and hotel HJ via PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of hotel BH, (b) peak-season of hotel BH, (c) off-season of hotel HJ, (d) peak-season of hotel HJ.
Figure 6. Training results of hotel BH and hotel HJ via PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season of hotel BH, (b) peak-season of hotel BH, (c) off-season of hotel HJ, (d) peak-season of hotel HJ.
Jtaer 20 00337 g006
Figure 7. Training results for multi customer groups based on DDP-C model and PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season, (b) peak-season.
Figure 7. Training results for multi customer groups based on DDP-C model and PPO algorithm. The red vertical dashed line serves as the marker line for two-stage training. (a) Off-season, (b) peak-season.
Jtaer 20 00337 g007
Table 1. Formulas for P i , F i and λ i in the case hotel.
Table 1. Formulas for P i , F i and λ i in the case hotel.
SeasonWeekendGroup P i F i λ i
Off-seasonWeekdays1Poisson(27) 1 / ( 1 + e 0.0366 ( x 400 ) ) 0.895 + 0.105 α t
2Poisson(39) 1 / ( 1 + e 0.0275 ( x 460 ) ) 0.835 + 0.1 α t + 0.065 α g
3Poisson(25) 1 / ( 1 + e 0.0220 ( x 520 ) ) 0.862 + 0.08 α t + 0.058 α g
Weekends1Poisson(32) 1 / ( 1 + e 0.0549 ( x 360 ) ) 0.848 + 0.122 α t
2Poisson(46) 1 / ( 1 + e 0.0366 ( x 420 ) ) 0.813 + 0.114 α t + 0.073 α g
3Poisson(24) 1 / ( 1 + e 0.0275 ( x 480 ) ) 0.824 + 0.094 α t + 0.082 α g
Peak-seasonWeekdays1Poisson(73) 1 / ( 1 + e 0.0220 ( x 770 ) ) 0.95 + 0.05 α t
2Poisson(52) 1 / ( 1 + e 0.0146 ( x 880 ) ) 0.926 + 0.043 α t + 0.031 α g
3Poisson(37) 1 / ( 1 + e 0.0105 ( x 990 ) ) 0.958 + 0.02 α t + 0.022 α g
Weekends1Poisson(90) 1 / ( 1 + e 0.0220 ( x 750 ) ) 0.925 + 0.075 α t
2Poisson(64) 1 / ( 1 + e 0.0146 ( x 850 ) ) 0.871 + 0.083 α t + 0.046 α g
3Poisson(42) 1 / ( 1 + e 0.0105 ( x 950 ) ) 0.918 + 0.05 α t + 0.032 α g
Table 2. Performance metrics under optimal pricing strategies of different models.
Table 2. Performance metrics under optimal pricing strategies of different models.
α g α t Off-SeasonPeak-Season
MRRProfitDGAROPMRRProfitDGAROP
0.50.33480.7347486.5%7792.78283100%
0.50.53500.7527488.6%7782.78585100%
0.50.73500.7707690.9%7792.79884100%
0.30.53610.7487183.1%7732.76984100%
0.70.53500.7717691.4%7812.80384100%
DDP-N3710.8357589.3%8222.91382100%
DP-N3720.7836983.1%7832.80283100%
Table 3. Hyperparameter settings for the comparative algorithm.
Table 3. Hyperparameter settings for the comparative algorithm.
HyperparametersPPODDPGTD3AC
Actor learning rate0.00010.0001/0.0001
Critic learning rate0.00020.0002/0.0002
Soft update coefficient/0.010.01/
Q-network learning rate//0.0003/
Policy network learning rate//0.0003/
Reward discount rate1111
Batch size32323232
Training episodes2000200020002000
Days per episode30303030
Table 4. Comparison of algorithm results and historical data.
Table 4. Comparison of algorithm results and historical data.
AlgorithmsProfit (M)Training (Running) TimeEpisodes (Iterations)
PPO (Off-season, DDP-C)0.75225 m2500
PPO (Peak-season, DDP-C)2.78550 m5000
PPO (Off-season, DDP-N)0.83520 m2000
PPO (Peak-season, DDP-N)2.91340 m4000
PSO (Off-season, DDP-C)0.78935 m5000
PSO (Peak-season, DDP-C)2.82235 m5000
PSO (Off-season, DDP-N)0.638>1 h10,000
PSO (Peak-season, DDP-N)2.398>1 h10,000
Historical Data (Off-season)0.703//
Historical Data (Peak-season)2.658//
Table 5. Comparison of algorithm results and historical data for two hotels.
Table 5. Comparison of algorithm results and historical data for two hotels.
HotelAlgorithmsOff-Season Profit (M)Peak-Season Profit (M)
Hotel BHPPO (DDP-C)0.6311.418
PPO (DDP-N)0.7411.624
PSO (DDP-C)0.6611.486
PSO (DDP-N)0.6801.555
Historical Data0.6301.400
Hotel HJPPO (DDP-C)1.8193.125
PPO (DDP-N)1.9543.514
PSO (DDP-C)1.8243.250
PSO (DDP-N)1.7713.377
Historical Data1.6702.970
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Xie, Y.; Jian, L.; Liu, W.; Lv, W. A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 337. https://doi.org/10.3390/jtaer20040337

AMA Style

Wang X, Xie Y, Jian L, Liu W, Lv W. A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. Journal of Theoretical and Applied Electronic Commerce Research. 2025; 20(4):337. https://doi.org/10.3390/jtaer20040337

Chicago/Turabian Style

Wang, Xinmin, Yuwei Xie, Ling Jian, Wei Liu, and Wenting Lv. 2025. "A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints" Journal of Theoretical and Applied Electronic Commerce Research 20, no. 4: 337. https://doi.org/10.3390/jtaer20040337

APA Style

Wang, X., Xie, Y., Jian, L., Liu, W., & Lv, W. (2025). A Two-Stage Deep Reinforcement Learning-Driven Dynamic Discriminatory Pricing Model for Hotel Rooms with Fairness Constraints. Journal of Theoretical and Applied Electronic Commerce Research, 20(4), 337. https://doi.org/10.3390/jtaer20040337

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop