You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

5 November 2025

Product Recommendation with Price Personalization According to Customer’s Willingness to Pay Using Deep Reinforcement Learning

,
and
1
Department of Electrical and Computer Engineering, University of Tehran, Tehran 14179-35840, Iran
2
Tehran Institute for Advanced Studies, Khatam University, Tehran 19916-33357, Iran
*
Author to whom correspondence should be addressed.

Abstract

Integrating recommendation systems with dynamic pricing strategies is essential for enhancing product sales and optimizing revenue in modern business. This study proposes a novel product recommendation model that uses Reinforcement Learning to tailor pricing strategies to customer purchase intentions. While traditional recommendation systems focus on identifying products customers prefer, they often neglect the critical factor of pricing. To improve effectiveness and increase conversion, it is crucial to personalize product prices according to the customer’s willingness to pay (WTP). Businesses often use fixed-budget promotions to boost sales, emphasizing the importance of strategic pricing. Designing intelligent promotions requires recommending products aligned with customer preferences and setting prices reflecting their WTP, thus increasing the likelihood of purchase. This research advances existing recommendation systems by integrating dynamic pricing into the system’s output, offering a significant innovation in business practice. However, this integration introduces technical complexities, which are addressed through a Markov Decision Process (MDP) framework and solved using Reinforcement Learning. Empirical evaluation using the Dunnhumby dataset shows promising results. Due to the lack of direct comparisons between combined product recommendation and pricing models, the outputs were simplified into two categories: purchase and non-purchase. This approach revealed significant improvements over comparable methods, demonstrating the model’s efficacy.

1. Introduction

The rapid growth of the Internet, the Fourth Industrial Revolution, and advances in Artificial Intelligence (AI) have made business process automation essential for survival. In particular, the retail and e-commerce sectors rely heavily on data, with customer information and purchase histories representing key business assets. Among the most effective strategies for engaging customers in these industries are intelligent and data-driven promotions.
Retaining existing customers is considerably more cost-effective than acquiring new ones, with studies suggesting that acquiring a new customer can cost five to seven times more than retaining an existing one [1,2,3]. Consequently, intelligent and personalized promotions that strengthen customer loyalty are vital. These initiatives not only enhance attachment to brands and organizations but also improve overall profitability [4]. Automating promotions through AI-based recommendation systems that consider customer interests, preferences, and purchase behavior has therefore become a strategic necessity.
AI’s influence on marketing is transforming how businesses understand and engage with customers. AI-driven systems not only automate and optimize marketing processes but also provide insights into behavioral patterns, enabling more informed decisions. For example, AI-powered marketing platforms personalize promotions by analyzing large volumes of customer data and adapting recommendations in real time. This approach increases customer satisfaction, optimizes resource allocation, and enhances conversion rates [5].
Personalization is a key driver of customer satisfaction and loyalty, forming the basis for advanced strategies such as Dynamic Pricing (DP) [6]. DP has been extensively applied in e-commerce to optimize discounts and manage inventory [7]. In highly competitive markets, these strategies often require sophisticated modeling of spatial and temporal dynamics. For instance, Kim et al. [8] proposed a Spatial–Temporal Attention-based Dynamic Graph Convolutional Network (STAD-GCN) for predicting retail gasoline prices by integrating spatial relationships and temporal variations.
Empirical studies indicate that personalized strategies enhance both satisfaction and loyalty by addressing rational and emotional aspects of customer decision-making [9]. Such approaches strengthen long-term relationships, encourage repeat purchases, and foster positive word-of-mouth communication. Moreover, leveraging behavioral data allows companies to continuously refine personalization strategies to remain competitive and profitable in dynamic market environments.
Recommendation systems (RSs) are among the most successful applications of machine learning [10], widely adopted across domains such as e-commerce, online streaming, education, and social media [11]. In e-commerce, RSs help identify and recommend products aligned with users’ preferences and behaviors, assist with product ranking during searches, and propose similar or complementary items. These systems personalize the shopping experience and significantly improve decision-making efficiency.
However, traditional RS typically focus only on identifying products that users may like, neglecting the critical influence of price. This omission can lead to suboptimal outcomes, as recommended products may fall outside a customer’s preferred price range or willingness to pay (WTP), reducing the likelihood of purchase and overall profitability. To optimize effectiveness, RS must integrate pricing factors so that recommendations reflect both customer preferences and economic constraints. When profit and discount considerations are ignored, the RS remains incomplete and suboptimal [12]. Traditional systems tend to optimize metrics such as click-through rates or ranking quality but often overlook DP factors that directly affect sales conversion and revenue [13].
As illustrated in Figure 1, price personalization allows a seller to offer a product at a discounted price (price B) to attract additional buyers who might not have purchased at the original price (price A). While customers willing to pay price A continue to generate revenue, the inclusion of price B results in additional sales and higher overall profit [14]. Consequently, personalized pricing can substantially enhance business revenue.
Figure 1. Increase in product sales at lower prices.
Incorporating price personalization into RS is therefore a crucial objective of this research. Recommending products at prices aligned with each customer’s WTP can significantly strengthen marketing strategies and improve promotional efficiency. This study introduces a reinforcement learning (RL)-based framework that jointly models product recommendations and price optimization. The proposed price-based promotion approach aims to increase customer retention and profitability by aligning product offerings with customers’ purchase propensities.
In traditional promotional campaigns, businesses often allocate a fixed budget to increase sales through uniform discounts—for example, offering 20% off all products. This method disregards individual differences in WTP and is suboptimal from both business and customer perspectives. Customers who are willing to pay the full price may become conditioned to expect discounts, reducing future margins, whereas those requiring higher discounts may remain unmotivated to purchase.
A more effective strategy involves tailoring discounts and prices to each customer’s WTP. For instance, offering a product at full price to a customer ready to pay it, and the same product at a 40% discount to another customer who requires that incentive, results in successful purchases by both. This targeted approach maximizes both customer satisfaction and efficient budget utilization.
As illustrated in Figure 2, offering personalized prices aligned with individual customer characteristics supports both marketing personalization and operational optimization objectives.
Figure 2. Smart promotion design tailored to the customer’s purchase intention for a specific product.
In marketing theory, the concept of the marketing mix—commonly referred to as the 4Ps (Product, Price, Promotion, and Place)—is fundamental. To maximize customer impact, it is essential to deliver the right product at the right price, supported by effective promotional strategies and efficient distribution channels. This interdependence highlights the critical importance of integrating promotion and pricing strategies within the overall marketing framework.
As shown in Figure 3, effective pricing should be tailored to the specific characteristics of each customer. Offering a product at different prices to different customers is not only a common commercial practice but also aligns with personalization and optimization principles central to modern marketing.
Figure 3. Personalizing product prices for customers.
Similarly, Figure 4 illustrates how a customer may not be inclined to purchase a product at full price but may complete the transaction if a discount is applied according to their profile or behavioral characteristics.
Figure 4. Customer’s purchase intention for a specific product.
To design a model capable of recommending both products and prices aligned with customers’ purchase propensities, various statistical and machine learning techniques can be employed. After reviewing prior research and evaluating different methods, RL emerged as a particularly effective approach for this purpose.
RL offers significant advantages for RS by optimizing long-term user engagement and satisfaction. Traditional RS methods, such as collaborative or content-based filtering, typically optimize for short-term objectives (e.g., clicks or immediate purchases). In contrast, RL maximizes cumulative rewards over time, providing several benefits.
First, RL supports sequential decision-making by considering the temporal order of user interactions, allowing the system to learn from past behaviors and anticipate future preferences [15]. Second, it adapts dynamically to real-time feedback, continuously improving its performance as new data are observed [16]. Third, RL enhances personalization by optimizing for long-term rewards, thereby tailoring recommendations more effectively and fostering user loyalty [17]. Additionally, it balances exploration and exploitation—discovering new items while leveraging known user preferences [15]—and enables holistic optimization of multiple metrics such as retention, lifetime value, and engagement [15].
RL has also demonstrated strong performance in sequential decision-making through prioritized experience replay, which accelerates convergence and improves profitability by focusing on high-impact interactions [18]. In summary, RL enables RS to evolve from static, one-step predictors into adaptive systems that optimize engagement and value across time.
Deep Reinforcement Learning (DRL) extends these advantages by combining RL with the representational power of deep neural networks. DRL is particularly well suited for recommendation and pricing applications due to the following capabilities:
  • Handling High-Dimensional Data: DRL uses deep neural networks to manage large and complex state–action spaces, which are common in RS owing to the vast number of potential user–item interactions. This allows the model to capture intricate patterns and nonlinear relationships in user data [16].
  • Robustness to Sparse Data: DRL methods are resilient to the data sparsity that often affects collaborative filtering techniques. Their ability to generalize from limited data makes them particularly valuable for addressing cold-start problems involving new users or products [15].
Accordingly, this research proposes a DRL-based model that recommends products together with optimal prices tailored to customers’ purchase propensities. This integration supports the design of intelligent, price-aware promotions that maximize both customer satisfaction and business profitability.
After formulating the problem as a Markov Decision Process (MDP), the Deep Q-Network (DQN) architecture was employed to solve it using DRL techniques.
Recent developments in DRL have expanded its application to large-scale recommendation and DP scenarios. For example, Wang et al. [19] introduced a deep Q-learning-based pricing framework that adapts dynamically to market fluctuations, achieving significant revenue gains. Similarly, Kavoosi et al. [20] applied DRL to joint inventory and pricing optimization in retail, demonstrating scalability across thousands of products. In the domain of personalized recommendations, Tanveer et al. [21] proposed a graph-attention-augmented DRL model (PGA-DRL) that captures complex relationships between users and items to improve personalization. Together, these studies underscore how DRL enables scalable, adaptive, and context-aware pricing strategies—supporting the motivation for our DQN-based framework that integrates product recommendation and pricing in a unified model.
In this study, the proposed framework was implemented using the Dunnhumby sales dataset, which contains detailed records of product sales, prices, and applied discounts. Model evaluation on the test dataset yielded strong results in the multi-class classification task, with a Macro Average Precision of 0.80, Macro Average Recall of 0.82, Macro F-score of 0.81, Micro F-score of 0.85, Macro Averaged AUC of 0.87, NDCG@5 of 0.89 and Weighted Multi-class Accuracy of 0.80. These results confirm the effectiveness of the proposed reinforcement learning framework in accurately recommending both products and corresponding prices according to customers’ purchase tendencies.
Building upon these advances, this study contributes to the literature by introducing an integrated framework that unifies personalized product recommendation and price optimization within a single RL environment. The model extends existing RS research by explicitly incorporating customers’ willingness to pay, thereby bridging the gap between DP and intelligent recommendation. By embedding pricing into the recommendation output, the model enhances the scope, accuracy, and practical relevance of traditional RS. Its applicability spans various industries—particularly retail and e-commerce—where product diversity, price sensitivity, and customer heterogeneity are prominent.
  • Contributions and Innovations.
This research addresses a novel and practically significant problem by integrating personalized product recommendation and DP within a unified DRL framework. Unlike traditional RSs that optimize short-term engagement or rely on static pricing models, the proposed approach jointly models customer preferences and WTP, enabling adaptive promotional strategies. The framework leverages recent advances in DRL to design intelligent, price-based promotions that simultaneously optimize customer satisfaction and long-term profitability. Additionally, it introduces problem-specific evaluation metrics that capture both financial relevance and predictive performance. Overall, this study combines theoretical rigor with practical application, contributing to the expanding body of DRL-based marketing research while offering measurable improvements in operational efficiency and profit optimization.

3. Proposed Method

This section presents the proposed model for product recommendation aligned with each customer’s WTP, developed within an RL framework.
Each customer is represented by a history of interactions that include purchases across different product categories and price ranges, both with and without discounts, as well as expressed preferences and product reviews. In addition to behavioral data, demographic attributes such as age, gender, education level, and income provide essential context for personalization. Correspondingly, each product is characterized by attributes including price range, category, subcategory, discount rate, and sales volume.
The primary objective of the proposed model is to enhance product sales and increase business profitability through personalized recommendations, thereby strengthening customer loyalty. From a marketing perspective, designing effective promotional strategies requires delivering offers that maximize both customer satisfaction and retention. The proposed framework addresses precisely this goal by learning optimal recommendation and pricing strategies tailored to individual purchasing behaviors.
Specifically, the model seeks to recommend the most appropriate product at a price consistent with the customer’s WTP. It achieves this by leveraging the joint information contained in customer and product attributes as well as their interaction histories. The system is designed to maximize the likelihood of purchase, thereby improving both organizational profitability and customer experience.
In line with the concept of value-aware recommender systems (VARSs), which explicitly integrate business value and user utility into the recommendation process [13], the proposed model incorporates DP strategies within an RL framework. This integration enables the system to simultaneously optimize for customer satisfaction and business profitability in real time.

3.1. Problem Definition

This paper presents a model of product recommendation along with pricing tailored to customer purchase enthusiasm using an RL approach.
The problem specifications for the proposed model include several key features. Customer features encompass purchase history, tracking the variety of purchases, frequency of discounts used, and preferred categories, along with behavioral metrics like discount rates, purchase frequency, and time intervals between discounted purchases. Product features focus on price rank within its category, sales metrics, including the ratio of discounted sales to total sales, and customer diversity in purchases. WTP insights involve analyzing customer data to infer price sensitivity and purchasing power, and understanding product positioning to align with different customer segments.
Unlike traditional RSs that primarily focus on suggesting products based on user preferences, our system integrates pricing strategies. It considers the customer’s WTP, ensuring that recommendations are not only relevant but also economically appealing. This approach addresses both the demand and pricing sensitivity, providing a more comprehensive solution.
Our system can be applied in various marketing strategies, such as:
  • Personalized Promotions: Offering discounts and promotions tailored to individual customers, increasing the likelihood of purchase.
  • DP: Adjusting prices in real-time based on customer behavior and market trends.
  • Customer Retention Programs: Creating loyalty programs that offer personalized discounts and exclusive deals, enhancing customer retention.
Price is a critical factor in marketing as it directly influences purchase decisions. For instance, a customer who frequently buys products at discounted prices is less likely to purchase items at full price. By analyzing this behavior, businesses can offer tailored discounts to encourage purchases, thus increasing sales volume and customer loyalty [12].
Given customer and product features, it might be inferred that a customer who buys 90% of their products at a discount is very unlikely to purchase a full-priced product. Similarly, products priced at the higher end of the price range may not appeal to customers who usually shop in a lower price range. Another example, as depicted in Figure 6, focuses on price-related features: for a customer characterized by a 35% discount rate and a preferred price range of 50 to 70, offering a product priced at 100 with a 40% discount might have a higher likelihood of a successful purchase, and this customer is likely to engage more frequently in the future.
Figure 6. Product Offer Aligned with Customer’s Preferred Price Range.
Additionally, if a product has typically sold at a 30% discount, then offering this product at full price would likely have a lower chance of success (Figure 7).
Figure 7. Customized Discount Rate Offer Aligned with Customer’s WTP.
In addition to the problem definition, it is important to highlight the generalization capability of the proposed framework. The reinforcement learning approach employed in this study is domain-agnostic, as it is based on fundamental state–action–reward dynamics rather than assumptions specific to the retail sector. With appropriate retraining on sector-specific features and reward functions, the model can be adapted to other industries such as fashion, electronics, or digital services, where consumer behavior may vary in terms of purchase frequency, price sensitivity, and seasonal effects. Prior research has shown that RL-based recommendation systems can successfully generalize across different domains when exposed to new interaction data  [15,77].
Furthermore, even within a single sector, product categories display diverse behavioral patterns. During the feature extraction process, a wide range of features was initially considered—derived from both business knowledge and the academic literature—including category-related features such as the number of purchases across different product types. However, statistical analysis and importance testing revealed that these category-level features exhibited relatively lower discriminative power compared to more fundamental behavioral variables such as purchase frequency, discount usage, and average spending. Therefore, the final feature set prioritized features with higher predictive strength, enabling a more robust modeling of customer behavior and willingness to pay. Nonetheless, the model remains flexible and can be retrained with category-specific features to capture finer distinctions when required.
We propose an RL-based approach to develop a dynamic recommender system that adjusts product prices in real-time according to each customer’s WTP. The system uses historical interaction data to learn optimal pricing strategies that maximize long-term customer engagement and revenue.

3.2. Problem Modeling in the RL Framework

In the proposed model, the problem of simultaneous product recommendation and pricing is addressed through an RL approach. In this setting, the optimization process is conducted within the RL paradigm, enabling the model to recommend both the most suitable product and its corresponding price in a unified manner. Owing to the characteristics of RL algorithms—such as their ability to incorporate user preferences into the objective function, maintain a long-term perspective, accumulate rewards over time, and continuously update through learning—the proposed method is expected to achieve superior performance compared to traditional techniques.
The overall schematic of the model, from input to output, is summarized in Figure 8 and described below:
Figure 8. Product Recommendation and Pricing Model with RL Approach.
  • Inputs: Customer and product features are extracted from the dataset, capturing behavioral interactions between customers and products.
  • Processing: Based on customer, product, and interaction data, multiple behavioral and transactional features are computed. The most influential variables are then identified and selected for modeling.
  • RL Framework: Using the extracted features, the RL environment is formulated in terms of states, actions, and rewards. The agent is trained using the DQN algorithm to learn the optimal pricing strategy.
  • Outputs: The trained agent generates personalized product and price recommendations, including optimal discount levels for each customer.
To model the problem within the RL framework, the MDP formulation is adopted. This involves defining the set of states, actions, transition probabilities, rewards, and the discount factor. In this study, the state space represents the joint information of the customer and product, including their respective attributes and price levels. In other words, a state reflects whether a specific customer, characterized by certain features, purchases a particular product at a given price.
Within this framework, the state captures the combination of customer and product attributes, while actions correspond to recommending specific product–price pairs. The reward function evaluates the success of each recommendation based on purchase likelihood and customer satisfaction, guiding the agent to maximize long-term profitability and customer loyalty.
The RL approach allows the model to adapt and improve continuously by accounting for both immediate and long-term effects of pricing and product recommendations. This dynamic capability effectively addresses the limitations of traditional supervised learning methods, which typically optimize short-term outcomes.
To better understand the influence of price on customer behavior and improve the interpretability of the system, the first step involves creating comprehensive profiles for both customers and products. These profiles facilitate exploratory data analysis, providing deeper insights into purchasing behavior and serving as the foundation for subsequent modeling stages.
Using customer, product, and interaction data, a diverse set of behavioral and transactional features is computed to describe relationships between customers and products. To identify the most informative variables, clustering is first applied to customer and product features, followed by feature importance analysis using decision trees. The resulting subset of high-impact variables constitutes the state space of the RL environment.
The list of features included in the state space is shown in Table 1.
Table 1. List of features forming the state space.
To mitigate the uncertainty inherent in implicit interaction data, the feature engineering and selection process was designed to enhance the model’s ability to infer customer preferences and WTP with higher precision. Beyond individual purchase or viewing actions, the selected features capture broader behavioral patterns such as discount sensitivity, spending consistency, and purchase frequency across product categories. By integrating these aggregated and statistically validated indicators into the RL state space, the framework minimizes the influence of noise or ambiguity arising from isolated events. Consequently, the RL agent learns from stable and behaviorally meaningful representations of customer activity rather than from raw, potentially misleading signals.
Scalability in the proposed framework is achieved through efficient state and action design. The model defines state and action representations based on aggregated and feature-engineered behavioral attributes instead of product-specific embeddings, allowing it to remain computationally efficient as the catalog size increases. Moreover, using a discrete action space with four discount levels and a no-purchase option reduces the computational burden compared to continuous or product-specific pricing, enabling faster convergence and stable training performance.
The action space includes five states: no purchase, purchase with 0% discount, purchase with 10% discount, purchase with 20% discount, and purchase with more than 30% discount. Table 2 illustrates the action space.
Table 2. Action Space.
The reward function is defined as follows: if the agent correctly predicts a purchase or no purchase, it receives a reward of 4 units. If a purchase is made and the discount percentage is also correctly predicted, an additional reward of 6 units is given. Otherwise, the agent receives a penalty of −10 units. The reward function is summarized in Table 3.
Table 3. Reward Function Corresponding to Agent Actions.
The reward function in our framework was designed heuristically but with explicit grounding in business logic. The numerical reward and penalty values were selected through an iterative process, tested across several configurations to identify those that maximized the simulated business profit. The values were assigned asymmetrically to reflect real-world business costs: false negatives (missed purchases) received stronger penalties, correct purchases were rewarded more heavily, and intermediate outcomes were scaled proportionally. This design ensures that the agent’s learning incentives align with long-term profit maximization.
The selection of the final reward structure was conducted through a comprehensive sensitivity analysis and profit-based evaluation, in which multiple reward tables—each representing different trade-offs between sales volume and profit margin—were tested, and the configuration that achieved the highest cumulative profit and most stable learning convergence was identified as optimal.
This reward structure incentivizes the agent to accurately predict both the occurrence of a purchase and the correct discount rate, thus optimizing the RS’s effectiveness.

3.3. Training the Model

This model uses an RL approach to simultaneously recommend products and their prices. The framework described includes defining states, actions, transition probabilities, and rewards associated with specific actions in specific states. The dataset used for training and evaluating the model comprises 1,946,799 samples, including both positive and negative interactions. Positive samples indicate successful transactions where the recommended product was purchased at the suggested price, while negative samples represent unsuccessful recommendations. This large dataset ensures that the model can learn diverse patterns and generalize well to new data. The neural network architecture for the RL agent is based on a DQN. The DQN model utilizes deep learning to approximate the Q-value function, which represents the expected cumulative reward of taking a given action in a particular state.
  • Input Layer: The input layer receives the state representation, which includes customer and product features. These features are encoded into a fixed-size vector to be processed by the neural network.
  • Hidden Layers: Four hidden layers with ReLU activation functions process the input data. ReLU is chosen for its ability to introduce non-linearity while maintaining computational efficiency, preventing vanishing gradient problems common in deep networks.
  • Output Layer: The output layer produces the Q-values for each possible action, representing the expected rewards for each action given the current state. This layer provides the basis for the decision-making process in the RL framework.
The detailed training procedure for the DQN-based DP model is presented in Algorithm 2.
The final DQN configuration was determined based on established guidelines and empirical testing. A learning rate of 1 × 10 3 (Adam optimizer), discount factor γ = 0.95 , replay buffer of 50 , 000 transitions, mini-batch size of 64, and target network updates every 1000 steps ensured stable convergence. The MSE loss was selected over alternative formulations such as the Huber loss, as it provided smoother convergence in our DP environment with relatively stable reward distributions. This setup yielded consistent results across multiple runs.
The trained model is evaluated using the test dataset. The results show significant improvements in recommendation accuracy and pricing strategy compared to traditional methods. The RL approach effectively adapts to customer behavior, providing personalized recommendations and optimal pricing strategies. One of the key advantages of using RL in this context is the model’s ability to continuously learn and adapt. As new data becomes available, the model can be retrained to incorporate recent trends and changes in customer behavior, ensuring that the recommendations and pricing strategies remain relevant and effective. Continuous learning allows the model to stay up-to-date with the latest customer preferences and market conditions, providing a dynamic and responsive RS.
Algorithm 2 DQN Training for DP
  1:
Initialize online network Q θ , target network Q θ ¯ Q θ , replay buffer D , discount γ , exploration rate ε , minibatch size B, target update period C.
  2:
for each episode do
  3:
      Initialize state s 0 .
  4:
      for  t = 0 , 1 ,  do
  5:
             Action ( ε -greedy): with prob. ε choose random a t , else a t = arg max a Q θ ( s t , a ) .
  6:
             Environment step: execute a t (price), observe reward r t , next state s t + 1 , and terminal flag done t .
  7:
             Store transition: push ( s t , a t , r t , s t + 1 , done t ) into D .
  8:
             Minibatch update: sample { ( s i , a i , r i , s i , done i ) } i = 1 B from D .
  9:
             Targets: for each i set y i = r i + γ ( 1 done i ) max a Q θ ¯ ( s i , a ) .
10:
             Optimize: update θ by minimizing 1 B i = 1 B l y i Q θ ( s i , a i )
11:
              if t   mod   C = 0  then
12:
                     Target update:  θ ¯ θ .
13:
               end if
14:
               Exploration schedule: decay ε .
15:
               if  done t  then
16:
                     break
17:
             end if
18:
      end for
19:
end for

3.4. Evaluation Criteria

Evaluation metrics are defined and selected according to the problem structure and its outputs. In this problem, product recommendations along with pricing are provided. The output of the problem is defined as multi-class. Therefore, multi-class classification metrics such as macro average accuracy, macro average recall, macro F-score, micro F-score [78], One-vs-Rest (OvR) macro averaged AUC [79] and Normalized Discounted Cumulative Gain (NDCG) are used. Additionally, weighted multi-class precision is proposed for evaluation based on different preferences.
Considering that the model output is a type of multi-class classification output, multi-class evaluation methods are used to evaluate this model. The confusion matrix for multi-class classification for one sample class is shown in Table 4, and the metrics of macro average accuracy, macro average recall, macro F-score, micro F-score, macro averaged AUC and NDCG are used to evaluate this model.
Table 4. Metrics in Multi-Class Classification for the Class ’Purchase with 10% Discount’.
A new evaluation metric is proposed such that, based on the weights given in Table 5, each sample that falls into the specified condition is assigned this coefficient, and then the model is evaluated accordingly.
Table 5. Weights for Predicting Different Classes to Calculate Weighted Multi-Class Accuracy.
The coefficients in Table 6 emphasize that if the correct class predictions of various non-purchase or purchases with different discounts are made, a coefficient of +10 is considered. In the case of any purchase with a predicted non-purchase, the highest negative coefficient of −4 is considered. If the purchase is made at different prices and the purchase is predicted at a price different from the actual purchase category, a negative coefficient is applied accordingly. The closer the predicted discount category is to the actual purchase category, the smaller the negative coefficient. For example, if the purchase is from the 20% discount category, a non-purchase prediction gets the highest negative coefficient, and the correct 20% discount category prediction gets a coefficient of +10. Predicting a higher discount category has a smaller negative coefficient because offering a higher discount might slightly reduce business profit, but predicting a lower discount category has a higher likelihood of not leading to a purchase and thus gets a higher negative coefficient.
Table 6. Dunnhumby dataset specifications.
When the number of samples in each of the above conditions is obtained and multiplied by the coefficients specified in Table 6, the ’weighted multi-class accuracy’ metric is defined and calculated as follows
Weighted Multi - Class Accuracy = Sum of all positive and negative values Total number of samples × 10
Due to the absence of a reference for comparing 5-class results in studies, the product recommendation results along with pricing were converted to binary outputs: purchase and non-purchase. These binary results were then compared with the reference studies.

3.5. Challenges and Innovations

The key challenges encountered during the development of the proposed model, along with its main innovations, are outlined below.
  • Challenges
    1.
    Defining the Problem by Simultaneously Considering Price and Product Recommendation: A major challenge involved formulating the problem in a way that simultaneously incorporates both product recommendation and pricing. Traditional RSs typically treat these two elements independently; however, our framework required their joint optimization within a single model. This integration ensures that recommendations are not only relevant but also appropriately priced to align with customers’ WTP and purchasing intent.
    2.
    Selecting an Appropriate Solution Based on Problem Characteristics: Choosing the most suitable methodological approach was critical due to the multifaceted nature of the problem. The selected solution needed to effectively manage the complexities of personalized pricing, heterogeneous customer preferences, diverse product attributes, and DP strategies, while maintaining scalability and interpretability.
    3.
    Challenges of Implementing a RL Approach: Developing an RL-based system introduced additional technical challenges, including the design of an appropriate state space, action space, and reward function tailored to the specific requirements of the problem. Moreover, the iterative learning process inherent in RL requires large volumes of high-quality training data, computational resources, and careful tuning of hyperparameters to ensure convergence and stability.
  • Innovations
    1.
    Novel and Business-Relevant Problem Definition: The proposed research addresses a novel and practically significant problem by integrating pricing into the recommendation process. This addition directly enhances the ability of RS to influence purchasing decisions, thus overcoming a key limitation of conventional RS.
    2.
    High Business Applicability and Marketing Optimization: The approach provides a practical and scalable solution for businesses by optimizing marketing expenditures and improving promotional effectiveness. Through tailored product and price recommendations, companies can enhance customer loyalty, increase retention rates, and generate sustainable long-term value.
    3.
    Financial Evaluation and Direct Economic Impact: The model establishes a direct link between its outputs and financial metrics, enabling businesses to evaluate the monetary impact of their recommendation strategies. This connection facilitates evidence-based decision-making and allows for precise assessment of profitability gains resulting from the system’s deployment.
    4.
    Problem Formulation and Modeling Aligned with Domain Characteristics: The problem was carefully formulated and modeled to reflect its unique structural and behavioral characteristics. The proposed framework was explicitly designed to handle the complexity of personalized pricing and recommendation, ensuring realistic and data-driven optimization.
    5.
    Integration of Reinforcement Learning within an MDP Framework: By framing the problem as a MDP and solving it using RL techniques, the model achieves adaptive, feedback-driven learning based on customer interactions. This enables continuous improvement in recommendation quality and pricing precision.
    6.
    Development of Problem-Specific Evaluation Metrics: New evaluation metrics were designed to capture both recommendation accuracy and pricing effectiveness. These customized indicators allow a comprehensive assessment of the model’s performance in terms of predictive precision and economic relevance.
    7.
    Adaptability to Business Objectives through Reward Function Adjustments: The model demonstrates high adaptability to changing business strategies by allowing modifications to the reward function. For instance, the system can be configured to discourage low-margin recommendations, prioritize high-value purchases, or balance discount levels according to marketing objectives. This flexibility ensures alignment with evolving business priorities and operational contexts.
By addressing these challenges and introducing the above innovations, the proposed RL-based model provides a comprehensive and adaptable solution for product recommendation and pricing aligned with customers’ WTP. The framework not only meets current business needs but also offers scalability and flexibility to accommodate future market changes and strategic developments.

4. Experiments

This section provides a comprehensive analysis of the experimental results, starting with a detailed description of the dataset used for training and evaluation. The section then outlines the evaluation methodology used to assess the model’s effectiveness. The methodology includes training the DQN model, validating its performance, and testing it on a separate dataset. Performance metrics such as precision, recall, and F1 score are used to evaluate the model’s accuracy and relevance of recommendations. Finally, we analyze the experimental results, demonstrating significant improvements in recommendation accuracy and pricing strategy compared to traditional methods, and highlight the model’s ability to adapt to customer behavior through continuous learning.

4.1. Dataset Description

The Dunnhumby dataset is selected due to its relevance to the problem at hand and its extensive use in numerous studies. This dataset adequately covers features related to discounts, pricing, and promotional attributes. The Dunnhumby dataset is an extensive collection of retail transaction data, encompassing 2,595,732 purchase transactions from 2500 unique customers over 711 days. It includes detailed customer demographics, product attributes, and transactional specifics such as pricing and discounts. This dataset is instrumental for developing and testing machine learning algorithms, particularly in customer segmentation, product recommendation, DP, and market basket analysis. Its comprehensive nature allows for robust real-world applications, enhancing retail strategies and customer satisfaction through data-driven insights (https://www.dunnhumby.com/, accessed on 9 September 2025).
Some specifications of the Dunnhumby dataset are presented in Table 6.
To the 2,595,732 positive samples in the dataset, 3,893,598 negative samples (or non-purchase instances) are added, making the total dataset size 6,489,330 samples. Thus, the ratio of positive samples to the total is 2 to 5. In other words, 60% of the samples are non-purchase instances, and 40% are purchase instances with various discount percentages. Considering 30% of the samples as test samples, the number of test samples will be 1,946,799, the details of which are shown in Table 7.
Table 7. Train and test split.

4.2. Experimental Results

The evaluation of this model utilizes multi-class classification metrics and the proposed weighted multi-class accuracy based on Table 5. Figure 9 shows the confusion matrix for this modeling approach.
Figure 9. Confusion Matrix for the RL-Based Pricing Recommendation Model.
The confusion matrix presented in our study showcases the performance of an RL model designed for product recommendation and pricing tailored to customer purchase enthusiasm. The true labels on the Y-axis represent the actual customer behaviors, while the predicted labels on the X-axis indicate the model’s predictions. Each cell’s value represents the count of instances for each combination of true and predicted labels.
Our model demonstrates a high degree of accuracy in predicting non-purchasing behavior, with a substantial true positive count of 1,063,240 in the “Don’t Purchase” category. This high accuracy is critical as it allows the model to effectively identify customers who are unlikely to make a purchase, thereby enabling the business to minimize marketing expenditures on these non-responsive segments.
The ability to predict high discount categories accurately is a crucial strength of the model. This accuracy ensures that the model can effectively identify customers who are most likely to respond only when offered significant incentives. In the context of marketing and sales, different customers exhibit varying levels of price sensitivity. Some customers might make a purchase without any discount, while others might require substantial discounts to be persuaded to buy.
When the model accurately predicts that a customer falls into a high discount category (such as 20% or 30%), it indicates that the customer has a high price sensitivity. These customers are less likely to convert without significant incentives. By correctly identifying these customers, the business can tailor its marketing efforts more effectively. Offering substantial discounts to these customers can lead to successful conversions that might not have occurred otherwise.
Moreover, precise identification of high discount customers helps in optimizing the allocation of marketing resources. Instead of blanket discount offers to all customers, the business can target high-discount incentives specifically to those who need them. This targeted approach not only improves conversion rates but also enhances overall profitability. It ensures that the business does not erode its margins by offering unnecessary discounts to customers who would have purchased at a lower discount or even at full price.
Additionally, accurate prediction of high discount categories can improve customer satisfaction. Customers who receive personalized offers that match their price sensitivity are more likely to feel valued and understood, enhancing their overall experience with the brand. This positive experience can lead to increased customer loyalty and long-term customer relationships.
In summary, the model’s ability to accurately predict high discount categories is vital for identifying customers who need significant incentives to convert. It allows for targeted marketing strategies, optimized resource allocation, improved conversion rates, enhanced profitability, and better customer satisfaction. This precision in prediction supports the overall effectiveness of the personalized pricing strategy, making it a valuable asset for the business.
The evaluation results for the proposed RL-based pricing and product recommendation model are presented in Table 8.
Table 8. Evaluation Results for the RL-Based Pricing Recommendation Model.
To comprehensively assess the performance of the proposed RL model for product recommendation and pricing, several key metrics were analyzed on the test data, including Macro Average Precision, Macro Average Recall, Macro and Micro F-scores, Macro Averaged AUC, NDCG@5, and Weighted Multi-class Accuracy. Together, these indicators provide a holistic view of the model’s capability to handle multi-class classification and ranking tasks.
A Macro Average Precision of 0.8059 indicates that, on average, 80.59% of the predicted instances were correctly classified. This high precision demonstrates that the model effectively minimizes false positives across all discount classes, ensuring that both recommendations and price adjustments are accurately tailored to the appropriate customer segments.
The Macro Average Recall of 0.8243 shows that, on average, 82.43% of the actual instances for each class were correctly identified. This strong recall value suggests that the model successfully captures most relevant cases across different classes, minimizing the number of customers whose purchasing intentions are overlooked by the RS.
The Macro F-score of 0.8108, representing the harmonic mean of Macro Precision and Recall, reflects the model’s balanced and consistent performance across all discount categories. This equilibrium between precision and recall is crucial in RS, where both the accuracy of suggested products and the inclusiveness of relevant options are important.
The Micro F-score of 0.8560 aggregates performance across all classes and accounts for class imbalance, reflecting the model’s overall accuracy at the dataset level. This result confirms that the model performs robustly across diverse customer behaviors and varying discount preferences.
The Macro Averaged AUC of 0.8743 further validates the strong discriminative ability of the proposed model across the five discount classes. Because AUC is a threshold-independent measure, this result indicates that the model consistently distinguishes between high and low purchase probabilities, regardless of classification boundaries. The macro-averaged computation ensures that all classes contribute equally, confirming a reliable ability to rank customers’ purchase likelihoods across multiple discount levels.
The NDCG@5 score of 0.8947 highlights the model’s exceptional ranking performance across the ordered discount categories. This metric measures how well the model prioritizes desirable outcomes—such as purchases at lower discount rates—while penalizing suboptimal ones, such as unnecessary deep discounts or missed opportunities. The high NDCG@5 score demonstrates that the model ranks pricing actions in a manner consistent with business profitability goals.
Finally, the Weighted Multi-class Accuracy of 0.8082 reflects the model’s predictive effectiveness when accounting for predefined class importance. This confirms that the system achieves reliable performance even when different classes (e.g., discount levels) carry different business priorities.
Collectively, these results demonstrate that the proposed RL model performs at a high level of accuracy, consistency, and business relevance. The strong Macro and Micro F-scores confirm balanced predictive capability across classes, while the high AUC and NDCG@5 values underline the model’s superior ranking and discriminative power. These findings validate the model’s effectiveness in jointly optimizing product recommendations and pricing decisions in alignment with customer purchasing behavior. Consequently, the proposed RL-based framework represents a valuable and practical tool for enhancing marketing decision-making, increasing profitability, and improving customer satisfaction.
The detailed Precision, recall, and F-score macro for the five output classes are presented in Table 9, Table 10, and Table 11, respectively.
Table 9. Precision Evaluation for Classes in the RL-Based Pricing Recommendation Model.
Table 10. Recall Evaluation for Classes in the RL-Based Pricing Recommendation Model.
Table 11. Macro F-score Evaluation for Classes in the RL-Based Pricing Recommendation Model.
The model exhibits high precision for predicting non-purchase behavior (94.09%) and purchases with a 20% discount (94.71%), indicating that it accurately identifies customers in these categories. This precision is critical for minimizing false positives and ensuring that marketing resources are effectively utilized. However, the precision for predicting full-price purchases (72.7%), purchases with a 10% discount (71.3%), and high-discount purchases (70.02%) is moderate, suggesting areas for improvement to enhance the accuracy of these predictions.
In terms of recall, the model demonstrates strong performance in identifying non-purchasers (88.2%), full price purchases (81.6%) and customers who will purchase with a 10% discount (83.52%) or a high discount (81.53%). This indicates that the model effectively captures the majority of relevant instances in these categories. The recall for 20% discount purchases (77.21%) also shows good coverage, though further refinement could help in capturing more instances within these groups.
The Macro F-score, which balances precision and recall, is high for non-purchase (91.07%) and 20% discount (85.07%) categories, reflecting robust overall performance. The F-scores for full-price (76.94%), 10% discount (76.98%), and high-discount purchases (75.34%) indicate a balanced but moderate performance, highlighting potential areas for model enhancement to achieve better accuracy.
In summary, the model performs exceptionally well in predicting non-purchases and 20% discount purchases, with high precision and balanced performance. There are opportunities to improve the precision and recall for full-price, 10% discount, and high-discount categories to optimize resource allocation and marketing efforts. These metrics validate the model’s capability to effectively tailor product recommendations and pricing strategies while identifying specific areas for further refinement to enhance customer satisfaction and conversion rates.
Due to the absence of authoritative studies that have implemented this proposed method, the results of the modeling method for recommending products along with price and the use of a multi-class output in this discussion, an approach for evaluating this method is to simplify the obtained results into two classes, purchase and non-purchase, and compare them with studies that have used RL-based methods for product recommendation in RS.
Studies [74,75,76] examined various DRL methods in the application of RS. The best results in a 2-class RS that suggests purchase or non-purchase are those in the experiments reported in these studies. According to the results presented in Table 12, our RL-based approach consistently exceeds the performance of Wang et al, DiffRec and XSimGCL across all key metrics, demonstrating its robustness and effectiveness for two-class purchase prediction.
Table 12. Comparison with recent state-of-the-art methods. Bold indicates the best score per metric (ties bolded).
The confusion matrix of results from the 5-class modeling is shown in Figure 9. In the course of converting the 5-class results into 2-class results, all purchases, whether at full price or with various discounts, are considered as a purchase, and non-purchase is considered as non-purchase, i.e., classes 2 to 5 are merged into each other. With this conversion, the confusion matrix for the 2-class purchase and non-purchase appears as Figure 10.
Figure 10. Confusion Matrix for the RL-Based Pricing Recommendation Model—converted to 2-Class.
The results of the evaluation metrics from converting to the 2-class problem of purchase or non-purchase and comparison with the results of studies [74,75,76] are also shown in Figure 11. These evaluation results show the modeling method for recommending products along with price using an RL approach and a comparison of the converted 2-class results with [74,75,76].
Figure 11. Evaluation results in product recommendation modeling with pricing using a RL approach and comparison of converted to 2-class results with studies [74,75,76].
Our RL-based pricing recommendation model demonstrates superior performance across all key evaluation metrics when compared with three representative state-of-the-art baselines: Wang et al. [74], DiffRec [76], and XSimGCL [75]. In terms of precision, our model achieves 0.883, which is higher than Wang et al.’s 0.821, DiffRec’s 0.865, and XSimGCL’s 0.872, indicating that our recommendations more accurately capture true purchases with fewer false positives. For recall, our model attains 0.896, exceeding the 0.868 of Wang et al., 0.882 of DiffRec, and 0.884 of XSimGCL, highlighting its strength in minimizing missed opportunities for conversions. The balanced effectiveness of our approach is reflected in the F1-Score of 0.888, which surpasses the best competing result of 0.878 (XSimGCL). Finally, our model achieves an accuracy of 0.893, which is on par with Wang et al. (0.893) and slightly ahead of DiffRec (0.888) and XSimGCL (0.890). Overall, these improvements underscore the advantages of explicitly modeling pricing as an action and optimizing a profit-aligned reward under willingness-to-pay constraints—capabilities that are not natively supported by diffusion-based or contrastive-learning recommenders.

4.3. Analysis of Experimental Results

The proposed RL-based model for product recommendation and pricing demonstrates consistently strong performance in both multi-class and binary evaluation settings. In the multi-class configuration, the model accurately predicts non-purchases as well as purchases across different discount levels, confirming that the integration of pricing into the recommendation output is both feasible and effective. When simplifying the five-class output into a binary system distinguishing between “Purchase” and “Don’t Purchase,” the model maintains robust performance across all key metrics, reinforcing its reliability in capturing customer purchasing behavior.
When compared with three representative state-of-the-art baselines—Wang et al. [74], DiffRec [76], and XSimGCL [75]—our model consistently outperforms them. While these prior methods primarily focus on ranking accuracy, preference modeling, or contrastive representation learning, the proposed framework explicitly integrates pricing decisions into the recommendation process and optimizes a profit-aware reward function aligned with customers’ WTP. This dual focus enables the model to capture both purchase likelihood and price sensitivity, resulting in more accurate, profit-oriented, and business-relevant recommendations than the existing baselines.
From a marketing perspective, accurately predicting customer purchasing behavior enables more precise and effective promotional campaigns. By identifying which customers are most responsive to specific discount levels, the model supports finer segmentation of the customer base. This targeted approach reduces marketing inefficiency while improving customer satisfaction by delivering personalized discounts that are more likely to convert. The model’s high recall ensures that a greater proportion of the potential customer base is reached, maximizing the overall impact of marketing initiatives. Furthermore, customers who receive tailored offers that align with their price expectations are more likely to perceive the brand as attentive and customer-centric, thereby increasing purchase intent and loyalty.
In terms of revenue optimization, the model’s predictive strength under various discount scenarios is critical. By identifying the optimal discount level required to convert each customer segment, the model helps businesses establish pricing strategies that maximize revenue and profitability. Delivering the right discount to the right customer at the right time can substantially increase sales volume while preserving profit margins. For instance, by minimizing unnecessary high discounts and reserving deeper incentives for highly price-sensitive customers, the firm can optimize its overall promotional expenditure. The model’s balanced performance, reflected in the F-score of 0.8884 and its strong precision and recall values, confirms its ability to drive revenue growth through improved customer targeting and personalized pricing strategies. This capability not only enhances short-term sales but also fosters long-term customer retention and lifetime value.
Although the model effectively predicts both non-purchasing behavior and high-discount purchase scenarios, some challenges remain in accurately classifying intermediate discount categories. Future improvements could focus on refining the state and action space representations—such as by clustering similar products and customers—to enhance contextual differentiation. Additionally, tuning the reward function to better reflect long-term user engagement and purchase satisfaction could further improve predictive accuracy. Incorporating larger and more diverse training datasets, performing hyperparameter optimization, and extending episode lengths may also enhance model convergence and generalization. These refinements would improve the system’s capacity to distinguish between closely related discount levels, resulting in even more precise and effective personalized pricing strategies.

5. Conclusions and Future Works

This research presented an RL-based framework for personalized product recommendation and pricing, demonstrating that incorporating customers’ WTP into the recommendation process can substantially improve both customer satisfaction and business profitability. Experimental results confirm this conclusion, showing strong performance across multiple evaluation metrics—including Precision, Recall, F1-score, AUC, and NDCG—thus validating the model’s ability to accurately predict purchase behavior and rank offers according to profitability. Collectively, these findings indicate that the proposed framework effectively aligns recommendation and pricing strategies with long-term business objectives.
This study contributes to the RS and marketing literature by bridging traditional product recommendation and DP within a unified RL framework. The integration of behavioral and transactional features, combined with a reward function optimized for profit maximization, provides both theoretical rigor and practical value. From a managerial perspective, the model enhances marketing efficiency by enabling adaptive, data-driven promotional strategies that respond dynamically to customer behavior and market conditions.
Future work will focus on expanding the framework through enhanced pre-processing and post-processing techniques. Improved pre-processing methods will ensure cleaner, more structured input data, while advanced post-processing tools will aid in interpreting model outputs and transforming them into actionable business insights. This will bridge the gap between technical outputs and strategic decision-making, enhancing the practical relevance of the model.
A key limitation of the current framework lies in its discrete action space, which includes four fixed discount levels plus a no-purchase option. These thresholds, derived from the Dunnhumby dataset, reflect real-world retail practices where stepwise discounts are prevalent. Although this approach promotes stability and interpretability, it does not fully represent the continuous nature of DP. Future extensions could incorporate continuous action spaces using algorithms such as Deep Deterministic Policy Gradient (DDPG) [80] or Soft Actor–Critic (SAC) [81], enabling more granular and realistic price recommendations.
Another promising direction involves refining the design of the reward function. Instead of relying solely on heuristic definitions, future studies could apply systematic reward modeling approaches, such as Inverse Reinforcement Learning (IRL) [82] or Multi-Objective Reinforcement Learning (MORL) [83], to infer or optimize reward structures directly from observed business profit signals. This would allow for more adaptive, profit-sensitive, and objective-driven learning outcomes.
Temporal dynamics also represent a key area for advancement. Considering the timing between purchases and sequences of consumer behavior could significantly enhance predictive accuracy and allow the model to better anticipate future actions. Incorporating temporal dependencies and seasonality effects would enable the framework to adapt more effectively to evolving market conditions and customer life cycles.
Further extensions may explore hybrid or alternative modeling techniques. For instance, integrating transformer-based architectures could be particularly effective given their proven ability to model sequential dependencies. Leveraging transformers may help capture complex, long-range relationships within customer interactions, further improving predictive power and personalization quality.
Ensuring scalability and adaptability remains critical for practical deployment. Future work should test the model in diverse business environments to validate robustness and generalizability. A scalable framework capable of maintaining performance across multiple contexts will enhance its potential for real-world application.
Finally, developing mechanisms for real-time learning and evaluation constitutes an essential step toward operational implementation. Building systems capable of continuous adaptation to incoming data would enable the model to respond rapidly to changing customer behavior and market dynamics. Real-time decision support would empower businesses to execute data-driven strategies that maximize profitability and customer satisfaction in fast-moving retail environments.
In summary, this research establishes a foundation for advanced pricing and RS based on reinforcement learning. Future studies will aim to address current limitations, refine the methodological components, and explore innovative extensions to further improve model performance and applicability. By continuously enhancing the framework with state-of-the-art techniques, this line of research seeks to deliver a powerful, adaptive, and profit-oriented tool capable of supporting sustainable marketing and pricing strategies in dynamic commercial ecosystems.

Author Contributions

Conceptualization, A.M. and B.B.; methodology, A.M. and B.B.; software, A.M.; validation, A.M., B.B. and H.M.; investigation, A.M. and B.B.; writing—original draft preparation, A.M.;writing—review and editing, A.M.; visualization, A.M.; supervision, B.B. and H.M.; project administration, B.B. and H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All of the real graphs used in this paper are publicly available as described in the body of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. StartUp Guru Lab. Customer Acquisition vs Retention Costs. 2025. Available online: https://startupgurulab.com/customer-acquisition-vs-retention-costs (accessed on 6 August 2025).
  2. Optimove. Customer Acquisition vs Retention Costs: Why Retention is More Profitable. 2025. Available online: https://www.optimove.com/resources/learning-center/customer-acquisition-vs-retention-costs (accessed on 6 August 2025).
  3. Gallo, A. The Value of Keeping the Right Customers. Harvard Business Review. 2014. Available online: https://hbr.org/2014/10/the-value-of-keeping-the-right-customers (accessed on 6 August 2025).
  4. Chen, K.; Chen, T.; Zheng, G.; Jin, O.; Yao, E.; Yu, Y. Collaborative personalized tweet recommendation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012; pp. 661–670. [Google Scholar]
  5. Kumar, V.; Ashraf, A.R.; Nadeem, W. AI-powered marketing: What, where, and how? Int. J. Inf. Manag. 2024, 77, 102783. [Google Scholar] [CrossRef]
  6. Kshetri, N.; Dwivedi, Y.K.; Davenport, T.H.; Panteli, N. Generative artificial intelligence in marketing: Applications, opportunities, challenges, and research agenda. Int. J. Inf. Manag. 2024, 75, 102716. [Google Scholar] [CrossRef]
  7. Nouri-Harzvili, M.; Hosseini-Motlagh, S.M. Dynamic discount pricing in online retail systems: Effects of post-discount dynamic forces. Expert Syst. Appl. 2023, 232, 120864. [Google Scholar] [CrossRef]
  8. Kim, S.; Park, E. STAD-GCN: Spatial–Temporal Attention-based Dynamic Graph Convolutional Network for retail market price prediction. Expert Syst. Appl. 2024, 255, 124553. [Google Scholar] [CrossRef]
  9. Lu, C.C.; Wu, I.L.; Hsiao, W.H. Developing customer product loyalty through mobile advertising: Affective and cognitive perspectives. Int. J. Inf. Manag. 2019, 47, 101–111. [Google Scholar] [CrossRef]
  10. Quadrana, M.; Cremonesi, P.; Jannach, D. Sequence-aware recommender systems. Acm Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
  11. Zheng, Y.; Wang, D. A survey of recommender systems with multi-objective optimization. Neurocomputing 2022, 474, 141–153. [Google Scholar] [CrossRef]
  12. Jiang, Y.; Liu, Y. Optimization of online promotion: A profit-maximizing model integrating price discount and product recommendation. Int. J. Inf. Technol. Decis. Mak. 2012, 11, 961–982. [Google Scholar] [CrossRef]
  13. De Biasio, A.; Montagna, A.; Aiolli, F.; Navarin, N. A systematic review of value-aware recommender systems. Expert Syst. Appl. 2023, 226, 120131. [Google Scholar] [CrossRef]
  14. Cantador, I.; Bellogín, A.; Castells, P. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, Chicago, IL, USA, 27 October 2011.
  15. Afsar, M.M.; Crump, T.; Far, B. Reinforcement Learning based Recommender Systems: A Survey. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
  16. Lin, Y.; Liu, Y.; Lin, F.; Zou, L.; Wu, P.; Zeng, W.; Chen, H.; Miao, C. A Survey on Reinforcement Learning for Recommender Systems. arXiv 2021, arXiv:2109.10665. [Google Scholar] [CrossRef]
  17. Chen, X.; Yao, L.; McAuley, J.; Zhou, G.; Wang, X. Deep reinforcement learning in recommender systems: A survey and new perspectives. Knowl.-Based Syst. 2023, 264, 110335. [Google Scholar] [CrossRef]
  18. Wen, S.; Shu, Y.; Rad, A.; Wen, Z.; Guo, Z.; Gong, S. A deep residual reinforcement learning algorithm based on Soft Actor-Critic for autonomous navigation. Expert Syst. Appl. 2025, 259, 125238. [Google Scholar] [CrossRef]
  19. Wang, J.; Karatzoglou, A.; Arapakis, I.; Jose, J.M. Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, DC, USA, 14–18 July 2024; pp. 1–11. [Google Scholar] [CrossRef]
  20. Kavoosi, A.; Tavakkoli-Moghaddam, R.; Sajedi, H.; Tajik, N.; Tafakkori, K. Dynamic pricing and inventory control of perishable products by a deep reinforcement learning algorithm. Expert Syst. Appl. 2025, 291, 128570. [Google Scholar] [CrossRef]
  21. Tanveer, J.; Lee, S.-W.; Rahmani, A.M.; Aurangzeb, K.; Alam, M.; Zare, G.; Malekpour Alamdari, P.; Hosseinzadeh, M. PGA-DRL: Progressive graph attention-based deep reinforcement learning for recommender systems. Inf. Fusion 2025, 121, 103167. [Google Scholar] [CrossRef]
  22. Sitar-Tăut, D.A.; Mican, D.; Mare, C. Customer behavior in the prior purchase stage – information search versus recommender systems. Econ. Comput. Econ. Cybern. Stud. Res. 2020, 54, 59–76. [Google Scholar] [CrossRef]
  23. Zhang, H.; Zhao, L.; Gupta, S. The role of online product recommendations on customer decision making and loyalty in social shopping communities. Int. J. Inf. Manag. 2018, 38, 150–166. [Google Scholar] [CrossRef]
  24. Ramampiaro, H.e.a. New Ideas in Ranking for Personalized Fashion Recommender Systems. In Business and Consumer Analytics: New Ideas; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar] [CrossRef]
  25. Umberto, P. Developing a price-sensitive recommender system to improve accuracy and business performance of ecommerce applications. Int. J. Electron. Commer. Stud. 2015, 6, 1–18. [Google Scholar] [CrossRef]
  26. Bai, T.; Wang, X.; Zhang, Z.; Song, W.; Wu, B.; Nie, J.Y. GPR-OPT: A Practical Gaussian optimization criterion for implicit recommender systems. Inf. Process. Manag. 2024, 61, 103525. [Google Scholar] [CrossRef]
  27. Pujahari, A.; Sisodia, D.S. Modeling users’ preference changes in recommender systems via time-dependent Markov random fields. Expert Syst. Appl. 2023, 234, 121072. [Google Scholar] [CrossRef]
  28. Beheshti, A.; Yakhchi, S.; Mousaeirad, S.; Ghafari, S.M.; Goluguri, S.R.; Edrisi, M.A. Towards Cognitive Recommender Systems. Algorithms 2020, 13, 176. [Google Scholar] [CrossRef]
  29. Lerner, J.S.; Li, Y.; Valdesolo, P.; Kassam, K.S. Emotion and decision making. Annu. Rev. Psychol. 2015, 66, 799–823. [Google Scholar] [CrossRef]
  30. Alfaifi, Y.H. Recommender Systems Applications: Data Sources, Features, and Challenges. Information 2024, 15, 660. [Google Scholar] [CrossRef]
  31. Livne, A.; Unger, M.; Shapira, B.; Rokach, L. Deep Context-Aware Recommender System Utilizing Sequential Latent Context. In Proceedings of the 13th ACM Conference on Recommender Systems—CARS Workshop, Copenhagen, Denmark; ACM: New York, NY, USA, 2019. [Google Scholar]
  32. Zhao, W.X.; Guo, Y.; He, Y.; Jiang, H.; Wu, Y.; Li, X. We Know What You Want to Buy: A Demographic-Based System for Product Recommendation on Microblogs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), New York, NY, USA; ACM: New York, NY, USA, 2014; pp. 1935–1944. [Google Scholar] [CrossRef]
  33. Tkalčič, M.; De Carolis, B.; de Gemmis, M.; Odić, A.; Košir, A. Emotions and Personality in Personalized Services: Models, Evaluation and Applications; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar] [CrossRef]
  34. Ricci, F.; Rokach, L.; Shapira, B. Introduction to recommender systems handbook. In Recommender Systems Handbook; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–35. [Google Scholar] [CrossRef]
  35. Beladev, M.; Rokach, L.; Shapira, B. Recommender Systems for Product Bundling. Knowl.-Based Syst. 2016, 111, 193–206. [Google Scholar] [CrossRef]
  36. Kouki, P.; Fountalis, I.; Vasiloglou, N.; Yan, N.; Ahsan, U.; Al Jadda, K.; Qu, H. Product Collection Recommendation in Online Retail. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys ’19), Copenhagen, Denmark; ACM: New York, NY, USA, 2019; pp. 486–490. [Google Scholar] [CrossRef]
  37. Wu, S.; Sun, F.; Zhang, W.; Xie, X.; Cui, B. Graph Neural Networks in Recommender Systems: A Survey. arXiv 2020, arXiv:2011.02260. [Google Scholar] [CrossRef]
  38. Zhang, P.; Niu, Z.; Ma, R.; Zhang, F. Multi-view graph contrastive representation learning for bundle recommendation. Inf. Process. Manag. 2025, 62, 103956. [Google Scholar] [CrossRef]
  39. Dara, S.; Chowdary, C.R.; Kumar, C. A Survey on Group Recommender Systems. J. Intell. Inf. Syst. 2019, 54, 271–295. [Google Scholar] [CrossRef]
  40. Hwangbo, H.; Kim, Y.S.; Cha, K.J. Recommendation system development for fashion retail e-commerce. Electron. Commer. Res. Appl. 2018, 28, 94–101. [Google Scholar] [CrossRef]
  41. Kompan, M.; Gaspar, P.; Macina, J.; Cimerman, M.; Bielikova, M. Exploring Customer Price Preference and Product Profit Role in Recommender Systems. IEEE Intell. Syst. 2022, 37, 89–98. [Google Scholar] [CrossRef]
  42. Shin, D.; Zhong, B.; Biocca, F.A. Beyond user experience: What constitutes algorithmic experiences? Int. J. Inf. Manag. 2020, 52, 102061. [Google Scholar] [CrossRef]
  43. Wakil, K.; Alyari, F.; Ghasvari, M.; Lesani, Z.; Rajabion, L. A new model for assessing the role of customer behavior history, product classification, and prices on the success of the recommender systems in e-commerce. Kybernetes 2020, 49, 1325–1346. [Google Scholar] [CrossRef]
  44. Zheng, Y.; Gao, C.; He, X.; Li, Y.; Jin, D. Price-aware recommendation with graph convolutional networks. In Proceedings of the Proceedings—International Conference on Data Engineering, Dallas, TX, USA, 20–24 April 2020; pp. 133–144. [Google Scholar] [CrossRef]
  45. Lopatovska, I.; Mokros, H.B. Willingness to pay and experienced utility as measures of affective value of information objects: Users’ accounts. Inf. Process. Manag. 2008, 44, 92–104. [Google Scholar] [CrossRef]
  46. Chen, J.; Jin, Q.; Zhao, S.; Bao, S.; Zhang, L.; Su, Z.; Yu, Y. Does product recommendation meet its Waterloo in unexplored categories? No, price comes to help. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014; pp. 667–676. [Google Scholar] [CrossRef]
  47. Kent, R.J.; Monroe, K.B. The Effects of Framing Price Promotion Messages on Consumers’ Perceptions and Purchase Intentions. J. Retail. 1998, 74, 353–372. [Google Scholar] [CrossRef]
  48. Greenstein-Messica, A.; Rokach, L. Personal price aware multi-seller recommender system: Evidence from eBay. Knowl.-Based Syst. 2018, 150, 14–26. [Google Scholar] [CrossRef]
  49. Terui, N.; Dahana, W.D. Price customization using price thresholds estimated from scanner panel data. J. Interact. Mark. 2006, 20, 58–70. [Google Scholar] [CrossRef]
  50. Sato, M.; Izumo, H.; Sonoda, T. Discount sensitive recommender system for retail business. In Proceedings of the EMPIRE ’15: 3rd Workshop on Emotions and Personality in Personalized Systems 2015, Vienna, Austria, 16–20 September 2015; pp. 33–40. [Google Scholar] [CrossRef]
  51. Jannach, D.; Adomavicius, G. Price and Profit Awareness in Recommender Systems. arXiv 2017, arXiv:1707.08029. [Google Scholar] [CrossRef]
  52. Sato, M.; Izumo, H.; Sonoda, T. Model of Personal Discount Sensitivity in Recommender Systems. arXiv. 2016. Available online: https://ixdea.org/28_6/ (accessed on 28 October 2025).
  53. Das, A.; Mathieu, C.; Ricketts, D. Maximizing profit using recommender systems. arXiv 2009, arXiv:0908.3633. [Google Scholar] [CrossRef]
  54. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  55. LiChun, C.; ZhiMin, Z. An overview of deep reinforcement learning. In Proceedings of the 2019 4th International Conference on Automation, Control and Robotics Engineering, Shenzhen, China, 19–21 July 2019. [Google Scholar] [CrossRef]
  56. Wang, C.; Dong, T.; Chen, L.; Zhu, G.; Chen, Y. Multi-objective optimization approach for permanent magnet machine via improved soft actor–critic based on deep reinforcement learning. Expert Syst. Appl. 2025, 264, 125834. [Google Scholar] [CrossRef]
  57. Xin, X.; Karatzoglou, A.; Arapakis, I.; Jose, J.M. Self-supervised reinforcement learning for recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 931–940. [Google Scholar] [CrossRef]
  58. Wu, Y.; MacDonald, C.; Ounis, I. Partially observable reinforcement learning for dialog-based interactive recommendation. In Proceedings of the RecSys 2021—15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 241–251. [Google Scholar] [CrossRef]
  59. Luo, W.; Yu, X.; Yen, G.G.; Wei, Y. Deep reinforcement learning-guided coevolutionary algorithm for constrained multiobjective optimization. Inf. Sci. 2025, 692, 121648. [Google Scholar] [CrossRef]
  60. Lei, Y.; Wang, Z.; Li, W.; Pei, H. Social attentive deep Q-network for recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 1189–1192. [Google Scholar] [CrossRef]
  61. Farris, L. Optimized recommender systems with deep reinforcement learning. arXiv 2021, arXiv:2110.03039. [Google Scholar] [CrossRef]
  62. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
  63. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
  64. Li, J.; Chen, B. A Deep Q-Learning Optimization Framework for Dynamic Pricing in E-Commerce. In Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy (CSAIDE 2025), Kuala Lumpur, Malaysia; ACM: New York, NY, USA, 2025; pp. 367–371. [Google Scholar] [CrossRef]
  65. Fraija, A.; Henao, N.; Agbossou, K.; Kelouwani, S.; Fournier, M.; Nagarsheth, S.H. Deep Reinforcement Learning Based Dynamic Pricing for Demand Response Considering Market and Supply Constraints. Smart Energy 2024, 14, 100139. [Google Scholar] [CrossRef]
  66. Nomura, Y.; Liu, Z.; Nishi, T. Deep Reinforcement Learning for Dynamic Pricing and Ordering Policies in Perishable Inventory Management. Appl. Sci. 2025, 15, 2421. [Google Scholar] [CrossRef]
  67. Wang, G.; Ding, J.; Hu, F. Deep Reinforcement Learning Recommendation System Algorithm Based on Multi-Level Attention Mechanisms. Electronics 2024, 13, 4625. [Google Scholar] [CrossRef]
  68. Rossiiev, O.D.; Shapovalova, N.N.; Rybalchenko, O.H.; Striuk, A.M. A Comprehensive Survey on Reinforcement Learning-based Recommender Systems: State-of-the-Art, Challenges, and Future Perspectives. In Proceedings of the 7th Workshop for Young Scientists in Computer Science and Software Engineering (CSSE@SW 2024). CEUR Workshop Proceedings, Kryvyi Rih, Ukraine, 27 December 2025; pp. 428–440. [Google Scholar]
  69. Gupta, R.; Lin, J.; Meng, F. Multi-agent Deep Reinforcement Learning for Interdependent Pricing in Supply Chains. arXiv 2025, arXiv:2507.02698. [Google Scholar]
  70. Zhang, D.; Zhao, Y.; Sun, L. Reinforcement Learning in Recommender Systems: Progress, Challenges, and Opportunities. In Proceedings of the 18th ACM Conference on Recommender Systems (RecSys), Bari, Italy, 14–18 October 2024; ACM: New York, NY, USA, 2024; pp. 112–129. [Google Scholar]
  71. Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; Coppin, B. Deep Reinforcement Learning in Large Discrete Action Spaces. arXiv 2015, arXiv:1512.07679. [Google Scholar] [CrossRef]
  72. Liu, S.; Cai, Q.; Sun, B.; Wang, Y.; Jiang, J.; Zheng, D.; Gai, K.; Jiang, P.; Zhao, X.; Zhang, Y. Exploration and Regularization of the Latent Action Space in Recommendation. arXiv 2023, arXiv:2302.03431. [Google Scholar] [CrossRef]
  73. Liu, D.; Li, J.; Wu, J.; Du, B.; Chang, J.; Li, X. Interest Evolution-driven Gated Neighborhood aggregation representation for dynamic recommendation in e-commerce. Inf. Process. Manag. 2022, 59, 102982. [Google Scholar] [CrossRef]
  74. Wang, K.; Zou, Z.; Zhao, M.; Deng, Q.; Shang, Y.; Liang, Y.; Wu, R.; Shen, X.; Lyu, T.; Fan, C. Rl4rs: A real-world dataset for reinforcement learning based recommender system. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 2935–2944. [Google Scholar]
  75. Yu, J.; Xia, X.; Chen, T.; Cui, L.; Hung, N.Q.V.; Yin, H. XSimGCL: Towards Extremely Simple Graph Contrastive Learning for Recommendation. IEEE Trans. Knowl. Data Eng. 2023, 36, 913–926. [Google Scholar] [CrossRef]
  76. Wang, W.; Xu, Y.; Feng, F.; Lin, X.; He, X.; Chua, T. Diffusion Recommender Model. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), Taipei, Taiwan, 23–27 July 2023; pp. 832–841. [Google Scholar] [CrossRef]
  77. Zhao, X.; Xia, L.; Zhang, L.; Ding, Z.; Yin, D.; Tang, J. Deep Reinforcement Learning for Page-wise Recommendations. In Proceedings of the Twelfth ACM Conference on Recommender Systems (RecSys ’18), Vancouver, BC, Canada, 2–7 October 2018; pp. 95–103. [Google Scholar] [CrossRef]
  78. Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar] [CrossRef]
  79. Hand, D.J.; Till, R.J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn. 2001, 45, 171–186. [Google Scholar] [CrossRef]
  80. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. In Proceedings of the International Conference on Learning Representations (ICLR). International Conference on Learning Representations, San Juan, CA, USA, 2–4 May 2016. [Google Scholar]
  81. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1861–1870. [Google Scholar]
  82. Ng, A.Y.; Russell, S.J. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), Stanford, CA, USA, 29 June–2 July 2000; pp. 663–670. [Google Scholar]
  83. Roijers, D.M.; Vamplew, P.; Whiteson, S.; Dazeley, R. A Survey of Multi-Objective Sequential Decision-Making. J. Artif. Intell. Res. 2013, 48, 67–113. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.