Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning
Abstract
1. Introduction
- We formulate a food delivery platform model with multiple transportation modes serving heterogeneous, selfish customers. The model captures the impact of pricing across modalities on customer behavior while accounting for the stochasticity in orders and customers’ value of time.
- We employ a deep reinforcement learning approach to derive a pricing mechanism for a meal delivery platform using a real-world transportation network.
- We evaluate the learned RL pricing policy against three baseline heuristic policies based on the total profit achieved.
Related Work
2. System Model and Problem Set Up
2.1. Orders’ Model
2.2. Couriers’ Model
- 1.
- Completing the remaining delivery tasks, ;
- 2.
- Traveling from the previous drop-off location to the new pickup location and then delivering the order to the drop-off location , denoted by .
2.3. Customers’ Behavior
- 1.
- The price offered by the platform for the modality does not exceed the customer’s maximum acceptable price, . Formally, for modality m at time t, .
- 2.
- The generalized cost associated with the modality does not exceed the customer’s maximum acceptable cost, denoted by .
2.4. Transportation Network
2.5. Meal Delivery Platform Problem
3. The Problem Formulation as a Markov Decision Process (MDP)
3.1. State Space
- The queue of waiting orders, denoted by . Each order is characterized by three parameters: the pickup location , the drop-off location , and the remaining time steps before expiration . The top order in the queue is denoted by , which is the one considered for service at time t with parameters , , and .
- The set of all available couriers, denoted by , where each courier is characterized by three parameters: the modality ; the next location where the courier will become available; and the remaining latency , which represents the time until the courier becomes available at location .
3.2. Action Space
3.3. State Transition Model
- : Given the pricing vector and the current order at the front of , the customer evaluates the generalized cost of each available modality m, (defined in (3)). The customer then either chooses a modality and receives the service or exits the queue if , indicating the customer is no longer willing to wait for a new pricing option. However, if , then the order remains at the front of . Therefore, the selected modality is determined as follows:Further, for all , the remaining time until expiration, denoted by , is decremented by one at each time step, i.e., . Any order with is removed from the queue.At each time step, new orders arrive according to a Poisson distribution with rate , which determines the number of orders added to the queue at time t. The pickup and drop-off locations of the new orders are generated randomly, where the regional pickup and drop-off rates are represented by and , respectively. Therefore, for each new order u, the pickup and drop-off locations are independently sampled with respect to and , i.e.,
- at time : For each courier , their remaining time until availability is updated according toIf , the courier has completed its delivery and is available at location for the next order.If, for order , the courier with modality is assigned to the delivery, then the state of this courier is updated by setting the next drop-off location to , and updating the remaining time until availability as follows:
3.4. The Initial State
3.5. Reward Signal, r
4. Reinforcement Learning
- 1.
- Considering that the computational complexity per iteration for value iteration is , and given that both the action space and the state space are continuous—due to continuous price and latency values—the cardinality of action and state spaces, and , are not finite.
- 2.
- In practice, a food delivery platform may not have complete knowledge of the distribution parameters of customers’ value of time.
5. Heuristic Policies
- Max Price: Since the profit, , is monotonically increasing with respect to the price , this policy always offers the maximum allowable price to maximize the profit. Additionally, since customers tend to select the courier with the lowest latency—and —this behavior helps minimize delivery cost for the platform. However, in systems with generally high latencies, this policy may lead customers to frequently leave the platform without placing an order due to the high generalized cost.
- Max Order: In this policy, the platform adjusts the prices of all modalities such that the expected generalized cost remains below the customer’s maximum acceptable threshold to maximize the number of confirmed orders.Considering the platform’s profit at time t, given modality ,However, if the condition (14) does not hold and , then the platform benefits from setting a higher price, leading the customer to reject the service and the platform to receive zero profit rather than incurring a loss.Since the platform does not have access to customers’ real-time VoT values, it adjusts prices such that the expected generalized cost for each modality m and its corresponding courier c satisfiesTherefore, in this policy,
- Zone-Based Price: Given that the mean value of customers’ VoT varies across regions, this policy utilizes regional differences to adjust pricing accordingly.To motivate this policy, we consider a pair of modalities and examine how pricing signals can be designed to encourage customers to select one modality over the other. In systems with more than two modalities, this pairwise analysis can be extended across all modality combinations to construct a pricing policy that accounts for regional variations in VoT.Without loss of generality, we consider the case where, for given modalities m and , and their corresponding couriers c and , the operational cost of courier is greater than courier c, i.e., . Therefore, if the platform reduces the price of modality m compared to so that the customer is encouraged to choose m over while maintaining a higher profit, i.e.,Here, we consider two cases:
- 1.
- : When the latency of modality is greater than the latency of modality m, Inequality (20) always holds, given that both the price and latency of modality are higher than those of modality m. In this case, the platform can simply set both and to , the maximum acceptable price for an order.
- 2.
- : In this case, if , then there exists a pricing strategy that satisfies all the conditions. This condition holds when is relatively small compared to the difference in operational costs between the two modalities.Although the platform does not observe the real-time value of , considering the expected VoT of the customer, (defined in (18)), for customers in regions with a lower expected VoT, the platform reduces the price of modality m just enough to satisfy
This analysis suggests that the platform can adopt a pricing policy that reduces the price of modalities with lower operational cost but higher latency. This makes these options more attractive to customers, especially those with lower . As a result, the platform can benefit from the lower operational cost while still meeting customer preferences. This trade-off allows the platform to increase overall profit by aligning customer preferences with its own objectives. Additionally, it helps distribute customer demand more effectively across different modalities.
6. Numerical Experiment
6.1. Sioux Fall Transportation Network
- The Suburban areas consist of the regions .
- The Inner Suburbs (Inner Ring) areas consist of the regions .
- The Downtown areas consist of the regions .
- 1.
- All arriving orders have their pickup locations in the Downtown region. Therefore, the regional pickup rate is set to for all the and for all other regions where . This assumption reflects the fact that restaurants are more concentrated in Downtown areas, making it more likely for customers to place orders from the diverse and popular options available there rather than from the limited local options elsewhere. Additionally, new orders arrive at a rate of .
- 2.
- Orders may be placed by customers in any region. The regional drop-off rate is defined as for Downtown regions () and for all other regions. Note that .
- 3.
- For region i, the VoT (measured in $ per minute) with , and the distribution of is defined as follows:
- For the Suburban areas, where , it holds that .
- For the Inner Suburbs areas, where , it holds that .
- For the Downtown areas, where , it holds that .
6.1.1. Scenario 1: Complete Graph
6.1.2. Scenario 2: Sioux Falls Network Graph
6.1.3. Hyperparameters for PPO
6.2. Result
6.2.1. Evaluation Metrics
- Order: Total number of orders served during an episode.
- Latency: Average latency of served orders within an episode.
- Price: Average price paid for orders during an episode.
- GC (Generalized Cost): Average generalized cost of served orders during an episode.
- Profit: Cumulative reward collected over an episode (200 time steps).
6.2.2. Scenario 1
6.2.3. Scenario 2
7. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Comparison of Reinforcement Learning Algorithms
Policy | Learning Rate | Batch Size | Clip Range | Entropy Coefficient | |
---|---|---|---|---|---|
PPO1 | 1024 | 128 | |||
PPO2 | 1024 | 128 | |||
PPO3 | 1024 | 128 | |||
PPO4 | 1024 | 64 | |||
PPO5 | 1024 | 128 | |||
PPO6 | 2048 | 64 | |||
PPO7 | 2048 | 128 |
Policy | PPO1 | PPO2 | PPO3 | PPO4 | PPO5 | PPO6 | PPO7 |
---|---|---|---|---|---|---|---|
Scenario 1 | |||||||
Scenario 2 |
Policy | PPO1 | PPO6 | SAC | TD3 |
---|---|---|---|---|
Scenario 1 | ||||
Scenario 2 |
References
- DoorDash DoorDash Releases Fourth Quarter and Full Year 2024 Financial Results, February 2025. Available online: https://ir.doordash.com/news/news-details/2025/DoorDash-Releases-Fourth-Quarter-and-Full-Year-2024-Financial-Results/default.aspx (accessed on 12 May 2025).
- Uber Technologies Investment. Uber Announces Results for Fourth Quarter and Full Year 2024, February 2025. Available online: https://investor.uber.com/news-events/news/press-release-details/2025/Uber-Announces-Results-for-Fourth-Quarter-and-Full-Year-2024/default.aspx (accessed on 12 May 2025).
- Moshref-Javadi, M.; Winkenbach, M. Applications and Research avenues for drone-based models in logistics: A classification and review. Expert Syst. Appl. 2021, 177, 114854. [Google Scholar] [CrossRef]
- Garg, V.; Niranjan, S.; Prybutok, V.; Pohlen, T.; Gligor, D. Drones in last-mile delivery: A systematic review on Efficiency, Accessibility, and Sustainability. Transp. Res. Part D Transp. Environ. 2023, 123, 103831. [Google Scholar] [CrossRef]
- Thiels, C.A.; Aho, J.M.; Zietlow, S.P.; Jenkins, D.H. Use of Unmanned Aerial Vehicles for Medical Product Transport. Air Med. J. 2015, 34, 104–108. [Google Scholar] [CrossRef]
- Hwang, J.; Kim, I.; Gulzar, M.A. Understanding the eco-friendly role of drone food delivery services: Deepening the theory of planned behavior. Sustainability 2020, 12, 1440. [Google Scholar] [CrossRef]
- Goodchild, A.; Toy, J. Delivery by drone: An evaluation of unmanned aerial vehicle technology in reducing CO2 emissions in the delivery service industry. Transp. Res. Part D Transp. Environ. 2018, 61, 58–67. [Google Scholar] [CrossRef]
- Kim, J.J.; Kim, I.; Hwang, J. A change of perceived innovativeness for contactless food delivery services using drones after the outbreak of COVID-19. Int. J. Hosp. Manag. 2021, 93, 102758. [Google Scholar] [CrossRef]
- Abbasi, G.A.; Rodriguez-López, M.E.; Higueras-Castillo, E.; Liébana-Cabanillas, F. Drones in food delivery: An analysis of consumer values and perspectives. Int. J. Logist. Res. Appl. 2024, 1–21. [Google Scholar] [CrossRef]
- Koay, K.Y.; Leong, M.K. Understanding consumers’ intentions to use drone food delivery services: A perspective of the theory of consumption values. Asia-Pac. J. Bus. Adm. 2023, 16, 1226–1240. [Google Scholar] [CrossRef]
- Waris, I.; Ali, R.; Nayyar, A.; Baz, M.; Liu, R.; Hameed, I. An empirical evaluation of customers’ adoption of drone food delivery services: An extended technology acceptance model. Sustainability 2022, 14, 2922. [Google Scholar] [CrossRef]
- Liébana-Cabanillas, F.; Rodríguez-López, M.E.; Abbasi, G.A.; Higueras-Castillo, E. A behavioral study of food delivery service by drones: Insights from urban and rural consumers. Int. J. Hosp. Manag. 2025, 127, 104098. [Google Scholar] [CrossRef]
- Beliaev, M.; Mehr, N.; Pedarsani, R. Pricing for multi-modal pickup and delivery problems with heterogeneous users. Transp. Res. Part C Emerg. Technol. 2024, 169, 104864. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Dorling, K.; Heinrichs, J.; Messier, G.G.; Magierowski, S. Vehicle routing problems for drone delivery. IEEE Trans. Syst. Man Cybern. Syst. 2016, 47, 70–85. [Google Scholar] [CrossRef]
- Attenni, G.; Arrigoni, V.; Bartolini, N.; Maselli, G. Drone-based delivery systems: A survey on route planning. IEEE Access 2023, 11, 123476–123504. [Google Scholar] [CrossRef]
- Beliaev, M.; Mehr, N.; Pedarsani, R. Congestion-aware bi-modal delivery systems utilizing drones. Future Transp. 2023, 3, 329–348. [Google Scholar] [CrossRef]
- Chen, C.; Demir, E.; Hu, X.; Huang, H. Transforming last mile delivery with heterogeneous assistants: Drones and delivery robots. J. Heuristics 2025, 31, 8. [Google Scholar] [CrossRef]
- Samouh, F.; Gluza, V.; Djavadian, S.; Meshkani, S.; Farooq, B. Multimodal Autonomous Last-Mile Delivery System Design and Application. In Proceedings of the 2020 IEEE International Smart Cities Conference (ISC2), Piscataway, NJ, USA, 28 September–1 October 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Liu, Y. An optimization-driven dynamic vehicle routing algorithm for on-demand meal delivery using drones. Comput. Oper. Res. 2019, 111, 1–20. [Google Scholar] [CrossRef]
- Jahanshahi, H.; Bozanta, A.; Cevik, M.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Kosucu, B.; Başar, A. A deep reinforcement learning approach for the meal delivery problem. Knowl.-Based Syst. 2022, 243, 108489. [Google Scholar] [CrossRef]
- Bozanta, A.; Cevik, M.; Kavaklioglu, C.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Duranel, A.; Basar, A. Courier routing and assignment for food delivery service using reinforcement learning. Comput. Ind. Eng. 2022, 164, 107871. [Google Scholar] [CrossRef]
- Zou, G.; Tang, J.; Yilmaz, L.; Kong, X. Online food ordering delivery strategies based on deep reinforcement learning. Appl. Intell. 2022, 56, 6853–6865. [Google Scholar] [CrossRef]
- Mehra, A.; Saha, S.; Raychoudhury, V.; Mathur, A. DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–9. [Google Scholar]
- Li, M.; Qin, Z.; Jiao, Y.; Yang, Y.; Wang, J.; Wang, C.; Wu, G.; Ye, J. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 983–994. [Google Scholar]
- Bi, Z.; Guo, X.; Wang, J.; Qin, S.; Liu, G. Deep reinforcement learning for truck-drone delivery problem. Drones 2023, 7, 445. [Google Scholar] [CrossRef]
- Chen, X.; Ulmer, M.W.; Thomas, B.W. Deep Q-learning for same-day delivery with vehicles and drones. Eur. J. Oper. Res. 2022, 298, 939–952. [Google Scholar] [CrossRef]
- Towers, M.; Kwiatkowski, A.; Terry, J.K.; Balis, J.U.; de Cola, G.; Deleu, T.; Goulão, M.; Kallinteris, A.; Krimmel, M.; KG, A.; et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments. Available online: https://github.com/Farama-Foundation/Gymnasium (accessed on 15 May 2025).
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
- Biewald, L. Experiment Tracking with Weights and Biases. 2020. Available online: https://www.wandb.com/ (accessed on 9 August 2025).
- Cabannes, T. TransportationNetworks, SiouxFalls, October 2021. Available online: https://github.com/bstabler/TransportationNetworks (accessed on 17 May 2025).
- Abdulaal, M.; LeBlanc, L.J. Continuous equilibrium network design models. Transp. Res. Part B Methodol. 1979, 13, 19–32. [Google Scholar] [CrossRef]
Symbol | Description |
---|---|
Set of regions | |
The vector of pickup location rates | |
The vector of drop-off location rates | |
The rate of order arrivals | |
The maximum time that an order will stay in the queue | |
t | The current time step |
T | The total number of time steps |
The queue of orders at time t | |
The order to be served at time t, which is at the top of the | |
The pickup location (origin) of the order | |
The drop-off location (destination) for the order | |
The remaining time to expire for the order | |
The set of modalities | |
The normalized cost of operation for each modality m | |
The price of the modality m at time t | |
The maximum acceptable price for customers | |
Set of all the couriers | |
Set of all the couriers of modality m | |
The drop-off location where the courier c will become available | |
Latency of courier c to be available at | |
The latency of courier c to arrive at the drop-off location of order | |
The service time for order under courier c | |
The chosen modality to serve | |
The chosen courier to serve | |
The maximum acceptable generalized cost of the customers |
Policy | Profit (USD) | Order | Price (USD) | GC (USD) | Latency (min) |
---|---|---|---|---|---|
RL | |||||
Max Order | |||||
Max Price | |||||
Zone-Based |
Policy | Profit (USD) | Order | Price (USD) | GC (USD) | Latency (min) |
---|---|---|---|---|---|
RL | |||||
Zone-Based | |||||
Max Price | |||||
Max Order |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zibaie, A.; Beliaev, M.; Alizadeh, M.; Pedarsani, R. Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transp. 2025, 5, 112. https://doi.org/10.3390/futuretransp5030112
Zibaie A, Beliaev M, Alizadeh M, Pedarsani R. Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transportation. 2025; 5(3):112. https://doi.org/10.3390/futuretransp5030112
Chicago/Turabian StyleZibaie, Arghavan, Mark Beliaev, Mahnoosh Alizadeh, and Ramtin Pedarsani. 2025. "Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning" Future Transportation 5, no. 3: 112. https://doi.org/10.3390/futuretransp5030112
APA StyleZibaie, A., Beliaev, M., Alizadeh, M., & Pedarsani, R. (2025). Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transportation, 5(3), 112. https://doi.org/10.3390/futuretransp5030112