Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning

Zibaie, Arghavan; Beliaev, Mark; Alizadeh, Mahnoosh; Pedarsani, Ramtin

doi:10.3390/futuretransp5030112

Open AccessArticle

Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning

Electrical and Computer Engineering, University of California Santa Barbara, Santa Barbara, CA 93106, USA

^*

Author to whom correspondence should be addressed.

Future Transp. 2025, 5(3), 112; https://doi.org/10.3390/futuretransp5030112

Submission received: 30 June 2025 / Revised: 9 August 2025 / Accepted: 13 August 2025 / Published: 1 September 2025

Download

Browse Figures

Versions Notes

Abstract

In this paper, we develop a dynamic pricing mechanism for a meal delivery platform that offers multiple transportation modes for order deliveries. We consider orders from heterogeneous customers who select their preferred delivery mode based on individual generalized cost (GC) functions, where GC captures the trade-off between price and delivery latency for each transportation option. Given the logistics of the underlying transportation network, the platform can utilize a pricing mechanism to guide customer choices toward delivery modes that optimize resource allocation across available transportation modalities. By accounting for variability in the latency and cost of modalities, such pricing aligns customer preferences with the platform’s operational objectives and enhances overall satisfaction. Due to the computational complexity of finding the optimal policy, we adopt a deep reinforcement learning (DRL) approach to design the pricing mechanism. Our numerical results demonstrate up to

143 %

higher profits compared to heuristic pricing strategies, highlighting the potential of DRL-based dynamic pricing to improve profitability, resource efficiency, and service quality in on-demand delivery services.

Keywords:

delivery; pricing; deep reinforcement learning

1. Introduction

Ongoing advancements in digital and logistics technologies have contributed to the growing popularity of e-commerce services, including food delivery and online retail. According to 2024 reports, DoorDash reported a

19 %

year-over-year (YoY) increase in total orders [1], while Uber’s delivery gross bookings increased by

18 %

YoY [2]. The rapid growth of this market has introduced new logistical and operational challenges, such as improving delivery times in congested urban areas and enhancing service accessibility in rural regions—issues that service providers are actively working to address. As a result, the replacement of traditional delivery methods with more efficient alternatives has become an area of increasing interest in both industry and academic research [3,4].

Recent advances in unmanned aerial vehicles (UAVs), known as drones, have made them a promising candidate for food delivery services. Drones offer several advantages, including being unaffected by road congestion and enabling faster delivery times, as they can maintain steady, relatively high speeds [5]. Operating on electricity, their environmental footprint is comparatively smaller and they produce less emissions compared to fuel-based delivery vehicles [6,7]. Their autonomous operation further reduces labor costs by eliminating the need for human drivers, while integrated GPS technology enables more precise and efficient routing. Furthermore, drones have gained attention as a contactless delivery option, particularly during the COVID-19 pandemic [8]. Collectively, these attributes make drones a valuable addition to last-mile delivery systems, providing an innovative alternative to traditional modalities. In response to their growing relevance, several studies have investigated customer attitudes and behavioral responses toward drone-based meal delivery services [9,10,11]. For instance, Liébana-Cabanillas et al. [12] studied behavioral differences between urban and rural customers regarding the use of drones for meal delivery.

Given the continued expansion of the e-commerce market, integrating emerging delivery modalities—such as drones—alongside existing options requires delivery platforms to adopt optimal pricing mechanisms to manage high order volumes while meeting diverse customer expectations. Due to the operational differences across transportation modes and the heterogeneity in customer preferences, pricing can be used as a strategic tool to guide resource allocation, reduce operational costs, increase gross profit, and improve overall customer satisfaction [13].

In this work, we consider a meal delivery platform that serves a population of customers located across different regions within a transportation network, using multiple delivery modalities. Customers are heterogeneous in their value of time (VoT), which influences their preferences over delivery modes. To capture the differences across modalities, we assume that the transportation networks associated with each modality differ in the number of links connecting various pickup and drop-off locations, referred to as Origin–Destination (OD) pairs, and their latency. In particular, we utilize drone delivery as an emerging and operationally distinct modality that offers flexible aerial routing in our numerical simulations as a relevant test case within our framework (Figure 1).

We formulate a sequential decision-making problem in which, based on the system state, the platform presents each customer with a set of available courier modes, along with their corresponding prices and delivery latencies determined by the Origin–Destination pair of the customer’s order. Each customer then makes a selfish decision based on their individual value of time: they either select the delivery modality that minimizes their generalized cost—which captures the trade-off between price and latency—or wait for a better option with a lower GC. Given the randomness in customer orders and their VoT, the system state evolves through a stochastic transition process. The platform’s objective is to set prices for different modalities in a way that enables efficient resource allocation for order delivery and maximizes profit, taking into account that modalities differ in their operational cost-to-latency ratios.

Since static pricing policies are unable to capture the inherent stochasticity of the system, we model the meal delivery problem as a Markov Decision Process (MDP) to derive a real-time pricing strategy. Given the complexity and high dimensionality of the state and action spaces, we adopt a deep reinforcement learning (RL) approach instead of traditional dynamic programming methods to obtain a near-optimal pricing policy. Specifically, we employ the Proximal Policy Optimization (PPO) algorithm [14] to improve the platform’s profit by dynamically assigning prices to each delivery modality based on incoming orders. We then evaluate the performance of the learned RL policy against three baseline heuristic strategies, comparing the total profit achieved by each approach.

Our main contributions can be summarized as follows:

We formulate a food delivery platform model with multiple transportation modes serving heterogeneous, selfish customers. The model captures the impact of pricing across modalities on customer behavior while accounting for the stochasticity in orders and customers’ value of time.
We employ a deep reinforcement learning approach to derive a pricing mechanism for a meal delivery platform using a real-world transportation network.
We evaluate the learned RL pricing policy against three baseline heuristic policies based on the total profit achieved.

Related Work

Drones have become a popular topic of study for last-mile delivery systems [4]. While many works have focused on the social acceptance of drones in last-mile delivery—given that this stage of the system involves direct interaction with customers [9,12]—others have examined the associated routing challenges [15,16]. In addition, several studies have considered multi-modal delivery systems that incorporate drones. In Beliaev et al. [17], the authors consider optimal routing and allocation policies for a bi-modal delivery system that accounts for road congestion. Similarly, refs. [18,19] examine hybrid delivery systems involving both delivery robots and drones, highlighting the benefits of combining modalities with different characteristics for last-mile delivery. Further, focusing specifically on food delivery services, Liu [20] presents a real-time routing optimization for drones, accounting for randomness in arrival times as well as pickup and delivery locations. In Beliaev et al. [13], the authors formulate the multi-modal meal delivery problem as a congestion game to analyze the impact of pricing mechanisms on efficient resource allocation and latency optimization. Although their work addresses the pricing problem in multi-modal meal delivery, our problem formulation, objective, and methodology differ significantly.

Recently, several studies have explored the application of reinforcement learning techniques to the meal delivery problem [21]. Many of these works focus on courier routing and order assignment tasks [22]. For instance, Zou et al. [23] formulate the order dispatching problem as a Markov Decision Process and employ a Double Deep Q-Network (DQN)-based RL agent to assign newly arriving orders to couriers based on pickup and delivery locations and the real-time state of all couriers. In most of these studies, ride-sharing is incorporated into food delivery systems to improve overall efficiency [24,25].

A few recent works also address hybrid delivery systems using RL approaches, combining different modalities such as ground vehicles and drones [26,27].

The organization of this paper is as follows: In Section 2, we present a high-level description of the system model and define the objective of the meal delivery platform, followed by a detailed formulation of the problem as a Markov Decision Process in Section 3. In Section 4, we propose a deep reinforcement learning approach as a method for developing a real-time pricing policy. Then, in Section 5, we introduce the heuristic policies used as baselines to evaluate RL-based policies. In Section 6, we present the numerical experiments on our case study, the Sioux Falls transportation network under two scenarios for drones, followed by the results of the performance of the real-time pricing policy. We provide a discussion of the results and future directions in Section 7.

2. System Model and Problem Set Up

2.1. Orders’ Model

We consider a meal delivery platform that operates in discrete time steps, indexed by

t \in Z_{+} = {1, 2, \dots}

. At each time step t, the platform observes a queue of pending orders, denoted by

Q_{t}

, and is allowed to serve only the first order in the queue. These orders are placed by heterogeneous customers located in different regions of a transportation network, where the set of all regions is denoted by

R

. Each order involves a request for meal delivery from a pickup region to a drop-off region and is characterized by an Origin–Destination pair,

(o_{t}, d_{t})

, where

o_{t}, d_{t} \in R

, along with the remaining time until expiration,

τ_{t} \in Z_{+}

. The arrival and characteristics of orders (e.g., OD pair) are assumed to be stochastic. If an order expires, it leaves the queue without being served. We assume the queue operates under a First-In/First-Out (FIFO) policy. Therefore, whenever an order arrives, it is added to the end of the queue.

2.2. Couriers’ Model

In this paper, we study a meal delivery platform under the assumption of a fixed time horizon during which couriers remain continuously available on the platform. This assumption allows us to capture the impact of dynamic pricing on courier latency as the primary limited resource, without introducing additional complexity from modeling variations in courier availability or guaranteeing their presence. Since the system supports multiple delivery modalities, each courier is associated with a modality, which corresponds to one of the available transportation options (e.g., car, drone, bike, etc.). The set of all couriers is denoted by

C

, and the set of all available modalities is denoted by

M

. Additionally, for each modality

m \in M

,

C (m)

represents the set of couriers of modality m, which is a subset of

C

.

Each courier may be assigned multiple delivery tasks at a time but can only pick up a new order after completing its previous delivery. For each courier

c \in C

at time t,

d_{t} (c)

denotes the location where the courier becomes available for a new delivery after

ℓ_{t} (c)

units of time.

For each modality m, the platform offers the courier with the minimum latency among all couriers of that modality to the incoming customer. Given order

U_{t}

, the latency

ℓ_{t}^{+} (c)

of courier c, if assigned to the order, is defined as the total time required to complete the delivery at location

d_{t}

, which consists of

1.: Completing the remaining delivery tasks, $ℓ_{t} (c)$ ;
2.: Traveling from the previous drop-off location to the new pickup location $o_{t}$ and then delivering the order to the drop-off location $d_{t}$ , denoted by ${\tilde{ℓ}}_{t} (c)$ .

ℓ_{t}^{+} (c) = ℓ_{t} (c) + {\tilde{ℓ}}_{t} (c) .

(1)

The platform selects one courier per modality for delivery with smallest latency as follows:

ℓ_{t}^{*} (m) = min_{c \in C (m)} (ℓ_{t}^{+} (c)) .

(2)

Note that for a given courier c, we refer to

{\tilde{ℓ}}_{t} (c)

as the service time for order

U_{t}

, representing the time required for the courier to pick up and deliver the order after the courier becomes available.

For each courier of modality m, the platform incurs an operational cost per unit of latency, denoted by

η_{m}

.

2.3. Customers’ Behavior

In this model, customers can either select a delivery modality from the available options or decline service, depending on the delivery cost. After submitting a delivery request, the platform responds with a set of available modality options along with their corresponding latencies and prices. The customer can either select an available option and pay the delivery price or choose to wait for a new set of modality options and prices. However, each customer has a limited waiting time and will withdraw their service request if they stay in the queue longer than a predefined number of time steps without receiving an acceptable option. An option is considered acceptable to a customer if it satisfies the following conditions:

1.: The price offered by the platform for the modality does not exceed the customer’s maximum acceptable price, $p_{\max}$ . Formally, for modality m at time t, $p_{t} (m) \leq p_{\max}$ .
2.: The generalized cost associated with the modality does not exceed the customer’s maximum acceptable cost, denoted by $\bar{G}$ .

In this work, we define the Generalized Cost Function (GCF) for a customer located in region

d_{t} \in R

, given the modality m, as follows:

G_{t} (m) = p_{t} (m) + α_{t} \times ℓ_{t}^{*} (m),

(3)

where

α_{t}

is a random variable defined as

α_{t} = max {ϵ, X_{t}}

, with

X_{t} \sim N (μ_{d_{t}}, σ_{d_{t}}^{2})

.

α_{t}

represents the customer’s value of time, which reflects their sensitivity to delivery latency at time t.

ϵ > 0

is a small constant ensuring that VoT remains strictly positive, avoiding cases where a negative VoT would imply a preference for longer waiting times. Note that customers are heterogeneous, meaning they have different values of time. However, for customers from the same region d, VoT values are modeled as independent random variables drawn from a distribution with region-specific parameters,

μ_{d}

and

σ_{d}

. This allows the model to capture localized socio-economic differences, such as variations in income levels or service expectations across regions. Additionally, the time-varying nature of

α_{t}

enables the model to reflect behavioral variation over time for customers who continue to wait in the system.

Given these conditions, at time t, if no modality satisfies the customer’s acceptance criteria, the customer either waits for an updated set of options at time

t + 1

or leaves the queue if their waiting time exceeds the predefined limit. However, if multiple modalities satisfy the acceptance conditions, the customer selects the modality with the lowest generalized cost, denoted by

m_{t}^{*} = \underset{m \in M}{arg min} G_{t} (m) .

(4)

The corresponding courier to modality

m_{t}^{*}

is also denoted by

c_{t}^{*}

.

In this model, we assume that parameters such as

p_{\max}

,

\bar{G}

, and the pair (

μ_{d}

,

σ_{d}

) for each region

d \in R

are known to the platform at each decision step. However, the exact realization of

α_{t}

is not revealed to the platform; instead, the platform only observes the customer’s final decision.

2.4. Transportation Network

In this work, we assume that couriers operate within transportation networks that connect all regions, such that there exists at least one path between any Origin–Destination pair.

To capture differences across delivery modalities, we consider a separate transportation network for each modality, while keeping the number of regions (i.e., the nodes of the network) the same across all networks. Each network includes weighted links representing the latency of the corresponding modality, which is the time it takes for a courier of that modality to travel along a link. We assume that couriers have a negligible impact on road congestion; in other words, the number of couriers using the same road at a given time does not affect its latency. As a result, each courier selects the shortest path between the pickup and delivery locations to complete its assigned task.

2.5. Meal Delivery Platform Problem

In this model, the platform’s objective is to maximize profit by setting the prices for the available modalities, given the current order. At each time step t, the platform offers a price vector

p_{t} \in R^{| M |}

, where each entry

p_{t} (m)

corresponds to a delivery mode

m \in M

.

Given the current order

U_{t}

, the customer selects the modality

m_{t}^{*}

(as defined in (4)) or chooses not to proceed with the delivery request. Based on the customer’s decision, the platform earns a profit defined as follows:

r_{t} = \{\begin{matrix} 0 & if m_{t}^{*} = None, \\ p_{t} (m^{*}) - η_{m^{*}} \times {\tilde{ℓ}}_{t} (c^{*}) & otherwise, \end{matrix}

(5)

where

η_{m^{*}}

denotes the operational cost per unit of latency for modality

m_{t}^{*}

. Note that

{\tilde{ℓ}}_{t} (c^{*})

represents the service time for the order, and it is lower than the latency experienced by the customer, which is

ℓ_{t}^{*} (m^{*})

(defined in (2)) for modality

m^{*}

and courier

c^{*}

.

Given Equation (5), the objective of the food delivery platform is to maximize its total profit over a time horizon of T steps:

R = max_{p_{1 : T}} \sum_{t = 1}^{T} r_{t},

(6)

where

p_{1 : T}

denotes the sequence of pricing vectors

p_{t}

for all time steps

t \in 1, 2, \dots, T

.

3. The Problem Formulation as a Markov Decision Process (MDP)

In this section, we formulate the food delivery problem as an MDP, represented by the tuple

(S, A, T, I, r)

. Here,

S

denotes the state space and

A

represents the action space. The state transition function is defined by

T

, while

I

specifies the initial state distribution. The reward signal is denoted by r. The notations used are listed in Table 1. Note that the food delivery platform operates in discrete time, indexed by the set of positive integers

t \in {1, 2, \dots}

.

3.1. State Space $S$

At each time step t, the state of the system,

s_{t} \in S

, is defined by the following components:

The queue of waiting orders, denoted by $Q_{t}$ . Each order $u \in Q_{t}$ is characterized by three parameters: the pickup location $o_{u}$ , the drop-off location $d_{u}$ , and the remaining time steps before expiration $τ_{u}$ . The top order in the queue is denoted by $U_{t}$ , which is the one considered for service at time t with parameters $o_{t}$ , $d_{t}$ , and $τ_{t}$ .
The set of all available couriers, denoted by $C$ , where each courier $c \in C$ is characterized by three parameters: the modality $m_{c} \in M$ ; the next location $d_{t} (c) \in R$ where the courier will become available; and the remaining latency $ℓ_{t} (c)$ , which represents the time until the courier becomes available at location $d_{t} (c)$ .

3.2. Action Space $A$

The action space

A

consists of all possible pricing vectors that the platform can offer given the number of available modalities

| M |

. Therefore, the action at time t is defined as follows:

p_{t} \in A = {[0, p_{\max})}^{| M |},

(7)

where

p_{\max}

is the maximum allowable price for any modality.

3.3. State Transition Model $T$

The State Transition Model

T

describes how the system evolves from state

s_{t}

to

s_{t + 1}

based on the pricing decision

p_{t}

taken at time t. Given the current state

s_{t}

and the action

p_{t}

, the state transition function

s_{t + 1} = T (s_{t}, p_{t})

is defined as follows:

$Q_{t + 1}$ : Given the pricing vector $p_{t}$ and the current order $U_{t}$ at the front of $Q_{t}$ , the customer evaluates the generalized cost of each available modality m, $G_{t} (m)$ (defined in (3)). The customer then either chooses a modality $m_{t}^{*}$ and receives the service or exits the queue $Q_{t}$ if $τ_{t} = 0$ , indicating the customer is no longer willing to wait for a new pricing option. However, if $τ_{t} > 0$ , then the order $U_{t}$ remains at the front of $Q_{t + 1}$ . Therefore, the selected modality $m_{t}^{*}$ is determined as follows:

$m_{t}^{*} = \{\begin{matrix} \underset{m \in M}{argmin} (G_{t} (m)) & if min_{m \in M} (G_{t} (m)) \leq \bar{G} \\ None & otherwise, \end{matrix}$

(8)

Further, for all $u \in Q_{t}$ , the remaining time until expiration, denoted by $τ_{u}$ , is decremented by one at each time step, i.e., $τ_{u} \leftarrow τ_{u} - 1$ . Any order with $τ_{u} = 0$ is removed from the queue.
At each time step, new orders arrive according to a Poisson distribution with rate $λ_{q}$ , which determines the number of orders added to the queue at time t. The pickup and drop-off locations of the new orders are generated randomly, where the regional pickup and drop-off rates are represented by $Λ_{o} \in R^{| R |}$ and $Λ_{d} \in R^{| R |}$ , respectively. Therefore, for each new order u, the pickup and drop-off locations are independently sampled with respect to $Λ_{o}$ and $Λ_{d}$ , i.e.,

$\begin{matrix} P_{o_{u}} (i) & = λ_{o} (i), \\ P_{d_{u}} (i) & = λ_{d} (i), \end{matrix}$

(9)

where $λ_{o} (i)$ and $λ_{d} (i)$ are the ith elements of $Λ_{o}$ and $Λ_{d}$ , respectively. Additionally, for each new order added to the queue, the expiration timer is set to $τ_{\max}$ .
$C$ at time $t + 1$ : For each courier $c \in C$ , their remaining time until availability is updated according to

$ℓ_{t + 1} (c) \leftarrow max {ℓ_{t} (c) - 1, 0} .$

(10)

If $ℓ_{t} (c) = 0$ , the courier has completed its delivery and is available at location $d_{t} (c)$ for the next order.
If, for order $U_{t}$ , the courier $c_{t}^{*}$ with modality $m_{t}^{*}$ is assigned to the delivery, then the state of this courier is updated by setting the next drop-off location to $d_{t + 1} (c^{*}) = d_{t}$ , and updating the remaining time until availability as follows:

$ℓ_{t + 1} (c^{*}) = ℓ_{t} (c^{*}) + {\tilde{ℓ}}_{t} (c^{*}) .$

(11)

3.4. The Initial State $I$

The initial queue of orders,

Q_{0}

, is empty.

Given the list of couriers and their modalities, they are initialized by being assigned to a delivery region uniformly at random, i.e.,

d_{0} (c) \sim Uniform (R)

. Each courier is also assigned an initial remaining latency

ℓ_{0} (c) = m a x {0, L_{c}}

, where

L_{c}

is sampled uniformly from the interval

[- a, b]

, where

a, b > 0

.

3.5. Reward Signal, r

Given

c_{t}^{*}

and its modality

m_{t}^{*}

, the reward signal at time t is

r_{t} = \{\begin{matrix} 0 & m_{t}^{*} = N o n e \\ p_{t} (m^{*}) - η_{m^{*}} \times {\tilde{ℓ}}_{t} (c^{*}) & otherwise . \end{matrix}

(12)

4. Reinforcement Learning

Given the platform’s objective,

R = {max}_{p_{1 : T}} \sum_{t = 1}^{T} r_{t}

, and Equation (12), to solve this optimization problem with a real-time policy that accounts for the current state of the system, one could formulate a dynamic programming algorithm assuming full knowledge of the system parameters. Note that the effective latency of an order is a random variable. However, this approach is not feasible for the following reasons:

1.: Considering that the computational complexity per iteration for value iteration is $O (| A | | S |^{2})$ , and given that both the action space and the state space are continuous—due to continuous price and latency values—the cardinality of action and state spaces, $| A |$ and $| S |$ , are not finite.
2.: In practice, a food delivery platform may not have complete knowledge of the distribution parameters of customers’ value of time.

Therefore, we employ a model-free reinforcement learning algorithm that implements the pricing policy by training a neural network on observed states and rewards, given the selected pricing signals.

Although static policies do not account for full state realizations, we evaluate the performance of the RL agent by first formalizing three heuristic pricing policies and then comparing the profit achieved by the RL agent to that of these baselines.

5. Heuristic Policies

We introduce three heuristic pricing policies for comparison in this problem setup:

Max Price: Since the profit, $r_{t} = p_{t} (m^{*}) - η_{m^{*}} \times {\tilde{ℓ}}_{t} (c^{*})$ , is monotonically increasing with respect to the price $p_{t} (m^{*})$ , this policy always offers the maximum allowable price $p_{\max}$ to maximize the profit. Additionally, since customers tend to select the courier with the lowest latency—and $η_{m^{*}} > 0$ —this behavior helps minimize delivery cost for the platform. However, in systems with generally high latencies, this policy may lead customers to frequently leave the platform without placing an order due to the high generalized cost.
Max Order: In this policy, the platform adjusts the prices of all modalities such that the expected generalized cost remains below the customer’s maximum acceptable threshold to maximize the number of confirmed orders.
Considering the platform’s profit at time t, given modality $m_{t}^{*}$ ,

$\begin{matrix} r_{t} & = p_{t} (m^{*}) - η_{m^{*}} ({\tilde{ℓ}}_{t} (c^{*})) \\ = G_{t} (m^{*}) - α_{t} (ℓ_{t}^{*} (m^{*})) - η_{m^{*}} ({\tilde{ℓ}}_{t} (c^{*})) \\ < \bar{G} - α_{t} (ℓ_{t}^{*} (m^{*})) - η_{m^{*}} ({\tilde{ℓ}}_{t} (c^{*})), \end{matrix}$

(13)

and if the generalized cost satisfies $G_{t} (m^{*}) = \bar{G}$ and it holds that

$\bar{G} > α_{t} (ℓ_{t}^{*} (m^{*})) + η_{m^{*}} ({\tilde{ℓ}}_{t} (c^{*})),$

(14)

the platform’s profit $r_{t}$ thereby attains its maximum positive value.
However, if the condition (14) does not hold and $r_{t} < 0$ , then the platform benefits from setting a higher price, leading the customer to reject the service and the platform to receive zero profit rather than incurring a loss.
Since the platform does not have access to customers’ real-time VoT values, it adjusts prices such that the expected generalized cost for each modality m and its corresponding courier c satisfies

$E {G_{t} (m) | p_{t} (m), ℓ_{t}^{*} (m)} = p_{t} (m) + E (α_{t}) \times ℓ_{t}^{*} (m) = \bar{G},$

(15)

subject to

$r_{t} = p_{t} (m) - η_{m} ({\tilde{ℓ}}_{t} (c)) \geq 0 .$

(16)

Therefore, in this policy,

$p_{t} (m) = max {\bar{G} - E (α_{t}) ℓ_{t}^{*} (m), η_{m} ({\tilde{ℓ}}_{t} (c))},$

(17)

for

$E (α_{t}) = E (max {ϵ, X_{t}}) = ϵ \cdot Φ (\frac{ϵ - μ_{i}}{σ_{i}}) + μ_{i} \cdot Φ (\frac{μ_{i} - ϵ}{σ_{i}}) + σ_{i} \cdot ϕ (\frac{ϵ - μ_{i}}{σ_{i}}),$

(18)

where $ϕ (x)$ and $Φ (x)$ are the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF) of the standard normal distribution, respectively.
Zone-Based Price: Given that the mean value of customers’ VoT varies across regions, this policy utilizes regional differences to adjust pricing accordingly.
To motivate this policy, we consider a pair of modalities and examine how pricing signals can be designed to encourage customers to select one modality over the other. In systems with more than two modalities, this pairwise analysis can be extended across all modality combinations to construct a pricing policy that accounts for regional variations in VoT.
Without loss of generality, we consider the case where, for given modalities m and $m^{'}$ , and their corresponding couriers c and $c^{'}$ , the operational cost of courier $c^{'}$ is greater than courier c, i.e., $η_{m^{'}} ({\tilde{ℓ}}_{t} (c^{'})) > η_{m} ({\tilde{ℓ}}_{t} (c))$ . Therefore, if the platform reduces the price of modality m compared to $m^{'}$ so that the customer is encouraged to choose m over $m^{'}$ while maintaining a higher profit, i.e.,

$p_{t} (m^{'}) - p_{t} (m) < η_{m^{'}} ({\tilde{ℓ}}_{t} (c^{'})) - η_{m} ({\tilde{ℓ}}_{t} (c)),$

(19)

then the platform benefits from this strategy. However, for the customer to select modality m over $m^{'}$ , the following condition must hold:

$p_{t} (m) + α_{t} \times ℓ_{t}^{*} (m) < p_{t} (m^{'}) + α_{t} \times ℓ_{t}^{*} (m^{'})$

(20)

Here, we consider two cases:
1.
$ℓ_{t}^{*} (m^{'}) > ℓ_{t}^{*} (m)$ : When the latency of modality $m^{'}$ is greater than the latency of modality m, Inequality (20) always holds, given that both the price and latency of modality $m^{'}$ are higher than those of modality m. In this case, the platform can simply set both $p_{t} (m)$ and $p_{t} (m^{'})$ to $p_{\max}$ , the maximum acceptable price for an order.
2.
$ℓ_{t}^{*} (m^{'}) < ℓ_{t}^{*} (m)$ : In this case, if $α_{t} (ℓ_{t}^{*} (m) - ℓ_{t}^{*} (m^{'})) < η_{m^{'}} ({\tilde{ℓ}}_{t} (c^{'})) - η_{m} ({\tilde{ℓ}}_{t} (c))$ , then there exists a pricing strategy that satisfies all the conditions. This condition holds when $α_{t}$ is relatively small compared to the difference in operational costs between the two modalities.
Although the platform does not observe the real-time value of $α_{t}$ , considering the expected VoT of the customer, $E (α_{t})$ (defined in (18)), for customers in regions with a lower expected VoT, the platform reduces the price of modality m just enough to satisfy

$p_{t} (m^{'}) - p_{t} (m) = E (α_{t}) (ℓ_{t}^{*} (m) - ℓ_{t}^{*} (m^{'})) + ϵ_{0} < η_{m^{'}} ({\tilde{ℓ}}_{t} (c^{'})) - η_{m} ({\tilde{ℓ}}_{t} (c)),$

(21)

for a small value of $ϵ_{0} > 0$ to encourage the selection of modality m over $m^{'}$ .
This analysis suggests that the platform can adopt a pricing policy that reduces the price of modalities with lower operational cost but higher latency. This makes these options more attractive to customers, especially those with lower $E (α_{t})$ . As a result, the platform can benefit from the lower operational cost while still meeting customer preferences. This trade-off allows the platform to increase overall profit by aligning customer preferences with its own objectives. Additionally, it helps distribute customer demand more effectively across different modalities.

6. Numerical Experiment

In this section, we analyze the performance of a reinforcement learning (RL) policy as a pricing mechanism within an instance of our meal delivery platform. The trained RL agent is compared against a set of heuristic policies to evaluate its effectiveness. To implement the dynamics of the meal delivery environment, we use the Gymnasium Python package (version

0.29.1

) [28], which provides a standardized API for single-agent RL environments. We adopt the Proximal Policy Optimization algorithm to train the RL policy [14]. PPO is a model-free algorithm, known for its stability and sample efficiency, and it supports continuous state and action spaces. These features make PPO particularly suitable for our system. While algorithms such as Twin-Delayed Deep Deterministic (TD3) [29] and Soft Actor–Critic (SAC) [30] also support continuous action spaces, our primary goal in this work is to demonstrate how deep reinforcement learning can improve platform performance. Therefore, we focused on PPO as the main algorithm in our experiments. Nonetheless, we include a comparison of agents trained with different algorithms and default hyperparameters in Appendix A. For the PPO implementation, we utilize Stable Baselines3 (SB3) (version

2.3.2

), a set of reliable implementations of reinforcement learning algorithms in PyTorch [31]. All experiments are tracked using the Weights and Biases platform (version

0.19.1

) [32].

6.1. Sioux Fall Transportation Network

In this work, we implement our platform using the real-world transportation network, the Sioux Falls network [33] (Figure 2 and Figure 3), originally introduced in [34], which has since become a standard benchmark for transportation network simulations.

The dataset consists of the free-flow travel times for vehicles (e.g., cars) along 38 undirected links connecting 24 regions. The units of free-flow travel times and discrete time steps are expressed in minutes. Note that if a link exists between two nodes, vehicles are assumed to be able to travel in both directions along that link.

To simplify the problem, we partition the 24 regions into three categories:

The Suburban areas consist of the regions $i \in {1, 2, 3, 6, 7, 12, 13, 20, 21, 24}$ .
The Inner Suburbs (Inner Ring) areas consist of the regions $i \in {4, 5, 8, 14, 15, 18, 19, 22, 23}$ .
The Downtown areas consist of the regions $i \in {9, 10, 11, 16, 17}$ .

Then we make the following assumptions:

1.

All arriving orders have their pickup locations in the Downtown region. Therefore, the regional pickup rate is set to

λ_{o} (i) = 0.2

for all the

i \in {9, 10, 11, 16, 17}

and

λ_{o} (i) = 0

for all other regions where

\sum_{i \in R} λ_{o} (i) = 1

. This assumption reflects the fact that restaurants are more concentrated in Downtown areas, making it more likely for customers to place orders from the diverse and popular options available there rather than from the limited local options elsewhere. Additionally, new orders arrive at a rate of

λ_{q} = 2

.

2.

Orders may be placed by customers in any region. The regional drop-off rate is defined as

λ_{d} (i) = 0.01

for Downtown regions (

i \in {9, 10, 11, 16, 17}

) and

λ_{d} (i) = 0.05

for all other regions. Note that

\sum_{i \in R} λ_{d} (i) = 1

.

3.

For region i, the VoT

α_{t} = max {ϵ, X_{t}}

(measured in $ per minute) with

ϵ = 0.001

, and the distribution of

X_{t} \sim N (μ_{i}, σ_{i}^{2})

is defined as follows:

For the Suburban areas, where $i \in {1, 2, 3, 6, 7, 12, 13, 20, 21, 24}$ , it holds that $X_{t} \sim N (0.8, 0.01)$ .
For the Inner Suburbs areas, where $i \in {4, 5, 8, 14, 15, 18, 19, 22, 23}$ , it holds that $X_{t} \sim N (1.2, 0.01)$ .
For the Downtown areas, where $i \in {9, 10, 11, 16, 17}$ , it holds that $X_{t} \sim N (5, 0.01)$ .

In addition, the maximum acceptable price for customers is set to

p_{\max} = $ 30

, and the maximum acceptable generalized cost is

\bar{G} = $ 35

. The length of the queue

Q_{t}

may vary over time; however, to limit the size of the state space, the agent observes only the first two orders in the queue at each time step. Similarly, the maximum waiting time in the queue before expiration is set to

τ_{\max} = 2

time steps. We consider two delivery modalities, cars and drones, where each episode includes a total of 15 cars and 10 drones. The operational cost is defined per unit of latency ($ per minute), with

η_{c} = 1

for cars and

η_{d} = 2

for drones. The initial latency for each courier is set to

max {0, L_{c}}

minutes, where

L_{c}

is uniformly sampled with respect to the interval

[- 1, 40]

.

Given that the Sioux Falls dataset specifies link latency only for cars, we consider two different scenarios for modeling drone logistics within the same network and train a separate RL policy for each scenario.

6.1.1. Scenario 1: Complete Graph

Given that drones travel through the air, they are capable of following direct paths between regions. In this scenario, we model the drone network as a complete graph (Figure 4), where drones can move directly between any pair of regions. Assuming a constant drone speed [5], and using the coordinate data of the nodes from [33], we calibrate drone speeds so that their travel latency is approximately equal to that of cars in areas where vehicle speed limits are typically around 35–40 mph.

6.1.2. Scenario 2: Sioux Falls Network Graph

In this scenario, drone routing is restricted to the same transportation network used by cars. This setting reflects environments where tall buildings or flight restrictions prevent drones from taking direct aerial paths. As a result, drones are constrained to follow the same road-based network as cars.

To capture logistical differences between the two modalities, we assume equal travel speeds for cars and drones along all links, except for those connected to at least one Downtown region—i.e., links where at least one endpoint is a node

i \in {9, 10, 11, 16, 17}

. For these links, we apply relative speed factors:

0.7 < 1

for cars and

0.9 < 1

for drones. These adjustments account for the fact that vehicles like cars are more affected by traffic congestion in dense urban areas.

6.1.3. Hyperparameters for PPO

For both scenarios, to train the RL agent using the PPO algorithm, we employ a neural network with two hidden layers, each containing 128 neurons. Training is performed using eight parallel environments, with a time limit of 200 steps per episode.

We set the learning rate to

1 \times 10^{- 4}

and use

n_{step} = 1024

, which defines the number of steps collected across the eight environments before each policy update. A discount factor of

γ = 0.99

is used, reflecting the assumption that the importance of future profits remains consistent over time. The mini-batch size is set to 128, and training is carried out over a total horizon of 5,000,000 steps. All other hyperparameters are kept at their default values, as provided in Stable Baselines3 [31].

Figure 5 illustrates the trajectory of episode-level profits over the training horizon for both scenarios.

6.2. Result

In this section we discuss the performance of the employed RL agents and evaluate them by comparing their performance with that of the heuristic policies introduced in Section 5.

6.2.1. Evaluation Metrics

To evaluate the RL policies, we compare the total profit achieved by each RL policy to that of the heuristic policies within each scenario. Additionally, we analyze and compare the following aspects:

Order: Total number of orders served during an episode.
Latency: Average latency of served orders within an episode.
Price: Average price paid for orders during an episode.
GC (Generalized Cost): Average generalized cost of served orders during an episode.
Profit: Cumulative reward collected over an episode (200 time steps).

Note that we conduct the evaluation over 100 episodes, each consisting of 200 time steps. The values reported in each column represent the mean of each metric across the 100 episodes. Since each episode spans 200 time steps, the maximum possible number of orders per episode is 200.

6.2.2. Scenario 1

In this case, we evaluate and record the performance of the RL and heuristic policies in an environment where drones are allowed to follow direct paths between regions. Figure 6 illustrates the profit trajectories for each policy over 100 episodes, each consisting of 200 time steps. The results show that the RL policy consistently outperforms all heuristic policies by a significant gap. In Table 2, we observe that the RL policy achieves a mean profit of

1267.3

, which is

143 %

higher than the best-performing baseline policy, Max Order. Comparing these two policies reveals that the RL agent learns to offer prices that lead to a more efficient allocation of resources. Specifically, the RL policy discourages customers from accepting deliveries associated with very high latencies, resulting in lower overall latency for the platform.

In contrast, baseline policies such as Max Price and Zone-Based set prices so high that a significant portion of orders are rejected—only around

10 %

of orders are served. Furthermore, for the customers whose orders were accepted, the average generalized cost under these baseline policies was higher than under the RL policy. This indicates that the RL policy not only achieves greater profitability but also maintains lower generalized costs for the customers it serves.

6.2.3. Scenario 2

In this case, we follow the same evaluation procedure to assess the RL policy in an environment where drone movement is restricted to the same routing network as cars. Figure 7 shows the profit trajectories of all four policies in this scenario. The results demonstrate a notable improvement in profit when using the RL policy compared to the baselines.

Table 3 shows that the RL policy achieves a mean profit of

198.8

, which is approximately

56 %

higher than that of the best-performing heuristic policy, the Zone-Based policy. Comparing their performance, the RL agent strategically lowers prices to increase the number of accepted orders, compensating for the reduced per-order revenue by achieving greater overall profit. This also results in lower generalized costs for the customers it serves. In this scenario, the Zone-Based policy slightly outperforms the Max Price policy by encouraging customers with a lower VoT to choose cars over drones through adjusted pricing.

Compared to the Max Order baseline, we observe a similar pattern as in the previous scenario: the RL agent learns a pricing mechanism that discourages customers with high-latency orders from placing requests, thereby helping to maintain lower overall latency across the platform. However, in this scenario, the restriction on drone movement results in slightly higher average latencies compared to the first scenario. This contributes to the poor performance of the Max Order policy, as it attempts to serve as many orders as possible without accounting for their high latencies.

7. Discussion

In this work, we address the problem of pricing mechanisms for meal delivery platforms that employ multiple courier modalities to serve a heterogeneous population of customers. We developed a deep reinforcement learning (RL) policy for real-time pricing using a real-world transportation network. To evaluate the effectiveness of our approach, we also introduced three heuristic baseline policies for comparison. We trained two separate RL policies under different assumptions regarding drone transportation logistics. The results suggest that, given the inherent stochasticity and complexity of the problem, an RL-based pricing mechanism can lead to significantly higher platform profits compared to heuristic approaches.

While our work evaluates two distinct scenarios capturing different drone movement dynamics, one limitation of the current approach is its simplified assumption regarding drone-related regulatory and operational constraints. In practice, drones can face limitations such as no-fly zones, weather-related operational variability, and designated landing areas. Incorporating these restrictions into the transportation network and courier availability modeling would result in more realistic and operationally feasible models in future works.

Additionally, several modeling simplifications were introduced to isolate and study specific interactions between customers and the platform. For example, we assume an always-available fleet of couriers throughout each episode. Although this assumption can represent a meal delivery platform that maintains fixed courier availability throughout a work shift (one episode), in reality, the number of available couriers can vary both within a single episode and across different episodes. Introducing a probabilistic model of courier availability could result in more robust pricing policies.

We also assume that the platform has full knowledge of parameters (e.g., value of time distribution, maximum price threshold, and generalized cost threshold) when implementing heuristic policies. However, in real-world systems, these parameters may be unknown and can vary over time. Future work could explore context-sensitive modeling of these parameters or apply online learning approaches such as bandit algorithms to estimate customers’ behavioral preferences.

Furthermore, our performance evaluation is limited to a single transportation network. Extending the model to a broader range of urban networks could be the next step in assessing its generalization. However, many publicly available datasets lack features (e.g., geographic coordinates or road classification) required for modeling a multi-modal meal delivery platform. Enhancing these datasets and underlying transportation models can support scaling the trained policy to larger urban areas.

Considering that this study demonstrates the potential of using deep reinforcement learning for dynamic pricing in meal delivery platforms, further work is required to address the current modeling limitations and to adapt the approach for more realistic and complex environments.

Author Contributions

Conceptualization, A.Z., M.B., M.A., and R.P.; methodology, A.Z. and M.B.; software, A.Z. and M.B.; validation, A.Z., M.A., and R.P.; formal analysis, A.Z.; investigation, A.Z.; resources, A.Z., M.B., M.A., and R.P.; data curation, A.Z.; writing—original draft preparation, A.Z.; writing—review and editing, A.Z., M.A., and R.P.; visualization, A.Z.; supervision, M.A. and R.P.; funding acquisition, M.A. and R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NSF grant #2419982.

Data Availability Statement

The original data presented in this study are openly available at https://github.com/ArghavanZ/RL_FoodDelivery.git (accessed on 9 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Comparison of Reinforcement Learning Algorithms

In this section, we compare the performance of various RL agents trained using different algorithms and hyperparameter configurations. While the agents differ in their learning algorithms and several hyperparameters, all are trained over a fixed horizon of 5,000,000 time steps, with a discount factor of

γ = 0.99

, and a neural network architecture with two hidden layers of 128 units each, as described in Section 6.1.3.

To do so, we first evaluate seven PPO agents, each trained with distinct hyperparameter settings, to examine the sensitivity of performance to these settings. For each scenario, all agents are trained using the same random seed, and the trajectories are collected using eight parallel environments. The list of agents and their corresponding hyperparameters is provided in Table A1. Using the same evaluation method as in Section 6.2, Table A2 illustrates how performance varies across different hyperparameter configurations.

Among the agents, PPO1 and PPO3 achieve relatively higher performance across Scenarios 1 and 2. Therefore, we selected the hyperparameters of PPO1 for our main proposed model. Note that the model used to generate the results in Table 2 is trained using a different random seed than PPO1.

Figure A1 illustrates the training curves of all PPO agents. PPO1, PPO3, and PPO5 exhibited the most stable performance during training for Scenario 1. In Scenario 2, however, the stability in performance does not vary significantly across agents. Note that the horizontal axis in Figure A1 represents the number of policy updates rather than the total number of environment steps. Therefore, PPO6 and PPO7, which use twice the value of

n_{steps}

, have only half as many updates compared to the other agents.

Table A1. List of PPO agents with different hyperparameters. PPO6 uses the default values provided in the Stable Baselines3 (SB3) implementation.

Policy	Learning Rate	$n_{step}$	Batch Size	Clip Range	Entropy Coefficient
PPO1	$0.0001$	1024	128	$0.2$	$0.001$
PPO2	$0.0001$	1024	128	$0.5$	$0.001$
PPO3	$0.0001$	1024	128	$0.2$	$0.005$
PPO4	$0.0001$	1024	64	$0.2$	$0.001$
PPO5	$0.0003$	1024	128	$0.2$	$0.001$
PPO6	$0.0003$	2048	64	$0.2$	$0.001$
PPO7	$0.0003$	2048	128	$0.2$	$0.001$

Table A2. Comparison of profit among PPO agents with different hyperparameters for each scenario. PPO1 and PPO3 perform relatively better than the others in both scenarios.

Policy	PPO1	PPO2	PPO3	PPO4	PPO5	PPO6	PPO7
Scenario 1	$1255.1 \pm 130$	$752.4 \pm 162$	$1295.5 \pm 126$	$813.8 \pm 130$	$1328.8 \pm 114$	$1303.2 \pm 119$	$725.2 \pm 137$
Scenario 2	$198.8 \pm 42$	$197.3 \pm 38$	$194.3 \pm 36$	$181.2 \pm 41$	$166.2 \pm 42$	$168.1 \pm 44$	$183 \pm 42$

Figure A1. Training curves for all PPO agents. (a) Shows the training curves for Scenario 1 and (b) illustrates the curves for Scenario 2.

Next, we compare the performance of three RL algorithms: PPO, SAC, and TD3. SAC and TD3 are trained using the default hyperparameter settings provided in the Stable Baselines3 implementation [31]. PPO6 is also trained using the default PPO configuration, while PPO1 is trained with our proposed hyperparameters, as described in Section 6.1.3. Table A3 presents the average profit achieved by each agent across the two scenarios. While SAC performs closely to PPO in Scenarios 1 and 2, PPO outperforms SAC in Scenario 1. In contrast, under the more congested conditions of Scenario 2, SAC slightly outperforms PPO. TD3 shows the weakest performance in both scenarios, with results closely aligned with the heuristic policy Max Price. Figure A2 shows the training curves for each algorithm.

Additionally, since SAC and TD3 do not support sample collection from parallel environments in their standard implementations, PPO agents train noticeably faster than those of SAC and TD3. These findings further support the decision to use PPO as the primary algorithm for our proposed pricing mechanism.

Table A3. Comparison of profit among different RL algorithms. PPO6, SAC, and TD3 were each trained under the default implementation parameters from SB3, while PPO1 was trained using our proposed hyperparameters.

Policy	PPO1	PPO6	SAC	TD3
Scenario 1	$1255.1 \pm 130$	$1303.2 \pm 119$	$1239.1 \pm 132$	$512.5 \pm 104$
Scenario 2	$198.8 \pm 42$	$168.1 \pm 44$	$203.4 \pm 43$	$125.5 \pm 45$

Figure A2. Training curves for PPO, SAC, and TD3 agents. (a) Shows the training curves for Scenario 1 and (b) illustrates the curves for Scenario 2.

References

DoorDash DoorDash Releases Fourth Quarter and Full Year 2024 Financial Results, February 2025. Available online: https://ir.doordash.com/news/news-details/2025/DoorDash-Releases-Fourth-Quarter-and-Full-Year-2024-Financial-Results/default.aspx (accessed on 12 May 2025).
Uber Technologies Investment. Uber Announces Results for Fourth Quarter and Full Year 2024, February 2025. Available online: https://investor.uber.com/news-events/news/press-release-details/2025/Uber-Announces-Results-for-Fourth-Quarter-and-Full-Year-2024/default.aspx (accessed on 12 May 2025).
Moshref-Javadi, M.; Winkenbach, M. Applications and Research avenues for drone-based models in logistics: A classification and review. Expert Syst. Appl. 2021, 177, 114854. [Google Scholar] [CrossRef]
Garg, V.; Niranjan, S.; Prybutok, V.; Pohlen, T.; Gligor, D. Drones in last-mile delivery: A systematic review on Efficiency, Accessibility, and Sustainability. Transp. Res. Part D Transp. Environ. 2023, 123, 103831. [Google Scholar] [CrossRef]
Thiels, C.A.; Aho, J.M.; Zietlow, S.P.; Jenkins, D.H. Use of Unmanned Aerial Vehicles for Medical Product Transport. Air Med. J. 2015, 34, 104–108. [Google Scholar] [CrossRef]
Hwang, J.; Kim, I.; Gulzar, M.A. Understanding the eco-friendly role of drone food delivery services: Deepening the theory of planned behavior. Sustainability 2020, 12, 1440. [Google Scholar] [CrossRef]
Goodchild, A.; Toy, J. Delivery by drone: An evaluation of unmanned aerial vehicle technology in reducing CO₂ emissions in the delivery service industry. Transp. Res. Part D Transp. Environ. 2018, 61, 58–67. [Google Scholar] [CrossRef]
Kim, J.J.; Kim, I.; Hwang, J. A change of perceived innovativeness for contactless food delivery services using drones after the outbreak of COVID-19. Int. J. Hosp. Manag. 2021, 93, 102758. [Google Scholar] [CrossRef]
Abbasi, G.A.; Rodriguez-López, M.E.; Higueras-Castillo, E.; Liébana-Cabanillas, F. Drones in food delivery: An analysis of consumer values and perspectives. Int. J. Logist. Res. Appl. 2024, 1–21. [Google Scholar] [CrossRef]
Koay, K.Y.; Leong, M.K. Understanding consumers’ intentions to use drone food delivery services: A perspective of the theory of consumption values. Asia-Pac. J. Bus. Adm. 2023, 16, 1226–1240. [Google Scholar] [CrossRef]
Waris, I.; Ali, R.; Nayyar, A.; Baz, M.; Liu, R.; Hameed, I. An empirical evaluation of customers’ adoption of drone food delivery services: An extended technology acceptance model. Sustainability 2022, 14, 2922. [Google Scholar] [CrossRef]
Liébana-Cabanillas, F.; Rodríguez-López, M.E.; Abbasi, G.A.; Higueras-Castillo, E. A behavioral study of food delivery service by drones: Insights from urban and rural consumers. Int. J. Hosp. Manag. 2025, 127, 104098. [Google Scholar] [CrossRef]
Beliaev, M.; Mehr, N.; Pedarsani, R. Pricing for multi-modal pickup and delivery problems with heterogeneous users. Transp. Res. Part C Emerg. Technol. 2024, 169, 104864. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Dorling, K.; Heinrichs, J.; Messier, G.G.; Magierowski, S. Vehicle routing problems for drone delivery. IEEE Trans. Syst. Man Cybern. Syst. 2016, 47, 70–85. [Google Scholar] [CrossRef]
Attenni, G.; Arrigoni, V.; Bartolini, N.; Maselli, G. Drone-based delivery systems: A survey on route planning. IEEE Access 2023, 11, 123476–123504. [Google Scholar] [CrossRef]
Beliaev, M.; Mehr, N.; Pedarsani, R. Congestion-aware bi-modal delivery systems utilizing drones. Future Transp. 2023, 3, 329–348. [Google Scholar] [CrossRef]
Chen, C.; Demir, E.; Hu, X.; Huang, H. Transforming last mile delivery with heterogeneous assistants: Drones and delivery robots. J. Heuristics 2025, 31, 8. [Google Scholar] [CrossRef]
Samouh, F.; Gluza, V.; Djavadian, S.; Meshkani, S.; Farooq, B. Multimodal Autonomous Last-Mile Delivery System Design and Application. In Proceedings of the 2020 IEEE International Smart Cities Conference (ISC2), Piscataway, NJ, USA, 28 September–1 October 2020; pp. 1–7. [Google Scholar] [CrossRef]
Liu, Y. An optimization-driven dynamic vehicle routing algorithm for on-demand meal delivery using drones. Comput. Oper. Res. 2019, 111, 1–20. [Google Scholar] [CrossRef]
Jahanshahi, H.; Bozanta, A.; Cevik, M.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Kosucu, B.; Başar, A. A deep reinforcement learning approach for the meal delivery problem. Knowl.-Based Syst. 2022, 243, 108489. [Google Scholar] [CrossRef]
Bozanta, A.; Cevik, M.; Kavaklioglu, C.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Duranel, A.; Basar, A. Courier routing and assignment for food delivery service using reinforcement learning. Comput. Ind. Eng. 2022, 164, 107871. [Google Scholar] [CrossRef]
Zou, G.; Tang, J.; Yilmaz, L.; Kong, X. Online food ordering delivery strategies based on deep reinforcement learning. Appl. Intell. 2022, 56, 6853–6865. [Google Scholar] [CrossRef]
Mehra, A.; Saha, S.; Raychoudhury, V.; Mathur, A. DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–9. [Google Scholar]
Li, M.; Qin, Z.; Jiao, Y.; Yang, Y.; Wang, J.; Wang, C.; Wu, G.; Ye, J. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 983–994. [Google Scholar]
Bi, Z.; Guo, X.; Wang, J.; Qin, S.; Liu, G. Deep reinforcement learning for truck-drone delivery problem. Drones 2023, 7, 445. [Google Scholar] [CrossRef]
Chen, X.; Ulmer, M.W.; Thomas, B.W. Deep Q-learning for same-day delivery with vehicles and drones. Eur. J. Oper. Res. 2022, 298, 939–952. [Google Scholar] [CrossRef]
Towers, M.; Kwiatkowski, A.; Terry, J.K.; Balis, J.U.; de Cola, G.; Deleu, T.; Goulão, M.; Kallinteris, A.; Krimmel, M.; KG, A.; et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments. Available online: https://github.com/Farama-Foundation/Gymnasium (accessed on 15 May 2025).
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Biewald, L. Experiment Tracking with Weights and Biases. 2020. Available online: https://www.wandb.com/ (accessed on 9 August 2025).
Cabannes, T. TransportationNetworks, SiouxFalls, October 2021. Available online: https://github.com/bstabler/TransportationNetworks (accessed on 17 May 2025).
Abdulaal, M.; LeBlanc, L.J. Continuous equilibrium network design models. Transp. Res. Part B Methodol. 1979, 13, 19–32. [Google Scholar] [CrossRef]

Figure 1. Illustration of the meal delivery service. Heterogeneous customers are distributed across different regions of a transportation network (indicated by different colors) and may place orders from various regions. Delivery is performed through multiple transportation modalities, each operating within a distinct transportation network, resulting in different routing options. The arrows indicate the links between the regions (nodes), while dotted lines connect identical nodes across the networks.

Figure 2. The Sioux Falls transportation network map.

Figure 3. Sioux Falls graph for cars. The green nodes represent regions in Downtown. The orange nodes indicate the Inner Ring, and the blue nodes represent Suburban areas. Each link represents a connection between two nodes with the corresponding travel time.

Figure 4. Sioux Falls graph for drones. The green nodes represent Downtown areas. The orange nodes are Inner Ring and the blue ones are Suburban areas.

Figure 5. The reward trajectories of the RL agent during the train phase. In (a), the RL agent was trained over the environment where the drones take direct paths between regions (scenario 1). In (b), the RL policy was developed over the environment where the movement of drones is restricted (scenario 2).

Figure 6. Profit trajectories for the first scenario over 100 episodes, each consisting of 200 time steps. (a) Shows the cumulative profit of the meal delivery platform under different policies. (b) Illustrates the profit aggregated per episode.

Figure 7. Profit trajectories for the second scenario over 100 episodes, each consisting of 200 time steps. (a) Shows the cumulative profit of the meal delivery platform under different policies. (b) Illustrates the profit aggregated per episode.

Table 1. Table of notations.

Symbol	Description
$R$	Set of regions
$Λ_{o} \in R^{\| R \|}$	The vector of pickup location rates
$Λ_{d} \in R^{\| R \|}$	The vector of drop-off location rates
$λ_{q} \in R^{\| R \|}$	The rate of order arrivals
$τ_{\max}$	The maximum time that an order will stay in the queue
t	The current time step
T	The total number of time steps
$Q_{t}$	The queue of orders at time t
$U_{t}$	The order to be served at time t, which is at the top of the $Q_{t}$
$o_{t}$	The pickup location (origin) of the order $U_{t}$
$d_{t}$	The drop-off location (destination) for the order $U_{t}$
$τ_{t}$	The remaining time to expire for the order $U_{t}$
$M$	The set of modalities
$η_{m}$	The normalized cost of operation for each modality m
$p_{t} (m)$	The price of the modality m at time t
$p_{\max}$	The maximum acceptable price for customers
$C$	Set of all the couriers
$C (m)$	Set of all the couriers of modality m
$d_{t} (c)$	The drop-off location where the courier c will become available
$ℓ_{t} (c)$	Latency of courier c to be available at $d_{t} (c)$
$ℓ_{t}^{+} (c)$	The latency of courier c to arrive at the drop-off location of order $U_{t}$
${\tilde{ℓ}}_{t} (c)$	The service time for order $U_{t}$ under courier c
$m_{t}^{*}$	The chosen modality to serve $U_{t}$
$c_{t}^{*}$	The chosen courier to serve $U_{t}$
$\bar{G}$	The maximum acceptable generalized cost of the customers

Table 2. Comparison of different policies for the first scenario. All policies were evaluated over 100 episodes, each consisting of 200 time steps. Each column reports the mean value of the corresponding metric across all episodes. The RL policy achieves a profit that is

143 %

higher than the best heuristic policy.

Table 2. Comparison of different policies for the first scenario. All policies were evaluated over 100 episodes, each consisting of 200 time steps. Each column reports the mean value of the corresponding metric across all episodes. The RL policy achieves a profit that is

143 %

higher than the best heuristic policy.

Policy	Profit (USD)	Order	Price (USD)	GC (USD)	Latency (min)
RL	$1267.3 \pm 127$	$107.3 \pm 11$	$26.2 \pm 0.3$	$32.8 \pm 0.2$	$7.5 \pm 0.3$
Max Order	$520.2 \pm 83$	$162.3 \pm 5.1$	$18.5 \pm 0.4$	$27.7 \pm 0.6$	$9.5 \pm 0.3$
Max Price	$512.5 \pm 104$	$23.9 \pm 4.8$	$30.0$	$33.8 \pm 0.2$	$4.6 \pm 0.3$
Zone-Based	$495.2 \pm 103$	$24.8 \pm 5.3$	$29.2 \pm 0.6$	$33.9 \pm 0.2$	$5.7 \pm 0.8$

Table 3. Comparison of different policies for the second scenario. All policies were evaluated over 100 episodes, each consisting of 200 time steps. Each column reports the mean value of the corresponding metric across all episodes. The RL policy achieves a profit that is

56 %

higher than the best heuristic policy.

Table 3. Comparison of different policies for the second scenario. All policies were evaluated over 100 episodes, each consisting of 200 time steps. Each column reports the mean value of the corresponding metric across all episodes. The RL policy achieves a profit that is

56 %

higher than the best heuristic policy.

Policy	Profit (USD)	Order	Price (USD)	GC (USD)	Latency (min)
RL	$198.8 \pm 42$	$14.2 \pm 3.5$	$24.5 \pm 0.8$	$32.1 \pm 0.7$	$8.9 \pm 1.3$
Zone-Based	$126.8 \pm 46$	$5.1 \pm 1.9$	$29.9 \pm 0.4$	$32.6 \pm 0.9$	$3.2 \pm 1.2$
Max Price	$125.5 \pm 45.2$	$5.0 \pm 1.8$	$30.0$	$32.6 \pm 0.9$	$3.1 \pm 1.1$
Max Order	$87.7 \pm 33$	$32.6 \pm 4.9$	$17.8 \pm 0.9$	$29.8 \pm 1$	$13.2 \pm 0.7$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zibaie, A.; Beliaev, M.; Alizadeh, M.; Pedarsani, R. Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transp. 2025, 5, 112. https://doi.org/10.3390/futuretransp5030112

AMA Style

Zibaie A, Beliaev M, Alizadeh M, Pedarsani R. Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transportation. 2025; 5(3):112. https://doi.org/10.3390/futuretransp5030112

Chicago/Turabian Style

Zibaie, Arghavan, Mark Beliaev, Mahnoosh Alizadeh, and Ramtin Pedarsani. 2025. "Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning" Future Transportation 5, no. 3: 112. https://doi.org/10.3390/futuretransp5030112

APA Style

Zibaie, A., Beliaev, M., Alizadeh, M., & Pedarsani, R. (2025). Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning. Future Transportation, 5(3), 112. https://doi.org/10.3390/futuretransp5030112

Article Menu

Dynamic Pricing for Multi-Modal Meal Delivery Using Deep Reinforcement Learning

Abstract

1. Introduction

Related Work

2. System Model and Problem Set Up

2.1. Orders’ Model

2.2. Couriers’ Model

2.3. Customers’ Behavior

2.4. Transportation Network

2.5. Meal Delivery Platform Problem

3. The Problem Formulation as a Markov Decision Process (MDP)

3.1. State Space S

3.2. Action Space A

3.3. State Transition Model T

3.4. The Initial State I

3.5. Reward Signal, r

4. Reinforcement Learning

5. Heuristic Policies

6. Numerical Experiment

6.1. Sioux Fall Transportation Network

6.1.1. Scenario 1: Complete Graph

6.1.2. Scenario 2: Sioux Falls Network Graph

6.1.3. Hyperparameters for PPO

6.2. Result

6.2.1. Evaluation Metrics

6.2.2. Scenario 1

6.2.3. Scenario 2

7. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Comparison of Reinforcement Learning Algorithms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. State Space $S$

3.2. Action Space $A$

3.3. State Transition Model $T$

3.4. The Initial State $I$