Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning

Kim, Hyun; Dimitrijevic, Branislav

doi:10.3390/urbansci10050287

Open AccessArticle

Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning

by

Hyun Kim

^*

and

Branislav Dimitrijevic

Department of Civil and Environmental Engineering, New Jersey Institute of Technology, Newark, NJ 07103, USA

^*

Author to whom correspondence should be addressed.

Urban Sci. 2026, 10(5), 287; https://doi.org/10.3390/urbansci10050287

Submission received: 9 February 2026 / Revised: 24 March 2026 / Accepted: 28 March 2026 / Published: 20 May 2026

(This article belongs to the Special Issue Smart City Transportation and Electric Vehicles: Innovations for Sustainable Urban Mobility)

Download

Browse Figures

Versions Notes

Abstract

Extensive research has been conducted to develop technologies that enable paratransit systems to operate autonomously, including advanced sensing technologies and associated software. However, there remains a gap in research addressing adaptive operational algorithms for such systems under stochastic and dynamically evolving demand. To address this gap, this study develops an imitation-learning-assisted deep reinforcement learning (DRL) approach for autonomous shuttle routing. The proposed framework integrates generative adversarial imitation learning with proximal policy optimization to enable sequential pickup and drop-off decision-making under stochastic passenger demand without centralized re-optimization. The DRL agent was trained over approximately 1.5 million training steps and evaluated across 1000 episodes with stochastic passenger generation. Its performance was benchmarked against a deterministic dial-a-ride problem (DARP) solver implemented using Google’s OR-Tools, as well as online heuristic baselines. Results indicate that while heuristic methods achieve lower average time-based performance metrics, the proposed approach is capable of learning adaptive routing policies and demonstrates consistent behavior across diverse demand realizations. These findings highlight the feasibility of learning-based routing in controlled environments and provide a foundation for extending such approaches to more complex and realistic autonomous mobility systems.

Keywords:

autonomous mobility-on-demand (AMoD); autonomous shuttles; dial-a-ride problem (DARP); intelligent transportation systems (ITS); reinforcement learning (RL); stochastic and dynamic vehicle routing problem (SDVRP)

1. Introduction

Autonomous vehicles (AVs), in particular shared autonomous shuttles, have received increasing attention from industry, government agencies, and academia as a potential next-generation public transportation solution. Shared autonomous mobility systems are expected to improve operational efficiency, expand service coverage, and enhance accessibility for transportation-disadvantaged populations by enabling flexible, demand-responsive transit operations [1,2,3]. As a result, autonomous shuttles have been actively piloted in campus environments, first–last mile services, and paratransit applications, including recent deployments in New Jersey and other regions worldwide [4].

From an urban systems perspective, autonomous shuttles represent a form of demand-responsive public transportation that must continuously adapt to uncertain passenger arrivals, heterogeneous trip patterns, and real-time network conditions. Unlike conventional fixed-route transit, such services operate as sequential decision-making systems embedded within complex urban environments, where routing policies directly influence service reliability, passenger equity, and overall network efficiency. Recent studies have demonstrated the applicability of reinforcement learning for urban mobility control problems such as transit operations and shared mobility coordination, highlighting the potential of learning-based methods to support adaptive transportation systems [5,6]. These control-oriented applications can be viewed as special cases of sequential decision-making under uncertainty, with routing emerging as a higher-dimensional extension that requires jointly optimizing pickup, drop-off, and movement decisions over time. However, most existing approaches focus on isolated operational tasks or simplified service settings, leaving a gap in end-to-end routing frameworks that explicitly model both pickup and drop-off decisions under stochastic urban demand.

Despite this growing interest, much of the existing research has focused on vehicle technology, safety, and system-level impacts, while comparatively less attention has been devoted to the development of operational routing and scheduling algorithms suitable for shared autonomous shuttles operating under stochastic and dynamically evolving passenger demand. Most demand-responsive transit and paratransit systems rely on variants of the dial-a-ride problem (DARP), which typically assume deterministic or fully known demand and often require centralized re-optimization as new requests arrive [7,8]. These assumptions limit their applicability to real-time autonomous shuttle operations, where passenger requests occur continuously, and routing decisions must be updated online.

Recent advances in machine learning, particularly deep reinforcement learning (DRL), offer a promising alternative for addressing dynamic routing problems. By learning policies through interaction with the environment, DRL agents can adapt routing decisions in real time without repeatedly solving large-scale combinatorial optimization problems [9,10]. However, training stable and effective DRL policies for routing under stochastic demand remains challenging due to sparse rewards, high-dimensional state spaces, and the need to coordinate pickup and drop-off decisions across time.

To address these challenges, this study proposes a learning-based routing framework for autonomous shuttles operating under stochastic passenger demand. The proposed approach integrates generative adversarial imitation learning (GAIL) [11] with proximal policy optimization (PPO) [10] to accelerate policy convergence and improve training stability. The learned routing policy is evaluated in a dynamic simulation environment and benchmarked against a deterministic DARP solution implemented using Google’s OR-Tools. Performance is assessed using passenger waiting time, in-vehicle time, service completion time, and overall service efficiency.

Rather than targeting full-scale deployment, this work focuses on evaluating the feasibility of learning-based routing under stochastic demand in a controlled setting. The proposed framework enables sequential decision-making without centralized re-optimization and is evaluated against both deterministic and heuristic baselines. This formulation serves as a foundational step toward more complex and realistic autonomous shuttle routing problems, including larger networks, multiple vehicles, and additional operational constraints.

The main contributions of this paper are threefold: (1) formulation of a stochastic autonomous shuttle routing problem within a reinforcement learning framework; (2) development of an imitation-learning-assisted DRL training pipeline for dynamic pickup and drop-off routing; and (3) systematic benchmarking of the learned policy against a conventional DARP solver under controlled stochastic demand scenarios.

2. Literature Review

The routing and scheduling of shared autonomous shuttles is closely related to the classical dial-a-ride problem (DARP), which seeks to determine optimal pickup and drop-off sequences subject to time windows, vehicle capacity constraints, and passenger service quality objectives. Early formulations of DARP focused on deterministic settings with fully known demand, using exact methods such as dynamic programming and branch-and-bound, as well as heuristic approaches including insertion heuristics, local search, and interchange methods [12,13,14]. These methods have been successfully applied to applications such as paratransit services, school bus routing, and company fleet operations.

Subsequent research extended DARP formulations to incorporate stochastic elements such as travel time variability, request cancellations, and vehicle disruptions [15]. Although these extensions improved robustness, most approaches continue to rely on centralized optimization and assume that passenger requests are known in advance or arrive at discrete re-optimization intervals. As a result, they remain computationally expensive and poorly suited for continuous, real-time routing decisions required by autonomous shuttle systems operating in highly dynamic environments.

In parallel, studies on shared autonomous vehicles (SAVs) and dynamic ride-sharing have explored simulation-based and optimization-based frameworks to assess system-level impacts and operational feasibility. Many of these studies impose simplifying assumptions, such as restricting ride-sharing to passengers with closely aligned spatiotemporal characteristics or aggregating pickup and drop-off locations into zones [16,17]. While these assumptions reduce computational complexity, they may compromise service equity and limit applicability to paratransit populations requiring door-to-door service [18].

More recently, reinforcement learning and Markov decision process (MDP) formulations have been investigated for stochastic vehicle routing problems. Learning-based approaches offer the advantage of producing real-time decisions without repeated global optimization [3,19]. However, many existing applications focus primarily on pickup decisions, neglect explicit modeling of passenger destinations, or are limited to single-vehicle or simplified service scenarios. Moreover, training instability and sparse reward signals remain significant barriers to practical deployment.

Recent advances in transportation systems have introduced new challenges that extend beyond classical vehicle routing formulations. In particular, electric vehicle routing problems (EVRP) incorporate battery limitations and charging strategies, significantly increasing operational complexity [20,21]. These constraints require routing approaches that jointly consider energy consumption, charging infrastructure, and service efficiency.

Furthermore, the emergence of connected vehicle technologies and communication infrastructures such as 5G enables real-time data exchange and coordination across transportation systems, supporting dynamic and adaptive decision-making [22]. These developments further emphasize the need for routing strategies that operate effectively under uncertainty and evolving system states.

Reinforcement learning (RL) has been increasingly explored as a framework for decision-making in transportation systems. Prior work has applied RL to vehicle-level control problems, including cooperative velocity planning and lane-changing in connected electric vehicles [23]. These studies demonstrate the effectiveness of RL for adaptive and cooperative decision-making but remain focused on vehicle-level control rather than system-level routing and dispatching under stochastic passenger demand.

At the system level, dynamic ride-sharing and mobility-on-demand problems have been studied to address real-time matching and routing of vehicles and passengers. For example, Alonso-Mora et al. [24] proposed a dynamic trip-vehicle assignment framework for high-capacity ride-sharing, highlighting the complexity of real-time fleet coordination under stochastic demand.

Despite these advances, existing approaches either rely on deterministic optimization with strong assumptions (e.g., full demand knowledge) or focus on localized control problems. There remains a need for learning-based routing frameworks that can operate under stochastic demand, limited information, and dynamic environments. This gap motivates the proposed integration of imitation learning and reinforcement learning for autonomous shuttle routing.

Despite growing interest in learning-based routing, a critical gap remains in the development of reinforcement learning frameworks that explicitly model both pickup and drop-off decisions, operate under stochastic passenger arrivals, and are evaluated against established optimization-based benchmarks. This study addresses this gap by integrating imitation learning with deep reinforcement learning to train an autonomous shuttle routing policy and by systematically comparing its performance to a deterministic DARP solver under controlled experimental conditions.

3. Research Motivation

Routing and scheduling shared autonomous shuttles under stochastic and dynamically evolving passenger demand represents a fundamental challenge for future demand-responsive transit systems. In its most general form, this challenge can be characterized as an MSDVRP (multi-agent stochastic and dynamic vehicle routing problem), in which multiple autonomous shuttles must make coordinated, real-time routing and pickup–drop-off decisions as new ride requests continuously arrive. Unlike conventional dial-a-ride formulations that assume deterministic or pre-known demand, such systems require online decision-making under uncertainty and evolving system states.

Despite recent advances in optimization and learning-based methods, developing stable and scalable solutions for this class of problems remains difficult. The combination of high-dimensional state spaces, delayed and sparse rewards, and non-stationarity arising from interactions among multiple vehicles poses significant challenges for DRL. Moreover, limited prior research exists on learning-based routing frameworks that explicitly address these issues in autonomous shuttle settings.

Motivated by these challenges, this study adopts a progressive research strategy in which the core decision-making problem is first examined in a simplified setting. By focusing on a relaxed formulation of the autonomous shuttle routing problem, this work seeks to establish a robust methodological foundation that can be extended to multi-agent and large-scale scenarios in future research.

4. Scope and Problem Definition

This study addresses a relaxed formulation of the autonomous shuttle problem (ASP), which concerns the dynamic routing of autonomous shuttles operating under stochastic passenger demand. While the broader ASP aligns with a multi-agent stochastic and dynamic vehicle routing problem, the present work focuses on a deliberately constrained setting to enable systematic analysis and stable policy learning.

Specifically, the scope of this study is limited to a single autonomous shuttle agent operating within a

10 \times 10

grid-based network under stochastic passenger arrivals. This abstraction represents a simplified street network and allows for reproducible experimentation and step-wise evaluation of routing decisions in a dynamic environment. By excluding multi-agent coordination, the formulation avoids non-stationarity introduced by agent interactions and isolates the fundamental challenges associated with dynamic pickup and drop-off routing.

The objective of the agent is to learn adaptive routing policies that minimize passenger waiting time, in-vehicle time, and overall service completion time. Policy performance is evaluated using operational metrics relevant to paratransit and shared autonomous mobility systems and is benchmarked against a conventional dial-a-ride problem (DARP) formulation implemented using Google’s OR-Tools.

While this formulation omits several real-world complexities—such as road-network constraints, travel speeds, time windows, service times, and multi-vehicle coordination—it provides a controlled environment for evaluating learning-based routing under stochastic demand. The proposed framework is designed to be extensible, and this simplified setting serves as a foundational step toward more complex and realistic autonomous shuttle routing scenarios.

5. Materials and Methods

The existing DARP heuristics and optimization algorithms provide the optimal route and schedule (sequence) to pick up passengers, considering their desired times of pickup and arrival at destinations. This is achieved by optimizing the DARP objective function with constraints and all variables known in advance, such as the number of passengers, the time windows of their desired pickup and drop-off times, and their pickup and drop-off locations. To exemplify the limitation of the traditional DARP algorithm, consider a simple 10 by 10 grid representing a street network, with known passengers’ origin and destination information, as shown in Figure 1. In Figure 1, the small circles represent origins of the passengers requesting rides, and crosses represent their destination. The matching origin and destination (i.e., for the same passenger trip) are shown in the same color. For example, P1 represents the origin of passenger 1, located at the intersection H3 and represented by red circle; the destination of passenger 1 is at intersection C7, represented by a cross shown in red color. The origin and destination pairs for different passengers are shown in different colors. As mentioned previously, depending on the time window of passengers’ desired pickup and arrival time, traditional DARP methods can be used to determine the optimal spatiotemporal departure point and the sequence of pickups and drop-offs of these passengers such as the Figure 1 below.

Now, let us consider a scenario in which the passenger trip requests and trip information is not available in advance. The three grid plots in Figure 2 represent the locations of passenger trip requests and the location of an autonomous shuttle (AS1) in the street network at three different times: the left plot is at time zero T = 0, the middle plot is at time one T = 1, and the right plot is at time T = 2. At T = 0, only one passenger, P1, requested a service and the shuttle AS1 is just 2 vertices away. At time T = 1, AS1 is moving towards P1. However, at T = 1, another passenger, P2, requests service and is four vertices away from the origin of P1. Then, at T = 2, another passenger, P3, requests service. At T = 2, AS1 has to decide between different options for servicing the passengers. For example, one option (let us call it Option 1) would be to pick up P2, drop off P1, drop off P2, pick up P3, and then drop off P3. Another option (say, Option 2) would be to pick up P2, then drop off P2 on the way to P3, pick up P3, then drop off P1, and lastly drop off P3. As a combinatorial problem, the passenger service plan has multiple solutions, but the traditional route planning optimization methods would not be able to dynamically add and subtract passengers while the shuttle is on the move and for multiple shuttles at once.

5.1. ASP Problem Formulation as a Markov Decision Process (MDP)

Though not directly focused on routing optimization, the work of Nijs et al. [25] on the constrained multiagent Markov decision process (CMMDP) provides a useful foundation for designing and conceptualizing the ASP within an MDP framework. CMMDP aims to support decision-making for multiple agents sharing resources under uncertainty. They determine which variations of CMMDP will serve the best depending on the objective, type of constraint, and observability of the domain. MDP can be formulated as a tuple

(S, A, P, R, γ)

representing the state space, action space, transition function, reward function, and discount factor, respectively, as shown in Equation (1)

M = 〈 S, A, P, R, γ 〉

(1)

\begin{matrix} S & : & state space, \end{matrix}

(2)

\begin{matrix} A & : & action space, \end{matrix}

(3)

\begin{matrix} P (s^{'} ∣ s, a) & : & state transition probability, \end{matrix}

(4)

\begin{matrix} R (s, a, s^{'}) & : & reward function, \end{matrix}

(5)

\begin{matrix} γ & \in & [0, 1] : discount factor . \end{matrix}

(6)

In MDP, the agent must decide an appropriate action that will produce the highest reward depending on the current state. Also, depending on the weight of the discount factor, the agent can value immediate reward more than the future reward or vice versa.

In the context of ASP, the state represents the positions of all AS, whether the shuttles have passengers onboard, the locations of all passengers and their elapsed wait time, travel time, and their destinations, including both the passengers waiting for a pickup and those already onboard the shuttles. The action space is defined as movements of the shuttle that can maneuver the shuttle on a grid including moving forward or turning left and right.

In a conventional MDP, the transition probability is used to model the probability of advancing to the next state from the current state if there are multiple scenarios possible by taking a single action. The probabilistic outcome of taking an action introduces some randomness, which is appropriate to be used in modeling stochastic scenarios. However, in the context of this research problem, the transition probability is not necessary for use because picking up and dropping off passengers are deterministic and there is no other outcome to performing each action. If a shuttle picks up passengers, then they are picked up with no other possible outcomes to that action.

The reward function is set to guide the agent to reach the overall objective. A careful and thorough reward shaping process is necessary to prevent erratic behavior while promoting actions that align with objectives such as finding the shortest path and optimally routing to maximize efficiency of transporting passengers. A more comprehensive setting of the state information, the action space, and reward function for the ASP are covered in the subsequent sections.

5.2. Proposed Method: Imitation Learning–Assisted Deep Reinforcement Learning

This study proposes a two-stage learning framework that integrates imitation learning and deep reinforcement learning (DRL) to train an autonomous shuttle agent for dynamic routing under stochastic passenger demand. The proposed approach combines the rapid policy initialization capabilities of imitation learning with the long-term optimality and adaptability of reinforcement learning, resulting in a more stable and efficient training process than using reinforcement learning alone.

5.2.1. Overview of the Learning Pipeline

Training begins with imitation learning using generative adversarial imitation learning (GAIL), where the agent learns an initial routing policy by imitating expert behavior. This stage provides the agent with a structured understanding of the task objective and reduces the exploration burden associated with sparse and delayed rewards. Once the policy converges under imitation learning, the learned network weights are transferred to a deep reinforcement learning agent. The agent then continues training using direct environmental feedback, allowing further policy refinement through exploration and exploitation. This two-stage pipeline leverages the strengths of both learning paradigms while mitigating their individual limitations.

5.2.2. Imitation Learning via Generative Adversarial Imitation Learning (GAIL)

GAIL formulates imitation learning as an adversarial optimization problem involving two components: a policy (agent) and a discriminator (adversary) [11]. The discriminator is trained to distinguish between state–action pairs generated by the agent and those obtained from expert demonstrations recorded by a human operator. Concurrently, the agent learns to generate trajectories that are indistinguishable from expert behavior.

Formally, GAIL solves the following minimax optimization problem:

min_{π_{θ}} max_{D_{ϕ}} E_{(s, a) \sim π_{E}} [log D_{ϕ} (s, a)] + E_{(s, a) \sim π_{θ}} [log (1 - D_{ϕ} (s, a))],

(7)

where

π_{E}

denotes the expert policy,

π_{θ}

is the agent policy, and

D_{ϕ} (s, a)

is the discriminator output indicating whether a state–action pair is expert-generated.

During training, the discriminator provides a learned reward signal to the agent, defined as

r^{GAIL} (s, a) = - log (1 - D_{ϕ} (s, a)),

(8)

which replaces the environment reward. By learning from this adversarial feedback, the agent rapidly acquires expert-like routing behavior without directly interacting with the environment. While this mechanism provides an effective initialization, GAIL training is inherently limited by the quality of expert demonstrations and may lead to premature convergence if used alone.

5.2.3. Policy Refinement via Proximal Policy Optimization (PPO)

To overcome the limitations of imitation learning and enable further performance improvements, the GAIL-trained policy is transferred to a deep reinforcement learning agent and refined using proximal policy optimization (PPO) [10]. PPO is selected due to its robustness, sample efficiency, and strong empirical performance in stochastic and dynamic environments.

PPO constrains policy updates by optimizing a clipped surrogate objective, which limits the deviation between successive policy iterations. The probability ratio used to compare the updated policy with the previous policy is defined as

r_{t} (Θ) = \frac{π_{Θ} (a_{t} ∣ s_{t})}{π_{Θ_{old}} (a_{t} ∣ s_{t})} .

(9)

The PPO clipped surrogate objective is given by

L^{CLIP} (Θ) = E_{t} [min (r_{t} (Θ) {\hat{A}}_{t}, clip (r_{t} (Θ), 1 - ε, 1 + ε) {\hat{A}}_{t})],

(10)

where

{\hat{A}}_{t}

is the advantage function measuring the relative quality of an action compared to the expected value of the current state, and

ε

is a hyperparameter that controls the allowable magnitude of policy updates. By clipping the probability ratio, PPO prevents excessively large updates that may destabilize learning while maintaining sufficient exploration.

In addition to policy optimization, PPO jointly updates a value function to improve estimates of future returns and incorporates entropy regularization to prevent premature convergence to deterministic policies. Through continued interaction with the environment, the agent refines the imitation-initialized policy and learns to adapt to situations beyond the expert demonstrations. This combined imitation learning and reinforcement learning framework enables efficient convergence and robust policy learning for autonomous shuttle routing in dynamic environments.

5.3. ASP Environment Set Up

Python 3.8.10, along with packages including Gym and Stable Baselines3 were used to set up a custom environment suitable to simulate the ASP in a discrete simulation space using a computer with AMD Ryzen 9 5900HS, NVIDIA GeForce RTX 3080 Laptop GPU, and 32 GB of RAM. Gym, developed by OpenAI, is an open-source standardized toolkit for developing and comparing DRL algorithms in customizable environments.

The custom ASP simulation environment was developed as a

10 \times 10

grid-based street network, with an agent controlling a shuttle initially placed at coordinate (5,5), facing north, with three randomly positioned passengers. After every 10 steps (where a step represents one unit of simulation time), an additional passenger is generated at a random location until a total of six passengers is reached. Each simulation, equivalent to an episode, terminates when either 500 steps are reached or all six passengers have been successfully transported to their destinations.

The simulation environment incorporates a predefined orientation constraint, such that the shuttle must maintain a heading (north, south, east, or west) and cannot move freely in all directions without adjusting its orientation, reflecting realistic vehicle movement behavior. The environment is modeled as a discrete

10 \times 10

grid, where each node represents a location and movement is restricted to adjacent nodes. Distance between two locations is measured using the Manhattan distance metric, and time is represented in discrete steps. At each time step, the vehicle moves exactly one grid unit, corresponding to a constant speed of one grid unit per time step.

As a result, travel time between any two locations is equal to their Manhattan distance. All reported performance metrics—including passenger waiting time, in-vehicle time, and service time—are therefore expressed in units of discrete time steps, which are directly equivalent to grid-based travel distance.

Despite this structured environment design, early training experiments indicated that the agent struggled to learn coherent routing behaviors when exposed to the full state space. In particular, the agent frequently exhibited unstable behaviors such as circling, idling near passenger locations, or failing to follow shortest paths toward pickup and drop-off points, even after extensive training. This suggests that the agent was unable to reliably infer short-horizon navigational structure solely from sparse reward signals in a stochastic environment.

To address this challenge, a lightweight environment-level guidance mechanism based on the breadth-first search (BFS) algorithm was incorporated. BFS computes shortest-path distances between the shuttle’s current position and candidate pickup or drop-off locations, allowing the environment to dynamically identify relevant pickup and drop-off targets at each decision step.

Importantly, this mechanism does not prescribe a fixed trajectory. Instead, it constrains the action space to feasible shortest-path movements while still allowing multiple valid actions due to the existence of multiple shortest paths in the grid network. As a result, the agent is presented with a set of candidate actions rather than a single deterministic choice.

Within this structured action space, the DRL agent retains full decision-making responsibility, including selecting which target to prioritize (e.g., pickup versus drop-off) and which path to follow among available shortest-path options. This formulation therefore extends beyond simple task sequencing and enables routing decisions under constrained but non-trivial action choices.

Overall, this design reduces effective state complexity and stabilizes learning while preserving agent autonomy, striking a balance between exploration feasibility and decision-making flexibility without incurring the combinatorial complexity of fully unconstrained routing.

5.4. DRL Observation Space

The observation space available to the agent includes: location, distance, and orientation towards the nearest passenger; location, distance, and orientation towards the nearest drop-off location; current shuttle position and orientation. Though the observation space may seem overly simplified for the ASP, this configuration empirically yielded the best performance. Until the problem was relaxed with a custom guidance system and the simplification of the observation space, the agent was not able to learn or produce coherent actions, contrary to the theories and expectations. The ASP DRL environment was first configured similarly to how a DARP would be solved, providing as much information as possible, then waiting for a solution to be developed. In the initial stage of the model development, the observation space included all passengers’ elapsed waiting time, elapsed in-vehicle time, upper limit of all passengers’ waiting time and arrival time, and the number of passengers onboard the shuttle, along with the current observation space, which summed up to 60 dimensions in the observation space array. Upon incrementally reducing the number of observations, the current observation space showed signs of improvement over time and therefore was adopted.

5.5. DRL Action Space

To prevent the agent from traversing simply up and down in the grid environment, the orientation mechanism was implemented that introduced more realistic movement of shuttles. Therefore, the discrete action space for the agent can be formulated as Equation (11).

A = \{\begin{matrix} 0, & move forward, \\ 1, & turn left then move forward, \\ 2, & turn right then move forward . \end{matrix}

(11)

As described, the agent can go forward, or turn left then go forward, or turn right then go forward. “Going forward” was added to the turning movements because situations were observed in the trials where the agent was only turning and not producing coherent maneuver within the grid to pick up or drop off passengers. To design a more robust learning opportunity and an action space that is aligned with the reward function, making a turn without moving forward was not included as an action choice.

5.6. DRL Reward Function

The reward function helps the agent achieve the objective by providing guidance in the form of reward or punishment. The reward function must match the observation space and the objective of the simulation environment for effective learning. Therefore, in the ASP DRL environment, the reward function is engineered to minimize the overall passenger waiting and traveling times by following (an optimal) passenger pickup and drop-off sequence and traveling along the shortest path en route to pick up or drop off passengers. There are several types of rewards that have been considered in the ASP DRL model: base rewards, vicinity rewards, inverse travel time rewards, and an early completion rewards. The base rewards for picking up and dropping off are set to be 0.2 and 0.15, respectively, to encourage pickups more than drop-offs, aligned with the transit user cost theory that people value wait time up to two to three times more than in-vehicle time [26]. In this initial research experiment, the pickup base reward was only slightly higher than the drop-off reward, but these values can be adjusted after examining the performance of the trained agent. Vicinity rewards are calculated by setting up vicinities of 5 distance units around passenger pickup and drop-off locations, and giving incrementally greater reward as the agent gets closer to the locations. The vicinity reward calculation is shown below in Equation (12).

vicinity reward = \frac{VICINITY_REWARDS}{distance + 1}

(12)

where:

VICINITY_REWARDS = \{\begin{matrix} pickup & = 0.5, \\ drop-off & = 0.5 . \end{matrix}

As shown in Equation (12), the value of the vicinity reward is inversely proportional between a pickup or a drop-off location, and the closer the agent gets to one of these locations, the more rewards it will receive. Since traveling along the shortest path is equally important for both pickups and drop-offs, their vicinity rewards are the same. In a transit system it is important to transport all passengers as quickly as possible. Thus, the agent is given rewards for completing pick up and drop off of passengers quickly. This is accomplished with the inverse travel time reward, or speed bonus, which is calculated using the following Equation (13).

Speed bonus = base reward \times (2 - \frac{elapsed time}{current time})

(13)

where:

elapsed time = \{\begin{matrix} elapsed wait time, & if status = waiting, \\ elapsed in vehicle time, & if status = onboard . \end{matrix}

The is calculated for each passenger and includes elapsed wait time or elapsed in-vehicle time depending on the status of passengers. If a passenger is in waiting status, then the elapsed time measures the time since the passenger demand was generated. On the other hand, if the status of a passenger is onboard (a shuttle vehicle), then the elapsed time measures their current in-vehicle travel time. For example, if a passenger was generated at step 0, picked up at step 7, and dropped off at step 20, then their elapsed wait time would be 7 and the elapsed in-vehicle time would be 13. Therefore, the agent will receive more rewards the faster it is able to pick up and drop off passengers by shortening passenger elapsed wait time and elapsed in-vehicle time. Also, at the completion of either a pickup or a drop-off, a reward will be added such that the agent can receive both a base reward and inverse travel time reward for picking up and dropping off passengers. Lastly, since the objective of the agent is to transport all the passengers generated in an episode as early as possible, additional reward is given for ending the episode as quickly as possible. The early completion reward is calculated as follows:

early completion bonus = remaining time \times 0.01

(14)

where:

remaining time = 500 - episode end time

An episode would terminate upon either reaching 500 steps or transporting all six passengers. The agent will not receive the early completion bonus if it was not able to transport all passengers in an episode. As shown from the equation, the early completion bonus is calculated by multiplying 0.01 by the remaining time and the remaining time is the difference of 500 and the episode end time. Therefore, the agent can receive more rewards by decreasing the episode end time thereby increasing the remaining time to receive high early completion bonus.

The reward function was designed to align with key operational objectives, including minimizing passenger waiting time, in-vehicle travel time, and overall service time. The relative magnitudes of reward components were iteratively tuned to balance these objectives and promote stable learning behavior.

An additional early-termination bonus was included to encourage global efficiency by rewarding the completion of all passenger services in fewer time steps. Since episodes only terminate once all passengers have been served, this bonus reinforces efficient routing rather than inducing short-sighted behavior.

While the reward parameters were not derived through formal optimization, they provide a practical and effective approximation of system-level objectives. Future work may explore more systematic approaches to reward design, including multi-objective optimization and data-driven calibration.

6. Results

Following the two-stage training procedure described above, 125 expert demonstration episodes were generated within the same simulation environment. These demonstrations were constructed using a custom breadth-first search (BFS)-based routing mechanism that ensures shortest-path navigation to pickup and drop-off locations on the grid network. Due to the grid structure, multiple shortest paths may exist; therefore, the BFS provides a set of equally optimal next-step actions rather than a single deterministic trajectory.

The expert policy resolves only the low-level navigation task—i.e., how to reach a selected target via shortest paths—while leaving higher-level decision-making (e.g., whether to prioritize pickups or drop-offs) unspecified. As a result, the demonstrations primarily encode efficient movement behavior rather than full task optimality.

These demonstrations were used within the GAIL framework to initialize the policy, effectively addressing the early-stage exploration problem where the agent otherwise fails to discover feasible routes to passengers. The resulting policy was subsequently fine-tuned using PPO to learn higher-level routing strategies that optimize passenger service metrics. The hyperparameters used across both training stages are summarized in Table 1.

The entropy coefficient was annealed to slowly discourage exploration over time, while encouraging the agent to exploit the policy it learned to refine the optimal routing policy based on the locations of passenger pickups and drop-offs. Different combinations of the hyperparameters were tested, such as the learning rate ranging from

10^{- 3}

to

10^{- 5}

, n steps from 64 to 1024, and batch_size from 256 to 8192. However, the hyperparameter values presented in Table 1 demonstrated the best performance. Given the stochastic environment, the hyperparameters were configured to ensure sufficiently long horizons (n steps and batch size) for the frequency of update and a robust data collection. Furthermore, a low learning rate and constrained update values were employed to promote effective generalization by the agent.

To evaluate the effectiveness of the trained policy and the environment-level guidance mechanism in the absence of established benchmarks for the autonomous shuttle problem (ASP), a deterministic dial-a-ride problem (DARP) formulation was implemented using Google’s OR-Tools. In this benchmark, passenger demand was assumed to be fully known in advance, providing an upper-bound reference for routing performance. Because the proposed DRL-based approach optimizes routing decisions online without prior knowledge of future passenger requests, it is expected to yield inferior objective values relative to the deterministic DARP solution. Consequently, performance proximity to the DARP benchmark is interpreted as an indication of effective real-time decision-making.

To provide a more appropriate and fair comparison under online and stochastic conditions, two additional heuristic baselines were implemented: a nearest-neighbor heuristic and a wait-time-aware heuristic. Unlike the offline DARP formulation, these methods operate under the same information constraints as the DRL agent, making decisions sequentially based only on currently available passenger requests.

The nearest-neighbor heuristic selects the next target (either a pickup or drop-off location) based on minimum Manhattan distance from the shuttle’s current position, prioritizing immediate spatial efficiency as a simple greedy baseline. The wait-time-aware heuristic extends this approach by incorporating passenger waiting time into the decision process, balancing travel distance and accumulated waiting time to prioritize passengers who have been waiting longer. This provides a stronger baseline that explicitly accounts for passenger-centric performance metrics such as waiting and service time.

To enable consistent and comparable evaluation across all methods, identical realized demand scenarios were used. For each of the 1000 evaluation episodes, passenger generation information—including arrival times, pickup locations, and drop-off locations—was recorded and subsequently replayed across the DARP model and both online heuristic baselines. This eliminates variability due to stochastic passenger generation and allows direct, paired comparisons of performance metrics.

Rather than relying on explicit random seed control, reproducibility is achieved through deterministic replay of the recorded demand scenarios, ensuring that evaluation outcomes remain consistent and independent of stochastic variations in passenger generation.

Additional implementation details for the DARP benchmark—including routing constraints, cost structure, and search strategies (e.g., PARALLEL_CHEAPEST_INSERTION and GUIDED_LOCAL_SEARCH)—are provided. The codebase, along with evaluation scripts and representative output files, is publicly available, as described in the Data Availability Statement.

After training, key operational statistics—including passenger pickup and drop-off locations and times, waiting times, and in-vehicle travel times—were recorded at the end of each validation episode. The same passenger demand realizations were then applied to the DARP model implemented using the Google OR-Tools Python package, employing the ortools.constraint_solver and pywrapcp.routing_enums_pb2 modules, as well as to the two online heuristic models, to generate benchmark solutions for comparison.

In each episode, a total of six passengers were generated at random locations within the service network. Three passengers (P1–P3) were generated at the start of the episode, followed by the arrival of one additional passenger every 10 time steps until all six passengers had been introduced. To ensure consistency between the online DRL setting and the offline DARP benchmark, pickup time constraints were imposed in the DARP formulation such that passenger P4 could not be picked up before step 10, P5 before step 20, and P6 before step 30. No explicit constraints were placed on maximum waiting time or in-vehicle travel time.

The service network was modeled as a

10 \times 10

grid, where movement between adjacent nodes incurred a unit cost of one time step. Travel costs and shortest paths were computed using the Manhattan distance metric, consistent with the grid-based environment used for DRL training.

For the DARP benchmark, OR-Tools was configured by specifying paired pickup and drop-off locations, pickup time constraints, vehicle capacity, and pickup–drop-off precedence constraints, with the objective of minimizing total travel cost across all passenger services. An initial routing solution was generated using the PARALLEL_CHEAPEST_INSERTION heuristic, followed by refinement using the GUIDED_LOCAL_SEARCH metaheuristic. These general vehicle routing problem (VRP) strategies iteratively improve solution quality through cost-based insertion and local neighborhood exploration [27].

Table 2 summarizes the average episode-level performance of all methods over 1000 evaluation episodes. The heuristic baselines achieved lower mean waiting, in-vehicle, and service times than the proposed GAIL + PPO agent. In particular, the wait-time-aware heuristic achieved the lowest average waiting time, whereas the nearest-neighbor baseline achieved the lowest average in-vehicle and service times. These findings suggest that the current learned policy remains less competitive than the selected benchmarks with respect to these time-based performance measures.

The confidence intervals in Table 3 indicate relatively low variability across evaluation episodes, suggesting stable performance across all methods.

Table 4 presents paired comparisons between the proposed GAIL + PPO agent and baseline methods. Across all metrics, the heuristic baselines achieve lower time-based performance values than the RL agent, and these differences are statistically significant (p < 0.01). The magnitude of the differences is particularly pronounced for service time when compared to the nearest-neighbor heuristic.

While Table 2, Table 3 and Table 4 provide aggregate performance comparisons across 1000 evaluation episodes, these summary statistics do not fully capture the variability of outcomes under stochastic demand. To complement this macroscopic analysis, we examine episode-level performance differences using empirical cumulative distribution functions (ECDFs), which provide a distributional view of how the proposed method compares to baseline approaches across individual realizations. Figure 3 shows the ECDFs of the average in-vehicle travel time differences between the baseline methods and the RL policy.

In the ECDF plots, the difference is defined as (baseline − RL), such that negative values indicate better performance by the baseline and positive values indicate better performance by the RL policy. This representation enables a direct assessment of not only the average performance gap, but also the proportion of episodes in which the RL policy achieves comparable or superior performance relative to each baseline.

For in-vehicle travel time, the ECDF results indicate that the baseline methods generally outperform the RL policy, as a substantial portion of the distribution lies on the negative side of zero (i.e., baseline minus RL < 0). This implies that the baselines achieve lower travel times in the majority of episodes.

For the nearest-neighbor heuristic, approximately 60% of episodes fall within a 10-unit difference from the RL policy, indicating that although the heuristic often performs better, the magnitude of improvement is relatively moderate in many cases.

When compared to the offline DARP benchmark, RL achieves lower in-vehicle travel time in approximately 40% of episodes (i.e., where the difference is positive). This demonstrates that, despite the offline method having full knowledge of future demand, the RL policy is capable of achieving comparable or improved performance in a non-negligible subset of stochastic demand realizations under certain stochastic realizations.

Figure 4 shows the ECDFs of the average service time differences between the baseline methods and the RL policy. For total service time, the ECDF indicates that the RL policy is generally outperformed by all baseline methods, as the majority of the distribution remains on the negative side of zero. While RL achieves lower service times than the offline benchmark in approximately 25% of episodes, these cases are limited compared to the overall distribution.

In comparison to the online heuristics, including both nearest-neighbor and wait-time-aware strategies, the RL policy consistently underperforms, suggesting that it does not fully capture the trade-off between minimizing travel distance and reducing passenger waiting time.

While the proposed GAIL + PPO agent does not outperform the heuristic baselines or the offline DARP solution in terms of time-based performance metrics, this outcome is consistent with the relative information and structural advantages of the benchmark methods. The heuristic approaches are explicitly designed to optimize short-term objectives under simplified decision rules, while the offline DARP solution benefits from full prior knowledge of all passenger requests.

Importantly, the proposed learning-based approach is not intended to directly compete with these methods under the current simplified setting, but rather to establish a scalable and extensible framework for more complex routing problems. In particular, traditional optimization and heuristic approaches become increasingly difficult to apply or generalize in environments involving stochastic demand, sequential decision-making, and high-dimensional state representations.

The results demonstrate that the proposed method is capable of learning meaningful routing behaviors and achieving competitive performance in a subset of scenarios despite operating under significantly greater uncertainty. This provides a foundation for extending the approach to more complex settings, such as multi-agent coordination and large-scale real-time transportation systems, where learning-based methods are expected to offer greater flexibility and adaptability.

7. Discussion

In the initial experiments, several components of the proposed modeling framework were intentionally relaxed—including the number of shuttles—to enable stable policy learning and to validate the feasibility of the learning-based routing approach. While the presented results demonstrate that the proposed framework can learn coherent and adaptive routing behavior under stochastic passenger demand, the original objective of developing a scalable routing algorithm for multiple autonomous shuttles motivates several extensions to the current formulation.

First, the quality and scale of expert demonstrations used during the imitation learning stage can be substantially improved. In the present study, expert trajectories were generated by a human operator due to the lack of existing datasets capable of capturing optimal pickup and drop-off routing under stochastic demand. While this approach proved sufficient to initialize the learning process and yielded meaningful initial results, it introduces limitations related to both routing optimality and dataset size. To address these limitations, future work will focus on developing an automated expert data generation pipeline capable of producing optimal observation–action pairs based on passenger locations and request times. This can be achieved by integrating a discrete-event simulation framework, such as SimPy [28], with a dynamically updated DARP solver to generate step-by-step shuttle movements under evolving passenger demand.

Second, the training process can be enhanced through the incorporation of a self-play mechanism. Self-play has been shown to enable agents to discover strategies beyond those demonstrated by human experts in complex decision-making tasks, including AlphaGo, AlphaStar, AlphaZero, and OpenAI Five [29,30,31,32]. After acquiring baseline routing behavior through imitation learning, self-play can be employed to allow agents to compete against previous versions of themselves, encouraging the emergence of more efficient routing, pickup sequencing, and ridesharing strategies based on spatiotemporal information.

Third, the framework can be extended to a fully multi-agent setting by adopting a multi-agent reinforcement learning algorithm. Candidate approaches include QMIX (value decomposition networks) [33] and multi-agent proximal policy optimization (MAPPO) [34], both of which follow the centralized training and decentralized execution (CTDE) paradigm. Under CTDE, agents leverage shared global information during training while relying solely on local observations during execution. QMIX combines individual agent value functions through a mixing network to approximate a global action-value function, whereas MAPPO employs decentralized policies with a centralized value function. Both approaches are well suited to the multi-vehicle autonomous shuttle problem, and future empirical evaluation will determine their relative effectiveness in this domain.

8. Conclusions

Beyond methodological contributions, this work highlights the role of learning-based routing in enabling demand-responsive autonomous shuttle services within urban transportation systems. By eliminating the need for repeated centralized re-optimization and allowing policies to adapt online to stochastic passenger demand, the proposed framework offers a scalable foundation for future smart city mobility deployments. Despite operating without prior knowledge of future passenger requests, the proposed learning-based policy achieved performance comparable to a deterministic DARP benchmark in key service metrics, particularly in in-vehicle travel time and overall service duration. Such capabilities are particularly relevant for first–last mile connectivity, paratransit operations, and service provision in transportation-disadvantaged communities. While the present study focuses on a single-agent formulation, the framework is designed to extend naturally to multi-shuttle environments, providing a pathway toward coordinated autonomous transit systems that support sustainable, efficient, and inclusive urban mobility.

Author Contributions

Conceptualization, H.K.; methodology, H.K.; software, H.K.; validation, H.K. and B.D.; formal analysis, H.K. and B.D.; investigation, H.K.; resources, H.K.; writing—original draft preparation, H.K.; writing—review and editing, H.K. and B.D.; visualization, H.K.; supervision, B.D.; project administration, B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and source code supporting the findings of this study are publicly available at https://github.com/kimhyun1018/gail_ppo_ridesharing (accessed on 27 March 2026).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT (OpenAI, GPT-4 and later versions) for the purposes of checking grammar and refining sentences. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASP	Autonomous Shuttle Problem
DRL	Deep Reinforcement Learning
GAIL	Generative Adversarial Imitation Learning
MDP	Markov Decision Process
MSDVRP	Multi-agent Stochastic and Dynamic Vehicle Routing

References

Imhof, S.; Frölicher, J.; von Arx, W. Shared Autonomous Vehicles in Rural Public Transportation Systems. Res. Transp. Econ. 2020, 83, 100925. [Google Scholar] [CrossRef]
Trubia, S.; Curto, S.; Severino, A.; Arena, F.; Zuccalà, Y. Autonomous Vehicles Effects on Public Transport Systems. AIP Conf. Proc. 2021, 2343, 110014. [Google Scholar] [CrossRef]
Burger, A. What Might Autonomous Public Transit Look Like? Santa Clara Valley Transportation Authority (VTA) Blog. 2021. Available online: https://www.vta.org/blog/what-might-autonomous-public-transit-look (accessed on 7 August 2024).
New Jersey Transit. AVATAR Autonomous Vehicle Assessment, Testing and Research Pilot. Available online: https://www.njtransit.com/avatar (accessed on 1 November 2024).
Farazi, N.P.; Zou, B.; Ahamed, T.; Barua, L. Deep reinforcement learning in transportation research: A review. Transp. Res. Interdiscip. Perspect. 2021, 11, 100425. [Google Scholar] [CrossRef]
Feng, S.; Duan, P.; Ke, J.; Yang, H. Coordinating ride-sourcing and public transport services with a reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 2022, 138, 103611. [Google Scholar] [CrossRef]
Cordeau, J.F. A Branch-and-Cut Algorithm for the Dial-a-Ride Problem. Oper. Res. 2006, 54, 573–586. [Google Scholar] [CrossRef]
Hiermann, G.; Puchinger, J.; Ropke, S.; Hartl, R.F. The Electric Fleet Size and Mix Vehicle Routing Problem with Time Windows and Recharging Stations. Eur. J. Oper. Res. 2016, 252, 995–1018. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016); Curran Associates, Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
Wilson, N.H.M.; Colvin, N.J. Computer Control of the Rochester Dial-a-Ride System; Technical Report Report 77-22; Massachusetts Institute of Technology, Center for Transportation Studies: Cambridge, MA, USA, 1977. [Google Scholar]
Psaraftis, H.N. A Dynamic Programming Solution to the Single Vehicle Many-to-Many Immediate Request Dial-a-Ride Problem. Transp. Sci. 1980, 14, 130–154. [Google Scholar] [CrossRef]
Jaw, J.J.; Odoni, A.R.; Psaraftis, H.N.; Wilson, N.H.M. A Heuristic Algorithm for the Multi-Vehicle Advance Request Dial-a-Ride Problem with Time Windows. Transp. Res. Part B Methodol. 1986, 20, 243–257. [Google Scholar] [CrossRef]
Xiang, Z.; Chu, C.; Chen, H. The Study of a Dynamic Dial-a-Ride Problem under Time-Dependent and Stochastic Environments. Eur. J. Oper. Res. 2008, 185, 534–551. [Google Scholar] [CrossRef]
Brownell, C.; Kornhauser, A. A Driverless Alternative: Fleet Size and Cost Requirements for a Statewide Autonomous Taxi Network in New Jersey. Transp. Res. Rec. 2014, 2416, 73–81. [Google Scholar] [CrossRef]
Fagnant, D.J.; Kockelman, K.M. Dynamic Ride-Sharing and Fleet Sizing for a System of Shared Autonomous Vehicles in Austin, Texas. Transportation 2018, 45, 143–158. [Google Scholar] [CrossRef]
Goralzik, A.; König, A.; Alčiauskaitė, L.; Hatzakis, T. Shared mobility services: An accessibility assessment from the perspective of people with disabilities. Eur. Transp. Res. Rev. 2022, 14, 34. [Google Scholar] [CrossRef] [PubMed]
Hildebrandt, F.D.; Thomas, B.W.; Ulmer, M.W. Opportunities for Reinforcement Learning in Stochastic Dynamic Vehicle Routing. Comput. Oper. Res. 2023, 150, 106071. [Google Scholar] [CrossRef]
Schneider, M.; Stenger, A.; Goeke, D. The electric vehicle-routing problem with time windows and recharging stations. Transp. Sci. 2014, 48, 500–520. [Google Scholar] [CrossRef]
Ali, W.A.; del Cacho Estil-les, M.A.; Mangini, A.M.; Roccotelli, M.; Fanti, M.P. Electric Vehicles Routing Simulation and Optimization under Smart Charging Strategies. In Proceedings of the 35th European Modeling and Simulation Symposium (EMSS 2023), Athens, Greece, 18–20 September 2023. [Google Scholar]
Abdullah, Z.K.; Alsaadi, M.S. Electric Vehicles and 5G: Impacts and Synergies for Sustainable Transportation. J. Intell. Syst. Appl. Data Sci. 2024, 2, 1–6. [Google Scholar] [CrossRef]
Ding, H.; Li, W.; Xu, N.; Zhang, J. An enhanced eco-driving strategy based on reinforcement learning for connected electric vehicles: Cooperative velocity and lane-changing control. J. Intell. Connect. Veh. 2022, 5, 316–332. [Google Scholar] [CrossRef]
Alonso-Mora, J.; Samaranayake, S.; Wallar, A.; Frazzoli, E.; Rus, D. On-demand high-capacity ride-sharing via dynamic trip-vehicle assignment. Proc. Natl. Acad. Sci. USA 2017, 114, 462–467. [Google Scholar] [CrossRef]
De Nijs, F.; Walraven, E.; De Weerdt, M.; Spaan, M. Constrained Multiagent Markov Decision Processes: A Taxonomy of Problems and Algorithms. J. Artif. Intell. Res. 2021, 70, 955–1001. [Google Scholar] [CrossRef]
ECONorthwest; Parsons Brinckerhoff Quade & Douglas. Estimating the Benefits and Costs of Public Transit Projects: A Guidebook for Practitioners; Technical Report TCRP Report 78; Transportation Research Board: Washington, DC, USA, 2002; Available online: https://onlinepubs.trb.org/onlinepubs/tcrp/tcrp78/index.htm (accessed on 1 November 2024).
Google. Routing Options in OR-Tools. Available online: https://developers.google.com/optimization/routing/routing_options (accessed on 22 November 2024).
Team SimPy. SimPy: Discrete-Event Simulation for Python. 2024. Available online: https://simpy.readthedocs.io/ (accessed on 8 March 2025).
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, J.; Hashme, S.; Hesse, C.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019, arXiv:1912.06680. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]

Figure 1. An illustration of a hypothetical street network with passenger pickup (circles) and drop-off (crosses) locations. Colors indicate matched origin–destination pairs for individual passengers, and grid labels (e.g., H3, C7) denote intersection coordinates.

Figure 2. An illustration of a hypothetical grid street network with passenger pickup (circles) and drop-off (crosses) locations, and the position of an autonomous shuttle (AS). The triangle represents the shuttle location.

Figure 3. ECDFs of average in-vehicle time across evaluation episodes. (a) Offline DARP vs. RL. (b) Nearest-neighbor vs. RL. (c) Wait-time-aware vs. RL.

Figure 4. ECDFs of average service time across evaluation episodes. (a) Offline DARP vs. RL. (b) Nearest-neighbor vs. RL. (c) Wait-time-aware vs. RL.

Table 1. Training configuration and hyperparameters across learning stages.

Training	Training	Number of	Learning	n	Batch	Clip	Ent.
Sequence	Algorithm	Steps	Rate	Steps	Size	Range	Coef.
1	GAIL	200,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.02
2	GAIL	200,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.01
3	GAIL	400,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.001
4	DRL	200,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.02
5	DRL	200,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.01
6	DRL	400,000	$1.0 \times 10^{- 4}$	1024	4096	0.2	0.001

Table 2. Overall performance comparison across 1000 evaluation episodes. Lower values indicate better performance.

Method	Wait Time	In-Vehicle Time	Service Time
Offline DARP	$19.92 \pm 7.79$	$15.55 \pm 4.69$	$35.47 \pm 8.02$
Nearest Neighbor	$10.84 \pm 3.38$	$10.59 \pm 3.45$	$21.43 \pm 4.25$
Wait-Time Aware	$10.54 \pm 2.93$	$13.78 \pm 4.04$	$24.32 \pm 5.57$
GAIL + PPO (Proposed)	$22.34 \pm 6.46$	$18.47 \pm 8.46$	$40.81 \pm 9.34$

Table 3. 95% confidence intervals for episode-level performance metrics.

Method	Wait Time	In-Vehicle Time	Service Time
Offline DARP	$[19.43, 20.41]$	$[15.26, 15.84]$	$[34.97, 35.97]$
Nearest Neighbor	$[10.64, 11.04]$	$[10.38, 10.80]$	$[21.18, 21.68]$
Wait-Time Aware	$[10.36, 10.72]$	$[13.53, 14.03]$	$[23.96, 24.68]$
GAIL + PPO	$[21.93, 22.75]$	$[17.95, 18.99]$	$[40.24, 41.38]$

Table 4. Paired comparison between GAIL + PPO and baseline methods (mean difference = baseline − RL).

Comparison	Metric	Mean Diff	p-Value
	Wait Time	$- 11.50$	<0.001
Nearest vs. RL	In-Vehicle Time	$- 7.88$	<0.001
	Service Time	$- 19.38$	<0.001
	Wait Time	$- 11.80$	<0.001
Wait-Aware vs. RL	In-Vehicle Time	$- 4.69$	<0.001
	Service Time	$- 16.49$	<0.001
	Wait Time	$- 2.42$	<0.01
Offline vs. RL	In-Vehicle Time	$- 2.92$	<0.01
	Service Time	$- 5.34$	<0.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, H.; Dimitrijevic, B. Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning. Urban Sci. 2026, 10, 287. https://doi.org/10.3390/urbansci10050287

AMA Style

Kim H, Dimitrijevic B. Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning. Urban Science. 2026; 10(5):287. https://doi.org/10.3390/urbansci10050287

Chicago/Turabian Style

Kim, Hyun, and Branislav Dimitrijevic. 2026. "Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning" Urban Science 10, no. 5: 287. https://doi.org/10.3390/urbansci10050287

APA Style

Kim, H., & Dimitrijevic, B. (2026). Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning. Urban Science, 10(5), 287. https://doi.org/10.3390/urbansci10050287

Article Menu

Learning-Based Routing for Autonomous Shuttles Under Stochastic Demand Using Generative Adversarial Imitation Learning and Reinforcement Learning

Abstract

1. Introduction

2. Literature Review

3. Research Motivation

4. Scope and Problem Definition

5. Materials and Methods

5.1. ASP Problem Formulation as a Markov Decision Process (MDP)

5.2. Proposed Method: Imitation Learning–Assisted Deep Reinforcement Learning

5.2.1. Overview of the Learning Pipeline

5.2.2. Imitation Learning via Generative Adversarial Imitation Learning (GAIL)

5.2.3. Policy Refinement via Proximal Policy Optimization (PPO)

5.3. ASP Environment Set Up

5.4. DRL Observation Space

5.5. DRL Action Space

5.6. DRL Reward Function

6. Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI