1. Introduction
Autonomous vehicles (AVs), in particular shared autonomous shuttles, have received increasing attention from industry, government agencies, and academia as a potential next-generation public transportation solution. Shared autonomous mobility systems are expected to improve operational efficiency, expand service coverage, and enhance accessibility for transportation-disadvantaged populations by enabling flexible, demand-responsive transit operations [
1,
2,
3]. As a result, autonomous shuttles have been actively piloted in campus environments, first–last mile services, and paratransit applications, including recent deployments in New Jersey and other regions worldwide [
4].
From an urban systems perspective, autonomous shuttles represent a form of demand-responsive public transportation that must continuously adapt to uncertain passenger arrivals, heterogeneous trip patterns, and real-time network conditions. Unlike conventional fixed-route transit, such services operate as sequential decision-making systems embedded within complex urban environments, where routing policies directly influence service reliability, passenger equity, and overall network efficiency. Recent studies have demonstrated the applicability of reinforcement learning for urban mobility control problems such as transit operations and shared mobility coordination, highlighting the potential of learning-based methods to support adaptive transportation systems [
5,
6]. These control-oriented applications can be viewed as special cases of sequential decision-making under uncertainty, with routing emerging as a higher-dimensional extension that requires jointly optimizing pickup, drop-off, and movement decisions over time. However, most existing approaches focus on isolated operational tasks or simplified service settings, leaving a gap in end-to-end routing frameworks that explicitly model both pickup and drop-off decisions under stochastic urban demand.
Despite this growing interest, much of the existing research has focused on vehicle technology, safety, and system-level impacts, while comparatively less attention has been devoted to the development of operational routing and scheduling algorithms suitable for shared autonomous shuttles operating under stochastic and dynamically evolving passenger demand. Most demand-responsive transit and paratransit systems rely on variants of the dial-a-ride problem (DARP), which typically assume deterministic or fully known demand and often require centralized re-optimization as new requests arrive [
7,
8]. These assumptions limit their applicability to real-time autonomous shuttle operations, where passenger requests occur continuously, and routing decisions must be updated online.
Recent advances in machine learning, particularly deep reinforcement learning (DRL), offer a promising alternative for addressing dynamic routing problems. By learning policies through interaction with the environment, DRL agents can adapt routing decisions in real time without repeatedly solving large-scale combinatorial optimization problems [
9,
10]. However, training stable and effective DRL policies for routing under stochastic demand remains challenging due to sparse rewards, high-dimensional state spaces, and the need to coordinate pickup and drop-off decisions across time.
To address these challenges, this study proposes a learning-based routing framework for autonomous shuttles operating under stochastic passenger demand. The proposed approach integrates generative adversarial imitation learning (GAIL) [
11] with proximal policy optimization (PPO) [
10] to accelerate policy convergence and improve training stability. The learned routing policy is evaluated in a dynamic simulation environment and benchmarked against a deterministic DARP solution implemented using Google’s OR-Tools. Performance is assessed using passenger waiting time, in-vehicle time, service completion time, and overall service efficiency.
Rather than targeting full-scale deployment, this work focuses on evaluating the feasibility of learning-based routing under stochastic demand in a controlled setting. The proposed framework enables sequential decision-making without centralized re-optimization and is evaluated against both deterministic and heuristic baselines. This formulation serves as a foundational step toward more complex and realistic autonomous shuttle routing problems, including larger networks, multiple vehicles, and additional operational constraints.
The main contributions of this paper are threefold: (1) formulation of a stochastic autonomous shuttle routing problem within a reinforcement learning framework; (2) development of an imitation-learning-assisted DRL training pipeline for dynamic pickup and drop-off routing; and (3) systematic benchmarking of the learned policy against a conventional DARP solver under controlled stochastic demand scenarios.
2. Literature Review
The routing and scheduling of shared autonomous shuttles is closely related to the classical dial-a-ride problem (DARP), which seeks to determine optimal pickup and drop-off sequences subject to time windows, vehicle capacity constraints, and passenger service quality objectives. Early formulations of DARP focused on deterministic settings with fully known demand, using exact methods such as dynamic programming and branch-and-bound, as well as heuristic approaches including insertion heuristics, local search, and interchange methods [
12,
13,
14]. These methods have been successfully applied to applications such as paratransit services, school bus routing, and company fleet operations.
Subsequent research extended DARP formulations to incorporate stochastic elements such as travel time variability, request cancellations, and vehicle disruptions [
15]. Although these extensions improved robustness, most approaches continue to rely on centralized optimization and assume that passenger requests are known in advance or arrive at discrete re-optimization intervals. As a result, they remain computationally expensive and poorly suited for continuous, real-time routing decisions required by autonomous shuttle systems operating in highly dynamic environments.
In parallel, studies on shared autonomous vehicles (SAVs) and dynamic ride-sharing have explored simulation-based and optimization-based frameworks to assess system-level impacts and operational feasibility. Many of these studies impose simplifying assumptions, such as restricting ride-sharing to passengers with closely aligned spatiotemporal characteristics or aggregating pickup and drop-off locations into zones [
16,
17]. While these assumptions reduce computational complexity, they may compromise service equity and limit applicability to paratransit populations requiring door-to-door service [
18].
More recently, reinforcement learning and Markov decision process (MDP) formulations have been investigated for stochastic vehicle routing problems. Learning-based approaches offer the advantage of producing real-time decisions without repeated global optimization [
3,
19]. However, many existing applications focus primarily on pickup decisions, neglect explicit modeling of passenger destinations, or are limited to single-vehicle or simplified service scenarios. Moreover, training instability and sparse reward signals remain significant barriers to practical deployment.
Recent advances in transportation systems have introduced new challenges that extend beyond classical vehicle routing formulations. In particular, electric vehicle routing problems (EVRP) incorporate battery limitations and charging strategies, significantly increasing operational complexity [
20,
21]. These constraints require routing approaches that jointly consider energy consumption, charging infrastructure, and service efficiency.
Furthermore, the emergence of connected vehicle technologies and communication infrastructures such as 5G enables real-time data exchange and coordination across transportation systems, supporting dynamic and adaptive decision-making [
22]. These developments further emphasize the need for routing strategies that operate effectively under uncertainty and evolving system states.
Reinforcement learning (RL) has been increasingly explored as a framework for decision-making in transportation systems. Prior work has applied RL to vehicle-level control problems, including cooperative velocity planning and lane-changing in connected electric vehicles [
23]. These studies demonstrate the effectiveness of RL for adaptive and cooperative decision-making but remain focused on vehicle-level control rather than system-level routing and dispatching under stochastic passenger demand.
At the system level, dynamic ride-sharing and mobility-on-demand problems have been studied to address real-time matching and routing of vehicles and passengers. For example, Alonso-Mora et al. [
24] proposed a dynamic trip-vehicle assignment framework for high-capacity ride-sharing, highlighting the complexity of real-time fleet coordination under stochastic demand.
Despite these advances, existing approaches either rely on deterministic optimization with strong assumptions (e.g., full demand knowledge) or focus on localized control problems. There remains a need for learning-based routing frameworks that can operate under stochastic demand, limited information, and dynamic environments. This gap motivates the proposed integration of imitation learning and reinforcement learning for autonomous shuttle routing.
Despite growing interest in learning-based routing, a critical gap remains in the development of reinforcement learning frameworks that explicitly model both pickup and drop-off decisions, operate under stochastic passenger arrivals, and are evaluated against established optimization-based benchmarks. This study addresses this gap by integrating imitation learning with deep reinforcement learning to train an autonomous shuttle routing policy and by systematically comparing its performance to a deterministic DARP solver under controlled experimental conditions.
5. Materials and Methods
The existing DARP heuristics and optimization algorithms provide the optimal route and schedule (sequence) to pick up passengers, considering their desired times of pickup and arrival at destinations. This is achieved by optimizing the DARP objective function with constraints and all variables known in advance, such as the number of passengers, the time windows of their desired pickup and drop-off times, and their pickup and drop-off locations. To exemplify the limitation of the traditional DARP algorithm, consider a simple 10 by 10 grid representing a street network, with known passengers’ origin and destination information, as shown in
Figure 1. In
Figure 1, the small circles represent origins of the passengers requesting rides, and crosses represent their destination. The matching origin and destination (i.e., for the same passenger trip) are shown in the same color. For example, P1 represents the origin of passenger 1, located at the intersection H3 and represented by red circle; the destination of passenger 1 is at intersection C7, represented by a cross shown in red color. The origin and destination pairs for different passengers are shown in different colors. As mentioned previously, depending on the time window of passengers’ desired pickup and arrival time, traditional DARP methods can be used to determine the optimal spatiotemporal departure point and the sequence of pickups and drop-offs of these passengers such as the
Figure 1 below.
Now, let us consider a scenario in which the passenger trip requests and trip information is not available in advance. The three grid plots in
Figure 2 represent the locations of passenger trip requests and the location of an autonomous shuttle (AS1) in the street network at three different times: the left plot is at time zero T = 0, the middle plot is at time one T = 1, and the right plot is at time T = 2. At T = 0, only one passenger, P1, requested a service and the shuttle AS1 is just 2 vertices away. At time T = 1, AS1 is moving towards P1. However, at T = 1, another passenger, P2, requests service and is four vertices away from the origin of P1. Then, at T = 2, another passenger, P3, requests service. At T = 2, AS1 has to decide between different options for servicing the passengers. For example, one option (let us call it Option 1) would be to pick up P2, drop off P1, drop off P2, pick up P3, and then drop off P3. Another option (say, Option 2) would be to pick up P2, then drop off P2 on the way to P3, pick up P3, then drop off P1, and lastly drop off P3. As a combinatorial problem, the passenger service plan has multiple solutions, but the traditional route planning optimization methods would not be able to dynamically add and subtract passengers while the shuttle is on the move and for multiple shuttles at once.
5.1. ASP Problem Formulation as a Markov Decision Process (MDP)
Though not directly focused on routing optimization, the work of Nijs et al. [
25] on the constrained multiagent Markov decision process (CMMDP) provides a useful foundation for designing and conceptualizing the ASP within an MDP framework. CMMDP aims to support decision-making for multiple agents sharing resources under uncertainty. They determine which variations of CMMDP will serve the best depending on the objective, type of constraint, and observability of the domain. MDP can be formulated as a tuple
representing the state space, action space, transition function, reward function, and discount factor, respectively, as shown in Equation (
1)
In MDP, the agent must decide an appropriate action that will produce the highest reward depending on the current state. Also, depending on the weight of the discount factor, the agent can value immediate reward more than the future reward or vice versa.
In the context of ASP, the state represents the positions of all AS, whether the shuttles have passengers onboard, the locations of all passengers and their elapsed wait time, travel time, and their destinations, including both the passengers waiting for a pickup and those already onboard the shuttles. The action space is defined as movements of the shuttle that can maneuver the shuttle on a grid including moving forward or turning left and right.
In a conventional MDP, the transition probability is used to model the probability of advancing to the next state from the current state if there are multiple scenarios possible by taking a single action. The probabilistic outcome of taking an action introduces some randomness, which is appropriate to be used in modeling stochastic scenarios. However, in the context of this research problem, the transition probability is not necessary for use because picking up and dropping off passengers are deterministic and there is no other outcome to performing each action. If a shuttle picks up passengers, then they are picked up with no other possible outcomes to that action.
The reward function is set to guide the agent to reach the overall objective. A careful and thorough reward shaping process is necessary to prevent erratic behavior while promoting actions that align with objectives such as finding the shortest path and optimally routing to maximize efficiency of transporting passengers. A more comprehensive setting of the state information, the action space, and reward function for the ASP are covered in the subsequent sections.
5.2. Proposed Method: Imitation Learning–Assisted Deep Reinforcement Learning
This study proposes a two-stage learning framework that integrates imitation learning and deep reinforcement learning (DRL) to train an autonomous shuttle agent for dynamic routing under stochastic passenger demand. The proposed approach combines the rapid policy initialization capabilities of imitation learning with the long-term optimality and adaptability of reinforcement learning, resulting in a more stable and efficient training process than using reinforcement learning alone.
5.2.1. Overview of the Learning Pipeline
Training begins with imitation learning using generative adversarial imitation learning (GAIL), where the agent learns an initial routing policy by imitating expert behavior. This stage provides the agent with a structured understanding of the task objective and reduces the exploration burden associated with sparse and delayed rewards. Once the policy converges under imitation learning, the learned network weights are transferred to a deep reinforcement learning agent. The agent then continues training using direct environmental feedback, allowing further policy refinement through exploration and exploitation. This two-stage pipeline leverages the strengths of both learning paradigms while mitigating their individual limitations.
5.2.2. Imitation Learning via Generative Adversarial Imitation Learning (GAIL)
GAIL formulates imitation learning as an adversarial optimization problem involving two components: a policy (agent) and a discriminator (adversary) [
11]. The discriminator is trained to distinguish between state–action pairs generated by the agent and those obtained from expert demonstrations recorded by a human operator. Concurrently, the agent learns to generate trajectories that are indistinguishable from expert behavior.
Formally, GAIL solves the following minimax optimization problem:
where
denotes the expert policy,
is the agent policy, and
is the discriminator output indicating whether a state–action pair is expert-generated.
During training, the discriminator provides a learned reward signal to the agent, defined as
which replaces the environment reward. By learning from this adversarial feedback, the agent rapidly acquires expert-like routing behavior without directly interacting with the environment. While this mechanism provides an effective initialization, GAIL training is inherently limited by the quality of expert demonstrations and may lead to premature convergence if used alone.
5.2.3. Policy Refinement via Proximal Policy Optimization (PPO)
To overcome the limitations of imitation learning and enable further performance improvements, the GAIL-trained policy is transferred to a deep reinforcement learning agent and refined using proximal policy optimization (PPO) [
10]. PPO is selected due to its robustness, sample efficiency, and strong empirical performance in stochastic and dynamic environments.
PPO constrains policy updates by optimizing a clipped surrogate objective, which limits the deviation between successive policy iterations. The probability ratio used to compare the updated policy with the previous policy is defined as
The PPO clipped surrogate objective is given by
where
is the advantage function measuring the relative quality of an action compared to the expected value of the current state, and
is a hyperparameter that controls the allowable magnitude of policy updates. By clipping the probability ratio, PPO prevents excessively large updates that may destabilize learning while maintaining sufficient exploration.
In addition to policy optimization, PPO jointly updates a value function to improve estimates of future returns and incorporates entropy regularization to prevent premature convergence to deterministic policies. Through continued interaction with the environment, the agent refines the imitation-initialized policy and learns to adapt to situations beyond the expert demonstrations. This combined imitation learning and reinforcement learning framework enables efficient convergence and robust policy learning for autonomous shuttle routing in dynamic environments.
5.3. ASP Environment Set Up
Python 3.8.10, along with packages including Gym and Stable Baselines3 were used to set up a custom environment suitable to simulate the ASP in a discrete simulation space using a computer with AMD Ryzen 9 5900HS, NVIDIA GeForce RTX 3080 Laptop GPU, and 32 GB of RAM. Gym, developed by OpenAI, is an open-source standardized toolkit for developing and comparing DRL algorithms in customizable environments.
The custom ASP simulation environment was developed as a grid-based street network, with an agent controlling a shuttle initially placed at coordinate (5,5), facing north, with three randomly positioned passengers. After every 10 steps (where a step represents one unit of simulation time), an additional passenger is generated at a random location until a total of six passengers is reached. Each simulation, equivalent to an episode, terminates when either 500 steps are reached or all six passengers have been successfully transported to their destinations.
The simulation environment incorporates a predefined orientation constraint, such that the shuttle must maintain a heading (north, south, east, or west) and cannot move freely in all directions without adjusting its orientation, reflecting realistic vehicle movement behavior. The environment is modeled as a discrete grid, where each node represents a location and movement is restricted to adjacent nodes. Distance between two locations is measured using the Manhattan distance metric, and time is represented in discrete steps. At each time step, the vehicle moves exactly one grid unit, corresponding to a constant speed of one grid unit per time step.
As a result, travel time between any two locations is equal to their Manhattan distance. All reported performance metrics—including passenger waiting time, in-vehicle time, and service time—are therefore expressed in units of discrete time steps, which are directly equivalent to grid-based travel distance.
Despite this structured environment design, early training experiments indicated that the agent struggled to learn coherent routing behaviors when exposed to the full state space. In particular, the agent frequently exhibited unstable behaviors such as circling, idling near passenger locations, or failing to follow shortest paths toward pickup and drop-off points, even after extensive training. This suggests that the agent was unable to reliably infer short-horizon navigational structure solely from sparse reward signals in a stochastic environment.
To address this challenge, a lightweight environment-level guidance mechanism based on the breadth-first search (BFS) algorithm was incorporated. BFS computes shortest-path distances between the shuttle’s current position and candidate pickup or drop-off locations, allowing the environment to dynamically identify relevant pickup and drop-off targets at each decision step.
Importantly, this mechanism does not prescribe a fixed trajectory. Instead, it constrains the action space to feasible shortest-path movements while still allowing multiple valid actions due to the existence of multiple shortest paths in the grid network. As a result, the agent is presented with a set of candidate actions rather than a single deterministic choice.
Within this structured action space, the DRL agent retains full decision-making responsibility, including selecting which target to prioritize (e.g., pickup versus drop-off) and which path to follow among available shortest-path options. This formulation therefore extends beyond simple task sequencing and enables routing decisions under constrained but non-trivial action choices.
Overall, this design reduces effective state complexity and stabilizes learning while preserving agent autonomy, striking a balance between exploration feasibility and decision-making flexibility without incurring the combinatorial complexity of fully unconstrained routing.
5.4. DRL Observation Space
The observation space available to the agent includes: location, distance, and orientation towards the nearest passenger; location, distance, and orientation towards the nearest drop-off location; current shuttle position and orientation. Though the observation space may seem overly simplified for the ASP, this configuration empirically yielded the best performance. Until the problem was relaxed with a custom guidance system and the simplification of the observation space, the agent was not able to learn or produce coherent actions, contrary to the theories and expectations. The ASP DRL environment was first configured similarly to how a DARP would be solved, providing as much information as possible, then waiting for a solution to be developed. In the initial stage of the model development, the observation space included all passengers’ elapsed waiting time, elapsed in-vehicle time, upper limit of all passengers’ waiting time and arrival time, and the number of passengers onboard the shuttle, along with the current observation space, which summed up to 60 dimensions in the observation space array. Upon incrementally reducing the number of observations, the current observation space showed signs of improvement over time and therefore was adopted.
5.5. DRL Action Space
To prevent the agent from traversing simply up and down in the grid environment, the orientation mechanism was implemented that introduced more realistic movement of shuttles. Therefore, the discrete action space for the agent can be formulated as Equation (
11).
As described, the agent can go forward, or turn left then go forward, or turn right then go forward. “Going forward” was added to the turning movements because situations were observed in the trials where the agent was only turning and not producing coherent maneuver within the grid to pick up or drop off passengers. To design a more robust learning opportunity and an action space that is aligned with the reward function, making a turn without moving forward was not included as an action choice.
5.6. DRL Reward Function
The reward function helps the agent achieve the objective by providing guidance in the form of reward or punishment. The reward function must match the observation space and the objective of the simulation environment for effective learning. Therefore, in the ASP DRL environment, the reward function is engineered to minimize the overall passenger waiting and traveling times by following (an optimal) passenger pickup and drop-off sequence and traveling along the shortest path en route to pick up or drop off passengers. There are several types of rewards that have been considered in the ASP DRL model: base rewards, vicinity rewards, inverse travel time rewards, and an early completion rewards. The base rewards for picking up and dropping off are set to be 0.2 and 0.15, respectively, to encourage pickups more than drop-offs, aligned with the transit user cost theory that people value wait time up to two to three times more than in-vehicle time [
26]. In this initial research experiment, the pickup base reward was only slightly higher than the drop-off reward, but these values can be adjusted after examining the performance of the trained agent. Vicinity rewards are calculated by setting up vicinities of 5 distance units around passenger pickup and drop-off locations, and giving incrementally greater reward as the agent gets closer to the locations. The vicinity reward calculation is shown below in Equation (
12).
where:
As shown in Equation (
12), the value of the vicinity reward is inversely proportional between a pickup or a drop-off location, and the closer the agent gets to one of these locations, the more rewards it will receive. Since traveling along the shortest path is equally important for both pickups and drop-offs, their vicinity rewards are the same. In a transit system it is important to transport all passengers as quickly as possible. Thus, the agent is given rewards for completing pick up and drop off of passengers quickly. This is accomplished with the inverse travel time reward, or speed bonus, which is calculated using the following Equation (
13).
where:
The is calculated for each passenger and includes elapsed wait time or elapsed in-vehicle time depending on the status of passengers. If a passenger is in waiting status, then the elapsed time measures the time since the passenger demand was generated. On the other hand, if the status of a passenger is onboard (a shuttle vehicle), then the elapsed time measures their current in-vehicle travel time. For example, if a passenger was generated at step 0, picked up at step 7, and dropped off at step 20, then their elapsed wait time would be 7 and the elapsed in-vehicle time would be 13. Therefore, the agent will receive more rewards the faster it is able to pick up and drop off passengers by shortening passenger elapsed wait time and elapsed in-vehicle time. Also, at the completion of either a pickup or a drop-off, a reward will be added such that the agent can receive both a base reward and inverse travel time reward for picking up and dropping off passengers. Lastly, since the objective of the agent is to transport all the passengers generated in an episode as early as possible, additional reward is given for ending the episode as quickly as possible. The early completion reward is calculated as follows:
where:
An episode would terminate upon either reaching 500 steps or transporting all six passengers. The agent will not receive the early completion bonus if it was not able to transport all passengers in an episode. As shown from the equation, the early completion bonus is calculated by multiplying 0.01 by the remaining time and the remaining time is the difference of 500 and the episode end time. Therefore, the agent can receive more rewards by decreasing the episode end time thereby increasing the remaining time to receive high early completion bonus.
The reward function was designed to align with key operational objectives, including minimizing passenger waiting time, in-vehicle travel time, and overall service time. The relative magnitudes of reward components were iteratively tuned to balance these objectives and promote stable learning behavior.
An additional early-termination bonus was included to encourage global efficiency by rewarding the completion of all passenger services in fewer time steps. Since episodes only terminate once all passengers have been served, this bonus reinforces efficient routing rather than inducing short-sighted behavior.
While the reward parameters were not derived through formal optimization, they provide a practical and effective approximation of system-level objectives. Future work may explore more systematic approaches to reward design, including multi-objective optimization and data-driven calibration.
6. Results
Following the two-stage training procedure described above, 125 expert demonstration episodes were generated within the same simulation environment. These demonstrations were constructed using a custom breadth-first search (BFS)-based routing mechanism that ensures shortest-path navigation to pickup and drop-off locations on the grid network. Due to the grid structure, multiple shortest paths may exist; therefore, the BFS provides a set of equally optimal next-step actions rather than a single deterministic trajectory.
The expert policy resolves only the low-level navigation task—i.e., how to reach a selected target via shortest paths—while leaving higher-level decision-making (e.g., whether to prioritize pickups or drop-offs) unspecified. As a result, the demonstrations primarily encode efficient movement behavior rather than full task optimality.
These demonstrations were used within the GAIL framework to initialize the policy, effectively addressing the early-stage exploration problem where the agent otherwise fails to discover feasible routes to passengers. The resulting policy was subsequently fine-tuned using PPO to learn higher-level routing strategies that optimize passenger service metrics. The hyperparameters used across both training stages are summarized in
Table 1.
The entropy coefficient was annealed to slowly discourage exploration over time, while encouraging the agent to exploit the policy it learned to refine the optimal routing policy based on the locations of passenger pickups and drop-offs. Different combinations of the hyperparameters were tested, such as the learning rate ranging from
to
,
n steps from 64 to 1024, and batch_size from 256 to 8192. However, the hyperparameter values presented in
Table 1 demonstrated the best performance. Given the stochastic environment, the hyperparameters were configured to ensure sufficiently long horizons (
n steps and batch size) for the frequency of update and a robust data collection. Furthermore, a low learning rate and constrained update values were employed to promote effective generalization by the agent.
To evaluate the effectiveness of the trained policy and the environment-level guidance mechanism in the absence of established benchmarks for the autonomous shuttle problem (ASP), a deterministic dial-a-ride problem (DARP) formulation was implemented using Google’s OR-Tools. In this benchmark, passenger demand was assumed to be fully known in advance, providing an upper-bound reference for routing performance. Because the proposed DRL-based approach optimizes routing decisions online without prior knowledge of future passenger requests, it is expected to yield inferior objective values relative to the deterministic DARP solution. Consequently, performance proximity to the DARP benchmark is interpreted as an indication of effective real-time decision-making.
To provide a more appropriate and fair comparison under online and stochastic conditions, two additional heuristic baselines were implemented: a nearest-neighbor heuristic and a wait-time-aware heuristic. Unlike the offline DARP formulation, these methods operate under the same information constraints as the DRL agent, making decisions sequentially based only on currently available passenger requests.
The nearest-neighbor heuristic selects the next target (either a pickup or drop-off location) based on minimum Manhattan distance from the shuttle’s current position, prioritizing immediate spatial efficiency as a simple greedy baseline. The wait-time-aware heuristic extends this approach by incorporating passenger waiting time into the decision process, balancing travel distance and accumulated waiting time to prioritize passengers who have been waiting longer. This provides a stronger baseline that explicitly accounts for passenger-centric performance metrics such as waiting and service time.
To enable consistent and comparable evaluation across all methods, identical realized demand scenarios were used. For each of the 1000 evaluation episodes, passenger generation information—including arrival times, pickup locations, and drop-off locations—was recorded and subsequently replayed across the DARP model and both online heuristic baselines. This eliminates variability due to stochastic passenger generation and allows direct, paired comparisons of performance metrics.
Rather than relying on explicit random seed control, reproducibility is achieved through deterministic replay of the recorded demand scenarios, ensuring that evaluation outcomes remain consistent and independent of stochastic variations in passenger generation.
Additional implementation details for the DARP benchmark—including routing constraints, cost structure, and search strategies (e.g., PARALLEL_CHEAPEST_INSERTION and GUIDED_LOCAL_SEARCH)—are provided. The codebase, along with evaluation scripts and representative output files, is publicly available, as described in the Data Availability Statement.
After training, key operational statistics—including passenger pickup and drop-off locations and times, waiting times, and in-vehicle travel times—were recorded at the end of each validation episode. The same passenger demand realizations were then applied to the DARP model implemented using the Google OR-Tools Python package, employing the ortools.constraint_solver and pywrapcp.routing_enums_pb2 modules, as well as to the two online heuristic models, to generate benchmark solutions for comparison.
In each episode, a total of six passengers were generated at random locations within the service network. Three passengers (P1–P3) were generated at the start of the episode, followed by the arrival of one additional passenger every 10 time steps until all six passengers had been introduced. To ensure consistency between the online DRL setting and the offline DARP benchmark, pickup time constraints were imposed in the DARP formulation such that passenger P4 could not be picked up before step 10, P5 before step 20, and P6 before step 30. No explicit constraints were placed on maximum waiting time or in-vehicle travel time.
The service network was modeled as a grid, where movement between adjacent nodes incurred a unit cost of one time step. Travel costs and shortest paths were computed using the Manhattan distance metric, consistent with the grid-based environment used for DRL training.
For the DARP benchmark, OR-Tools was configured by specifying paired pickup and drop-off locations, pickup time constraints, vehicle capacity, and pickup–drop-off precedence constraints, with the objective of minimizing total travel cost across all passenger services. An initial routing solution was generated using the
PARALLEL_CHEAPEST_INSERTION heuristic, followed by refinement using the
GUIDED_LOCAL_SEARCH metaheuristic. These general vehicle routing problem (VRP) strategies iteratively improve solution quality through cost-based insertion and local neighborhood exploration [
27].
Table 2 summarizes the average episode-level performance of all methods over 1000 evaluation episodes. The heuristic baselines achieved lower mean waiting, in-vehicle, and service times than the proposed GAIL + PPO agent. In particular, the wait-time-aware heuristic achieved the lowest average waiting time, whereas the nearest-neighbor baseline achieved the lowest average in-vehicle and service times. These findings suggest that the current learned policy remains less competitive than the selected benchmarks with respect to these time-based performance measures.
The confidence intervals in
Table 3 indicate relatively low variability across evaluation episodes, suggesting stable performance across all methods.
Table 4 presents paired comparisons between the proposed GAIL + PPO agent and baseline methods. Across all metrics, the heuristic baselines achieve lower time-based performance values than the RL agent, and these differences are statistically significant (
p < 0.01). The magnitude of the differences is particularly pronounced for service time when compared to the nearest-neighbor heuristic.
While
Table 2,
Table 3 and
Table 4 provide aggregate performance comparisons across 1000 evaluation episodes, these summary statistics do not fully capture the variability of outcomes under stochastic demand. To complement this macroscopic analysis, we examine episode-level performance differences using empirical cumulative distribution functions (ECDFs), which provide a distributional view of how the proposed method compares to baseline approaches across individual realizations.
Figure 3 shows the ECDFs of the average in-vehicle travel time differences between the baseline methods and the RL policy.
In the ECDF plots, the difference is defined as (baseline − RL), such that negative values indicate better performance by the baseline and positive values indicate better performance by the RL policy. This representation enables a direct assessment of not only the average performance gap, but also the proportion of episodes in which the RL policy achieves comparable or superior performance relative to each baseline.
For in-vehicle travel time, the ECDF results indicate that the baseline methods generally outperform the RL policy, as a substantial portion of the distribution lies on the negative side of zero (i.e., baseline minus RL < 0). This implies that the baselines achieve lower travel times in the majority of episodes.
For the nearest-neighbor heuristic, approximately 60% of episodes fall within a 10-unit difference from the RL policy, indicating that although the heuristic often performs better, the magnitude of improvement is relatively moderate in many cases.
When compared to the offline DARP benchmark, RL achieves lower in-vehicle travel time in approximately 40% of episodes (i.e., where the difference is positive). This demonstrates that, despite the offline method having full knowledge of future demand, the RL policy is capable of achieving comparable or improved performance in a non-negligible subset of stochastic demand realizations under certain stochastic realizations.
Figure 4 shows the ECDFs of the average service time differences between the baseline methods and the RL policy. For total service time, the ECDF indicates that the RL policy is generally outperformed by all baseline methods, as the majority of the distribution remains on the negative side of zero. While RL achieves lower service times than the offline benchmark in approximately 25% of episodes, these cases are limited compared to the overall distribution.
In comparison to the online heuristics, including both nearest-neighbor and wait-time-aware strategies, the RL policy consistently underperforms, suggesting that it does not fully capture the trade-off between minimizing travel distance and reducing passenger waiting time.
While the proposed GAIL + PPO agent does not outperform the heuristic baselines or the offline DARP solution in terms of time-based performance metrics, this outcome is consistent with the relative information and structural advantages of the benchmark methods. The heuristic approaches are explicitly designed to optimize short-term objectives under simplified decision rules, while the offline DARP solution benefits from full prior knowledge of all passenger requests.
Importantly, the proposed learning-based approach is not intended to directly compete with these methods under the current simplified setting, but rather to establish a scalable and extensible framework for more complex routing problems. In particular, traditional optimization and heuristic approaches become increasingly difficult to apply or generalize in environments involving stochastic demand, sequential decision-making, and high-dimensional state representations.
The results demonstrate that the proposed method is capable of learning meaningful routing behaviors and achieving competitive performance in a subset of scenarios despite operating under significantly greater uncertainty. This provides a foundation for extending the approach to more complex settings, such as multi-agent coordination and large-scale real-time transportation systems, where learning-based methods are expected to offer greater flexibility and adaptability.