1. Introduction
In an increasingly urbanized world, increasing mobility accessibility and reducing private auto reliance have become priorities for cities. The traditional, fixed-route transit model as the only shared mobility mode is not sufficient to provide broad accessibility across areas with varying population densities. The need for flexible and expanded service coverage in shared mobility has sprung alternatives such as on-demand mobility [
1] and flex-route transit [
2,
3].
Flex-route transit, a service type that operates a base fixed route with optional pickup and drop-off points, has emerged as a hybrid alternative, as it combines the directness and capacity of fixed-route transit with the expanded coverage of on-demand mobility. In practice, flex-route transit has been the subject of recent pilot programs [
4], and research supports its potential to complement fixed-route transit and improve accessibility [
5].
However, flex-route transit has proven difficult and costly to operate. Surveys indicate that the cost per trip can be up to 8 times higher than the fixed route [
6]. Operational challenges lead to low productivity, excessive vehicle idle times, and schedule violations. A pilot evaluation also showed highly variable user waiting times [
4]. These challenges also make them prone to failure—a comprehensive review showed that 40% of flexible transit implementations last less than 3 years before failure [
7].
The reported difficulty reflects the need for better planning and real-time control. Research has mostly focused on the various stages of planning: the study of spatial and temporal demand characteristics suited for flexible transit [
8], scheduling methods [
9,
10,
11], and pre-planned strategies such as offering users alternative pickup spots [
12] and reallocating slack time to minimize idle time [
13,
14].
Although service plans can make the operation robust, operational control is necessary to respond to disruptions caused by stochastic demand levels and travel times. The primary lever for control is the real-time assignment of requests to vehicles, taking into consideration schedule adherence at checkpoints. Research distinguishes between static and dynamic flexible-route transit systems in terms of their request handling processes [
15,
16]. Static systems, which operate requests pre-booked up to a day in advance, are common in practice [
6,
17] and in research studies [
14,
18]. Dynamic systems, in turn, reserve flexibility to serve last-minute requests. This introduces the challenge of handling requests (demand) and assigning vehicles (supply) in real time. Heuristic approaches determine deviation decisions based on the immediate availability of slack time, with reactive rather than proactive decision-making. As a result, volatile demand and travel times can compromise their effectiveness. Moreover, a multi-objective function is required to capture the costs imposed on both the operation and onboard passengers, which is challenging to capture with heuristic solutions. Furthermore, previous methods neglect the negative impact of deviation decisions on riders at fixed stops [
19].
To fill these gaps, this study proposes a reinforcement learning-based solution to determine when a vehicle should deviate from the base route to serve requested riders at a flex-pickup stop. The state design enables a deviation policy that adapts to dynamic conditions in demand patterns and travel times. Furthermore, the reward design addresses the trade-offs in deviation decisions. Fewer deviations result in early arrivals and fewer requests served, and too many deviations result in late arrivals that deteriorate rider experience and lead to missed transfers. A weighted reward function is formulated to incentivize a deviation policy that balances schedule adherence and the number of served requests. Transportation authorities can define the reward weights in accordance with their service objectives, making this approach replicable in other cities. The proposed method is evaluated against heuristics derived from real-world practice [
6]. To test the adaptability to dynamic conditions, two scenarios are proposed with different demand compositions. In the peak period scenario, where typically commuters would use the service, the demand for rider requests is low. In the off-peak scenario, where more leisure travelers would use the service, the demand for rider requests is higher. As a result, the methodology enables a dynamic flex-route transit operation that can be attractive to different rider groups, such as time-sensitive commuters traveling between fixed stops, as well as leisure travelers prioritizing the convenience of pickups away from the fixed route. More broadly, the approach aligns with the goals of smart cities by leveraging large-scale vehicle and demand data to dynamically adapt to changing demand patterns and operational conditions.
In sum, the contributions are as follows:
A flexible control algorithm for flex-route transit that considers the impacts on both requested and fixed-stop riders is proposed.
Real-time decisions capture the conflicting objectives and trade-offs in the operation.
Support is provided for flexibility to adapt to varying temporal conditions according to service objectives.
The remainder of this paper is structured as follows:
Section 2 discusses related works,
Section 3 describes the flex-route problem,
Section 4 presents the solution approach,
Section 5 describes the case study for the application and
Section 6 presents the performance evaluation. Finally,
Section 7 concludes the paper.
2. Related Works
2.1. Flex-Route Operating Strategies
The literature and practice have focused primarily on the use of heuristics for deviation and request assignment decisions [
6]. Ref. [
14] proposed a strategy to extend the slack time at certain stops and high-level demand periods to accommodate additional requests, significantly reducing the request rejection rate. The method proposed in [
12] enabled the pickup locations of accepted requests to serve as optional pickup stops for future rider requests. The strategy was effective in reducing waiting time at higher demand levels, as more alternative locations were available for pickup. These operating strategies are pre-planned and assume that requests are known in advance, thereby not supporting real-time decision-making.
Real-time methods for deviation decisions in flex-route systems remain limited in the literature. Advanced optimization-based solutions for request assignments in high-capacity flexible services, as proposed in [
20,
21], are limited to services that are not based around a fixed route. Ref. [
22] compared fixed- and flex-route service designs for a feeder network between stops used as transfer points to provide insights for service design. For the flex-route service type, a greedy-reactive method was used for proximity-based matching between riders and vehicles. The results showed that, for higher demand levels, the wait time was more evenly distributed across the service area under the flex-route service. Ref. [
19] proposed a reinforcement learning algorithm for real-time request assignment decisions in flex-route services, with a reward function based on passengers and operating costs. The method showed operational efficiency improvements compared to randomly assigned requests. However, the impact on service quality indicators relevant to fixed-stop riders, such as reliability and schedule adherence, was neglected in the methodology design and evaluation.
2.2. Reinforcement Learning in Mobility Systems
Reinforcement learning (RL) has been applied extensively to mobility systems. Ref. [
23] used Deep Q-Networks to address the vehicle rebalancing problem in the mobility-on-demand context. The method reduces passenger waiting time, with savings increasing as fleet size grows. Ref. [
24] extended this work to address both vehicle rebalancing and request assignment simultaneously. The evaluation showed improvements in passenger waiting time but not in vehicle travel distances, highlighting the trade-offs in multi-objective problems. Regarding fixed-route transit problems, Chen et al. [
25] were the first to address the bus holding problem, using sparse-cooperative Q-learning. Although this approach enables cooperative strategies across vehicles, including global fleet information within the state, it limits the computational scalability. To address the complexity issue, Alesiani and Gkiotsalitis [
26] proposed a fully decentralized learning scheme, reducing passenger journey times. However, decentralized control introduces the issue of non-stationarity because agents are unable to observe the impact of neighboring agents on their individual reward. This issue was first recognized in an early paper on RL applied to the control of elevator systems [
27]. Their proposed solution was to include the impact of each agent on the global reward signal. Ref. [
28] adopted RL in the context of bus holding and stop skipping, and it addressed the non-stationarity issue by including the impact on passenger waiting times for the trip following the controlled trip. The technique improved reliability performance compared to agents trained with less “awareness”. Ref. [
29] addressed the bus holding problem with a credit assignment technique based on inductive graph learning to asynchronously approximate the impact of other agents’ actions on the reward. The methodology achieved a performance increase, even on new bus routes where the model was not trained. Finally, Ref. [
30] formulated the multi-strategy control problem in fixed routes using curriculum learning and used domain randomization to increase robustness. The performance improved compared to that of simpler RL methods under various service design and demand level scenarios.
3. Problem Description
3.1. Flex-Route Transit Operation
The specific flex-route problem addressed in this paper is shown in
Figure 1, which illustrates a route with
M fixed stops. From the riders’ perspective, using the service involves placing a request and waiting to be assigned a vehicle. The vehicle may not be assigned immediately, as the control system must wait for the following vehicle to be close to the flex stop to make the deviation decision (described further in the next section). However, the system ensures that the time between acceptance and pickup is sufficient for the rider to reach the flex stop for pickup. A vehicle assignment is not guaranteed, and a maximum wait time tolerance
is assumed before the rider cancels their request and exits the system, similar to the “walk away” time assumed in [
23].
Figure 2 shows the processes involved in the implementation of flex-route transit systems. The planning process consists of various components. The route design consists of determining the location of fixed- and flex-route stops. The frequency setting process establishes the minimum level of service based on the available resources and expected demand. Finally, the scheduling process assigns vehicles and crews to trips and sets the available slack time for deviations. The slack time must be carefully set according to the observed travel times and expected demand. If the slack time is too low, it risks delays and rejected requests if the demand exceeds the allowed deviations; conversely, if the slack time is too high, vehicle idle times grow, increasing operating costs. The slack time setting problem is explored further in [
31].
Once the service plan is developed, operational control strategies are designed to ensure that the planned level of service and desired service quality are delivered. The operations control problem includes interdependent processes. First, the available slack time is the critical resource used to determine a vehicle’s availability to deviate for requests. In tandem, the received requests must be assigned to vehicles, and the assignments then determine the stops served by the vehicle. Additionally, serving the fixed stops at the planned service regularity must be actively managed to maintain service quality for riders who use the service directly at fixed stops.
In this paper, we assume that the planning phase is complete and that the frequency of service and slack time have been determined. The methodology focuses on the operations control problem given headway and slack time. The method is designed to enhance performance, as measured by metrics that capture customer experience and system efficiency.
3.2. Operational Control
The problem centers around the decision to deviate a vehicle to serve requests. To facilitate serving requests placed in real time, decisions must be made after the vehicle has started its trip and within a sufficiently short horizon. Therefore, the decision event is defined as the vehicle’s arrival at the last fixed stop before a potential deviation segment towards a flex stop. A fixed stop that triggers deviation decision events is referred to as a control stop .
Formally, when the vehicle serving trip i arrives at a control stop m, a decision is made on whether to deviate to flex stop before continuing towards fixed stop . If the flex stop is served, rider(s) that have placed a request until the vehicle arrival time at the flex stop are served. If the flex stop is not served, the requested rider(s) are not served by the vehicle. They are not explicitly rejected from the service and remain in the system up to a maximum wait time tolerance . After , the request is rescinded and considered lost ridership. Drop-offs at flex stops are not considered.
The service is operated by v vehicles departing every H minutes from the start stop . Operating schedules are critical to the efficiency and effectiveness of the operation. The schedule of trip i is described by the scheduled departure time from the start terminal and scheduled stop arrival times for every stop . Scheduled arrivals are derived from the scheduled running time, which includes a slack time to allow for route deviations.
Deviations from schedules can result in missed connections if the trip is late and excessive idle time if the trip is early. The schedule deviation of trip
i at stop
m is defined as
The schedule thresholds used to determine the on-time status are predetermined: for late trips and for early trips. Trip i is considered late at stop m if and early if . Given the role of schedules in this service design to ensure transfer connections and operational efficiency, dynamic schedule adjustments are not considered in this study.
4. Solution Approach
In this section, we formulate the flex-route operations control problem as a reinforcement learning (RL) problem that optimizes real-time deviation decisions under dynamic and uncertain conditions. RL is a framework for sequential decision-making that learns the value of control actions based on their short- and long-term impacts on a desired objective.
The RL approach consists of learning optimal control actions from agents interacting with an environment represented as a Markov Decision Process (MDP). The MDP can be broken down into a sequence of decision steps, with each described by a state , action , reward r, and the next state , summarized by the tuple . Based on iterative actions and reward signals, a policy is learned to map states to actions that maximize the expected reward over time.
Multiagent reinforcement learning (MARL) extends the RL approach to multiple agents. Ref. [
32] describes two classes of learning agents within the MARL framework: joint and independent learners. The joint learning framework assumes that all agents are controlled and evaluated simultaneously. This may not apply to many practical settings [
33] and conflicts with the discrete and asynchronous nature of the flex-route problem, since deviation decisions only happen when the vehicle is near the flex stop. Thus, the flex-route problem can be better represented as an independent learners problem [
34]. This decentralized formulation also enhances scalability, as policies can be learned and executed locally by individual vehicles, enabling applications to larger and more complex transit networks without incurring prohibitive computational or coordination costs. The experience tuple is thus adapted as
to represent the decision step for an individual agent
i.
A limitation of the independent learners approach is that the state transitions and rewards of any individual agent can be affected by unobserved actions of other agents, an issue known as non-stationarity [
35], which can compromise the convergence to an optimal policy. Despite this, MARL algorithms have been successfully applied to various settings [
34,
36,
37,
38,
39], including public transit operations [
28,
29,
40]. The proposed method reduces non-stationarity by making the neighboring agent information more visible in the state space, as recommended in [
35], which can also facilitate the learning of cooperative strategies.
4.1. Environment Definition
The state for the vehicle serving trip
i includes variables that best capture the factors for making deviations:
where
m is the stop,
is the number of requests waiting at flex-route stop
,
is the actual headway, and
is the schedule deviation. The actual headway is defined as the inter-arrival time between trips
and
i at stop
m, namely,
. The headway is an important indicator of the dynamics of the controlled agent and its leading vehicle. For frequent routes, the number of riders waiting at fixed stops is directly proportional to the headway (as riders do not need to follow the schedule). More riders boarding after long headways implies longer dwell times, making long headways longer. Also, because demand determines dwelling times at stops, the rate at which the headway grows or shrinks increases over time as a result of this disequilibrium.
The action is a binary variable where one corresponds to a vehicle deviation towards the flex stop and zero corresponds to continuing towards the fixed stop on the base route. For some cases, the agent is programmed to neglect actions that are known to be undesirable, a technique called action masking that can help accelerate learning. Agents will not deviate when there are no requests at the flex stop, i.e., . Deviations are also forbidden when the agent observes excessive delays , where is the threshold for excessive delay.
The reward design is designed to balance the need to maximize served requests while maintaining on-time performance. The reasoning behind the reward is based on how these metrics are affected by the number of deviations per trip. Too few deviations lead to more early trips and excessive idle time, while too many deviations increase the number of late trips, potentially leading to missed transfers for riders. In terms of requests, more deviations generally translate to a higher request acceptance rate. However, to serve the total demand for requests, vehicles must deviate even when few requests are present at flex stops, resulting in diminishing returns. Another limiting factor in achieving 100% request acceptance rates comes from delays by excessive deviations, which result in riders canceling their requests (when their wait times exceed
).
Figure 3 illustrates these inherent trade-offs.
To accommodate balanced objectives, the reward function is designed as
where
represents the reward received by trip
i from the action made at stop
m;
and
correspond to the cost components for skipped requests and the schedule status, respectively; and
is a balancing weight. The schedule status penalty refers to early and late trips. The use of a weighted reward structure is drawn from studies on multi-objective reinforcement learning [
28,
41]. A negative sign is applied to the sum because the individual components represent costs (where higher values indicate worse outcomes), transforming the expression into a reward that the agents maximize. The component of skipped requests is calculated as
where
is the number of riders requesting to be picked up by trip
i at stop
, and
acts as a factor that activates the penalty when the vehicle does not deviate (
). The component for the trip’s schedule status is
which penalizes early and late trips with weights
and
, respectively. Defining different penalties for each case accounts for their different implications. For instance, late arrivals can result in missed transfers, which is of utmost importance to riders. Early arrivals can result in increased vehicle idle times, which is of interest to the operating authority. This reward is collected at the next control stop
before the next action is decided.
To understand the implications of this reward design, it is worth considering its impact through scenarios with the most severe penalties. Consider the case when a trip does not deviate despite the presence of requests (
) and arrives early (
) at the next control stop. The penalties for each component will then be
and
. This dual penalty signals to the agent that not deviating results in two consequences: early arrivals and skipped requests. This relates to the performance when deviations (left) are low in
Figure 3. Consider the opposite case with excessive deviations. If a vehicle deviates despite being delayed, then it will be penalized by
. Furthermore, significant delays would likely not be recovered downstream, resulting in the agent being penalized again in the next stop with
. In this way, the reward function encourages a balanced deviation policy.
4.2. Training Algorithm
The agents are trained with Deep Q-Networks (DQNs) [
42]. This method is based on Q-learning [
43], which is a classical RL algorithm that trains a predictor of the value of action
a given state
s, i.e.,
, where the value captures the expectation of reward. The DQN extends this method to use neural networks as state–action value predictors, i.e.,
, which is more appropriate for the large state space of the flex-route problem and applies techniques that improve learning stability.
The DQN is an online learning algorithm, iteratively evaluating and updating
. At decision step
t, the action
is dictated by the
-greedy rule, which ensures that the greedy, value-maximizing action is selected with a probability of
and that a random action is selected with probability
, to stimulate the continual exploration of new strategies. Random actions are selected with equal probability, i.e.,
. The action selection is summarized as
where
represents the best action for the current state, and
is the uniformly random action sampling. The
parameter is annealed linearly between
and
for the first
training steps and is then fixed at
for an additional
training steps.
For the policy update step, the parameters of
are updated as follows:
where
is the learning rate, and
is the maximum expected value of the next state (considering all actions) with a discount factor
, which captures long-term rewards.
is the value of the current state–action pair.
are the parameters used to estimate the value of the next state, and they are copied from
every
steps in order to stabilize training.
The network used is a fully connected neural network with hidden layers of size and with ReLU activation. The last layer uses linear activation for the mapping to discrete action values. Each update of the network is made on a batch of experiences, drawn randomly from the experience replay buffer, which stores the most recent experiences.
Given that vehicles in the flex-route problem share a common objective, the independent learners approach is extended by employing a single shared policy, trained on the pooled experiences of all agents and used to govern their actions. This technique also has the potential to accelerate convergence [
34].
4.3. Training Environment
Given the complex dynamics of the flex-route problem, a simulation model is used as a training environment. The model is designed to assess the performance of control strategies in flex-route operations under demand and travel time uncertainty.
The model is mesoscopic and represents vehicle movements as discrete stop, arrival, and departure events along the route. Vehicle travel times on segments between consecutive stops are modeled according to a prespecified distribution. Dwell times at stops are estimated as a function of the number of boarding and alighting riders. Riders arrive at stops according to Poisson processes with average arrival rates at the origin–destination level. Individual rider journeys are tracked to evaluate waiting and riding time impacts.
The same fleet of vehicles continuously performs trips in both directions, allowing the model to capture delay propagation over chained trips. If vehicles arrive earlier than scheduled at the end terminal, they will wait until the scheduled departure time in the opposite direction. If vehicles arrive late, they depart in the opposite direction immediately after alighting passengers at the terminal. For simplification, vehicles are not allowed to overtake each other, and vehicle capacity is not considered.
5. Application
5.1. Service Area
The case study is centered around a 2-mile route in Boston (
Figure 4), which combines elements of an existing fixed bus route and a proposed shuttle route [
44]. The route serves an area of high employment and commercial density and links it with a commuter rail terminal on the west end and a ferry terminal on the east end that serves as a commuter hub to other critical locations. This area has a rapidly growing business scene that draws commuters during the day, along with a mix of entertainment and social activities in the evenings. However, currently, this area has limited public transit service and is thus heavily reliant on private vehicles and employee shuttles. As a result, the demand pattern during peak hours consists mostly of regular commuters who need a fast, reliable connection to a transit hub; they are more likely to use the service from a traditional fixed-route stop. In the evening off-peak hours, instead, the demand is composed of leisure travelers who prioritize convenience and are thus more likely to request the service at a flex-route stop.
5.2. Operation
The flex-route serves five fixed stops and two flex-route stops in each direction. The three levels of operational design considered are the service plan, travel times, and demand.
Vehicle travel times are randomly sampled from a log-normal distribution. The mean and standard deviation of travel times between consecutive fixed stops are set as 2 and 0.5 min, respectively. For deviations, the mean travel time for every link between a fixed stop m and deviated stop (or vice versa) is 1.5 min, and the standard deviation is 0.3 min. The scheduled running time between each pair of fixed stops is set to 2.5 min, resulting in a 10 min end-to-end travel time. There is an allocated slack time of 2 min for deviations for each trip. The assumed half-cycle time is 8 min without deviations and 10 min with both deviations. Trips are scheduled to depart every 10 min in each direction, which, combined with a scheduled round-trip time of 20 min, yields a fleet requirement of two vehicles.
The total dwell time is modeled as the sum of a constant component and a passenger-dependent component. Each stopping activity includes a fixed 8 s dwell to account for deceleration and acceleration, plus an additional 2 s for every boarding or alighting passenger.
The demand is represented by Poisson arrival rates assigned for every origin–destination (OD) stop pair. The OD-level demand is determined based on assumptions on the type of stop and inter-stop travel distances. The terminals, given their status as transfer connections and commuter hubs, have the highest generation and attraction of trips, followed by intermediate fixed stops. Flex stops are assumed to have the lowest demand and are only for pickups. In terms of distance, longer trips are considered to have a higher demand than shorter trips.
Assuming similar ridership levels to existing systems in large urban areas [
17], the total demand for the service is set as 10.8 riders per scheduled trip (or 65 riders per hour). The demand distribution between fixed and flex stops is determined for two demand scenarios, with the total demand held equal. The two scenarios are designed to simulate peak and off-peak demand periods. The demand in the peak scenario is mainly composed of commuter riders, and, thus, the level of service and on-time arrivals at terminals are of higher priority than rejected requests. This contrasts with the evening off-peak period, in which leisure travelers make up most of the demand, and so schedule constraints may be relaxed by giving a higher weight to request rejections. This will enable a robust evaluation of the method’s performance under different demand patterns. The origin–destination matrix is detailed in
Figure 5.
The maximum wait time for requests before they rescind their request () is set to 10 min.
6. Evaluation
This section presents the performance evaluation. First, it introduces a heuristic method, drawn from current practice, for comparison; second, it lists the parameter settings for the proposed MARL algorithm; then, it compares the performance between methods, including a sensitivity analysis.
The performance metrics considered are based on those in the diagram presented earlier in
Figure 2 and reflect relevant metrics for the stakeholders of the operation. The first key metric is the request acceptance rate, which is defined as the number of requests accepted divided by the total number of requests received, including those that canceled their request because of excessive wait times. The second key metric is the on-time performance, which measures the number of trips arriving at the last stop with a schedule deviation
between the on-time bounds
and
, divided by the total number of trips. Additionally, we include an evaluation of the passenger waiting times, as observed in the simulated episodes, which is relevant for fixed-stop riders. Furthermore, we conduct a sensitivity analysis by evaluating scenarios with increased travel time variability. These stochastic variations ensure that our evaluation reflects not only short-term performance but also the policy’s robustness under long-term operational uncertainty.
6.1. Heuristic Method
The performance of the MARL approach is compared against that of a heuristic that formalizes existing practice. Reports of existing services [
6] indicate that deviation decisions are made only when there is sufficient slack time remaining in the trip’s schedule. This restriction is designed to ensure on-time arrivals at checkpoints. Alternative methods in the literature propose dynamically relaxing this restriction to serve more requests [
14].
Drawing on these approaches, the proposed heuristic encourages the efficient use of slack time while maximizing the number of served requests. Upon the vehicle’s arrival at stop
m, the heuristic allows for a deviation if the number of requests currently waiting at flex stop
exceeds a dynamic threshold
as follows:
is dynamically set according to the schedule deviation as follows:
where
is the minimum number of requests required to perform a deviation at stop
for trip
i. The max operator is applied to prevent deviations when there are no requests. For
, the minimum number of requests is determined by a linear relationship with the schedule deviation
, with
and
, the slope coefficient and the intercept, respectively.
, the slope coefficient, sets the change in the minimum number of requests for every unit of schedule deviation. It is non-negative to incentivize a higher minimum number of requests for higher values of schedule deviation (delays).
represents the minimum number of requests for
, i.e., when all the slack time in the trip’s schedule has been consumed. Legacy services use similar decision rules tied to the available slack time [
6]. In general, this parametric design is intuitive for service providers to implement in accordance with their service objectives.
The values of
and
determine the rate of deviations of the policy. The parameter calibration is based on the performance curves presented conceptually in
Figure 3 and is oriented towards achieving optimal ranges of on-time performance and request acceptance rates.
6.2. MARL Parameter Settings
The MARL hyperparameters were calibrated for maximal performance. The optimized settings are listed in
Table 1.
The penalties for early and late trips, which determine the MARL reward function, are defined as . The on-time performance bounds, which determine the MARL reward function, as well as the evaluation, are defined as min and min. The threshold for excessive delay that will prevent deviations in the MARL policy is defined as min.
To ensure a fair comparison across methods, we set the reward weight parameter () for MARL and the coefficients and for the heuristic such that both approaches produced comparable deviation rates. Deviation rates were defined to balance service flexibility and operational efficiency under different demand conditions. For the peak period, we set to yield 0.8 deviations per trip. For the off-peak period, we set , which resulted in 0.9 deviations per trip, reflecting the higher request demand.
To analyze the sensitivity of the control policy to the corresponding reward weights, all other parameters were the same.
6.3. Overall Performance
Table 2 shows the performance in terms of the request acceptance rate and on-time performance. The results were averaged over 50 simulation replications, which were found to be sufficient for producing stable performance estimates with low variance. The random seed was held constant between methods to expose them to the same conditions. Overall, MARL delivers consistent and better performance across both peak and off-peak scenarios. MARL increases the on-time rate by 4% in the peak period and by 6% in the off-peak scenario, compared to the heuristic. In terms of the acceptance rate, MARL shows an increase of 3.4% in the peak and a marginal decrease of 0.7% in the off-peak. The marginal decline can be explained by the heuristic’s more aggressive deviation behavior when the request demand is higher. Overall, this range of improvements is comparable to that of existing RL methods that demonstrated 7% operational cost savings when compared to random vehicle assignment strategies [
19].
6.4. Spatial Distribution of Served Requests
The overall performance improvements outlined in the previous section can be attributed to the MARL policy being less myopic than the heuristic method, since it is trained on future rewards. This is evidenced by MARL’s more balanced distribution of deviations along the route: 38% of MARL’s deviations are made at the second flex stop, compared to 29% of the heuristic’s deviations. Considering that demand levels at the first flex stop are higher, which could bias deviations, the observed behavior demonstrates that MARL has learned the environment’s dynamics and is capable of waiting for deviation opportunities downstream. These improvements can help ensure more equitable service along the route.
6.5. On-Time Performance
Figure 6 compares the distributions of trips by schedule status. Across both methods, the off-peak scenario exhibits more late trips than the peak scenario, which is reasonable given that more deviations are needed to serve the increased request demand. It can also be observed that a significant benefit from applying MARL is in reducing the number of late trips (by 5% in the off-peak and 2.6% in the peak). This indicates that MARL-based decisions use the allocated slack time more efficiently and strategically to avoid late trips and missed transfers. For transit operators, this means that MARL enables more effective use of the existing slack time, improving on-time performance without requiring schedule changes or additional resources.
6.6. Wait Times
Deviations can deteriorate service regularity at fixed stops, increasing rider wait times. Statistics on the wait time are extracted from individual passenger journeys in the simulated episodes.
Figure 7 shows the distribution of rider wait times in the peak and off-peak scenarios. It is observed that MARL tightened the wait time distribution of fixed-stop riders, reducing the 90th percentile wait times by 8% in the off-peak scenario and 5% in the peak scenario. These wait time savings are consistent with the findings of an RL application to fixed-route transit control [
28]. This reflects the potential of the approach to improve the rider experience, which can help transit authorities better serve their transit-dependent riders, as well as attract new riders.
6.7. Sensitivity Analysis
To evaluate the robustness of the MARL approach to changing conditions, we conducted a sensitivity analysis with respect to increased travel time variability. Scenarios were generated by multiplying the standard deviation of travel times between each pair of fixed stops by a constant factor. This increased the vulnerability to delays and penalties for deviations. With all other parameters held constant, the MARL models were retrained with the updated environments.
Figure 8 shows the request acceptance rate and on-time rate for the tested scenarios. It can be observed that, overall, both measures are reduced as a result of variable travel times. However, the on-time trip rate is more impacted. This is reasonable given that the on-time rate is more sensitive to extreme delays, whereas the acceptance rate is largely determined from the value of requests (parameters
w and
), which is held constant for these experiments. The on-time performance of MARL is consistently better than that of the heuristic in the scenarios, demonstrating its ability to adapt the policy to the increased risk of lateness.
To evaluate the impact of travel time variability on on-time performance,
Figure 9 shows the distribution of schedule deviations along the route for each scenario. It shows that, with an increased travel time variability, MARL consistently improves schedule adherence compared to the heuristic. The improvement also increases downstream in the route, which highlights the forward-looking capabilities of the MARL policy.
7. Conclusions
This paper addresses the problem of real-time request assignment and route deviation decisions in the flex-route setting. The performance of an RL-based strategy is compared against that of a heuristic method derived from existing practice. Multiple demand scenarios, as well as scenarios of increased travel time variability, are examined.
With a similar number of deviations, MARL increases the number of on-time trips by 4–6% and acceptance rates by 3% compared to the heuristic. The performance improvement is consistent across the peak and off-peak scenarios, characterized by a different distribution of fixed- and flex-stop riders, which reflects the robustness of the system to demand conditions. The improved on-time performance is paired with waiting time improvements of 5–8%, which benefit fixed-route riders who simply turn up at stops. These improvements can be attributed to the forward-looking nature of the MARL approach. The results indicate that MARL can support cost-effective and adaptive flex-route operations. The novel reward design with weights determining the relative importance of on-time performance and request acceptance can be used by practitioners to achieve desired outcomes, depending on service design and demand.
This research is a first step in several potential directions. In terms of system design, extensions of this work can accommodate more complex service plans and rider behavior. For example, riders can request to be dropped off at a flex stop. Riders can also choose to walk towards a fixed stop instead of waiting at a flex stop, increasing boarding demand at fixed stops. Future work could also consider different forms of requests in addition to those made in real time. Riders who request the service more time in advance, or have more flexible plans and can wait longer before canceling their request, can be rewarded with higher guarantees of deviation.
On the methodological side, extension to the action space could allow agents to defer a request to the next trip instead of rejecting it. More broadly, using the RL framework with more explicit multiagent coordination techniques could improve performance. Another important next step to enhance the reward design is to incorporate dynamic weight adjustment, allowing the weight configurations to adapt to a range of conditions within the same RL policy. Finally, testing the methods on continuous-operation simulations would provide valuable insights into policy stability and adaptability over longer horizons.
Author Contributions
Conceptualization, J.R., H.N.K. and J.Z.; methodology, J.R. and H.N.K.; software, J.R.; supervision, H.N.K. and J.Z.; formal analysis, J.R.; writing—original draft preparation, J.R.; writing—review and editing, H.N.K. and J.Z.; project administration, J.Z.; funding acquisition, J.Z. and H.N.K. All authors have read and agreed to the published version of the manuscript.
Funding
This material is based upon work supported by the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) under the Vehicle Technology Program Award Number DE-EE0009211 and Award Number DE-EE0011186.
Data Availability Statement
Data sharing is not applicable to this article.
Acknowledgments
The views expressed herein do not necessarily represent the views of the U.S. Department of Energy or the United States Government. The author, Joseph Rodriguez, acknowledges the use of large language models developed by Claude to improve the grammatical structure and language of individual sentences in the text.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Brown, A.; Manville, M.; Weber, A. Can mobility on demand bridge the first-last mile transit gap? Equity implications of Los Angeles’ pilot program. Transp. Res. Interdiscip. Perspect. 2021, 10, 100396. [Google Scholar] [CrossRef]
- Coutinho, F.M.; van Oort, N.; Christoforou, Z.; Alonso-González, M.J.; Cats, O.; Hoogendoorn, S. Impacts of replacing a fixed public transport line by a demand responsive transport system: Case study of a rural area in Amsterdam. Res. Transp. Econ. 2020, 83, 100910. [Google Scholar] [CrossRef]
- Jokinen, J.P.; Sihvola, T.; Mladenovic, M.N. Policy lessons from the flexible transport service pilot Kutsuplus in the Helsinki Capital Region. Transp. Policy 2019, 76, 123–133. [Google Scholar] [CrossRef]
- Perera, S.; Ho, C.; Hensher, D. Resurgence of demand responsive transit services—Insights from BRIDJ trials in Inner West of Sydney, Australia. Res. Transp. Econ. 2020, 83, 100904. [Google Scholar] [CrossRef]
- Alonso-González, M.J.; Liu, T.; Cats, O.; Van Oort, N.; Hoogendoorn, S. The Potential of Demand-Responsive Transport as a Complement to Public Transport: An Assessment Framework and an Empirical Evaluation. Transp. Res. Rec. 2018, 2672, 879–889. [Google Scholar] [CrossRef]
- Transit Cooperative Research Program; Transportation Research Board; National Academies of Sciences, Engineering, and Medicine. A Guide for Planning and Operating Flexible Public Transportation Services; Transportation Research Board: Washington, DC, USA, 2010; p. 22943. [Google Scholar] [CrossRef]
- Currie, G.; Fournier, N. Why most DRT/Micro-Transits fail—What the survivors tell us about progress. Res. Transp. Econ. 2020, 83, 100895. [Google Scholar] [CrossRef]
- Li, X.; Quadrifoglio, L. Feeder transit services: Choosing between fixed and demand responsive policy. Transp. Res. Part C Emerg. Technol. 2010, 18, 770–780. [Google Scholar] [CrossRef]
- Kim, M.E.; Schonfeld, P. Integration of conventional and flexible bus services with timed transfers. Transp. Res. Part B Methodol. 2014, 68, 76–97. [Google Scholar] [CrossRef]
- Almasi, M.; Sadollah, A.; Oh, Y.; Kim, D.K.; Kang, S. Optimal Coordination Strategy for an Integrated Multimodal Transit Feeder Network Design Considering Multiple Objectives. Sustainability 2018, 10, 734. [Google Scholar] [CrossRef]
- Gkiotsalitis, K. Coordinating feeder and collector public transit lines for efficient MaaS services. EURO J. Transp. Logist. 2022, 11, 100057. [Google Scholar] [CrossRef]
- Qiu, F.; Li, W.; Zhang, J. A dynamic station strategy to improve the performance of flex-route transit services. Transp. Res. Part C Emerg. Technol. 2014, 48, 229–240. [Google Scholar] [CrossRef]
- Qiu, F.; Shen, J.; Zhang, X.; An, C. Demi-flexible operating policies to promote the performance of public transit in low-demand areas. Transp. Res. Part A Policy Pract. 2015, 80, 215–230. [Google Scholar] [CrossRef]
- Zheng, Y.; Li, W.; Qiu, F. A slack arrival strategy to promote flex-route transit services. Transp. Res. Part C Emerg. Technol. 2018, 92, 442–455. [Google Scholar] [CrossRef]
- Vansteenwegen, P.; Melis, L.; Aktaş, D.; Montenegro, B.D.G.; Sartori Vieira, F.; Sörensen, K. A survey on demand-responsive public bus systems. Transp. Res. Part C Emerg. Technol. 2022, 137, 103573. [Google Scholar] [CrossRef]
- Shahin, M.; Saeidi, S.; Shah, S.A.; Kaushik, M.; Sharma, R.; Peious, S.A.; Draheim, D. Cluster-Based Association Rule Mining for an Intersection Accident Dataset. In Proceedings of the 2021 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), Quetta, Pakistan, 26–27 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Transportation Research Board; National Academies of Sciences, Engineering, and Medicine. Operational Experiences with Flexible Transit Services; Transportation Research Board: Washington, DC, USA, 2004; p. 23364. [Google Scholar] [CrossRef]
- Alshalalfah, B.; Shalaby, A. Feasibility of Flex-Route as a Feeder Transit Service to Rail Stations in the Suburbs: Case Study in Toronto. J. Urban Plan. Dev. 2012, 138, 90–100. [Google Scholar] [CrossRef]
- Wu, M.; Yu, C.; Ma, W.; Wang, L.; Ma, X. Reinforcement Learning Based Demand-Responsive Public Transit Dispatching; American Society of Civil Engineers: Reston, VA, USA, 2021; pp. 387–398. [Google Scholar] [CrossRef]
- Alonso-Mora, J.; Samaranayake, S.; Wallar, A.; Frazzoli, E.; Rus, D. On-demand high-capacity ride-sharing via dynamic trip-vehicle assignment. Proc. Natl. Acad. Sci. USA 2017, 114, 462–467. [Google Scholar] [CrossRef] [PubMed]
- Fielbaum, A.; Bai, X.; Alonso-Mora, J. On-demand ridesharing with optimized pick-up and drop-off walking locations. Transp. Res. Part C Emerg. Technol. 2021, 126, 103061. [Google Scholar] [CrossRef]
- Leffler, D.; Burghout, W.; Cats, O.; Jenelius, E. Distribution of passenger costs in fixed versus flexible station-based feeder services. Transp. Res. Procedia 2020, 47, 179–186. [Google Scholar] [CrossRef]
- Wen, J.; Zhao, J.; Jaillet, P. Rebalancing shared mobility-on-demand systems: A reinforcement learning approach. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 220–225, ISSN 2153-0017. [Google Scholar] [CrossRef]
- Guériau, M.; Dusparic, I. Samod: Shared autonomous mobility-on-demand using decentralized reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 1558–1563. [Google Scholar]
- Chen, C.; Chen, W.; Chen, Z. A Multi-Agent Reinforcement Learning approach for bus holding control strategies. Adv. Transp. Stud. 2015, 2, 41–54. [Google Scholar]
- Alesiani, F.; Gkiotsalitis, K. Reinforcement Learning-Based Bus Holding for High-Frequency Services. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3162–3168. [Google Scholar] [CrossRef]
- Crites, R.H.; Barto, A.G. Elevator Group Control Using Multiple Reinforcement Learning Agents. Mach. Learn. 1998, 33, 235–262. [Google Scholar] [CrossRef]
- Rodriguez, J.; Koutsopoulos, H.N.; Wang, S.; Zhao, J. Cooperative bus holding and stop-skipping: A deep reinforcement learning framework. Transp. Res. Part C Emerg. Technol. 2023, 155, 104308. [Google Scholar] [CrossRef]
- Wang, J.; Sun, L. Reducing Bus Bunching with Asynchronous Multi-Agent Reinforcement Learning. In Proceedings of the Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; Volume 1, pp. 426–433, ISSN 1045-0823. [Google Scholar] [CrossRef]
- Tang, Y.; Qu, A.; Jiang, X.; Mo, B.; Cao, S.; Rodriguez, J.; Koutsopoulos, H.N.; Wu, C.; Zhao, J. Robust Reinforcement Learning Strategies with Evolving Curriculum for Efficient Bus Operations in Smart Cities. Smart Cities 2024, 7, 3658–3677. [Google Scholar] [CrossRef]
- Fu, L. Planning and Design of Flex-Route Transit Services. Transp. Res. Rec. 2002, 1791, 59–66. [Google Scholar] [CrossRef]
- Claus, C.; Boutilier, C. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 1998, 1998, 2. [Google Scholar]
- Melo, F.S.; Lopes, M.C. Convergence of Independent Adaptive Learners. In Proceedings of the Progress in Artificial Intelligence, Guimarães, Portugal, 3–7 December 2007; Neves, J., Santos, M.F., Machado, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 555–567. [Google Scholar] [CrossRef]
- Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
- Laurent, G.J.; Matignon, L.; Le Fort-Piat, N. The world of independent learners is not markovian. Int. J.-Knowl.-Based Intell. Eng. Syst. 2011, 15, 55–64. [Google Scholar] [CrossRef]
- Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Perolat, J.; Silver, D.; Graepel, T. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. Neural Inf. Process. Syst. 2017, 30, 4190–4203. [Google Scholar]
- Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef]
- Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv 2016, arXiv:1610.03295. [Google Scholar] [CrossRef]
- Vinitsky, E.; Köster, R.; Agapiou, J.P.; Duéñez-Guzmán, E.A.; Vezhnevets, A.S.; Leibo, J.Z. A learning agent that acquires social norms from public sanctions in decentralized multi-agent settings. Collect. Intell. 2023, 2, 1–14. [Google Scholar] [CrossRef]
- Menda, K.; Chen, Y.C.; Grana, J.; Bono, J.W.; Tracey, B.D.; Kochenderfer, M.J.; Wolpert, D. Deep Reinforcement Learning for Event-Driven Multi-Agent Decision Processes. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1259–1268. [Google Scholar] [CrossRef]
- Van Moffaert, K. Multi-Objective Reinforcement Learning for Sequential Decision Making Problems. Ph.D. Thesis, Vrije Universiteit Brussel, Brussel, Belgium, 2016. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
- Leung, S. A Free Bus to Get Around the Seaport? Beats a Gondola. The Boston Globe, 31 October 2019. Available online: https://www.bostonglobe.com/business/2019/10/31/free-bus-get-around-seaport-beats-gondola/4tZuWGmoRWXFBRAOmGbYAM/story.html (accessed on 21 July 2025).
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).