1. Introduction
Environmental sustainability is now a central issue for cities of every size, and logistics as well as distribution activities have a direct influence on this agenda. Access restrictions introduced through Low Emission Zones (LEZ) accelerate the transition toward low-emission fleets in urban logistics, increasing the adoption of Electric Vehicles (EVs) in operational planning. These vehicles provide several practical benefits over those powered by internal combustion engines, including reduced emissions, lower noise levels, and greater suitability for emerging autonomous driving technologies [
1]. Despite these advantages, EV-based operations also introduce a number of constraints. Limited battery capacity, lengthy charging times, and the incomplete development of charging infrastructure often restrict operational flexibility. As a result, routing strategies originally designed for the classical Vehicle Routing Problem (VRP) cannot be transferred to EV applications without modification, and additional considerations must be incorporated to address these challenges [
2,
3,
4].
In many real-world logistics operations and routing tasks, dynamic characteristics are inherent and cannot be ignored. While customer demands are known during the planning phase, their timing and location cannot be predicted. A routing strategy that can respond to new demands as they emerge is required. Therefore, the Dynamic Vehicle Routing Problem (D-VRP) has recently become a critical topic in routing and transportation research [
5,
6,
7,
8,
9]. Due to their advantages, electric vehicles have been incorporated into research to address this problem. Constraints such as limited battery capacity and charging requirements in EVs, combined with dynamically changing demand, have increased the importance of EV-focused routing models in the literature [
10,
11]. In this study, the D-EVRP is considered, where customer requests are revealed progressively during operation and routing decisions must be adapted online under electric vehicle-specific constraints. In contrast to static plan for EVRP variants, D-EVRP requires simultaneous consideration of dynamic request arrivals, limited battery capacity, charging station availability, and recharging decisions. These characteristics naturally define the problem as a sequential decision-making process under uncertainty, in which routing, charging, and service actions are tightly coupled.
Mathematical programming techniques and several meta-heuristic approaches, including Genetic Algorithms, Simulated Annealing, and Tabu Search, are known to provide strong performance when the underlying problem is fixed and free of dynamic variations. When customer demand becomes dynamic, however, each newly arriving request introduces additional variables, requiring the solution to be recomputed from scratch. This repeated re-optimization leads to a considerable computational burden, especially in large-scale settings. The situation becomes even more challenging when the battery-charging dynamics of electric vehicles are incorporated, since charging decisions and energy constraints further increase the time required to obtain a solution [
3,
12]. Earlier studies illustrate these difficulties: Erdoğan and Miller-Hooks [
2], introduced the Green-VRP framework and examined routing with charging-station planning. Schneider et al. [
3], extended the problem to the Electric Vehicle Routing Problem with Time Windows (EVRPTW) model with time windows. Keskin and Çatay [
4], C-EVRPTW with proposed a partial-recharge strategy. Montoya et al. [
12], incorporated the non-linear nature of charging curves. Although these contributions represent important milestones, they largely rely on static plan assumptions. Approaches developed for dynamic settings, such as the Multiple Scenario Approach (MSA) and anticipatory algorithms [
9], can support online decision making but tend to face scalability issues once EV-specific energy constraints are taken into account. The need for online and interruption-tolerant decision making in dynamic routing problems is closely related to the online over-time optimization paradigm. Duque et al. [
13] show that when problem instances arrive continuously and strict time limits are imposed, repeatedly solving full optimization models becomes ineffective. Their findings indicate that lightweight, learning-assisted decision mechanisms are more suitable for such settings. This observation directly motivates the use of deep reinforcement learning in the D-EVRP, where routing decisions must be produced immediately under uncertainty without access to future requests.
Recent studies have increasingly adopted Reinforcement Learning (RL), often in combination with deep learning techniques. The ability of RL to react to real-time changes and unexpected request patterns makes it well suited for routing problems in dynamic environments. Through the interaction between the agent, state, action, and reward, an adaptive policy is learned. This policy enables the system to generalize from past experience rather than relearn from scratch for each newly arriving request. Nazari et al. [
14], introduced the first end-to-end RL-based VRP solution using a Seq2Seq architecture, while Kool et al. [
15], employed an attention-based RL framework. Ulmer later investigated routing with dynamic demand using an Offline–Online Approximate Dynamic Programming approach [
9]. Foundational contributions by Van Hasselt et al. [
8] and Mnih et al. [
16] examined Double Deep Q-Networks (DDQN), and Wang et al. [
17], enhanced stability and sample efficiency in sequential decision problems through a dueling network architecture combined with prioritized experience replay. Similar value-based RL architectures have also demonstrated strong performance in real-time robotic control and navigation tasks, reinforcing the suitability of deep reinforcement learning for sequential decision-making problems under dynamic and uncertain conditions [
18]. Recent DRL-based routing studies can be broadly categorized into constructive attention-based models, policy-gradient approaches, and value-based learning frameworks. Constructive encoder–decoder architectures with attention mechanisms have demonstrated strong solution quality by learning complete routes end-to-end and improving exploration through diverse decoding strategies [
19,
20]. Similar attention-based designs have also been extended to dynamic electric vehicle routing settings, particularly for time-dependent or emergency scenarios [
21]. However, these approaches typically rely on full-graph inference or rollout mechanisms, which can limit their applicability in strict real-time environments.
An alternative research line focuses on policy-based reinforcement learning, often combined with graph neural networks and proximal policy optimization to address demand and traffic uncertainty [
22]. Comparative analyses show that policy-based methods can achieve strong generalization but often require complex feasibility handling and action masking in dynamically changing decision spaces [
23]. In contrast, value-based approaches evaluate discrete feasible actions directly, offering a computationally lighter decision mechanism for online routing under uncertainty. The choice of the DDQN architecture is motivated by the structural properties of the D-EVRP. The problem is characterized by a large and dynamically changing discrete action space, where standard Q-learning and classical DQN methods are prone to overestimation bias and unstable value updates. The DDQN alleviates this issue by decoupling action selection from value estimation, resulting in more reliable learning behavior. When combined with a dueling network structure and prioritized experience replay, the DDQN further improves training stability and sample efficiency, which is essential for real-time routing decisions under dynamic demand and energy constraints. Recent studies on dynamic electric vehicle routing further support the suitability of DDQN-based frameworks for real-time decision making. In particular, hybrid approaches combining the DDQN with local improvement heuristics have demonstrated that value-based learners can rapidly generate feasible initial routes under dynamic conditions, which can then be refined when additional computation time is available [
24]. These findings motivate the adoption of DDQN in this study as a practical compromise between solution quality, scalability, and online applicability.
Related work has demonstrated that deep reinforcement learning can be effectively applied to routing problems under dynamic and energy-constrained settings. Basso et al. [
10] proposed a reinforcement learning framework for dynamic stochastic electric vehicle routing, showing that learned policies can successfully coordinate routing and recharging decisions. Similarly, adaptive routing and charging strategies for electric vehicles under uncertainty were investigated by Sweda et al. [
11]. In the broader dynamic routing literature, value-based deep reinforcement learning approaches have been shown to provide stable and competitive performance in online vehicle routing problems with evolving action spaces [
25,
26].
This study examines the D-EVRP from an application-oriented perspective. The problem involves determining routes for one or more electric vehicles so that, as new customer requests appear during operation, the vehicles can be redirected in an efficient manner without exhausting their available battery levels. A solution framework based on Deep Reinforcement Learning is adopted, and a Double Deep Q-Network is employed to train an agent capable of making real-time routing decisions. Reinforcement learning is well suited for sequential decision problems under uncertainty, since a policy can be learned through repeated interaction with a simulated environment in order to maximize long-term returns. Within this framework, the agent receives information describing the current state of the vehicle, including its position, remaining battery energy, load status and other operational variables, and chooses an action such as selecting the next customer to visit or initiating a charging operation. The aim of the learned policy is to minimize total travel cost while ensuring that all dynamic customer requests are served within the imposed constraints.
The problem is formulated as a Markov Decision Process, and a state representation is introduced that incorporates vehicle information together with forthcoming customer requests. Battery charge levels and the spatial distribution of charging stations are explicitly embedded in the state, allowing energy-related constraints of electric vehicles to be captured. A Deep Q-Learning approach is used to handle the routing decisions under dynamic demand and battery limits. The learned policy balances recharging needs and the timely service of new requests, making it suitable for continuously changing conditions. While several studies have addressed dynamic routing and energy-aware decision making, existing approaches typically focus on specific aspects of the problem. In this study, these components are jointly handled within a unified DDQN-based online routing framework, emphasizing practical integration rather than algorithmic novelty. To ensure scalability, K-means clustering is applied. Customer locations are divided into geographic clusters, and each cluster is assigned a separate vehicle and DQN agent. The proposed DRL framework is compared with a myopic greedy rule, random dispatch, and a Genetic Algorithm-based dynamic heuristic. Experiments on real campus data with 5–100 customers [
27,
28,
29,
30] show that although the Genetic Algorithm can sometimes yield slightly shorter routes, the Double DQN achieves near-optimal performance with much lower computational effort, making it suitable for real-time settings. Additionally, the learnt strategy automatically adjusts to requests that come in at random without the need for frequent re-optimization. Reduced trip distance and better charging choices promote effective battery use and reduced energy consumption from a sustainability standpoint. These benefits, along with the fact that electric cars emit no emissions, make the approach consistent with more general environmental objectives. In conclusion, the aim of this research is to provide a scalable and online-capable DRL-based D-EVRP framework that successfully combines realistic energy limitations with dynamic demand.
Building on recent advances in deep reinforcement learning for dynamic routing, this study adopts a value-based decision framework to address the Dynamic Electric Vehicle Routing Problem. Established components such as Double Deep Q-Networks, dueling architectures, and prioritized experience replay are integrated within a unified simulation environment that explicitly models dynamic request arrivals, battery limitations, and charging operations. The focus is placed on operational applicability and scalability under real-time constraints, with emphasis on realistic problem modeling and online feasibility rather than on developing new learning architectures.
This article aims to make the following main contributions:
The developed RoutingEnvironment takes into account battery consumption, charging times, and same-site customers, reflecting real-world operational constraints, providing an application-level modeling contribution.
In addition to the customers in the static plan, customers are generated for dynamic routing. Geographically, K-Means clustering was used, and a separate agent was trained for each cluster. This ensures scalability for large-scale problems.
Double DQN, Dueling-Network, and PER components were jointly employed to improve learning stability in a large and variable action space.
The proposed method was compared with greedy search, random policy, and Genetic Algorithm methods. Performance differences at different degree of dynamism (DoD) [
31] were reported.
All experiments were conducted using a real distance matrix, customer clusters, and charging stations defined on the Eskişehir Osmangazi University (ESOGÜ) campus. Thereby, providing a reproducible framework for future research.
2. Related Work
EVRP extend the classical Vehicle Routing Problem (VRP) by incorporating energy consumption and battery-charging constraints. In the broader VRP literature, the evolution and quality of available information are commonly classified into four categories: Static–Deterministic (SD), Static–Stochastic (SS), Dynamic–Deterministic (DD), and Dynamic–Stochastic (DS) [
6]. Among these, only DD and DS represent dynamic settings, whereas SD and SS correspond to static variants. This classification highlights the need for anticipatory decision making in the presence of dynamic and stochastic requests. Recent review studies further emphasize that randomly occurring customer requests constitute the predominant source of uncertainty in dynamic VRP settings and that such uncertainty can be systematically characterized along three dimensions: the source of dynamism, the type of requests, and the planning horizon considered [
32].
The early foundation of the dynamic routing literature was built around anticipatory route selection and scenario-based planning approaches such as the Multiple Scenario Approach (MSA). Thomas and White [
33], formulated the decision process as a Markov Decision Process, showing that an optimal anticipatory policy outperforms reactive strategies when information is gradually revealed over time. Bent and Van Hentenryck [
34], enhanced online decision quality in the Dynamic VRP (D-VRP) by applying MSA and generating scenarios that represent possible future requests. In the same direction, Hvattum et al. [
35], introduced online heuristics such as stochastic hedging and branch-and-regret. These studies highlight the trade-off between re-optimization and anticipatory sampling and demonstrate that the rapid growth of the state–action space leads to the well-known curse of dimensionality. To mitigate this bottleneck, Ulmer et al. [
9,
36], advanced the field through Approximate Dynamic Programming and non-parametric value function approximations (nVFA), combining offline value learning with online rollout strategies to achieve an effective balance between solution quality and computational effort.
Dynamic research in the EV domain has emphasized energy-aware decision frameworks in which routing choices and recharging policies are jointly considered due to the presence of energy constraints. Sweda et al. [
11], integrated adaptive routing and recharging decisions under dynamic conditions and examined the online balance between charging and service operations for a single vehicle. In the following period, Basso et al. [
10], combined EV-focused Dynamic–Stochastic formulations with safe reinforcement learning tools, introduced the DS-EVRP problem class, and demonstrated that a learned policy can generate near-optimal solutions more quickly than classical heuristics. This research line highlights a methodological requirement in EVDRP studies: the state representation must simultaneously incorporate the state of charge (SoC), charging station locations, and the temporal flow of incoming requests.
The application of DRL to direct policy learning in routing has gained significant momentum. Nazari et al. [
14] and Kool et al. [
15], demonstrated that high-quality policies can be learned without relying on the classical re-optimization loop by employing sequence-to-sequence (Seq2Seq) and attention-based encoder–decoder architectures. Within the family of Q-learning methods, the DDQN [
8], has been utilized to mitigate overestimation bias, whereas the Dueling Network architecture [
17] improves learning efficiency in large action spaces by separating state-value and advantage components. PER further enhances sample efficiency. The integration of these components into the EVDRP framework has enabled the development of online-applicable policies under high levels of dynamism. Within this research direction, value-based methods such as Double Deep Q-Networks have received particular attention due to their ability to handle large discrete action spaces while mitigating overestimation bias. Their effectiveness in dynamic routing contexts has been demonstrated in recent studies, where DDQN-based or closely related architectures achieved stable learning behavior and competitive solution quality under online decision-making requirements [
8,
25,
26]. In line with these studies, the present work adopts DDQN as a stable value-based learner and focuses on its deployment within a realistic and scalable D-EVRP simulation setting, rather than proposing a new learning architecture.
Pan and Liu [
25], introduced a partially observable MDP (POMDP) framework for the Dynamic and Uncertain VRP (DU-VRP), showing experimentally that the learned policy can serve as a competitive alternative to repeated re-optimization under uncertainty. The reported results indicate consistent performance in service rate and delay metrics when uncertainty levels are high. Konovalenko and Hvattum [
26], combined accept–reject decisions with DRL policies in a real-time last-mile delivery setting and demonstrated improvements in overall performance, while also providing a detailed analysis of how different state-space components affect policy quality.
Recent years have seen a marked increase in studies focusing on routing electric vehicles in dynamic and uncertain environments. These studies consistently emphasize that effective learning-based routing for electric vehicles requires state representations that jointly encode battery state, charging station availability, and the temporal dynamics of incoming requests [
10,
22,
37,
38]. Kadyrov et al. [
22] proposed a Graph Neural Network and Proximal Policy Optimization (GNN and PPO) approach for a VRP with stochastic demand and traffic uncertainty, showing superior results in both generalization and stability compared with heuristic and classical re-optimization techniques. The Edge-DIRECT model introduced by Mozhdehi et al. [
37] developed an edge-enhanced dual-attention DRL architecture for the EVRPTW, demonstrating improved balance between energy consumption and service times, especially for heterogeneous fleet structures.
From the reviewed literature, it is evident that recent DRL-based approaches to dynamic routing follow different methodological directions, ranging from attention-driven constructive models that emphasize expressive representations and route diversity [
19,
20] to policy-based frameworks that learn adaptive routing behavior in dynamically evolving decision contexts [
22,
23]. In parallel, value-based and hybrid methods have been proposed to support fast online decision making by directly evaluating feasible actions and, when necessary, incorporating additional refinement mechanisms [
24]. The present study aligns with this latter line of research by adopting a DDQN-based online routing framework for the dynamic electric vehicle routing problem, focusing on feasibility-aware routing decisions under battery and charging constraints rather than explicitly modeling demand or traffic uncertainty. Accordingly, this study should be interpreted as an application-oriented integration of established DRL components within a realistic dynamic EV routing environment, rather than as a novel reinforcement learning methodology. The contribution lies in demonstrating how value-based DRL can be operationalized under practical energy, charging, and scalability constraints in an online setting.
3. Materials and Methods
3.1. Problem Environment and Operational Settings
Figure 1 illustrates the evolution of the vehicle’s route under the D-EVRP. At the initial state (t = 0), the vehicle generates an initial route plan based on only the available customers and available information. As time progresses, that is, at t > 0, customer requests dynamically emerge and consider the statuses from the active demand set. Throughout the process, all vehicle movements are evaluated within constraints such as energy constraints and clustering limits. The figure shows the route plan at the starting point and the route plan created after the dynamic demands arrive.
The experiments are conducted in a campus-scale electric vehicle routing environment generated from multiple dataset configurations. Ten charging stations, customer locations, and a depot are all included in each dataset. The k-means algorithm divides the customer set into clusters. The number of clusters varies according to the size of the dataset. Customers located at the same physical coordinates are grouped as same-site locations, allowing the vehicle to visit a single representative point while still performing the required service operations for all customers at that site. The electric vehicle starts each episode at the depot with a full battery and an initial payload. Travel distances are obtained from a preprocessed shortest-path matrix of the campus road network, and travel times are computed deterministically using a constant cruising speed. Battery consumption increases linearly with distance traveled, and recharging is allowed only at designated charging stations unless depot charging is explicitly enabled. Charging duration is modeled as a linear function of the missing energy. An energy-feasibility rule ensures that the vehicle can visit a customer only if it can subsequently reach at least one charging station. Otherwise, the vehicle is automatically directed to the nearest charger. Dynamic requests are incorporated using configurable dynamic ratios. A portion of the customers is hidden at the start of the episode and gradually revealed according to a probabilistic arrival mechanism. If the active visit list becomes empty while unreleased requests remain, new customers are injected to maintain continuity. Episodes conclude only when all the static and dynamic customers have been served and the vehicle has returned to the depot. DRL, greedy search, random policy, and genetic algorithms operate with all the constraints established for the environment. These established rules allow the algorithms to operate and produce results. Human movement or pedestrian dynamics are not explicitly simulated in the environment. Instead, variability in the routing process is introduced through dynamically revealed customer requests controlled by the degree of dynamism parameter. The road network, travel distances, and vehicle dynamics remain deterministic, ensuring that observed variations in routing behavior arise from demand uncertainty and policy decisions rather than random motion.
The following assumptions are made for all the datasets, dynamicity levels, and routing strategies:
General Assumptions
Each customer request has a known service time and a pickup/delivery (P/D) quantity.
Customers sharing the same coordinates are treated as a single service point, and all related services are completed sequentially when the vehicle arrives.
All travel times are deterministic and computed using the shortest-path distances from a preprocessed distance matrix.
Vehicle Operation Assumptions
Each episode begins with the vehicle starting at the depot with a fully charged battery and a predefined initial payload.
The vehicle travels at a constant speed, and its energy consumption is linearly proportional to the distance traveled.
The vehicle can charge only at designated charging stations, unless depot charging is explicitly enabled for a scenario.
Charging duration is modeled as a linear function of the missing energy, using a fixed charging-rate parameter.
At the end of every episode, the vehicle is required to return to the depot, regardless of the routing strategy used.
Energy and Feasibility Assumptions
An Agent, a client can only be served if the vehicle has enough battery to reach the node and then continue to at least one charging station.
If this feasibility condition is violated, the environment automatically redirects the vehicle to the nearest charging station before continuing the route.
Dynamic Demand Assumptions
A configurable portion of customer requests is initially hidden and revealed progressively through a probabilistic arrival mechanism during the episode.
At each decision step, a limited number of dynamic customers may be injected into the visit list.
If the visit list becomes empty while undisclosed customers remain, the simulation assumes the release of new requests to prevent premature termination.
An episode terminates only after all customers in pre-planned route and dynamically revealed customers have been served and the vehicle has returned to the depot.
3.2. Dataset
The Eskişehir Osmangazi University (ESOGÜ) Meşelik Campus road network provides a realistic test environment. All customer and charging station locations are taken from real coordinates [
38,
39]. Each node represents demand from campus buildings. In addition, the dataset includes real road distances rather than Euclidean approximations. An asymmetric distance matrix is available [
40], which enhances scenario realism by reflecting actual travel path. Road distances often vary due to traffic regulations and environmental restrictions. Outbound and return directions may not be the same due to traffic and road conditions. Using the distance matrix allows for accurate utilization of these effects. The matrix was generated using data from OpenStreetMap [
41,
42] and Simulation of Urban MObility (SUMO, version 1.22.0) [
43,
44].
ESOGU-EVRP-PDP-TW includes 118 customers. Additionally, 10 charging stations are distributed across the campus [
39]. The dataset is divided into 5, 10, 20, 40, 60, 80, and 100 customers. Each test group is distributed as Random (R), Clustered (C), and Random-Clustered (RC) [
38]. Pickup and delivery locations are allocated within the campus. Each node within the clusters is allocated to only one of them. There is only one delivery point. The dataset is shown in
Figure 2.
All requests and routes are plotted on the campus map. The ESOGU Meşelik Campus, together with important locations and sample pickup and delivery points, are depicted in
Figure 2. Realistic trip times are obtained by computing distances between nodes using the real road network. In addition to an energy-consumption matrix proportional to travel distance that represents EV battery usage, the dataset also contains a real-world distance matrix, which is crucial for assessing routing performance. All things considered, the dataset provides a practical testbed for research on dynamic routing.
Table 1 displays a sample of the dataset.
Table 1 lists the customer IDs and types. It also includes information about their locations, pickup (P) or delivery (D) status, service times (ST), and earliest and latest starting times (ESTTS and LSTTS). This example table uses the ESOGU-EVRP-PDP-TW-C5 example.
The technical characteristics of the electric vehicle employed in each trial scenario are summarized in
Table 2. These parameters define the vehicle’s energy consumption, charging behavior, load capacity, and routing constraints. Battery consumption and recharging processes are modeled using linear relationships, which are commonly adopted in electric vehicle routing studies to balance modeling simplicity and computational efficiency. For clarity, energy-related parameters are expressed in kilowatt-seconds (kWs), where 1 kWh corresponds to 3600 kWs. The same set of vehicle specifications is applied uniformly across all clusters and datasets to ensure methodological consistency and enable fair comparisons among the evaluated routing strategies.
3.3. Proposed Method
This section examines the D-EVRP in detail. Details are provided for the Double DQN-based DRL algorithm for EVRP to increase scalability. The MDP model and associated dynamic decision processes for solving the problem are presented in
Figure 3. First, a dataset was created and transformed into a database containing the necessary information for problem solving. A starting and ending point was designated as a depot. Customers, charging stations, and pickup and delivery points are located on the campus. All data is then normalized for raw data processing. Since multi-agent operation is required, customers in the dataset are clustered using K-means. The D-EVRP process is defined as an MDP. At each decision step, the agent selects an action based on the current system state. A decision step is triggered whenever a meaningful state change occurs, such as the completion of a service, arrival at a node, initiation or completion of a charging operation, or the appearance of a new dynamic customer request. Therefore, routing decisions are updated immediately in response to dynamic events, allowing the policy to adapt online without relying on fixed re-optimization intervals. State (
) includes the vehicle’s current location, state of charge (SoC), current load, unvisited customer list, and dynamically arriving customers. Action (
) is used to determine the next point to visit. The agent, customer, charging station, or depot points are selected based on the vehicle’s energy and load constraints. The reward (
) mechanism evaluates distance as a negative reward, making it more inclined to choose the shortest route. It also encourages visiting charging stations with a negative reward, thus avoiding unnecessary road extensions. To prevent the vehicle from completely depleting its energy, i.e., from being stranded, very high penalties are applied. Unlike myopic policies, which evaluate only the immediate next node, a predictive mechanism is employed to assess both the next node and the subsequent node in advance. By looking ahead more than one step, potential dead-ends and infeasible future states are avoided. Positive rewards are provided for reaching destinations if the node performs pickup and delivery tasks.
The Initialize Episode section provides the training portion. The vehicle is placed in the storage location. Values such as SOC and load are reset. The customer list returns to its initial state and is ready for rerouting. The goal here is for each episode to begin as an independent route planning problem. The agent then receives state information from the environment. All components defined in the MDP are read here. The next action is selected and explored. It selects the most appropriate action based on the Q-values with a (1 −)% probability. This mechanism both enables the discovery of new routes and triggers the implementation of the learned policy. The selected action is then implemented in the environment. The agent goes to a customer location, performs pickup and delivery operations, and if energy is low, updates its route to the nearest charging station. If energy is not low, meaning the vehicle does not need charging, it re-evaluates the customer probabilities. The route also contains a list of visitor lists and dynamic customers. Dynamic customers are added to the visit list at a completely random time, and the route is constantly changing. It receives a reward for the action and moves on to the next state. The test tuple is added to the replay buffer (, , , ). This mechanism prevents inefficient sequential correlation and ensures stable DQN learning. The agent updates the Q-network () using mini-batch sampling, and the Target Network () is updated only periodically for stability. All customers must be serviced for the episode to complete. If this condition is not met, the loop returns to the Observe State and this pattern continues until the entire route is generated. In the real-time environment, incoming customer requests are dynamically added and assigned to the appropriate cluster. The vehicle continuously selects the next most suitable customer. Once the entire algorithm is complete, it produces outputs such as total distance, total time, energy consumption, and the order of customers visited.
Problem Formulation
Before defining the MDP components, it is important to clarify the operational scope of the decision-making process. Although the overall problem may involve multiple vehicles and customer clusters, the learning problem is formulated as a single-vehicle Markov Decision Process within each cluster. Clustering is used purely as a spatial decomposition technique to improve scalability and does not alter the underlying decision model. Specifically, each cluster is assigned exactly one electric vehicle and one DRL agent. The agent controls a single vehicle throughout an episode and makes sequential routing decisions only for that vehicle. There is no explicit vehicle–request assignment decision within the learning process, and interactions between vehicles are not modeled. As a result, the MDP formulation consistently represents a single-vehicle dynamic routing problem, replicated independently across clusters.
D-EVRP is defined by the MDP as . Here, S represents the state space, A represents the action space, T represents the state transition dynamics, R represents the reward function, and represents the discount factor. The MDP provides a crucial flow for determining the decisions vehicles should make to respond to incoming requests and the resulting actions of the agent. The decision-making system provides control to the vehicles and forms a structure by communicating the necessary inputs and results for cluster-based operation. This means that each vehicle controls and makes decisions for a subset of the problem.
State : The state describes the full status of the system at a decision moment and includes all information needed to determine the next routing or assignment action. It contains the current time within the planning horizon, the attributes of the active vehicles, and the set of unserved or ongoing requests. Each vehicle is characterized by its current position, its remaining battery energy, and its current load. The request set includes all jobs that have not yet been completed, together with their pickup and delivery locations, their release times, and their current progress. Requests that have already been picked up remain assigned to the same vehicle until they are delivered, reflecting the pairing between pickup and delivery tasks, while newly arrived requests wait to be assigned. Additional operational details such as delivery deadlines or available service windows are also part of the state. Since requests arrive dynamically and vehicles move over time, the state evolves in a dynamic manner and grows quickly in size. To make the representation manageable, it is encoded in a structured form that can be processed by the neural network, for example by describing vehicles through their location and battery level and summarizing or limiting the set of pending requests. The problem is treated as episodic, with each episode representing a full day of operations, beginning with all vehicles at the depot and ending once all service tasks for that day have been completed. The dimensionality of the state representation does not grow with the total number of customers in the instance. Scalability is preserved by restricting the action space and state inputs to the current cluster and the active visit list, which remain bounded in size. As problem complexity increases, it is handled through clustering and dynamic task activation rather than by expanding the state vector.
Action : An action represents the selection of the next feasible node to be visited by the controlled electric vehicle. The action space consists of customer nodes, charging stations, and the depot, subject to battery and load feasibility constraints. At each decision step, the agent selects exactly one next node for the vehicle to visit. Vehicle–request assignment decisions are not part of the action space, since each agent controls a single vehicle within a predefined cluster.
Transition (
): After an action is applied and any external event occurs, the system moves to a new state. Vehicles travel toward their pickup or delivery nodes, and travel times are taken from the distance matrix. Reaching a pickup node activates the job and increases the vehicle’s load, while reaching a drop-off node completes the job and removes it from the active list. The clock then advances to the next event, such as a new request or a vehicle’s arrival at its next node. During this period, additional requests may appear either from a dynamic arrival model or from real demand data and enter the system as pending tasks. As vehicles move, their battery levels decrease according to the distance they have traveled, and a vehicle directed to a charging point begins to replenish its energy, which is usually assumed to have reached full charge unless a partial charge is explicitly modeled. This transition is affected by the uncertainty in demand arrivals and, if considered, the variability in travel times. After this update, the system reaches a new state reflecting the revised vehicle locations, energy levels, and the current pending or ongoing demand set. The MDP framework naturally captures the sequential and dynamic evolution of the system [
26]. After the transition, the new state
reflects updated vehicle positions, battery levels, and the updated set of requests.
Reward (
R): The reward function is constructed to direct the learning process toward efficient routing and high service quality. At each decision instant, a composite reward is produced in which several operational objectives are reflected. A positive reward is given when a delivery is completed, encouraging timely service. The distance matrix value
is subtracted from the reward as a vehicle moves from node
i to node
j, adding a cost commensurate with the distance covered. Battery depletion also affects the cost because low battery levels often require detours for recharging. The cumulative reward reflects the balance between completed services, their timeliness, and the total distance and energy used. For example, a delivered job can yield
, whereas a ten-meter relocation leads in a deduction of
. The numerical values used in the reward function are not intended to represent exact physical costs, but rather to encode the relative importance of key operational events. The positive reward magnitude assigned to successful service completion is deliberately larger than step-level penalties in order to ensure that feasibility and task fulfillment dominate the learning objective. Conversely, smaller negative rewards are applied to distance-based movements and inefficient actions to discourage unnecessary detours without overshadowing the primary objective of completing all requests. This form of asymmetric reward scaling is a common reward-shaping strategy in reinforcement learning-based routing and sequential decision problems, where relative weighting is used to stabilize training and guide policy convergence toward feasible solutions [
14,
25,
26]. In the context of electric vehicle pickup and delivery, the incentive structure directs the Markov Decision Process toward routing and assignment choices that reduce operating costs while preserving service quality by encoding these objectives.
This MDP formulation provides the basis for applying reinforcement learning to the routing decisions. The state includes all information needed for planning, such as vehicle status, pending requests, and the current time. The action space reflects the available routing and assignment choices, while the reward structure is designed to encode the operational objectives of the EVRP with pickups and deliveries. By expressing the dynamic EVRP as an MDP, the problem becomes suitable for reinforcement learning methods, allowing an effective dispatching policy to be learned as shown in previous studies [
45,
46]. The following section introduces the DRL model constructed to solve this MDP formulation.
The reward parameters were selected to reflect relative priorities rather than exact cost magnitudes. In particular, successful task completion is intentionally weighted more strongly than step-level movement penalties to ensure that feasibility and service completion dominate the learning objective. While different numerical values may lead to variations in learning dynamics, the qualitative behavior of the policy is primarily governed by the reward structure rather than precise parameter tuning. A systematic sensitivity analysis of reward weights is beyond the scope of this study and constitutes an interesting direction for future work.
3.4. Deep Reinforcement Learning Model
3.4.1. Network Architecture
For clarity and reproducibility, the full state representation and network input structure are described explicitly in this subsection. To solve the aforementioned MDP, a DRL approach based on the DQN family of algorithms is employed. Specifically, a DDQN with a
-greedy exploration approach, a Dueling Network design, and PER is employed. These improvements are used to manage the complexity of the state-action space and to increase learning stability. Below is a summary of the general DRL design and how it fits into the EVRP framework. For the Q-network, a Dueling DQN design is used. After a shared feature-extraction layer, the network splits into two streams: one estimates the state-value function
, and the other computes the advantage function
. These outputs are then combined to obtain the Q-values for each action as:
This approach improves learning efficiency in contexts with vast or complicated action spaces by allowing the model to independently learn the value of a state and the relative advantages of each action.
The DQN model receives a structured representation of the state as its input, where vehicle and request information are encoded into a feature vector. In a typical configuration, one portion of this input is formed by a matrix with dimensions equal to the number of vehicles multiplied by the selected vehicle attributes, such as position and battery level, while another portion summarizes the set of pending requests through either a fixed-size subset or aggregated statistics such as the number of waiting customers and the spatial distribution of their locations. This information is flattened or otherwise preprocessed before being passed through several fully connected layers, or through one-dimensional convolutional layers when the structure is treated as a sequence. These layers act as a shared feature extractor. After this shared module, the network is divided into two branches: one branch produces a single scalar value representing , and the other produces a vector representing the advantage values , whose dimension matches the number of available actions.
Figure 4 presents the overall Dueling DDQN [
8,
17] architecture designed for the D-EVRP framework. The procedure starts with the environment state, which is converted via a shared feature extractor into a latent representation. This latent vector is then propagated into two separate computational branches: the value stream, which estimates
, and the advantage stream, which estimates
. Within the dueling structure, these two quantities are combined to compute the state–action value
, enabling a more stable and expressive evaluation of individual actions. The resulting Q-values are subsequently used to select the action with the highest estimated return, forming the basis of the decision-making mechanism in the dynamic electric vehicle routing environment.
3.4.2. Learning Algorithm
The value and advantage outputs are combined to obtain the final action-value estimates, which are then used during learning through the DDQN update mechanism (
Figure 4). Under the standard DQN formulation, the target value for a transition
is expressed as
where
represents the target network parameterized by
. This scheme frequently leads to an overestimation bias because the same function approximator is used to both select and evaluate candidate actions. To address this issue, DDQN separates these roles. The corresponding target is defined as
Following the computation of the target value, the Q-network parameters are updated based on the temporal-difference (TD) error,
which quantifies the discrepancy between the current estimate and the refined target. The parameter update of the online network is then performed according to the standard Q-learning rule:
where
denotes the learning rate.
To balance exploration and exploitation during training, an
-greedy policy is adopted. The action-selection mechanism is defined as
When prioritized experience replay is employed, transitions are sampled with probability
where
controls the degree of prioritization. To correct the sampling bias introduced by this mechanism, importance-sampling weights are computed as
where
N denotes the replay buffer size and
gradually increases to counteract the non-uniform sampling distribution.
Within the Dueling architecture, the state-value and advantage components are combined to produce stable action-value estimates.
The action–value function is computed using the dueling architecture defined in Equation (1), which ensures that the advantage values are identifiable while preserving the relative preferences among actions. The next action is selected using the online network, whereas its value is evaluated using the target network, a modification demonstrated by Van Hasselt et al. [
8] to substantially reduce overestimation and enhance stability.
Each training episode simulates the sequential arrival of customer requests together with the vehicle decisions made throughout the day. At every decision point, the agent observes the current state
s and selects an action
a according to the
-greedy policy. The selected action is then executed, meaning that the corresponding assignment or routing choice is applied, and the simulator progresses to the next event, yielding a reward
r and the subsequent state
. The resulting transition
is stored in the replay buffer. At regular intervals, a minibatch of transitions is sampled from the buffer, and a gradient-descent update is performed on the DDQN loss function [
17]:
where
is the DDQN target defined earlier and
denotes the distribution of sampled experiences. The network parameters are updated using the Adam optimizer with an appropriately chosen learning rate (e.g.,
). The target-network parameters
are periodically reassigned to the current values of
after a fixed number of training iterations to maintain training stability. Mini-batch updates are performed with batch sizes such as 32 or 64, and gradient clipping is applied to avoid excessively large updates that may arise from high-variance rewards.
3.5. Clustering for Scalability
The environment is divided into smaller areas that can be managed separately by parallel agents using a spatial clustering technique to address scalability. The campus is divided into geographically cohesive zones, each of which has a DDQN agent and a subset of cars. Each agent’s state and action space are reduced by this decomposition, which facilitates training and real-time decision-making. The clustering step does not alter the fundamental optimization problem. Rather, it is only a scalability tool. To create
K clusters, K-means is used on past request locations or the coordinates of all pickup and delivery sites [
47]. Each zone can be solved independently thanks to the resulting multi-agent structure, which offers near-linear scalability as more clusters and agents are added. The effect is regulated by selecting cluster boundaries that restrict potential cross-region gains, even though this may decrease global optimality as cars stay inside their designated zones. This strategy is consistent with decomposition techniques that are frequently employed in large-scale VRPs, where dividing the problem into smaller components greatly shortens calculation times. The clustering mechanism employed in this study is introduced strictly as a scalability tool rather than as a solution-quality enhancement technique. Its primary purpose is to reduce state and action space complexity in large-scale dynamic settings, and therefore the impact of clustering on solution quality is not evaluated as an independent performance factor.
The impact of clustering on solution quality has been discussed in related literature. Clustering-based decompositions are primarily introduced to reduce state-space complexity and stabilize learning or optimization processes, rather than to directly improve optimality. For example, Rajeh et al. [
48] show that spatial clustering in multi-agent DRL frameworks significantly improves tractability and learning stability, while potentially limiting global coordination across clusters. Similarly, García Sánchez et al. [
49] emphasize that clustering introduces an inherent trade-off between computational efficiency and global optimality in EV routing problems. Consistent with these findings, clustering in this study is used solely as a scalability mechanism. While it may restrict certain cross-cluster routing opportunities, this effect is controlled through geographically coherent cluster boundaries, resulting in substantial computational gains with only a limited impact on overall solution quality.
In this context, clustering facilitates real-time judgments by restricting each agent’s view to a small collection of vehicles and requests, and it speeds up DRL training by lowering environmental variability within each zone. Thus, scalability is achieved through a divide-and-conquer approach. Separate DDQN agents are trained for each cluster using the methods described, and the final solution is formed by combining their locally optimized policies. This strategy provides major computational gains with only minimal loss in overall performance. It should be noted that the use of clustering introduces an inherent trade-off between scalability and global solution optimality. By restricting vehicles and requests to predefined geographic clusters, the size of the state and action spaces is significantly reduced, enabling faster training and real-time decision making. However, this decomposition may limit cross-cluster routing opportunities that could otherwise lead to marginally shorter global routes. In this study, clustering is therefore employed strictly as a scalability mechanism rather than as a means to improve solution quality. This design choice prioritizes computational feasibility and online applicability, which are essential requirements in large-scale dynamic routing settings.
4. Experimental Setup
4.1. DRL Training Settings
The Deep Reinforcement Learning (DRL) framework in this study learns an adaptive routing policy that manages dynamic arrivals, battery limits, and spatial variation. For each customer cluster, an independent agent is trained through repeated interactions with the environment. At each decision step, the agent observes its battery level, load, and spatial features, selects a feasible customer or charging station, and receives a reward reflecting routing quality. Through these interactions, the agent updates its Q-function to reduce travel distance, avoid unnecessary charging, and serve both available and dynamically revealed requests. Training relies on experience replay, -greedy exploration, and periodic target-network updates for stable learning.
Algorithm 1 presents the full dueling DQN training workflow. It describes how the agent manages dynamic customer arrivals, gathers transitions, updates the Q-network, and engages with the environment. Initialization of the online and target networks is the first step in the training process, which then moves on to episode-based learning, in which actions are selected either at random or in accordance with the learnt policy. New dynamic requests may emerge following each action, giving the agent the opportunity to learn from actual online demand. Periodic synchronization enhances convergence, while the replay buffer and target network stabilize updates. Before gradient updates were initiated, the replay buffer was populated through an initial warm-up phase, during which the agent interacted with the environment using pure exploration. Training updates started only after a sufficient number of transitions were collected to form stable mini-batches. For any experimental configuration, this procedure offers a transparent and repeatable pipeline.
The reward function determines how the agent evaluates route decisions and shapes the learned policy. It is designed to reduce travel distance and to encourage on-time service. While positive rewards are given for serving customers, long trips and charging actions result in negative rewards. Visiting an unnecessary charging station results in a large penalty to prevent overcharging. The reward structure produces a policy that conserves energy, utilizes distance efficiently, and is responsive to dynamic customer arrivals. The complete set of numerical reward terms is provided in
Table 3 to ensure full transparency and reproducibility of the training settings.
| Algorithm 1: Dueling Double Deep Q-Learning with Dynamic Customer Arrivals for EV Routing |
![Applsci 16 00278 i001 Applsci 16 00278 i001]() |
The numerical values of the reward components are selected to balance routing efficiency, service completion, and energy feasibility. Distance-based penalties directly reflect the optimization objective, while charging penalties discourage unnecessary detours without preventing mandatory recharging. All reward parameters are kept fixed across all datasets and experiments to ensure reproducibility. Similar distance-driven and penalty-based reward structures are commonly adopted in dynamic routing and EV-focused reinforcement learning studies.
The dueling DQN agent is trained with a fixed set of hyperparameters that are kept identical across all clusters and scenarios. The discount factor, exploration schedule, replay memory size, and soft target-update coefficient are chosen to balance stability and adaptability in dynamic routing conditions. In addition, Double DQN and PER are enabled to reduce overestimation bias and to focus updates on informative transitions.
Table 4 summarizes the hyperparameters used in all DRL experiments. The neural network size and replay buffer capacity were selected based on empirical stability considerations and commonly adopted practices in deep reinforcement learning for routing and control problems. Two hidden layers with 256 neurons were found to provide sufficient representational capacity to capture the nonlinear interactions between spatial, battery, and demand-related features, while avoiding over-parameterization that may cause unstable learning. Similarly, the replay buffer size of 80,000 transitions was chosen to balance sample diversity and memory efficiency, ensuring exposure to both recent dynamic events and past routing experiences. Preliminary trials with smaller buffers led to faster forgetting of rare but critical states, while larger buffers did not yield noticeable performance improvements. Therefore, a fixed and consistent hyperparameter set was adopted across all experiments to ensure fair comparison and reproducibility.
The hyperparameters reported in
Table 4 were selected to ensure stable learning behavior in a dynamic routing environment. The learning rate controls the magnitude of gradient updates and was set to avoid unstable oscillations in Q-value estimates. The replay memory size determines the diversity of the stored experiences and helps mitigate correlation between consecutive transitions, which is critical under dynamic customer arrivals. The soft target update coefficient
regulates the update speed of the target network, balancing stability and responsiveness. All hyperparameters were kept fixed across experiments to isolate the effect of demand dynamism and ensure fair comparison among routing strategies.
During training, dynamic customer arrivals were generated stochastically according to predefined degrees of dynamism, while the exact arrival times and customer sequences were randomized across episodes. For evaluation, the trained DDQN policy was tested on previously unseen demand realizations generated with different random seeds but under the same structural constraints (road network, charging stations, and vehicle specifications). No training episodes were reused during the testing. This separation ensures that the policy is evaluated on novel dynamic scenarios rather than memorized request patterns, reducing the risk of implicit overfitting.
The Q-network adopts a dueling deep Q-learning architecture, where the state value and action advantages are modeled separately to enhance the stability of value estimation. The input consists of a continuous state vector that includes normalized battery level, payload, and distance-based spatial features. A shared representation is generated using two fully connected layers with 256 neurons and ReLU activation, followed by layer normalization; dropout is applied when additional regularization is required. After the shared layers, the network branches into a value stream producing through a 128-neuron layer and an advantage stream of equal size generating action-specific advantages. The final Q-values are obtained by combining the two outputs according to the dueling formulation, improving identifiability and yielding a more stable learning process in dynamic electric-vehicle routing environments.
All experiments were conducted on a workstation equipped with an NVIDIA RTX 3080 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel 12700 CPU (Intel Corporation, Santa Clara, CA, USA) and 32 GB RAM 3200 Mhz.
4.2. Baselines Configurations
Exact Method: The proposed DDQN method is compared with the MILP model developed for EVRPTW, one of the most widely used deterministic solution approaches in the literature. This model is adapted based on the formulation presented by Schneider et al. [
3] and consists of linear constraints that ensure time window compatibility, battery constraints, and vehicle availability, along with binary routing variables that aim to minimize the total travel distance. Charging stations are considered as intermediate stops in the model, representing a full charging process. Since all variables, parameters, and decision structure of MILP are defined in detail in the related work, a comprehensive mathematical formulation by Schneider et al. [
3] should be examined. In this study, the MILP model is used only for small samples or for subsequent comparison; the obtained optimal solutions are used as a lower bound (benchmark) to evaluate the performance of the DDQN-based method.
Greedy Policy (Myopic): The myopic heuristic system serves as a baseline that makes dispatching decisions based solely on immediate objectives. Each newly arriving request is assigned to the vehicle that can serve it at the earliest time or lowest instantaneous cost, typically corresponding to the closest available vehicle. Vehicles then follow shortest paths without considering future consequences, which makes the approach computationally efficient but prone to globally suboptimal solutions such as unfavorable vehicle positioning or unnecessarily long future routes. Algorithm 2 describes the greedy search baseline used in our experiments [
50,
51,
52]. At each step, all currently available customers, including dynamically revealed ones, are evaluated, and the vehicle selects among those feasible customers that can be safely served with the remaining battery, ensuring reachability of both the customer and a charging station if required. If no such customer exists, the vehicle travels to the nearest charging station or depot for recharging. Due to its myopic nature, this policy provides a lower-bound reference for evaluating the agent’s long-term planning capability.
Random Dispatch (Random): A random assignment baseline is also considered, in which incoming requests are allocated to any feasible vehicle at random, selected from those with available capacity. When a vehicle carries multiple pending tasks, the next task to execute is chosen in random order as well. This approach produces a naive lower-bound on performance, as it does not incorporate any criterion related to distance, time, battery limitations, or anticipated future tasks. In empirical tests, the random policy consistently yields higher total travel distances and a larger proportion of delayed services, since routing decisions are made without guidance from system conditions or expected outcomes. Its inclusion as a baseline allows the performance gains produced by structured decision-making through learning-based or heuristic approaches to be clearly quantified.
Algorithm 3 outlines the random dispatch baseline used in the experiments. Similar to the greedy policy, the method operates with an electric vehicle and maintains a set of pending customers that includes both available and newly revealed dynamic requests. At each decision step, the algorithm identifies all customers that can be safely served given the current battery level, ensuring that the vehicle can both reach the customer and subsequently access a charging station if needed. When at least one feasible customer exists, the vehicle selects one uniformly at random and travels to that location via the shortest path to complete service. If no feasible customer is available, the vehicle moves to the nearest reachable charging station (or depot) to recharge before continuing. This baseline does not incorporate any spatial or temporal logic and therefore represents an uninformed decision strategy. Its purpose is to provide a lower bound on performance and to highlight the extent to which the DRL agent’s policy improves over purely random decision making.
| Algorithm 2: Greedy (Myopic) Routing Heuristic for an EV |
![Applsci 16 00278 i002 Applsci 16 00278 i002]() |
The Genetic Algorithm (GA) baseline [
53,
54] is implemented as a high-quality heuristic for the static version of the daily EV routing problem. The GA operates offline assuming full knowledge of daily requests, and each individual encodes vehicle routes as ordered sequences of customer nodes. The initial population is generated using random permutations or constructive heuristics. Tournament selection, ordered crossover (OX), and swap-based mutation are applied at each generation, while repair operators ensure feasibility with respect to capacity, pickup–delivery precedence, and battery constraints; charging stops are inserted automatically when state-of-charge violations occur.
Algorithm 4 summarizes the full GA workflow used for generating static and dynamic routing solutions across customer clusters. The algorithm runs for 300 generations with a population size of 50, preserving the best two individuals through elitism. Fitness is computed as the total travel distance by simulating the route with the distance matrix and battery model used in the DRL environment. For dynamic settings, the GA is re-executed in a rolling-horizon manner whenever new requests appear or at periodic intervals, treating the current system state vehicle positions, battery levels, and pending tasks as a static snapshot. Although computationally heavier than the DDQN approach, this periodic re-optimization provides a strong benchmark and an upper bound on achievable routing performance [
55,
56,
57,
58,
59]. Dynamic customers are inserted deterministically in FIFO order during simulation.
| Algorithm 3: Random Dispatch Baseline for an EV |
![Applsci 16 00278 i003 Applsci 16 00278 i003]() |
The performance of the proposed DDQN approach is assessed against several standard baselines for dynamic routing, including MILP, genetic algorithm, myopic, and random dispatch methods. MILP provides an optimal reference for small instances, while the myopic and random strategies represent lower-bound heuristics. Experimental results show that the DDQN policy clearly outperforms the myopic and random baselines by reducing travel distance and waiting time, and remains competitive with GA and MILP, often staying within a small optimality gap. Unlike MILP solutions, the DDQN agent adapts to real-time request arrivals, enabling it to generate better routes under uncertainty. We demonstrate that the DRL-based method provides dynamic routing performance while supporting real-time decision making in the D-EVRP environment.
It should be noted that the compared solution approaches operate under fundamentally different decision-making paradigms. The proposed DDQN-based method follows an online policy structure, where routing decisions are made sequentially as customer requests are dynamically revealed. In contrast, the Genetic Algorithm performs periodic re-optimization over the full route, and the MILP model is evaluated only for small-scale instances due to computational limitations in dynamic settings. Accordingly, the purpose of the experimental comparison is not to claim strict algorithmic equivalence among all methods, but to examine their practical performance in terms of solution quality and computational efficiency under identical operational constraints. All routing strategies are evaluated within the same environment and are subject to the same battery, charging, and feasibility rules. Any infeasible action that would leave a vehicle without sufficient energy to reach a charging station is prevented at the environment level and is therefore applied consistently across all methods.
| Algorithm 4: Genetic Algorithm (GA) for Cluster-Based EV Routing |
![Applsci 16 00278 i004 Applsci 16 00278 i004]() |
5. Experimental Results and Discussion
This section provides a detailed evaluation of the routing performance obtained under varying dataset sizes and degrees of demand dynamically. Optimal solutions generated by a MILP formulation are first presented for the small-scale instances, establishing deterministic benchmarks that serve as reference points for the subsequent analyses. The experimental results obtained from heuristic and learning-based strategies Random Dispatch, Greedy Search, Genetic Algorithm, and the DDQN-based policy are then reported across all demand levels and cluster sizes. For each scenario, total travel distance, computation time, and the number of required routes is documented. Comparative tables and graphical summaries are finally provided to highlight the influence of dynamic request intensity and instance size on solution quality.
Optimal solutions obtained from the CPLEX solver are summarized in
Table 5. In this context, the number of routes (NoR) quantifies the minimum set of vehicle tours necessary to fully serve all requests, whereas the CPLEX computation time denotes the MILP solver’s processing duration for deriving the corresponding optimal solution. For the smallest datasets (C5 and C10), within one second, yielding compact single-route solutions with total distances of 3705.48 m and 3255.85 m, respectively. These results exhibit the expected growth in routing complexity, reflected both in the number of required routes and in the rapid increase of total travel distance.
A review of
Table 6 and
Table 7 shows clear performance trends across all instance sizes and dynamic ratios. The Random policy performs the worst in every case, with travel distance rising sharply as the problem grows, mainly because it ignores spatial and energy constraints and causes unnecessary detours and charging stops. The Greedy Search strategy produces shorter routes than Random due to its nearest-customer rule, but the absence of long-term planning leads to frequent local minima, especially at higher dynamic levels. In medium and large datasets (C40–C100), the greedy method often chooses customers that seem close initially but result in poor vehicle positioning later, yielding considerably longer tours.
The Genetic Algorithm performs well in small and medium settings. In datasets C5, C10, and C20, it often produces the shortest or near-shortest routes across most dynamic ratios, reflecting its ability to exploit structure in small clusters. However, its performance declines sharply as instance size grows. In C60, C80, and C100, distances increase significantly, especially above 40% dynamicity. This decline stems from the GA’s static chromosome structure. Although dynamic requests are inserted during simulation, the fixed ordering becomes misaligned with the evolving state, reducing effectiveness in highly dynamic environments. In all configurations, the DQN-based policy demonstrates stable performance, outperforming both the Random and Greedy approaches. On medium and large datasets, it appears to outperform GA at several levels of dynamicity. Furthermore, its computational time is lower. On smaller sets (C5-C20), DQN achieves results close to the best heuristic, even though it lacks full static information. For C40-C100, DQN generally achieves the most stable results for all DoD values. Increasing DoD increases travel distance for all methods due to greater uncertainty. Random and Greedy show poor results, exhibiting the largest fluctuations. GA performs moderately well but becomes unstable as both size and dynamicity increase. Unlike other methods, DQN exhibits a controlled increase in travel distance, avoiding the severe degradations seen in heuristics. This highlights the value of learning long-term decision patterns rather than relying on static or purely local rules.
Figure 5 illustrates how DDQN route distances change under different dynamism levels. The figures shown vary with the dataset size. In the smallest clusters (C5 and C10), distances remain within a narrow range and show only slight increases as DoD increases. This suggests that dynamic arrivals have a limited impact. In the medium clusters (C20 and C40), the curves increase more steadily, and as dynamism strengthens, differences in route preferences begin to appear. The largest datasets (C60, C80, C100) show a clearer upward trend, with distances increasing almost continuously with DoD. This is a natural consequence, as it complicates the problem. As more customers increase the randomness of their requests, this increases their impact. Even so, the curves remain smooth, suggesting that the DDQN policy updates routes in a controlled manner. Overall, the figure demonstrates that DDQN remains stable across all sizes, with distance increasing gradually as dynamicity grows, and the effect becoming more pronounced in larger, high-density scenarios.
Figure 6 compares the behavior of DDQN and the Genetic Algorithm (GA) across the same DoD spectrum and datasets. The contrast between the two approaches reveals clear performance differences that depend on problem scale and dynamism intensity. In small datasets (C5–C20), the DDQN curves consistently appear below the GA curves, indicating that DDQN achieves shorter overall routes under dynamic conditions. The GA trajectories in this region show larger oscillations, suggesting that population-based search becomes less stable when the number of tasks is small but the arrival times shift unpredictably. In medium-sized datasets (C40 and C60), the distance values of the two methods remain closer, though GA still produces higher costs in most DoD levels. The divergence becomes more marked around 40–80% DoD, where GA exhibits stronger upward swings. This pattern suggests that the increased number of dynamic insertions and relocations interacts negatively with GA’s iterative search mechanism, occasionally pushing solutions toward suboptimal local structures. In the largest dataset (C100), GA consistently produces the highest distance values across all DoD levels. DDQN, although affected by increasing dynamism, maintains a more controlled growth pattern. This behavior highlights the comparative advantage of learning-based policies in large-scale, high-volatility scenarios, where rapid adaptation is required. The GA, by contrast, shows sensitivity to both scale and dynamism, resulting in inflated routing costs as problem complexity intensifies. Overall,
Figure 6 demonstrates that DDQN achieves more stable and efficient routing across varying dynamism levels, particularly in moderate-to-large task environments. GA remains competitive only in limited cases but tends to exhibit inconsistent behavior as DoD increases or dataset size becomes large.
Figure 7 shows how total travel distance varies with the degree of dynamism (DoD) across four methods: DDQN, Genetic Algorithm, Myopic, and Random. Clear patterns emerge. DDQN consistently achieves the lowest and most stable distances in all datasets, with only modest increases as DoD rises. This indicates that the learned value-based policy absorbs real-time requests without sharp growth in travel cost. The Genetic Algorithm produces slightly higher distances than DDQN but still follows a controlled trend. While GA adapts to dynamic insertions through periodic re-optimization, its population-based updates introduce small fluctuations.
The GA maintains a clear advantage over Myopic and Random strategies, especially when DoD exceeds 40%. The Myopic method is highly sensitive to dynamism, with distances increasing sharply in larger datasets because it focuses only on immediate gains and cannot anticipate future tasks. The Random strategy produces the highest and most irregular distances, confirming that ignoring spatial and temporal structure leads to inefficient routing. The gap between Random and DDQN becomes even larger for big datasets (C60, C80, C100), underscoring the value of informed decisions at scale. Overall, the figure shows a shift from uninformed to adaptive routing: DDQN achieves the most efficient outcomes, followed by GA, while Myopic and Random degrade rapidly as DoD rises.
Figure 8 compares various routing strategies under available demand. While the Greedy Search and Random policy perform faster than the exact solution, the distance values vary significantly across datasets because their primary goal is to determine a lower bound. GA was chosen as the baseline method as a robust algorithm capable of competing with DDQN. However, DDQN demonstrates strong suitability for real-time or near-real-time routing by providing consistent and competitive distances with extremely low computational times.
The experimental evaluation is intentionally conducted on a single campus-scale dataset to ensure realism, reproducibility, and consistent control over road topology, charging infrastructure, and demand patterns. Each experimental configuration was executed five times. For consistency with prior dynamic routing studies and to emphasize best-case operational performance, the best result obtained across these runs was reported in the tables. The primary objective of the study is not to claim universal generalization across heterogeneous geographic settings, but rather to demonstrate the feasibility and effectiveness of the proposed DDQN-based framework under realistic dynamic and energy-constrained conditions. Total travel distance is selected as the main performance metric, as it directly reflects routing efficiency and enables fair comparison across all baseline methods. Other operational indicators such as battery usage, charging behavior, and service completion are implicitly governed by the environment constraints and feasibility rules, and are therefore indirectly captured through distance minimization. The customer set is fixed and fully defined by the dataset, while dynamicity arises only from the timing and order of request arrivals. Therefore, rare situations correspond to extreme arrival sequences at high DoD levels rather than previously unseen customers. Such cases are handled through the online decision mechanism together with explicit energy and feasibility constraints. Extending the evaluation to multiple synthetic benchmarks and additional performance metrics constitutes a natural direction for future research. While the learned DDQN policy can be applied online without repeated re-optimization, significant shifts in the underlying request distribution may affect decision quality and would require policy retraining or adaptation.
All methods are evaluated under a fully normalized experimental protocol to ensure fair comparison. Specifically, Random, Myopic, Genetic Algorithm, and DDQN policies operate on the same datasets, share identical customer locations, charging infrastructure, vehicle specifications, and feasibility constraints, and are exposed to the same degree of dynamism scenarios. Dynamic request arrivals follow the same predefined schedules across all methods. No method is provided with additional foresight or privileged information beyond what is inherently assumed by its decision mechanism. Performance is assessed using the same evaluation metrics, primarily total travel distance and execution time, allowing direct and consistent comparison across all approaches.
The problem setting inherently exhibits partial observability, since future customer requests and their arrival times are unknown at decision time. As a result, routing decisions are made based only on the currently available information, which may occasionally lead to suboptimal short-term actions. However, the learning-based policy mitigates this limitation by capturing long-term patterns from historical interactions, allowing the agent to act robustly despite incomplete information.
Overall, the results emphasize three main points:
Heuristics lacking long-term planning (Random, Greedy) perform poorly as dynamism increases.
GA offers strong baselines mainly for static or small-scale cases.
The DQN policy adapts reliably to both sample size and dynamic uncertainty, making it the most consistent method across all experiments.
6. Conclusions and Future Work
This study introduced a scalable Deep Reinforcement Learning framework for the D-EVRP. By formulating the task as a Markov Decision Process and using a DDQN with a dueling structure and Prioritized Experience Replay, the method captured the sequential and uncertain nature of dynamic, energy-constrained routing. K-means clustering was used to decompose large areas into smaller subproblems, improving scalability and enabling real-time operation. Experiments on a real campus dataset showed that the DDQN policy consistently outperformed Random and Greedy strategies and often exceeded the performance of the Genetic Algorithm, especially in medium and large instances with high dynamism. Although GA was competitive in static cases, its effectiveness decreased as dynamism increased. In contrast, the DDQN agent adapted naturally to new requests, balanced charging decisions, and reduced total travel distance without repeated re-optimization. Overall, the results demonstrate that deep reinforcement learning is an effective solution for dynamic EV routing. The proposed framework supports energy-efficient logistics, improves operational reliability, and offers a promising direction for smart-campus and smart-city transportation systems. The results demonstrate that established deep reinforcement learning techniques can be effectively deployed within a realistic and dynamic EV routing environment, supporting scalable and online-capable decision making under energy constraints.
Future extensions may further improve the performance and practicality of DRL-based EV routing. One direction is cooperative multi-agent reinforcement learning, where vehicles share information across clusters to improve global routing. More detailed battery and charging models such as nonlinear charging curves, variable rates, and degradation could also increase realism. Incorporating dynamic travel times, congestion, and real-time traffic data would strengthen robustness in urban settings. Advanced deep learning models such as graph neural networks may better capture spatial structure and enhance scalability. Adaptive, learning-based clustering could adjust region boundaries based on demand patterns. Hybrid methods combining reinforcement learning with classical optimization may yield higher-quality solutions. Real-world deployment and hardware-in-the-loop testing would allow evaluation under operational constraints. Finally, transfer learning or meta-learning may help trained agents generalize across different campuses or cities without full retraining.