Next Article in Journal
Plastic-Waste-Modified Asphalt for Sustainable Road Infrastructure: A Comprehensive Review
Previous Article in Journal
Participatory Scenario Development for Sustainable Cities: Literature Review and Case Study of Madrid, Spain
Previous Article in Special Issue
Optimization of Joint Distribution Routes for Automotive Parts Considering Multi-Manufacturer Collaboration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Pricing for Wireless Charging Lane Management Based on Deep Reinforcement Learning

1
Shandong Hi-Speed Group Co., Ltd., Jinan 250098, China
2
Nottingham University Business School China, University of Nottingham Ningbo China, Ningbo 315100, China
3
State Key Laboratory of Intelligent Transportation System, Beijing 100191, China
4
Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, Ningbo 315100, China
5
College of Business & Public Management, Wenzhou-Kean University, Wenzhou 325060, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(21), 9831; https://doi.org/10.3390/su17219831
Submission received: 17 June 2025 / Revised: 17 September 2025 / Accepted: 28 September 2025 / Published: 4 November 2025

Abstract

We consider a dynamic pricing problem in a double-lane system consisting of one general purpose lane and one wireless charging lane (WCL). The electricity price is dynamically adjusted to affect the lane-choice behaviors of incoming electric vehicles (EVs), thereby regulating the traffic assignment between the two lanes with both traffic operation efficiency and charging service efficiency considered in the control objective. We first establish an agent-based dynamic double-lane traffic system model, whereby each EV acts as an agent with distinct behavioral and operational characteristics. Then, a deep Q-learning algorithm is proposed to derive the optimal pricing decisions. A regression tree (CART) algorithm is also designed for benchmarking. The simulation results reveal that the deep Q-learning algorithm demonstrates superior capability in optimizing dynamic pricing strategies compared to CART by more effectively leveraging system dynamics and future traffic demand information, and both outperform the static pricing strategy. This study serves as a pioneering work to explore dynamic pricing issues for WCLs.

1. Introduction

With increasing environmental awareness, EVs have become mainstream in transportation due to their lower emissions. However, their limited driving range remains a significant obstacle to their full potential. To overcome this challenge, alongside traditional plug-in charging, advanced EV charging methods have been developed, such as static wireless charging, battery swapping, and dynamic wireless charging (DWC). Among these, DWC stands out as the most promising method. It allows EVs to charge while in motion using facilities embedded under the road surface known as wireless charging lanes (WCLs). To date, DWC technology has been researched in many countries, including the United States, China, Germany, Sweden, and Korea. However, most of the research on DWC remains experimental and has not yet been widely implemented in existing traffic systems.
Recognizing the significant potential of DWC, transportation management issues within the DWC context have increasingly attracted academic attention [1]. To date, these issues have been categorized into four aspects: (1) Development and features of DWC technology; (2) Optimal allocation of WCLs [2,3,4,5,6,7,8,9,10,11]; (3) EV energy consumption analysis in WCL context [12,13,14,15,16,17]; and (4) Billing and pricing for EVs on WCLs. Notably, real-time WCL traffic management issues have not been thoroughly studied. Liu et al. [18,19] explored a ramp metering control problem and a variable speed limit (VSL) control problem on WCLs, respectively, which revealed the inherent conflict between traffic operation efficiency and charging service efficiency (defined by the total net energy increase in the EVs). Then, Zhang et al. [20] further included EV routing decisions to ensure smooth traffic on the WCLs, which was shown to partially resolve the trade-off between these two efficiency measures. However, these studies [18,19,20] all assumed that the WCLs are fully deployed on the road system, and therefore their conclusions cannot be extended to more flexible and general lane settings. Some studies, such as [15,16], also highlighted this issue of trade-offs and suggest that WCLs should be deployed in a multi-lane system. However, to our best knowledge, optimizing real-time traffic management in a multi-lane system with WCLs has not been addressed.
In general, the core of real-time management in multi-lane traffic systems focuses on dynamically adjusting traffic assignment across different lanes by adapting to traffic demand. Effective traffic assignment helps optimize lane usage, reduce congestion, and improve overall operational efficiency. In the multi-lane system with WCLs, the goal of real-time management is to optimize both traffic and energy objectives. Therefore, this paper explores a dynamic pricing problem in a double-lane system consisting of one GPL and one WCL. Our basic consideration is that the charging price can influence EVs’ lane choices, thereby affecting the traffic operation and charging service objectives of the system. Therefore, a dynamic pricing strategy is essential to enhance these efficiencies. Given the heterogeneity of EV attributes, the traffic dynamics should be modeled at a micro-level where each EV acts as an autonomous agent. To this end, we employ an Agent-Based Model (ABM) to establish the traffic dynamics. Due to the complexity of the ABM, a model-free method, a deep Q-learning algorithm, is utilized to derive a dynamic pricing strategy.
The primary contributions of this paper are two-fold: (1) we pioneer the exploration of dynamic pricing problems in a multi-lane system with WCLs, thereby filling a significant gap in this field; (2) we propose a novel framework that combines ABM tailored to a multi-lane system with WCLs and deep reinforcement learning (DRL) to derive dynamic pricing strategies. This framework can be applied or extended to other real-time traffic management problems that involve EV choice behavior under the option of DWC.
The remainder of this paper is structured as follows. Section 2 provides a comprehensive review of the related literature, situating our work within the broader context of the field. Section 3 formally defines the problem under investigation, outlining the key considerations and objectives. Section 4 details the research methodology employed, including the theoretical framework and the specific techniques utilized. Section 5 presents the numerical experiments conducted to validate the proposed method, describing the experimental setup, procedures, and evaluation metrics. Section 6 demonstrates and analyzes the results, highlighting the efficacy and implications of our approach. Section 7 summarizes the key findings, draws conclusions, and discusses potential avenues for future research.

2. Related Work

This section presents a brief review of the existing literature related to our study, specifically focusing on three aspects: (1) The pricing problem in traffic systems with WCLs, especially focusing on the scale of the traffic system and the aim of dynamic pricing; (2) Studies on the multi-lane systems with WCLs, especially focusing on the establishment of traffic models and energy consumption models employed within these studies. Additionally, the impact of the integration of WCLs into existing traffic infrastructures on traffic flow dynamics and charging efficiency is of concern; and (3) Studies that apply DRL algorithms in dynamic pricing problems on highways. Our review concentrates on the types of DRL algorithms applied, the route choice models, and the design of state and reward functions.
Several studies have addressed the pricing problems in traffic systems with WCLs. They all consider a static pricing problem on a traffic-network scale. The aim is to reduce costs and promote traffic efficiency. He et al. [21] explored a static pricing problem in WCLs from the perspective of a government agency. The goal is to optimize both transportation and power networks. This study aimed to validate the efficacy of two pricing models, the first and second best, in enhancing social welfare. The first-best model strives to minimize the combined costs of power generation and travel by implementing locational marginal pricing, whereas the second-best model focuses solely on the transportation network, aiming to reduce travel time and energy consumption while ensuring fiscal sustainability. Similarly, Wang et al. [22] also considered a static pricing problem in a network-level traffic system and introduced an intriguing charging pricing and vehicle scheduling algorithm based on a double-level game model. In the lower level, each EV behaves in its self-interest, striving to minimize detours and reduce electricity costs while securing adequate power for travel. The upper level encapsulates the interaction between WCLs and EVs, where EVs seek to lower the charging cost and WCLs aim to maximize profits from electricity sales. This study demonstrates that the proposed double-layer game model can achieve a balanced outcome benefiting both EVs and WCL operators. Similarly to ref. [21], Esfahani et al. [23] also addressed an optimal pricing problem in a DWC scenario, considering both transportation and power networks. However, their focus was on a more advanced scenario where the WCLs are bidirectional, formulating a bi-level optimization problem to determine the optimal buying price for electricity at each charging link, based on the assumption that the selling price is set by the Locational Marginal Price (LMP). The effectiveness of this bidirectional charging model in mitigating peak loads and reducing EV charging costs was substantiated through three numerical examples. It is evident that the extant research concentrates on the network-level pricing problem, presuming the complete coverage of WCLs on a link. And, they all consider a static pricing problem. A dynamic pricing problem multi-lane system with WCLs on a road level has not been addressed.
A few studies address the multi-lane system with deployed WCLs. He et al. [15] developed a car-following model to simulate the driving behaviors of EVs in double-lane systems where the WCL is partially deployed on one lane. This model particularly focuses on the car-following and lane-changing behaviors induced by the presence of WCLs, offering insights into the adjustments that drivers make to utilize charging facilities, which in turn affect overall traffic dynamics and safety. Notably, the authors make some basic assumptions about the traffic rules. For instance, the parameters of all EVs are assumed to be identical. EVs that need charging must choose the WCL, while those that do not require charging may also travel on the WCL. Building on the foundational models and basic assumptions of ref. [15], a subsequent study [16] extended the theoretical framework to quantitatively measure the impacts of WCLs on travel time and energy consumption. This paper not only refines the energy consumption models specific to EVs but also calibrates these models against empirical data, thereby validating the theoretical predictions. The results indicate a notable decrease in road capacity (8% to 17%) and an increase in energy consumption (3% to 14%) under varying traffic densities, underscoring the practical implications of integrating WCLs into existing infrastructures. In contrast, our study focuses on real-time traffic control on WCLs using dynamic pricing, for which a double-lane system is also constructed and simulated. This microscopic way of modeling a dual or multi-lane traffic system is capable of explicitly capturing vehicle-level lane choice decisions, thus enabling refined modeling and the evaluation of control measures, see [24] for a more comprehensive review multi-lane traffic modeling and control. In traditional settings, traffic speed variation is the main factor considered by drivers for lane changing. However, the lane choice and changing decisions of each EV in our problem are mainly driven by its state-of-charge (SOC) due to our focus on WCL management. We develop a DRO-based dynamic pricing controller for managing the two-lane traffic system. Interestingly, the learning-based strategy seems a popular approach in recent studies on traffic control in multi-lane systems, such as for intersection [25] or ramp-merging traffic control [26]. While the control problems in these studies are mostly from the perspective of individual EVs, ours optimizes the traffic operator’s decisions; hence, the underlying system dynamics and uncertainties considered are quite different.
More broadly, the RL-based approaches have been applied to various existing and emerging traffic operational problems with intelligent transportation systems (ITSs), such as traffic signal control [27], variable speed limit control [28], twin-vehicle auto-driving [29], cooperative on-ramp merging control [26] and connected and automated vehicle platooning [30], to name a few recent ones. In particular, several studies have applied DRL algorithms in dynamic pricing problems in highway traffic control, a relevant setting to ours. Pandey et al. [31] explored a dynamic pricing model designed to optimize the use of express lanes. The DRL algorithm utilized here is an Actor–Critic (A2C) algorithm, which is specialized in handling continuous state and action spaces, making it particularly suitable for the dynamic and complex environment of highway traffic. The state representation in this model includes variables such as current traffic density, time of day, and the historical usage patterns of the lanes, while the reward function is designed to maximize revenue and minimize travel time. Abdalrahman and Zhuang [32] extended the application of DRL to manage dynamic pricing in EV charging stations. They employed a multi-agent framework where each charging station operates as an independent agent utilizing a variant of the Q-learning algorithm. The state space for each agent includes the number of EVs waiting, the state of charge of each EV, and the current electricity price, whereas the reward function is constructed to maximize profit while ensuring customer satisfaction by reducing waiting times and charging costs. Cui et al. [33] applied DRL algorithms for dynamic pricing at fast charging stations for EVs, focusing on optimizing station profit and enhancing user satisfaction. The study effectively integrates traffic flow predictions and EV charging demands into a dynamic pricing model that adjusts in real-time to traffic and usage conditions. The study establishes the vehicle–road learning environment using the Markov decision process (MDP) and employs the Deep Deterministic Policy Gradient (DDPG) algorithm, which is a policy-based reinforcement learning method particularly suited for continuous action spaces such as pricing strategies. The state space in the DRL framework includes the current load at the charging stations, the availability of the chargers, the real-time traffic conditions around the stations, and the predicted demand for charging. These variables help the system to understand the current scenario at both the traffic and energy distribution levels. The reward function is designed to maximize the profitability of charging stations while balancing the electrical grid’s demand and user satisfaction.

3. Problem Statement

We consider a dynamic pricing problem in a double-lane system dedicated to providing a charging service for DWC EVs. The system consists of one GPL and one WCL. The speed limit on the WCL is set a bit lower than that on the GPL to extend the charging duration for EVs [15]. In this study, we assume v w c l = 0.9 v g p l . The entire system is established by NetLogo as an ABM in which an EV is treated as an agent that has attributes including location, current travel speed, SOC, minimum SOC level, etc. The traffic rules in this system are assumed as follows:
  • In the lane-changing zone (see Figure 1), each EV can choose one lane. Its lane choice mainly depends on three factors: the current SOC, the observed travel speeds on each lane, and the charging price.
  • Once an EV enters the GPL or WCL, lane changing behavior is restricted unless its SOC drops below the minimum or exceeds the maximum threshold. All EVs obey this rule, which is enforced by the advanced ITS.
  • EVs entering the WCL must charge their battery until their SOC reaches its maximum SOC level.
  • The charging price is modeled as a discrete variable and can be changed at regular intervals, e.g., every three minutes.
  • EVs receive charging-price information via the ITS with onboard communication support.
  • The WCL-equipped traffic system is publicly funded and operated, with the primary goal of maximizing social welfare, that is, minimizing total congestion and maximizing EV energy uptake, rather than profit. Hence, the revenue is not considered.
  • Traffic demand and downstream traffic conditions within the control horizon are assumed to be predictable by the system with the support of ITS.
We develop an ABM tailored to the double-lane system in the context of the DWC scenario in NetLogo, as depicted in Figure 1. Since the traffic and charging demands change over time, it is crucial for traffic operators to dynamically adjust the charging prices to influence EVs’ lane choices and optimize traffic distribution between the two lanes. The goal is to enhance the operational efficiency of the system. In this paper, we focus on both traffic operation efficiency and charging service efficiency from the perspective of the system operator. The former efficiency is mainly defined by the traffic throughput as in traditional traffic management settings, while the latter is measured by the total amount of net energy recharged of the EVs, as in the recent literature on WCL traffic control [18,19]. Detailed expressions of these two efficiency measures will be provided in the following section.
Considering the complexity of the ABM, a model-based control method is deemed unsuitable for implementing a dynamic pricing strategy. Instead, a model-free approach based on a deep Q-learning algorithm is derived for the dynamic pricing strategy. Additionally, a straightforward method using the classification and regression tree (CART) algorithm is established as a benchmark approach. We conduct a series of numerical experiments and a case study to demonstrate the effectiveness of the proposed algorithm. The challenges lie in two aspects: (1) the establishment of the ABM tailored to the double-lane system, which involves integrating traffic dynamics and the lane-choice model; and (2) the design of the deep Q-learning algorithm, particularly the formulation of the system state and reward function.

4. Method

This section introduces the basic elements of the ABM and the Deep Q-learning algorithm we adopt. The conceptual design of the interaction between our ABM and the deep Q-learning algorithm is depicted in Figure 2. It can be seen that the ABM (i.e., the environment of the Deep Q-learning algorithm) is coded in NetLogo (introduced in Section 4.2) by extending the existing model “Traffic 2 lanes” in the NetLogo Library [34], which has been used in a number of traffic problems. For example, ref. [35] extends the model to explore the ways to overcome traffic congestion on toll roads. Ref. [36] develops a new traffic simulation environment in NetLogo to explore an intersection control task. In this paper, we incorporate the rules of the double-lane system, depicted in Figure 1, and the SOC dynamics of EVs into the model. Since the deep Q-learning algorithm is coded in Python 3.7, we adopt the pyNetLogo 0.5 library to build the bridge between the algorithm and the environment [37]. The pyNetLogo 0.5 library provides a seamless interface enabling Python to interact with NetLogo. By integrating this library, we are equipped with capabilities to dynamically load models, execute NetLogo-specific commands, and extract data from reporter variables. Such functionalities are instrumental for the training of our deep Q-learning algorithm in an online fashion.

4.1. Agent-Based Model (ABM)

The ABM employed for the double-lane road system comprises two types of agents. The first type of agent encompasses the road infrastructure, specifically the GPL and the WCL. The second type of agent is the EV, which is capable of making decisions autonomously and interacting with both the GPL/WCL (i.e., gain energy) and other EVs. The attributes (global variables and EV attributes) of these agents are defined in Table 1.

4.1.1. Global Variables

In our ABM, global variables include speed limits on GPL, speed limits on WCL, charging power, charging price, total throughput, and total energy. Their definitions and notations are as follows.
  • Speed limit on GPL ( v g p l ): This denotes the maximum speed at which an EV is permitted to travel on the GPL. In our model, it is defined as a constant. Its unit is km/h.
  • Speed limit on WCL ( v w c l ): This denotes the maximum speed at which an EV is permitted to travel on the WCL. Generally, v w c l is set slightly lower than v g p l to allow EVs more time to charge [15].
  • Charging power (e): This denotes the power available at the WCLs, assumed to be constant over time and uniform along the lane, measured in kilowatts (kW).
  • Charging price (p): This refers to the cost of charging per kilowatt–hour on the WCL, communicated in real-time to all EVs to facilitate informed lane choices. We assume that p is a discrete variable, priced at USD/kWh.
  • Total throughput ( T h r o u g h p u t ): This denotes the cumulative number of vehicles that pass a specific point, such as the entrance of the road, within a given time interval, measured in vehicles per hour (veh/h).
  • Total energy ( E n e r g y ): This denotes the total energy delivered to vehicles via the WCL, calculated as the sum of the energy received by each vehicle, measured in kilowatt–hours (kWh).
Note that only T h r o u g h p u t and E n e r g y are statistical accumulators that can change over time, while the other four attributes are assumed to be time invariant.

4.1.2. EV Attributes

Each EV has a set of attributes as follows:
  • Maximum travel speed ( v i m a x ): This attribute specifies the upper limit of an EV’s speed, normalized to the range [ 0 , 1 ] . In our model, its value is defined as a constant, which is generally bigger than v g p l and v w c l .
  • Current travel speed ( v i ): This attribute describes the instantaneous speed of the EV. Its value is constrained within a normalized range of 0 (stationary) to 1 (maximum travel speed).
  • Observed travel speed ( v ¯ i g p l , v ¯ i w c l ): This attribute captures the speed of EVs within a lane as observed by an individual EV. It is quantified as the average speed of EVs along a specified observable distance (e.g., 100 m) ahead of the observing EV. In this model, we assume it to be the average speed of vehicles across the entire lane, which is disseminated to all vehicles in real time through advanced vehicle communication systems. Its value is normalized to the range [ 0 , 1 ] .
  • Acceleration ( a i ): This attribute describes the change of an EV’s speed within one time interval. In our model, their values are defined as constants.
  • SOC ( s i ): The SOC is a crucial attribute for operations and management on WCLs, indicating the current energy level of the EV’s battery. The value is constrained within a normalized range of 0 (completely depleted) to 1 (fully charged).
  • Minimum SOC level ( s i m i n ): This threshold represents the critical SOC below which an EV risks imminent power depletion, potentially leading to operational failure and reduced battery lifespan. In our model, an EV is allowed to change to the WCL whenever its SOC level drops below this point.
  • Maximum SOC level ( s i m a x ): This threshold signifies the optimal SOC at which an EV’s battery is considered fully charged without exceeding the manufacturer’s recommended limits to prevent overcharging. In our model, an EV is allowed to change to the GPL whenever its SOC level exceeds this point.
  • Location ( l x , l y ): The EV’s location in the context of the NetLogo model is captured by a two-dimensional coordinate l x , l y , where l x represents the longitudinal axis along the road while l y denotes the lateral position across lanes.
  • Target location ( l y ): This attribute is denoted as the y axis value of the target location of an EV (corresponding to the target lane). In our double-lane system, the GPL and the WCL can be expressed as l y = 0 and l y = 1 , respectively.
  • Lateral speed ( v i l a t ): This attribute signifies the speed at which an EV executes a lane-changing behavior. In our model, this parameter is defined as a constant.

4.1.3. Exogenous Input

In our model, traffic demand serves as an exogenous input to the double-lane system. Recognizing the possibility that EVs might choose their lanes before reaching the designated lane-changing zone, we assume that EVs appear in their selected lane when they enter the zone if there is available capacity; otherwise, they are compelled to the other lane. This assumption reflects the dynamic interaction between traffic demand and lane availability, ensuring a realistic representation of traffic demand. We also assume that the arrival of EVs into the system within a fixed interval of time (equal to the minimum time interval between two charging price signals, denoted by τ ) satisfies a Poisson distribution [38]. In this case, the time intervals between consecutive events (an EV enters the system) can be described using an exponential distribution [39]. Then, the inter-arrival times follow an exponential distribution whose probability density function is given by
f ( t ) = d e d t ,
where the constant rate d represents the vehicles that enter the system per minute, t represents the time interval between consecutive vehicle arrivals. Taking d = 6 as an example, (1) is plotted in Figure 3.
In NetLogo, random inter-arrival times are generated from the exponential distribution with the specified rate d. The arrival time for each vehicle is then calculated by the cumulative sum of these random intervals. This process effectively models the stochastic nature of traffic flow into the specified road segment over the given time frame. Note that we simulate a non-homogeneous Poisson process where the rate d changes over time (here, we assume it changes every 3 min) which is approximated by a piecewise constant function. Additionally, we assume that the initial SOC of incoming EVs s i n i within a given time frame follows a Gaussian distribution, with a mean ( μ ) and a standard deviation ( σ ). This assumption is consistent with one of the findings yielded in [40] that the SOC distribution of EVs at the beginning of charging events is similar to a Gaussian distribution, with an average initial SOC of around 41 % .

4.1.4. Scale

In the NetLogo traffic model, an explicit scale is not defined. Instead, a normalized scale is adopted, wherein the roadway comprises numerous 1 × 1 patches, with each vehicle occupying one patch. By assuming that the length of an EV spans four meters, it is inferred that one patch equates to four meters in real-world dimensions. Consequently, this allows for the extrapolation of other parameters, such as speed, on a similar basis. For example, a normalized speed of 0.5 used in NetLogo corresponds to 80 km/h in reality.

4.1.5. Lane-Choice Model

The probability of lane choices of EVs can be estimated by a discrete choice model, which is a powerful tool used in econometrics and behavioral research to model the decision-making process of individuals faced with multiple alternatives. It was initially proposed by ref. [41] based on random utility theory. The theory assumes that individuals make decisions based on the total utility of available options, which consists of an observable deterministic component related to choice attributes and an unobservable stochastic component that captures individual preferences and random fluctuations. This theory explains why individuals may choose differently under similar conditions and vary their choices over time, highlighting the role of both deterministic influences and stochastic elements in decision making [42]. In this paper, we assume a logit model, which is widely used in predicting drivers’ lane choice among a set of alternatives [43,44,45]. Note that, although our logit model applies to two alternative settings as in our double-lane system, it can be extended to multi-lane systems using a multinomial logit model.
The multinomial logit model for the double-lane system is established in the following. First, we specify two separate utility functions, U g p l and U w c l , for the two lanes. Each utility function in our model is assumed to be a linear combination of explanatory variables that affect the EVs’ lane choice. In the context of issues related to the pricing of electric vehicle charging, explanatory variables usually include at least SOC, travel time, and charging cost. Since the WCLs have not been commercialized on a large scale, the charging choice in the context of WCLs cannot be well investigated; however, there are studies investigating the charging choice under other scenarios. For instance, ref. [46] analyzed the factors that influence EVs’ charging choice among charging stations using the data from an interactive stated choice experiment. The result shows that the utility of the charging choice is negatively correlated with a set of variables including SOC, charging time, and charging price. Similarly, in our model, we assume that the utility of choosing the GPL for the i-th EV is only influenced by travel time, denoted by T t r a v e l , i g p l , is calculated as the travel time to traverse the selected lane; the utility of choosing the WCL is assumed to be influenced by the charging time, T c h a r g i n g , i w c l (calculated as the time to charge on the WCL, i.e., the time to traverse the WCL), the SOC (the current SOC of the i-th EV, denoted by s i ), and charging cost (calculated as the total cost of the EV to charge on the WCL, denoted by C c h a r g i n g , i ). Then, the utility functions can be expressed as
U i g p l = β T × T t r a v e l , i g p l + β 0 + ϵ i g p l ,
U i w c l = β s × s i + β T × T t r a v e l , i w c l + β C × C c h a r g i n g , i + ϵ i w c l ,
where the coefficients β s , β T , and β C represent the marginal utilities of SOC, travel time, and charging cost, respectively. ϵ i g p l , ϵ i w c l are random error terms, which independently and identically follow a Gumbel distribution. β 0 is a constant term used to calibrate the utility function. s i in both utility functions is the SOC of i-th EV at the time it makes the lane-choice; T t r a v e l , i g p l , T c h a r g i n g , i w c l , and C c h a r g i n g , i are defined as
T t r a v e l , i g p l = L s y s v ¯ i g p l ,
T c h a r g i n g , i w c l = L s y s v ¯ i w c l ,
C c h a r g i n g , i = L s y s v ¯ i w c l × e + × p c h a r g i n g ,
where L s y s represents the length of the double-lane system. In the equations, we utilize the observed speeds, v ¯ i g p l and v ¯ i w c l , to calculate the expected travel times for the i-th EV traversing the GPL and the WCL, respectively. This calculation is considered reliable in most contexts, provided that the traffic flow speed remains relatively stable over short intervals. The marginal utilities β s , β T , β C are all negative, indicating that the utility of choosing the lane of an EV decreases with the increase in its SOC, travel/charging time, and charging cost.
Following the traffic rules mentioned in Section 3, we assume that β s is a piecewise function by dividing the SOC into three ranges, which is similar to the model used in [47]:
β s = , if 0 s i < s i m i n , β s , if s i m i n s i s i m a x , , if s i m a x < s i 1 ,
where piecewise β s satisfies the traffic rule that an EV whose SOC is below its minimum SOC level will choose the WCL—while an EV whose SOC exceeds its maximum SOC level will choose the GPL. Then, the choice probability of i-th EV for each lane at time t, denoted as P w c l ( t ) , can be expressed as
P i g p l = e x p ( U i g p l ) e x p ( U i w c l ) + e x p ( U i g p l ) ,
P i w c l = e x p ( U i w c l ) e x p ( U i w c l ) + e x p ( U i g p l ) .
The selection of marginal utilities β s , β T , β C is crucial for the utility functions. In our model, we aim to derive a set of appropriate values for these utilities in Section 5.1 based on the results from [46].

4.1.6. EV Driving Behavior

The behavior of an EV is characterized by the dynamics of its attributes. Among the attributes mentioned in Section 4.1.2, maximum travel speed and acceleration/deceleration are constant, while others are variable. For the i-th EV, its acceleration at time t is calculated as
a i ( t ) = a a c c , i , accelerate if there are no blocking cars ahead and v i ( t ) < v w c l · 1 { l y ( t ) = 1 } + v g p l · 1 { l y ( t ) = 1 } , 0 , maintain v i ( t ) if there are no blocking cars ahead and v i ( t ) = v w c l · 1 { l y ( t ) = 1 } + v g p l · 1 { l y ( t ) = 1 } , a d e c , i , decelerate if there are blocking cars ahead ,
where a a c c , i and a d e c , i are constants, representing the acceleration magnitudes for EV i when there are no blocking cars ahead and the deceleration magnitude for EV i when there are blocking cars ahead, respectively. Their values can vary slightly across different EVs. 1 c is the indicator function that equals 1 if condition c is satisfied and 0 otherwise.
Then, its speed, v i location along the road, x i , lateral movement, y i , can be expressed as
v i ( t + 1 ) = v i ( t ) + a i ( t ) ,
l i x ( t + 1 ) = l i x ( t ) + v i ( t ) ,
l i y ( t + 1 ) = l i y ( t ) + v i l a t ( t ) ,
where v l a t represents the lateral speed of EVs. In this model, we assume it to be a constant value for all EVs.
As mentioned in Section 4.1.2, the observed travel speeds of EV i on the two lanes, v ¯ i g p l , and v ¯ i w c l are defined as the average speed of vehicles on the entire lane. Let { v ¯ j g p l } , j { 1 , , J g p l ( t ) } and { v ¯ j w c l } , j { 1 , , J w c l ( t ) } denote the set of speeds of these observed vehicles on the GPL and WCL, respectively. Then, we have:
v ¯ i g p l ( t ) = 1 J g p l ( t ) Σ j = 1 J g p l ( t ) v ¯ j g p l , if there are observable EVs ahead , v g p l , if there are no observable EVs ahead ,
v ¯ i w c l ( t ) = 1 J w c l ( t ) Σ j = 1 J w c l ( t ) v ¯ j w c l , if there are observable EVs ahead , v w c l , if there are no observable EVs ahead ,
where J ( t ) is the total number of vehicles on the lane on which EV i is moving at time t.
Its SOC, s i , is updated as
s i ( t + 1 ) = s i ( t ) + e + e i ( t ) , if the EV is on the WCL , e . g . , l i y ( t ) = 1 , s i ( t ) e i ( t ) , otherwise ,
where e + is the charging power of the WCL, which is assumed as a constant. e i ( t ) is the energy consumption of i-th EV at time t. Based on the analysis of the laboratory tests from [48], it can be expressed as a nonlinear function of its speed v i ( t ) and acceleration a i ( t ) :
e i ( t ) = f e ( v i ( t ) , a i ( t ) ) ,
As discussed in Section 4.1.5, the model for determining the lane choice probability of an EV as it first enters the system is represented by the random variable l i y , with the probability defined as
P r ( l i y = 1 ) = P i w c l ,
P r ( l i y = 1 ) = P i g p l ,
where P r ( ) represents the probability function which calculates the likelihood of an EV choosing a particular lane for its first entry; here, l i y equals −1 for the choice of the WCL and 1 for the GPL. P i w c l and P i g p l are the probabilities for choosing each lane, respectively.
As mentioned in Section 3, once an EV enters the GPL or the WCL, it is prohibited from changing lanes. Exceptionally, its SOC falls outside the range [ s i m i n , s i m a x ] , it can re-select its target lane using the probability model shown in (18).

4.2. NetLogo

NetLogo [49] is an agent-based modeling (ABM) platform that enables autonomous agents with distinct behaviors to interact within a shared environment, giving rise to complex system-level phenomena [50]. Recently, NetLogo has been widely adopted in traffic studies, including decentralized routing, distributed coordination, and combined driving–charging behavior [36,51,52,53]). Unlike traditional traffic simulators like VISSIM and SUMO, which are (not agent-based but) more focused on simulating traffic flow based on vehicles, NetLogo emphasizes the behavior and interaction of individual agents. This approach can provide deeper insights into the dynamics of complex systems, such as traffic networks, where individual EV behaviors significantly influence the overall traffic patterns or the heterogeneity among different agents cannot be ignored. Readers interested in more agent-based traffic simulators can refer to [54]. In the traffic scenario explored in this paper, an EV’s lane choice is influenced by multiple attributes, including its location, SOC, charging prices, and observed travel speeds. The heterogeneity of the agents’ attributes is central to the traffic scenario analyzed in this paper and should be taken into account in our approach. Consequently, we employ NetLogo 6.4.0 to simulate the traffic dynamics of a double-lane system.

4.3. Deep Q-Learning Algorithm

Reinforcement learning (RL) is a paradigm of machine learning in which an agent learns to make decisions by performing actions in an environment to achieve a goal. The agent receives feedback in the form of rewards or penalties, guiding it towards effective strategies. The process involves learning what actions to take in different states to maximize the cumulative reward [55]. RL is taxonomized by model usage (model-free vs. model-based) [56], policy representation (policy-based vs. value-based) [57,58], and sampling alignment (on-policy vs. off-policy) [59,60].
Despite RL’s outstanding performance in low-dimensional tasks [55], its scalability to high-dimensional state spaces remains limited. DRL solves this bottleneck by integrating the RL decision-making framework with the expressive function approximation of deep neural networks, allowing the end-to-end learning from raw sensory input while balancing exploration and exploitation [60]. This synergy has produced landmark successes such as AlphaGo [61] and advanced robotic control, yet current DRL algorithms still suffer from sample inefficiency and limited generalization gaps that motivate ongoing research towards more data-efficient and interpretable models.
Given the continuous state representation (traffic density) and discrete action set (charging price), deep Q-learning (DQL), which is one of the most popular value-based RL methods, is adopted in this paper. DQL is a value-based approach that is well suited for problems with a discrete action set and has proven effective across various applications. The choice aligns with our goal to introduce and validate a practical ABM integrated with a DRL framework for real-time pricing in a multi-lane WCL context.
It is important to note that this paper focuses on demonstrating the practicality and effectiveness of this framework rather than engaging in an exhaustive comparison of different RL algorithms. While advanced RL methods such as PPO or A2C could offer additional insights, we emphasize that the choice of RL method should be tailored to the specific demands of the model and the problem it aims to address. Future research could certainly explore these more sophisticated methods to further enhance the dynamic pricing strategies in WCL settings, building upon the foundational work presented here.

4.3.1. Background

Notation
  • x—state; a—action; r—reward; π —policy; γ [ 0 , 1 ] —discount factor.
  • Q ( x , a ) —action-value function; Q ( x , a ) —optimal action-value function.
  • θ —network parameters; L ( θ ) —loss function with weights θ .
Traditional Q-learning utilizes a Q-table to estimate the maximum expected rewards for an optimal action a for a given state x in a specific environment [62,63]. Let Q ( x , a ) be the optimal action-value function which denotes the maximum expected return achievable by any policy, given state x and action a. By the Bellman optimality equation, Q ( x , a ) is defined as
Q ( x , a ) = E r + γ max a Q ( x , a ) ( x , a ) .
However, tabular Q-learning fails in large or continuous state spaces [64]. DQL replaces the Q-table with a neural network Q ( x t , a t ; θ ) that maps states to Q-values, enabling generalization across high-dimensional inputs. Training minimizes the squared TD-error:
L t ( θ t ) = E ( x t , a t , r t + 1 , x t + 1 ) r t + 1 + γ max a Q ( x t + 1 , a ; θ t ) Q ( x t , a t ; θ t ) 2 ,
where the expectation is over transitions sampled from the behavior policy induced by Q. This gradient-based optimization yields an effective approximation of Q in complex domains.

4.3.2. State

The state consists of traffic states and the future traffic demand. For the representation of traffic states, we first divide the road system under consideration into N segments (each segment is assumed to be identical), as depicted in Figure 4. Let ρ g p l , ρ w c l be the vectors collected the normalized traffic densities on the GPL and the WCL, respectively, (The state space is deliberately macroscopic: it encodes only the number of EVs per segment. This is sufficient because the pricing policy acts on aggregate lane-choice probabilities rather than on individual trajectories. Microscopic uncertainties average out at the segment level, and any residual error can be reduced by shortening the segment length. When RL is employed for micro-control tasks that involve vehicle-level actions, on the other hand, the state representation should explicitly account for the positions and speeds of individual EVs [26,29].). Then, we have
ρ g p l = [ ρ 1 g p l , , ρ N g p l ] ,
ρ w c l = [ ρ 1 w c l , , ρ N w c l ] .
Here, the n-th normalized traffic density, ρ n g p l and ρ n w c l , n = 1 , , N are defined as
ρ n g p l = m n g p l m j , n g p l ,
ρ n w c l = m n w c l m j , n w c l ,
where m n g p l and m n w c l are the numbers of EVs located within the n-th segments on the GPL and the WCL; m j , n g p l and m j , n w c l represents the maximum number of vehicles that can be accommodated within the segment.
Similarly, we use a vector d = [ d 1 d m a x , , d N d d m a x ] to represent the future traffic demand over the next N d periods of pricing signals. Here, d i for i = 1 , , N d is the average number of incoming vehicles per minute within the i-th period in the future, which is equal to the d adopted in Section 4.1.3. d m a x is the pre-defined maximum value of d i . Let the system state x can be expressed as
x = [ ρ g p l , ρ w c l , d ] .
Note that all elements in the state x have been normalized to fall within the range [ 0 , 1 ] .

4.3.3. Action

The action within our DRL framework is the charging price, denoted by p c h a r g i n g P = { 0 , 0.5 , 1 , 2 , 3 } (its unit is USD/kWh). The base price is set to be p c h a r g i n g = 1 . In our model, we adopt a fixed marginal utility for each price; however, in some cases, a piece-wise marginal utility can be assumed, as demonstrated in ref. [65].

4.3.4. Reward

The reward is derived from two objectives: maximizing the traffic operation efficiency measured by total throughput [66] and the charging service efficiency measured by the total net energy delivered to EVs [18,19] (i.e., the total energy received minus consumed), while penalizing any deviation from critical densities to discourage congestion.
The reward at time t is given by
r t = W 1 × T h r o u g h p u t t + W 2 × E n e r g y t W 3 × ( C t g p l + C t g p l ) ,
where T h r o u g h p u t t is the number of vehicles passing the loop detector at the WCL entrance during interval t (see Figure 4); E n e r g y t is the total net energy recharged to EVs; C t g p l and C t w c l are congestion penalties over all N segments, modeled by the sigmoid variant proposed in [67]:
C g p l = n = 1 N 1 1 + e x p ( ( m n g p l m 1 , n ) / k 1 , n ) ,
C w c l = n = 1 N 1 1 + e x p ( ( m n w c l m 2 , n ) / k 2 , n ) ,
which yields near-zero penalty when the density of a segment is below its critical density, and rapidly increases thereafter, mirroring the soft-constraint approach adopted in DRL algorithms. The weights W i allow the flexible balancing of these objectives [68]. m 1 , n , m 2 , n , k 1 , n , k 2 , n , n = 1 , , N are the coefficients of the penalty function. These parameters determine the scaling and translation of the function which can be user-defined.

4.3.5. Q-Network

In the proposed algorithm, the state vector x has been defined in (25), then the Q-network approximates the Q-value function Q ( x , a ) , which comprises a series of fully connected (FC) layers interspersed with nonlinear activation functions. Here, we adopt the Rectified Linear Unit (ReLU) as the activation function.

4.3.6. Training

Episodic reinforcement learning: In an episodic reinforcement learning, each training run is decomposed into episodes that terminate after a fixed number of steps. The initial state x 0 is sampled from a distribution over traffic densities, ensuring that the agent experiences diverse congestion patterns. Within every episode, the agent executes actions, collects transitions, and periodically updates the Q-network via experience replay, thereby maximizing the discounted return over the trajectory while mitigating long-horizon credit assignment and sparse-reward issues. Full details are given in Algorithm 1.
Exploration rate: DRL typically uses a decaying exploration rate (e.g., epsilon-greedy with decaying ϵ ) to shift the agent from broad exploration early on to the later exploitation of learned knowledge, avoiding premature convergence while maximizing long-term reward. In the proposed algorithm, the decay of ϵ is modeled by the following formula, where ϵ s , ϵ f , and ϵ d represent the initial exploration rate, final exploration rate, and the decay factor, respectively:
ϵ ( t ) = ϵ f + ( ϵ s ϵ f ) exp t ϵ d .
Here, t represents the frame index or the number of iterations. This formula ensures that ϵ decreases exponentially from ϵ s to ϵ f , reducing as the agent gains more experience, thereby transitioning from exploration to exploitation.
Experience replay: To mitigate the non-stationarity and sample correlation inherent in sequential RL updates, we equip the agent with an experience replay buffer D that stores every transition ( x t , a t , r t , x t + 1 ) . During the training process, the algorithm randomly samples a minibatch of experiences from the replay buffer.
Algorithm 1 Deep Q-learning with experience replay
Require:  γ : discount factor, α : learning rate, ϵ : exploration rate
Require: C: memory capacity for experience replay, M: minibatch size
   1:
Initialize replay memory D to capacity C
   2:
Initialize Q-network with random weights θ
   3:
for episode = 1 , E  do
   4:
      Initialize state x = [ ρ g p l , ρ w c l , d ] by resetting the NetLogo environment
   5:
      for  t = 1 , T  do
   6:
            Select a random action a t with probability ϵ
   7:
            Otherwise select a t = arg max x Q ( x t , a ; θ )
   8:
            Execute action a t in environment
   9:
            Observe reward r t and next state x t + 1
 10:
            Store transition ( x t , a t , r t , x t + 1 ) in D
 11:
            Sample random minibatch of transitions ( x t , a t , r t , x t + 1 ) from D
 12:
            if  x t + 1 is terminal then
 13:
                 Set y t = r t
 14:
            else
 15:
                 Set y t = r t + γ max a t + 1 Q ( x t + 1 , a t + 1 ; θ )
 16:
            end if
 17:
            Perform a SGD update on ( y t Q ( x t , a t ; θ ) ) 2 with respect to the Q-network parameters θ
 18:
            Update state: x t = x t + 1
 19:
      end for
 20:
end for

4.4. The CART Algorithm

Decision trees are a fundamental type of machine learning algorithm used for both classification and regression tasks, modeling decisions and their consequences in a tree structure [69]. The nodes represent features tests, while the branches correspond to the outcomes of these tests. Among the foundational algorithms, ID3 uses information gain to select the best feature for splitting data [70], whereas CART employs Gini impurity for classification and mean squared error (MSE) for regression [71]. These methods have been extended, such as C4.5, which handles both continuous and discrete attributes and employs sophisticated pruning techniques [72].
For regression, CART keeps splitting until the MSE of every subset is as small as possible. The MSE is defined as
MSE = 1 n i = 1 n ( y i y ^ i ) 2 ,
where y i is the actual value, y ^ i is the predicted value, and n is the number of data points in the subset.
In this paper, the CART algorithm is employed to generate a preliminary dynamic pricing strategy by leveraging one-step performance data derived from sample scenarios (see Section 5). The CART algorithm is selected because of its inherent suitability for the data structure of the dynamic pricing problem, which aligns with the characteristics of the data set and the nature of the problem. In the following, the training procedure is introduced.

Training

In this study, we develop a dynamic pricing strategy using the CART algorithm, primarily to establish a benchmark for evaluating our DQL algorithm, rather than to fully explore the potential of traditional machine learning techniques. Considering that the price p is a discrete variable, the dynamic pricing challenge could initially be approached as a classification task. However, such an approach fails to leverage the continuous value data of each price p, which is crucial for a nuanced understanding of pricing dynamics. To address this limitation, we employ the CART algorithm for a regression task to better utilize the quantitative information associated with each potential price. Our methodology unfolds in three steps:
1.
Data generation: Utilizing an Agent-Based Model (ABM), we generate a dataset ( X , Y ) , where X = [ x , a ] comprises the feature vector. Here, x = [ ρ g p l , ρ w c l , d 1 ] represents the state, specifically only including the immediate future traffic demand, distinct from the states used in DQL. The action a is also included in X. The corresponding Y = [ r ( x , 0 , , r ( x , 3 ) ] consists of the rewards for each charging price p, illustrating the reward’s dependency on the price.
2.
Decision tree training: We apply the CART algorithm to map the relationship between X and Y (as defined in (26)), thereby modeling how different charging prices influence the rewards. The purity of each node is measured using the MSE.
3.
Optimal price implementation: The price yielding the highest reward is selected and implemented in the system, optimizing the charging strategy within the defined parameters.
Our aim is to test the efficacy of the DQL algorithm by comparing it against a straightforward, regression-based benchmark. The pseudocode of the algorithm is detailed in Algorithm 2.
Algorithm 2 Modified CART
  1:
Input: Training data D = { ( X 1 , Y 1 ) , , ( X N , Y N ) } ; hyper-parameters max - depth , min - samples - split , min - samples - leaf
  2:
Output: Decision tree
  3:
Initialize tree with a single root node
  4:
Set current - depth 0
  5:
Discretize continuous features in X (e.g., quantile bins)
  6:
while  current - depth < max - depth and node samples > min - samples - split  do
  7:
      if split possible then
  8:
            for each candidate split do
  9:
                  Compute combined MSE of left and right subsets:
                                    MSE = 1 | L | i L ( y i y ¯ L ) 2 + 1 | R | j R ( y j y ¯ R ) 2
10:
            end for
11:
            Select split with minimum MSE
12:
            Create two child nodes
13:
      else
14:
            Mark node as leaf
15:
      end if
16:
end while
17:
return Decision tree

5. Numerical Experiments

5.1. Parameter Settings for the Lane-Choice Model

As detailed in Section 4.1.5, the marginal utilities β s , β T , β C in our logit model are informed by the results from ref. [46], where the corresponding values are 4.58 , 0.242 , and 0.01 for the variables s i , T c h a r g i n g , i , and C c h a r g i n g , i , respectively. These variables represent SOC in percentage, charging time in hours, and charging cost in dollars. It is crucial to note that in the cited study, T c h a r g i n g , i and C c h a r g i n g , i denote the time and cost to fully charge an EV’s battery. However, in the WCL context, it is not feasible for an EV to achieve a predetermined SOC since the charging duration is equivalent to the time taken to travel the entire WCL. Given this discrepancy in the definition of T c h a r g i n g , i and C c h a r g i n g , i , we adapt the marginal utilities in our model by considering the relative values of different variables, rather than directly adopting the values from the cited reference. In our model, we assume that all EVs share the same battery capacity of 75 kWh; the actual power of WCL, e + , is 15 kW. Other parameters are collected in Table 2.
First, since T c h a r g i n g , i and C c h a r g i n g , i are calculated in the same way, it can be inferred that
β T β C = 0.242 0.01 24 .
According to the coefficients of SOC and charging cost, the value of a 1% decrease in SOC is calculated as ( 4.58 % 0.01 = U S D 4.58 ), indicating that a 1% reduction in SOC equates to a decrease of USD 4.58 in the total charging cost when charging from the current SOC to full capacity. For example, if we assume that the current SOC of EVs is 41% [40], then the equivalent charging cost per kWh that corresponds to the same financial impact as a 1% SOC decrease is calculated as 4.58 ( 1 0.41 ) × 75 0.1 USD/kWh. Then, we have 1 % × β s = L s y s v ¯ i w c l × e + × 0.1 × β C . Hence, the ratio of β C to β s can be calculated as
β C β s = 1 8.33 .
Let β s = 4.58 . Based on Equations (30) and (31), we calculate that β C = 0.55 and β T = 13.24 . To emphasize the impact of charging price on lane choice, we increase the coefficient for β C by 1.5 times, resulting in β C = 0.55 × 1.5 = 0.825 . The relationship between the probability of choosing the WCL and the charging price p c h a r g i n g , for an EV with a SOC of 41%, is illustrated in Figure 5.
However, it is important to note that these marginal utilities are only for reference. Accurate values should be derived from experimental analysis in the context of WCLs. Nonetheless, this is difficult to implement before the large-scale commercialization of WCLs.

5.2. Simulation for Sample Scenarios

In this section, we conduct a series of numerical experiments across twelve sample scenarios. In each scenario, we test the performance of every possible charging price. Hence, each scenario only lasts for one pricing interval. The objectives of these experiments are three-fold: (1) to demonstrate the effectiveness of our ABM developed in NetLogo, (2) to validate the design of the reward function as defined in Equation (26), and (3) to elucidate the performance disparities among various charging prices p c h a r g i n g .
The initial conditions of the twelve scenarios are detailed in Table 3, where m t o t g p l = n = 1 5 m n g p l and m t o t w c l = n = 1 5 m n w c l represent the total number of EVs on the GPL and on the WCL, respectively. In scenarios #1 to #4, the system begins with free-flow traffic, whereas scenarios #5 to #8 and #9 to #12 start with medium and heavy congestion, respectively. To facilitate the consistent comparison of different charging prices ( p c h a r g i n g ), a uniform highway traffic demand of d 1 = 60 is maintained across all scenarios. This setup is designed to encompass a broad range of traffic conditions, providing a comprehensive analysis of the impacts of pricing strategies. The performance of each price setting ( p c h a r g i n g ) within a scenario is determined by averaging the outcomes of ten repeated experiments, enhancing the reliability of our results.

5.3. Parameter Settings for the Deep Q-Learning Algorithm

This section introduces the parameter settings for the DQL algorithm. First, the parameters for the ABM used in the numerical experiments and the hyperparameters for the DQL algorithm are listed in Table 2. During the process of DQL training, in each episode, the system begins with a random initial state wherein the total number on the GPL and the WCL satisfy a uniform distribution: n = 1 N ρ n g p l U ( ρ m i n g p l , ρ m a x g p l ) , n = 1 N ρ n w c l U ( ρ m i n w c l , ρ m a x w c l ) . The variation in the initial state across different episodes helps enhance the algorithm’s ability to address diverse traffic dynamics. Each episode lasts for 10 pricing intervals of 30 min. Hence, T in Step 5 of Algorithm 1 is set to be 10.

5.4. Parameter Settings for the CART Algorithm

In this section, we introduce the configuration of the training data and the hyper-parameter settings employed for the CART algorithm. The training dataset, constructed from the twelve illustrative scenarios discussed in Section 5.2 and outlined in Table 3, comprises 60 data points resulting from a combination of 12 scenarios and 5 distinct pricing levels. For each data point, traffic demands are selected from a set d { 20 , 30 , 40 , 50 , 60 } , generating 300 unique combinations. We believe that these 300 data points can cover most traffic scenarios. We set the maximum depth to 4, require at least 10 samples for any split, and allow no fewer than 5 samples in each leaf. These values keep the tree shallow and the splits data-rich enough to curb over-fitting while still capturing the main pricing patterns; with limited depth, the model avoids high-variance leaf rules, and the sample thresholds ensure that each rule is backed by enough observations to generalize well to new traffic states. The model is evaluated using the testing subset. Predictions for the test features are generated and then compared against the actual outcomes in the test dataset. The MSE is used to evaluate the average squared difference between the predicted values and the actual values. A lower MSE value indicates higher model accuracy and better performance in capturing the underlying data patterns.

5.5. Simulation of Real Traffic Scenarios

In this subsection, we construct two real traffic scenarios that facilitate a comparative analysis between the DQL algorithm and the CART algorithm. Each scenario spans 30 min, corresponding to 10 pricing intervals. In Scenario #1, the system initiates under conditions of light traffic, replicating a typical morning traffic scenario with low demand persisting for 30 min. Conversely, Scenario #2 starts with heavy traffic, characterized by greater congestion on the WCL compared to the GPL, and maintains high traffic demand throughout the same duration.
We employ real-world traffic data used in [18], d 1 , as illustrated in Figure 6, to model the demand within our double-lane system. The traffic demand data is sampled every 3 min, aligning with the pricing interval. In the NetLogo environment, the arrival pattern of incoming EVs is modeled according to a Poisson distribution, as delineated in Section 4.1.3.
In each scenario, both the DQL and CART algorithms generate 10 pricing signals. To underscore the impact of charging prices on traffic flow and charging efficiencies, we also evaluate the performance of a constant pricing strategy (here, we adopt the base price, 1 USD/kWh). The efficacy of the three strategies is compared in terms of total throughput, total energy received by EVs, and the penalties for congestion, consistent with the reward function defined in (26).

6. Results and Discussions

6.1. Results for Sample Scenarios

Figure 7 displays the results of the first four numerical experiments conducted under an initial state of free-flow traffic. Among these experiments, it is observed that the lowest charging price results in the best reward. The reasons are intuitive. Firstly, it can be noted that the total throughput across different charging prices is almost the same in all four experiments. This is because the light initial traffic conditions allow incoming EVs to enter the double-lane system without experiencing congestion, regardless of how traffic demand is allocated between the two lanes. It can also be observed that the penalty for traffic congestion in each experiment remains at a low level (<0.04) in the four scenarios. Hence, the total energy dominates the reward according to (26). The lowest charging price ( U S D 0 ) results in more EVs choosing the WCL compared to other prices, thereby yielding the highest total energy.
Figure 8 and Figure 9 display the results of the first four numerical experiments conducted under initial states of medium and heavy traffic, respectively. In scenarios #7, #8, and #11, where the GPL is more congested than the WCL, a lower price results in more EVs choosing the WCL, which not only eases the congestion on the GPL (indicated by a lower penalty), but also increases the total throughput and improves the total energy received by EVs. Consequently, the lowest price exhibits the best reward. In contrast, in scenarios #5, #6, and #10, where the WCL is more congested than the GPL, a higher price leads to more EVs choosing the GPL, effectively easing the congestion on the GPL (lower penalty) and increasing total throughput. Although a lower price still exhibits a higher total energy received by EVs, in the context of congested traffic, the penalty and the total throughput dominate. Consequently, the best rewards are achieved at prices of U S 2 , U S D 3 , U S D 2 in these scenarios, respectively. In scenarios #9 and #12, where the congestion on both lanes is about the same, a medium price (1 USD/kWh) performs best considering the trade-off between the three metrics.
The results provide useful managerial insights for operating WCLs. Traffic operators can effectively optimize traffic flow by dynamically adjusting the charging price. When the WCL is more congested, raising the price shifts EVs to the GPL, easing congestion at the cost of WCL utilization; interestingly, when the GPL is more congested, lowering the price attracts EVs to the WCL, reducing congestion while simultaneously improving WCL utilization—a win–win outcome.

6.2. Learning Performance of the Decision Tree Algorithm

The performance of the Decision Tree Regression model was evaluated using both the mean squared error (MSE) and the coefficient of determination ( R 2 ). These metrics collectively provide a comprehensive view of the model’s predictive accuracy and its ability to explain the variability in the target variable.
The MSE provides a measure of the average of the squares of the errors, indicating how closely the model’s predictions match the actual values:
MSE = 1 n i = 1 n ( Y i Y ^ i ) 2 .
In addition, R 2 is calculated to assess the proportion of variance in the dependent variable that is predictable from the independent variables:
R 2 = 1 Sum of Squares of Residuals Total Sum of Squares = 1 i = 1 n ( Y i Y ^ i ) 2 i = 1 n ( Y i Y ¯ ) 2
where Y ¯ is the mean of the observed data y i .
The model achieved an MSE of 0.054, indicating a strong predictive accuracy with minor deviations from the actual values. Additionally, the R 2 value obtained was 0.85, suggesting that 85% of the variance in the dependent variable is explainable by the independent variables. This high R 2 value corroborates the model’s effectiveness in capturing and quantifying the underlying data patterns. The combination of a low MSE and a high R 2 demonstrates not only the model’s ability to produce accurate predictions but also its capacity to explain a significant proportion of the variance in the data.

6.3. Learning Performance of Deep Q-Learning

In this section, we demonstrate the learning performance of our DQL algorithm, as depicted in Algorithm 1. Figure 10 illustrates the learning curves for Algorithm 1. We conduct five repeated trainings under the same parameter settings. It is observed that the cumulative reward per episode progressively increases over time, albeit with some fluctuations. In the initial 50 episodes, where the exploration rate is high, the rewards garnered are modest, reflecting the agent’s preliminary adaptation and exploration of the environment. As training advances, a notable increase in rewards is seen between episodes 50 and 150, denoting the agent’s improved performance and strategy optimization. Beyond 200 episodes, the rewards reach and maintain a relatively high plateau, highlighting the agent’s successful derivation of an effective policy through sustained training.
Furthermore, Figure 10 compares the impacts of different learning rates on the convergence behavior. When the learning rate is high (1 × 10 3 ), the algorithm converges more rapidly in the early stages but exhibits larger variance, suggesting instability in value estimation and a tendency to overshoot the optimal policy. With a moderate learning rate (1 × 10 4 ), the learning curve demonstrates both stable growth and efficient convergence, ultimately achieving higher asymptotic performance. Conversely, a very small learning rate (1 × 10 5 ) results in much slower convergence, as the agent requires more episodes to sufficiently update its value function. Therefore, we adopt the learning rate of 1 × 10 4 in the following numerical experiments.

6.4. Results Under Real Traffic Scenarios

This section compares the performance of three strategies within the two real traffic scenarios. Figure 11 and Figure 12 plot the price signal under the two scenarios. Figure 13 and Figure 14 compare the efficacy of these strategies across four critical metrics: accumulated reward, total throughput, total energy, and penalties for congestion, corresponding to scenario #1 and scenario #2, respectively.
In scenario #1, where the system begins with light traffic and experiences a low traffic demand, both the CART and DQL algorithms consistently opt for the lowest charging price (0 USD/kWh) at each pricing interval throughout the simulation. This leads to an increase in the number of EVs entering the WCL, thereby resulting in more energy being transmitted to EVs. This phenomenon aligns with the analysis in Section 6.1, indicating that a lower charging price under light traffic conditions enhances the charging efficiency without sacrificing the traffic efficiency. The strategy of giving the base price (1 USD/kWh) throughout the simulation yields a total throughput of 1349 veh, a total energy of 17.6 kWh, and a penalty of 0.036. In contrast, the CART and DQL algorithms yield almost the same total throughput, over five times energy (94.4 kWh), but a bit higher penalty for traffic congestion (0.064, however, still at a low level). Consequently, the accumulated reward yielded by the dynamic pricing strategy is 6.7% higher than the static pricing strategy.
In scenario #2, the system starts under heavy traffic conditions with sustained high demand throughout the simulation. As illustrated in Figure 14, the pricing trends demonstrated by the CART and DQL algorithms start at a high price (2 USD/kWh) and progressively decrease, culminating in the minimum price (0 USD/kWh). The DQL algorithm shows performance improvements in the final reward of 12.1% over CART and 28.3% over the static pricing strategy. The reasons are as follows. As the initial congestion on the WCL is greater than that on the GPL, both algorithms implement a high price (2 USD/kWh) at the beginning. This effectively leads more EVs to the GPL, alleviating congestion on the WCL. Subsequently, as congestion eases, the focus shifts towards maximizing energy transmission, leading to lower prices. In the final stages (24 to 30 min), the system reverts to light traffic conditions, similarly to scenario #1, where a lower price is advantageous for maximizing energy delivery without compromising traffic efficiency. However, the pricing strategies of CART and DQL differ significantly. CART is not capable of capturing the system dynamics or utilizing the future traffic demand information but only selects optimal prices for the current interval. Although the traffic demand remains high throughout the simulation, CART still adopts a more aggressive pricing approach at the early stages to maximize the immediate rewards. In contrast, the DQL algorithm adopts a more nuanced strategy, maintaining a lower price of 1 USD/kWh between 3 and 9 min, which, despite temporarily reduces energy growth, minimizes congestion penalties and enhances throughput. Consequently, from 9 to 15 min, the energy growth under both strategies aligns, yet the congestion penalty remains significantly lower under the DQL approach. Thus, CART’s strategy is myopic, whereas DQL’s farsighted approach better captures the complexities of the environment and effectively leverages the future system inputs, highlighting its superior capability in managing complex dynamic traffic scenarios.
In summary, in light traffic scenarios, the dynamic pricing strategy tends to give as low a price as possible. It greatly improves the charging service efficiency and only slightly compromises on traffic operation efficiency. Since the traffic dynamics under light traffic is simple, CART and DQL exhibit a similar performance. However, in heavy traffic, the DQL algorithm outperforms the CART due to its ability to capture the complex system dynamics of the ABM and leverage future traffic demand information. Under both light and traffic scenarios, the two dynamic pricing strategies, which can give a charging price in correspondence to the system state, performs better than the static pricing strategy.

6.5. Model Validation and Calibration

The proposed NetLogo-based traffic model is mainly based on synthetic parameters and traffic data, which may affect the model accuracy. At present, the only commercial DWC system is the bus-transit route in Korea [2]; passenger-car EV data for WCLs are therefore unavailable. Once DWC adoption expands, three aspects of the model will need validation and calibration: (1) the microscopic driving behavior parameters listed in Table 1; (2) lane-choice model parameters; and (3) the EV energy consumption profile, since real-world use on a WCL may differ from the baseline profile taken from conventional roads. Furthermore, the long-term demand profile using the WCL system with dynamic pricing and automatic billing may be affected by user adoption behavior as observed for other advanced transportation technologies [75]; thus, model calibration may be updated every few years to reflect changes in travel behavior patterns.

7. Conclusions and Future Work

This study examines a dynamic pricing problem in a double-lane traffic system consisting of a GPL and a WCL. The system is modeled using an agent-based framework, where each EV is represented as an autonomous agent. A DQL algorithm is developed to determine the optimal pricing strategy, dynamically adjusting charging prices to balance traffic and charging efficiencies. Numerical experiments demonstrate that the DRL strategy substantially outperforms both a CART-based strategy and a static pricing approach, owing to its ability to exploit system dynamics and anticipate future demand. The results underscore the potential of the value-based RL framework for the discrete pricing of WCLs, which is much more effective than the myopic and static policies. Our findings also provide useful managerial insights: when the GPL is congested and the WCL is underutilized, reducing the WCL price can simultaneously enhance traffic and charging efficiencies, creating a win–win outcome. Conversely, when the WCL is congested, dynamic pricing must carefully balance the trade-off between the two efficiencies.
Future research can extend this work in several directions. First, the ABM can be enriched with more realistic and heterogeneous EV characteristics, such as diverse battery capacities, charging rates, and driving behaviors. Second, the DRL framework can be enhanced through improved hyperparameter tuning, more expressive state representations, and refined reward functions to accelerate convergence and improve stability. Though we have discussed the impact of learning rate, global sensitivity analysis that is seen in other transportation studies (e.g., [76,77]) can be performed to more thoroughly pinpoint the most critical parameters of the proposed learning algorithm. Third, broader algorithmic benchmarking can also be explored to enrich the choices of DRL methods for the problem, as we mentioned earlier. Fourth, new elements of the optimization model can be explored. For example, specific constraints can be introduced to avoid notable disproportional service for EVs with lower charging needs [19], thus improving the fairness of the dynamic pricing model—whilst revenue objectives that are not considered in this paper represent a natural avenue for further study. Finally, the current framework assumes the availability of mature vehicle communication systems and a high penetration of EVs using DWC [78]. Before such conditions are achieved, alternative traffic management strategies should be explored to address the mixed traffic flows of DWC-enabled and conventional vehicles, ensuring smooth integration during the transition phase.

Author Contributions

Conceptualization, F.L. and Z.T.; methodology, F.L. and Z.T.; software, F.L.; validation, F.L. and Z.T.; formal analysis, F.L. and Z.T.; investigation, F.L.; resources, Z.T. and H.K.C.; writing—original draft preparation, F.L.; writing—review and editing, Z.T. and H.K.C.; visualization, F.L.; supervision, Z.T. and H.K.C.; project administration, Z.T.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is partially supported by Science and Technology Planning Project of Shandong Hi-speed Group Co., Ltd. (No. HS2025B018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Fan Liu is an employee of Shandong Hi-speed Group Co., Ltd. This research was partly funded by Shandong Hi-speed Group Co., Ltd. All authors have disclosed no other commercial or financial relationships that could be construed as a potential conflict of interest beyond the employment and funding relationships noted above.

References

  1. Tan, Z.; Liu, F.; Chan, H.K.; Gao, H.O. Transportation systems management considering dynamic wireless charging electric vehicles: Review and prospects. Transp. Res. Part E Logist. Transp. Rev. 2022, 163, 102761. [Google Scholar] [CrossRef]
  2. Miller, J.M.; Jones, P.T.; Li, J.M.; Onar, O.C. ORNL experience and challenges facing dynamic wireless power charging of EV’s. IEEE Circuits Syst. Mag. 2015, 15, 40–53. [Google Scholar] [CrossRef]
  3. Lee, M.S.; Jang, Y.J. Charging infrastructure allocation for wireless charging transportation system. In Proceedings of the Eleventh International Conference on Management Science and Engineering Management 11, Changchun, China, 17–19 August 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1630–1644. [Google Scholar]
  4. Chen, Z.; He, F.; Yin, Y. Optimal deployment of charging lanes for electric vehicles in transportation networks. Transp. Res. Part B Methodol. 2016, 91, 344–365. [Google Scholar] [CrossRef]
  5. Chen, Z.; Liu, W.; Yin, Y. Deployment of stationary and dynamic charging infrastructure for electric vehicles along traffic corridors. Transp. Res. Part C Emerg. Technol. 2017, 77, 462–477. [Google Scholar] [CrossRef]
  6. Alwesabi, Y.; Wang, Y.; Avalos, R.; Liu, Z. Electric bus scheduling under single depot dynamic wireless charging infrastructure planning. Energy 2020, 213, 118855. [Google Scholar] [CrossRef]
  7. Ngo, H.; Kumar, A.; Mishra, S. Optimal positioning of dynamic wireless charging infrastructure in a road network for battery electric vehicles. Transp. Res. Part D Transp. Environ. 2020, 85, 102393. [Google Scholar] [CrossRef]
  8. Mubarak, M.; Üster, H.; Abdelghany, K. Strategic network design and analysis for in-motion wireless charging of electric vehicles. Transp. Res. Part E Logist. Transp. Rev. 2021, 145, 102159. [Google Scholar] [CrossRef]
  9. Liu, H.; Zou, Y.; Chen, Y.; Long, J. Optimal locations and electricity prices for dynamic wireless charging links of electric vehicles for sustainable transportation. Transp. Res. Part E Logist. Transp. Rev. 2021, 145, 102174. [Google Scholar] [CrossRef]
  10. Alwesabi, Y.; Liu, Z.; Kwon, S.; Wang, Y. A novel integration of scheduling and dynamic wireless charging planning models of battery electric buses. Energy 2021, 230, 120806. [Google Scholar] [CrossRef]
  11. Schwerdfeger, S.; Bock, S.; Boysen, N.; Briskorn, D. Optimizing the electrification of roads with charge-while-drive technology. Eur. J. Oper. Res. 2022, 299, 1111–1127. [Google Scholar] [CrossRef]
  12. Deflorio, F.P.; Castello, L.; Pinna, I.; Guglielmi, P. “Charge while driving” for electric vehicles: Road traffic modeling and energy assessment. J. Mod. Power Syst. Clean Energy 2015, 3, 277–288. [Google Scholar] [CrossRef]
  13. Deflorio, F.; Pinna, I.; Castello, L. Dynamic charging systems for electric vehicles: Simulation for the daily energy estimation on motorways. IET Intell. Transp. Syst. 2016, 10, 557–563. [Google Scholar] [CrossRef]
  14. García-Vázquez, C.A.; Llorens-Iborra, F.; Fernández-Ramírez, L.M.; Sánchez-Sainz, H.; Jurado, F. Comparative study of dynamic wireless charging of electric vehicles in motorway, highway and urban stretches. Energy 2017, 137, 42–57. [Google Scholar] [CrossRef]
  15. He, J.; Huang, H.J.; Yang, H.; Tang, T.Q. An electric vehicle driving behavior model in the traffic system with a wireless charging lane. Phys. A Stat. Mech. Its Appl. 2017, 481, 119–126. [Google Scholar] [CrossRef]
  16. He, J.; Yang, H.; Huang, H.J.; Tang, T.Q. Impacts of wireless charging lanes on travel time and energy consumption in a two-lane road system. Phys. A Stat. Mech. Appl. 2018, 500, 1–10. [Google Scholar] [CrossRef]
  17. Jansuwan, S.; Liu, Z.; Song, Z.; Chen, A. An evaluation framework of automated electric transportation system. Transp. Res. Part E Logist. Transp. Rev. 2021, 148, 102265. [Google Scholar] [CrossRef]
  18. Liu, F.; Tan, Z.; Chan, H.K.; Zheng, L. Ramp Metering Control on Wireless Charging Lanes Considering Optimal Traffic and Charging Efficiencies. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11590–11601. [Google Scholar] [CrossRef]
  19. Liu, F.; Tan, Z.; Chan, H.K.; Zheng, L. Model-based variable speed limit control on wireless charging lanes: Formulation and algorithm. Transp. Res. Part E Logist. Transp. Rev. 2025. [Google Scholar] [CrossRef]
  20. Zhang, Y.; Hong, Y.; Tan, Z. Design of Coordinated EV Traffic Control Strategies for Expressway System with Wireless Charging Lanes. World Electr. Veh. J. 2025, 16, 496. [Google Scholar] [CrossRef]
  21. He, F.; Yin, Y.; Zhou, J. Integrated pricing of roads and electricity enabled by wireless power transfer. Transp. Res. Part C Emerg. Technol. 2013, 34, 1–15. [Google Scholar] [CrossRef]
  22. Wang, T.; Yang, B.; Chen, C. Double-layer game based wireless charging scheduling for electric vehicles. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), Antwerp, Belgium, 25–28 May 2020; pp. 1–5. [Google Scholar]
  23. Esfahani, H.N.; Liu, Z.; Song, Z. Optimal pricing for bidirectional wireless charging lanes in coupled transportation and power networks. Transp. Res. Part C Emerg. Technol. 2022, 135, 103419. [Google Scholar] [CrossRef]
  24. Lei, X.; Li, L.; Li, G.; Wang, G. Review of multilane traffic flow theory and application. J. Chang. Univ. (Nat. Sci. Ed.) 2020, 40, 78–90. [Google Scholar]
  25. Jiang, J.; Ren, Y.; Guan, Y.; Eben Li, S.; Yin, Y.; Yu, D.; Jin, X. Integrated decision and control at multi-lane intersections with mixed traffic flow. J. Phys. Conf. Ser. 2022, 2234, 012015. [Google Scholar] [CrossRef]
  26. Zhou, S.; Zhuang, W.; Yin, G.; Liu, H.; Qiu, C. Cooperative on-ramp merging control of connected and automated vehicles: Distributed multi-agent deep reinforcement learning approach. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 402–408. [Google Scholar]
  27. Bouktif, S.; Cheniki, A.; Ouni, A.; El-Sayed, H. Deep reinforcement learning for traffic signal control with consistent state and reward design approach. Knowl.-Based Syst. 2023, 267, 110440. [Google Scholar] [CrossRef]
  28. Li, D.; Lasenby, J. Imagination-Augmented Reinforcement Learning Framework for Variable Speed Limit Control. IEEE Trans. Intell. Transp. Syst. 2024, 25, 1384–1393. [Google Scholar] [CrossRef]
  29. Chen, S.; Wang, M.; Song, W.; Yang, Y.; Fu, M. Multi-agent reinforcement learning-based decision making for twin-vehicles cooperative driving in stochastic dynamic highway environments. IEEE Trans. Veh. Technol. 2023, 72, 12615–12627. [Google Scholar] [CrossRef]
  30. Jiang, K.; Lu, Y.; Su, R. Safe Reinforcement Learning for Connected and Automated Vehicle Platooning. In Proceedings of the 2024 IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS), St. Louis, MO, USA, 12–15 May 2024; pp. 1–6. [Google Scholar]
  31. Pandey, V.; Wang, E.; Boyles, S.D. Deep reinforcement learning algorithm for dynamic pricing of express lanes with multiple access locations. Transp. Res. Part C Emerg. Technol. 2020, 119, 102715. [Google Scholar] [CrossRef]
  32. Abdalrahman, A.; Zhuang, W. Dynamic pricing for differentiated PEV charging services using deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1415–1427. [Google Scholar] [CrossRef]
  33. Cui, L.; Wang, Q.; Qu, H.; Wang, M.; Wu, Y.; Ge, L. Dynamic pricing for fast charging stations with deep reinforcement learning. Appl. Energy 2023, 346, 121334. [Google Scholar] [CrossRef]
  34. Wilensky, U.; Payette, N. NetLogo Traffic 2 Lanes Model; Center for Connected Learning and Computer-Based Modeling, Northwestern University: Evanston, IL, USA, 1998; Available online: http://ccl.northwestern.edu/netlogo/models/Traffic2Lanes (accessed on 15 March 2025).
  35. Triastanto, A.N.D.; Utama, N.P. Model Study of Traffic Congestion Impacted by Incidents. In Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Yogyakarta, Indonesia, 20–21 September 2019; pp. 1–6. [Google Scholar]
  36. Mitrovic, N.; Dakic, I.; Stevanovic, A. Combined alternate-direction lane assignment and reservation-based intersection control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1779–1789. [Google Scholar] [CrossRef]
  37. Jaxa-Rozen, M.; Kwakkel, J.H. Pynetlogo: Linking netlogo with python. J. Artif. Soc. Soc. Simul. 2018, 21, 4. [Google Scholar] [CrossRef]
  38. Zhu, J.; Hu, L.; Chen, Z.; Xie, H. A Queuing Model for Mixed Traffic Flows on Highways considering Fluctuations in Traffic Demand. J. Adv. Transp. 2022, 2022, 4625690. [Google Scholar] [CrossRef]
  39. Medhi, J. Stochastic Models in Queueing Theory; Elsevier: Amsterdam, The Netherlands, 2002. [Google Scholar]
  40. Hu, L.; Dong, J.; Lin, Z. Modeling charging behavior of battery electric vehicle drivers: A cumulative prospect theory based approach. Transp. Res. Part C Emerg. Technol. 2019, 102, 474–489. [Google Scholar] [CrossRef]
  41. McFadden, D. Conditional Logit Analysis of Qualitative Choice Behavior; Frontiers in Econometrics: New York, NY, USA, 1972. [Google Scholar]
  42. Train, K.E. Discrete Choice Methods with Simulation; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  43. Lou, Y.; Yin, Y.; Laval, J.A. Optimal dynamic pricing strategies for high-occupancy/toll lanes. Transp. Res. Part C Emerg. Technol. 2011, 19, 64–74. [Google Scholar] [CrossRef]
  44. Tan, Z.; Gao, H.O. Hybrid model predictive control based dynamic pricing of managed lanes with multiple accesses. Transp. Res. Part B Methodol. 2018, 112, 113–131. [Google Scholar] [CrossRef]
  45. Li, C.; Dong, X.; Cipcigan, L.M.; Haddad, M.A.; Sun, M.; Liang, J.; Ming, W. Economic viability of dynamic wireless charging technology for private EVs. IEEE Trans. Transp. Electrif. 2022, 9, 1845–1856. [Google Scholar] [CrossRef]
  46. Ge, Y.; MacKenzie, D. Charging behavior modeling of battery electric vehicle drivers on long-distance trips. Transp. Res. Part D Transp. Environ. 2022, 113, 103490. [Google Scholar] [CrossRef]
  47. Zhou, S.; Qiu, Y.; Zou, F.; He, D.; Yu, P.; Du, J.; Luo, X.; Wang, C.; Wu, Z.; Gu, W. Dynamic EV charging pricing methodology for facilitating renewable energy with consideration of highway traffic flow. IEEE Access 2019, 8, 13161–13178. [Google Scholar] [CrossRef]
  48. Galvin, R. Energy consumption effects of speed and acceleration in electric vehicles: Laboratory case studies and implications for drivers and policymakers. Transp. Res. Part D Transp. Environ. 2017, 53, 234–248. [Google Scholar] [CrossRef]
  49. Wilensky, U. NetLogo; Center for Connected Learning and Computer-Based Modeling, Northwestern University: Evanston, IL, USA, 1999. [Google Scholar]
  50. Railsback, S.F.; Grimm, V. Agent-Based and Individual-Based Modeling: A Practical Introduction; Princeton University Press: Princeton, NJ, USA, 2019. [Google Scholar]
  51. Mostafizi, A.; Koll, C.; Wang, H. A decentralized and coordinated routing algorithm for connected and autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11505–11517. [Google Scholar] [CrossRef]
  52. Kponyo, J.; Nwizege, K.; Opare, K.; Ahmed, A.; Hamdoun, H.; Akazua, L.; Alshehri, S.; Frank, H. A distributed intelligent traffic system using ant colony optimization: A NetLogo modeling approach. In Proceedings of the 2016 International Conference on Systems Informatics, Modelling and Simulation (SIMS), Riga, Latvia, 1–3 June 2016; pp. 11–17. [Google Scholar]
  53. Wang, L.; Yang, M.; Li, Y.; Hou, Y. A model of lane-changing intention induced by deceleration frequency in an automatic driving environment. Phys. A Stat. Mech. Appl. 2022, 604, 127905. [Google Scholar] [CrossRef]
  54. Nguyen, J.; Powers, S.T.; Urquhart, N.; Farrenkopf, T.; Guckert, M. An overview of agent-based traffic simulators. Transp. Res. Interdiscip. Perspect. 2021, 12, 100486. [Google Scholar] [CrossRef]
  55. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  56. Ha, D.; Schmidhuber, J. Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  57. Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
  58. Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Department of Engineering, University of Cambridge: Cambridge, UK, 1994; Volume 37. [Google Scholar]
  59. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12. [Google Scholar]
  60. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  61. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  62. Zhu, F.; Ukkusuri, S.V. Accounting for dynamic speed limit control in a stochastic traffic environment: A reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 2014, 41, 30–47. [Google Scholar] [CrossRef]
  63. Li, Z.; Liu, P.; Xu, C.; Duan, H.; Wang, W. Reinforcement learning-based variable speed limit control strategy to reduce traffic congestion at freeway recurrent bottlenecks. IEEE Trans. Intell. Transp. Syst. 2017, 18, 3204–3217. [Google Scholar] [CrossRef]
  64. Müller, E.R.; Carlson, R.C.; Kraus, W.; Papageorgiou, M. Microsimulation analysis of practical aspects of traffic control with variable speed limits. IEEE Trans. Intell. Transp. Syst. 2015, 16, 512–523. [Google Scholar] [CrossRef]
  65. Wen, Y.; MacKenzie, D.; Keith, D.R. Modeling the charging choices of battery electric vehicle drivers by using stated preference data. Transp. Res. Rec. 2016, 2572, 47–55. [Google Scholar] [CrossRef]
  66. Tympakianaki, A.; Spiliopoulou, A.; Kouvelas, A.; Papamichail, I.; Papageorgiou, M.; Wang, Y. Real-time merging traffic control for throughput maximization at motorway work zones. Transp. Res. Part C Emerg. Technol. 2014, 44, 242–252. [Google Scholar] [CrossRef]
  67. Wang, C.; Xu, Y.; Zhang, J.; Ran, B. Integrated traffic control for freeway recurrent bottleneck based on deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15522–15535. [Google Scholar] [CrossRef]
  68. Aslani, M.; Mesgari, M.S.; Wiering, M. Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events. Transp. Res. Part C Emerg. Technol. 2017, 85, 732–752. [Google Scholar] [CrossRef]
  69. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Cole Statistics/Probability Series; Wadsworth & Brooks; Chapman and Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
  70. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
  71. Breiman, L. Classification and Regression Trees; Routledge: Oxfordshire, UK, 2017. [Google Scholar]
  72. Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  73. Bock, S.; Weis, M. A proof of local convergence for the adam optimizer. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  74. Shi, Y.; Wang, Z.; LaClair, T.J.; Wang, C.; Shao, Y.; Yuan, J. A novel deep reinforcement learning approach to traffic signal control with connected vehicles. Appl. Sci. 2023, 13, 2750. [Google Scholar] [CrossRef]
  75. Liu, X.; Zhang, D.; Nguyen, C.T.; Yuen, K.F.; Wang, X. “Freedom–enslavement” paradox in consumers’ adoption of smart transportation: A comparative analysis of three technologies. Transp. Policy 2025, 164, 206–216. [Google Scholar] [CrossRef]
  76. Owais, M. Deep Learning for Integrated Origin–Destination Estimation and Traffic Sensor Location Problems. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6501–6513. [Google Scholar] [CrossRef]
  77. Owais, M.; Moussa, G.S. Global sensitivity analysis for studying hot-mix asphalt dynamic modulus parameters. Constr. Build. Mater. 2024, 413, 134775. [Google Scholar] [CrossRef]
  78. Jin, J.; Zhu, X.; Wu, B.; Zhang, J.; Wang, Y. A dynamic and deadline-oriented road pricing mechanism for urban traffic management. Tsinghua Sci. Technol. 2021, 27, 91–102. [Google Scholar] [CrossRef]
Figure 1. A schematic diagram of a two-lane road system. The red bar represents the boundary between the two-lane road system and the lane-changing zone.
Figure 1. A schematic diagram of a two-lane road system. The red bar represents the boundary between the two-lane road system and the lane-changing zone.
Sustainability 17 09831 g001
Figure 2. A schematic diagram of the conceptual design.
Figure 2. A schematic diagram of the conceptual design.
Sustainability 17 09831 g002
Figure 3. The time interval between consecutive vehicle arrivals d = 6 .
Figure 3. The time interval between consecutive vehicle arrivals d = 6 .
Sustainability 17 09831 g003
Figure 4. Road segments.
Figure 4. Road segments.
Sustainability 17 09831 g004
Figure 5. The relationship between the probability of choosing the WCL and the charging price, assuming an EV with an SOC of 41%.
Figure 5. The relationship between the probability of choosing the WCL and the charging price, assuming an EV with an SOC of 41%.
Sustainability 17 09831 g005
Figure 6. Real traffic demand for numerical experiments.
Figure 6. Real traffic demand for numerical experiments.
Sustainability 17 09831 g006
Figure 7. Results for sample scenarios 1 to 4, characterized by an initial state of free-flow traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Figure 7. Results for sample scenarios 1 to 4, characterized by an initial state of free-flow traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Sustainability 17 09831 g007
Figure 8. Results for sample scenarios 5 to 8, characterized by an initial state of congested (medium) traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Figure 8. Results for sample scenarios 5 to 8, characterized by an initial state of congested (medium) traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Sustainability 17 09831 g008
Figure 9. Results for sample scenarios 9 to 12, characterized by an initial state of congested (medium) traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Figure 9. Results for sample scenarios 9 to 12, characterized by an initial state of congested (medium) traffic. Note that, in each figure, the colored bars (blue, green, red, black) represent either the maximum or the minimum values, while the other bars are shown in grey.
Sustainability 17 09831 g009
Figure 10. Accumulated reward vs. episodes for Algorithm 1. The solid line represents the average performance over ten repeated trainings. The shaded region represents half of the standard deviation from the average performance.
Figure 10. Accumulated reward vs. episodes for Algorithm 1. The solid line represents the average performance over ten repeated trainings. The shaded region represents half of the standard deviation from the average performance.
Sustainability 17 09831 g010
Figure 11. Price signal for real traffic scenario #1.
Figure 11. Price signal for real traffic scenario #1.
Sustainability 17 09831 g011
Figure 12. Price signal for real traffic scenario #2.
Figure 12. Price signal for real traffic scenario #2.
Sustainability 17 09831 g012
Figure 13. Results for the real traffic scenario #1.
Figure 13. Results for the real traffic scenario #1.
Sustainability 17 09831 g013
Figure 14. Results for real traffic scenario #2.
Figure 14. Results for real traffic scenario #2.
Sustainability 17 09831 g014
Table 1. List of variables.
Table 1. List of variables.
NotationsDefinitionsUnitsType 1
Global variables/parameters
NNumber of road segments/New
L s y s Length of the multi-lane system/New
e + Charging power on the WCLkWNew
pCharging price on the WCLUSD/kWhNew
v g p l Speed limit on GPLkm/hNew
v w c l Speed limit on WCLkm/hNew
T h r o u g h p u t Total throughputvehNew
E n e r g y Total energykWhNew
EV attributes
e i Energy consumption of the i-th EVkWNew
v i m a x Maximum travel speed the i-th EVkm/hOld
v i Current travel speed of the i-th EVkm/hOld
v ¯ i g p l Observed travel speed on the GPL by the i-th EVkm/hNew
v ¯ i w c l Observed travel speed on the WCL by the i-th EVkm/hNew
a i Acceleration the i-th EVm/s2Old
s i SOC of the i-th EVpercentNew
s i m i n Minimum SOC level of the i-th EVpercentNew
s i m a x Maximum SOC level of the i-th EVpercentNew
s i n i The mean value of initial SOC of incoming EVspercentNew
Δ s i n i The standard deviation of initial SOC of incoming EVspercentNew
l i x , l i y Location of the i-th EVkmNew
y Target lane of the i-th EV/New
v i l a t Lateral speed of the i-th EVkm/hOld
1 “Old” refers to an existing variable of the NetLogo 6.4.0 traffic model while “New” refers to a new one.
Table 2. Values of the double-lane system.
Table 2. Values of the double-lane system.
Hyper-ParametersValues
Learning rate0.0001
Discount factor 10.99
Initial exploration rate 1  ϵ 1
Final exploration rate 1  ϵ 0.01
Batch size32
Number of hidden layers2
Size of a hidden layer64
Gradient descent optimizerAdam [73]
Memory capacity10,000
ParametersValues
e + 15
p{0.5, 1, 1.5, 3, 5}
v g p l   122
v w c l   120
v i m a x 100
a a c c , i  13
a d e c , i   1−4.5
s i m i n 20
s i m a x 80
v i l a t 1
d m a x 150
W 1 0.01
W 2 0.01
W 3 −0.2
1 The parameter settings are similar to [74].
Table 3. Initial states for the numerical examples.
Table 3. Initial states for the numerical examples.
No. of ScenariosTraffic State m tot gpl  1 (veh) m tot wcl  2 (veh) d 1  3 (veh/min)
#1Free-flow10010060
#210020060
#320010060
#420020060
#5Congested (medium)20040060
#620060060
#740020060
#860020060
#9Congested (heavy)40040060
#1040060060
#1160040060
#1260060060
1 Total number of EVs on the GPL. 2 Total number of EVs on the WCL. 3 Traffic demand in the next pricing interval.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, F.; Tan, Z.; Chan, H.K. Dynamic Pricing for Wireless Charging Lane Management Based on Deep Reinforcement Learning. Sustainability 2025, 17, 9831. https://doi.org/10.3390/su17219831

AMA Style

Liu F, Tan Z, Chan HK. Dynamic Pricing for Wireless Charging Lane Management Based on Deep Reinforcement Learning. Sustainability. 2025; 17(21):9831. https://doi.org/10.3390/su17219831

Chicago/Turabian Style

Liu, Fan, Zhen Tan, and Hing Kai Chan. 2025. "Dynamic Pricing for Wireless Charging Lane Management Based on Deep Reinforcement Learning" Sustainability 17, no. 21: 9831. https://doi.org/10.3390/su17219831

APA Style

Liu, F., Tan, Z., & Chan, H. K. (2025). Dynamic Pricing for Wireless Charging Lane Management Based on Deep Reinforcement Learning. Sustainability, 17(21), 9831. https://doi.org/10.3390/su17219831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop