To offer a structured and comparative overview of existing RL applications for EVCSs, the evaluation focuses on seven key attributes:
Such dimensions capture the critical factors in designing, deploying, and benchmarking RL-based controllers. Dissecting the literature along these axes helps readers understand how RL techniques align with various traffic control frameworks, design constraints, and performance objectives—supporting informed choices for different network scenarios.
6.1. RL Types and Methodologies
The analysis of RL methodologies applied to EVCS control reveals that actor–critic approaches dominate in volume, reflecting their strong capability to manage continuous action spaces, multi-agent interactions, and constraint-aware decision-making (see
Figure 7—right). Techniques such as DDPG and SAC are frequently employed in grid-integrated and scalable scheduling frameworks [
126,
129,
137,
138], where they often outperform value-based methods in high-dimensional, dynamic environments (see
Figure 7—left). Value-based RL algorithms—such as Q-learning and DQN—remain widely used, particularly in simpler, single-agent, or discretized scenarios where cost optimization is the primary focus [
104,
106,
117]. In contrast, policy-based RL methods appear far less frequently in the literature, with only a few notable studies. Nevertheless, in those limited cases, policy-based strategies have demonstrated considerable promise, especially in embedding safety constraints directly into the EVCS decision-making pipeline [
120,
121]. Such methodological distribution reflects a broader evolution in the field—while value-based methods laid the groundwork for early EVCS control solutions, actor–critic architectures have since emerged as the dominant paradigm for managing complex, real-world, and multi-agent charging environments. More specifically:
Value-based: The development of value-based RL in EVCS applications illustrates a clear trajectory, evolving from early model-free Q-learning toward more advanced deep and hybrid formulations that address the increasing complexity of charging systems. Initial studies, such as those using fitted Q-Iteration [
104,
109], validated the feasibility of batch-mode RL, enabling training from historical data and reducing the risks associated with real-world exploration. These methods proved especially effective in residential and public charging environments by offering scalability without needing detailed grid models. However, their offline learning nature limited adaptability in real-time scenarios, encouraging a shift toward online learning and function approximation. This shift became evident in studies like [
108,
120], where SARSA with linear approximators and deep value networks improved the responsiveness and scalability of control systems for dynamic pricing and power flow management. Meanwhile, deep Q-learning frameworks—such as those presented in [
106,
111,
113]—incorporated temporal representation learning (e.g., LSTM networks) to model sequential patterns in charging behavior, improving forecasts and dynamic adaptability. While these innovations marked major advancements, they also introduced new challenges—deep value-based RL models required significant training data and were often sensitive to poorly shaped or sparse reward functions. Multi-agent extensions, as seen in [
110,
112], broadened the applicability of value-based approaches by decentralizing decision-making across multiple EVs or distributed loads. Such methods tackled issues of fairness and coordination in community-level EVCS operations, but also exposed vulnerabilities such as slow convergence and increased instability due to non-stationary learning environments. To address these, enhancements like prioritized experience replay (PER) and hierarchical coordination schemes were integrated [
117].
Notably, hybrid architectures that blend value-based learning with actor–critic principles or incorporate advanced replay mechanisms (e.g., PER, HER) have shown promising results in continuous control environments [
114,
115,
116]. These demonstrate the flexibility of value-based RL to adapt when appropriately extended. Nonetheless, persistent limitations remain—particularly in handling continuous action spaces, which often necessitate integration with policy-gradient methods [
115]. Additionally, extensive reliance on simulation restricts real-world deployment potential, a challenge shared across most RL methodologies. Reward engineering remains a central obstacle. For example, while DQN-based methods such as [
118] demonstrated effective cost and grid optimization, they lacked explicit modeling of long-term battery degradation or user-centric objectives. These gaps are prompting a convergence in algorithmic design, incorporating hierarchical RL, model-based components, and transfer learning to reduce data requirements and improve generalizability.
Looking ahead, value-based RL is expected to evolve from its traditional single-agent, cost-minimization role toward more sophisticated multi-agent, multi-objective frameworks that jointly optimize user satisfaction, renewable integration, and grid flexibility. Emerging works—such as the cooperative DDQN-PER model of [
117] and the dynamic frameworks of [
115]—have illustrated this direction. Ultimately, the next generation of value-based RL will need to bridge discrete and continuous control strategies, integrate domain-aware reward structures, and deliver scalable, real-world-ready policies.
Policy-based: Although policy-based RL approaches remain limited in EVCS literature, they have addressed key limitations of value-based and actor–critic methods by focusing on direct policy optimization and the integration of safety guarantees. For example, Ref. [
121] introduced Constrained Policy Optimization (CPO) to enforce grid and operational constraints directly during training, thereby ensuring safe and constraint-compliant charging behavior without the need for post hoc corrections. In a similar direction, Ref. [
120] proposed a DNN-enhanced policy-gradient framework, augmented by dynamic programming techniques for real-time power flow control. This approach demonstrated rapid convergence and scalability in complex grid environments. Collectively, these studies highlight the potential of policy-gradient methods to effectively manage continuous control tasks and embed safety-critical features within the decision-making process. However, their broader adoption remains limited, potentially due to high computational demands and relatively low sample efficiency, which constrain their scalability and implementation in larger EVCS systems.
Actor–Critic: Actor–critic RL methodologies have rapidly evolved from basic single-agent implementations to advanced multi-agent and hybrid frameworks capable of addressing large-scale coordination, grid integration, and safety-aware control. Foundational contributions—e.g., the goal representation adaptive dynamic programming (GrADP) approach by [
122]—have demonstrated early on the suitability of actor–critic methods for continuous control tasks, particularly in frequency regulation and ancillary services. Building on this foundation, deterministic policy gradient algorithms like DDPG became widely adopted (see
Figure 8—left) for their ability to operate in continuous action spaces without discretization overheads [
123,
128,
132]. The integration of recurrent structures, such as LSTM, further improved temporal decision-making in applications like dynamic pricing and SoC-constrained energy scheduling.
Scalability and decentralization have emerged as prominent research directions with the advent of multi-agent actor–critic frameworks. Studies such as [
129,
135] employed centralized training with decentralized execution (CTDE) architectures, leveraging mechanisms like counterfactual baselines (e.g., COMA) and game-theoretic coordination to mitigate credit assignment challenges while promoting cooperative autonomy. Similarly, in [
130], researchers introduced adaptive gradient re-weighting among critics to reduce policy conflict, whereas [
134] extended multi-agent DDPG for V2G-enabled frequency regulation, enabling collaborative EV decision-making in grid-supportive scenarios.
Recent advances have also prioritized constraint-aware learning. For instance, Ref. [
125] incorporated second-order cone programming (SOCP) into DDPG to enforce voltage stability constraints, while [
138] applied a constrained soft actor–critic (CSAC) method to embed grid and operational constraints directly into the policy-learning process. Similarly, Ref. [
133] proposed a bilevel DDPG framework that combines predictive LSTM modules with safety shields, enabling feasible and reliable scheduling decisions under uncertainty. Such innovations underscored the growing emphasis on safe and risk-aware RL—an especially important consideration for real-world EVCS operations within tightly constrained grid infrastructures.
Entropy-regularized actor–critic methods such as SAC are also gaining traction due to their improved exploration capabilities and higher sample efficiency in stochastic environments [
136,
139]. Meanwhile, Ref. [
141] combined actor–critic RL with probabilistic forecasting and metaheuristics to develop a risk-sensitive, multi-objective scheduling framework. In parallel, Ref. [
131] employed TD3 to address value overestimation and improve the training stability of battery-enabled charging infrastructure. Collectively, these innovations illustrate a trajectory toward integrated, hybrid actor–critic systems that balance learning efficiency with operational safety, grid stability, and market responsiveness.
Although actor–critic RL has shown promise, several challenges still limit its practical use in EVCS control. One major issue is its high sample complexity—these methods often need millions of state–action interactions to learn effectively, which makes real-world training difficult. Another common problem is training instability: if the critic’s estimates are inaccurate, they can misguide the actor, leading to unstable or even failed learning. To tackle such challenges, several strategies have been proposed. Some studies use offline pre-training with historical charging or mobility data to give the model a strong starting point, reducing the need for extensive online learning [
106,
115]. Others rely on expert demonstrations or simpler rule-based/MPC controllers to guide early learning, which helps improve stability [
109,
127]. In addition, actor–critic models are increasingly incorporating learned models of the environment or predictive demand approximations to reduce the number of real-world interactions needed. These efforts aim to bridge the gap between data-hungry simulations and the more demanding conditions of real-world EVCS deployment, where efficiency and stability are key.
Furthermore, multi-agent actor–critic models often suffer from non-stationarity and increased coordination overhead in large-scale networks—a limitation only partially addressed through hierarchical control decomposition [
124] or federated learning architectures [
126]. Going forward, future research should prioritize the integration of hierarchical actor–critic schemes, safety-focused RL, and hybrid decision-making frameworks informed by predictive modeling, in order to bridge the persistent gap between simulation-based learning and real-world deployment in EVCS systems.
Hybrids: Hybrid RL approaches have also emerged as a prominent direction in EVCS control research, representing a significant evolution of RL by integrating complementary methodologies to overcome the individual limitations of model-free RL, mathematical optimization, and forecasting. Early contributions, such as [
142], combined multi-step Q(
) learning with multi-agent coordination to enhance convergence speed and scalability in mobile EVCS scheduling under dynamic grid conditions. Likewise, Ref. [
143] proposed a hybrid of model-based and model-free RL, using value iteration for rapid policy initialization followed by Q-learning refinement—thereby reducing exploration overhead while preserving adaptability.
According to evaluation, hybrid RL schemes are predominantly characterized by combinations of RL with mathematical optimization techniques—such as MILP, ILP, BLP, SP, and game theory—for feasibility and constraint satisfaction [
149,
151] (see
Figure 9—left and right). This hybridization illustrates how the real-time adaptability of RL can be effectively fused with the rigor of optimization frameworks to yield high-quality, feasible charging strategies. For instance, Ref. [
149] integrated multi-agent DQN with MILP post-optimization to enable decentralized agents to learn local policies, while MILP ensured coordinated scheduling across battery-swapping and fast-charging stations. Similarly, Ref. [
151] used RL in conjunction with MILP to maintain grid-compliant station operations, and [
152] merged RL with LSTM-based forecasting and ILP for adaptive yet constraint-respecting V2G scheduling. Additional works have explored RL in tandem with game-theoretic models [
145] or surrogate optimization [
153], and hybridized RL with metaheuristics like GA, DE, WOA, and MOAVOA to support large-scale planning and global search tasks [
148,
156]. A smaller subset of hybrid studies also focused on algorithmic coordination and matching: for instance, Ref. [
143] demonstrated how local RL decisions can be augmented through tailored global coordination schemes.
Recent developments in hybrid RL have expanded into multi-agent and actor–critic combinations. For example, Ref. [
150] combined soft actor–critic (SAC) for virtual power plant (VPP) energy trading with TD3 for EVCS-level scheduling in a cooperative, multi-agent setting—demonstrating how hybrid RL can co-optimize grid-scale operations and local EV management. Similarly, Ref. [
154] introduced a two-level P-DQN-DDPG architecture that integrates discrete booking decisions with continuous pricing control, a particularly valuable advancement for real-time, market-driven EVCS operations. These multi-layered frameworks signal a broader trend in hybrid RL: toward holistic, multi-agent, multi-objective, and multi-level optimization across operational, economic, and grid-interactive domains. In addition, surrogate modeling and simulation-based planning have been incorporated into hybrid RL designs. For instance, Ref. [
153] fused Monte Carlo RL with surrogate optimization to solve large-scale EVCS siting problems, while [
146] combined DQN with binary linear programming for efficient fleet and charger coordination. These innovations illustrate the potential of hybrid RL to act as an orchestrator—merging data-driven learning with analytical models to enhance convergence speed, policy quality, and scalability.
Overall, hybrid RL has matured into a unifying framework in which RL is no longer treated as a standalone controller, but as the intelligent core of integrated decision-making systems. This paradigm shift effectively addresses challenges such as scalability, safety, and multi-objective optimization in EVCS management, while paving the way for real-world deployment. Future research is expected to build on these foundations by exploring risk-aware scheduling strategies [
141], scalable multi-agent architectures [
155], and hierarchical RL–optimization pipelines that support real-time, grid-interactive EVCS control.
6.2. Agent Architectures
MARL has emerged as a critical trend in EVCS control, addressing the scalability and coordination challenges posed by large-scale EV integration. As illustrated in
Figure 10 (right and center), MARL methodologies have seen widespread adoption across EVCS-related research. Early decentralized Q-learning studies, such as [
110,
112], demonstrated how independent agents can learn charging policies based solely on local observations, enabling scalable and modular control architectures without the need for centralized oversight. However, such fully decentralized frameworks often suffer from suboptimal global coordination, leading to the rise of centralized training with decentralized execution (CTDE) paradigms.
Recent works such as [
129,
130,
134,
150,
155] employed shared critics with decentralized policy networks, allowing agents to access global grid-level knowledge during training while preserving execution autonomy. Moreover, advanced designs such as the non-cooperative game-theoretic MARL framework by [
135] introduced spatially discounted rewards to balance local competitiveness with system-level coordination. Hybrid MARL frameworks have also emerged—for instance, the integration of MILP post-optimization in [
149] and the hierarchical Kuhn-Munkres matching algorithm in [
145]—showcasing how optimization techniques can enhance MARL to ensure grid compliance and infrastructure-wide efficiency. Overall, MARL research in EVCS control is clearly progressing toward hierarchical, hybrid, and cooperative architectures capable of addressing diverse objectives such as grid stability, fairness, and renewable energy integration.
An essential consideration in MARL control lies in the clear identification and implementation of three foundational elements:
Structure,
Training, and
Coordination, as introduced in
Section 3. These dimensions are crucial, as they define how knowledge is shared, how policies are learned, and how agent collaboration is orchestrated across the system. A well-defined strategy for each of these dimensions directly impacts scalability, learning stability, and adaptability to complex, dynamic environments like EVCS systems. A closer look reveals the following:
Structure: Analysis of structural choices in MARL for EVCSs shows a clear dominance of architectures using a shared centralized critic with separate policy networks (see
Figure 11—left). This configuration, adopted in studies such as [
129,
130,
134,
150], enables agents to benefit from a global value function during training while preserving decentralized policy execution for scalability. In contrast, architectures with fully independent critics and policies, as found in [
110,
112], prioritize agent autonomy and implementation simplicity at the cost of coordinated global optimization. Partially shared or hybrid structures are rare, appearing in only a few studies such as [
142], with most research favoring CTDE-compatible architectures due to their balance of coordination and independence. Some advanced variants, like [
130], further enhanced MARL scalability by employing multi-critic architectures with adaptive gradient re-weighting to mitigate inter-agent policy conflict.
Training: Centralized Training with Decentralized Execution has become the prevailing training paradigm in MARL-based EVCS control (
Figure 11—Center), enabling agents to incorporate global system information during learning while executing actions locally [
126,
135,
155]. CTDE effectively mitigates non-stationarity in multi-agent settings and promotes stable convergence. Fully decentralized training appears mainly in simpler settings involving tabular or basic function-approximation Q-learning [
110,
112]. Some studies also explored mixed training schemes—such as [
145], which combines decentralized Q-learning with periodic centralized matching—demonstrating how hybrid paradigms can enable scalable yet coordinated scheduling.
Coordination: Implicit coordination has dominated the MARL landscape for EVCS control (see
Figure 11—right), relying on shared reward functions or centralized critics to align agent behaviors without direct communication [
129,
134,
150]. Such an approach may reduce communication overhead and simplify implementation, making it ideal for scalable and deployable systems. Emergent coordination—cases where agents collaborate through shared interactions with the environment—have been observed in decentralized frameworks like [
110,
112]. Explicit or hierarchical coordination mechanisms remain relatively rare; for example, Ref. [
145] used a Kuhn-Munkres algorithm to coordinate agents periodically for efficient V2V charging. The general absence of explicit communication-based coordination reflects a broader focus on lightweight, communication-efficient MARL solutions tailored to real-world EVCS deployment.
Overall, MARL remained relatively underexplored in EVCS control, with only 14 of 52 studies (26% percentage—see
Figure 10—right) employing decentralized or distributed RL methodologies despite its suitability for large-scale, decentralized charging coordination. Current applications, such as decentralized Q-learning [
110,
112] and CTDE-based actor–critic methods [
129,
130,
134], demonstrated MARL’s potential to manage grid-constrained, multi-agent environments while enabling scalable cooperation among EVs, aggregators, and grid entities. However, potential challenges, including non-stationarity, high sample complexity, and coordination overheads, seem to prohibit a broader adoption. Future research is anticipated to deploy MARL research more intensively, focusing specifically on hierarchical coordination mechanisms [
145], federated and privacy-preserving training schemes [
126], and the integration with forecasting and optimization layers [
149,
155] to improve convergence, safety, and real-world deployability. Such advancements may promote MARL as the next-generation EVCS control, enabling self-organizing, grid-interactive, and fairness-aware charging ecosystems.
6.3. Reward Functions
Reward design concerns a foundational element in RL-based EVCS control, as it directly shapes agent behavior and learning convergence. A clear trend in recent literature reveals a shift toward multi-objective reward formulations, which have become significantly more prevalent than single-objective designs (see
Figure 12—right). Only a limited subset of studies—such as [
104,
108,
110,
113,
114,
123,
127,
131,
135,
136,
144,
145,
148,
151]—have relied on single-objective rewards, often focused on isolated goals such as economic optimization or specific grid performance metrics (see
Figure 12—right). In contrast, the majority of RL implementations adopted composite reward functions that integrated multiple objectives considering economic, grid, user experience, battery health, environmental sustainability, and fairness-related penalties, reflecting the complexity of modern EVCS ecosystems.
Among these objectives, economic-oriented terms concern the most widely utilized penalties; see
Figure 12—left). Such penalties typically aim to minimize charging costs or maximize operational profit through dynamic pricing strategies and energy arbitrage mechanisms [
104,
105,
114,
136,
150] (see
Figure 12—left). However, recent work has increasingly paired these with grid-supportive components—such as transformer overload mitigation, load balancing, and voltage/frequency stabilization—to enhance alignment with distribution network performance and reliability [
109,
110,
117,
134,
137,
152].
In parallel, user-centric reward terms have gained traction. These include penalties for unmet SoC targets, excessive waiting times, or high levels of range anxiety [
106,
112,
118,
129,
130]. Additionally, some studies incorporate battery degradation costs to ensure the long-term health of EV batteries, an increasingly relevant concern for both fleet operators and private users [
117,
129,
131]. Though less common, environmental-oriented rewards—particularly those targeting CO
2 reduction and renewable energy utilization—are beginning to emerge, highlighting a growing emphasis on sustainability and carbon neutrality in EVCS design [
155,
156].
Despite this increasing sophistication in reward structures, several critical challenges remain. Sparse reward environments—such as those described in [
115]—may significantly prohibit learning convergence, particularly in complex or sequential decision-making tasks. Additionally, linear weighting schemes used in multi-objective formulations—e.g., [
135,
152]—are often sensitive to hyperparameter-tuning, potentially introducing bias in the prioritization of objectives. Penalty-based constraint handling also dominates the field, which can lead to transient constraint violations during training [
138].
Another notable trend acquired by the evaluations is that, despite the prevalence of multi-objective formulations, almost all existing EVCS RL studies ultimately relied on scalarized rewards through weighted sums or penalties [
106,
112,
115,
116,
121,
129,
135,
139,
155,
156]. To this end, almost no work has been found to employ genuine multi-objective RL methods, such as Pareto front learning, multi-objective policy gradients, or separate critics per objective. Such a trend highlights a significant research opportunity to move beyond ad-hoc weighting schemes toward principled frameworks that can capture diverse trade-offs and provide more robust charging policies.
Looking ahead, the development of more nuanced reward mechanisms is essential. Promising directions include risk-aware and probabilistic reward shaping strategies, such as those demonstrated in [
141], where penalties are dynamically weighted based on uncertainty or event severity. Hierarchical or modular reward architectures may also offer greater clarity and flexibility, enabling agents to independently learn sub-policies for economic efficiency, grid compliance, and user satisfaction, while maintaining coherence at the system level. Finally, the integration of real-world feedback, particularly in federated or decentralized learning setups [
126]—will be vital in ensuring that reward functions are practically grounded, robust, and capable of generalization to large-scale EVCS deployment scenarios.
6.4. Baseline Control
The analysis of baselines employed across RL-based EVCS studies reveals clear patterns regarding how researchers benchmark their approaches, reflecting both methodological maturity and evolving expectations in the field. RBC strategies, typically based on Time-of-Use (TOU) pricing or immediate charging policies, remain among the most widely adopted baselines [
104,
109,
137] (see
Figure 13—left). These methods are computationally simple, forecasting-free, and reflect legacy charging practices, making them ideal for demonstrating how RL can dynamically adapt to real-time pricing signals and grid conditions. Particularly in residential and public charging scenarios, outperforming RBC allows RL to establish its relevance in moving from static heuristics to context-aware optimization [
118,
140]. Another frequently used—yet simplistic—heuristic concerns the
greedy or
charge-when-plugged strategy, where EVs immediately draw maximum available power upon connection until fully charged [
105,
118,
152]. Despite its lack of intelligence, this baseline provides a clear lower bound for evaluating RL effectiveness, especially in minimizing peak demand, transformer stress, and overall charging costs in single-agent and residential settings [
106,
118,
137,
138]. Together with fixed control strategies, these heuristic approaches constitute the dominant form of baseline control in current literature (see
Figure 13—right).
Offline optimization methods—particularly mixed-integer linear programming (MILP)—have also been widely used to benchmark RL against theoretically optimal schedules with perfect foresight (see
Figure 13—left). Such approaches may offer a valuable upper bound for assessing cost minimization, load balancing, and constraint satisfaction [
106,
149,
151]. For instance, Ref. [
116] compared a cooperative DDQN framework with MILP optimization assuming full future knowledge to highlight RL’s capacity to approximate optimal solutions in real-time settings. Similarly, Refs. [
111,
118] employed offline MILP or deterministic optimization to quantify the performance gap between centralized planning and RL-based adaptive control. These comparisons reinforce RL’s advantage in uncertain or computationally constrained environments where offline methods may be impractical.
Evolutionary algorithms, including genetic algorithms (GAs), particle swarm optimization (PSO), and differential evolution (DE), were also present as comparative benchmarks, particularly in large-scale or multi-objective EVCS problems [
112,
155,
156]. While these methods provide near-optimal offline solutions, their computational complexity and inability to adapt in real time make them unsuitable for operational use. Comparing RL to such baselines allows researchers to demonstrate that RL can deliver comparable (or superior) performance while adapting to dynamic conditions without repeated re-optimization. For example, Ref. [
112] used GA as a baseline for microgrid-level MARL scheduling, and [
155] contrasted MOAVOA-MADDPG against a standalone MOAVOA, showcasing RL’s advantage under uncertainty. Similarly, Ref. [
156] benchmarked DE against an RL-based mobile EVCS planner, emphasizing RL’s superior adaptability.
MPC baselines, although less common, were employed in advanced studies involving grid-integrated and community-level charging systems [
106,
115,
125]. MPC offers strong performance under forecastable environments by optimizing over a receding horizon; however, its reliance on accurate models and high computational cost renders it impractical for large-scale, real-time EVCS operation [
115,
128]. Consequently, many studies position RL as a scalable and model-free alternative, capable of achieving similar or better outcomes under stochastic conditions and incomplete information [
106,
115,
125,
128].
A particularly important trend involves the increasing use of learning-based baselines, which mark a methodological shift from feasibility demonstration to algorithmic refinement (see
Figure 13—right). Many studies benchmarked new RL schemes against existing state-of-the-art RL variants to showcase improvements in learning speed, convergence, and robustness. For instance, Ref. [
117] evaluated their cooperative DDQN model against standard Q-learning, DQN, and prioritized replay variants, highlighting gains in cost reduction and convergence rate. Similarly, Ref. [
150] demonstrated that their hybrid SAC–TD3 architecture outperformed both SAC and TD3 independently, showing the benefits of hybridization. In hierarchical RL research, Ref. [
126] benchmarked a federated SAC framework against standard SAC, A2C, and TOU-based RBC baselines, validating superior economic and user satisfaction performance. Moreover, hybrid learning baselines—e.g., integrating forecasting into RL—have gained traction. For example, Ref. [
141] evaluated their WOAGA-RL approach against DDPG and traditional ML forecasting methods (LSTM, DeepAR), establishing the value of incorporating predictive modeling into decision-making. This trend indicates that RL-based baselines are now central to performance benchmarking, highlighting the maturity of the field and its shift toward intra-RL comparisons across algorithm families.
Finally, some studies employed hybrid or metaheuristic baselines—such as MOAVOA, WOAGA, or stochastic MILP—particularly in mobile EVCS deployment and multi-objective optimization contexts [
141,
155,
156]. Such comparisons illustrate the growing emphasis on robustness, sustainability, and real-world applicability, placing RL within the broader paradigm of hybrid and integrative control frameworks. In summary, the evolution of baseline methodologies—from simplistic heuristics to sophisticated optimization and learning-based frameworks—signals a broader transition in EVCS research. Rather than merely establishing feasibility, RL methods are increasingly validated through comparisons with state-of-the-art optimization and learning systems. This shift underscores RL’s maturity and its growing potential to serve as a scalable, intelligent control strategy for next-generation EVCS infrastructure.
Across the surveyed studies, baseline hyperparameter-tuning practices were inconsistent and often under-documented. In many cases, authors tuned only their proposed method while leaving baselines at default settings. For example, Ref. [
106] provides extensive details on their DQN architecture but does not report whether the MPC forecast models or FQI baseline were re-tuned for fairness. Similarly, Refs. [
130,
150] benchmarked their approaches against multiple RL algorithms (DQN, SAC, PPO, and MADDPG) but did not disclose search ranges or tuning budgets for those comparators. A second common pattern noticed was the reuse of hyperparameters from prior work or standard toolboxes without re-validation under the current problem setting. For instance, in [
126,
127], researchers compare against actor–critic and Q-learning baselines but adopt fixed parameters, noting that they follow configurations from earlier studies. Finally, several works give insufficient detail for reproducibility, simply listing baselines such as MPC [
125], offline MILP optimization [
151], or TOU-based RBC [
137,
138], but without specifying solver tolerances, forecast model retraining, or parameter sweeps. Such variability creates the risk that poorly tuned baselines may inflate the reported advantages of new methods, a concern already raised in reinforcement learning more broadly. To mitigate such issues, current work strongly recommends the adoption of a common hyperparameter optimization (HPO) protocol: allocate equal tuning budgets across all methods, define and report search spaces for key hyperparameters, apply a consistent optimization strategy to both proposed and baseline algorithms, and disclose solver settings for MPC or offline optimization baselines. Such practices would improve reproducibility and prevent inflated performance claims due to poorly tuned baselines.
6.5. Datasets
The evaluation of recent literature reveals a significant reliance on synthetic data generated via simulation environments or heuristic modeling, primarily due to their flexibility in supporting large-scale scenario testing and controlled policy evaluation [
104,
108,
114,
151,
154] (see
Figure 14—left and Right). These synthetic datasets, while not dominant, are often constructed using pricing signals, grid topology models, and stochastic vehicle arrival distributions, allowing researchers to train and validate RL agents in highly customizable settings devoid of real-world noise and variability. However, the larger share of studies has shifted toward using real-world datasets—or combinations of real and synthetic data, to enhance policy generalizability and align training environments with practical operating conditions (see
Figure 14—right). For instance, Ref. [
106] employed real-world EV usage patterns and electricity market prices to train DQN agents, while [
109] leveraged ElaadNL’s real charging session data to evaluate multi-agent grid load management. Likewise, Ref. [
137] used residential microgrid datasets to optimize charging with DDPG, and [
118] combined real-world EV mobility profiles with simulation-based grid models to bridge user and infrastructure perspectives. Recent works such as [
132,
141,
152] even integrated hardware-in-the-loop and historical datasets, emphasizing a growing shift toward data-driven RL frameworks where real-world information underpins deployment-ready policy design.
Broadly, datasets used in RL-based EVCS research may be categorized into five groups: EV-centric, grid and market-related, renewable and storage-related, mobility and traffic, and contextual data. Each serves a distinct role—from capturing charging flexibility and grid dynamics to incorporating renewable integration and external environmental factors [
106,
112,
125,
141]. This diversity reflects the increasing system-level complexity of EVCS frameworks and underscores the need for RL agents capable of learning within interconnected energy and mobility domains.
EV-centric data form the foundation of nearly half the reviewed studies, as they capture vehicle-level behavior and constraints—especially relevant in decision-making around scheduling and flexibility (see
Figure 15—right). Among the most commonly used features are EV arrival/departure times and state-of-charge (SoC) levels, which define the temporal and operational constraints of RL agents [
104,
105,
106,
118] (see
Figure 15—left). Such features enable intelligent scheduling aligned with user objectives and grid availability. In multi-agent frameworks, additional parameters such as booking data and aggregated charging demand are often used to coordinate resources fairly and minimize congestion [
109,
117]. While the field remains focused on optimizing EV–infrastructure interaction, future directions may potentially involve fleet-level coordination and multi-agent systems and thus, demand more granular and scalable EV datasets [
129,
145].
Grid and market-related data are critical for embedding RL-based EVCS control into power system operations. Grid-related variables, including transformer capacity, voltage stability, and feeder constraints, were commonly used to ensure grid-compliant charging decisions [
108,
125,
137] (see
Figure 15—left). For instance, transformer loading profiles informed Q-learning agents managing residential EV clusters [
110], while voltage thresholds were incorporated into actor–critic schemes for community-scale DER coordination [
141]. Price signals (TOU, real-time, or dynamic pricing) were widely adopted as reward elements or state inputs, allowing RL agents to respond adaptively to market conditions [
106,
118,
152]. Although less common, frequency-related data have also been explored—especially in studies focused on V2G services for frequency regulation [
134,
140].
Renewable generation and energy storage data have gained prominence in recent years as EVCS research moves toward integrated energy systems. Photovoltaic (PV) profiles were widely used in residential and community-scale studies to align charging with solar availability, reducing peak demand and improving self-consumption [
112,
115,
137] (see
Figure 15—left). Similarly, state-of-charge (SoC) data for battery energy storage systems (BESSs) were essential for hybrid RL models, supporting co-optimization of EV charging and stationary storage [
131,
138]. Actor–critic algorithms—and particularly SAC and DDPG—were frequently deployed in this context due to their ability to handle continuous action spaces and real-time power modulation [
126,
141]. The growing use of such data indicates a shift toward RL-enabled control strategies that support dynamic coordination among EVs, distributed storage, and renewable assets.
Data related to
mobility and traffic were primarily utilized in emerging research on public and fleet-based EVCSs. Such data proved essential for bridging the gap between transportation and power networks, allowing RL agents to not only optimize charging costs but also reduce travel distances and alleviate congestion. To this end, traffic flow, road network topology, and queue length datasets were incorporated in hybrid RL frameworks to support location-aware charging station recommendations and navigation strategies [
111,
135,
147]. Such applications relied mostly on graph-based deep RL or MARL coordination, particularly in vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) contexts [
135,
145]. Although still a niche area, the inclusion of traffic data signals a future direction where EVCS scheduling will be co-optimized with intelligent transportation systems for city-scale electrification strategies.
Auxiliary data such as household load demand and emission factors appeared less frequent but increasingly relevant in advanced multi-objective RL research (see
Figure 15—left and right). Household load demand was typically included in residential energy management studies, where EVCS operation needs to be coordinated with other appliances to minimize peak demand and user costs [
112,
124]. Emission factors, used in hybrid RL frameworks like WOAGA-RL [
141] and multi-objective MARL [
155], represent a growing effort to internalize environmental impacts into RL training, allowing EVCS policies to optimize not only cost and grid stability but also carbon footprint. Although such data types remained relatively underutilized, their presence marks an important step toward sustainability-aware EVCS scheduling that aligns with future energy transition goals.
6.6. Performance Indexes
Several categories of performance indexes were observed across the literature, spanning economic, grid-related, operational, user-centric, energy-oriented, environmental, and specialized domains. Each category served to evaluate distinct aspects of RL-based EVCS control, collectively contributing to a comprehensive assessment of policy effectiveness, scalability, and real-world applicability. More specifically:
Cost-related performance indexes dominated the field, providing direct evaluation of the economic feasibility of RL policies from both EVCS operator and end-user perspectives (see
Figure 16—left and right). Such metrics included total charging cost, dynamic pricing responsiveness, revenue maximization, and investment indicators such as net present value (NPV) and Levelized Cost of Storage [
104,
114,
131]. For example, cost minimization in day-ahead or real-time market scenarios was used to assess the efficiency of RL agents in exploiting temporal price variations for arbitrage. Revenue-related metrics, meanwhile, often captured aggregator profits or V2G revenues in grid-interactive contexts [
118,
152]. Hybrid frameworks that incorporated forecasting (e.g., WOAGA-RL) further introduced financial risk indicators, such as market imbalance penalties, to evaluate robustness under uncertainty [
141]. The most frequently used economic metric across studies was total charging cost [
104,
113,
128], while revenue metrics were more common in advanced or multi-agent frameworks involving grid services or market participation [
118,
131,
152]. These cost indicators serve not merely as measures of monetary expenditure but also as proxies for the adaptability and decision-making granularity of RL algorithms under volatile pricing conditions—effectively capturing the economic intelligence of RL-driven EVCS control.
Grid-related metrics were similarly prevalent, focusing on evaluating the impact of RL-based scheduling on power system stability and distribution network health (see
Figure 16—left and right). Among these,
peak load reduction was the most widely adopted index, serving as a key measure of load smoothing and transformer stress mitigation [
109,
125]. Other commonly used indicators included load variance, transformer loading frequency, and peak-to-valley ratios [
110,
152]. Recent studies extended such evaluations to include voltage stability margins and penalty-based constraint violations, particularly in actor–critic or hybrid methods dealing with real-world grid constraints [
137,
141]. For example, transformer overload frequency was used in decentralized MARL frameworks to assess coordination effectiveness across agents in shared grid environments [
110]. Additionally, frequency deviation metrics—such as the RMS of frequency variation—were applied in V2G scenarios to evaluate system-level support capabilities [
134]. Thus, grid-related metrics extend beyond purely technical evaluation: they quantify how effectively RL-based EVCS strategies function as flexibility enablers within broader smart grid ecosystems.
Energy-related performance indexes focused on the integration of EVCS scheduling with distributed energy resources (DERs) and local energy flows. These include PV self-consumption ratios [
112], BESS utilization [
131], energy arbitrage efficiency, and bidirectional power flow in V2G-enabled systems. Among these, PV self-consumption emerged as a key metric in energy-oriented research aiming to align EV charging with RES generation [
112,
131]. In RES-integrated microgrids, such metrics quantified how effectively RL agents synchronized charging with solar peaks to minimize grid imports [
112]. Other studies leveraged battery energy throughput and SoC stability metrics to assess the long-term sustainability of storage-integrated strategies [
138]. Overall, energy-related indices served as indicators of how well RL-based EVCS control can be embedded within broader energy management frameworks, signaling a shift from cost-centric scheduling to energy-autonomous paradigms.
Environmental performance metrics have remained relatively underexplored, despite the growing global emphasis on decarbonization (see
Figure 16—left and right). Such metrics typically include CO
2 emissions, emission reductions compared to baseline strategies [
155], and carbon footprint minimization for mobile renewable-integrated charging stations [
156]. Unlike cost or grid indexes, environmental metrics often depend on exogenous variables such as grid carbon intensity, requiring RL agents to adapt to context-aware, carbon-optimized operation. For instance, hybrid optimization approaches [
141] integrated forecasted grid emission profiles into RL scheduling to achieve environmentally conscious arbitrage. The inclusion of environmental metrics marks a paradigm shift where EVCS control is evaluated not just in terms of economic or technical performance, but also by its contribution to systemic sustainability. However, their limited adoption underscores a significant research gap, highlighting the need for life-cycle-aware emission modeling, carbon-sensitive reward functions, and environmentally coupled policy learning in future RL frameworks [
141,
155,
156].
Operational performance metrics were also widely used to assess both algorithmic performance and system-level viability of RL strategies (see
Figure 16—left and right). While early studies emphasized algorithm speed or convergence, recent literature expanded this scope to include real-time feasibility, robustness under uncertainty, and implementation overhead. These metrics included convergence speed and training stability [
127,
150], station utilization, queue length reduction, and runtime efficiency [
115,
153]. For example, policy convergence rates in DRL settings [
129] were used to validate the applicability of complex models under large state–action spaces. In MARL settings, additional operational indicators such as Pareto front hypervolume and spacing [
155] were employed to assess the effectiveness of multi-objective planning. Battery degradation cost was also occasionally tracked as a constraint or secondary metric [
133], closing the gap between algorithmic outcomes and hardware durability. Among operational metrics, convergence speed emerged as the most commonly used, consistently deployed to ensure both training stability and applicability in real-time EVCS environments [
127,
150].
User-related performance indexes were prevalent in the literature (see
Figure 16—left and right), as they directly connect RL-driven EVCS policies to service quality and user satisfaction. These include SoC at departure, charging completion rates, waiting time, and fairness across heterogeneous EV fleets [
109,
128]. Fairness metrics, for instance, were used to ensure equitable resource allocation, preventing bias toward early-arriving vehicles [
109], while SoC constraints ensured user satisfaction was not compromised for grid objectives [
138]. In mobile or reservation-based settings, additional metrics such as booking acceptance rate [
154] and average navigation energy [
135] extended the scope of user evaluation to mobility-aware decision-making. Collectively, these metrics reflect the increasing importance of human-centric design in RL frameworks, aiming to bridge algorithmic optimization with perceived service quality. Among them, charging delay was the most frequently reported, used as a proxy for temporal satisfaction and system responsiveness [
115,
128].
Beyond these dominant categories, RL-based EVCS studies also introduced
specialized or “other” performance metrics that captured nuances not fully addressed by standard cost, grid, or operational indexes. For example, forecasting accuracy indicators—such as prediction interval coverage probability and average interval score, were adopted in hybrid RL-forecasting models to evaluate decision robustness under uncertainty [
141,
149]. Optimization-specific metrics, including hypervolume, spacing, and inverted generational distance (IGD), were employed to benchmark Pareto front quality and diversity in multi-objective planning scenarios [
155]. Additional metrics, such as reward variance and policy stability, quantified robustness in stochastic environments [
150]. Mobility-oriented indicators like average road speed [
147] and congestion-aware queuing metrics extended the evaluation to power-transportation coupling. Lastly, emerging considerations such as pricing stability [
114] and real-time responsiveness [
140] further expanded the performance landscape. These “other” metrics function as higher-order tools for assessing RL models’ scalability, uncertainty handling, and system-wide integration capabilities—critical for advancing toward holistic, real-world EVCS control solutions.
Across the surveyed works, no study systematically applies a formal fairness index such as Jain’s, Gini, or Theil to quantify equity among EV users. The single mention of “fairness among EVs” in [
109] lacks a defined formula or reproducible metric, making it incomparable across studies. Instead, most papers rely on proxies that only partially reflect fairness, such as departure SoC as a minimum satisfaction guarantee [
128,
129,
138], waiting time or queue length to capture service accessibility [
115,
130,
147], or the number of users successfully served [
136]. While these proxies provide indirect evidence of equitable allocation, they do not reveal distributional disparities (e.g., whether some users consistently pay more, wait longer, or depart undercharged). This highlights a systematic gap in the EVCS RL literature: fairness remains underexplored and is seldom quantified using standardized, interpretable metrics
6.7. EVCS Types
The analysis of RL-based EVCS applications reveals distinct methodological patterns across different deployment types, shaped by the associated objectives, agent architectures, and system complexities. More specifically:
Residential EVCS deployments primarily focused on cost minimization, demand response integration, and renewable energy utilization. These scenarios predominantly employed single-agent deep RL methods—particularly DQN and DDPG—due to their capacity to operate efficiently within limited-scale environments [
106,
112,
128] (see
Figure 17—left and right). Residential-focused studies often utilized real-world datasets to align EV charging with PV generation and dynamic tariff structures, optimizing for metrics such as charging delay and SoC satisfaction. More recent research extended to multi-agent RL, enabling household-level coordination and transformer load balancing in distribution networks [
110].
In contrast,
public and workplace EVCSs represented the most commonly studied category (see
Figure 17—left and right), with a greater focus on grid-interactive coordination and congestion mitigation. Value-based RL techniques such as FQI and SARSA, along with hybrid RL–optimization frameworks, have been employed to manage peak load and flatten demand profiles [
104,
108,
154]. Multi-agent actor–critic models were particularly prominent in this space, addressing large-scale urban deployments where metrics such as station utilization, user waiting time, and booking acceptance rate are central [
135,
139]. Due to the inherent complexity of mobility-grid coupling, these applications often relied on simulation-based environments, including tools like SUMO, to facilitate mobility-aware EVCS scheduling [
147].
Community EVCS scenarios often incorporate distributed energy resources such as ESS and V2G capabilities. Such setups leveraged advanced actor–critic algorithms like SAC and TD3 [
137,
141], with MARL gaining traction for coordinating among aggregated chargers and distributed resources [
129,
150]. Research in this domain frequently utilizes hybrid RL–optimization models that simultaneously address economic, grid-supportive, and environmental objectives. The result is a transition toward multi-objective, carbon-aware planning that better aligns with emerging smart community infrastructure paradigms.
Mobile and fleet-based EVCSs constituted a more recent and rapidly evolving application domain, often characterized by dynamic topology and mobility constraints. Mobile and fleet-based studies predominantly employed hybrid RL strategies that combine Q-learning with evolutionary optimization or stochastic decision-making for charging location planning and route optimization [
142,
156]. Multi-agent RL architectures were commonly applied in V2V and vehicle-to-infrastructure (V2I) coordination schemes [
135,
145], where cooperative agent behavior was necessary to minimize travel distance, reduce waiting times, and alleviate traffic or grid congestion.
Lastly, emerging research on
highway- and mixed-type EVCSs has begun to integrate high-power charging infrastructure with renewable energy and storage systems. These contexts typically adopt hybrid frameworks—such as RL combined with MILP, AVOA, or other metaheuristic optimizers—to meet long-horizon planning and scalability requirements [
151,
155]. Such approaches were aimed at balancing cost efficiency with grid stability, often under uncertain mobility demand and renewable generation profiles.
Overall, while residential and public EVCS applications are relatively mature and frequently utilize single-agent value-based or actor–critic RL methods, recent trends clearly point toward hybrid and multi-agent frameworks for community, fleet, and mobile deployments. This transition reflects the growing complexity and interdependence of modern EVCS ecosystems, where coordination, multi-objective optimization, and grid-supportive behavior are vital for scalable and real-world deployments.