1. Introduction
Electric vehicles (EVs) are increasingly viewed as a promising solution to the critical challenge of meeting global emission reduction targets. The recent literature highlights EVs as a sustainable alternative for mitigating the environmental impacts caused by the transportation sector, supported by extensive studies demonstrating their lower carbon emissions [
1]. However, while EV adoption offers significant environmental benefits, it also introduces major challenges to energy infrastructure. As EV popularity grows, the demand for charging stations has substantial strain on power grids, especially during peak hours. The flexible and unpredictable nature of EV charging makes simply expanding the power network impractical, leading to oversized infrastructure and unnecessary maintenance costs, as charging stations often operate below full capacity. Uncertainties related to EV usage patterns, such as charging requests, arrival and departure times, and fluctuating loads, further complicate the development of efficient charging solutions. To address these challenges, researchers are increasingly collaborating with industry stakeholders to design sustainable EV charging management systems that maintain grid stability while ensuring system reliability, security, and the successful fulfillment of EV charging demands.
The installation of the EV charging stations also differs considerably due to various building and operational environments with different demand profiles, constraints, and optimization goals. Charging at a household usually happens overnight and is characterized as a low power level, but there is a high simultaneity risk in the case of evening peaks [
2]. Commercial and office buildings have charging demands during the day to correspond with working hours, which may be complemented by on-site solar generation [
3].
The university campuses have their own unique challenges when it comes to parking facilities. The arrival and departure times of the cars entering and exiting these facilities are connected in a staggered fashion, and some cars stay for long periods of time. This is another unique situation. On the other hand, the large rooftops of the university parking facilities make it an ideal place for the installation of solar panels. Hotels and malls experience highly intermittent and high-turnover charging events, which necessitate an adaptive schedule to set up a balance between the visitors’ convenience and grid overloading [
4]. Public charging stations and fleets suffer from scale and variability issues, which require serious real-time coordination [
5]. Despite such varying contexts, in general, research to date has approached EV charging as a generic problem with an insufficient degree of adaptation of the reinforcement learning (RL) model across sectors to cover the specific temporal and spatial profile of each one [
6]. RL-based EV charging and energy management systems are further discussed in the recent literature. Model-free deep reinforcement learning methods are suggested in [
7] for real-time charging scheduling in smart grids, which helps with adaptive decision-making in uncertain operating environments. Q-learning-based schemes have been further utilized in [
8] for electricity pricing and vehicle-to-grid interaction to ensure cost-effective dynamic charging control. Additionally, deep RL methods are discussed in [
9] to maintain voltage stability in distribution networks during high EV penetration. In [
10], coordinated charging schemes integrated with renewable energy sources are suggested to enhance solar energy exploitation in microgrids.
Smart charging strategies aim to address the EV charging problem by pursuing multiple objectives, including maximizing charging revenues, minimizing energy costs, reducing power losses, and lowering the carbon footprint. In [
11], the authors provide a solution that utilizes vehicle-to-grid (V2G) as a reactive power compensation method in a distributed coordination and present the economic benefits of this method. In [
12], the authors propose a cost-based optimization framework that jointly optimizes electric vehicle charging (EVC) schedules and photovoltaic (PV) reactive power management. Their approach aims to minimize the total system cost while maintaining grid stability and improving the integration of renewable energy resources. In [
13], a two-stage control scheme is developed to enable rapid V2G discharging. The first stage ensures optimal energy extraction from EV batteries, while the second stage provides fast and reliable support to the grid, particularly during peak demand periods or sudden fluctuations. Meanwhile, in [
14], the authors present a comprehensive review of strategies for coupling renewable energy sources with electric vehicle systems. This review covers key aspects such as integration architectures, energy management methods, and the operational challenges that arise when combining EVs with intermittent renewable generation, highlighting pathways to improve sustainability and grid resilience.
Instead of investigating the smart EV coordination, the authors in [
15,
16] use data-driven approaches to optimize the EV charging problems. In [
17], the authors propose a multi-agent deep reinforcement learning framework for scheduling EV charging. However, in another work [
18], the authors formulated the EV charging problem as a constrained Markov Decision Process to search within a safe deep reinforcement learning model that uses a constrained policy optimization technique [
19]. Furthermore, the authors in [
7] proposed a model-free approach integrated with a Markov Decision Process (MDP), enabling individual EVs to autonomously determine optimal charging and discharging schedules. In [
6], a comprehensive review is presented on EV charging management systems based on reinforcement learning techniques, highlighting key methodologies, challenges, and advancements in the field.
Intelligent optimization strategies leveraging artificial intelligence (AI) and machine learning (ML) techniques, including reinforcement learning (RL), have demonstrated significant potential in optimizing EV charging processes by adapting to dynamic environmental conditions and user preferences [
6,
7,
20]. Among these techniques, RL has emerged as a particularly promising approach for addressing the challenges of EV charging optimization. As a branch of machine learning, RL involves an agent that continuously interacts with an environment, learning optimal decision-making strategies through feedback in the form of rewards and penalties. In the context of EV charging systems, the charging controller functions as the “agent,” while the “environment” encompasses factors such as grid power availability, EV battery states, renewable energy generation levels, and real-time electricity pricing. By learning from these dynamic conditions, RL-based controllers can effectively optimize charging operations, balancing user needs with grid and energy constraints.
As presented in [
6], the potential that RL has for optimizing EV charging strategies is discussed in previous research. The adaptability of reinforcement learning techniques in EV charging to dynamic conditions is a major advantage. This helps address the variability in renewable energy availability, such as solar power, as reinforcement learning algorithms can reschedule charging in real-time to maximize the utilization of green energy and, thus, minimize the dependence on the grid.
The problem of EV coordination is present in the energy management system, which, in turn, controls the charging parameters while attaining certain goals like financial satisfaction and operational objectives. The literature review in [
6] highlights that the integration of RL with EV charging systems is still in its early stages.
The RL studies mentioned above are applied to a small-scale test bed. However, their viability is not validated for practical systems, like the IEEE-33 bus system, to evaluate the impact of EV penetration. The RL models used in the above studies are deterministic, with simplified renewable power generation and smooth demand patterns. However, uncertainty in solar power generation, unpredictable arrival and departure times of EVs, and variability in user behaviors are not sufficiently modeled. In the approaches discussed above, there is a comparison of the RL approach with the heuristic and rule-based approaches. However, comparison of the RL approach with advanced optimization techniques like MPC is not sufficiently discussed.
RL has shown potential for optimizing charging processes; the field faces significant deployment challenges. Most studies are limited to theoretical models and simulations, with few addressing real-world implementation issues such as scalability, adaptability to dynamic grid conditions, and integration with existing infrastructure. RL-related studies often lack validation under scalability and realistic power system conditions, such as voltage drop and line losses. This research addresses the literature gaps by applying a reinforcement learning (RL) solution for EV charging using the adaptive charging network (ACN) framework within an actual energy system, as discussed in [
21]. Unlike previous studies, this approach uses real-time observation for the LV network, which increases practice system integration. This study further demonstrates the scalability of the proposed system by integrating it into the IEEE-33 bus power network to evaluate its effectiveness in managing the charging events while improving the grid stability and renewable energy utilization.
This paper proposes a reinforcement learning framework that accounts for renewable energy resources such as solar energy, provides a solution that optimizes the management of grid load, and addresses dynamic conditions in real-time. The approach provides a more scalable, efficient, and sustainable solution for EV charging compared to traditional optimization techniques by adapting to dynamic energy availability, charging requirements, and grid conditions. Several reinforcement learning agents are used to address the coordination problem of EV charging from different methodological perspectives. The agents include DQN, PPO, and Advantage A2C agents. Each agent is based on an algorithm that has its own advantages concerning the efficiency of learning, stability, and adaptability to complex environments. The objective of this paper is to provide a comparison of the performance of these RL-based algorithms to optimize the charging policies of EVs in dynamic conditions on campus. These conditions include the variable solar energy availability, the constraints on peak power, and the users’ demands for energy. The key contributions of this research are summarized as follows:
Development of a dynamic EV charging controller: An RL-based controller is proposed that adaptively adjusts the charging rates of EVs depending upon solar energy availability, individual EV charging requirements, and overall grid load conditions.
Contrary to most simulation studies, the proposed work includes real-world datasets to train and test the RL agents during uncontrolled EV demand, a stochastic sequence of historical data from Qatar University (QU), and real-world online charging data collected from Caltech University. This will make the model more robust and relevant to the real-world setting of the campus environment.
Benchmarking of multiple RL algorithms for EV charging: The experiment compared the performance of three sophisticated RL techniques, namely DQN, Advantage A2C, and PPO, under the same environment.
Real-world scalability and impact analysis: The proposed RL controller is integrated into a standard IEEE-33 bus power distribution model, validating its performance under large-scale EV penetration. This analysis of scalability bridges a common gap between theoretical RL studies and practical power system applications.
Integrated management of peak load and renewable intermittency: The proposed solution effectively manages peak grid loads while accounting for the intermittency of renewable energy sources, with performance benchmarked across multiple case studies.
The proposed controller framework serves as a scalable foundation for future EV charging infrastructure research, particularly for large-scale deployments integrated with renewable energy resources. The structure of the rest of this paper is as follows:
Section 2 provides a review of the literature discussing EV charging and RL application;
Section 3 presents and formulates the charging problem in detail;
Section 4 discusses the RL controller and its design and implementation;
Section 5 provides the experimental results of the study; and presents a discussion of the experimental results. It also discusses the proposed approach’s strengths and limitations, and
Section 6 concludes the paper and recommends directions for future research.
3. Proposed Method
3.1. Problem Formulation
This study focuses on the optimization of EV charging operations integrated with solar (photovoltaic, PV) energy generation within a university campus setting. The system under consideration includes the main university building, rooftop-installed solar panels, and EV chargers located in the campus parking lot.
The Main Charger Controller (MCC) works as a centralized agent for decision-making while receiving information in real-time from three major subsystems, which are illustrated in
Figure 3. These three major systems are the EV chargers, the solar PV system, and the distribution system, which supplies the information of aggregated active power (
Pagg).
The solar PV system comprises photovoltaic panels fixed on rooftops. This system provides constant, real-time power generation information from the solar PV system (PPV), facilitating priority operation for solar-smart charging in periods when intense radiation is experienced.
The distribution system module is responsible for monitoring the total electrical load demand of the campus building. This involves both its active demand (Pload) as well as its reactive demand (Qload). The difference between the active power supplied by the grid, Pt, and the active power demanded by the load, Pload, is represented as Pagg, and it is supplied to the MCC. This information enables the Management Control Center to prevent the local transformer from being overloaded. In turn, this helps to plan the discharge time of the EVs when the building load has low demand.
EV chargers are deployed in the parking lot of the campus. These are Level 2 charging stations, which provide real-time information regarding charging session details to the MCC. This helps MCC to determine charging rates (Ich) for each EVSE to adjust the charge timing as needed.
The main objective is to coordinate charging while prioritizing the use of solar energy, satisfying energy requirements of the user, and maintaining the operational constraints. The charging method proposed in this study maximizes the delivered energy and reduces charging peaks by using the fluctuating arrival and departure times. Another key objective of this study is to optimize the charging infrastructure more efficiently compared to other algorithms.
The proposed RL model solves the EV charging management problem. The workflow in
Figure 4 illustrates the proposed RL framework, which starts with energy requests at the charging station and ends with the RL model returning the charging rates to the EV chargers.
3.2. RL Workflow
The workflow presented in
Figure 4 outlines the development of the RL model, where the agent is trained while continuously updating the states in the environment using the simulator. It represents the end-to-end workflow of the RL-based charging coordination system, outlining the interaction among physical infrastructure, simulation, training, and deployment. EV charging sessions are modeled through an event table containing arrival time, departure time, requested energy, and station ID. Random generation, QU historical stochastic data, and Caltech online datasets are the sources. The open-source simulator is used to emulate the charging network. It takes in charging rates from the RL agent, applies system constraints (transformer limits, voltage, and plug-in events), and returns updated states in terms of remaining EV energy, solar generation, and building load.
The first step in this process is to formalize the MDP for the RL framework. The MDP is designed to optimize the cumulative reward, encompassing three main components: states, actions, and rewards. The MDP follows a four-tuple structure (), where S represents a finite set of states, A denotes a finite set of actions, R is the reward function, and T defines the state transition function. The environment is set up by defining the states and observation functions, with the reward function structured according to the specific objectives of the EV charging optimization:
Data collection plays a key role in the next phase, where an event table is created, covering a full day from 6:00 a.m. to 3:00 p.m., depending on the demand data (random, historical, and online), with 1 min time steps, representing the arrival times, departure times, and energy requests for all chargers in the network.
During the training phase, the ACN-Sim open-source simulator calls the scheduling algorithm whenever there is a charging request [
24]. The RL agent determines the charging rates and sends them back to ACN-Sim to update the datasets, such as the remaining energy in the EVs, until all scheduling events are completed. This process is iterated across multiple episodes based on the training requirements and testing criteria.
In the testing phase, the trained RL agent interacts with the ACN-Sim platform to set charging rates into the event tables and retrieves system states for the reward function. The simulator outputs various states, such as the remaining energy in EVs, delivered energy, and active chargers, which are then used to evaluate the agent’s performance. The ACN-Sim environment, which is Python-based and integrates with OpenAI Gym for RL training and testing [
25], is used for this purpose. The ACN-Sim simulation environment is implemented using Python 3.13.
The training of the RL agent is carried out across different scenarios as listed in
Table 1, with performance is evaluated through training, testing, and comparison against EV charging coordination benchmarking algorithms such as Model Predictive Control (MPC) and uncontrolled or Least-Laxity-First (LLF) algorithm [
26].
Once the RL model is trained, it is deployed via a cloud-based interface, such as Streamlit, providing a user interface that facilitates interaction between EV chargers, users, and the trained RL agent [
22,
27].
3.3. MDP Formulation
The MDP framework comprises a four-tuple (S, A, R, T), where S denotes a finite set of states, A denotes the finite set of actions, R denotes the reward function, and T denotes the state transition function [
28].
The problem of EV charging coordination is modeled as an MDP (see
Figure 3), which is defined using the elements presented below.
3.3.1. State Space (S)
At a certain decision time step, the state comprises the available solar power, the building electrical load, and individual active EVSE parameters, such as the requested energy and the times of arrival and departure.
3.3.2. Action Space (A)
Action space is the set of charging rates defined at a given time t, which is selected by the charger. The charging rate is the charging current calculated at a certain voltage level. For the discrete action space, the agent selects a vector of charging rates.
This study explores both discrete and continuous action spaces to benchmark their effectiveness as follows:
Discrete action space: Allowable rates [0, 8, 16, 24, 32].
Continuous action space: The charging current is a continuous value ranging from 0 to 32 A.
3.3.3. Reward Function (R)
The reward function is expressed in terms of linear equations for a given system. It includes positive rewards, , and penalties, . It is expressed as a weighted combination of sub-rewards and penalties to direct the RL agent to fulfill various objectives.
The following objectives and constraints represent the rewards and penalties:
Solar utilization reward: If the system charged in the time of high solar availability, which in this case is during midday hours.
Current constraint violation penalty: The sum of charging rates is the designed constraint on the charger limits expressed using the expression, , for the number of chargers by the maximum allowed limit. It is mathematically represented by , defined in Equation (1) as follows:
The number of the active charger is denoted by :
Charger violation penalty: The negative reward is defined as the violation magnitude that is subtracted from the total reward of the system when a single violation occurs in an EVSE constraint by the last schedule.
Unmet demand penalty: This is the failure to meet the energy demanded by the user before his departure, and it is defined as the complement of the ratio of the remaining demand to energy requested by the user.
Depending upon the solar utilization award,
rsolar,t, and violation penalties of current,
dcurrent,t charger,
dcharger,t and unmet demand,
dunmet,t, the reward function,
Rt, is mathematically expressed as follows:
where
λ1,
λ2 and
λ3 are the weighting coefficients that are used to optimize the trade-off between grid safety, user satisfaction, and solar utilization. The positive reward,
rsolar,t, for solar utilization is defined as:
The expressions for penalty,
dcurrent,t, for violating total current limit is
The expression for penalty,
dcharger,t, for violating an individual charger limit is:
The expression for penalty,
dunmet,t, for unmet energy demand at departure:
3.3.4. Transition Function (T)
The transition function is nondeterministic and is partially observed, which matches the RL algorithm requirements, and it is based on the following dynamics:
The arrivals and departures of vehicles.
The requested energy, the energy delivered by the charger, and the remaining energy demand.
The fluctuations in available solar energy.
Variations in the usage of building energy.
3.4. Models, Objectives, and Constraints
3.4.1. The Distribution Network Model
Figure 3 represents the distribution network model, which combines the mathematical models representing several parts of the distributed system. According to the constraints in Equations (7) and (8), power can either be drawn from the grid during charging, or it is supplied back into the grid depending upon the power supplied by the PV source at node
i and at all buses,
n. These two processes are mutually exclusive, as represented by u in the constraint in Equation (9):
3.4.2. Solar Power Model
From Equation (10), it is observed that the instantaneous value of power generated by the solar PV array,
, depends on the capacity factor, the CF, and
, which represents the solar PV rating:
3.4.3. Plug-In Electric Vehicles
The maximum rates of the EV charger, given in Equation (11), limit the charging and discharging of the vehicle:
3.4.4. Charging Piles
Equation (12) defines the aggregated current,
, as the summation of the simultaneous charging currents found in N chargers:
where
k denotes the active chargers that are requesting charging. In Equation (13), the transformer rating (TR) constrains the aggregate current:
Equation (14) defines the charging of an individual charger:
The charger voltage,
, is defined as the bus voltage. The voltage between node
and node
is equal to the voltage across the chargers. The expression for
is taken from [
29,
30,
31], which is given by:
3.5. RL Algorithms (Benchmarking)
The problem of EV charging management is mathematically modeled as a multi-objective optimization problem in which the agent is learning the optimal policies for charging through interactions with the environment. As mentioned before, the three algorithms discussed are DQN, PPO, and Advantage A2C. Each algorithm offers its unique mechanism of learning, as presented in
Table 1.
Table 1.
Comparison of different RL algorithms [
24].
Table 1.
Comparison of different RL algorithms [
24].
| Method | Type | Strengths | Challenges | Reference |
|---|
| DQN | Value-based | Stable learning with experience replay | Sensitive to hyperparameters | [32] |
| PPO | Policy gradient | Stable and efficient updates | May require tuning clipping range | [33] |
| A2C | Actor–Critic | Low-variance gradient estimates | Synchronous updates can be slow | [34] |
3.5.1. DQN, or Deep Q-Network
The DQN technique falls under the category of value-based methods that are based on a deep neural network, which approximates the Q-function,
. The Q-function has the role of estimating the cumulative reward ‘r’ for taking action ‘a’ in state ‘s’, using online neural network parameters, θ, to evaluate the value of state–action pairs. This technique stabilizes training through experience replay, as well as a target network. The updated rule for Q-value:
where α is the learning rate and γ is the discount factor [
32]. The two main techniques used in DQN to stabilize training are as follows:
Experience replay: Breaks the correlation between consecutive samples by storing past transitions and sampling them randomly.
Target network: A separate network denoted as , which is used to calculate the target value. The target network is periodically updated to improve training stability.
3.5.2. Advantage Actor–Critic
A2C is a type of actor–critic method that is synchronous and utilizes multiple agents that collect experience in parallel. This method is a policy-gradient-based method that estimates the actor (policy function) and the critic, represented by the value function, for decreasing the variance and improving the stability. The critic estimates the state values and suggests the direction in which the actor will update the policy parameters. The advantage function is for reduction in the variance in estimating the gradient given by:
In practice, it is estimated as:
3.5.3. Policy Gradient Update
The policy,
, is the main RL function. The probability of taking action in a given state S is defined as the policy. The optimized policy according to the deterministic reward function determines the goal of RL. The PPO, which is a stochastic optimization, is used in this study. Its application to a policy deep reinforcement learning framework is robust and less vulnerable to hyperparameter tuning. It is used in several applications discussed in the literature [
34,
35]:
The clipped objective function improves the training stability. It ensures a balance exists between policy improvement and exploration:
where
is a small hyperparameter. The RL agent seeks to learn an optimal policy,
, that maximizes the expected cumulative reward:
where:
is the discount factor.
T is the episode length.
is the reward at time, t.
3.6. Performance Evaluation
The testing phase includes benchmarking of the trained RL model against various optimization methods and scheduling used in charging management of electric vehicles. The comparison metrics evaluate performance across several objectives, which include infrastructure savings, solar energy utilization, maximum aggregated energy delivery, and current. Additionally, the final stage includes integration of the RL model into the power network of the IEEE-33 benchmark for impact analysis on the distribution network. Load flow simulations were conducted to evaluate line losses and voltage profiles within the power system.
6. Conclusions
This paper develops a smart EV charging management system deploying artificial intelligence, namely reinforcement learning (RL). The developed RL agent achieved the target objectives, including energy delivery to EVs, solar energy utilization, and peak reduction in charging currents. The focus of the study is to train the RL agent for peak reduction while satisfying the other objectives to optimize the infrastructure expansion required for EV charging.
The proposed model is trained using QU campus real data, and the result is benchmarked against uncontrolled and Model Predictive Control (MPC) strategies. The results show an optimum charging infrastructure while delivering the energy requested by the users. The findings demonstrate a reduction in the maximum aggregated current, which introduces less stress on the grid. One of the key contributions of this study is comparing three RL algorithms under real-world demands while finding the appropriate agent and reward function to achieve the required objectives. The model is for campus charging, which has a specific behavior, and the study demonstrates real-world data, which includes actual arrival and departure times.
The results show the importance of selecting an appropriate RL agent along with the realistic demand patterns while optimizing the reward for the EV charging management problem. The PPO-RL agent learned faster compared to A2C and DQN and achieved a higher reward. The trained agent was tested against existing algorithms and achieved the required objective, leading to 42% more solar utilization compared with MPC charging.
This study bridges a common research gap by addressing scalability, where the PPO-RL agent is integrated into the IEEE-33 bus power system for evaluation. For the case study, the RL-based EV charging management model improved the voltage drop by 0.05 p.u. and reduced the line losses by 17% compared to the MPC benchmark method. The RL agent trained for solar energy utilization was tested; the voltage levels were higher during peak hours, and further reduction was reflected in the line losses and currents.
The obtained results are based on the training datasets from two universities, and it is recommended that future studies should test the model considering other regions and the effectiveness of the geographical location on the proposed model. The model can be extended to larger-scale real-world condition networks, incorporating battery degradation, V2G technology, DC microgrids, and dynamic pricing for further enhancement. Continued research can explore advanced techniques such as two-stage multi-agent DRL [
42], distributed observer-based control with RL [
43], and Seq2Seq-based deep learning frameworks [
44], where these approaches can enhance real-world applications and enhance the performance of the RL-based EV charging management system.
In summary, the study highlights that the RL-based EV charging management system increases the reliability of the grid by maximizing the use of renewable energy and reducing the burden on the grid, and thus the power infrastructure required to accommodate future EV demands. However, the RL-based charging management is trained using the QU stochastic data and Caltech online data. There are chances it may not work perfectly in other environments, such as residential areas, business centers, and public fast-charging stations, without further training. The reward function in the present study is static. However, in actual practice, the conditions of the grid, energy prices, and user priorities vary with time. A fixed reward structure will probably not adapt well to the long-term changes in demand or policy. A future system may require meta-RL or online reward adaptation. Further, the user preferences are not included in the model, such as levels of priority (e.g., emergency charging), which are of significance regarding user acceptance. The proposed RL model does not consider the long-term effects of the charging rate on the battery health of the electric vehicle. Frequently charging the battery partially or at a high rate during periods of low grid load may affect the battery’s lifespan. This will be considered in future work while formulating the rewards function. Future research could include the integration of the proposed RL framework with other energy system design and economic analysis tools, such as HOMER Grid. This will enable a holistic techno-economic analysis of the campus EV charging infrastructure for different electricity tariffs, renewable energy incentives, and long-term investment horizons. This will enable an accurate calculation of the net present value (NPV), payback, and cost implications of implementing the RL-based EV charging system compared to traditional uncontrolled or rule-based approaches.
Furthermore, the proposed method could be validated through distribution utility analysis software (for instance, OpenDSS 9.8.0.1, CYMDIST 8.0, or other load flow software), which would allow a more detailed evaluation of how well the proposed method performs in a real-world scenario with specific constraints such as fault levels, protection schemes, phase imbalance, and transformer aging effects that could be considered as blind spots in a generalizable RL model. Such field tests would significantly contribute to the overall translatability of this work from a theoretical to a practical campus/community-scale microgrid application.