Next Article in Journal
Digital Economy and Urban–Rural Integration: Threshold Effects and Regional Heterogeneity in China
Previous Article in Journal
Threshold Effects of Water Use Efficiency in Urbanization and Industrial Growth
Previous Article in Special Issue
Electric Minibus Taxis in Cape Town: Energy Demand, Emissions, and Costs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Peak Shaving and Solar Utilization for Sustainable Campus EV Charging Using Reinforcement Learning Approach

Electrical Engineering Department, Qatar University, Doha 2713, Qatar
*
Author to whom correspondence should be addressed.
Sustainability 2026, 18(6), 2737; https://doi.org/10.3390/su18062737
Submission received: 7 December 2025 / Revised: 1 March 2026 / Accepted: 2 March 2026 / Published: 11 March 2026

Abstract

To reduce the carbon footprint, electric vehicles (EVs) are considered an alternative transportation choice. However, increased use of EVs could lead to overloading the existing power network when accounting for all installed chargers. With the increasing deployment of EV chargers, universities are potential locations for the oversized power network issue. This paper applies reinforcement learning (RL) to optimize for EV charging infrastructure at the university scale using real-world data, directly contributing to sustainable energy management by reducing grid burden and increasing renewable energy utilization. The RL-based charger aims to reduce the burden on the grid while increasing renewable energy utilization. This study investigated practical relevance in real-world systems, considering three demand scenarios: random, stochastic historical demand from Qatar University, and actual online data from Caltech University. Three RL algorithms—Deep Q-Network (DQN), Advantage Actor–Critic (A2C), and Proximal Policy Optimization (PPO)—are applied. While training, the historical stochastic data requires more tuning of the RL framework than the random demand, emphasizing the importance of realistic demand profiles. The performance of the RL approach depends on the type of demand. The results show that the proposed RL approach can efficiently mitigate the peak charging currents. For the Qatar University historical demand scenario, the PPO algorithm minimized the peak charging currents by 50% relative to uncontrolled charging (160 A to 80 A) and Model Predictive Control maintained the energy transfer capability at 99.710%. For the random demand type, the peak charging currents are minimized by 38.3% as compared to uncontrolled charging (128 A to 79 A), with a nominal reduction in energy transfer capability to 95.89%. Scalability is tested by integrating the model into the IEEE-33 bus network. Without solar integration, the proposed RL-based EV charging management model improves the voltage drop by 0.05 p.u., leading to reduction in the line losses by 17% as compared to the MPC benchmark method and by 32% as compared to the uncontrolled charging scheme. Further, the proposed RL approach leads to a 9% reduction in line current during peak hours in the IEEE-33 bus system. With solar integration into the IEEE-bus system, the proposed framework of the RL approach improved the sustainability of the charging infrastructures by enhancing solar energy utilization by 42.5%. These findings validate the applicability of the proposed model used for optimizing the sustainable EV charging infrastructure while managing the charging coordination problem.

1. Introduction

Electric vehicles (EVs) are increasingly viewed as a promising solution to the critical challenge of meeting global emission reduction targets. The recent literature highlights EVs as a sustainable alternative for mitigating the environmental impacts caused by the transportation sector, supported by extensive studies demonstrating their lower carbon emissions [1]. However, while EV adoption offers significant environmental benefits, it also introduces major challenges to energy infrastructure. As EV popularity grows, the demand for charging stations has substantial strain on power grids, especially during peak hours. The flexible and unpredictable nature of EV charging makes simply expanding the power network impractical, leading to oversized infrastructure and unnecessary maintenance costs, as charging stations often operate below full capacity. Uncertainties related to EV usage patterns, such as charging requests, arrival and departure times, and fluctuating loads, further complicate the development of efficient charging solutions. To address these challenges, researchers are increasingly collaborating with industry stakeholders to design sustainable EV charging management systems that maintain grid stability while ensuring system reliability, security, and the successful fulfillment of EV charging demands.
The installation of the EV charging stations also differs considerably due to various building and operational environments with different demand profiles, constraints, and optimization goals. Charging at a household usually happens overnight and is characterized as a low power level, but there is a high simultaneity risk in the case of evening peaks [2]. Commercial and office buildings have charging demands during the day to correspond with working hours, which may be complemented by on-site solar generation [3].
The university campuses have their own unique challenges when it comes to parking facilities. The arrival and departure times of the cars entering and exiting these facilities are connected in a staggered fashion, and some cars stay for long periods of time. This is another unique situation. On the other hand, the large rooftops of the university parking facilities make it an ideal place for the installation of solar panels. Hotels and malls experience highly intermittent and high-turnover charging events, which necessitate an adaptive schedule to set up a balance between the visitors’ convenience and grid overloading [4]. Public charging stations and fleets suffer from scale and variability issues, which require serious real-time coordination [5]. Despite such varying contexts, in general, research to date has approached EV charging as a generic problem with an insufficient degree of adaptation of the reinforcement learning (RL) model across sectors to cover the specific temporal and spatial profile of each one [6]. RL-based EV charging and energy management systems are further discussed in the recent literature. Model-free deep reinforcement learning methods are suggested in [7] for real-time charging scheduling in smart grids, which helps with adaptive decision-making in uncertain operating environments. Q-learning-based schemes have been further utilized in [8] for electricity pricing and vehicle-to-grid interaction to ensure cost-effective dynamic charging control. Additionally, deep RL methods are discussed in [9] to maintain voltage stability in distribution networks during high EV penetration. In [10], coordinated charging schemes integrated with renewable energy sources are suggested to enhance solar energy exploitation in microgrids.
Smart charging strategies aim to address the EV charging problem by pursuing multiple objectives, including maximizing charging revenues, minimizing energy costs, reducing power losses, and lowering the carbon footprint. In [11], the authors provide a solution that utilizes vehicle-to-grid (V2G) as a reactive power compensation method in a distributed coordination and present the economic benefits of this method. In [12], the authors propose a cost-based optimization framework that jointly optimizes electric vehicle charging (EVC) schedules and photovoltaic (PV) reactive power management. Their approach aims to minimize the total system cost while maintaining grid stability and improving the integration of renewable energy resources. In [13], a two-stage control scheme is developed to enable rapid V2G discharging. The first stage ensures optimal energy extraction from EV batteries, while the second stage provides fast and reliable support to the grid, particularly during peak demand periods or sudden fluctuations. Meanwhile, in [14], the authors present a comprehensive review of strategies for coupling renewable energy sources with electric vehicle systems. This review covers key aspects such as integration architectures, energy management methods, and the operational challenges that arise when combining EVs with intermittent renewable generation, highlighting pathways to improve sustainability and grid resilience.
Instead of investigating the smart EV coordination, the authors in [15,16] use data-driven approaches to optimize the EV charging problems. In [17], the authors propose a multi-agent deep reinforcement learning framework for scheduling EV charging. However, in another work [18], the authors formulated the EV charging problem as a constrained Markov Decision Process to search within a safe deep reinforcement learning model that uses a constrained policy optimization technique [19]. Furthermore, the authors in [7] proposed a model-free approach integrated with a Markov Decision Process (MDP), enabling individual EVs to autonomously determine optimal charging and discharging schedules. In [6], a comprehensive review is presented on EV charging management systems based on reinforcement learning techniques, highlighting key methodologies, challenges, and advancements in the field.
Intelligent optimization strategies leveraging artificial intelligence (AI) and machine learning (ML) techniques, including reinforcement learning (RL), have demonstrated significant potential in optimizing EV charging processes by adapting to dynamic environmental conditions and user preferences [6,7,20]. Among these techniques, RL has emerged as a particularly promising approach for addressing the challenges of EV charging optimization. As a branch of machine learning, RL involves an agent that continuously interacts with an environment, learning optimal decision-making strategies through feedback in the form of rewards and penalties. In the context of EV charging systems, the charging controller functions as the “agent,” while the “environment” encompasses factors such as grid power availability, EV battery states, renewable energy generation levels, and real-time electricity pricing. By learning from these dynamic conditions, RL-based controllers can effectively optimize charging operations, balancing user needs with grid and energy constraints.
As presented in [6], the potential that RL has for optimizing EV charging strategies is discussed in previous research. The adaptability of reinforcement learning techniques in EV charging to dynamic conditions is a major advantage. This helps address the variability in renewable energy availability, such as solar power, as reinforcement learning algorithms can reschedule charging in real-time to maximize the utilization of green energy and, thus, minimize the dependence on the grid.
The problem of EV coordination is present in the energy management system, which, in turn, controls the charging parameters while attaining certain goals like financial satisfaction and operational objectives. The literature review in [6] highlights that the integration of RL with EV charging systems is still in its early stages.
The RL studies mentioned above are applied to a small-scale test bed. However, their viability is not validated for practical systems, like the IEEE-33 bus system, to evaluate the impact of EV penetration. The RL models used in the above studies are deterministic, with simplified renewable power generation and smooth demand patterns. However, uncertainty in solar power generation, unpredictable arrival and departure times of EVs, and variability in user behaviors are not sufficiently modeled. In the approaches discussed above, there is a comparison of the RL approach with the heuristic and rule-based approaches. However, comparison of the RL approach with advanced optimization techniques like MPC is not sufficiently discussed.
RL has shown potential for optimizing charging processes; the field faces significant deployment challenges. Most studies are limited to theoretical models and simulations, with few addressing real-world implementation issues such as scalability, adaptability to dynamic grid conditions, and integration with existing infrastructure. RL-related studies often lack validation under scalability and realistic power system conditions, such as voltage drop and line losses. This research addresses the literature gaps by applying a reinforcement learning (RL) solution for EV charging using the adaptive charging network (ACN) framework within an actual energy system, as discussed in [21]. Unlike previous studies, this approach uses real-time observation for the LV network, which increases practice system integration. This study further demonstrates the scalability of the proposed system by integrating it into the IEEE-33 bus power network to evaluate its effectiveness in managing the charging events while improving the grid stability and renewable energy utilization.
This paper proposes a reinforcement learning framework that accounts for renewable energy resources such as solar energy, provides a solution that optimizes the management of grid load, and addresses dynamic conditions in real-time. The approach provides a more scalable, efficient, and sustainable solution for EV charging compared to traditional optimization techniques by adapting to dynamic energy availability, charging requirements, and grid conditions. Several reinforcement learning agents are used to address the coordination problem of EV charging from different methodological perspectives. The agents include DQN, PPO, and Advantage A2C agents. Each agent is based on an algorithm that has its own advantages concerning the efficiency of learning, stability, and adaptability to complex environments. The objective of this paper is to provide a comparison of the performance of these RL-based algorithms to optimize the charging policies of EVs in dynamic conditions on campus. These conditions include the variable solar energy availability, the constraints on peak power, and the users’ demands for energy. The key contributions of this research are summarized as follows:
  • Development of a dynamic EV charging controller: An RL-based controller is proposed that adaptively adjusts the charging rates of EVs depending upon solar energy availability, individual EV charging requirements, and overall grid load conditions.
  • Contrary to most simulation studies, the proposed work includes real-world datasets to train and test the RL agents during uncontrolled EV demand, a stochastic sequence of historical data from Qatar University (QU), and real-world online charging data collected from Caltech University. This will make the model more robust and relevant to the real-world setting of the campus environment.
  • Benchmarking of multiple RL algorithms for EV charging: The experiment compared the performance of three sophisticated RL techniques, namely DQN, Advantage A2C, and PPO, under the same environment.
  • Real-world scalability and impact analysis: The proposed RL controller is integrated into a standard IEEE-33 bus power distribution model, validating its performance under large-scale EV penetration. This analysis of scalability bridges a common gap between theoretical RL studies and practical power system applications.
  • Integrated management of peak load and renewable intermittency: The proposed solution effectively manages peak grid loads while accounting for the intermittency of renewable energy sources, with performance benchmarked across multiple case studies.
The proposed controller framework serves as a scalable foundation for future EV charging infrastructure research, particularly for large-scale deployments integrated with renewable energy resources. The structure of the rest of this paper is as follows: Section 2 provides a review of the literature discussing EV charging and RL application; Section 3 presents and formulates the charging problem in detail; Section 4 discusses the RL controller and its design and implementation; Section 5 provides the experimental results of the study; and presents a discussion of the experimental results. It also discusses the proposed approach’s strengths and limitations, and Section 6 concludes the paper and recommends directions for future research.

2. Background

2.1. Reinforcement Learning (RL)

RL is a type of machine learning where an agent learns how to make decisions by interacting with an environment to maximize a cumulative reward. The RL process consists of an agent, a value function or state, st, a reward signal, rt+1, and a policy, π, as depicted in Figure 1.
The policy dictates the behavior of the agent. The policy reflects the relation between the new state, or the perception, and the action of the agent in the environment. The agent learns decision-making based on state inputs and optimal control actions. The objective is to maximize the cumulative reward. To achieve this objective, the charging management controller (the agent) must learn the actions that lead to the maximum reward [22]. This happens in the search process through trial and error. The actions of the agent impact the current reward as well as the delayed reward. Thus, the major stages in the reinforcement learning applications are sensing the environment for states, taking actions, and achieving the goals through maximizing the rewards [23].

2.2. Smart Charging Strategies Including Application of Reinforcement Learning

The EV coordination is the major problem of the energy management system, which is achieved by controlling the charging parameters while attaining certain goals like financial satisfaction and operational objectives. RL techniques are used to provide solutions for EV problems concerning complex scheduling, thus motivating the application of RL techniques. The literature is surveyed, and the studies related to RL-based EV charging are reviewed in [6]. From this review, it is concluded that the integration of the RL in EV charging system applications is still emerging and facing some deployment challenges. This is due to the relevant RL studies being at their initial stages. Figure 1 shows the block diagram of the reinforcement learning algorithms reviewed in the literature. It shows the parameters that are used in EV coordination, such as its states, inputs, constraints, or objectives [6].
From Figure 2, it is observed that in the literature, a set of parameters is used for modeling of EV charging coordination in a Markov Decision Process (MDP) framework. Some of the popularly used parameters are current pricing levels of electricity, levels of transformer loading, battery State-of-the-Charge (SoC), availability of renewable energy sources, and user-driven parameters such as arrival/departure times of users and their preferred/required consumption. The popularly used rewards are aimed at optimizing charging expenses, maximizing the use of renewable energy resources, and reducing the waiting period, as well as meeting consumer demands. Typical constraints reported in the literature are the capacity of electric charges, battery capacity, network capacity, and parking duration.
Using this established framework, the MDP defined in this paper considers several of the key parameters, such as availability of solar, building demand, and EV-specific variables within the state space. However, the proposed method deviates further to incorporate campus-specific demand profiles dependent on the Qatar University campus setting, recourse to the intermittency of the solar source, and utilization of the centralized controller (MCC) to coordinate the process of EV recharge for the distributed network. These key deviations allow the proposed framework to be informed by existing research relevant to the proposed field of RL-based EV charging coordination.

3. Proposed Method

3.1. Problem Formulation

This study focuses on the optimization of EV charging operations integrated with solar (photovoltaic, PV) energy generation within a university campus setting. The system under consideration includes the main university building, rooftop-installed solar panels, and EV chargers located in the campus parking lot.
The Main Charger Controller (MCC) works as a centralized agent for decision-making while receiving information in real-time from three major subsystems, which are illustrated in Figure 3. These three major systems are the EV chargers, the solar PV system, and the distribution system, which supplies the information of aggregated active power (Pagg).
The solar PV system comprises photovoltaic panels fixed on rooftops. This system provides constant, real-time power generation information from the solar PV system (PPV), facilitating priority operation for solar-smart charging in periods when intense radiation is experienced.
The distribution system module is responsible for monitoring the total electrical load demand of the campus building. This involves both its active demand (Pload) as well as its reactive demand (Qload). The difference between the active power supplied by the grid, Pt, and the active power demanded by the load, Pload, is represented as Pagg, and it is supplied to the MCC. This information enables the Management Control Center to prevent the local transformer from being overloaded. In turn, this helps to plan the discharge time of the EVs when the building load has low demand.
EV chargers are deployed in the parking lot of the campus. These are Level 2 charging stations, which provide real-time information regarding charging session details to the MCC. This helps MCC to determine charging rates (Ich) for each EVSE to adjust the charge timing as needed.
The main objective is to coordinate charging while prioritizing the use of solar energy, satisfying energy requirements of the user, and maintaining the operational constraints. The charging method proposed in this study maximizes the delivered energy and reduces charging peaks by using the fluctuating arrival and departure times. Another key objective of this study is to optimize the charging infrastructure more efficiently compared to other algorithms.
The proposed RL model solves the EV charging management problem. The workflow in Figure 4 illustrates the proposed RL framework, which starts with energy requests at the charging station and ends with the RL model returning the charging rates to the EV chargers.

3.2. RL Workflow

The workflow presented in Figure 4 outlines the development of the RL model, where the agent is trained while continuously updating the states in the environment using the simulator. It represents the end-to-end workflow of the RL-based charging coordination system, outlining the interaction among physical infrastructure, simulation, training, and deployment. EV charging sessions are modeled through an event table containing arrival time, departure time, requested energy, and station ID. Random generation, QU historical stochastic data, and Caltech online datasets are the sources. The open-source simulator is used to emulate the charging network. It takes in charging rates from the RL agent, applies system constraints (transformer limits, voltage, and plug-in events), and returns updated states in terms of remaining EV energy, solar generation, and building load.
The first step in this process is to formalize the MDP for the RL framework. The MDP is designed to optimize the cumulative reward, encompassing three main components: states, actions, and rewards. The MDP follows a four-tuple structure ( S , A , R , T ), where S represents a finite set of states, A denotes a finite set of actions, R is the reward function, and T defines the state transition function. The environment is set up by defining the states and observation functions, with the reward function structured according to the specific objectives of the EV charging optimization:
  • Data collection plays a key role in the next phase, where an event table is created, covering a full day from 6:00 a.m. to 3:00 p.m., depending on the demand data (random, historical, and online), with 1 min time steps, representing the arrival times, departure times, and energy requests for all chargers in the network.
  • During the training phase, the ACN-Sim open-source simulator calls the scheduling algorithm whenever there is a charging request [24]. The RL agent determines the charging rates and sends them back to ACN-Sim to update the datasets, such as the remaining energy in the EVs, until all scheduling events are completed. This process is iterated across multiple episodes based on the training requirements and testing criteria.
  • In the testing phase, the trained RL agent interacts with the ACN-Sim platform to set charging rates into the event tables and retrieves system states for the reward function. The simulator outputs various states, such as the remaining energy in EVs, delivered energy, and active chargers, which are then used to evaluate the agent’s performance. The ACN-Sim environment, which is Python-based and integrates with OpenAI Gym for RL training and testing [25], is used for this purpose. The ACN-Sim simulation environment is implemented using Python 3.13.
  • The training of the RL agent is carried out across different scenarios as listed in Table 1, with performance is evaluated through training, testing, and comparison against EV charging coordination benchmarking algorithms such as Model Predictive Control (MPC) and uncontrolled or Least-Laxity-First (LLF) algorithm [26].
  • Once the RL model is trained, it is deployed via a cloud-based interface, such as Streamlit, providing a user interface that facilitates interaction between EV chargers, users, and the trained RL agent [22,27].

3.3. MDP Formulation

The MDP framework comprises a four-tuple (S, A, R, T), where S denotes a finite set of states, A denotes the finite set of actions, R denotes the reward function, and T denotes the state transition function [28].
The problem of EV charging coordination is modeled as an MDP (see Figure 3), which is defined using the elements presented below.

3.3.1. State Space (S)

At a certain decision time step, the state comprises the available solar power, the building electrical load, and individual active EVSE parameters, such as the requested energy and the times of arrival and departure.

3.3.2. Action Space (A)

Action space is the set of charging rates defined at a given time t, which is selected by the charger. The charging rate I c h is the charging current calculated at a certain voltage level. For the discrete action space, the agent selects a vector of charging rates.
This study explores both discrete and continuous action spaces to benchmark their effectiveness as follows:
  • Discrete action space: Allowable rates [0, 8, 16, 24, 32].
  • Continuous action space: The charging current is a continuous value ranging from 0 to 32 A.

3.3.3. Reward Function (R)

The reward function R is expressed in terms of linear equations for a given system. It includes positive rewards, r j , and penalties, d j . It is expressed as a weighted combination of sub-rewards and penalties to direct the RL agent to fulfill various objectives.
The following objectives and constraints represent the rewards and penalties:
  • Solar utilization reward: If the system charged in the time of high solar availability, which in this case is during midday hours.
  • Current constraint violation penalty: The sum of charging rates is the designed constraint on the charger limits expressed using the expression, k I E V k , for the N number of chargers by the maximum allowed limit. It is mathematically represented by I a m a x , defined in Equation (1) as follows:
k = 1 N I E V k < I a m a x
The number of the active charger is denoted by k :
  • Charger violation penalty: The negative reward is defined as the violation magnitude that is subtracted from the total reward of the system when a single violation occurs in an EVSE constraint by the last schedule.
  • Unmet demand penalty: This is the failure to meet the energy demanded by the user before his departure, and it is defined as the complement of the ratio of the remaining demand to energy requested by the user.
Depending upon the solar utilization award, rsolar,t, and violation penalties of current, dcurrent,t charger, dcharger,t and unmet demand, dunmet,t, the reward function, Rt, is mathematically expressed as follows:
R t = r solar , t     λ 1 d current , t     λ 2 d charger , t     λ 3 d unmet , t
where λ1, λ2 and λ3 are the weighting coefficients that are used to optimize the trade-off between grid safety, user satisfaction, and solar utilization. The positive reward, rsolar,t, for solar utilization is defined as:
r solar , t = 0 + 1   if   charging   takes   place   during   high   solar   availability o t h e r w i s e
The expressions for penalty, dcurrent,t, for violating total current limit is
d c u r r e n t , t = max 0 ,   n = 1 N I E V , k ( t )     I max
The expression for penalty, dcharger,t, for violating an individual charger limit is:
d charger , t = max 0 ,   n = 1 N I E V , k ( t )     I max max
The expression for penalty, dunmet,t, for unmet energy demand at departure:
d unmet , t = E r e m a i n i n g ( t ) E r e q u e s t e d

3.3.4. Transition Function (T)

The transition function is nondeterministic and is partially observed, which matches the RL algorithm requirements, and it is based on the following dynamics:
  • The arrivals and departures of vehicles.
  • The requested energy, the energy delivered by the charger, and the remaining energy demand.
  • The fluctuations in available solar energy.
  • Variations in the usage of building energy.

3.4. Models, Objectives, and Constraints

3.4.1. The Distribution Network Model

Figure 3 represents the distribution network model, which combines the mathematical models representing several parts of the distributed system. According to the constraints in Equations (7) and (8), power can either be drawn from the grid during charging, or it is supplied back into the grid depending upon the power supplied by the PV source at node i and at all buses, n. These two processes are mutually exclusive, as represented by u in the constraint in Equation (9):
P E V i u i P E V m a x , i , i = 1 , , i + n
P P V i 1 u i P P V m a x , i , i = 1 , , i + n
u i = 1 p o w e r   d r a w n   0 o t h e r w i s e

3.4.2. Solar Power Model

From Equation (10), it is observed that the instantaneous value of power generated by the solar PV array, P P V , depends on the capacity factor, the CF, and P o k W , which represents the solar PV rating:
P P V k W = C F × P o k W

3.4.3. Plug-In Electric Vehicles

The maximum rates of the EV charger, given in Equation (11), limit the charging and discharging of the vehicle:
P E V t P E V C S d i s , m a x , P E V C S c h , m a x

3.4.4. Charging Piles

Equation (12) defines the aggregated current, I A g g , as the summation of the simultaneous charging currents found in N chargers:
I A g g = k = 1 N I c h k
where k denotes the active chargers that are requesting charging. In Equation (13), the transformer rating (TR) constrains the aggregate current:
I A g g <   L F . I T m a x
Equation (14) defines the charging of an individual charger:
P E V k ( k W ) = I E V k × V E V C S 1000
The charger voltage, V E V C S , is defined as the bus voltage. The voltage between node i + 1 and node i is equal to the voltage across the chargers. The expression for V E V C S is taken from [29,30,31], which is given by:
V i + 1 = V i 2 2 P i r j + Q i x j + ( r j 2 + x j 2 ) ( P i 2 + Q i 2 ) V i 2

3.5. RL Algorithms (Benchmarking)

The problem of EV charging management is mathematically modeled as a multi-objective optimization problem in which the agent is learning the optimal policies for charging through interactions with the environment. As mentioned before, the three algorithms discussed are DQN, PPO, and Advantage A2C. Each algorithm offers its unique mechanism of learning, as presented in Table 1.
Table 1. Comparison of different RL algorithms [24].
Table 1. Comparison of different RL algorithms [24].
MethodTypeStrengthsChallengesReference
DQNValue-basedStable learning with experience replaySensitive to hyperparameters[32]
PPOPolicy gradientStable and efficient updatesMay require tuning clipping range[33]
A2CActor–CriticLow-variance gradient estimatesSynchronous updates can be slow[34]

3.5.1. DQN, or Deep Q-Network

The DQN technique falls under the category of value-based methods that are based on a deep neural network, which approximates the Q-function, Q s ,   a ;   θ . The Q-function has the role of estimating the cumulative reward ‘r’ for taking action ‘a’ in state ‘s’, using online neural network parameters, θ, to evaluate the value of state–action pairs. This technique stabilizes training through experience replay, as well as a target network. The updated rule for Q-value:
Q ( s ,   a ;   θ )     Q ( s ,   a ;   θ )   +   α [ r   +   γ   m a x   Q ( s ,   a ;   θ )     Q ( s ,   a ;   θ ) ]
where α is the learning rate and γ is the discount factor [32]. The two main techniques used in DQN to stabilize training are as follows:
  • Experience replay: Breaks the correlation between consecutive samples by storing past transitions and sampling them randomly.
  • Target network: A separate network denoted as   Q s , a ; θ , which is used to calculate the target value. The target network is periodically updated to improve training stability.

3.5.2. Advantage Actor–Critic

A2C is a type of actor–critic method that is synchronous and utilizes multiple agents that collect experience in parallel. This method is a policy-gradient-based method that estimates the actor (policy function) and the critic, represented by the value function, for decreasing the variance and improving the stability. The critic estimates the state values and suggests the direction in which the actor will update the policy parameters. The advantage function is for reduction in the variance in estimating the gradient given by:
A ( s t , a t ) = Q ( s t , a t ) V ( s t )
In practice, it is estimated as:
A ( s t , a t ) = r t + γ V ( s t + 1 ) V ( s t )

3.5.3. Policy Gradient Update

The policy, π ( S , a ) , is the main RL function. The probability of taking action in a given state S is defined as the policy. The optimized policy according to the deterministic reward function determines the goal of RL. The PPO, which is a stochastic optimization, is used in this study. Its application to a policy deep reinforcement learning framework is robust and less vulnerable to hyperparameter tuning. It is used in several applications discussed in the literature [34,35]:
θ J ( θ ) = E [ θ log π θ ( a t s t ) A ( s t , a t ) ]
The clipped objective function improves the training stability. It ensures a balance exists between policy improvement and exploration:
r t ( θ ) = π θ ( a t | s t ) / π θ 0 ( a t | s t )
L C L I P ( θ ) = E t [ m i n ( r t ( θ ) A t , c l i p ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A t ) ]
where ϵ is a small hyperparameter. The RL agent seeks to learn an optimal policy, π ( S , a ) , that maximizes the expected cumulative reward:
max π E [ t = 0 T γ t R s t ,   a t   ]
where:
  • γ     0 ,   1 is the discount factor.
  • T is the episode length.
  • R s t ,   a t is the reward at time, t.

3.6. Performance Evaluation

The testing phase includes benchmarking of the trained RL model against various optimization methods and scheduling used in charging management of electric vehicles. The comparison metrics evaluate performance across several objectives, which include infrastructure savings, solar energy utilization, maximum aggregated energy delivery, and current. Additionally, the final stage includes integration of the RL model into the power network of the IEEE-33 benchmark for impact analysis on the distribution network. Load flow simulations were conducted to evaluate line losses and voltage profiles within the power system.

4. Mathematical Implementation

4.1. Simulation Setup

The RL source code is developed using the following tools: TensorFlow version 1.14.0 [25], Keras, Python version 3.7, NumPy [23], Stable Baselines libraries [27], and OpenAI’s Gym [26]. The workstation, including an Intel Core i7-9700 CPU with 16 G of RAM, is used to run the simulation. The simulation setup is set in the first step, which includes the data population distribution and the parameter tuning of the solver used for the proposed deep RL network.
The simulation starts with training, then testing, and finally benchmarking, where each step includes a few assumptions and inputs. This section defines the settings and inputs of each step, including the training and testing processes, which are explained below.

4.2. Training and Testing Processes

The states act as inputs for the training process. Over several epochs, the discounted sum of rewards is repeated, which is predicted by the NN. The estimate of NN is not very accurate, which gives rise to non-zero variance. The training time duration for an epoch can typically span one full day, which amounts to 1140 min required for charging coordination in the case of base load. Here, the time step considered in the training of RL is 1 min.
After training, the trained agent, which is a machine learning model, is saved and ready to serve for EV charger coordination. To achieve this, Python testing code developed a model and is called for EV charging coordination. The testing is performed over a set of different scenarios and benchmarked as defined in the following sections.
The architecture of the deep neural network (DNN) algorithm is a multi-layer perceptron (MLP), which consists of four layers, including two hidden layers [36]. This is a DNN because it consists of more than one hidden layer. The MLP is a feedforward network consisting of activation units, a i ( l ) , at each layer, l . The hyperparameters of a neural network, which include the number of layers and neurons, are set through trial and error to reach the training criteria. Training for weight correction is carried out using the backpropagation technique. The multi-layer neural network computes the policy function in deep reinforcement learning. The deep neural network’s independent layer topology enables efficient gradient computations using backpropagation. The backpropagation technique in DNN simply refers to a computational implementation of the chain ReLU, which is an algorithm that finds the pertinent partial derivatives using the backward pass during the training process.
The hyperparameters of the PPO-RL and the trained DNN are presented by the learning rate of 2.5 × 10−4 and the discount factor of 0.99, respectively. The agent performs an update every 16 steps, with four training minibatches per update. Feedforward neural networks are the NNs that are used in the actor and critic networks. Each network has two hidden layers, containing 64 neurons per layer. ReLU is the activation function to be used. The value of the clipping parameter, ϵ, is set to 0.2. The loss coefficient, c1, for the value function is given a value of 0.5, and the entropy coefficient, c2, is given a value of 0.01 to support exploration. The maximum episode length is fixed based on the simulation time.

4.3. Coordination Algorithms Used for Benchmarking EV Charging

The benchmarking of the proposed charging coordination algorithm is carried out against algorithms, including the Least-Laxity-First (LLF) algorithm, uncontrolled charging, and the Model Predictive Control (MPC) algorithm. These algorithms are available in ACN-Sim [26].

4.4. Case Studies and Model Inputs

In this paper, the parameters defined by the authors in [26] are used as a model for the EV charging demand. The building load and solar energy datasets used in the current study are characterized statistically as shown in Table 2. The modeling of arrival and departure times is carried out in three different ways: random data, online data from the Caltech Building, and stochastic data from Qatar University, with case study configurations detailed in Table 3.
In all the cases, the application of the process of charging is done in the adaptive charging network (CAN), from Caltech, as discussed in detail in [37]. The IEEE-33 bus system is used to validate the model used for the grid [38].
The plug-in operation of charging gives rise to an event. Unless otherwise stated, it is considered for a single EVSE. The magnitude of the charging current and the voltage at which it is supplied are taken as 32 A and 208 V.
The EV demand during an event is modeled for different numbers of events per charging station and depends on the arrival and departure of the EV user. The proposed charging management system is simulated and tested for different case scenarios with different charger events and by changing the EV demand model.

4.4.1. EV Demand

The EV charging demand is the charging request and is modeled based on the arrival and departure times. For all cases, the Caltech ACN is considered for the charging process provided by the simulation detailed in [37].
The maximum charging current drawn by the chargers is 32 A at 208 Vac three-phase and connected to a 150 kVA transformer [39]. The chargers are Level 2 and 6.6 kW, three-phase connected at the main switch panel. The charger’s location is the California Institute of Technology (Caltech parking), and the roof area (solar) is 3000 square meters with a peak output power of 600 kWp.
For validating the model under study, the IEEE 33-distribution system is chosen, as detailed in [38]. The load flow results of the power system, represented as line currents, node voltages, and system losses, are generated for different EV demands.

4.4.2. Random Dataset for EV Demand

The EV demand is modeled using a random dataset generator, which is named the random plug-in function. This function returns several random plug-in events that could occur at any moment between 0 and the duration defined in Table 3. Each plug-in has a random entrance and departure within the allotted period and a satisfiable requested energy if there are no other cars plugged into that specific charger. Unless otherwise specified, each EV has an initial laxity equal to half the stay duration.

4.4.3. Qatar University Historical Data (Stochastic)

The EV demand is modeled based on human behavior, which is emulated on parking behavior at the Qatar University (QU) campus. This is achieved from historical data captured from 32 parking areas on campus. The collected parking data includes parking duration, arrival times, and departure times. The data was collected by the Qatar Transportation and Traffic Safety Center. The campus vehicle count is distributed between the parking durations of 1.5 h and 8.5 h. The average parking duration for campus vehicles is 3 h. The total number of vehicles parked at noon on campus is 3698, which is 60% of the available parking spaces. Based on the collected data, peak parking hours are from 11:00 a.m. to 12:00 p.m. (see Figure 5). The requested energy from the QU dataset depends on the parking duration, which depends on the specific arrival time. The parking duration dataset is thus transformed into a probability density function for each hour of the day and used to generate the stochastic data fed into the training RL algorithm.

4.4.4. Caltech University Database for EV Demand (Online Data)

In another approach, online data that is generated from a given charging infrastructure is used to generate the event table corresponding to EV demand. An online database is included in Caltech University, which has historical charging events. The ACN-Data set includes fields related to the EV charging problem, such as station ID, site ID, user ID, session ID, arrival time, departure time, requested energy, and delivered energy [40]. An actual charging network is considered for the case study, which is installed at the California Institute of Technology (Caltech) [41]. The departure time, arrival time, and energy data that are requested are extracted from the ACN-Data [40].

4.4.5. Building Load and Solar Energy

Other datasets corresponding to building active loading power and reactive power are obtained from the QU history data from the year 2018. The statistical parameters are calculated for these datasets and are listed in Table 2. The National Solar Radiation Center, located in Pasadena, CA, is used to collect solar data for the year 2018.

5. Discussion and Results

5.1. Benchmarking Training Performance

The RL agent is trained using three different algorithms in Table 1 for the following three different EV demands:
  • Case A: Random demand.
  • Case B: QU data-Stochastic demand.
  • Case C: Caltech online.
The results were obtained after running over 100,000 iterations for each of the five chargers and ten events per iteration, with the base case scenarios (Case-1A to Case-3A) and (Case-1B and Case-3B) described in Table 3. For each case, the cumulative reward was recorded, and a moving average was plotted every 10 episodes to assess the performance over time. Figure 6, Figure 7 and Figure 8 demonstrate the training behavior of the RL agents under different EV demand models.
The results indicate that the DQN agent exhibits faster convergence and stable reward accumulation with fewer steps in Case-1A, as shown in Figure 6.
The maximum cumulative reward for Case-1A is 0.932, and for Case-1B it is 0.968, with standard deviations of 0.1711 and 0.1689, respectively. Although the standard deviation indicates that the QU load scenario is slightly more stable than the random load scenario, this difference is not visually significant in the plots. The DQN agent, while providing a solid performance, exhibits greater fluctuations in reward compared to the A2C and PPO agents.
In comparison, the A2C agent, as shown in Figure 7, demonstrated less fluctuation and achieved faster convergence than the DQN agent. The maximum cumulative rewards were 0.909 for Case-2A and 0.965 for Case-2B, with standard deviations of 0.1218 and 0.0756, respectively. A2C exhibited a more stable learning performance, particularly when trained with real-world EV demand in Case-2B, as opposed to random demand in Case-1A. This improved stability and higher accumulated reward are consistent with previous findings in reinforcement learning for EV charging optimization [23].
Lastly, the PPO agent, as seen in Figure 8, achieves a maximum cumulative reward of 0.660 in Case-3A, with a low standard deviation of 0.0370. In Case-3B, the reward increased to 0.738, but with a slightly higher standard deviation of 0.0767. Although PPO performed better with real-world demand, its reward values were slightly lower than those of DQN and A2C. The lower standard deviation observed in PPO, particularly under real-world demand, suggests a smoother training process, though with a trade-off in peak values.

5.2. Testing Performance and Benchmarking

The RL framework is tested for different case scenarios, as included in Table 3. To maximize the energy delivery and reduce peak charging currents, the proposed charging method includes the variation in departure time and arrival times. To reduce the required charging infrastructure compared to other algorithms is the main objective.

5.2.1. Shaving of Charging Peak

The results for RL output are generated for each RL algorithm (DQN, A2C, and PPO). The maximum aggregated currents are plotted for Case-1A, Case-1B, and Case-1C, as shown in Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14. To investigate the impact of different EV demand models on the performance of reinforcement learning (RL)-trained agents, three distinct RL algorithms were compared under two scenarios. In scenario one, Case A is considered, which uses random EV demand. In scenario two, Case B is considered, which utilizes historical stochastic demand data from QU.
For Case-1A to Case-3A, as outlined in Table 4, the results show that the uncontrolled system experienced a maximum aggregated current of 128 A during charging events. The DQN agent (Case-1A) reduced the aggregated current by 37.5%, bringing it down to 80 A while maintaining 100% energy delivery. Comparing the methods, RL and MPC_QC handle the situation more smoothly than LLF and uncontrolled charging.
For the A2C agent considered in Case-2A, the maximum aggregated current was 25%, lowering the charging current to 96 A, and also delivering 100% of the requested energy. Conversely, the PPO agent considered in Case-3A resulted in a reduction of 38.28% to 79 A, but with a slight 4.11% decrease in energy delivery.
In Table 4, which used the QU historical data (stochastic demand), the uncontrolled system required a higher aggregated current of 160 A. Here, downward arrow, ↓ refers to the reduction in maximum current. The DQN agent (Case-1B) reduced the current by 25% to 120 A, while ensuring full energy delivery. Similarly, A2C (Case-2B) reduced the current by 25%, with only a minor decrease of 0.039% in energy delivery. The PPO agent (Case-3B) demonstrated the most substantial reduction, cutting the current by 50% to 80 A, though it came with a slight energy delivery reduction of 0.29%.
The results in Table 5 were compared between the two demand models (random vs. historical) and the uncontrolled system. The findings indicate that charging demand was higher in Case-1B to Case-3B, reflecting the increased charging requirements based on historical data from QU. However, the RL agents consistently reduced the maximum charging current across both demand cases. The RL agent, particularly PPO, showed greater reductions in current in Case-3B, likely due to the more complex and variable nature of the historical demand. This, however, was accompanied by a trade-off in energy delivery, with PPO exhibiting the most notable reduction in delivered energy across both scenarios.
The experiment results show that all RL-based algorithms can effectively shave the peak charging current, but the PPO algorithm has the largest peak shaving effect with a slightly lower energy delivery. This is largely because of the nature of the peak shaving problem itself, as well as the nature of the PPO algorithm. The clipped objective function of PPO directly forces the policy updates not to be too large, which is beneficial for stable learning. However, it may lead to conservative decision-making by PPO. The PPO will learn to avoid high instantaneous charging currents much more strongly than DQN and A2C agents. As a result, some EVs will be charged slightly late. However, DQN and A2C do not clip the policy updates and may therefore sacrifice slightly lower energy delivery for high instantaneous loads.
Moreover, the peak shaving task conflicts with the unrestricted energy satisfaction if the transformer or the transmission line current heats up to the maximum value. The PPO agent’s function will greatly discourage the large EV arrival concurrency or the low solar generation by training the agent to emphasize the total grid safety. This may cause a small amount of total unfulfilled demand value based on the departure time.
In conclusion, the RL agents effectively reduced the peak charging current in both demand scenarios. While the historical demand imposed higher overall charging requirements, the RL algorithms, especially PPO, achieved significant reductions in current, albeit at the cost of a slight decrease in energy delivery.

5.2.2. Utilization of Solar Energy

During charging, the RL is first tested to ensure charging rates during peak hours of solar energy generation. During these hours, the charging rate should be high. At the time instant of t = 898 min, the maximum powers generated, including the algorithms MPC_QC, Unctrl, and RL, are 21.16 kW, 39.94 kW, and 30.16 kW, respectively, as shown in Figure 15.

5.2.3. Real Data Used for Applying RL (Caltech University)

For a real-world application, we applied the proposed reinforcement learning (RL) model to a university campus with actual electric vehicle (EV) chargers and charging behavior. The model was trained using a real dataset (Case-1C), specifically, an event table from Caltech University’s publicly available database [26]. The training was conducted over 10,000 iterations (500 epochs), achieving a reward of 0.96 by the end of the training. The results in Figure 16 show that the model achieved a low standard deviation of 0.024 and an average cumulative reward of 0.939. The event table is in Table 6.
When simulating the RL model with this real-world data, the results in Figure 17 show that the RL model outperformed the MPC-QC during peak solar hours by utilizing more solar energy and thus reducing infrastructure requirements by 49.1%.
However, the energy delivered by the system, including the RL model, is 67.49%, which is slightly lower than the MPC algorithm, which is 71.05%, as listed in Table 7.

5.2.4. Grid Impact Analysis for Large-Scale Adoption

The output of the RL charging management model is integrated into the IEEE 33 bus system in MATLAB-2024b to run the load flow simulation. The load flow simulation makes use of the accumulated charging demand curve produced by the PPO-RL agent for the base Case-3A, which is the case corresponding to random demand. It runs over a 1440 min (24 h) duration as presented in Figure 15. The analysis compares three charging strategies, which are uncontrolled, MPC, and RL. These charging strategies are compared under two operational modes: (i) without and (ii) with solar energy integration:
(i)
Without solar integration: The results obtained in Figure 15 are applied to the IEEE-33 bus system [36], and load flow analysis is performed to obtain the voltage levels at each bus. The impact analysis considers EV chargers on buses without solar energy or V2G in Figure 18. The lowest drop in voltage occurs at bus-18. The RL method reduces the voltage levels by 0.05 p.u. and 0.01 p.u. at bus-18 compared to the uncontrolled and MPC methods, respectively. The proposed RL method raises the minimum voltage from 0.907 p.u. (uncontrolled) and 0.912 p.u. (MPC) up to 0.957 p.u. and increases it by 0.045–0.05 p.u., thereby improving the voltage stability in the feeder. With respect to line losses, Figure 19 shows that the peak line losses occur between bus 2 and bus 3. The peak value of line loss is reduced by the proposed RL approach by 17% as compared to the loss by the MPC approach, while it is reduced by 32% as compared to the uncontrolled charging scheme. Figure 20 depicts that the line current with the highest magnitude (on the same line section) is decreased by 9% in the RL method as compared to the MPC method. This reduces thermal violations in the distribution network.
(ii)
With solar integration: A noticeable improvement in VSM occurs when the charging station for each EV is accompanied by an embedded PV scheme. Figure 21 above shows that, during the peak sun period (around noon), the RL algorithm increases the minimum bus voltage at bus 18 from 0.937 p.u. (without solar) to 0.971 p.u., which is an improvement of 3.6% and outperforms MPC and the uncontrolled approach. Line losses and the peak value of line current are further reduced with solar power injection. As seen in Figure 22, the peak value of the line losses is decreased by 1.95% for the system involving RL + solar, with the highest value of loss reduction taking place on the section from bus 2 to bus 3, for the system including the RL-based charging strategy. It is clear from Figure 23 that the peak values of the line currents are further reduced by 1% for the system involving solar integration.
The significant improvement of the PPO agent in reducing the aggregate peak current by 50% compared to the Model Predictive Control (MPC) is a very important contribution. This leads to a reduction in the stress on the transformer and the distribution lines, which may be caused by the delayed upgrades and the reduced demand charges faced by the campus authority. The scalability test performed on the IEEE-33 bus system is a further confirmation of the same and has resulted in a 17% reduction in the line losses.
However, there is a significant cost to pay for such an extreme peak shaving policy, including a minute loss in energy distribution. PPO, A2C, and DQN suffered a minute loss in energy distribution, which was very close to 100% for A2C and DQN. However, for PPO, it slightly dropped to values like 99.71% for the QU dataset and 67.49% for the Caltech online dataset for the given setting above. This is considered neither a weakness of these algorithms nor a limitation. However, it results from the policy-learning process adopted by these algorithms. The PPO reward function, which strongly penalizes peak current violations, leads to delayed charging or modulates the charging demands during the times when less solar energy is available and high current demand in a way to ensure that the grid constraints are not violated.
From the experimental results, it is observed that there are two design choices for the system operator. Either the operator may proceed for PPO policy to ensure grid stability, or may give priority to the user guarantee using an A2C/DQN-based policy. The PPO policy works best when the power supply capacity is constrained or when the EV integration process is at a very early level. The relatively small deficit of power supply (less than 0.3% in an orderly setting, as QU) might be tolerated, as a sufficient charge level makes most EVs able to leave the campus. The A2C/DQN-based policy guarantees that users are satisfied for all cases, but needs a grid that has the flexibility to handle additional peaks. It can be applied to grids that have robust infrastructure networks when a user-centric environment is preferred. It can also work well when there is a production requirement where user experience is necessary.
The reduced energy supply (67.49%) associated with the Caltech online case study needs to be addressed specifically. In this case, the actual, unrecorded consumption patterns associated with possible short intervals between arrival and departure times are represented. Additionally, the approach demonstrated by the PPO agent resulted in more unmet demand. A fixed reward function may not necessarily be the best strategy for any degree of unmet demand, which indicates a serious limitation of the PPO agent. A more complex strategy may combine a dynamic or adaptive reward function, adjusting the weights associated with unmet demand metrics according to a perception of urgency, potentially using parking time and demanded energy factors.

6. Conclusions

This paper develops a smart EV charging management system deploying artificial intelligence, namely reinforcement learning (RL). The developed RL agent achieved the target objectives, including energy delivery to EVs, solar energy utilization, and peak reduction in charging currents. The focus of the study is to train the RL agent for peak reduction while satisfying the other objectives to optimize the infrastructure expansion required for EV charging.
The proposed model is trained using QU campus real data, and the result is benchmarked against uncontrolled and Model Predictive Control (MPC) strategies. The results show an optimum charging infrastructure while delivering the energy requested by the users. The findings demonstrate a reduction in the maximum aggregated current, which introduces less stress on the grid. One of the key contributions of this study is comparing three RL algorithms under real-world demands while finding the appropriate agent and reward function to achieve the required objectives. The model is for campus charging, which has a specific behavior, and the study demonstrates real-world data, which includes actual arrival and departure times.
The results show the importance of selecting an appropriate RL agent along with the realistic demand patterns while optimizing the reward for the EV charging management problem. The PPO-RL agent learned faster compared to A2C and DQN and achieved a higher reward. The trained agent was tested against existing algorithms and achieved the required objective, leading to 42% more solar utilization compared with MPC charging.
This study bridges a common research gap by addressing scalability, where the PPO-RL agent is integrated into the IEEE-33 bus power system for evaluation. For the case study, the RL-based EV charging management model improved the voltage drop by 0.05 p.u. and reduced the line losses by 17% compared to the MPC benchmark method. The RL agent trained for solar energy utilization was tested; the voltage levels were higher during peak hours, and further reduction was reflected in the line losses and currents.
The obtained results are based on the training datasets from two universities, and it is recommended that future studies should test the model considering other regions and the effectiveness of the geographical location on the proposed model. The model can be extended to larger-scale real-world condition networks, incorporating battery degradation, V2G technology, DC microgrids, and dynamic pricing for further enhancement. Continued research can explore advanced techniques such as two-stage multi-agent DRL [42], distributed observer-based control with RL [43], and Seq2Seq-based deep learning frameworks [44], where these approaches can enhance real-world applications and enhance the performance of the RL-based EV charging management system.
In summary, the study highlights that the RL-based EV charging management system increases the reliability of the grid by maximizing the use of renewable energy and reducing the burden on the grid, and thus the power infrastructure required to accommodate future EV demands. However, the RL-based charging management is trained using the QU stochastic data and Caltech online data. There are chances it may not work perfectly in other environments, such as residential areas, business centers, and public fast-charging stations, without further training. The reward function in the present study is static. However, in actual practice, the conditions of the grid, energy prices, and user priorities vary with time. A fixed reward structure will probably not adapt well to the long-term changes in demand or policy. A future system may require meta-RL or online reward adaptation. Further, the user preferences are not included in the model, such as levels of priority (e.g., emergency charging), which are of significance regarding user acceptance. The proposed RL model does not consider the long-term effects of the charging rate on the battery health of the electric vehicle. Frequently charging the battery partially or at a high rate during periods of low grid load may affect the battery’s lifespan. This will be considered in future work while formulating the rewards function. Future research could include the integration of the proposed RL framework with other energy system design and economic analysis tools, such as HOMER Grid. This will enable a holistic techno-economic analysis of the campus EV charging infrastructure for different electricity tariffs, renewable energy incentives, and long-term investment horizons. This will enable an accurate calculation of the net present value (NPV), payback, and cost implications of implementing the RL-based EV charging system compared to traditional uncontrolled or rule-based approaches.
Furthermore, the proposed method could be validated through distribution utility analysis software (for instance, OpenDSS 9.8.0.1, CYMDIST 8.0, or other load flow software), which would allow a more detailed evaluation of how well the proposed method performs in a real-world scenario with specific constraints such as fault levels, protection schemes, phase imbalance, and transformer aging effects that could be considered as blind spots in a generalizable RL model. Such field tests would significantly contribute to the overall translatability of this work from a theoretical to a practical campus/community-scale microgrid application.

Author Contributions

Methodology, H.M.A. and A.G.; software, H.M.A.; validation, H.M.A. and S.I.; formal analysis, H.M.A., A.G., L.B.-B. and S.I.; investigation, H.M.A. and S.I.; resources, H.M.A., data curation, H.M.A.; writing—original draft preparation, H.M.A.; writing—review and editing, A.G. and S.I.; supervision, A.G.; project administration, A.G. and L.B.-B.; funding acquisition, L.B.-B. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported in this paper was supported by the Qatar Research Development and Innovation Council [NPRP14S-0305-210019].

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The content is solely the responsibility of the authors and does not necessarily represent the official views of the Qatar Research Development and Innovation Council.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EVElectric vehicle
RLReinforcement learning
AIArtificial intelligence
MLMachine learning
DQNDeep Q-Network
DDNDeep neural network
PPOProximal policy optimization
A2CAdvantage actor–critic
ACNAdaptive charging network
QUQatar University
PVPhotovoltaic (solar energy)
MDPMarkov Decision Process
SState space in MDP
AAction space in MDP
RReward function in MDP
TState transition function in MDP
V ( s t ) Value function at time step
A ( s t , a t ) Advantage function at time step
r t Reward at time step
I E V Charging current of EV
I A g g Aggregate charging current across all chargers
P E V Power drawn by EV charger
P P V Solar photovoltaic power
P E V C S Voltage at EV charging station
I T m a x Maximum transformer rating current
Q(s, a; θ)Q-value function approximated by deep neural network
γ Discount factor in reinforcement learning
αLearning rate in reinforcement learning
ϵ Clipping hyperparameter in PPO
V2GVehicle-to-grid

References

  1. Tedesco, R.; Schweitzer, J.-P.; Sommer, P.; Saar, D. What is the Environmental Impact of Electric Cars? In Factsheet; European Environmental Bureau (EEB): Brussels, Belgium; Environmental Coalition on Standards (ECOS): Brussels, Belgium; Deutsche Umwelthilfe (DUH): Brussels, Belgium, 2023; Available online: https://eeb.org/library/policy-brief-what-is-the-environmental-impact-of-electric-cars/ (accessed on 1 February 2026).
  2. Shao, S.; Zhang, T.; Pipattanasomporn, M.; Rahman, S. Impact of TOU rates on distribution load shapes in a smart grid with PHEV penetration. In Proceedings of the IEEE PES Transmission and Distribution Conference and Exposition, New Orleans, LA, USA, 19–22 April 2010; pp. 1–6. [Google Scholar] [CrossRef]
  3. García-Villalobos, J.; Zamora, I.; Martín, J.I.S.; Asensio, F.J.; Aperribay, V. Plug-in electric vehicles in electric distribution networks: A review of smart charging approaches. Renew. Sustain. Energy Rev. 2014, 38, 717–731. [Google Scholar] [CrossRef]
  4. Fernandez, G.S.; Krishnasamy, V.; Kuppusamy, S.; Ali, J.S.; Ali, Z.M.; El-Shahat, A.; Abdel Aleem, S.H. Optimal Dynamic Scheduling of Electric Vehicles in a Parking Lot Using Particle Swarm Optimization and Shuffled Frog Leaping Algorithm. Energies 2020, 13, 6384. [Google Scholar] [CrossRef]
  5. Yilmaz, M.; Krein, P.T. Review of Battery Charger Topologies, Charging Power Levels, and Infrastructure for Plug-In Electric and Hybrid Vehicles. IEEE Trans. Power Electron. 2013, 28, 2151–2169. [Google Scholar] [CrossRef]
  6. Abdullah, H.M.; Gastli, A.; Ben-Brahim, L. Reinforcement Learning Based EV Charging Management Systems-A Review. IEEE Access 2021, 9, 41506–41531. [Google Scholar] [CrossRef]
  7. Wan, Z.; He, H.L.H.; Prokhorov, D. Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 5246–5257. [Google Scholar] [CrossRef]
  8. Dang, Q.; Wu, D.; Boulet, B. A Q-Learning Based Charging Scheduling Scheme for Electric Vehicles. In Proceedings of the 2019 IEEE Transportation Electrification Conference and Expo (ITEC), IEEE, Detroit, MI, USA, 19–21 June 2019; pp. 1–5. [Google Scholar] [CrossRef]
  9. Liu, D.; Zeng, P.; Cui, S.; Song, C. Deep Reinforcement Learning for Charging Scheduling of Electric Vehicles Considering Distribution Network Voltage Stability. Sensors 2023, 23, 1618. [Google Scholar] [CrossRef]
  10. Li, H.; Dai, X.; Goldrick, S.; Kotter, R.; Aslam, N.; Ali, S. Reinforcement Learning for EV Fleet Smart Charging with On-Site Renewable Energy Sources. Energies 2024, 17, 5442. [Google Scholar] [CrossRef]
  11. Zhang, W.; Quan, H.; Gandhi, O.; Rodríguez-Gallegos, C.D.; Srinivasan, D.; Weng, Y. Dynamic and fast electric vehicle charging coordinating scheme, considering V2G based var compensation. In Proceedings of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 26–28 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
  12. Seddig, K.; Jochem, P.; Fichtner, W. Integrating renewable energy sources by electric vehicle fleets under uncertainty. Energy 2017, 141, 2145–2153. [Google Scholar] [CrossRef]
  13. Erdogan, N.; Erden, F.; Kisacikoglu, M. A fast and efficient coordinated vehicle-to-grid discharging control scheme for peak shaving in power distribution system. J. Mod. Power Syst. Clean Energy 2018, 6, 555–566. [Google Scholar] [CrossRef]
  14. Liu, L.; Kong, F.; Liu, X.; Peng, Y.; Wang, Q. A review on electric vehicles interacting with renewable energy in smart grid. Renew. Sustain. Energy Rev. 2015, 51, 648–661. [Google Scholar] [CrossRef]
  15. Kene, R.O.; Olwal, T.O. Data-Driven Modeling of Electric Vehicle Charging Sessions Based on Machine Learning Techniques. World Electr. Veh. J. 2025, 16, 107. [Google Scholar] [CrossRef]
  16. Singh, A.R.; Kumar, R.S.; Madhavi, K.R.; Alsaif, F.; Bajaj, M.; Zaitsev, I. Optimizing demand response and load balancing in smart EV charging networks using AI integrated blockchain framework. Sci. Rep. 2024, 14, 31768. [Google Scholar] [CrossRef] [PubMed]
  17. Zhang, C.; Kuppannagari, S.R.; Xiong, C.; Kannan, R.; Prasanna, V.K. A cooperative multi-agent deep reinforcement learning framework for real-time residential load scheduling. In IoTDI 2019, Proceedings of the 2019 International Conference on Internet of Things Design and Implementation, Montreal, QC, Canada, 15–18 April 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 59–69. [Google Scholar] [CrossRef]
  18. Li, H.; Wan, Z.; He, H. Constrained EV Charging Scheduling Based on Safe Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 11, 2427–2439. [Google Scholar] [CrossRef]
  19. Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 1, pp. 30–47. [Google Scholar]
  20. Li, H.; Li, G.; Lie, T.T.; Li, X.; Wang, K.; Han, B.; Xu, J. Constrained large-scale real-time EV scheduling based on recurrent deep reinforcement learning. Int. J. Electr. Power Energy Syst. 2023, 144, 108603. [Google Scholar] [CrossRef]
  21. Lee, Z.J.; Johansson, D.; Low, S.H. ACN-Sim: An open-source simulator for data-driven electric vehicle charging research. In Proceedings of the 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids, SmartGridComm, Phoenix, AZ, USA, 25–28 June 2019. [Google Scholar] [CrossRef]
  22. Richards, T.; Treuille, A. Streamlit for Data Science: Create Interactive Data Apps in Python; Packt Publishing: Birmingham, UK, 2023. [Google Scholar]
  23. Johnson, J.D.; Li, J.; Chen, Z. Reinforcement Learning: An Introduction: R.S. Sutton, A.G. Barto, MIT Press, Cambridge, MA 1998, 322 pp. ISBN 0-262-19398-1. Neurocomputing 2000, 35, 205–206. [Google Scholar] [CrossRef]
  24. Lee, Z.J.; Sharma, S.; Johansson, D.; Low, S.H. ACN-Sim: An Open-Source Simulator for Data-Driven Electric Vehicle Charging Research. IEEE Trans. Smart Grid 2021, 12, 5113–5123. [Google Scholar] [CrossRef]
  25. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym Beta. arXiv 2016, arXiv:1606.01540. Available online: https://arxiv.org/abs/1606.01540 (accessed on 1 February 2026).
  26. Lee, Z.J.; Li, T.; Low, S.H.; Lee, Z.J. ACN-Data: Analysis and Applications of an Open EV Charging Dataset. In e-Energy’19: Proceedings of the Tenth ACM International Conference on Future Energy Systems, Phoenix, AZ, USA, 25–28 June 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  27. Streamlit Documentation. Streamlit. “Streamlit Documentation.” Version 1.28.0. Streamlit Inc., San Francisco, CA, USA, 2023. Available online: https://docs.streamlit.io/ (accessed on 1 February 2026).
  28. Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
  29. Abdullah, H.M.; Kamel, R.M.; Tahir, A.; Sleit, A.; Gastli, A. The Simultaneous Impact of EV Charging and PV Inverter Reactive Power on the Hosting Distribution System’s Performance: A Case Study in Kuwait. Energies 2020, 13, 4409. [Google Scholar] [CrossRef]
  30. Baran, M.E.; Wu, F.F. Network reconfiguration in distribution systems for loss reduction and load balancing. IEEE Trans. Power Deliv. 1989, 4, 1401–1407. [Google Scholar] [CrossRef]
  31. Das, D.; Kothari, D.P.; Kalam, A. Simple and efficient method for load flow solution of radial distribution networks. Int. J. Electr. Power Energy Syst. 1995, 17, 335–346. [Google Scholar] [CrossRef]
  32. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  33. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. arXiv 2016, arXiv:1602.01783. [Google Scholar] [CrossRef]
  34. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  35. Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv 2020, arXiv:2006.05990. [Google Scholar] [CrossRef]
  36. PPO2—Stable Baselines 2.10.3a0 Documentation. Available online: https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html (accessed on 1 February 2026).
  37. Charging Network—ACN Research Portal 0.1 Documentation. Available online: https://acnportal.readthedocs.io/en/latest/acnsim/models.html (accessed on 1 February 2026).
  38. Isong, U.; Nseobong, O.; Oritsetimeyin, P. The IEEE 33 Bus Distribution System Load Flow Analysis Using Newton Raphson Method. J. Multidiscip. Eng. Sci. Technol. 2023, 10, 2458–9403. [Google Scholar]
  39. Abdullah, H.M.; Gastli, A.; Ben-Brahim, L. Smart Management of Electric Vehicle Chargers Through Reinforcement Learning. In Proceedings of the 2024 4th International Conference on Smart Grid and Renewable Energy (SGRE), Doha, Qatar, 8–10 January 2024; pp. 1–8. [Google Scholar] [CrossRef]
  40. ACN-Data—A Public EV Charging Dataset. Available online: https://ev.caltech.edu/dataset (accessed on 1 February 2026).
  41. Smart Charging Network for EVs Installed at Caltech. Available online: https://www.caltech.edu/about/news/smart-charging-network-evs-installed-caltech-50567 (accessed on 1 February 2026).
  42. Gao, H.; Jiang, S.; Li, Z.; Wang, R.; Liu, Y.; Liu, J. A Two-stage Multi-agent Deep Reinforcement Learning Method for Urban Distribution Network Reconfiguration Considering Switch Contribution. IEEE Trans. Power Syst. 2024, 39, 7064–7076. [Google Scholar] [CrossRef]
  43. Li, X.; Hu, C.; Luo, S.; Lu, H.; Piao, Z.; Jing, L. Distributed Hybrid-Triggered Observer-Based Secondary Control of Multi-Bus DC Microgrids over Directed Networks. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 2467–2480. [Google Scholar] [CrossRef]
  44. Yang, N.; Hao, J.; Li, Z.; Ye, D.; Xing, C.; Zhang, Z.; Wang, C.; Huang, Y.; Zhang, L. Data-Driven Decision-Making for SCUC: An Improved Deep Learning Approach Based on Sample Coding and Seq2Seq Technique. Prot. Control Mod. Power Syst. 2025, 10, 13–24. [Google Scholar] [CrossRef]
Figure 1. Reinforcement learning cycle.
Figure 1. Reinforcement learning cycle.
Sustainability 18 02737 g001
Figure 2. Pictorial representation of parameters included in the problems related to EV charging coordination using various RL techniques.
Figure 2. Pictorial representation of parameters included in the problems related to EV charging coordination using various RL techniques.
Sustainability 18 02737 g002
Figure 3. Block diagram of various energy systems included in this study [23].
Figure 3. Block diagram of various energy systems included in this study [23].
Sustainability 18 02737 g003
Figure 4. Proposed RL framework.
Figure 4. Proposed RL framework.
Sustainability 18 02737 g004
Figure 5. QU campus vehicle distribution per parking duration hours (1.5 to 8.5 h).
Figure 5. QU campus vehicle distribution per parking duration hours (1.5 to 8.5 h).
Sustainability 18 02737 g005
Figure 6. Different cumulative average rewards generated by DQN-RL agents of different EV demand loads (random and Qatar University).
Figure 6. Different cumulative average rewards generated by DQN-RL agents of different EV demand loads (random and Qatar University).
Sustainability 18 02737 g006
Figure 7. Different cumulative average rewards generated by the A2C-RL agents of different EV demand loads (random and Qatar University).
Figure 7. Different cumulative average rewards generated by the A2C-RL agents of different EV demand loads (random and Qatar University).
Sustainability 18 02737 g007
Figure 8. Different cumulative average rewards generated by the PPO-RL agents of different EV demand loads (random and Qatar University).
Figure 8. Different cumulative average rewards generated by the PPO-RL agents of different EV demand loads (random and Qatar University).
Sustainability 18 02737 g008
Figure 9. Case-1A DQN-Random demand for 5 EVCS and 10 events.
Figure 9. Case-1A DQN-Random demand for 5 EVCS and 10 events.
Sustainability 18 02737 g009
Figure 10. Case-1B DQN-QU demand for 5 EVCS and 10 events.
Figure 10. Case-1B DQN-QU demand for 5 EVCS and 10 events.
Sustainability 18 02737 g010
Figure 11. Case-2A A2C-Random demand for 5 EVCS and 10 events.
Figure 11. Case-2A A2C-Random demand for 5 EVCS and 10 events.
Sustainability 18 02737 g011
Figure 12. Case-2B A2C-Random demand for 5 EVCS and 10 events.
Figure 12. Case-2B A2C-Random demand for 5 EVCS and 10 events.
Sustainability 18 02737 g012
Figure 13. Case-3A PPO-Random demand for 5 EVCS and 10 events.
Figure 13. Case-3A PPO-Random demand for 5 EVCS and 10 events.
Sustainability 18 02737 g013
Figure 14. Case-3B PPO-QU demand for 5 EVCS and 10 events.
Figure 14. Case-3B PPO-QU demand for 5 EVCS and 10 events.
Sustainability 18 02737 g014
Figure 15. Case-3A PPO with solar utilization-aggregated power during peak hours of solar energy generation.
Figure 15. Case-3A PPO with solar utilization-aggregated power during peak hours of solar energy generation.
Sustainability 18 02737 g015
Figure 16. Cumulative reward generated by the trained RL agent for (Case-1C).
Figure 16. Cumulative reward generated by the trained RL agent for (Case-1C).
Sustainability 18 02737 g016
Figure 17. Comparing the combined current of RL and benchmark algorithms using PPO (Case-1C).
Figure 17. Comparing the combined current of RL and benchmark algorithms using PPO (Case-1C).
Sustainability 18 02737 g017
Figure 18. Voltage levels at each bus of the IEEE-33 network after applying the RL charging method (without solar/V2G).
Figure 18. Voltage levels at each bus of the IEEE-33 network after applying the RL charging method (without solar/V2G).
Sustainability 18 02737 g018
Figure 19. Line losses of the IEEE 33 bus system for different charging methods.
Figure 19. Line losses of the IEEE 33 bus system for different charging methods.
Sustainability 18 02737 g019
Figure 20. Line currents of the IEEE 33 bus system for different charging methods.
Figure 20. Line currents of the IEEE 33 bus system for different charging methods.
Sustainability 18 02737 g020
Figure 21. Voltage levels at each bus of the IEEE-33 network for different charging methods with solar integration.
Figure 21. Voltage levels at each bus of the IEEE-33 network for different charging methods with solar integration.
Sustainability 18 02737 g021
Figure 22. Line losses of each bus of the IEEE-33 network for different charging methods with solar integration.
Figure 22. Line losses of each bus of the IEEE-33 network for different charging methods with solar integration.
Sustainability 18 02737 g022
Figure 23. Line currents feeding each bus of the IEEE-33 network for different charging methods with solar integration.
Figure 23. Line currents feeding each bus of the IEEE-33 network for different charging methods with solar integration.
Sustainability 18 02737 g023
Table 2. Dataset statistics.
Table 2. Dataset statistics.
Dataset NameDurationStep PeriodDataset Size (Elements)Standard DeviationMean/Max
Solar power1 year (2018)5 min105,120157.60113.745/531.72 kW
Active power1 year (2018)1 h87601.805.869/18.60 MW
Reactive power1 year (2018)1 h87601.122.779/11.33 MVar
Table 3. RL testing and validation used in the case study.
Table 3. RL testing and validation used in the case study.
EV Demand SourceCaseAlgorithmPeriodChargerEventsTransformerArrival/
Departure Pattern
RandomCase-1ADQN1440510150 MVARandom
Case-2AA2C100103150 MVARandom
Case-3APPO100203150 MVARandom
QUCase-1BDQN1440510150 MVAStochastic
Case-2BA2C1440103150 MVAStochastic
Case-3BPPO1440203150 MVAStochastic
CaltechCase-1CPPO1440Online dataOnline data150 MVAOnline
Table 4. Performance comparison of RL algorithms vs. uncontrolled charging for random EV demand.
Table 4. Performance comparison of RL algorithms vs. uncontrolled charging for random EV demand.
CaseAlgorithmMax Charging Current (A)Energy Delivered (%)% Reduction in Max Current% Change in Energy Delivered
-Unctrl128100--
1ADQN8010037.5% ↓0%
2AA2C9610025.0% ↓0%
3APPO7995.8938.28% ↓−4.11% ↓
The symbol ↓ indicates a percentage reduction compared to the uncontrolled charging baseline in each respective demand scenario.
Table 5. Performance comparison of RL algorithms vs. uncontrolled charging for QU historical EV demand.
Table 5. Performance comparison of RL algorithms vs. uncontrolled charging for QU historical EV demand.
CaseAlgorithmMax Charging Current (A)Energy Delivered (%)% Reduction in Max Current% Change in Energy Delivered
-Unctrl160100--
1BDQN12010025.0% ↓0%
2BA2C12099.96125.0% ↓−0.039% ↓
3BPPO8099.71050% ↓−0.29% ↓
The symbol ↓ indicates a percentage reduction compared to the uncontrolled charging baseline in each respective demand scenario.
Table 6. The event table using dataset from Caltech University (Case-1C).
Table 6. The event table using dataset from Caltech University (Case-1C).
DepartureArrivalStation IDRequested Energy
029915EVSE-0018
1267134EVSE-00250
2276237EVSE-00348
31188559EVSE-00460
41385829EVSE-00516
51311844EVSE-00621.84
61121912EVSE-00114
71233929EVSE-00720
81431944EVSE-00828
91188948EVSE-0096.81
Table 7. RL and benchmark charging schemes generating aggregated currents.
Table 7. RL and benchmark charging schemes generating aggregated currents.
Charging MethodEnergy Delivered (kWh)Max Charging Rate% Delivery
RL184.00711667.49%
MPC_Mosek193.73082.77571.05%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdullah, H.M.; Gastli, A.; Ben-Brahim, L.; Islam, S. Peak Shaving and Solar Utilization for Sustainable Campus EV Charging Using Reinforcement Learning Approach. Sustainability 2026, 18, 2737. https://doi.org/10.3390/su18062737

AMA Style

Abdullah HM, Gastli A, Ben-Brahim L, Islam S. Peak Shaving and Solar Utilization for Sustainable Campus EV Charging Using Reinforcement Learning Approach. Sustainability. 2026; 18(6):2737. https://doi.org/10.3390/su18062737

Chicago/Turabian Style

Abdullah, Heba M., Adel Gastli, Lazhar Ben-Brahim, and Shirazul Islam. 2026. "Peak Shaving and Solar Utilization for Sustainable Campus EV Charging Using Reinforcement Learning Approach" Sustainability 18, no. 6: 2737. https://doi.org/10.3390/su18062737

APA Style

Abdullah, H. M., Gastli, A., Ben-Brahim, L., & Islam, S. (2026). Peak Shaving and Solar Utilization for Sustainable Campus EV Charging Using Reinforcement Learning Approach. Sustainability, 18(6), 2737. https://doi.org/10.3390/su18062737

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop