# Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- In contrast to prior strategies [19,20,21,22], the proposed strategy is a win–win for both stakeholders, i.e., the EV owners and the EV charging station operators. Fulfilling charging demands under agreed conditions is prioritized, and profit maximization from the charging station operator’s perspective follows.
- Although direct bench-marking against pre-published literature is difficult because of the different operating conditions and data used, the financial benefit that is achieved for the charging station herein is considerable and comparable to the profit achieved in the literature [22].
- A new training scheme is proposed for the Q-learning algorithm. The constraints imposed guarantee customer satisfaction, which is removed from the optimization objective to allow the RL agent to maximize EV charging station profit.
- The proposed strategy is easy to adjust, and a different balance/prioritization between stakeholders needs can be selected (see Equation (13)).

## 2. System Model

#### 2.1. EV Charging Station Environment

- EVs are price-sensitive; i.e., they adjust their charging demands based on the value of ${r}_{t}$ provided by the station. Thus, ${d}_{i}={D}_{i}\left({r}_{t}\right)$, where ${D}_{i}\left(\xb7\right)$ : $\$/\mathrm{kWh}\to \mathrm{kWh}$ is the demand–response function of EV i. Obviously, if EV i decides not to accept the presented rate, then ${d}_{i}=0$. Additionally, note that the demand–response function is EV-specific in the general case.
- The price rate ${r}_{t}$ presented to ${\mathcal{I}}_{t}$ will be constant for each EV in ${\mathcal{I}}_{t}$ during its parking time.
- There is a fixed and finite number of individual chargers at the station, N. Thus, for all time slots t, $\left(\right)open="|"\; close="|">{\mathcal{K}}_{t}$, which means that at any given time, at most N EVs are parked at the station. Suppose the number of EVs, $\left(\right)$, that arrive at the station overflow the available chargers. In that case, a subset of ${\mathcal{I}}_{t}$ is selected, in a first-come-first-served manner, to meet the parking capacity of the station.

#### 2.2. Problem Formulation Using the MDP Framework

- State/Observation Space

- The EVs that are parked at the station ${\mathcal{J}}_{t}$, along with the residual charging demand ${\tilde{d}}_{j}^{t}$ and parking time ${\tilde{p}}_{j}^{t}$ for each EV $j\in {\mathcal{J}}_{t}$
- The newly arrived EVs, ${\mathcal{I}}_{t}$
- The last 24 h of values of the electricity price time series. Under the assumption that electricity price changes every $\Delta t$ slots, the 24 h historical values can be represented by:$${c}_{t},{c}_{t-\Delta t},{c}_{t-2\Delta t},\dots ,{c}_{t-M\Delta t}$$$$M=24\frac{60}{{t}_{\mathrm{len}}\Delta t}$$

- Action Space

- Reward Modeling

## 3. Proposed Solution

#### 3.1. Constrained Least Laxity First

Algorithm 1: Constrained least laxity first. |

Require: Total charging rate ${e}_{t}$Require: Total number of chargers NRequire: Residual demand ${\tilde{d}}_{i}^{t}$, $i\in {\mathcal{K}}_{t}$Require: Residual parking time ${\tilde{p}}_{i}^{t}$, $i\in {\mathcal{K}}_{t}$Initialize remaining total charging rate ${\tilde{e}}_{t}\leftarrow {e}_{t}$ for i = 1, N do Initialize ${x}_{i,t}\leftarrow 0$ Calculate laxity ${l}_{i,t}\leftarrow {\tilde{p}}_{i}^{t}-\frac{{\tilde{d}}_{i}^{t}\xb760}{{x}_{max}}$ Initialize ${l}_{i,t+1}\leftarrow {l}_{i,t}$ end forwhile ${\tilde{e}}_{t}>0$do Find EV $\widehat{i}$ with the least laxity that has ${x}_{\widehat{i},t}=0$ Update charging rate of EV $\widehat{i}$: ${x}_{\widehat{i},t}\leftarrow min\left(\right)open="("\; close=")">{\tilde{e}}_{t},{x}_{max},{\tilde{d}}_{\widehat{i}}^{t}\xb7\frac{1}{\alpha}$ Calculate laxity of EV $\widehat{i}$ for next time slot $t+1$: ${l}_{\widehat{i},t+1}\leftarrow {l}_{\widehat{i},t}+\frac{{x}_{\widehat{i},t}\xb7{t}_{\mathrm{len}}}{{x}_{max}}-{t}_{\mathrm{len}}$ Update remaining total charging rate ${\tilde{e}}_{t}\leftarrow {\tilde{e}}_{t}-{x}_{\widehat{i},t}$ end whilefor i = 1, N doif ${l}_{i,t+1}<0$ then Constrain charging rate of EV i: ${x}_{i,t}\leftarrow min\left(\right)open="("\; close=")">{x}_{max},{\tilde{d}}_{i}^{t}\xb7\frac{1}{\alpha}$ end ifend forCalculate constrained total charging rate ${e}_{t}^{\prime}\leftarrow {\sum}_{i=1}^{N}{x}_{i,t}$ |

#### 3.2. Agent Architecture

- N input nodes, each of which is the laxity of an EV at charger i, ${l}_{i,t}$.
- One node corresponding to the number of EV arrivals observed at the admission zone of the station.

- Let ${w}_{r}=\left(\right)open="\{"\; close="\}">{w}_{r,1},{w}_{r,2},\cdots ,{w}_{r,L}$ be the L discrete price rate levels.
- Let ${w}_{e}=\left(\right)open="\{"\; close="\}">{w}_{e,1},{w}_{e,2},\cdots ,{w}_{e,K}$ be the K discrete charging rate levels.
- Then, the action space is:$${\mathcal{A}}_{t}={w}_{e}\times {w}_{r}=\left(\right)open="\{"\; close="\}">\left(\right)open="("\; close=")">{w}_{r,1},{w}_{e,1},\left(\right)open="("\; close=")">{w}_{r,1},{w}_{e,2}$$

#### 3.3. Training Approach

Algorithm 2: Constrained deep Q-learning. |

Require: Episode length schema function, hRequire: Exploration rate schema, lInitialize replay memory D to capacity N Initialize action-value $Q\equiv Q(s,a,;\theta )$ parametrized with random weights $\theta $ Initialize target action-value $\widehat{Q}$ with weights ${\theta}^{-}$ for episode = 1, E do Initialize state ${s}_{1}$ Get current episode duration $T=h\left(\mathrm{episode}\right)$ for t = 1, T do Get exploration rate $\u03f5=l(\mathrm{episode},t)$ With probability $\u03f5$ select a random action ${a}_{t}$, otherwise select ${a}_{t}=arg{max}_{a}Q\left(\right)open="("\; close=")">s,a\phantom{\rule{3.33333pt}{0ex}};\theta $ Constrain ${a}_{t}$ using the Constrained LLF algorithm Execute ${a}_{t}$ and observe reward ${R}_{t}$ and next state ${s}_{t+1}$ Store transition $\left(\right)$ Sample random minibatch of transitions $\left(\right)$ Set target
$${y}_{j}=\left\{\begin{array}{cc}{R}_{j},& \mathrm{if}\phantom{\rule{4.pt}{0ex}}{s}_{j+1}\phantom{\rule{4.pt}{0ex}}\mathrm{final}\phantom{\rule{4.pt}{0ex}}\mathrm{state}\\ {R}_{j}+\gamma \underset{{a}^{\prime}}{max}\widehat{Q}({s}_{j+1},{a}^{\prime};{\theta}^{-}),& \mathrm{otherwise}\end{array}\right)$$
Perform a gradient descent step on ${\left(\right)}^{{y}_{j}}$ Every C steps copy policy network weights to target network weights ${\theta}^{-}=\theta $ end forend for |

## 4. Evaluation Methodology

#### 4.1. Datasets

- They were upsampled to 60 min intervals.
- They were scaled by a factor of $\frac{1}{100}$ and rounded to the closest integer.
- They were undersampled to 1 min intervals, by randomly distributing the 1 h samples to intermediate minutes using a uniform distribution.

#### 4.2. Experimental Setup

- Chargers of the station: $N=20$.
- Maximum charging rate per charger: According to the U.S. Department of Energy (https://afdc.energy.gov/fuels/electricity_infrastructure.html, accessed on 15 February 2022), most EVs on the road today are not capable of charging at rates higher than 50 kW. Thus, a more conservative approach of 30 kW was selected. Note that 22 kW is the closest standard charging rate (i.e., Level 2 EV charging), but the purpose of this work is to present a more general approach. ${x}_{max}=30$ kW
- Maximum total charging rate: ${e}_{max}=N\xb7{x}_{max}=600$ kW.
- Time slot length: ${t}_{\mathrm{len}}=5$ min.
- Episode duration: 1 day or 1440 min or 288 time slots.
- Discrete price rate levels: $\left(\right)$$.
- Discrete charging rate levels: $\left(\right)$ kW.
- Cardinality of action space: $\left(\right)open="|"\; close="|">{\mathcal{A}}_{t}$.

- Demand–Response Function

- $\u03f5$-Greedy Policy

- Episode Duration

## 5. Results

#### 5.1. Training Results

#### 5.2. Policy Analysis

#### 5.3. Case Study: Increasing Episode Time Horizon

#### 5.4. Case Study: Removing Constraints

## 6. Conclusions

- As a first step, the technique of constraining the estimated charging rate could be incorporated into different DRL training algorithms that would operate on continuous action spaces, thereby lifting the need for discretizing scheduling and pricing actions.
- This work could serve as the basis for different formulations that consider more stakeholders, e.g., the grid operators and the corresponding constraints.
- Furthermore, the assumption was made that the total charging rate requested by the charging station is constrained only by the number of individual chargers. Consequently, potentially all parked EVs can be scheduled to charge during each slot; respecting additional constraints placed by the grid operator is an aspect that naturally arises as a potential future extension.
- In addition, more financial tools can be considered in modeling the relationship between different stakeholders.
- Finally, a more automated version of such a system can also be tailor-made for real-time EV detection, on a non-intrusive load monitoring (NILM) basis [31].

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Nomenclature

t | The time slot index |

${t}_{\mathrm{len}}$ | The length (duration) of each time slot |

${\mathcal{I}}_{t}$ | The set of EVs that have arrived at the station at the beginning of time slot t |

${\mathcal{J}}_{t}$ | The set of EVs that are already parked in the station before time slot t |

${\mathcal{K}}_{t}$ | The set of EVs that require charging at time slot t |

${r}_{t}$ | The price rate announced to the customers at time slot t |

${t}_{i}^{a}$ | The arrival time of EV i |

${d}_{i}$ | The charging demand of EV i |

${p}_{i}$ | The maximum desired parking time of EV i |

${D}_{i}\left(\xb7\right)$ | The demand–response function of EV i |

${\beta}_{1},{\beta}_{2},\sigma $ | The parameters of the demand–response function |

N | The total number of chargers in the station |

${x}_{i,t}$ | The charging rate at which EV i will be charged during time slot t |

${x}_{max}$ | The maximum individual charging rate for every charger |

${e}_{t}$ | The total charging rate at time slot t |

${e}_{t}^{\prime}$ | The constrained total charging rate at time slot t |

${e}_{max}$ | The maximum total charging rate for the charging station |

$\alpha $ | The charging rate to energy conversion coefficient |

${c}_{t}$ | The electricity price that the charging station pays to the utility company |

$\left(\right)$ | The 4-tuple of elements of the Markov decision process |

$\gamma $ | The discount rate |

${\tilde{d}}_{i}^{t}$ | The residual charging demand for EV i at time slot t |

${\tilde{p}}_{i}^{t}$ | The residual parking time for EV i at time slot t |

${l}_{i,t}$ | The laxity of EV i at time slot t |

$\xi $ | The relaxation coefficient |

${\mathcal{A}}_{t}$ | The set of all available actions |

${w}_{r}$ | The set of discrete price rate levels |

L | The number of discrete price rate levels |

${w}_{e}$ | The set of discrete charging rate levels |

K | The number of discrete charging rate levels |

$\u03f5$ | The probability of a random action of the $\u03f5$-greedy policy |

${\u03f5}_{\mathrm{start}},{\u03f5}_{\mathrm{end}},{\u03f5}_{\mathrm{decay}}$ | The parameters of the $\u03f5$-greedy policy |

## References

- Azam, A.; Rafiq, M.; Shafique, M.; Yuan, J. Towards Achieving Environmental Sustainability: The Role of Nuclear Energy, Renewable Energy, and ICT in the Top-Five Carbon Emitting Countries. Front. Energy Res.
**2021**, 9, 804706. [Google Scholar] [CrossRef] - Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Evaluating the Relationship between Freight Transport, Economic Prosperity, Urbanization, and CO2 Emissions: Evidence from Hong Kong, Singapore, and South Korea. Sustainability
**2020**, 12, 664. [Google Scholar] [CrossRef] - Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Investigating the nexus among transport, economic growth and environmental degradation: Evidence from panel ARDL approach. Transp. Policy
**2021**, 109, 61–71. [Google Scholar] [CrossRef] - Shafique, M.; Luo, X. Environmental life cycle assessment of battery electric vehicles from the current and future energy mix perspective. J. Environ. Manag.
**2022**, 303, 114050. [Google Scholar] [CrossRef] - Yilmaz, M.; Krein, P.T. Review of the Impact of Vehicle-to-Grid Technologies on Distribution Systems and Utility Interfaces. IEEE Trans. Power Electron.
**2013**, 28, 5673–5689. [Google Scholar] [CrossRef] - Shafique, M.; Azam, A.; Rafiq, M.; Luo, X. Life cycle assessment of electric vehicles and internal combustion engine vehicles: A case study of Hong Kong. Res. Transp. Econ.
**2021**, 101112. [Google Scholar] [CrossRef] - International Energy Agency. Global EV Outlook. In Scaling-Up the Transition to Electric Mobility; IEA: London, UK, 2019. [Google Scholar]
- Statharas, S.; Moysoglou, Y.; Siskos, P.; Capros, P. Simulating the Evolution of Business Models for Electricity Recharging Infrastructure Development by 2030: A Case Study for Greece. Energies
**2021**, 14, 2345. [Google Scholar] [CrossRef] - Almaghrebi, A.; Aljuheshi, F.; Rafaie, M.; James, K.; Alahmad, M. Data-Driven Charging Demand Prediction at Public Charging Stations Using Supervised Machine Learning Regression Methods. Energies
**2020**, 13, 4231. [Google Scholar] [CrossRef] - Moghaddam, V.; Yazdani, A.; Wang, H.; Parlevliet, D.; Shahnia, F. An Online Reinforcement Learning Approach for Dynamic Pricing of Electric Vehicle Charging Stations. IEEE Access
**2020**, 8, 130305–130313. [Google Scholar] [CrossRef] - Ghotge, R.; Snow, Y.; Farahani, S.; Lukszo, Z.; van Wijk, A. Optimized Scheduling of EV Charging in Solar Parking Lots for Local Peak Reduction under EV Demand Uncertainty. Energies
**2020**, 13, 1275. [Google Scholar] [CrossRef] [Green Version] - He, Y.; Venkatesh, B.; Guan, L. Optimal Scheduling for Charging and Discharging of Electric Vehicles. IEEE Trans. Smart Grid
**2012**, 3, 1095–1105. [Google Scholar] [CrossRef] - Tang, W.; Zhang, Y.J. A Model Predictive Control Approach for Low-Complexity Electric Vehicle Charging Scheduling: Optimality and Scalability. IEEE Trans. Power Syst.
**2017**, 32, 1050–1063. [Google Scholar] [CrossRef] [Green Version] - Zhang, L.; Li, Y. Optimal Management for Parking-Lot Electric Vehicle Charging by Two-Stage Approximate Dynamic Programming. IEEE Trans. Smart Grid
**2017**, 8, 1722–1730. [Google Scholar] [CrossRef] - Bellman, R. Dynamic Programming. Science
**1966**, 153, 34–37. [Google Scholar] [CrossRef] [PubMed] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Abdullah, H.M.; Gastli, A.; Ben-Brahim, L. Reinforcement Learning Based EV Charging Management Systems—A Review. IEEE Access
**2021**, 9, 41506–41531. [Google Scholar] [CrossRef] - Lee, J.; Lee, E.; Kim, J. Electric Vehicle Charging and Discharging Algorithm Based on Reinforcement Learning with Data-Driven Approach in Dynamic Pricing Scheme. Energies
**2020**, 13, 1950. [Google Scholar] [CrossRef] [Green Version] - Zhang, F.; Yang, Q.; An, D. CDDPG: A Deep-Reinforcement-Learning-Based Approach for Electric Vehicle Charging Control. IEEE Internet Things J.
**2021**, 8, 3075–3087. [Google Scholar] [CrossRef] - Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning. IEEE Trans. Smart Grid
**2019**, 10, 5246–5257. [Google Scholar] [CrossRef] - Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement Learning for Real-Time Pricing and Scheduling Control in EV Charging Stations. IEEE Trans. Ind. Inform.
**2021**, 17, 849–859. [Google Scholar] [CrossRef] - Chis, A.; Lunden, J.; Koivunen, V. Reinforcement Learning-Based Plug-in Electric Vehicle Charging with Forecasted Price. IEEE Trans. Veh. Technol.
**2016**, 66, 3674–3684. [Google Scholar] [CrossRef] - Lucas, A.; Barranco, R.; Refa, N. EV Idle Time Estimation on Charging Infrastructure, Comparing Supervised Machine Learning Regressions. Energies
**2019**, 12, 269. [Google Scholar] [CrossRef] [Green Version] - Deng, R.; Yang, Z.; Chow, M.Y.; Chen, J. A Survey on Demand Response in Smart Grids: Mathematical Models and Approaches. IEEE Trans. Ind. Inform.
**2015**, 11, 570–582. [Google Scholar] [CrossRef] - Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
- Pazis, J.; Lagoudakis, M.G. Reinforcement learning in multidimensional continuous action spaces. In Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Paris, France, 11–15 April 2011; pp. 97–104. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Exchange, K.P. System Marginal Price. Data Retrieved from Electric Power Statistics Information System. 2022. Available online: http://epsis.kpx.or.kr/epsisnew/selectEkmaSmpShdGrid.do?menuId=040202&locale=eng (accessed on 8 February 2022).
- Al-Saadi, M.; Olmos, J.; Saez-de Ibarra, A.; Van Mierlo, J.; Berecibar, M. Fast Charging Impact on the Lithium-Ion Batteries’ Lifetime and Cost-Effective Battery Sizing in Heavy-Duty Electric Vehicles Applications. Energies
**2022**, 15, 1278. [Google Scholar] [CrossRef] - Athanasiadis, C.L.; Papadopoulos, T.A.; Doukas, D.I. Real-time non-intrusive load monitoring: A light-weight and scalable approach. Energy Build.
**2021**, 253, 111523. [Google Scholar] [CrossRef]

**Figure 10.**Training curves of the best run of the proposed model, averaged over a moving window of 50 episodes.

**Figure 16.**Plot of random probability for $\u03f5$-greedy policy during training (three-day episode duration).

**Figure 18.**Training curves of the proposed model, averaged over a moving window of 50 episodes (three-day episode duration).

**Figure 19.**Price announced to customers vs. electricity price paid to utility company (three-day episode duration).

**Figure 22.**Residual demand for a single charger for the first 350 time slots (three-day episode duration).

**Figure 24.**Training curves of the unconstrained model, averaged over a moving window of 50 episodes.

EV Type | Standard Deviation $\mathit{\sigma}$ | ${\mathit{\beta}}_{1}$ [kWh/$] | ${\mathit{\beta}}_{2}$ [kWh] | Parking Time |
---|---|---|---|---|

Emergent | 4.47 | −1 | 6 | 30 |

Normal | 3.96 | −4 | 15 | 120 |

Residential | 2.63 | −25 | 100 | 720 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Paraskevas, A.; Aletras, D.; Chrysopoulos, A.; Marinopoulos, A.; Doukas, D.I.
Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning. *Energies* **2022**, *15*, 2323.
https://doi.org/10.3390/en15072323

**AMA Style**

Paraskevas A, Aletras D, Chrysopoulos A, Marinopoulos A, Doukas DI.
Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning. *Energies*. 2022; 15(7):2323.
https://doi.org/10.3390/en15072323

**Chicago/Turabian Style**

Paraskevas, Athanasios, Dimitrios Aletras, Antonios Chrysopoulos, Antonios Marinopoulos, and Dimitrios I. Doukas.
2022. "Optimal Management for EV Charging Stations: A Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning" *Energies* 15, no. 7: 2323.
https://doi.org/10.3390/en15072323