Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids

Jung, Sang-Woo; An, Yoon-Young; Suh, BeomKyu; Park, YongBeom; Kim, Jian; Kim, Ki-Il

doi:10.3390/math13121999

Open AccessArticle

Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids

by

Sang-Woo Jung

¹,

Yoon-Young An

²,

BeomKyu Suh

¹

,

YongBeom Park

¹,

Jian Kim

¹ and

Ki-Il Kim

^1,*

¹

Department of Computer Science and Engineering, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34143, Republic of Korea

²

ICT Convergence Standards Research Division, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(12), 1999; https://doi.org/10.3390/math13121999

Submission received: 29 April 2025 / Revised: 2 June 2025 / Accepted: 16 June 2025 / Published: 17 June 2025

(This article belongs to the Special Issue Artificial Intelligence and Optimization in Engineering Applications)

Download

Browse Figures

Versions Notes

Abstract

Efficient scheduling of Energy Storage Systems (ESS) within microgrids has emerged as a critical issue to ensure energy cost reduction, peak shaving, and battery health management. For ESS scheduling, both single-agent and multi-agent deep reinforcement learning (DRL) approaches have been explored. However, the former has suffered from scalability to include multiple objectives while the latter lacks comprehensive consideration of diverse user objectives. To defeat the above issues, in this paper, we propose a new DRL-based scheduling algorithm using a multi-agent proximal policy optimization (MAPPO) framework that is combined with Pareto optimization. The proposed model employs two independent agents: one is to minimize electricity costs and the other does charge/discharge switching frequency to account for battery degradation. The candidate actions generated by the agents are evaluated through Pareto dominance, and the final action is selected via scalarization-reflecting operator-defined preferences. The simulation experiments were conducted using real industrial building load and photovoltaic (PV) generation data under realistic South Korean electricity tariff structures. The comparative evaluations against baseline DRL algorithms (TD3, SAC, PPO) demonstrate that the proposed MAPPO method significantly reduces electricity costs while minimizing battery-switching events. Furthermore, the results highlight that the proposed method achieves a balanced improvement in both economic efficiency and battery longevity, making it highly applicable to real-world dynamic microgrid environments. Specifically, the proposed MAPPO-based scheduling achieved a total electricity cost reduction of 14.68% compared to the No-ESS case and achieved 3.56% greater cost savings than other baseline reinforcement learning algorithms.

Keywords:

energy storage system; scheduling; multi-agent; deep reinforcement learning; multi objective optimization; Pareto optimization

MSC:

68T05

1. Introduction

As the global energy transition accelerates, the role of renewable energy sources (RES) is becoming increasingly important. Renewable sources such as photovoltaic (PV) and wind power offer carbon-neutral energy, but their inherent intermittency and variability present major challenges to power system management [1,2,3]. To overcome these challenges, microgrids have emerged as a promising solution.

A microgrid is a small-scale distributed energy system composed of consumers (loads), renewable generators (e.g., PV or wind), and energy storage systems (ESS) such as lithium-ion batteries. It can operate independently or in connection with the main utility grid. Microgrids facilitate the integration of renewable energy, enhance energy self-sufficiency, and improve the reliability and flexibility of regional power supply. They also help reduce peak loads on the main grid and strengthen resilience against unexpected events such as natural disasters or blackouts [4]. Among the core components of a microgrid, ESS plays a critical role in balancing mismatches between electricity supply and demand, managing peak loads, and maximizing renewable energy utilization [5,6].

The Energy Management System (EMS) is a key component responsible for the efficient operation of ESS and other resources within a microgrid. EMS collects real-time data to determine optimal energy dispatch strategies, ensuring supply stability, minimizing operating costs, and extending battery lifespan. It also plays a crucial role in managing RES variability and achieving effective integration of energy resources [7,8].

Although numerous studies have aimed to enhance the stability and efficiency of power systems, performing optimal scheduling in real-time remains a major challenge due to the volatility of RES and demand uncertainty, especially in small-scale microgrids. Various strategies have been proposed to minimize operating costs while maintaining operational reliability [9]. Initially, rule-based methods were widely used, relying on predefined rules and heuristics to determine charge/discharge schedules. These approaches use fixed strategies based on electricity pricing and historical data [10] but lack the flexibility to adapt to dynamic environments and cannot adequately reflect system complexity.

Later, optimization techniques such as linear programming (LP) and mixed-integer linear programming (MILP) were applied to ESS scheduling. These methods define objective functions while incorporating ESS constraints to derive optimal schedules [10,11]. Although LP and MILP are capable of modeling both continuous and discrete control variables, they are primarily linear and difficult to apply to complex nonlinear or large-scale systems. To overcome their problem, dynamic programming (DP) was introduced to solve the scheduling problem sequentially over time as indicated in [12]. This method calculates optimal decisions at each time step, constructing a globally optimal schedule. However, DP suffers from the “curse of dimensionality,” where the computational burden grows exponentially with the size of the state space. To deal with this deficiency, metaheuristic algorithms such as genetic algorithms (GA) have also been studied in ESS scheduling. With the help of GA, it is possible to mimic biological evolution through selection, crossover, and mutation to explore a wide search space. This method is effective for solving nonlinear problems and can simultaneously handle multiple objectives [13]. However, even though these traditional methods—from rule-based heuristics to LP, MILP, DP, and GA—have significantly contributed to the development of ESS scheduling, they face limitations in fully modeling system dynamics and adapting to real-time variability.

To improve adaptability, recently, machine learning and deep learning methods have been increasingly applied to ESS scheduling. Specifically, reinforcement learning (RL), a type of machine learning, is mostly studied because it enables agents to learn optimal policies by interacting with the environment. This feature makes RL effective in a situation in which both the volatility of the power supply and peak demand are rapidly increasing. Generally, RL can be categorized into model-based and model-free approaches. While the former requires accurate modeling of the environment and incurs high computational cost, the latter—adopted in this study—learns policies directly from experience [14]. To be specific, model-free RL methods are further classified into value-based and policy-based approaches. Value-based RL learns a value function to select actions, whereas policy-based RL directly learns a policy for action selection. Actor-critic algorithms combine both, learning the value function and policy simultaneously. This hybrid structure leverages the strengths of both paradigms. Figure 1 illustrates the categorization of RL algorithms.

Several studies have employed RL for ESS scheduling. The first effort focused on Q-learning within a Markov Decision Process (MDP) framework [15]. However, Q-learning struggles with high-dimensional state–action spaces and lacks the capacity to model continuous variables, leading to the “curse of dimensionality.” To overcome these limitations, Deep Q-Networks (DQN) were introduced by combining RL with deep neural networks (DNN), allowing for the approximation of complex nonlinear functions [16]. Despite its advantages, DQN is primarily designed for discrete action spaces and is less suitable for continuous control in ESS. The deep deterministic policy gradient (DDPG) addresses this by integrating DQN with actor–critic methods, thereby supporting continuous action spaces [17]. Policy gradient methods optimize the policy function directly, enabling fine-grained control over charging/discharging decisions [18]. However, they can suffer from instability due to abrupt policy updates and slow convergence.

Trust Region Policy Optimization (TRPO) mitigates policy instability by restricting policy updates within a trust region [19], but this introduces high computational cost and implementation complexity. Furthermore, Proximal Policy Optimization (PPO), which is employed in this study, simplifies TRPO by using a clipping mechanism to stabilize policy updates and prevent overshooting. PPO has recently been widely used in ESS scheduling studies due to its balance between stability and computational efficiency [20,21,22,23]. Soft actor–critic (SAC), another popular algorithm, improves sample efficiency and learning stability by introducing entropy regularization and dual Q-value estimation [24].

After reviewing the existing literature for RL in ESS scheduling, it is obvious that the integration of machine learning techniques into ESS scheduling significantly enhances flexibility and operational performance. However, since each algorithm offers unique strengths and limitations, and it is essential to select the appropriate method according to system characteristics and requirements in the aspects of electricity costs, they extend battery life and improve grid stability.

However, the challenge lies in formulating a scheduling strategy that addresses multiple, and often conflicting, objectives. For instance, minimizing electricity costs for economic benefits must be balanced with minimizing charge/discharge state transitions to preserve battery life and reducing peak load to improve energy efficiency [25,26]. Among the various techniques for handling multi-objective decision problems, fuzzy optimization methods have shown notable effectiveness in modeling trade-offs between conflicting objectives under uncertainty. For example, Zhang et al. proposed a fuzzy multi-objective model for coordinating aircraft flight scheduling, demonstrating the applicability of fuzzy logic in balancing operational constraints [27].

To address these issues, in this study, we propose an ESS scheduling algorithm based on deep reinforcement learning (DRL) that incorporates a multi-agent system and Pareto front optimization. DRL enables agents to learn optimal charge/discharge strategies through interaction with a complex, uncertain environment. In particular, the model-free DRL approach is suitable for nonlinear and dynamic settings where explicit environmental modeling is difficult. Pareto front optimization is employed to coordinate the agents’ actions by balancing their conflicting goals using scalarization based on operator-defined weights. This provides a structured and adaptable decision-making mechanism for real-time ESS scheduling in microgrids. The proposed multi-agent deep reinforcement learning (DRL) model assigns each agent to a specific objective, such as minimizing electricity cost, reducing charge/discharge switching frequency, or lowering peak demand. Each agent learns its respective policy independently within a shared environment, and the final action is selected through Pareto-based action integration. At each decision point, a set of candidate actions is generated by the agents, and the final ESS control action is selected by evaluating Pareto dominance and applying weighted scalarization across objectives.

In summary, the main technical contributions of this paper are as follows:

First, we propose a novel ESS scheduling framework that explicitly separates the inherently conflicting objectives of minimizing electricity costs and reducing charge/discharge switching frequency into two independent agents and integrates their decisions through Pareto-based optimization. This structure effectively balances economic efficiency and battery health while improving policy interpretability.
Second, within the economic optimization agent, electricity usage cost and peak demand-based basic cost are jointly optimized, reflecting the natural correlation between these cost components under realistic industrial tariff systems.
Third, to explicitly account for long-term battery degradation, this study models the charge/discharge switching frequency as a dedicated reward function, assigned to a separate agent, thereby incorporating battery aging effects into the ESS scheduling policy—a critical factor often overlooked in previous DRL-based scheduling studies.
Fourth, by constructing a Pareto front at each decision step and applying scalarization techniques reflecting operator-defined preferences, the proposed method achieves flexible and adaptive scheduling that dynamically prioritizes operational goals according to real-world requirements.

The remainder of this paper is organized as follows. Section 2 formulates the system model, objective functions, operational constraints, and the DRL framework structure. Section 3 introduces the proposed multi-agent scheduling algorithm and the Pareto front-based integration strategy. Section 4 describes the experimental setup, simulation procedures, and result analysis. Finally, Section 5 concludes the study and outlines future research directions.

2. System Modeling and Problem Formulation

This section describes the configuration of the microgrid considered in this study, the mathematical formulation of the energy storage system (ESS), the objective functions for scheduling, and the structure of the reinforcement learning (RL) framework used in this research.

2.1. System Modeling

The microgrid considered in this study consists of four key components: electric load, photovoltaic (PV) generation, energy storage system (ESS), and the main utility grid. The PV and ESS are installed behind the meter, and the generated energy is used solely for self-consumption, without selling surplus energy to the grid. Therefore, the microgrid is modeled as a grid-connected system with no power export, as illustrated in Figure 2 where * symbol refers to variable calculated through complex sensor value.

The energy flow within the microgrid is governed by the principle of energy balance at each time step. The total electricity demand at time t, denoted by

P_{t}^{l o a d}

, is met through the combination of grid supply

P_{t}^{g r i d}

, PV generation

P_{t}^{P V}

, and ESS operation

P_{t}^{E S S}

. This relationship is expressed as

P_{t}^{G r i d} = P_{t}^{L o a d} - P_{t}^{P V} + P_{t}^{E S S},

(1)

P_{t}^{G r i d} \leq P_{c a p a c i t y}^{G r i d},

(2)

P_{t}^{P V} \leq P_{c a p a c i t y}^{P V},

(3)

P_{t}^{L o a d} \leq P_{c a p a c i t y}^{L o a d} .

(4)

The sign convention assumes that discharging the ESS contributes positively to the load, while charging represents power consumption. If the right-hand side of Equation (1) is negative (i.e., generation exceeds demand), excess power is first used to charge the ESS. Since power export is not permitted in this model, any surplus that cannot be stored is curtailed.

The ESS operation must also respect physical constraints such as its state of charge (SoC), charging/discharging power limits, and battery capacity. These constraints are discussed in detail in Section 2.2. This power flow model ensures that the microgrid can operate autonomously in a grid-connected but non-exporting mode, with the ESS serving as the key control variable for balancing supply and demand while minimizing grid reliance.

In a microgrid, load modeling refers to the characterization of the power consumption pattern of industrial buildings connected to the power system. The power demand pattern represents the amount of electricity consumed at each time stamp t, and it is typically expressed as a time-series profile. The load profile for a future day (day-ahead) is forecasted based on historical consumption data collected within the microgrid. Accurate load prediction is essential for effective energy scheduling and management within the system.

The photovoltaic (PV) system is a key component of the microgrid, providing renewable energy to meet power demand. Accurate PV modeling is critical for predicting the amount of solar energy generated and for optimizing energy management strategies accordingly. Among the various factors influencing PV output prediction, solar irradiance is the most significant. In this study, the day-ahead PV generation pattern is forecasted by combining historical solar irradiance data collected from the testbed and short-term weather forecasts [28].

Specifically, the forecasting model leverages a hybrid dataset composed of both historical and future weather variables, including time, PV output, temperature, humidity, and solar irradiance. While temperature and humidity are not directly used in classical PV output formulas, they significantly affect irradiance and system efficiency. Therefore, the prediction model incorporates future hourly weather information from the Korea Meteorological Administration (KMA) API to generate solar power forecasts 24 h ahead. This approach is designed to reflect realistic operational constraints, as directly predicting weather-based inputs in a general-purpose system would require considerable computational resources and external data access.

Similar approaches have also been applied in wind energy forecasting. For instance, [29,30] proposed a short-term wind power forecasting method that integrates numerical weather prediction correction with spatiotemporal graph convolutional networks, which aligns with our methodology of using multivariate meteorological inputs for accurate renewable generation forecasting.

2.2. ESS Modeling

The energy storage system (ESS) plays a critical role in managing the variability of renewable energy sources and providing stable power supply within a microgrid. By adjusting its charging and discharging operations, the ESS can reduce peak power demand and improve the overall operational efficiency of the microgrid.

The power charged or discharged by the ESS at time t, denoted as

P_{t}^{E S S}

, is defined as Equation (5).

In this expression, if charging is dominant, the value

P_{t}^{E S S}

becomes negative, and if discharging is dominant, it becomes positive. This sign convention clearly distinguishes between the two operational modes.

The capacity utilization of the ESS is generally expressed as a percentage known as the state of charge (SoC). However, for scheduling based on forecasted values, the percentage-based SoC is converted into an absolute energy value in kilowatt-hours (kWh). This allows the system to operate using precise estimations of energy stored. The conversion is expressed as Equation (6).

Here,

E_{t}^{E S S}

represents the estimated energy stored in the ESS at time t,

S o C_{t}

is the state of charge at that time expressed as a decimal (e.g., 0.6 for 60%), and

{E S S}_{c a p a c i t y}

is the total capacity of the battery in kWh [31].

The operational state of the ESS, indicating whether the system is charging, discharging, or idle at time t, is defined as Equation (7):

P_{t}^{E S S} = P_{t}^{c h a r g e} - P_{t}^{d i s c h a r g e},

(5)

E_{t}^{S o C} = {E S S}_{c a p a c i t y} \cdot {S o C}_{t},

(6)

S_{t}^{c h / d i s} = \{\begin{matrix} + 1 i f P_{t}^{c h a r g e} > 0 \\ - 1 i f P_{t}^{d i s c h a r g e} > 0 \\ 0 o t h e r w i s e \end{matrix} .

(7)

This state variable is introduced to account for battery degradation. Frequent transitions between charging and discharging states are known to accelerate battery wear due to chemical stress, thereby reducing the battery’s lifespan [32]. Therefore, minimizing unnecessary switching operations is an essential consideration in the ESS scheduling strategy.

The above equations and constraints collectively define the operational behavior of the ESS, enabling it to serve as a flexible and responsive asset for energy management within the microgrid.

2.3. Objective Function and Constrains

In order to satisfy both economic and operational stability requirements in microgrid operation, this study employs three objective functions for optimization.

2.3.1. Energy Efficiency and Economical Cost

This study uses a testbed designed to reflect the structure of the Korean electricity market. In South Korea, KEPCO (Korea Electric Power Corporation) is responsible for centralized power generation and distribution. The market operates under a single-buyer model in which KEPCO purchases all electricity and resells it to consumers. Electricity prices are regulated by the government in cooperation with KEPCO, and generation and supply are monopolized [33]. This structure ensures pricing stability and reduces generation costs through bulk purchasing.

Objective function 1: Minimize Electric Cost

Minimizing electricity cost is a primary goal of ESS scheduling, aiming to reduce power consumption costs by optimally controlling charging and discharging. The Korean electricity tariff system is described by Equation (8):

R_{C o s t} = \{P_{p e a k} \times C_{d e m a n d} + \sum (P_{g r i d} \times C_{u s a g e})\} \times 1.137 .

(8)

Electricity costs consist of two parts: a basic cost and a usage cost, with an additional 10% value-added tax (VAT), and a 3.7% electricity industry development fund surcharge. The basic cost is calculated based on either the contracted power capacity or peak demand, while the usage cost is based on the total energy drawn from the grid over one month. The term

P_{p e a k}

represents the maximum peak load, and

R_{c o s t}

denotes the unit price for the basic cost. These are used to calculate the basic portion of the electricity bill. The total grid energy consumption over the month is denoted by

P_{g r i d}

, which is used to calculate the usage cost. The full cost-minimization objective function is defined in Equation (9):

M i n i m i z e (R_{C o s t}) .

(9)

In cases where a demand meter is not installed, the basic cost is determined based on contracted capacity. However, if a demand meter is present, as assumed in this study, the basic cost is calculated based on the actual peak demand. Under this model, the basic cost determined by the peak demand is fixed for 12 months, unless the peak is updated. If the peak demand remains below 30% of the contracted capacity, the basic cost is reduced to 30% of the standard rate. The method for calculating peak load is provided in Equation (10):

P_{P e a k} = \{\begin{matrix} \max_{t \in T} (P_{t}^{G r i d}), i f \max_{t \in T} (P_{t}^{G r i d}) > P_{p e a k} \\ P_{c o n t r a c t} \times 0.3, i f \max_{t \in T} (P_{t}^{G r i d}) \leq P_{c o n t r a c t} \times 0.3 \end{matrix} .

(10)

Minimizing peak load helps ensure the stability of the power system and reduces energy consumption during high-demand periods. Peak load refers to the highest power usage observed during a given period, and reducing it is important for cost and grid management. The corresponding objective functions are defined in Equations (11) and (12):

R_{P e a k} = \max_{t \in T} (P_{t}^{G r i d}),

(11)

M i n i m i z e (R_{P e a k}) .

(12)

2.3.2. Battery Health Management

Objective function 2: Minimize ESS charge/discharge state change

While Objective Functions 1 and 2 relate to minimizing electricity costs and improving economic efficiency, they can result in frequent ESS operation. To address the physical stability of the battery, this study incorporates battery life considerations into the objective function.

Lithium-ion battery degradation is strongly influenced by factors such as the number of charge/discharge cycles, overcharging, and fast charging. These conditions can lead to uneven growth of the solid electrolyte interphase (SEI) layer and formation of dead cells. Additionally, lithium plating on the anode surface can lead to dendrite formation, which further accelerates degradation [34,35]. Such degradation increases internal pressure and resistance, raises operating temperature, and causes thermal accumulation. According to [36], frequent switching between charge and discharge modes results in rapid SEI layer growth and loss of active lithium ions, thereby shortening battery life.

Minimizing ESS switching frequency is thus a conflicting objective compared to economic cost minimization, and it is formulated to preserve battery longevity. Equation (13) counts the number of state transitions over a time step t, comparing the operating state at each time step with the previous one. A higher count indicates more frequent switching. The full switching minimization objective function is given in Equation (14):

R_{S w i t c h} = \sum_{t = 2}^{T} | S_{t}^{c h / d i s} - S_{t - 1}^{c h / d i s} |,

(13)

M i n i m i z e (R_{S w i t c h}) .

(14)

2.3.3. Constraints

To ensure system stability, efficiency, and economic viability, several operational constraints are imposed on the microgrid, as derived from the system model and objectives. First, the microgrid must maintain power balance between supply and demand at all times. This requirement is applied to all components of the microgrid as shown in Equation (15). Second, the power drawn from the external grid must not exceed the contracted power capacity. This constraint is defined in Equation (16):

P_{t}^{G r i d} + P_{t}^{E S S} + P_{t}^{P V} - P_{t}^{L o a d} = 0,

(15)

P_{t}^{G r i d} \leq P_{C o n t r a c t},

(16)

P_{t}^{c h a r g e} \cdot P_{t}^{d i s c h a r g e} = 0 .

(17)

Next, Equations (17) and (18) ensure that at each time step, the ESS can perform only one action—charging, discharging, or remaining idle—and that the power does not exceed the rated converter output.

Finally, Equation (19) ensures that the ESS SoC during charge/discharge operations remains within the specified upper and lower SoC limits, i.e., between

{S o C}_{m i n}

and

{S o C}_{m a x}

:

| P_{t}^{c h a r g e} + P_{t}^{d i s c h a r g e} | \leq P_{c a p a c i t y}^{E S S},

(18)

{S o C}_{m i n} \leq {S o C}_{t} \leq {S o C}_{m a x} .

(19)

2.4. DRL Model Structure

This study employs a deep reinforcement learning (DRL) framework to optimize the efficient operation of an energy storage system (ESS) in a microgrid. The DRL model is built upon the Markov Decision Process (MDP) framework, which consists of a state space, an action space, and a reward function. The overall architecture of the DRL-based ESS control system is illustrated in Figure 3.

2.4.1. State Space

The state space represents the set of all environmental variables accessible to the agent. In this study, the state space includes key parameters collected from the microgrid through the energy management system (EMS) at each time step t. These parameters capture the dynamic operating conditions of the system and serve as the input to the learning agent. The state vector includes the following features: state of charge (SoC) of the battery, charge/discharge power, load demand, PV generation, electricity price, weather information, and total grid energy consumption. These elements together constitute the comprehensive representation of the system’s operational status.

S_{t} \in {P_{t}^{L o a d}, P_{t}^{P V}, P_{t}^{c h a r g e}, P_{t}^{d i s c h a r g e}, P_{t}^{G r i d}, {S o C}_{t}, C_{t}^{u s a g e}, P_{t}^{P e a k}, {r a d}_{t}, {t e m p}_{t}, {h u m}_{t}}

(20)

2.4.2. Action Space

The action space defines the set of all possible control commands that the agent can issue to the ESS. In this study, the ESS is controlled using a continuous power command, which determines whether the system charges, discharges, or remains idle. The agent selects an action from this space at each time step, and the environment transitions accordingly. The control variable is expressed in units of power (kW), and the action directly affects the SoC and overall scheduling outcome.

A_{t} \in {P_{t}^{E S S}}

(21)

P_{t}^{E S S} = \{\begin{matrix} c h a r g e i f P_{t}^{E S S} > 0 \\ d i s c h a r g e i f P_{t}^{E S S} < 0 \\ i d l e i f P_{t}^{E S S} = 0 \end{matrix}

(22)

2.4.3. Reward

The reward function provides feedback to the agent after each action, indicating the impact of its decision on the system. It is a critical component for policy learning, as it drives the optimization of the agent’s behavior. In this study, the reward function is designed to reflect multiple objectives related to ESS operation, including electricity cost minimization, peak load reduction, and minimizing switching frequency.

R = ω_{1} \cdot R_{c o s t} + ω_{2} \cdot R_{s w i t c h}

(23)

Since each of these objectives is measured in different units and has distinct data characteristics, normalization is applied prior to reward aggregation. Specifically, linear data such as energy consumption and cost are normalized using min–max normalization, while nonlinear or highly variable data are normalized using z-score normalization. This ensures that no single objective dominates the reward structure due to scale differences.

R_{c o s t_n o r m a l i z e d} = \frac{R_{c o s t} - μ_{c o s t}}{σ_{c o s t}}, (z-socre normalize)

(24)

R_{s w i t c h_n o r m a l i z e d} = \frac{R_{s w i t c h} - R_{s w i t c h_m i n}}{R_{s w i t c h_m a x} - R_{s w t i c h_m i n}}, (\min - \max normalize)

(25)

R = ω_{1} \cdot R_{c o s t_n o r m a l i z e d} + ω_{2} \cdot R_{s w i t c h_n o r m a l i z e d}

(26)

The reward function is composed as a weighted sum of the normalized objective components. Each weight represents the relative importance of an operational goal, and these weights can be adjusted to reflect the desired ESS control strategy. Let

ω_{1}

and

ω_{2}

denote the weights for electricity cost and switching frequency, respectively. These weights can be tuned based on operator preference or optimization goals, as discussed in more detail in Section 3.

3. Proposed ESS Scheduling Algorithm

The proposed ESS scheduling system consists of three main phases, as illustrated in Figure 4. Each phase is structured as follows:

Phase 1: Day-ahead forecasting predicts the next day’s load and PV generation profiles. Industrial load data and weather forecast data are processed using GRU or LSTM-based time-series models to forecast load and PV generation over 96 steps (one day). The forecasted results are used as input to the subsequent DRL training process.
Phase 2: DRL (MAPPO) utilizes the forecasted data to train multiple agents, each optimizing a distinct objective such as minimizing electricity cost, reducing peak load, or minimizing charge/discharge switching events. A shared actor network and independent critic networks are employed, and policy training is conducted using the Proximal Policy Optimization (PPO) algorithm, including trajectory collection, advantage estimation, and policy updates.
Phase 3: Pareto Optimization evaluates candidate actions proposed by each agent across multiple objectives. The actions are assessed based on electricity cost reduction, peak load reduction, and switching frequency minimization. Pareto dominance is determined, and the optimal ESS control action is selected through scalarization based on operator-defined preferences.

These three phases work in an integrated manner to enable real-time scheduling that dynamically balances the conflicting objectives of reducing operational costs, mitigating peak demand, and preserving battery life.

3.1. PPO

The ESS scheduling problem addressed in this paper has the following characteristics. First, ESS charge/discharge control is defined in a continuous action space, as the control variable represents power flow in real numbers. Second, ESS scheduling is formulated as a multi-objective optimization problem that simultaneously considers economic efficiency (electricity cost), technical performance (peak load), and operational stability (switching frequency), each of which may conflict with the others. Third, due to the uncertainty of the microgrid environment—including load and PV forecasting errors—and physical constraints (e.g., SoC limits, contracted power), it is essential to ensure both policy stability and convergence speed during training.

Taking these factors into account, PPO was deemed the most appropriate algorithm for the proposed system for the following reasons. First, PPO supports stochastic policy-based continuous control, allowing for fine-grained adjustment of ESS power output. Second, it utilizes a clipped objective function to suppress abrupt policy changes during learning, ensuring both stable convergence and improved training performance. Third, as an on-policy method, PPO maintains strong consistency between the policy and environment, which is particularly advantageous in relatively stationary microgrid conditions. Lastly, PPO is computationally efficient, requiring only two neural networks—a simple actor–critic structure—compared to more complex algorithms such as TD3 or SAC, which require additional critics or entropy regularization. This simplicity makes PPO particularly suitable for real-time operation in microgrids where computational resources may be limited. The main features of the DRL algorithms applied to ESS scheduling are compared in Table 1. In summary, PPO satisfies the essential requirements of the proposed multi-agent ESS scheduling system, namely policy stability, computational efficiency, and compatibility with continuous control. The experimental results in this paper further demonstrate its superiority over other baseline DRL models.

PPO is widely recognized for its training stability and simplicity. It improves upon vanilla policy gradient methods by introducing a clipped surrogate objective that limits destructive updates and enhances convergence robustness. This allows PPO to achieve competitive performance with relatively simple implementation and fewer hyperparameter sensitivities compared to Trust Region Policy Optimization (TRPO) [37]. Unlike TD3 or SAC, which require complex actor–critic architectures or entropy tuning, PPO achieves reliable training with a simpler structure. Value-based methods such as DQN are effective for discrete action spaces but struggle in high-dimensional or continuous environments. PPO, on the other hand, directly learns a stochastic policy distribution, making it more suitable for generating flexible control signals. Although TD3 introduces twin critics and delayed updates, and SAC enhances exploration via entropy regularization, both demand additional networks and tuning, increasing implementation complexity.

PPO employs a single actor–critic structure and achieves fast, stable convergence through the clipped objective. With only two networks, PPO is more computationally lightweight than TD3 (3 networks) or SAC (4 networks) but maintains a balanced trade-off between performance and simplicity.

Thus, PPO is considered the most suitable choice for stable and efficient scheduling of ESS charge/discharge in microgrid environments that require continuous control.

PPO is based on an actor–critic architecture in which the policy function

π_{θ} (a_{t} | s_{t})

and value function

V_{ϕ} (s_{t})

are learned simultaneously. A conventional policy gradient is used to update the policy parameters as follows:

\nabla_{θ} J (θ) = E_{t} [\nabla_{θ} l o g π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}] .

(27)

Here,

{\hat{A}}_{t}

is the advantage function, which quantifies how much better an action is compared to the baseline at a given state.

To improve learning stability and prevent abrupt policy updates, PPO introduces a clipped objective function as follows:

L^{C L I P} (θ) = E_{t} [m i n (r_{t} (θ) {\hat{A}}_{θ}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})] .

(28)

Here,

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

is the probability ratio between the new and old policies, and

ϵ

is a small constant that bounds the update size. The clipping mechanism restricts the extent of policy changes, contributing to stable learning.

The critic network is trained using temporal difference (TD) learning, and the value loss is defined as

L^{V F} (ϕ) = E_{t} [{(V_{ϕ} (s_{t}) - {\hat{V}}_{t})}^{2}] .

(29)

The target value

{\hat{V}}_{t}

is typically estimated as

{\hat{V}}_{t} = r_{t} + {γ V}_{ϕ} (s_{t + 1}) .

(30)

The overall PPO loss function combines the clipped policy loss, the value loss, and an entropy bonus term as follows:

L^{P P O} = L^{C L I P} - c_{1} L^{V F} + c_{2} \cdot E_{t} [H [π_{θ} (a_{t} | s_{t})]] .

(31)

Here,

c_{1} a n d c_{2}

are coefficients that determine the relative importance of the value loss and entropy bonus, and

H

is the entropy term that encourages exploration.

PPO uses on-policy data collected under the current policy. Learning is performed over several epochs using mini-batch stochastic gradient descent. In this study, a Gaussian policy is adopted to model continuous control signals for ESS charge/discharge, and the policy network is trained to estimate both the mean and standard deviation.

Through this structure, PPO is capable of learning robust and efficient scheduling strategies in the nonlinear and uncertain environment of microgrid ESS control.

3.2. Multi-Agent PPO (MAPPO)

To address the problem of multi-objective optimization in energy storage system (ESS) charge and discharge scheduling, this study proposes a multi-agent proximal policy optimization (MAPPO) structure. The proposed method aims to resolve trade-offs between electricity cost minimization and charge/discharge switching frequency reduction, both of which are essential but often conflicting objectives in microgrid operation.

In the proposed framework, two agents are employed. The first agent is dedicated to minimizing electricity costs, including both energy usage cost and peak demand-based basic cost, which are typically incurred in time-of-use pricing systems. The second agent is focused on minimizing the frequency of charge/discharge state transitions in order to preserve battery longevity and operational stability. Both agents operate within a shared environment, observing the same state space and generating actions in the same action space. Despite this common ground, each agent is trained using an independently defined reward function, allowing them to specialize in their respective objectives.

The state space shared between agents includes key operational information such as the state of charge (SoC), load demand, photovoltaic (PV) generation, electricity pricing, and time-related data. The action space for both agents consists of continuous control signals that determine the ESS power level, typically ranging from –250 kW to +250 kW.

Although the agents share the same actor network to promote common feature learning, each maintains its own critic network that evaluates actions based on a distinct reward signal. This parameter-sharing architecture improves training efficiency and stabilizes learning, while still allowing each agent to optimize its own objective. Training is conducted under a centralized learning paradigm, in which the centralized critic has access to all agents’ state and action information. However, execution is performed in a decentralized manner, where each agent uses only its own policy to make decisions based on the observed state.

During inference, each agent generates a control action for the ESS based on its learned policy. These action candidates are then evaluated using their respective reward functions to construct a reward vector at time step t, denoted as

R_{t} = [R_{t}^{c o s t}, R_{t}^{s w i t c h}]

. Using these reward vectors, the system constructs a Pareto front consisting of non-dominated actions. The final control action is then selected from the Pareto front using a scalarization technique, such as a weighted sum, which reflects the operator’s preferences or operational strategy. To select the final ESS control action, a scalarization function is applied to the Pareto front. This function is defined as

a_{t}^{*} = a r g \max_{a \in A_{Pareto}} (ω_{1} \cdot R_{t}^{c o s t} (a_{t}) + ω_{2} \cdot R_{t}^{s w i t c h} (a_{t})),

(32)

where

ω_{1}

and

ω_{2}

are operator-defined preference weights such that

ω_{1} + ω_{2} = 1

. These weights allow operators to dynamically prioritize between economic efficiency and battery protection without retraining.

By decoupling objectives across agents and integrating their outputs through Pareto-based scalarization, the MAPPO structure enables interpretable, flexible, and real-time multi-objective scheduling for energy storage systems.

It is important to note that the scalarization process is not involved during the training phase. Each agent in the MAPPO structure is trained independently to optimize its own reward function without the use of any preference weights. The scalarization is applied only during the inference phase, where a set of candidate actions generated by the trained policies is evaluated using the operator-defined weights. This separation enables the operator to dynamically adjust preferences between conflicting objectives (cost minimization and switching reduction) without retraining the policies. By maintaining fixed policies and adjusting only the scalarization weights, the proposed MAPPO framework ensures flexible and efficient decision-making under varying operational strategies.

By separating the learning processes for each objective while enabling coordinated decision-making through Pareto-based integration, the proposed MAPPO architecture effectively balances economic performance and battery health. This modular and interpretable structure allows for flexible policy control in practical microgrid environments and provides operators with the ability to adjust control priorities without retraining the model. The proposed multi-agent PPO-based ESS scheduling architecture is depicted in Figure 5.

To further clarify the technical advantages of the proposed MAPPO-based scheduling method, Table 1 presents a comparative overview of recent energy storage system (ESS) scheduling approaches. The table summarizes key aspects such as microgrid configuration, scheduling algorithms, targeted objectives, optimization strategies, action space characteristics, and each method’s key contribution.

As shown in Table 2, most prior studies employed either single-agent DRL or traditional optimization methods such as MILP and heuristic search. These methods typically handle single or aggregated objectives, often lacking the flexibility to manage conflicting goals explicitly. Moreover, only a few studies consider battery degradation effects or enable preference-aware policy adaptation. In contrast, the proposed MAPPO framework introduces a multi-agent structure with separate agents dedicated to different objectives—electricity cost reduction and battery health preservation and integrates their decisions using Pareto optimization. This allows for scalable, interpretable, and operator-customizable ESS scheduling policies suitable for real-world applications.

3.3. Pareto Optimization

The scheduling of energy storage systems (ESS) must address multiple conflicting objectives, including minimizing electricity cost, reducing peak load, and decreasing the frequency of charge/discharge state transitions. These objectives form a multi-objective optimization problem that is difficult to resolve through a single scalar reward function. Using a simple weighted sum to combine objectives often results in the over-optimization of one aspect at the expense of others. To overcome this issue, this study employs Pareto optimization based on the outputs of individual agents trained separately for each objective. This approach effectively balances the trade-offs and produces a well-rounded ESS scheduling policy.

In Pareto optimization, one solution is said to dominate another if it performs equally well or better across all objectives, and strictly better in at least one. A solution that is not dominated by any other is called a Pareto optimal solution. The set of all such non-dominated solutions forms the Pareto front, which represents the optimal boundary where multiple objectives can be satisfied simultaneously without compromising one another.

In this study, each agent generates candidate actions based on its learned policy, which was trained to optimize a specific objective function. These actions are associated with reward vectors that capture performance on electricity cost, peak load, and switching frequency reduction. For a given system state, all proposed actions are evaluated, and their corresponding reward vectors are used to determine Pareto dominance. The non-dominated actions are then collected to construct the Pareto front.

From this front, the final control action is selected by applying a scalarization method that reflects the operator’s policy preferences. This scalarization is typically computed as a weighted sum of the individual objective rewards. The weights are adjustable and can be tuned depending on operational priorities or external conditions. The scalarized reward for action a can be expressed as

R_{t o t a l} (a_{i}) = ω_{1} \cdot R_{i}^{c o s t} + ω_{2} \cdot R_{i}^{s w i t c h},

(33)

where

ω_{1}, a n d ω_{2}

are the relative importance weights assigned to electricity cost, peak load, and switching minimization, respectively. The action with the highest scalarized reward is selected as the final ESS charge/discharge control decision.

Unlike traditional single-objective reinforcement learning methods, Pareto-based action selection allows for solutions that maintain a balanced performance across multiple goals. It also provides flexibility to adjust strategies depending on system conditions and operator intent. Furthermore, because each candidate action is evaluated across all objectives, this method enhances interpretability and supports transparent decision-making. By applying Pareto optimization to ESS scheduling, the proposed system achieves a well-balanced trade-off between efficiency, cost-effectiveness, and battery longevity, making it well-suited to manage complex and dynamic microgrid environments.

Multi-objective optimization is essential for microgrid operation, as many objectives are inherently conflicting. Pareto optimization offers a practical approach to balancing these conflicts by identifying compromise solutions where no objective can be improved without negatively affecting at least one other. The resulting Pareto front defines the trade-off surface from which optimal balanced solutions can be selected.

4. Experimental Setup and Evaluation

This section presents the experimental setup used to empirically evaluate the performance of the proposed multi-objective deep reinforcement learning (DRL)-based ESS scheduling algorithm. It includes the testbed environment, dataset composition, algorithm comparison scenarios, and a quantitative and qualitative analysis of the results. By conducting a comprehensive evaluation for each objective—electricity cost reduction, peak load minimization, and charge/discharge switching reduction—the practical effectiveness of the proposed algorithm in a real microgrid environment is demonstrated from multiple perspectives.

4.1. Testbed Environment

The experimental environment used for training and evaluating the proposed MAPPO-based ESS scheduling model consists of a high-performance computing workstation. The hardware includes an Intel Core i7-12700K CPU (12 cores, 20 threads), an NVIDIA RTX 4090 GPU with 24 GB VRAM, and 64 GB DDR4 RAM. All simulations were conducted on Ubuntu 22.04, using Python 3.9 and PyTorch 2.5.1. CUDA acceleration was enabled to fully utilize the computational power of the GPU during model training.

The testbed used in this study simulates a microgrid composed of an industrial building load, photovoltaic (PV) generation, and an energy storage system (ESS). The specifications of the testbed environment are summarized in Table 3.

The experiment was conducted using a microgrid testbed built upon real industrial consumer load data. The microgrid includes a 1000 kWh ESS, a maximum charge/discharge capacity of 250 kW, and a 250 kW PV system. The ESS is modeled as a lithium-ion battery with a charge/discharge efficiency of 95%. The PV system consists of fixed panels, and its hourly output is simulated based on irradiance data. The grid interconnection is configured as a unidirectional structure, which prevents the ESS from exporting energy back to the grid. The ESS is scheduled at 15 min intervals, resulting in 96 steps per day. The electricity pricing model reflects the actual time-of-use pricing system used by KEPCO (Korea Electric Power Corporation), in which the basic cost is determined based on peak demand and the usage cost is calculated from the cumulative power drawn from the grid. This setup is designed to replicate realistic industrial electricity pricing and ensure the validity of the experiments. The time-of-day pricing and load peak periods are shown in Table 4 and Table 5, respectively.

4.2. Test Data

The test dataset is based on real-world measurements. One year of load data was divided into daily segments for training and testing. The load profile reflects the usage characteristics of industrial consumers, with higher consumption during the day and lower at night, and includes seasonal and weekday variations. PV generation was simulated using a forecasting model that combines historical irradiance data and day-ahead weather forecasts. Time-of-use electricity pricing was applied, with higher prices during peak demand periods and lower prices during off-peak hours. All experiments were conducted under the same initial SoC conditions and load-generation scenarios to ensure fair comparisons between algorithms.

4.3. Scenario

In this study, four reinforcement learning models—PPO, TD3, SAC, and the proposed MAPPO—are compared for ESS scheduling performance. All models were trained using the same dataset, consisting of industrial microgrid load and PV generation records from September 2022 to October 2023. After training, inference was conducted over a five-day test period that includes both weekdays and weekends. Table 6 summarizes the main hyperparameter settings used for the reinforcement learning algorithms in this study. All models were trained under identical environment settings, and the hyperparameters were tuned to ensure stable training and optimal performance for each algorithm

During each inference scenario, ESS charge/discharge scheduling was performed at 15 min intervals (96 steps per day), and the performance was evaluated based on electricity cost reduction and switching frequency minimization.

Figure 6 illustrates the load, PV generation, grid electricity usage patterns, and SOC variations when the ESS is not operated. In the absence of ESS operation, the grid supplies the net load directly without any charge/discharge control, resulting in higher dependency on the grid. Under this scenario, the total annual electricity cost was calculated as 2,687,600 KRW, with 1,115,720 KRW attributed to usage cost. The effectiveness of the proposed reinforcement learning-based ESS scheduling algorithm is evaluated by comparing it against this No-ESS operation scenario.

In Figure 6, the blue background represents off-peak time periods, the green background indicates mid-peak time periods, and the red background corresponds to on-peak time periods. During weekdays, the load typically rises to around 60 kWh, necessitating significant grid power usage in the absence of ESS scheduling. In contrast, during weekends, the load decreases and PV generation becomes relatively higher, leading to instances where the generation exceeds the load, resulting in reverse power flow back to the grid.

4.4. Results and Discussion

This section presents an experimental analysis of the proposed reinforcement learning-based ESS scheduling algorithms. The models compared include TD3, SAC, PPO and the proposed MAPPO. All models were trained and validated under identical datasets and test conditions, and the evaluation metrics include annual electricity cost savings, average daily switching frequency, ESS operational stability, and load response characteristics. The experimental results are analyzed visually by examining the charge/discharge scheduling patterns, grid power usage, and SOC trajectories of each model, followed by a quantitative performance comparison focusing on economic efficiency and battery preservation.

Figure 7 presents the ESS scheduling results for different deep reinforcement learning models, including TD3, SAC, PPO and MAPPO. The top plots illustrate the variations in load, PV generation, grid power usage, and state of charge (SoC), while the bottom plots depict the selected charge/discharge actions and SoC changes.

Overall, all reinforcement learning models learned a scheduling pattern of charging the ESS during off-peak periods and discharging during peak load periods. However, the SAC, PPO, and MAPPO models were more effective in reducing the reverse power flow occurring during weekends with excessive PV generation. In particular, the MAPPO and PPO models minimized the usage cost most effectively, resulting in the greatest overall reduction in total electricity cost. Among them, MAPPO achieved fewer charge/discharge switching events compared to PPO, demonstrating superior performance not only in economic efficiency but also in preserving battery longevity through optimized ESS operation.

Table 7 compares the annual usage cost, basic cost, total electricity cost, charge/discharge switching frequency, and total cost reduction rate between the No-ESS case and ESS scheduling models based on TD3, SAC, PPO, and MAPPO. Without ESS operation, the total annual electricity cost was 2,687,600 KRW, and the usage cost alone accounted for 1,115,720 KRW. When ESS scheduling was applied, all models exhibited significant reductions in both usage and total electricity costs. In particular, the PPO model achieved the lowest usage cost (750,200 KRW) and total cost (2,271,960 KRW), demonstrating the highest economic benefit among the compared models. The MAPPO model also achieved comparable performance, with a usage cost of 768,680 KRW and a total cost of 2,292,970 KRW.

In terms of total electricity cost reduction relative to the No-ESS case, PPO achieved the highest reduction rate of 15.47%, followed by MAPPO at 14.68%, TD3 at 12.5%, and SAC at 11.12%. These results indicate that reinforcement learning-based ESS scheduling substantially contributes to improving economic efficiency. Regarding battery operation, the MAPPO model exhibited the lowest average daily switching frequency (49 events/day), suggesting superior battery health preservation. In contrast, PPO recorded the highest switching frequency (121 events/day), indicating a potential risk of accelerated battery degradation despite its strong cost savings. Overall, the proposed MAPPO-based ESS scheduling model effectively balances economic efficiency and battery preservation, significantly enhancing practical operational performance compared to conventional methods.

The basic electricity charge remained unchanged regardless of the application of ESS scheduling. This is because, even without ESS operation, the maximum load peak did not exceed the minimum threshold specified by the contracted power capacity, and the ESS scheduling applied in this study did not alter the load profile sufficiently to surpass the threshold. As a result, while the ESS contributed to reducing the usage charge, it had no impact on the basic charge. Under such conditions, selecting a tariff structure with a relatively higher basic charge and a lower usage charge could be advantageous. This strategy would maximize the economic benefits of ESS operation by emphasizing usage cost reduction while maintaining a fixed basic charge.

Table 8 presents the impact of different reward weight settings on the total electricity cost and the frequency of battery charge/discharge switching. As the weight

w_{1}

assigned to cost minimization increases and the weight

w_{2}

assigned to switching frequency minimization decreases, the total cost gradually decreases while the battery switching events increase. When

w_{1}

= 0.9 and

w_{2}

= 0.1, the lowest total cost of 2,254,100 KRW is achieved, but at the expense of the highest switching frequency (132 events/day), indicating potential battery degradation risks.

Conversely, when

w_{2}

is dominant (e.g.,

w_{1}

= 0.1,

w_{2}

= 0.9), the switching events are minimized to only six times per day, but the total cost increases significantly. These results highlight the trade-off relationship between economic efficiency and battery protection and demonstrate the importance of selecting appropriate weight settings based on operational priorities.

5. Conclusions

In this paper, we proposed a novel scheduling framework for energy storage systems (ESS) in microgrids based on multi-agent deep reinforcement learning (DRL) combined with Pareto optimization. The proposed approach effectively addresses the inherently conflicting objectives of minimizing total electricity costs and minimizing charge/discharge switching frequency, which are critical for both economic efficiency and battery longevity. To achieve this, a multi-agent proximal policy optimization (MAPPO) framework was designed, where two specialized agents independently optimize electricity costs and switching frequency, respectively. Their decisions are integrated using Pareto front evaluation and operator-defined scalarization, ensuring a flexible and balanced scheduling policy. This structure not only enhances interpretability but also allows dynamic adjustment of operational priorities without retraining.

Extensive simulations were conducted using real-world industrial load and photovoltaic (PV) generation data under realistic South Korean electricity tariffs. The proposed MAPPO-based scheduling strategy demonstrated significantly improved performance compared to baseline DRL algorithms such as TD3, SAC, and PPO. The MAPPO model achieved a 14.68% reduction in total electricity cost compared to the No-ESS case and 3.56% more cost savings than single-agent DRL models using scalarized rewards. Furthermore, the frequency of charge/discharge switching was reduced by 35.2% compared to PPO. In terms of billing components, the usage charge was reduced by up to 18.9%, while the basic electricity charge remained unchanged, since contracted demand thresholds were not exceeded. These findings suggest that under the current tariff structures, which emphasize basic charges, the benefits of ESS may be constrained. A policy shift towards increasing the proportion of usage-based tariffs may further enhance the economic value of ESS deployment.

As a future direction, this research will be extended to jointly optimize the capacity sizing of ESS and PV systems, taking into account both capital investment and replacement costs. By integrating long-term economic evaluation with short-term operational control, we aim to develop a comprehensive co-optimization framework for microgrid planning and ESS–PV deployment that maximizes cost-effectiveness and operational reliability.

Author Contributions

Conceptualization, S.-W.J. and Y.-Y.A.; methodology, B.K.S., Y.B.P. and J.K.; validation, S.-W.J. and Y.-Y.A.; writing—original draft preparation, S.-W.J.; writing—review and editing, K.-I.K.; supervision, K.-I.K.; project administration, K.-I.K.; funding acquisition, K.-I.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government through MSIT under Grant RS-2022-00165225, and in part by Institute for Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2022-II221200, Convergence security core talent training business (Chungnam National University)).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Impram, S.; Secil, V.N.; Bülent, O. Challenges of renewable energy penetration on power system flexibility: A survey. Energy Strategy Rev. 2020, 31, 100539. [Google Scholar] [CrossRef]
Eltigani, D.; Syafrudin, M. Challenges of integrating renewable energy sources to smart grids: A review. Renew. Sustain. Energy Rev. 2015, 52, 770–780. [Google Scholar] [CrossRef]
Sinsel, S.R.; Rhea, L.R.; Volker, H.H. Challenges and solution technologies for the integration of variable renewable energy sources—A review. Renew. Energy 2020, 145, 2271–2285. [Google Scholar] [CrossRef]
Liu, X. Microgrids for enhancing the power grid resilience in extreme conditions. IEEE Trans. Smart Grid 2016, 8, 589–597. [Google Scholar]
Oliveira, D.Q. A critical review of energy storage technologies for microgrids. Energy Syst. 2021, 1–30. [Google Scholar] [CrossRef]
Shahzad, S. Possibilities, challenges, and future opportunities of microgrids: A review. Sustainability 2023, 15, 6366. [Google Scholar] [CrossRef]
Allwyn, R.G.; Amer, A.H.; Vijaya, M. A comprehensive review on energy management strategy of microgrids. Energy Rep. 2023, 9, 5565–5591. [Google Scholar] [CrossRef]
Chaudhary, G. Review of energy storage and energy management system control strategies in microgrids. Energies 2021, 14, 4929. [Google Scholar] [CrossRef]
Khan, M.W. Optimal energy management and control aspects of distributed microgrid using multi-agent systems. Sustain. Cities Soc. 2019, 44, 855–870. [Google Scholar] [CrossRef]
Minchala, A.; Luis, I. A review of optimal control techniques applied to the energy management and control of microgrids. Procedia Comput. Sci. 2015, 52, 780–787. [Google Scholar] [CrossRef]
Jang, M.J. Optimization of ESS scheduling for cost reduction in commercial and industry customers in Korea. Sustainability 2022, 14, 3605. [Google Scholar] [CrossRef]
Kim, W.W. Operation scheduling for an energy storage system considering reliability and aging. Energy 2017, 141, 389–397. [Google Scholar] [CrossRef]
Raghavan, A.; Paarth, M.; Ajitha, S. Optimization of day-ahead energy storage system scheduling in microgrid using genetic algorithm and particle swarm optimization. IEEE Access 2020, 8, 173068–173078. [Google Scholar] [CrossRef]
Shakya, A.K.; Gopinatha, P.; Sohom, C. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Ojand, K.; Hanane, D. Q-learning-based model predictive control for energy management in residential aggregator. IEEE Trans. Autom. Sci. Eng. 2021, 19, 70–81. [Google Scholar] [CrossRef]
Ji, Y. Real-time energy management of a microgrid using deep reinforcement learning. Energies 2019, 12, 2291. [Google Scholar] [CrossRef]
Gorostiza, F.S.; Francisco, M.G. Deep reinforcement learning-based controller for SOC management of multi-electrical energy storage system. IEEE Trans. Smart Grid 2020, 11, 5039–5050. [Google Scholar] [CrossRef]
Fan, L. Optimal scheduling of microgrid based on deep deterministic policy gradient and transfer learning. Energies 2021, 14, 584. [Google Scholar] [CrossRef]
Li, H.; Zhiqiang, W.; Haibo, H. Real-time residential demand response. IEEE Trans. Smart Grid 2020, 11, 4144–4154. [Google Scholar] [CrossRef]
Kang, H. Reinforcement learning-based optimal scheduling model of battery energy storage system at the building level. Renew. Sustain. Energy Rev. 2024, 190, 114054. [Google Scholar] [CrossRef]
Xu, G. An optimal solutions-guided deep reinforcement learning approach for online energy storage control. Appl. Energy 2024, 361, 122915. [Google Scholar] [CrossRef]
Pinciroli, L. Optimal operation and maintenance of energy storage systems in grid-connected microgrids by deep reinforcement learning. Appl. Energy 2023, 352, 121947. [Google Scholar] [CrossRef]
Härtel, F.; Thilo, B. Minimizing energy cost in pv battery storage systems using reinforcement learning. IEEE Access 2023, 11, 39855–39865. [Google Scholar] [CrossRef]
Zheng, Y. Optimal scheduling strategy of electricity and thermal energy storage based on soft actor-critic reinforcement learning approach. J. Energy Storage 2024, 92, 112084. [Google Scholar] [CrossRef]
Farzin, H.; Mahmud, F.; Moein, M.A. A stochastic multi-objective framework for optimal scheduling of energy storage systems in microgrids. IEEE Trans. Smart Grid 2016, 8, 117–127. [Google Scholar] [CrossRef]
Geng, S. Multi-objective optimization of a microgrid considering the uncertainty of supply and demand. Sustainability 2021, 13, 1320. [Google Scholar] [CrossRef]
Wei, M.; Yang, S.; Wu, W.; Sun, B. A multi-objective fuzzy optimization model for multi-type aircraft flight scheduling problem. Transport 2024, 39, 313–322. [Google Scholar] [CrossRef]
Zhou, K.; Kaile, Z.; Shanlin, Y. Reinforcement learning-based scheduling strategy for energy storage in microgrid. J. Energy Storage 2022, 51, 104379. [Google Scholar] [CrossRef]
Mao, Y.; Chao, H.; Wei, Z.; Guozhong, F.; Yunpeng, J. A short-term power prediction method based on numerical weather prediction correction and the fusion of adaptive spatiotemporal graph feature information for wind farm cluster. Expert Syst. Appl. 2025, 274, 126979. [Google Scholar]
Zhang, T.; Zhao, W.; He, Q.; Xu, J. Optimization of Microgrid Dispatching by Integrating Photovoltaic Power Generation Forecast. Sustainability 2025, 17, 648. [Google Scholar] [CrossRef]
Hassan, M. A comprehensive review of battery state of charge estimation techniques. Sustain. Energy Technol. Assess. 2022, 54, 102801. [Google Scholar] [CrossRef]
Li, R. Accelerated aging of lithium-ion batteries: Bridging battery aging analysis and operational lifetime prediction. Sci. Bull. 2023, 68, 3055–3079. [Google Scholar] [CrossRef] [PubMed]
Park, W.Y. Korean Power System Challenges and Opportunities. In Priorities for Swift and Successful Clean Energy Deployment at Scale; Lawrence Berkeley National Laboratory: Berkeley, CA, USA, 2023. [Google Scholar]
Celik, B. Prediction of battery cycle life using early-cycle data, machine learning and data management. Batteries 2022, 8, 266. [Google Scholar] [CrossRef]
Severson, K.A. Data-driven prediction of battery cycle life before capacity degradation. Nat. Energy 2019, 4, 383–391. [Google Scholar] [CrossRef]
Fanoro, M.; Mladen, B.; Saurabh, S. A review of the impact of battery degradation on energy management systems with a special emphasis on electric vehicles. Energies 2022, 15, 5889. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]

Figure 1. RL Algorithms in ESS Scheduling.

Figure 2. Microgrid Components.

Figure 3. DRL architecture.

Figure 4. Proposed ESS Scheduling Framework.

Figure 5. Multi-agent PPO Architecture.

Figure 6. No-ESS operation scenario.

Figure 7. ESS Charge/Discharge Scheduling Results for Different DRL Models ((a) SAC, (b) TD3, (c) PPO, (d) MAPPO).

Table 1. Comparison of DRL algorithms.

	DQN	PPO	TD3	SAC
State space	Discrete	Continuous	Continuous	Continuous
Action space	Discrete	Continuous	Continuous	Continuous
Policy type	Value-based	Policy-based (Stochastic)	Actor–Critic (Deterministic)	Actor–Critic (Stochastic)
Policy	Q-table estimate	Clipped policy update	Twin Q + Delay update	Entropy regularized stochastic
On-policy/ off-policy	Off-policy	On-policy	Off-policy	Off-policy
Exploration	ε-greedy	stochastic	noise	temperature
Actor–Critic Network number	2 networks	2 networks	3 networks	4 networks
Q-network Number	1 (single Q)	0 (Value function)	2 (Twin delayed Q)	2 (double Q)

Table 2. Comparative analysis of recent ESS scheduling studies.

Paper	MG Configuration	Scheduling Algorithm	Objectives	Optimization Strategy	Action Space	Key Contribution
[11]	PV, ESS	MILP	Cost	MILP	Discrete	Realistic Korean tariff + degradation modeling
[13]	PV, ESS	GA + PSO	Cost, Peak	Heuristic	Continuous	GA vs. PSO comparison for scheduling
[15]	PV, ESS	Q-Learning + MPC	Cost	Hybrid (MPC + RL)	Discrete	Hybrid model-based + model-free control
[17]	ESS x N	DDPG	SoC Health	Reward shaping	Continuous	Multi-ESS SoC coordination
[18]	PV, ESS	TD3	O&M Cost	Weight sum	Continuous	Maintenance-aware ESS operation
[22]	ESS	PPO	Elec + Thermal Cost	Scalarization	Continuous	Maintenance-aware ESS operation
[24]	PV, ESS, Boiler	SAC	Cost, Reliability, Emission	Scenario-based stochastic optimization	Continuous	Multi-energy DRL coordination
[25]	PV, ESS	MILP (Stochastic)	Cost	Weight sum	Discrete	Probabilistic multi-objective formulation
This Paper	PV, ESS	MAPPO	Cost, Battery	Pareto	Continuous	Multi-agent, dynamic preference control

Table 3. Testbed environment configuration.

Config	Specification
ESS Capacity	1000 kWh
ESS Charge/Discharge Power Limit	±250 kW
PV Capacity	250 kW
Time Resolution	15 min (96 steps/day)
Contract Power	500 kW
Converter Capacity	250 kW
Upper Limit of SoC	90%
Lower Limit of SoC	10%
Converter Efficiency	95%

Table 4. Load peak time periods according to seasons.

Season	Off-Peak	Mid-Peak	On-Peak
Spring/Summer/Fall (Mar–Oct)	22:00~08:00	08:00~11:00 12:00~13:00 18:00~22:00	11:00~12:00 13:00~18:00
Winter (Nov–Feb)	22:00~08:00	08:00~09:00 12:00~16:00 19:00~22:00	09:00~12:00 16:00~19:00

Table 5. Electricity Tariff by Time Period (KRW/kWh).

Basic Cost	Peak Time	Summer (Jun–Aug)	Spring/Fall (Mar–May, Sep–Oct)	Winter (Nov–Feb)
8320	Off-peak	94.0	94.0	101.0
	Mid-peak	146.9	116.5	147.1
	On-peak	229.0	147.2	204.6

Table 6. Hyperparameter settings (PPO, TD3, SAC).

Hyperparameter	PPO	TD3	SAC
Learning Rate	3 × 10⁻⁴	3 × 10⁻⁴	3 × 10⁻⁴
Batch Size	256	256	256
Discount Factor	0.995	0.995	0.995
Replay Buffer	-	1 × 10⁻⁶	1 × 10⁻⁶
Optimizer	Adam	Adam	Adam
Target Network Update Rate	-	0.005	0.005
Target Update Frequency	-	2	Continuous
Clipping Range	0.4	-	-
Policy Noise	-	0.1	-
Noise Clip	-	0.3	-
Entropy Coefficient	-	-	0.2
Exploration Strategy	Gaussian Noise	Policy Smoothing (0.2)	Entropy Regularization

Table 7. Comparison of electricity cost and battery charge/discharge state switch.

	No-ESS	TD3	SAC	PPO	MAPPO
Usage Cost (KRW)	1,115,720	820,300	813,480	750,200	768,680
Basic Cost (KRW)	1,248,000	1,248,000	1,248,000	1,248,000	1,248,000
Total Cost (KRW)	2,687,600	2,351,600	2,388,740	2,271,960	2,292,970
Battery Charge/Discharge Switch	0	113	81	121	49
Total Cost Reduction (%)	0	12.5	11.12	15.47	14.68
Training Time (Hour)	0	9.3	5.8	2.4	2.5

Table 8. Impact of reward weight settings on total cost and battery charge/discharge state switch.

Weight	Total Cost (KRW)	Battery Charge/Discharge Switch
$w_{1} = 0.1, w_{2} = 0.9$	2,564,350	6
$w_{1} = 0.3, w_{2} = 0.7$	2,448,750	34
$w_{1} = 0.5, w_{2} = 0.5$	2,363,450	41
$w_{1} = 0.7, w_{2} = 0.3$	2,292,970	49
$w_{1} = 0.9, w_{2} = 0.1$	2,254,100	132

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jung, S.-W.; An, Y.-Y.; Suh, B.; Park, Y.; Kim, J.; Kim, K.-I. Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids. Mathematics 2025, 13, 1999. https://doi.org/10.3390/math13121999

AMA Style

Jung S-W, An Y-Y, Suh B, Park Y, Kim J, Kim K-I. Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids. Mathematics. 2025; 13(12):1999. https://doi.org/10.3390/math13121999

Chicago/Turabian Style

Jung, Sang-Woo, Yoon-Young An, BeomKyu Suh, YongBeom Park, Jian Kim, and Ki-Il Kim. 2025. "Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids" Mathematics 13, no. 12: 1999. https://doi.org/10.3390/math13121999

APA Style

Jung, S.-W., An, Y.-Y., Suh, B., Park, Y., Kim, J., & Kim, K.-I. (2025). Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids. Mathematics, 13(12), 1999. https://doi.org/10.3390/math13121999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning for Scheduling of Energy Storage System in Microgrids

Abstract

1. Introduction

2. System Modeling and Problem Formulation

2.1. System Modeling

2.2. ESS Modeling

2.3. Objective Function and Constrains

2.3.1. Energy Efficiency and Economical Cost

2.3.2. Battery Health Management

2.3.3. Constraints

2.4. DRL Model Structure

2.4.1. State Space

2.4.2. Action Space

2.4.3. Reward

3. Proposed ESS Scheduling Algorithm

3.1. PPO

3.2. Multi-Agent PPO (MAPPO)

3.3. Pareto Optimization

4. Experimental Setup and Evaluation

4.1. Testbed Environment

4.2. Test Data

4.3. Scenario

4.4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI