Next Article in Journal
Intention Prediction for Active Upper-Limb Exoskeletons in Industrial Applications: A Systematic Literature Review
Previous Article in Journal
A New Efficient and Provably Secure Certificateless Signature Scheme Without Bilinear Pairings for the Internet of Things
Previous Article in Special Issue
Electrical Diagnosis Techniques for Power Transformers: A Comprehensive Review of Methods, Instrumentation, and Research Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Agent DDPG-Based Multi-Device Charging Scheduling for IIoT Smart Grids

1
Guangxi Key Laboratory of Braininspired Computing and Intelligent Chips, School of Electronic and Information Engineering/School of Integrated Circuits, Guangxi Normal University, Guilin 541001, China
2
School of Electronic and Information Engineering, Harbin Institute of Technology, Shenzhen 150001, China
3
Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230026, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Sensors 2025, 25(17), 5226; https://doi.org/10.3390/s25175226
Submission received: 21 July 2025 / Revised: 19 August 2025 / Accepted: 20 August 2025 / Published: 22 August 2025
(This article belongs to the Special Issue Smart Sensors, Smart Grid and Energy Management)

Abstract

As electric vehicles (EVs) gain widespread adoption in industrial environments supported by Industrial Internet of Things (IIoT) smart grids technology, coordinated charging of multiple EVs has become vital for maintaining grid stability. In response to the scalability challenges faced by traditional algorithms in multi-device environments and the limitations of discrete action spaces in continuous control scenarios, this paper proposes a dynamic charging scheduling algorithm for EVs based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG). The algorithm combines real-time electricity prices, battery status monitoring, and distributed sensor data to dynamically optimize charging and discharging strategies of multiple EVs in continuous action spaces. The goal is to reduce charging costs and balance grid load through coordinated multi-agent learning. Experimental results show that, compared with baseline methods, the proposed MADDPG algorithm achieves a 41.12% cost reduction over a 30-day evaluation period. Additionally, it effectively adapts to price fluctuations and user demand changes through Vehicle-to-Grid technology, optimizing charging time allocation and enhancing grid stability.

1. Introduction

With increasing global environmental awareness, electric vehicles (EVs) are experiencing significant growth worldwide. EVs encompass not only traditional transportation vehicles such as electric cars and trucks, but also industrial equipment, including electric forklifts and self-driven vehicles. These battery-powered devices provide solutions for reducing carbon emissions and decreasing reliance on fossil fuels. As technology advances, EVs have become more common in industrial sectors, creating opportunities for coordinated energy management [1,2,3,4]. However, the widespread adoption of EVs introduces challenges for power grid management, particularly concerning dynamic charging demands and load variations. The charging requirements must account for real-time grid load conditions and involve intelligent adjustments to prevent excessive strain during peak demand periods [5]. Consequently, the development of efficient EV charging scheduling methods has become important for ensuring grid stability [6].
The advancement of smart grid technologies has transformed energy management systems through intelligent sensing and communication capabilities. Modern charging infrastructure leverages smart sensors to monitor critical parameters, including battery state-of-charge, grid conditions, and dynamic electricity pricing in real-time [7,8,9]. The integration of Industrial Internet of Things (IIoT) [10,11,12] further enhances fleet management through predictive frameworks and real-time data exchange [13], while simultaneously raising cybersecurity concerns regarding the protection of sensitive charging data. Within this interconnected environment, the main challenge becomes coordinated decision-making in continuous action spaces, where each EV’s charging actions directly affect overall system performance. This IIoT-enabled infrastructure facilitates such coordination through continuous data exchange, enabling accurate system monitoring and improved energy management.
Traditional optimization methods have demonstrated value in early studies but reveal significant limitations when confronted with multi-agent coordination challenges. Swarm intelligence algorithms such as genetic algorithms and particle swarm optimization [14,15] perform well in static scenarios but struggle with real-time adaptability and inter-agent coordination. Model predictive control [16] and Lyapunov optimization approaches [17] have shown promise in dynamic scenarios but face computational scalability issues when managing multiple autonomous agents with continuous action spaces. Additionally, these methods fail to address the fundamental challenge of non-stationary environments that arise when multiple learning agents operate simultaneously.
To address these challenges, Deep Reinforcement Learning (DRL) [18] has emerged as a breakthrough solution due to its ability to handle high-dimensional state spaces, continuous action spaces, and dynamic environments. However, most existing DRL applications in EV charging focus on single-agent scenarios, limiting their applicability to real-world multi-device environments. Single-agent Deep Q-Network (DQN) approaches [19], while effective for individual EV optimization, suffer from the curse of dimensionality when extended to multi-agent settings and cannot handle continuous charging power control effectively. To overcome the discrete action limitations of DQN, the Deep Deterministic Policy Gradient (DDPG) algorithm has been adopted for continuous EV charging control [20], demonstrating superior performance in individual EV optimization scenarios. However, these single-agent approaches fail to consider the collective performance and coordination requirements of multiple EVs operating simultaneously.
As mentioned above, the limitations of single-agent approaches become evident when managing multiple charging piles at charging stations. If all charging piles make decisions independently, the total power demand may exceed the transformer’s rated capacity, potentially affecting power system stability. Recent studies have addressed this by imposing constraints on total charging rates [21] and developing recurrent DDPG algorithms [22]. However, these approaches still rely on single-agent frameworks, which face significant scalability challenges in complex multi-device scenarios. Recognizing these limitations, multi-agent reinforcement learning (MARL) has gained attention for coordinating multiple decision-making units. MARL algorithms are primarily categorized into Q-learning-based methods, such as QMIX [23], and policy gradient-based methods, including Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Recent advances in policy-gradient MARL have demonstrated superior capabilities in continuous control domains, with Actor-Critic methods enabling more precise control over EV charging and discharging behaviors [24]. These methods have shown particular effectiveness in industrial environments, where multi-agent systems can flexibly adapt to dynamic resource allocation challenges in terminal–edge–cloud IIoT architectures [25]. MADDPG has been specifically applied to charging decision problems involving multiple charging piles [26,27,28]. While MARL shows promise for EV charging coordination, some limitations remain in existing approaches [29,30]. Most existing methods employ limited inter-agent interaction or operate as independent single-agent systems, resulting in uncoordinated charging behaviors that may cause grid load imbalances or overload conditions when multiple EVs simultaneously access limited grid resources during peak demand periods.
Additionally, current charging coordination frameworks lack design considerations for Industrial IoT scenarios and fail to exploit the real-time data capabilities inherent in IIoT environments. This limitation prevents effective utilization of real-time grid conditions, dynamic pricing information, and EV status updates for optimized charging scheduling in industrial park applications. To address these limitations and research gaps, this paper proposes an MADDPG-based coordinated charging scheduling algorithm that integrates continuous action control with multi-agent coordination in IIoT-enabled environments.
The primary contributions of this paper are summarized as follows:
  • We propose an MADDPG-based coordination framework for smart sensor-integrated Industrial IoT environments that combines real-time multi-device sensor data with multi-agent reinforcement learning for EV charging scheduling. Our framework exploits smart sensor infrastructure characteristics, real-time data collection capabilities, and distributed communication protocols to enhance coordination performance and reduce charging costs in industrial park settings.
  • We propose an MADDPG-based multi-agent algorithm that facilitates coordinated policy learning among multiple EVs, ensuring continuous control over charging power. This approach allows each EV agent to autonomously decide based on local observations, effectively accommodating the diverse battery capacities and charging requirements of heterogeneous EVs. By utilizing continuous control, our algorithm overcomes the discrete action limitations found in traditional approaches like QMIX.
  • Comprehensive experimental evaluation demonstrates the effectiveness of the proposed sensor-integrated approach, achieving 43.5% cost reduction compared with baseline methods over a 30-day evaluation period while maintaining grid stability and satisfying each EV’s charging requirements under realistic industrial park scenarios with dynamic pricing, real-time sensor monitoring, and varying EV fleet sizes.
The rest of this paper is organized as follows. Section 2 presents the system model and problem formulation for multi-agent EV charging coordination. The design and implementation of the proposed MADDPG-based multi-device scheduling algorithm are discussed in Section 3. Section 4 provides the experimental setup and results, including performance evaluation against existing approaches. Finally, conclusions and future research directions are discussed in Section 5.

2. System Model and Problem Formulation

2.1. System Model

We consider an industrial park equipped with N heterogeneous EVs, each requiring coordinated charging scheduling to optimize grid load profiles and minimize operational costs. Figure 1 illustrates the overall system architecture, where multiple EVs are connected to a centralized charging management system through IIoT infrastructure, enabling real-time data exchange and coordinated decision-making.
Each EV is connected to a bidirectional charging infrastructure that supports Vehicle-to-Grid (V2G) functionality, enabling both energy consumption from the grid and energy feedback during peak load periods. The IIoT infrastructure consists of three integrated components: embedded sensors on each EV that continuously monitor battery State of Charge (SoC), remaining charging duration before departure, and real-time charging status; smart charging stations equipped with bidirectional power converters and communication modules that interface with grid management systems to acquire electricity price signals and grid load conditions; and a centralized data aggregation unit that processes information for coordinated decision-making.

2.2. Battery State Model

The battery status of each EV is characterized by its SoC, representing the ratio of remaining energy to nominal battery capacity. The SoC evolves over discrete time intervals t based on applied charging or discharging power, with continuous monitoring through embedded sensor systems ensuring accurate state estimation.
For charging operations, the SoC update is
S o C i , t + 1 = S o C i , t + η c · P i , t c h · Δ t C i ,
where S o C i , t is the SoC of EV i in time slot t, P i , t c h 0 is the charging power, C i is the battery capacity, η c ( 0 , 1 ] is the charging efficiency, and  Δ t is the time slot duration.
For discharging operations, the SoC evolves as
S o C i , t + 1 = S o C i , t P i , t d i s · Δ t η d · C i ,
where P i , t d i s 0 is the discharging power, and η d ( 0 , 1 ] is the discharging efficiency.
The SoC is constrained within feasible bounds:
S o C min S o C i , t S o C max , i , t ,
Individual power limits are enforced:
0 P i , t c h P i , max c h , 0 P i , t d i s P i , max d i s , i , t ,
To prevent grid overload, the aggregate charging power is constrained:
i = 1 N P i , t c h P max , t ,
where P max is the maximum permissible total charging power.
These constraints ensure safe battery operation and grid stability while enabling effective coordination among multiple charging devices through real-time monitoring and control.

2.3. Problem Formulation

The objective of the multi-device charging scheduling problem is to minimize the total electricity cost for all EVs over a finite time horizon T while satisfying individual charging requirements and grid constraints. The total cost for all EVs is defined as
C = t = 1 T i = 1 N P t P i , t c h P t P i , t d i s ,
where P t is the time-varying electricity price at time slot t. Here, P t P i , t c h represents the cost incurred when purchasing electricity from the grid, while P t P i , t d i s represents the revenue generated by selling energy back to the grid.
The optimization problem is formulated as
min P i , t c h , P i , t d i s C
subject to the battery dynamics in (1) and (2), the constraints in (3)–(5), and the charging completion requirement:
S o C i , t S o C t a r g e t , i ,
This problem is challenging due to the coupled constraints across multiple agents, the continuous decision variables, and the stochastic nature of electricity prices, EV arrival/departure times, and initial SoC levels. Traditional optimization methods struggle with scalability and real-time adaptation in such dynamic multi-device environments. To address these issues, we model the problem as a multi-agent Markov Decision Process (MDP) and solve it using an MADDPG-based approach, as detailed in the next section.

3. Charging Scheduling Algorithm Based on MADDPG

The multi-agent EV charging coordination problem involves multiple autonomous agents operating in continuous action spaces with complex inter-agent dependencies. Traditional discrete algorithms such as DQN and Multi-Agent DQN face scalability challenges due to exponential growth in joint action spaces and discrete action limitations. The integration of real-time sensor data and dynamic system conditions further requires algorithms capable of processing high-dimensional observations and adapting to rapidly changing environments.
To address these challenges, we adopt the MADDPG algorithm, which combines centralized training with decentralized execution. MADDPG enables each agent to maintain its own actor network for continuous action selection based on local sensor observations, while employing centralized critics for value estimation using global state information. This design overcomes non-stationarity and high dimensionality challenges while preserving decentralized execution capabilities essential for real-time industrial operation.

3.1. Multi-Agent MDP Formulation

To implement the MADDPG framework for sensor-integrated EV charging coordination, we formalize the problem as a multi-agent Markov Decision Process. The multi-agent EV charging scheduling problem is modeled as an MDP defined by the tuple ( S , A , P , { r i } i = 1 N , γ ) , where:
(1) State S t : At time slot t, the joint state S t S captures essential information of the environment and all EVs:
S t = P t , S o C 1 , t , S o C 2 , t , , S o C N , t , L t , T 1 , t r e m , , T N , t r e m ,
where P t is the electricity price, S o C i , t is the State of Charge of the i-th EV battery, L t denotes the current grid load, and  T i , t r e m represents the remaining charging time for EV i.
(2) Action a t = ( a 1 , t , a 2 , t , , a N , t ) : Each agent i chooses a continuous action a i , t [ P i m i n , P i m a x ] representing its charging (positive) or discharging (negative) power at time t. The joint action a t must satisfy the aggregate charging power constraint:
i = 1 N max ( a i , t , 0 ) P max .
(3) State Transition Probability P : The environment dynamics are governed by unknown transition probabilities P ( S t + 1 | S t , a t ) , capturing the evolution of EV battery states and grid status following joint actions. Due to system complexity and stochasticity, these dynamics are learned implicitly via interaction without explicit modeling.
(4) Reward r i , t : Each agent receives a reward designed to minimize electricity costs, ensure battery safety, encourage timely charging completion, and maintain grid stability. The reward function is formulated as
r i , t = P t · a i , t + R l i m i , t + R t a r g e t i , t
where the constraint penalty R l i m i , t and target completion reward R t a r g e t i , t are defined as
R l i m i , t = ρ 1 , if S o C i , t < S o C min or S o C i , t > S o C max ρ 2 , if i = 1 N max ( a i , t , 0 ) > P max 0 , otherwise
R t a r g e t i , t = ρ 3 , if T i , t r e m = 0 and S o C i , t < S o C t a r g e t 0 , otherwise
(5) Discount factor γ ( 0 , 1 ] : Balances the importance of immediate and future rewards to encourage policies optimizing long-term performance.

3.2. MADDPG Algorithm Implementation

To address the continuous action space and multi-agent coordination challenges in EV charging scheduling, we implement the MADDPG algorithm, which extends the deterministic policy gradient framework to multi-agent contexts through centralized training with decentralized execution.
Each EV agent i employs an actor network μ θ i that maps its local observation o i to a continuous action a i = μ θ i ( o i ) . During training, each agent utilizes a centralized critic network Q ϕ i ( s , a ) that processes the global state s and joint action a = ( a 1 , a 2 , , a N ) to address multi-agent non-stationarity, while maintaining decentralized actor execution based solely on local observations.
The critic network parameters ϕ i are optimized by minimizing the temporal difference loss:
L ( ϕ i ) = E ( s , a , r i , s ) Q ϕ i ( s , a ) y i 2 ,
where the target value y i is computed using target networks:
y i = r i + γ Q ϕ i s , a 1 , , a N ,
with a j = μ θ j ( o j ) representing the next action from the target actor network.
The actor network parameters θ i are updated using the deterministic policy gradient:
θ i J E s θ i μ θ i ( o i ) a i Q ϕ i ( s , a 1 , , a N ) | a i = μ θ i ( o i ) .
Training stability is enhanced through experience replay and soft target network updates:
θ i τ θ i + ( 1 τ ) θ i , ϕ i τ ϕ i + ( 1 τ ) ϕ i ,
where τ 1 controls the update rate.
Algorithm 1 summarizes the training process for coordinated EV charging scheduling with real-time sensor integration. The algorithm initializes all network parameters, then iteratively collects experience through environment interaction with continuous sensor monitoring and updates network parameters using sampled minibatches. The MADDPG framework effectively handles multi-agent coordination and continuous charging power control through its combination of decentralized actors and centralized critics, as illustrated in Figure 2.
Algorithm 1 MADDPG-based Multi-Agent EV Charging Scheduling
1:
Initialize actor networks μ θ i , critic networks Q ϕ i , and corresponding target networks μ θ i , Q ϕ i for all agents i = 1 , , N .
2:
Initialize experience replay buffer D .
3:
for episode = 1 to M do
4:
    Reset environment, obtain initial state s = ( o 1 , , o N ) .
5:
    for time step t = 1 to T do
6:
        for each agent i do
7:
           Select action a i = μ θ i ( o i ) + N t , where N t is exploration noise.
8:
        end for
9:
        Execute joint action a = ( a 1 , , a N ) , observe reward r = ( r 1 , , r N ) , and next state s = ( o 1 , , o N ) .
10:
        Store transition ( s , a , r , s ) into D .
11:
        if replay buffer size sufficient then
12:
           Sample minibatch from D .
13:
           for each agent i do
14:
               Update critic network Q ϕ i by minimizing loss L ( ϕ i ) .
15:
               Update actor network μ θ i using policy gradient.
16:
               Soft-update target networks:
θ i τ θ i + ( 1 τ ) θ i , ϕ i τ ϕ i + ( 1 τ ) ϕ i .
17:
           end for
18:
        end if
19:
        Update s s .
20:
    end for
21:
end for

3.3. Computational Complexity Analysis

The computational complexity of the proposed MADDPG-based charging scheduling depends on the neural network architectures and the number of agents. Consider actor networks with L a layers and n a neurons per hidden layer, and critic networks with L c layers and n c neurons per hidden layer. The input dimensions are d o for actors (local observation) and d s + N × d a for critics (global state plus joint actions), where N is the number of agents and d a is the action dimension.
The per-step complexity for each actor network is O ( d o · n a + ( L a 2 ) n a 2 + n a · d a ) , while for each critic network, the complexity is O ( ( d s + N d a ) × n c + ( L c 2 ) n c 2 + n c ) . The critic input dimension scaling with N introduces quadratic dependence on the number of agents, reflecting the centralized training overhead. The experience replay buffer requires memory complexity of O ( 2 d s + N ( d o + d a + 1 ) ) per sample.
During execution, only decentralized actor networks are active, with inference complexity O ( N × ( d o n a + ( L a 2 ) n a 2 + n a d a ) ) that scales linearly with the number of agents, enabling real-time operation. The MADDPG approach introduces computational overhead compared with single-agent methods due to centralized critics, but remains computationally feasible for moderate numbers of EV agents while effectively balancing complexity and coordination performance.

4. Experimental Results and Discussion

To validate the effectiveness of the proposed MADDPG-based charging scheduling algorithm, we construct a simulation environment modeling the charging behavior of electric vehicles and their interactions with the power grid in an industrial park. The simulation integrates factors including real-time electricity price data, battery state dynamics, and charging time management.

4.1. Setup and Training

The simulation environment utilizes hourly electricity price data sourced from the UK NordPool database [31]. The training utilizes six months of historical pricing data, while the evaluation employs completely separate monthly periods to ensure unbiased performance assessment and generalization capability. These six months of real-time electricity price data are stored in a table for dynamic retrieval during the training process to ensure experimental authenticity. The electricity price unit remains consistent with the source data, using British pounds (GBP) as the unit of measurement.
In the simulation environment, the charging and discharging behaviors of EVs are modeled with the following assumptions to simplify the problem:
  • EVs utilize lithium-ion batteries with constant charging and discharging power rates, with varying battery capacities and maximum charging power limits across different vehicles.
  • All EVs participate in charging and discharging processes within the industrial park using conventional slow-charging methods.
  • Charging and discharging decisions are influenced by dynamic electricity prices obtained through real-time pricing signals, without considering external disturbances or physical queuing effects at charging stations (This assumption is suitable for IIoT environments with sufficient charging infrastructure and centralized management, where external disturbances and queuing effects can be reasonably neglected [32,33]).
  • EVs commence charging immediately upon arrival at charging stations, with charging periods aligned to hourly intervals.
  • Battery safety is maintained by constraining SoC between 0.1 and 1.0, with continuous monitoring through simulated sensor feedback systems.
The simulation environment models five heterogeneous EVs to reflect realistic industrial park scenarios. Due to varying battery capacities across vehicles, SoC is adopted as a unified standard rather than assuming fixed capacities. The scheduling operates with one-hour time intervals, where each training episode corresponds to one complete operational day of 24 h divided into 24 time slots. EV behavioral parameters [34] are modeled using truncated normal distributions to enhance the algorithm’s generalization capability, and the algorithm does not rely on prior knowledge of stochastic variable distributions. Table 1 details the specific parameter settings for EV behavior modeling.
The core hyperparameters of the MADDPG training process are carefully configured to balance learning stability and convergence speed. These include the learning rate, discount factor, exploration noise parameters, replay buffer capacity, batch size, and the target networks’ soft update coefficient. The complete set of algorithm parameters used in training is summarized in Table 2. Additionally, the system configuration parameters are in Table 3.
Figure 3 illustrates the training convergence of the proposed MADDPG algorithm across 20 independent training runs with different random seeds. The solid line represents the mean reward progression, while the shaded area indicates the 95% confidence interval. The initial training phase exhibits significant reward fluctuations due to random exploration as agents learn coordination strategies while simultaneously adapting to real-time data streams from multiple sensors. As training progresses, the reward fluctuations gradually diminish and converge toward higher values, indicating consistent learning of coordinated charging policies that effectively utilize real-time sensor feedback for decision-making across different initializations.
Figure 4 demonstrates the parameter sensitivity analysis for the proposed MADDPG algorithm. The optimal discount factor γ = 0.98 achieves the best performance with rapid convergence, while lower values result in suboptimal coordination due to insufficient long-term planning. The learning rate α = α a = α c = 0.001 provides optimal balance between convergence speed and final performance quality, whereas higher rates cause training oscillations and lower rates converge quickly but fail to reach optimal reward values.

4.2. Performance Evaluation

Figure 5 and Figure 6 demonstrate the effectiveness of the proposed MADDPG-based coordinated charging scheduling algorithm in managing multi-agent EV charging operations. Figure 5 presents the charging and discharging behavior of five EVs over a three-day period under dynamic electricity pricing. The results show that EVs successfully identify electricity price variations and make corresponding charging or discharging decisions accordingly. The algorithm effectively schedules charging operations during low-price periods and discharging operations during high-price periods through V2G functionality. The charging behaviors respect individual power constraints without exceeding the maximum power limits, while the aggregate power consumption remains within the 25 kW system constraint.
Figure 6 illustrates the battery state evolution throughout the charging process. Despite having different battery capacities, all five vehicles successfully reach their target SoC levels before departure while maintaining safe operation within the constraint range. These results confirm the algorithm’s ability to achieve collective optimization while satisfying individual EV requirements.
To evaluate the scalability of the proposed algorithm, we conducted experiments with different EV fleet sizes and power constraints. Figure 7 demonstrates the algorithm’s performance across different fleet sizes. The results show that the algorithm successfully converges for both 3 EV and 10 EV scenarios. The 3 EV case achieves faster convergence due to lower system complexity, while the 10 EV scenario requires more training episodes but eventually reaches stable performance. This validates that the MADDPG framework scales effectively to accommodate varying fleet sizes while preserving coordination quality.
Figure 8 illustrates the adaptability of the proposed MADDPG algorithm under different power constraints. The training convergence curves demonstrate that the algorithm successfully converges for both 20 kW and 30 kW total power limits, with similar convergence patterns indicating robust performance across varying constraint scenarios. The charging behavior patterns show that agents effectively adapt their coordination strategies to the imposed power limits. Under the tighter 20 kW constraint, EVs exhibit more conservative charging behaviors with enhanced coordination to avoid exceeding the power limit, while the 30 kW scenario allows for more aggressive charging patterns.
To comprehensively evaluate seasonal robustness, we conduct testing across four disjoint one-month evaluation windows representing different seasons, all selected from periods outside the training dataset. This approach ensures unbiased performance assessment across varying seasonal electricity pricing patterns.
Figure 9 demonstrates the cumulative cost performance across four seasonal evaluation periods. The MADDPG algorithm maintains consistent cost optimization performance across all seasonal conditions, demonstrating similar cost accumulation patterns despite varying electricity pricing environments. This validates the robustness of our multi-agent coordination framework to diverse seasonal electricity pricing environments.
To assess the performance of the proposed MADDPG-based coordinated charging scheduling algorithm, we conducted a comprehensive comparative analysis with three baseline methods. The Greedy-based charging scheduling strategy selects actions that maximize immediate reward without considering future consequences or inter-agent coordination. The MAQL-based charging scheduling method employs multi-agent Q-learning with discretized state representation for coordinated vehicle charging decisions. The MADQN-based charging scheduling algorithm implements a multi-agent Deep Q-Network where each agent’s continuous action is discretized to a symmetric set around zero, scaled by its maximum charge/discharge power. All methods operate under identical environmental settings and constraints. The key distinction lies in action space representation: discrete methods are limited to predetermined power levels, while MADDPG utilizes continuous action spaces. All algorithms were evaluated through multiple independent runs to ensure statistical reliability. The performance evaluation was conducted over a 30-day period under dynamic pricing conditions.
Figure 10 and Table 4 present the cumulative charging costs for four scheduling algorithms over a 30-day evaluation period using models trained on six months of historical data. The MADDPG algorithm achieves the lowest cumulative cost with a 41.12% reduction compared with the Greedy baseline, outperforming MADQN by 14.34% and MAQL by 30.91%. While both MADDPG and MADQN show periods of profit generation through V2G operations, MADDPG demonstrates better overall performance than MADQN. These results indicate that the proposed MADDPG method outperforms existing reinforcement learning approaches in this dynamic pricing scenario. For transparency and reproducibility, the detailed per-seed cumulative cost results for all algorithms over 20 independent runs are reported in Appendix A, Table A1.
To provide deeper insights into daily cost variations and algorithm reliability, Figure 11 and Table 5 present the daily cost evolution and statistical analysis throughout the evaluation period. Figure 11 illustrates the day-to-day performance characteristics, demonstrating that MADDPG outperforms baseline methods in the majority of daily scenarios with generally lower cost performance. Paired Wilcoxon signed-rank tests reveal that MADDPG achieves p-values of 3.7 × 10 5 and 2.0 × 10 6 , together with Cohen’s d values of 1.07 and 1.31 when compared with MADQN and MAQL, respectively. These results demonstrate that the daily cost differences are statistically significant rather than due to random variation, and that the large effect sizes reflect practically meaningful improvements. The corresponding Cliff’s δ values of 0.14 and 0.27 further indicate that MADDPG’s daily costs are generally lower than those of both baseline algorithms. Table 5 provides a comprehensive statistical analysis of daily cost performance, including mean, median, interquartile ranges (IQR), and so on, for all scheduling algorithms. The results confirm that MADDPG achieves the lowest mean daily cost compared with baseline methods.
It is worth noting that our algorithm incorporates battery-protective mechanisms, including SoC constraints and gradual power adjustments through continuous action spaces, which minimize potential lifecycle degradation compared with aggressive charging strategies. Given our evaluation timeframe and these protective measures, battery degradation effects are negligible within our study scope.

5. Conclusions

This paper proposed an MADDPG-based coordinated charging scheduling algorithm for multi-agent EV operations in dynamic pricing environments, leveraging smart sensor-enabled IIoT infrastructure within smart grids. The framework employed centralized training with decentralized execution to enable continuous power control, accommodating heterogeneous EV fleets and optimizing charging and discharging decisions in real time. Experimental results show a 41.12% cost reduction over the Greedy baseline, alongside improved grid stability and strong adaptability to electricity price fluctuations and varying demand patterns. Scalability analysis further demonstrated its potential applicability in real-world industrial scenarios under IIoT-enabled smart grids. Future work will extend the framework to consider stochastic external disturbances, physical queuing constraints, and hardware-in-the-loop testing with representative sensors (e.g., smart meters, vehicle charging interface sensors). Evaluation metrics will include the stability of cost savings under varying grid conditions, consistent scheduling performance across scenarios, and the ability to maintain grid voltage and load balance.

Author Contributions

Conceptualization and methodology, H.Z. (Haiyong Zeng), Y.H. and H.Z. (Hongyan Zhu); validation, Y.H. and K.Z.; resources, H.Z. (Haiyong Zeng) and Z.Y.; Data curation, Y.H.; writing—original draft, H.Z. (Haiyong Zeng) and Y.H.; writing—review and editing, H.Z. (Haiyong Zeng), Z.Y., H.Z. (Hongyan Zhu) and F.L.; visualization, Y.H. and K.Z.; supervision, F.L.; project administration, H.Z. (Haiyong Zeng); funding acquisition, H.Z. (Haiyong Zeng). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of China under grant 62301172, the Guangxi Natural Science Foundation under grant 2024GXNSFBA010246, the Guangxi Science and Technology Base and Special Talent Program under grant GuikeAD23026197, the Guangxi Young Talent Inclusive Support Program 2024, and the Guangxi Key Laboratory of Brain-inspired Computing and Intelligent Chips under grant BCIC-24-Z6.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to copyright.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To enhance transparency and reproducibility, this appendix lists the cumulative charging costs for each of the 20 independent runs across the four scheduling algorithms evaluated in this study. These per-seed results complement the aggregated statistics presented in Section 4.2 and further demonstrate the robustness of the observed performance improvements. The Greedy strategy also exhibits variation across runs due to stochastic initialization of the simulation environment.
Table A1. Cumulative charging costs (GBP) for 20 independent runs across all algorithms.
Table A1. Cumulative charging costs (GBP) for 20 independent runs across all algorithms.
SeedMADDPGMADQNMAQLGreedy
027.894532.145639.823447.2341
528.956733.478941.256748.5678
1230.234135.890144.789151.2345
2329.123433.785642.145649.1789
2528.456732.894540.567848.0123
4231.078937.234146.578952.8901
4729.789134.956743.456750.3456
5228.134531.890439.234147.9891
5530.890136.789145.923453.4567
6629.456734.234542.789149.8234
7828.723433.156740.890148.9567
8530.567836.123444.345651.7891
9929.234133.956741.678949.5678
10127.678931.567838.789146.3456
12331.345638.123447.890154.2345
15729.823435.456743.823450.7891
17828.567832.789140.123448.8901
18630.123435.678944.567851.4567
19529.678934.567842.345649.9234
22529.345633.012341.456749.6789
Mean29.455234.386642.623850.0182
Std. Dev.1.04961.85052.51112.0544

References

  1. Jiang, F.; Yuan, X.; Hu, L.; Xie, G.; Zhang, Z.; Li, X.; Hu, J.; Wang, C.; Wang, H. A Comprehensive Review of Energy Storage Technology Development and Application for Pure Electric Vehicles. J. Energy Storage 2024, 86, 111159. [Google Scholar] [CrossRef]
  2. Husain, I.; Ozpineci, B.; Islam, M.S.; Gurpinar, E.; Su, G.J.; Yu, W.; Chowdhury, S.; Xue, L.; Rahman, D.; Sahu, R. Electric Drive Technology Trends, Challenges, and Opportunities for Future Electric Vehicles. Proc. IEEE 2021, 109, 1039–1059. [Google Scholar] [CrossRef]
  3. Das, H.S.; Rahman, M.M.; Li, S.; Tan, C.W. Electric Vehicles Standards, Charging Infrastructure, and Impact on Grid Integration: A Technological Review. Renew. Sustain. Energy Rev. 2020, 120, 109618. [Google Scholar] [CrossRef]
  4. Zhang, X.; Chan, K.W.; Li, H.; Wang, H.; Qiu, J.; Wang, G. Deep-Learning-Based Probabilistic Forecasting of Electric Vehicle Charging Load with a Novel Queuing Model. IEEE Trans. Cybern. 2020, 51, 3157–3170. [Google Scholar] [CrossRef]
  5. Liu, C.; Chai, K.K.; Zhang, X.; Lau, E.T.; Chen, Y. Adaptive Blockchain-Based Electric Vehicle Participation Scheme in Smart Grid Platform. IEEE Access 2018, 6, 25657–25665. [Google Scholar] [CrossRef]
  6. Kataray, T.; Nitesh, B.; Yarram, B.; Sinha, S.; Cuce, E.; Shaik, S.; Vigneshwaran, P.; Roy, A. Integration of Smart Grid with Renewable Energy Sources: Opportunities and Challenges–A Comprehensive Review. Sustain. Energy Technol. Assess. 2023, 58, 103363. [Google Scholar] [CrossRef]
  7. Mounir, M.; Sayed, S.G.; El-Dakroury, M.M.E. Securing the Future: Real-Time Intrusion Detection in IIoT Smart Grids Through Innovative AI Solutions. J. Cybersecur. Inf. Manag. 2025, 15, 208–244. [Google Scholar] [CrossRef]
  8. Meydani, A.; Shahinzadeh, H.; Ramezani, A.; Moazzami, M.; Nafisi, H.; Askarian-Abyaneh, H. State-of-the-Art Analysis of Blockchain-Based Industrial IoT (IIoT) for Smart Grids. In Proceedings of the 2024 9th International Conference on Technology and Energy Management (ICTEM), Tehran, Iran, 14–15 February 2024; pp. 1–12. [Google Scholar]
  9. Zeng, H.; Wang, J.; Wei, Z.; Zhu, X.; Jiang, Y.; Wang, Y.; Masouros, C. Multicluster-Coordination Industrial Internet of Things: The Era of Nonorthogonal Transmission. IEEE Veh. Technol. Mag. 2022, 17, 84–93. [Google Scholar] [CrossRef]
  10. Hou, W.; Zhu, X.; Cao, J.; Zeng, H.; Jiang, Y. Composite Robot Aided Coexistence of eMBB, URLLC and mMTC in Smart Factory. In Proceedings of the IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–6. [Google Scholar]
  11. Li, Z.; Zhu, X.; Cao, J. Localization Accuracy Under Age of Information Influence for Industrial IoT. In Proceedings of the IEEE International Workshop on Radio Frequency and Antenna Technologies (iWRF&AT), Shenzhen, China, 31 May–3 June 2024; pp. 404–409. [Google Scholar]
  12. Hou, W.; Wei, Z.; Zhu, X.; Cao, J.; Jiang, Y. Toward Proximity Surveillance and Data Collection in Industrial IoT: A Multi-Stage Statistical Optimization Design. IEEE Wirel. Commun. Lett. 2024, 13, 1536–1540. [Google Scholar] [CrossRef]
  13. Aldossary, M. Enhancing Urban Electric Vehicle (EV) Fleet Management Efficiency in Smart Cities: A Predictive Hybrid Deep Learning Framework. Smart Cities 2024, 7, 3678–3704. [Google Scholar] [CrossRef]
  14. Ren, J.; Wang, H.; Yang, W.; Liu, Y.; Tsang, K.F.; Lai, L.L.; Chung, L.C. A Novel Genetic Algorithm-Based Emergent Electric Vehicle Charging Scheduling Scheme. In Proceedings of the IECON 2019–45th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 October 2019; Volume 1, pp. 4289–4292. [Google Scholar]
  15. Tan, K.M.; Ramachandaramurthy, V.K.; Yong, J.Y.; Padmanaban, S.; Mihet-Popa, L.; Blaabjerg, F. Minimization of Load Variance in Power Grids—Investigation on Optimal Vehicle-to-Grid Scheduling. Energies 2017, 10, 1880. [Google Scholar] [CrossRef]
  16. Shi, Y.; Tuan, H.D.; Savkin, A.V.; Duong, T.Q.; Poor, H.V. Model Predictive Control for Smart Grids with Multiple Electric-Vehicle Charging Stations. IEEE Trans. Smart Grid 2018, 10, 2127–2136. [Google Scholar] [CrossRef]
  17. Costa, J.S.; Lunardi, A.; Lourenço, L.F.N.; Oliani, I.; Sguarezi Filho, A.J. Lyapunov-Based Finite Control Set Applied to an EV Charger Grid Converter Under Distorted Voltage. IEEE Trans. Transp. Electr. 2024, 11, 3549–3557. [Google Scholar] [CrossRef]
  18. Du, Y.; Li, F. Intelligent Multi-Microgrid Energy Management Based on Deep Neural Network and Model-Free Reinforcement Learning. IEEE Trans. Smart Grid 2020, 11, 1066–1076. [Google Scholar] [CrossRef]
  19. Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 5246–5257. [Google Scholar] [CrossRef]
  20. Zhang, F.; Yang, Q.; An, D. CDDPG: A Deep-Reinforcement-Learning-Based Approach for Electric Vehicle Charging Control. IEEE Internet Things J. 2021, 8, 3075–3087. [Google Scholar] [CrossRef]
  21. Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement Learning for Real-Time Pricing and Scheduling Control in EV Charging Stations. IEEE Trans. Ind. Inform. 2021, 17, 849–859. [Google Scholar] [CrossRef]
  22. Li, H.; Li, G.; Lie, T.T.; Li, X.; Wang, K.; Han, B.; Xu, J. Constrained Large-Scale Real-Time EV Scheduling Based on Recurrent Deep Reinforcement Learning. Int. J. Electr. Power Energy Syst. 2023, 144, 108603. [Google Scholar] [CrossRef]
  23. Wang, L.; Liu, S.; Wang, P.; Xu, L.; Hou, L.; Fei, A. QMIX-Based Multi-Agent Reinforcement Learning for Electric Vehicle-Facilitated Peak Shaving. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Kuala Lumpur, Malaysia, 4–8 December 2023; pp. 1693–1698. [Google Scholar]
  24. Zhao, Z.; Lee, C.K.M.; Yan, X.; Wang, H. Reinforcement Learning for Electric Vehicle Charging Scheduling: A Systematic Review. Transp. Res. Part E Logist. Transp. Rev. 2024, 190, 103698. [Google Scholar] [CrossRef]
  25. Zhang, Q.; Wang, Y. Correlated Information Scheduling in Industrial Internet of Things Based on Multi-Heterogeneous-Agent-Reinforcement-Learning. IEEE Trans. Netw. Sci. Eng. 2023, 11, 1065–1076. [Google Scholar] [CrossRef]
  26. Yan, L.; Chen, X.; Chen, Y.; Wen, J. A Cooperative Charging Control Strategy for Electric Vehicles Based on Multiagent Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2022, 18, 8765–8775. [Google Scholar] [CrossRef]
  27. Park, K.; Moon, I. Multi-Agent Deep Reinforcement Learning Approach for EV Charging Scheduling in a Smart Grid. Appl. Energy 2022, 328, 120111. [Google Scholar] [CrossRef]
  28. Li, H.; Han, B.; Li, G.; Wang, K.; Xu, J.; Khan, M.W. Decentralized Collaborative Optimal Scheduling for EV Charging Stations Based on Multi-Agent Reinforcement Learning. IET Gener. Transm. Distrib. 2024, 18, 1172–1183. [Google Scholar] [CrossRef]
  29. Gao, Y.; Wang, W.; Yu, N. Consensus Multi-Agent Reinforcement Learning for Volt-Var Control in Power Distribution Networks. IEEE Trans. Smart Grid 2021, 12, 3594–3604. [Google Scholar] [CrossRef]
  30. Liang, Y.; Ding, Z.; Zhao, T.; Lee, W.J. Real-Time Operation Management for Battery Swapping-Charging System via Multi-Agent Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 14, 559–571. [Google Scholar] [CrossRef]
  31. Nordpool Group. Historical Market Data. 2024. Available online: https://www.nordpoolgroup.com/ (accessed on 22 September 2024).
  32. Viegas, M.A.A.; da Costa, C.T., Jr. Fuzzy Logic Controllers for Charging/Discharging Management of Battery Electric Vehicles in a Smart Grid. J. Control Autom. Electr. Syst. 2021, 32, 1214–1227. [Google Scholar] [CrossRef]
  33. Ren, L.; Yuan, M.; Jiao, X. Electric Vehicle Charging and Discharging Scheduling Strategy Based on Dynamic Electricity Price. Eng. Appl. Artif. Intell. 2023, 123, 106320. [Google Scholar] [CrossRef]
  34. Mhaisen, N.; Fetais, N.; Massoud, A. Real-Time Scheduling for Electric Vehicles Charging/Discharging Using Reinforcement Learning. In Proceedings of the IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 1–6. [Google Scholar]
Figure 1. Multi-device charging scheduling system for IIoT smart grids.
Figure 1. Multi-device charging scheduling system for IIoT smart grids.
Sensors 25 05226 g001
Figure 2. MADDPG-based charging scheduling framework: decentralized actors generate continuous charging/discharging actions from local observations, while centralized critics evaluate joint state-action pairs for coordinated training.
Figure 2. MADDPG-based charging scheduling framework: decentralized actors generate continuous charging/discharging actions from local observations, while centralized critics evaluate joint state-action pairs for coordinated training.
Sensors 25 05226 g002
Figure 3. Reward variation graph during MADDPG training.
Figure 3. Reward variation graph during MADDPG training.
Sensors 25 05226 g003
Figure 4. Parameter sensitivity analysis graph for MADDPG algorithm: (a) Effect of discount factor on MADDPG performance. (b) Effect of learning rate on MADDPG performance.
Figure 4. Parameter sensitivity analysis graph for MADDPG algorithm: (a) Effect of discount factor on MADDPG performance. (b) Effect of learning rate on MADDPG performance.
Sensors 25 05226 g004
Figure 5. EVs’ charging and discharging behavior diagram.
Figure 5. EVs’ charging and discharging behavior diagram.
Sensors 25 05226 g005
Figure 6. SoC variation graph of EVs.
Figure 6. SoC variation graph of EVs.
Sensors 25 05226 g006
Figure 7. Scalability analysis of MADDPG algorithm under different EV fleet sizes: (a,b) Training convergence curves showing reward evolution during learning process. (c,d) Charging behavior patterns demonstrating coordinated scheduling strategies.
Figure 7. Scalability analysis of MADDPG algorithm under different EV fleet sizes: (a,b) Training convergence curves showing reward evolution during learning process. (c,d) Charging behavior patterns demonstrating coordinated scheduling strategies.
Sensors 25 05226 g007
Figure 8. Performance comparison under different power constraints: (a,b) Training convergence curves showing reward evolution with 20 kW and 30 kW total power limits. (c,d) Corresponding charging behavior patterns demonstrating how power constraints affect coordination strategies.
Figure 8. Performance comparison under different power constraints: (a,b) Training convergence curves showing reward evolution with 20 kW and 30 kW total power limits. (c,d) Corresponding charging behavior patterns demonstrating how power constraints affect coordination strategies.
Sensors 25 05226 g008
Figure 9. Seasonal cumulative cost performance across four evaluation periods.
Figure 9. Seasonal cumulative cost performance across four evaluation periods.
Sensors 25 05226 g009
Figure 10. Cumulative charging cost comparison graph for EVs.
Figure 10. Cumulative charging cost comparison graph for EVs.
Sensors 25 05226 g010
Figure 11. Daily cumulative cost comparison across 30-day evaluation period.
Figure 11. Daily cumulative cost comparison across 30-day evaluation period.
Sensors 25 05226 g011
Table 1. Parameter Settings for EVs’ Behavior.
Table 1. Parameter Settings for EVs’ Behavior.
ParameterDistribution
Arrival Time (hour) N ( 17 , 1 2 )
Departure Time (hour) N ( 6 , 1 2 )
Initial SoC N ( 0.3 , 0 . 1 2 )
Target SoC N ( 0.9 , 0 . 1 2 )
Table 2. MADDPG Algorithm Parameter Settings.
Table 2. MADDPG Algorithm Parameter Settings.
ParameterValue
Actor Learning Rate α a 1 × 10 3
Critic Learning Rate α c 1 × 10 3
Discount Factor γ 0.98
Target Network Soft Update Rate τ 0.005
Replay Buffer Size | B | 100,000
Minimum Replay Buffer Size | B | min 2000
Batch Size B128
Total Training Episodes E2000
Table 3. System Configuration Parameters.
Table 3. System Configuration Parameters.
ParameterValue
Number of EVs5
EV Battery Capacity Range25–40 kWh
Max Charging Power per EV4.0 to 8.0 kW
Total Power Constraint25.0 kW
Grid Energy Capacity30 kWh
Time Slot Duration1 h
SoC Constraints[0.1, 1.0]
Table 4. Accumulated Electricity Purchase Cost of EVs Under Different Scheduling Algorithms.
Table 4. Accumulated Electricity Purchase Cost of EVs Under Different Scheduling Algorithms.
Scheduling AlgorithmCumulative Cost (GBP)
Day 5 Day 10 Day 15 Day 20 Day 25 Day 30
MADDPG3.60439.080613.989313.342323.067529.4552
MADQN3.988810.405115.988316.422826.853734.3866
MAQL6.330213.605920.072323.820235.440442.6238
Greedy7.030414.930822.694628.163140.834350.0182
Table 5. Statistics for Daily Costs Under Different Scheduling Algorithms.
Table 5. Statistics for Daily Costs Under Different Scheduling Algorithms.
Scheduling
Algorithm
Daily Cost (£)
Mean Median IQR (Q1–Q3) Worst Case Best Case Std. Dev.
MADDPG0.9861.1210.704–1.5212.250−1.3180.903
MADQN1.1491.2720.950–1.5582.560−0.7880.886
MAQL1.4421.3541.110–1.7462.6690.2030.670
Greedy1.6701.6551.360–1.9742.7350.5070.626
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, H.; Huang, Y.; Zhan, K.; Yu, Z.; Zhu, H.; Li, F. Multi-Agent DDPG-Based Multi-Device Charging Scheduling for IIoT Smart Grids. Sensors 2025, 25, 5226. https://doi.org/10.3390/s25175226

AMA Style

Zeng H, Huang Y, Zhan K, Yu Z, Zhu H, Li F. Multi-Agent DDPG-Based Multi-Device Charging Scheduling for IIoT Smart Grids. Sensors. 2025; 25(17):5226. https://doi.org/10.3390/s25175226

Chicago/Turabian Style

Zeng, Haiyong, Yuanyan Huang, Kaijie Zhan, Zichao Yu, Hongyan Zhu, and Fangyan Li. 2025. "Multi-Agent DDPG-Based Multi-Device Charging Scheduling for IIoT Smart Grids" Sensors 25, no. 17: 5226. https://doi.org/10.3390/s25175226

APA Style

Zeng, H., Huang, Y., Zhan, K., Yu, Z., Zhu, H., & Li, F. (2025). Multi-Agent DDPG-Based Multi-Device Charging Scheduling for IIoT Smart Grids. Sensors, 25(17), 5226. https://doi.org/10.3390/s25175226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop