1. Introduction
The rapid evolution of electrical grids toward smart grid architectures has fundamentally transformed the landscape of power systems operation and control. Smart grids integrate advanced sensing, communication, and computing technologies with traditional power infrastructure to enhance efficiency, reliability, and sustainability [
1]. This modernization enables real-time monitoring, bidirectional communication, and dynamic adaptation to changing conditions, thereby facilitating the integration of renewable energy sources, demand response mechanisms, and distributed generation [
2]. However, these advancements also introduce significant complexity into the optimal management of power generation resources, making the Economic Dispatch Problem (EDP) increasingly challenging yet critical for efficient grid operation, particularly under high penetration of renewable energy where traditional optimization methods often fall short [
3].
The Economic Dispatch Problem represents a fundamental optimization challenge in power systems engineering, focusing on determining the optimal power output allocation among available generating units to meet the system demand while minimizing total operating costs [
4]. In the context of smart grids, the EDP must additionally account for the intermittency of renewable resources, demand-side flexibility, energy storage systems, and various operational constraints such as prohibited operating zones, valve-point effects, and ramp rate limits [
5,
6]. The efficient solution of the EDP is paramount as even marginal improvements in cost efficiency can translate to substantial economic savings given the scale of modern power systems.
Traditional approaches to solving the EDP have relied on classical mathematical optimization techniques, including linear programming, quadratic programming, lambda iteration, and gradient methods [
4]. While these methods provide exact solutions for convex and well-behaved problems, they face significant limitations when confronted with real-world EDP instances characterized by non-convexity, discontinuities, and multiple local optima. The incorporation of realistic constraints such as prohibited operating zones and valve-point effects renders the problem highly nonlinear and non-convex, causing classical methods to struggle with convergence or become trapped in suboptimal solutions [
7].
To overcome these limitations, metaheuristic optimization algorithms have emerged as viable alternatives for tackling complex EDP instances. Techniques such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Grey Wolf Optimizer (GOP), and Artificial Bee Colony (ABC) have demonstrated considerable success in navigating the complex search spaces of modern EDP formulations [
8,
9]. These population-based approaches offer the advantages of global exploration capability, constraint-handling flexibility, and reduced sensitivity to initial conditions. However, metaheuristics also exhibit notable limitations, including parameter sensitivity, convergence inconsistency, and the absence of theoretical guarantees regarding solution quality. Moreover, their performance often deteriorates when scaling to high-dimensional problems or when addressing dynamic environments where system conditions evolve rapidly [
10].
The limitations of metaheuristics have motivated the exploration of reinforcement learning (RL) approaches for economic dispatch optimization. RL methods offer several distinctive advantages, including the ability to learn optimal policies through interaction with the environment, adaptation to changing conditions, and incorporation of sequential decision-making frameworks that align well with power system operations [
11]. Nevertheless, classical RL methods such as Q-learning and Deep Q-Networks (DQN) face challenges related to sample efficiency, exploration–exploitation balance, and high-dimensional continuous action spaces that are characteristic of the EDP. Recent advancements in reinforcement learning have also introduced powerful transformer-based architectures capable of modeling long-range dependencies in temporal and spatial energy systems [
12]. Multi-agent reinforcement learning (MARL) is gaining traction in distributed energy dispatch, enabling coordination across decentralized units [
13]. Moreover, safe RL and explainable RL frameworks are being developed to meet the reliability and transparency requirements of critical infrastructure like smart grids [
14]. These emerging directions complement GRPO by addressing issues of system-wide coordination, interpretability, and risk-aware policy deployment.
Group Relative Policy Optimization (GRPO) addresses these challenges by extending the traditional Proximal Policy Optimization (PPO) framework to incorporate collective learning dynamics and relative policy improvements. Unlike conventional RL methods that update policies based solely on absolute performance metrics, GRPO leverages information sharing and relative performance assessments within groups of agents to enhance convergence stability and exploration efficiency [
15]. By establishing trust regions based on the relative performance of multiple solution candidates, GRPO mitigates the risks of premature convergence and policy degradation during training. Additionally, the group-based approach enables more effective constraint handling through collaborative learning, where feasible solutions inform the development of effective constraint satisfaction strategies across the entire population of agents. Beyond improving computational efficiency, GRPO has important societal and environmental implications. By enabling scalable and adaptive scheduling, it can facilitate higher penetration of renewable energy sources, support demand–response mechanisms, and strengthen grid resilience—key levers in the transition to sustainable energy systems. The literature indicates that residential load management strategies can reduce
emissions by between 1% and 20% depending on the energy mix and regulatory environment [
16]. Similarly, demand response—even modest peak-shaving—has been estimated to yield system-level cost savings in the range of billions of dollars; for instance, a mere 5% reduction in peak demand could translate into USD 35 billion in avoided costs over 20 years in the U.S.
In this work, we present a novel application of GRPO for solving complex Economic Dispatch Problems in smart grid environments. The main contributions of this research are threefold: (1) the development of a specialized GRPO framework tailored to the unique characteristics of the EDP, incorporating domain-specific knowledge into the policy and value network architectures; (2) an enhanced constraint-handling mechanism that preserves solution feasibility through adaptive penalty formulations and guided exploration of the feasible region; and (3) comprehensive empirical evaluation demonstrating superior performance compared to both traditional metaheuristics and conventional RL approaches across a diverse set of benchmark problems and real-world case studies.
The remainder of this paper is organized as follows:
Section 2 presents a formal mathematical formulation of the Economic Dispatch Problem, including the objective function and various operational constraints considered in this study.
Section 3 provides a comprehensive review of related work spanning classical methods, metaheuristics, and reinforcement learning approaches for economic dispatch.
Section 4 introduces the proposed GRPO methodology, detailing the algorithm design, network architectures, and constraint-handling mechanisms.
Section 5 describes the experimental setup, including benchmark problems, comparison methods, and evaluation metrics.
Section 6 presents and discusses the experimental results, highlighting the performance advantages of GRPO across different problem instances. Finally,
Section 7 concludes the paper with a summary of findings, limitations, and directions for future research.
2. Related Work
Economic dispatch (ED) is a fundamental optimization problem in power systems, where the goal is to allocate generation among available units to meet a certain load at minimum cost. In multi-objective formulations of ED, additional objectives such as emission minimization or loss reduction are considered alongside cost, making the problem a multi-criteria optimization challenge. The ED problem is subject to numerous constraints, including power balance, generator output limits, ramp rate limits, and sometimes more complex operational constraints like prohibited operating zones or multi-area power transfer limits [
17]. Solving ED optimally is crucial for both economic efficiency and environmental compliance in modern grids. Over the past decade, there has been significant progress in solution methods for ED, especially for non-convex and multi-objective versions. These methods span classical mathematical optimization, metaheuristic algorithms, and machine learning, including reinforcement learning approaches, each with its own merits and limitations. In this section, we explore and discusses each category in turn, with an emphasis on the state-of-the-art metaheuristic and learning-based techniques for multi-objective ED, and highlight the challenges associated with them to motivate our approach.
2.1. Classical Mathematical Approaches
Classical methods for ED rely on mathematical programming and optimization theory. For the basic ED with a single objective (e.g., fuel cost minimization) and convex cost curves, the problem can be formulated as a convex optimization (e.g., a quadratic program) and solved to global optimality using efficient algorithms [
18]. Lambda-iteration (gradient method) and Lagrange relaxation have been long-standing techniques: the lambda-iteration method equalizes the incremental cost (
) across units to satisfy optimality and balance conditions, and it works well when generator cost functions are smooth and convex. In fact, early implementations of ED in the 1960s applied linear programming techniques to economic generation allocation [
17]. Later methods like dynamic programming were used for unit commitment and dispatch problems in the 1960s–70s. Quadratic programming methods (e.g., using the Kuhn–Tucker conditions) and network flow approaches which treat ED as a transportation problem were also explored in the literature [
18]. However, classical optimization methods encounter limitations as the ED problem becomes more realistic. One major issue is non-convexity. Practical ED often involves valve-point effect ripples in cost functions, multiple fuel options for generators, or prohibited operating zones, all of which introduce non-smooth, non-convex characteristics [
18,
19,
20]. Traditional solvers such as LP or QP struggle with these; they might converge to a local optimum or require piecewise-linear approximations to fit into a convex model [
18]. For instance, including emission cost as a second objective or constraint was handled in the 1990s by techniques like Lagrangian relaxation [
21], but those methods needed convexified formulations. Classic methods also have difficulty with discrete decision variables like unit on/off states or integer transmission flow limits without resorting to combinatorial search techniques that exponentially increase complexity. While methods like dynamic programming and branch-and-bound can handle discrete unit commitment, they suffer from the curse of dimensionality for large-scale systems [
21]. Another limitation is in handling multiple objectives. Classical approaches typically scalarize multiple objectives (e.g., combine cost and emission into a single weighted sum objective). This requires choosing weighting factors a priori, which can be arbitrary and may not capture the true trade-offs. There is no straightforward way for classical deterministic solvers to produce a Pareto-optimal set of solutions in one run—they would need to be run multiple times with different weights or targets. This lack of flexibility is a drawback in multi-objective ED scenarios [
21]. Finally, computational scalability is a concern: although methods like the lambda-iteration are extremely fast for convex ED and were historically used in real-time dispatch computers, they cannot guarantee optimality for non-convex ED [
18]. Techniques like quadratic programming or nonlinear programming can solve moderately sized problems, but their performance degrades as the number of generators and constraints grows. In summary, while classical mathematical methods form the foundation and can solve simplified ED efficiently, they struggle with the complex, non-convex, and multi-objective nature of modern ED problems. These limitations motivated the development of metaheuristic methods that can more easily handle such complexities.
2.2. Metaheuristic Approaches
Metaheuristic algorithms have emerged as key methods for economic dispatch (ED), especially in non-convex and multi-objective scenarios, due to their ability to handle complex, nonlinear constraints without strict assumptions on objective functions [
19,
20]. Although these algorithms do not guarantee global optima, they typically yield high-quality solutions efficiently. Genetic Algorithms (GA), among the earliest used for ED, utilize evolutionary processes including crossover and mutation. GAs effectively manage valve-point effects and multi-fuel scenarios [
22], with multi-objective variants like NSGA-II frequently generating Pareto fronts for economic–emission trade-offs. However, GAs may suffer from slow convergence and require careful parameter tuning. Constraint handling commonly involves penalty functions or repair operators to ensure feasibility [
17]. Particle Swarm Optimization (PSO), which is inspired by flocking behavior, offers simpler implementation and faster convergence than GAs, as evidenced in the seminal work by [
23]. PSO effectively addresses high-dimensional search spaces due to rapid information sharing and minimal parameter tuning. Multi-objective PSO (MOPSO) manages environmental/economic dispatch by maintaining solution diversity. PSO’s simplicity enables near real-time application, though occasional stagnation and parameter sensitivity remain concerns [
24].
Differential Evolution (DE) uses vector differences and recombination to achieve efficient continuous optimization [
25]. DE often outperforms GA and PSO on complex ED problems, particularly when hybridized or adapted through specialized mechanisms, such as single-unit adjustment repairs for constraint violations [
17]. Adaptive strategies enhance DE’s robustness, especially in multi-objective contexts, though parameter choice and discrete handling limitations persist. Ant Colony Optimization (ACO), although traditionally for combinatorial optimization, is applied to discrete or hybrid ED problems. It generally performs less efficiently than GA, PSO, or DE for purely continuous ED but can excel in hybrid scenarios involving unit commitment or discrete decisions. Multi-objective ACO adaptations exist but require substantial computational resources and careful tuning [
8]. Recent metaheuristics like the Grey Wolf Optimizer (GWO) have gained attention due to their simplicity and balanced exploration–exploitation capabilities, showing competitive results on complex ED scenarios [
8]. Numerous other novel algorithms, including Whale Optimization, Bat, and Firefly, have demonstrated effectiveness in ED, especially when hybridized or parameter-adaptive. These methods typically handle constraints via penalties, repairs, or dependent variable encoding, achieving near-real-time performance with optimization strategies like parallelization and warm starts.
Overall, metaheuristics significantly extend the solvability of ED problems beyond classical methods, addressing multi-modal and multi-objective challenges effectively. Despite their inherent lack of optimality guarantees and sensitivity to parameterization, these methods remain central to ED research and practice, increasingly augmented by emerging reinforcement learning-based approaches.
2.3. Machine Learning and Reinforcement Learning Approaches
Machine Learning (ML) techniques have recently emerged as promising alternatives to conventional optimization methods for solving Economic Dispatch Problems, encompassing supervised learning, hybrid methods, and particularly reinforcement learning (RL). Supervised ML approaches utilize historical real data or data generated by offline optimizers (e.g., linear programming or metaheuristics) to train predictive models like Decision Trees or Neural Networks, providing fast dispatch decisions in real-time scenarios. For instance, Goni et al. [
26] employed a Decision Tree trained on data from the Lagrange multiplier method, significantly reducing computation time compared to classical solvers. However, such methods depend heavily on comprehensive, stationary training datasets and require retraining upon system alterations. Additionally, hybrid strategies involving ML-guided metaheuristics have been explored; notably, Visutarrom et al. [
27] used RL to adaptively tune Differential Evolution parameters, enhancing robustness and reducing manual tuning efforts.
Building upon these approaches, reinforcement learning methods have become particularly prominent, offering advantages for dynamic ED scenarios involving uncertainty and sequential decision-making. RL methods learn dispatch policies mapping system states (loads, generation statuses) directly to optimal generator actions, utilizing algorithms such as Q-learning, Deep Q-Networks (DQN), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). While classical Q-learning faces limitations due to discrete state-action spaces, DQN employs neural network approximations to effectively handle larger continuous state spaces. Sage et al. [
28] highlighted DQN’s superior performance in battery dispatch scenarios, demonstrating significant cost savings through optimal charge/discharge policies.
Addressing continuous action spaces essential for ED, policy gradient methods like DDPG and PPO have shown significant promise. Chen et al. [
29] successfully applied hybrid DDPG for microgrid dispatch, achieving superior performance compared to discretized DQN. PPO has been favored for its stability during training, with studies demonstrating its efficacy in learning robust dispatch policies in renewable-integrated dynamic ED contexts.
Expanding on these methods, advanced RL techniques such as Soft Actor–Critic (SAC) and emerging Group Relative Policy Optimization (GRPO) have shown potential for overcoming inherent RL limitations like training complexity and critic network biases. While SAC’s stochastic nature has proved less effective for deterministic dispatch scenarios, GRPO’s critic-free design promises enhanced training efficiency, although its practical application in ED remains exploratory.
To successfully implement RL methods, careful consideration must be given to designing states, actions, and rewards. Incorporating temporal indicators significantly enhances policy adaptability [
28]. Continuous dispatch adjustments typically favor actor–critic methods, whereas discrete action spaces are suitable for DQN. Furthermore, multi-agent RL presents a potential avenue for managing large-scale dispatch tasks, although ensuring cooperative behavior among multiple agents adds complexity. Reward functions typically integrate economic objectives with penalties for constraint violations. However, excessive emphasis on penalties can hinder cost optimization. Thus, hard-coding domain-specific constraints often supplements reward shaping, guiding RL agents primarily towards minimizing economic costs. Despite these advancements, trained RL policies face several challenges. Although execution is rapid and suitable for real-time deployment, RL training itself is computationally intensive and time-consuming. Moreover, RL-derived policies do not guarantee global optimality and exhibit limited generalization beyond training scenarios, necessitating continuous retraining. Policy explainability and safe exploration also remain significant issues, particularly for operational grid deployments. Nevertheless, RL’s adaptability and real-time responsiveness provide substantial benefits in evolving smart grid applications. Hybrid frameworks combining RL-derived dispatch policies with deterministic refinement processes could ensure feasibility and operational reliability, effectively bridging learning-based methods and traditional optimization techniques.
6. Experimental Results and Discussion
The proposed GRPO method is implemented and evaluated on the Economic Dispatch Problem (EDP) with various constraints. To demonstrate the effectiveness of our approach, we coded the algorithm in Python 3.13.2 and executed it on a modern computing platform. All experiments were conducted independently to ensure the reliability and consistency of the results. This section presents the experimental setup, parameters, convergence characteristics, and comparative analysis with state-of-the-art methods.
6.1. Experimental Setup
The GRPO algorithm was implemented in Python and executed on a standard computing environment. For comparative assessment, we tested our approach on well-established benchmark systems from the literature, focusing primarily on the 15-unit test system with prohibited operating zones. The implementation parameters were carefully tuned to balance exploration and exploitation capabilities of the algorithm, and their values are summarized in
Table 2.
6.2. Test Cases
This section presents the empirical evaluation of the proposed GRPO algorithm across four Economic Dispatch Problem (EDP) scenarios: 15, 30, 60, and 90 generating units. All systems include prohibited operating zones, power balance constraints, and spinning reserve requirements. The results assess convergence behavior, feasibility, solution quality, and scalability of the algorithm.
6.3. Comparative Analysis
To evaluate the effectiveness of our proposed GRPO approach, we compared its performance against several state-of-the-art methods from the literature for the 15-unit test system. The comparison includes both traditional methods and advanced metaheuristic techniques.
As shown in
Table 4, the proposed GRPO algorithm outperforms most existing methods in terms of solution quality. It achieves a best cost of 32,421.67 USD/h, which is lower than all the compared methods except EHNN. However, it should be noted that the EHNN solution is not fully feasible as it fails to satisfy the power balance constraint, with 0.8 MW left unallocated. Although it is not faster than the MVMO and MVMO
s methods, its runtime remains comparable and faster than classical GA and PSO. This demonstrates the computational efficiency of the GRPO method in delivering high-quality solutions, while remaining well-suited for real-time or near-real-time power system operations.
6.4. GRPO Performance on Larger-Scale Systems: 30, 60, 90 Units
To evaluate the scalability and robustness of the GRPO algorithm under increasing problem dimensionality, we extended our experiments to include systems with 30, 60, and 90 generating units. These configurations include a larger number of prohibited operating zones and more intricate cost landscapes, challenging both the convergence stability and constraint satisfaction capabilities of the learning agent. For each scenario, GRPO was executed for 60 iterations with a fixed set of hyperparameters, and a full suite of learning diagnostics was recorded and visualized.
6.4.1. Performance on 30-Unit System
Figure 4 presents the convergence behavior for the 30-unit case. The fitness history in
Figure 4 shows a rapid decline in both best and mean fitness values during the initial 10 iterations, with the best solution stabilizing just around USD 64,600/h. The quick convergence indicates that GRPO efficiently explores the feasible search space and converges toward an optimal operating region. The population diversity shown in
Figure 4 starts around 0.36 and decreases sharply during the first 10 iterations, then stabilizes near 0.05, suggesting convergence without premature collapse. This maintained diversity helps the algorithm avoid local minima while still focusing the search.
The rewards history depicted in
Figure 4 shows a consistent upward trajectory, progressing from approximately
to over
. This trend reflects the reinforcement learning agent’s increasing ability to identify cost-effective and feasible solutions. The learning performance curves in the same figure show a smooth and synchronized exponential decay in both critic loss and KL divergence, validating the stability of policy updates within the PPO-like optimization backbone of GRPO.
Exploration vs. exploitation (
Figure 4) reveals a smooth and well-timed transition. The two curves intersect near iteration 15, after which exploitation gradually dominates. This behavior confirms that GRPO dynamically balances exploration in early stages with policy refinement in later iterations.
Finally, constraint violations decrease consistently as shown in
Figure 4, with both the violation count and power balance error approaching zero. This indicates that the algorithm reliably learns to respect all problem constraints as training progresses.
6.4.2. Performance on 60-Unit System
As shown in
Figure 5, GRPO demonstrates similarly stable behavior in the 60-unit configuration. The fitness curves converge to a best solution around USD 193,950/h, with the mean fitness closely following, highlighting effective policy convergence. The diversity metric again declines quickly to around zero, while reward progression reflects consistent learning of cost-reducing strategies, climbing steadily from negative values to approximately 18. Critic loss and KL divergence continue their expected exponential decay which validates policy stability and sample efficiency. Notably, the convergence dynamics remain smooth even in this more complex setting. The exploration–exploitation transition occurs earlier, around iteration 10–15, and remains balanced throughout, supporting efficient solution discovery and refinement. Constraint satisfaction is also excellent: violation count falls to zero by iteration 40, and power balance error stays consistently below 0.5 MW.
6.4.3. Performance on 90-Unit System
In the most complex setting, the 90-unit system, GRPO continues to exhibit reliable and scalable behavior as illustrated in
Figure 6. The fitness history shows convergence to a best cost near USD 193,800/h, while the mean fitness narrows toward the best value over iterations, indicating consistent population-level learning. The diversity measure starts above
and follows a decaying trend similar to smaller systems, settling at a low, non-zero value. This again ensures the algorithm avoids premature convergence while focusing the search. Reward values show steady improvement, reaching nearly
, while critic loss and KL divergence decay synchronously and smoothly, signifying stable actor–critic updates even in this high-dimensional problem. The exploration–exploitation graph demonstrates a strong shift toward exploitation after iteration 20, maintaining a learning balance that supports long-term policy improvement. Importantly, constraint violations are driven to zero across training. Both the violation count and the power balance error drop steadily with no regressions which demonstrates that GRPO continues to enforce feasibility even in large and complex EDP configurations.
6.5. Statistical Analysis of GRPO Performance
To address the stochastic nature of GRPO and provide a rigorous statistical evaluation, we conducted 30 independent runs for each test system.
Figure 7 presents the distribution of solution quality across these runs, demonstrating the algorithm’s consistency and robustness.
As shown in
Figure 7, GRPO exhibits remarkable consistency across all test systems. The standard deviation remains below 0.3% of the mean cost for all cases, with the 15-unit system showing
USD/h (0.15% of mean), the 30-unit system
USD/h (0.15% of mean), the 60-unit system
USD/h (0.20% of mean), and the 90-unit system
USD/h (0.25% of mean). The worst-case solutions remain within 1% of the best-case solutions across all runs, confirming the algorithm’s reliability for practical deployment. Notably, all 30 runs achieved 100% constraint satisfaction, with no violations of power balance, prohibited operating zones, or spinning reserve requirements.
6.6. Comparative Analysis of the Quality Solution
Table 5 presents a comparative analysis of the proposed GRPO algorithm against several methods, namely MVMO, MVMO
s, CGA, and IGAMUM, across the considered three system scales: 30, 60, and 90 generating units. The results include best and average dispatch costs, as well as corresponding CPU times.
Across all unit sizes, GRPO consistently achieves the lowest minimum and average costs, outperforming all other methods in terms of solution quality. For instance, in the 90-unit system, GRPO attains a best cost of 193,936.09, which is approximately 322 lower than the next-best solution provided by MVMOs. The cost improvement becomes more prominent as the system size increases, demonstrating GRPO’s superior scalability.
While GRPO’s CPU time is higher than that of MVMO variants, it remains significantly more efficient than CGA and IGAMUM. Notably, GRPO’s runtime grows moderately with problem size—from 37.16 s (30 units) to 138.14 s (90 units)—which is acceptable given the substantial improvement in cost performance. These results highlight the effectiveness of GRPO’s hybrid learning strategy in balancing solution quality with computational efficiency, especially in large-scale economic dispatch scenarios.
6.7. Hyperparameter Sensitivity Analysis
The performance of GRPO depends on several hyperparameters that control the exploration–exploitation balance and population dynamics. We conducted a comprehensive sensitivity analysis of three critical parameters: population size (K), elite percentage (E), and initial noise level () to identify optimal configurations and assess the algorithm’s robustness to parameter variations.
Figure 8a reveals that population size significantly impacts solution quality, with costs increasing by 1.96% when reduced to 10 agents. The optimal population of 50 agents effectively balances computational efficiency with exploration capability, while larger populations offer marginal improvements at increased computational cost. The elite percentage (
Figure 8b) demonstrates exceptional robustness, maintaining cost variations below 0.42% across the range [0.2, 0.4]. This stability stems from GRPO’s adaptive mechanisms that dynamically adjust elite influence based on population diversity.
The initial noise level analysis (
Figure 8c) illustrates a critical trade-off: insufficient noise (
) leads to premature convergence with 1.42% cost increase, while excessive noise (
) delays convergence without improving solution quality. The optimal value of 0.05 achieves convergence in five iterations while maintaining solution quality. The robustness summary (
Figure 8d) confirms that elite percentage is the most stable parameter with a 91.6% robustness score, followed by initial noise (71.6%) and population size (60.8%). These findings validate GRPO’s practical applicability without extensive parameter tuning.
6.8. Robustness Under Unseen Operating Conditions
Power systems frequently encounter operating conditions that deviate significantly from nominal scenarios due to demand fluctuations, equipment failures, and changing operational requirements. To evaluate GRPO’s generalization capability and operational robustness, we conducted comprehensive stress tests under three categories of unseen conditions: load variations, modified reserve requirements, and generator failures.
As shown in
Figure 9a, GRPO exhibits remarkable resilience to load variations, maintaining cost increases below 5.2% even at extreme load conditions (130% nominal) compared to 8.1% for PPO and 11.8% for DDPG. This superior performance is attributed to GRPO’s population-based learning, which implicitly captures diverse operating scenarios during training. The algorithm’s ability to maintain near-optimal performance across a wide load range (70–130%) demonstrates its practical applicability in dynamic grid environments.
Reserve requirement modifications (
Figure 9b) reveal GRPO’s excellent constraint adaptability. When reserves increase to 2.5× nominal which simulates stringent reliability requirements, GRPO experiences only a 8.6% cost increase while maintaining full feasibility. In contrast, PPO and DDPG show 13.5% and 19.8% increases, respectively, often with constraint violations. This adaptability stems from GRPO’s elite-guided updates that preserve feasible solution patterns while exploring new operating regions.
Generator failure scenarios (
Figure 9c) represent the most challenging stress tests. GRPO demonstrates progressive but controlled degradation: 2.8% for single small unit failure, 4.6% for large unit failure, and 10.2% for three-unit simultaneous failure. These values are consistently 40–50% lower than PPO and 60–70% lower than DDPG, which highlights GRPO’s superior crisis management capabilities. The overall robustness summary (
Figure 9d) quantifies performance across all categories, with GRPO achieving scores of 92.1% for load robustness, 89.4% for reserve adaptability, and 85.3% for failure resilience, yielding a composite robustness score of 88.9%—significantly exceeding PPO (72.8%) and DDPG (56.2%).
6.9. Comparison with Reinforcement Learning Methods
To comprehensively evaluate the effectiveness of GRPO, we extended our comparative analysis to include state-of-the-art reinforcement learning algorithms, specifically Proximal Policy Optimization (PPO) [
36] and Deep Deterministic Policy Gradient (DDPG) [
37]. These methods represent the current benchmarks in policy gradient and actor–critic approaches for continuous control problems.
Figure 10 illustrates the comparative performance of GRPO, PPO, and DDPG on the 15-unit test system. The results demonstrate GRPO’s superior convergence characteristics in both learning efficiency and solution quality. As shown in
Figure 10a, GRPO exhibits the fastest reward improvement, reaching near-optimal values within five iterations, while PPO requires approximately 8 iterations and DDPG shows slower, more volatile convergence with characteristic instability around iterations 8–23. The best cost convergence curves in
Figure 10b reveal that GRPO achieves the lowest operating cost of USD 32,421.67/h, outperforming PPO (USD 32,480/h) and DDPG (USD 32,650/h) by 0.18% and 0.70%, respectively.
The scalability assessment across larger systems (30, 60, and 90 units) further validates GRPO’s advantages, as presented in
Figure 4,
Figure 5 and
Figure 6 and summarized in
Table 6. GRPO consistently maintains faster convergence and achieves lower operating costs across all system scales. For the 30-unit system, GRPO converges to USD 64,558.09/h compared to USD 64,575/h for PPO and USD 64,605/h for DDPG. The performance gap widens in larger systems: in the 90-unit configuration, GRPO achieves USD 193,936.09/h while PPO and DDPG reach USD 193,960/h and USD 194,100/h, respectively.
The superior performance of GRPO can be attributed to its group-based learning mechanism and relative performance assessment, which provide more robust exploration and exploitation balance compared to single-policy methods. While PPO offers stable learning through its clipped objective function, it lacks the population diversity that enables GRPO to escape local optima effectively. DDPG, despite its efficiency in continuous action spaces, exhibits characteristic instability due to overestimation bias in the critic network and sensitivity to hyperparameters. GRPO’s elite-guided updates and adaptive noise scheduling further contribute to its consistent outperformance across varying problem scales.
7. Discussion
The empirical results across four increasingly complex Economic Dispatch Problem (EDP) scenarios—15, 30, 60, and 90 units—demonstrate the versatility, scalability, and robustness of the proposed Group Relative Policy Optimization (GRPO) algorithm. This section synthesizes those findings and positions GRPO in the broader context of state-of-the-art metaheuristics and reinforcement learning methods for power system optimization.
7.1. Convergence Behavior and Learning Stability
One of the most remarkable strengths of GRPO is its rapid convergence across all tested systems. In the 15-unit system, the best fitness stabilized within the first five iterations, and similar trends were observed for the 30- and 60-unit systems, where high-quality solutions emerged in less than 20 iterations. Even in the 90-unit system, convergence remained stable, with best fitness flattening after 40 iterations. While GRPO exhibits slightly higher CPU time than some single-solution metaheuristics like MVMO, this cost reflects its population-based architecture and strong emphasis on feasibility and robustness. Unlike many metaheuristics that require extensive tuning or risk constraint violations, GRPO consistently produces high-quality, fully feasible solutions. Moreover, its evaluation step is inherently parallelizable which makes the method well-suited for fast deployment on modern multi-core or GPU-enabled systems.
The consistent exponential decay of both critic loss and KL divergence across all system sizes affirms the stability of GRPO’s policy updates. Unlike many RL-based methods that suffer from instability, GRPO maintains well-regulated policy improvement through trust region control and elite-based population updates. These mechanisms prevent policy collapse and ensure learning continuity across episodes.
The adaptive exploration-to-exploitation trade-off is a key architectural innovation in GRPO. Initial iterations are dominated by exploration costs, facilitating wide sampling of the search space, while later iterations shift toward exploitation of high-reward regions. The dynamic cost adjustment strategy ensures that this transition occurs organically, typically between iterations 10 and 20, depending on system size. This is a stark contrast to traditional metaheuristics like Genetic Algorithms or Particle Swarm Optimization, where exploration–exploitation balance is static or hand-tuned.
By tightly integrating this dynamic mechanism, GRPO avoids early convergence while also accelerating the refinement of feasible solutions, which is critical in non-convex and multi-modal landscapes such as EDPs with prohibited operating zones and ramp constraints.
GRPO demonstrates clear advantages over both classical and modern approaches to economic dispatch. Compared to traditional optimization techniques such as
-
iteration and dynamic programming, and widely adopted metaheuristics like Genetic Algorithms, Evolutionary Programming, and Particle Swarm Optimization, GRPO delivers superior performance in terms of solution quality, constraint satisfaction, and computational efficiency. As evidenced in
Table 3, GRPO attained the lowest dispatch cost on the 15-unit system while exhibiting a significantly reduced runtime—highlighting its practical suitability for real-time deployment.
Crucially, GRPO eliminates reliance on problem-specific operators or heuristic parameter tuning, which often constrain the generalizability of metaheuristic methods. Furthermore, unlike many deep reinforcement learning models that suffer from sparse rewards, sample inefficiency, or poor constraint handling, GRPO integrates population-based search with policy gradient learning to achieve both global exploration and stable convergence. This hybrid design positions GRPO as a next-generation optimization framework for complex power system applications.
Our comparative analysis with contemporary reinforcement learning methods reveals that GRPO’s group-based approach provides tangible advantages over single-policy algorithms. The 0.18–0.70% cost improvement over PPO and DDPG may appear modest, but translates to significant economic savings when scaled to real-world power systems operating continuously. Moreover, GRPO’s superior convergence stability and constraint satisfaction rates make it more suitable for safety-critical applications where reliability is paramount. The consistent performance across different system scales without hyperparameter retuning further underscores GRPO’s practical applicability in dynamic grid environments.
7.2. Constraint Handling and Feasibility
Constraint satisfaction is often a major challenge in solving practical EDPs. Many algorithms either apply penalty-based correction mechanisms or heuristic repair procedures, which can compromise solution quality. GRPO, on the other hand, enforces constraints implicitly during learning by integrating feasibility checks into the reward structure and maintaining elite memory pools of valid solutions. As a result, the number of violations—such as prohibited zone breaches or power imbalance—drops to near zero in all test cases, including the 90-unit system.
Moreover, the power balance error across all experiments was maintained below 0.5 MW, often reaching values in the order of MW in smaller systems. This level of precision is vital in real-time power system operation, where marginal imbalances can cascade into significant operational risks.
7.3. Scalability and Generalizability
Perhaps the most compelling outcome of this study is GRPO’s strong scalability. Even without any hyperparameter retuning, GRPO maintained consistent performance from 15 to 90 units. Unlike many machine learning approaches that are highly sensitive to scale or require problem-specific adjustments, GRPO demonstrated robustness to increasing problem dimensionality, decision space complexity, and constraint density. The modularity of GRPO’s architecture also means it can be extended to incorporate additional operational constraints such as ramp rates, emission limitations, spinning reserve margins, or multi-area dispatch scenarios. Its compatibility with parallel computation—thanks to its population-based design—also positions it well for deployment in high-performance or cloud environments.
Although we do not use real-world operational datasets in this study, we evaluated GRPO under a range of realistic and challenging conditions designed to approximate real-world complexities. Specifically, we introduced dynamic demand profiles, sudden load surges, generator outages, and forecast uncertainty to simulate the variability and unpredictability found in modern power systems. These pressure tests show that GRPO maintains full constraint satisfaction and exhibits relatively lower performance degradation when compared to PPO and DDPG. This indicates the algorithm’s strong potential for real-world deployment and generalizability under conditions of incomplete or noisy information.
While GRPO offers enhanced convergence stability and robust exploration, these advantages come with certain trade-offs. Maintaining a population of policies requires more memory compared to single-policy RL methods and increases the computational overhead per iteration. However, since candidate evaluations are independent, GRPO is well-suited for parallel execution on modern multi-core or distributed systems. As the number of generators increases (i.e., the action space expands), we observed that the per-iteration compute time scales linearly, but overall convergence remains efficient due to early stage elite guidance and adaptive exploration decay. Additionally, increasing the population size beyond a certain threshold (e.g., above 60) showed marginal improvements in solution quality, suggesting diminishing returns and supporting the use of moderate population sizes (e.g., 40–50) for practical deployments.
7.4. Practical Implications and Future Work
The results of this study have significant implications for smart grid operation and the automation of energy dispatch systems. GRPO offers a viable path toward integrating learning-based optimization in power control centers, capable of responding to dynamic grid conditions, operational constraints, and market-based objectives. Its convergence speed and constraint compliance make it suitable for deployment in real-time or near-real-time scenarios, particularly in systems with high renewable penetration, uncertainty, or reconfiguration needs. Furthermore, GRPO could serve as the optimization engine in hybrid digital twins of power networks—continuously learning from data streams and adapting dispatch policies to evolving operational contexts. In addition to technical performance, the practical relevance of GRPO lies in its potential deployment in microgrids and developing country contexts, where efficient and flexible scheduling is essential. At scale, widespread adoption could yield significant financial savings and measurable emission reductions, further underscoring the method’s contribution to sustainable energy management.
Future work may explore GRPO’s extension to multi-objective dispatch, integration with uncertainty models (e.g., wind and solar forecasts), and deployment within cyber-physical control systems for autonomous grid operation.
8. Conclusions
This paper proposed a novel optimization framework, Group Relative Policy Optimization (GRPO), to solve the Economic Dispatch Problem (EDP) under realistic constraints, including prohibited operating zones, ramp rate limits, and spinning reserve requirements. GRPO integrates reinforcement learning with population-based search, elite memory, and adaptive exploration mechanisms, offering a stable and scalable alternative to classical metaheuristics and RL approaches. Through extensive experiments on 15-, 30-, 60-, and 90-unit systems, the GRPO algorithm consistently produced high-quality solutions with full constraint satisfaction and competitive runtime. The proposed method demonstrated strong scalability, generalizability, and robustness, even without hyperparameter tuning across test cases. Its convergence speed and feasibility rates make it a promising candidate for near-real-time scheduling in smart grids.
Future work will focus on extending GRPO’s applicability to real-world and dynamic environments, particularly by incorporating renewable energy variability and forecast uncertainty. Enhancements such as hybridizing GRPO with faster local search methods or metaheuristics will be explored to improve runtime for real-time applications. The framework can also be adapted to handle multi-objective formulations involving emissions or transmission losses. Moreover, investigating its deployment in distributed or multi-agent grid settings, and integrating safety guarantees or explainability features, will support its adoption in practical smart grid systems. Overall, GRPO provides a compelling optimization backbone for next-generation energy management systems that require adaptability, scalability, and reliability.