Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Reinforcement Learning Approach Based on Group Relative Policy Optimization for Economic Dispatch in Smart Grids

Electricity 2025, 6(3), 49; https://doi.org/10.3390/electricity6030049

by Adil Rizki¹

, Achraf Touil¹

, Abdelwahed Echchatbi¹ and Rachid Oucheikh^2,*

Reviewer 1: Anonymous

Reviewer 2:

Lefeng Cheng

Reviewer 3:

Simon Grima

Electricity 2025, 6(3), 49; https://doi.org/10.3390/electricity6030049

Submission received: 17 July 2025 / Revised: 24 August 2025 / Accepted: 28 August 2025 / Published: 1 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper describes a novel algorithm based on the Group Relative Policy Optimization (GRPO) approach for solving the Economic Dispatch Problem (EDP) in smart grids. The paper is well written and provides an effective approach to the ED optimization problem. However, these points could further improve the research methodology and results.

1) The experiments to verify the proposed approach are conducted on benchmark systems (15, 30, 60, and 90 units) but lack validation on real-world power grid data or dynamic environments (e.g., renewable energy fluctuations, demand variability). It will be impactful to test GRPO on real-world datasets or dynamic simulations (e.g., incorporating wind/solar forecast errors) to assess its robustness in practical scenarios.

2) The authors claims that the GRPO algorithm is computationally efficient but on the contrary, reports higher CPU times (e.g., 138.14s for 90 units) compared to MVMO variants. For real-time applications, even sub-second delays can be critical. Therefore, it is required to optimize the implementation (e.g., parallelization, GPU acceleration) or hybridize GRPO with faster metaheuristics for time-sensitive deployments.

3) It is not clear how GRPO performs under unseen conditions (e.g., sudden load changes, generator failures, or new constraint types). RL methods often struggle with out-of-distribution scenarios. It is advised to include stress tests with adversarial scenarios or transfer learning experiments to evaluate generalization.

4) As a solution to a constrained optimization problem, GRPO enforces constraints (e.g., prohibited zones, power balance), and relies on penalty functions and repair mechanisms, which may not guarantee feasibility in all cases. Thus, the authors should integrate hard-constraint satisfaction techniques (e.g., Lagrangian multipliers or feasible-space projection) to ensure 100% feasibility.

5) The paper compares GRPO to older methods (e.g., GA, PSO) while missing some of the other RL approaches (e.g., SAC, TD3) or hybrid ML-optimization methods. Benchmarking against recent RL/ML techniques (e.g., Transformer-based optimizers or physics-informed RL) to validate superiority of the proposed GRPO algorithm can boost the impact.

6) While looking at the results, it seems that the performance of the GRPO algorithm depends on hyperparameters (e.g., population size, elite percentage), but no sensitivity analysis is provided. This shortcoming can be addressed by conducting ablation studies to identify critical hyperparameters and their impact on performance.

7) The largest test case (90 units) is still modest compared to real-world grids with thousands of units. Is it possible to test GRPO on larger systems (e.g., 500+ units) or distributed implementations to verify scalability?

8) Several typos in the paper required correction:

* Line 513 SimplifiedPPOOptimizer
* Line 39 'to solving' should be 'to solve' ... and so on.

9) All Tables and Figures must be referenced in the text. Add a brief conclusion as Section 8 summarizing your work and projecting future extensions. This can be done by removing last section of Section 7 and modifying it as Section 8.

These suggestions could further improve the scientific quality of the manuscript.

Author Response

Review 1

Response:

We sincerely thank the reviewer for dedicating their time and effort to provide valuable feedback, from which we have greatly benefited to improve the quality of our paper.

Remark:

Response:

We thank the reviewer for the suggestion. We used standard benchmark systems (15, 30, 60, and 90 units) to ensure clear comparison with existing work. To address the concern, we have now added stress tests and dynamic scenario experiments that reflect the real-word in Section 6.6, including time-varying loads, failures and reserve requirements (which may simulate renewable integration issues). We also compared GRPO to PPO and DDPG under these realistic conditions, where GRPO consistently maintained feasibility, delivered competitive cost performance, and demonstrated greater robustness than both algorithms.

Remark:

Response:

We appreciate this important observation. While GRPO exhibits higher raw CPU time than some MVMO in the 90-unit case, it consistently achieves fully feasible solutions with lower cost even without hyperparameter tuning, which supports its efficiency from a solution quality and convergence perspective. We agree that for strict real-time constraints, computational speed is critical. But this is not an issue thanks to GRPO’s population-based structure which is highly parallelizable. All what we need to show is to have CPU time which is reasonable, better than many metaheuristics and that can be handled easily using parallelization (No Free Lunch Theorem). We have added this point in Section 7.1 as a discussion of the convergence behavior.

Remark:

Response:

We thank the reviewer for raising this point. In the revised version, we have included a new set of stress tests under out-of-distribution (OOD) scenarios, such as sudden load surges, reserve requirements, and generator outages (see Section 6.7, highlighted in red). GRPO consistently recovered feasible solutions in these settings and outperformed PPO and DDPG in both robustness and constraint handling. We believe that GRPO has acceptable generalization ability, and it is effective in handling practical, dynamic conditions beyond the training distribution. This remark is related somehow to the first remark where the reviewer asked for real-word cases.

Remark:

Response:

We appreciate the reviewer’s comment. While it is true that penalty-based methods alone may not guarantee feasibility, GRPO combines them with multi-stage repair mechanisms specifically designed for the Economic Dispatch Problem. As reported in Section 5.3 and Section 6.2.1, GRPO consistently achieves 100% feasibility across all test cases after few iterations, without the need for additional hard-constraint methods such as Lagrangian multipliers or projection techniques. We have also added quantitative analysis of pre-repair violation rates, showing that constraint violations diminish rapidly during training, and all final solutions are feasible after repair.

Remark:

Response:

We appreciate the reviewer’s suggestion to include more recent reinforcement learning baselines. In response, we have expanded our comparative study to include Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG) which are two widely and recent RL methods. We have chosen these two algorithms based on the comments of all reviewers and were implemented with the same constraint-handling framework used in GRPO to ensure a fair comparison. The results are presented in Section 6.8 which show that GRPO outperforms by a small but consistent margin PPO and outperforms DDPG by a large margin, as DDPG exhibited higher variability, slower convergence, and occasional constraint violations.

While PPO remains competitive in terms of computational efficiency and single-policy simplicity, GRPO's population-based learning offers enhanced robustness and stability, especially in constrained, non-convex scenarios like the Economic Dispatch Problem. We acknowledge that GRPO incurs slightly higher computational overhead per iteration due to maintaining a population of candidate solutions. However, this cost is mitigated by the algorithm’s fast convergence and high feasibility rate, and it can be further reduced through parallel or distributed evaluation, which GRPO naturally supports.

Remark:

Response:

Yes, sure! all metaheuristics and machine learning techniques are sensitive to hyperparameter values. We have added a dedicated sensitivity analysis in Section 6.6 to examine the effect of key hyperparameters (namely population size, elite percentage, and initial exploration noise) on solution quality and convergence. The results confirm that while GRPO is sensitive to extreme parameter values, it also maintains stable and high performance within a reasonable range. These findings help guide parameter selection and shows that the algorithm’s performance does not critically depend on precise tuning.

Remark:

Response:

We appreciate the reviewer’s point about testing GRPO on larger-scale systems. In the current study, we followed established benchmark configurations (15, 30, 60, and 90 units) to ensure comparability with prior work and to validate the method on standard test cases. To analyze scalability, we structured the experiments incrementally and observed that GRPO's computational time grows in a near-linear manner with respect to the number of units, while consistently feasible solutions with good quality. This confirms potential algorithm’s scalability to larger or distributed systems, which can be done in separate/future work.

Remark:

8) Several typos in the paper required correction:

* Line 513 SimplifiedPPOOptimizer

* Line 39 'to solving' should be 'to solve' ... and so on.

Response:

We have proofread the paper and tried to address all typographical errors.

Remark:

Response:

We double-checked the referencing of Tables and Figures, and they all appear to be mentioned in the text. Thank you for your recommendation about adding a Conclusion section. We have added a conclusion summarizing the main contributions and findings of our work and projecting potential future extensions.

Final Remark:

These suggestions could further improve the scientific quality of the manuscript.

Answer:

We sincerely thank you for your time, effort, and valuable suggestions, which have helped us further improve the scientific quality of our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes an interesting Group Relative Policy Optimization (GRPO) for solving Economic Dispatch Problems in smart grids. My comments are as follows.

1. The reinforcement learning method is used in this paper, I want to ask the authors, how do you justify the RL categorization when your implementation exhibits no characteristic RL components like policy gradients, value function approximation, or temporal difference learning? You can explain it in your paper.

2. In this paper, the authors claim "full constraint satisfaction," the repair-based approaches for prohibited operating zones and power balance constraints need systematic evaluation. How frequently do violations occur before repair?

3. The authros can provide some comparisons with recent Deep Q-Networks, Actor-Critic methods, or other modern RL approaches.

4. How does performance degrade under noisy or incomplete system information?

5. The discussion of practical limitations and real-World applicability can be improved and riched in this paper.

6. An interesting Ref. with doi of 10.1016/j.rser.2025.115776, which is recommended to be integrated into Section 1 where the authors motivate the need for advanced optimization methods in smart grids with renewable energy integration. By citing this Ref., the proposed game-theoretic evolution framework may directly support the authros' argument that traditional optimization methods inadequately handle the complexity of modern power systems with high renewable penetration.

Author Response

Review 2

This paper proposes an interesting Group Relative Policy Optimization (GRPO) for solving Economic Dispatch Problems in smart grids. My comments are as follows.

Response:

We sincerely thank the reviewer for dedicating their time and effort to provide valuable feedback, from which we have greatly benefited to improve the quality of our paper.

Remark:

The reinforcement learning method is used in this paper, I want to ask the authors, how do you justify the RL categorization when your implementation exhibits no characteristic RL components like policy gradients, value function approximation, or temporal difference learning? You can explain it in your paper.

Response:

We thank the reviewer for pointing out the need to clarify the RL classification of our proposed approach. While GRPO does not use traditional RL components such as value function approximation or temporal difference learning, it is built upon and extends Proximal Policy Optimization (PPO), a well-established policy gradient method. In our revised manuscript, we have added the formulation and components of GRPO and explained that GRPO retains the essential characteristics of reinforcement learning: it learns policies through interaction with the environment, uses cumulative rewards to guide optimization, and applies trust-region constraints for stability. Instead of a critic network, GRPO leverages relative performance within a population, which is consistent with recent developments in critic-free evolutionary reinforcement learning methods.

Remark:

In this paper, the authors claim "full constraint satisfaction," the repair-based approaches for prohibited operating zones and power balance constraints need systematic evaluation. How frequently do violations occur before repair?

Response:

We appreciate the reviewer’s attention to the constraint handling strategy. To address this concern, we have added a paragraph in Section 6.2.1 analyzing the pre-repair violation statistics on the 15-unit system. Specifically, in the first iteration, around 62% of generated candidates violate at least one constraint. After repair, all solutions become feasible. As training progresses, the violation rate decreases sharply to about 35% by iteration 3 and under 10% by iteration 5, which is also the convergence point. This demonstrates the effectiveness of our repair mechanisms not only in enforcing feasibility but also in steering the population toward the feasible space very early in the learning process.

Remark:

The authros can provide some comparisons with recent Deep Q-Networks, Actor-Critic methods, or other modern RL approaches.

Response:

Remark:

How does performance degrade under noisy or incomplete system information?

Response:

We thank the reviewer for raising this important question. To evaluate GRPO’s robustness under noisy or uncertain conditions, we introduced forecast errors and time-varying demand profiles in the newly added stress-testing experiments (Section 6.7). These scenarios simulate incomplete or noisy system information by adding stochastic fluctuations to load inputs. GRPO maintained full feasibility and exhibited only minor cost degradation while outperforming PPO and DDPG under the same conditions. This demonstrates the relative algorithm’s ability to generalize and remain stable when facing uncertain or imperfect input data.

Remark:

The discussion of practical limitations and real-World applicability can be improved and riched in this paper.

Response:

We thank the reviewer for this helpful suggestion. In the revised manuscript, we have expanded the discussion in Section 7.4 to address GRPO’s practical applicability and limitations. We highlight its suitability for renewable-integrated smart grids, its strong performance under dynamic and uncertain conditions, and its inherent parallelizability, which supports scalable deployment. We also acknowledge that GRPO incurs moderate computational overhead compared to some single-solution heuristics, but emphasize that this is offset by its robustness, feasibility guarantees, and potential for acceleration via parallel or GPU-based implementations.

Remark:

An interesting Ref. with doi of 10.1016/j.rser.2025.115776, which is recommended to be integrated into Section 1 where the authors motivate the need for advanced optimization methods in smart grids with renewable energy integration. By citing this Ref., the proposed game-theoretic evolution framework may directly support the authros' argument that traditional optimization methods inadequately handle the complexity of modern power systems with high renewable penetration.

Response:

We have added this recent reference. It is nice to see how to use of game theory and evolutionary game theory to optimize decision-making and manage distributed energy resources in complex, decentralized power systems.

We sincerely thank the reviewer for their valuable time and effort in providing constructive feedback on our paper. We tried to handle most of the remarks suggested by the reviewer.

Reviewer 3 Report

Comments and Suggestions for Authors

I am pleased to review the article titled “A Reinforcement Learning Approach Based on Group Relative Policy Optimisation for Economic Dispatch in Smart Grids”. The manuscript presents a compelling application of reinforcement learning, specifically the Group Relative Policy Optimisation (GRPO) method, to the Economic Dispatch Problem (EDP) within smart grids, a domain where traditional and metaheuristic methods often face limitations. The research addresses a clear and timely question: how can reinforcement learning approaches be effectively adapted to solve highly constrained, non-convex EDPs in a scalable and stable manner?

The topic is both original and highly relevant. It contributes to a pressing need in the energy sector for efficient, scalable, and adaptive algorithms that support real-time grid operations amid increasing complexity, such as renewable integration and evolving constraints. GRPO represents an innovative extension of PPO and introduces a population-based, relative performance-driven optimisation process, which is novel in the context of economic dispatch. Compared to existing literature, the paper adds meaningful value through its robust constraint-handling mechanism, elite-guided updates, and adaptive exploration features, offering practical advantages over classical RL and metaheuristic solutions.

The methodology is well-articulated and grounded in reinforcement learning theory. The problem formulation is comprehensive, with appropriate modelling of non-convexities like valve-point effects, prohibited operating zones, spinning reserve, and ramp constraints. The introduction of GRPO is mathematically rigorous and clearly implemented. However, the authors could enhance the reproducibility of the study by providing more details on the hyperparameter tuning process, justification for selected benchmark instances, and potential sensitivity analyses. Additionally, the role of stochastic elements in GRPO (e.g., variance from population sampling or environment interaction) could be explored further through ablation studies or robustness checks.

The conclusions are consistent with the results and logically follow from the analysis. The authors present a compelling case for GRPO’s superior performance based on comparative experiments, particularly on the IEEE 15-unit system, where GRPO achieves cost minimisation with strong feasibility and rapid convergence. That said, the discussion could be strengthened by elaborating more on limitations—such as computational cost for scaling to real-time operation in ultra-large systems—and potential trade-offs in exploration versus exploitation.

The references are extensive and appropriate, drawing from key works in both metaheuristics and RL. However, some newer studies applying transformer-based RL or hybrid methods for EDP could have been cited to further situate this work within the cutting edge of the literature.

Figures and tables are generally well-presented. Table 2 and Figure 3 effectively illustrate the optimal dispatch solution, while the convergence plots in Figure 2 provide insight into the algorithm's dynamics. However, adding confidence intervals or boxplots for cost and feasibility metrics across multiple runs would improve interpretability and rigour. Furthermore, the visual presentation of the GRPO architecture could be enhanced via a flow diagram to clarify the training process.

The paper is clearly written and accessible to the journal’s international audience, though there is occasional redundancy in the methodology sections that could be tightened. Terminology is used appropriately, though the acronym list could be expanded to support readers from interdisciplinary backgrounds. The social and economic implications, such as cost savings, sustainability impacts, or relevance to energy transition goals, are hinted at but not explicitly elaborated, which limits the broader impact narrative.

I believe this is a high-quality, technically sound manuscript that introduces an original and well-justified solution to a complex real-world problem. The work is significant in its potential to influence future smart grid dispatch systems, especially where real-time adaptation is critical. To further improve the study, I recommend minor revisions focusing on the inclusion of ablation studies, sensitivity analysis, and a more comprehensive discussion of broader implications and future extensions.

Suggestions:

While the GRPO algorithm’s parameters (e.g., population size, noise level, elite percentage, etc.) are listed in Table 1, there is no discussion of why these values were selected. Including a brief explanation, perhaps via a parameter sensitivity analysis or citation of prior works, would improve transparency and reproducibility. Alternatively, authors could provide a grid or random search result summary in a supplementary appendix.
Given that GRPO includes stochastic elements (e.g., Gaussian perturbation, adaptive noise, population sampling), it’s essential to discuss the variance in results. The paper currently presents only one run per test case. To increase scientific rigour, it would be beneficial to:
Present average and standard deviation of cost and feasibility over multiple (e.g., 20 or 30) independent runs.
Include boxplots or error bars in convergence plots (e.g., Figure 2).
Report worst-case vs best-case results to assess robustness.
While Algorithm 1 is quite detailed, readers would benefit from a simplified flowchart or diagram of the GRPO pipeline. This would clarify how the elite-guided updates, constraint-handling repairs, and population evolution loop work together.
Although the paper mentions the use of 15, 30, 60, and 90-unit systems, only the 15-unit case is reported in detail, providing summary results (best cost, feasibility, convergence time) for the other systems, even in a compressed table or appendix would better demonstrate the scalability claim.
The comparative analysis is primarily against metaheuristic methods. While it references PPO and SAC in the text, direct comparison to PPO (its closest RL baseline) is missing from Table 3. Including a PPO baseline with equivalent constraints would:
Validate GRPO’s added value.
Anchor the contribution more clearly within the RL domain.
The manuscript strongly emphasises GRPO’s advantages (e.g., stability, convergence, robustness), but a more nuanced discussion of trade-offs is needed. For example:
Does maintaining a population of policies increase memory and compute time significantly compared to single-policy PPO?
How does GRPO scale in terms of computational time with larger action spaces?
Are there diminishing returns in performance when increasing population size?

The introduction and conclusion briefly mention real-time scheduling and cost efficiency. However, the broader societal relevance, such as supporting renewable energy integration, improving grid resilience, or enabling demand-response strategies, is underdeveloped. Consider adding:
A paragraph on how GRPO might support the transition to sustainable energy systems.
Discussion on potential for deployment in developing countries or microgrid scenarios.
Estimates of financial savings or CO₂ reduction if GRPO were adopted at scale.
Some recent developments in reinforcement learning for smart grid optimisation are not included. Suggest citing:
Transformer-based RL approaches (for long-term dependencies).
Multi-agent RL systems for distributed grids.
Safe RL or explainable RL in critical infrastructure contexts.

Adding these will position GRPO more clearly within the emerging frontier of learning-based energy optimisation.

Theoretical performance bounds are briefly discussed in Section 4.6.4. The authors could expand this with:
Clarification of assumptions under which the bound holds.
Practical implications: what does the bound tell us about solution quality or convergence rate in constrained, real-world scenarios?

In Table 3, specify whether the comparison methods are fully feasible (as is noted for EHNN).
Include annotations in Figure 2 (e.g., iteration markers or lines for key transitions like exploration-exploitation shift).

Improve conciseness in Sections 4 and 5. There is some redundancy in descriptions of GRPO components (e.g., elite sets and trust region mechanisms are explained in multiple places).
Consider re-organising the constraint-handling techniques into a summary table (type of constraint, method used, penalty formula or repair logic).

Author Response

Reviewer 3:

Response:

We sincerely thank the reviewer for their valuable time and effort in providing constructive feedback on our paper. We tried to handle most of the remarks suggested by the reviewer.

Remark:

While the GRPO algorithm’s parameters (e.g., population size, noise level, elite percentage, etc.) are listed in Table 1, there is no discussion of why these values were selected. Including a brief explanation, perhaps via a parameter sensitivity analysis or citation of prior works, would improve transparency and reproducibility. Alternatively, authors could provide a grid or random search result summary in a supplementary appendix.

Response:

We thank the reviewer for highlighting the need to justify our hyperparameter choices. In the first version, we have used only manual selection by trying different combination of hyperparameters. In this revised version, we have made automatic hyperparameter tuning using grid search technique. We have added a dedicated sensitivity analysis in Section 6.7 to examine the effect of key hyperparameters (namely population size, elite percentage, and initial exploration noise) on solution quality and convergence. The results confirm that while GRPO is sensitive to extreme parameter values, it also maintains stable and high performance within a reasonable range. These findings help guide parameter selection and shows that the algorithm’s performance does not critically depend on precise tuning.

Remark:

Given that GRPO includes stochastic elements (e.g., Gaussian perturbation, adaptive noise, population sampling), it’s essential to discuss the variance in results. The paper currently presents only one run per test case. To increase scientific rigour, it would be beneficial to:
Present average and standard deviation of cost and feasibility over multiple (e.g., 20 or 30) independent runs.
Include boxplots or error bars in convergence plots (e.g., Figure 2).
Report worst-case vs best-case results to assess robustness.

Response:

We agree that single-run results are insufficient for stochastic algorithms. That is the reason why we had reported average cost in the old version after running 30 runs. To improve this according to your remark, we performed 30 independent runs per test case (15–90 units), added boxplots (Fig. 7) and error bars to convergence plots (Figs. 2, 4–6), and reported detailed statistics in Section 6.5. GRPO showed <0.3% standard deviation, 100% feasibility, and worst-case performance within 1% of best-case across all runs. We hope that these statistics give clear vision on the method performance.

Remark:

While Algorithm 1 is quite detailed, readers would benefit from a simplified flowchart or diagram of the GRPO pipeline. This would clarify how the elite-guided updates, constraint-handling repairs, and population evolution loop work together.

Response:

We have added a simplified flowchart or diagram of the GRPO pipeline

Remark:

Although the paper mentions the use of 15, 30, 60, and 90-unit systems, only the 15-unit case is reported in detail, providing summary results (best cost, feasibility, convergence time) for the other systems, even in a compressed table or appendix would better demonstrate the scalability claim.

Response:

Our intention was not to discard the logic of using the 30-, 60-, and 90-unit systems for scalability tests; rather, we focused initially on the 15-unit system since much of the existing literature emphasizes this case, which will enable a clearer comparison and validation of our approach. In the revised manuscript, we have added multiple experiments and included detailed results for the larger systems which seems enough for showing multi-facet evaluation and comparison.

Remark:

The comparative analysis is primarily against metaheuristic methods. While it references PPO and SAC in the text, direct comparison to PPO (its closest RL baseline) is missing from Table 3. Including a PPO baseline with equivalent constraints would:
Validate GRPO’s added value.
Anchor the contribution more clearly within the RL domain.

Response:

Remark:

The manuscript strongly emphasises GRPO’s advantages (e.g., stability, convergence, robustness), but a more nuanced discussion of trade-offs is needed. For example:
Does maintaining a population of policies increase memory and compute time significantly compared to single-policy PPO?
How does GRPO scale in terms of computational time with larger action spaces?
Are there diminishing returns in performance when increasing population size?

Response:

We thank the reviewer for recommending a more critical discussion of GRPO’s design trade-offs. In the revised manuscript (Section 7.3), we have expanded the analysis to discuss the computational and memory implications of maintaining a population of policies. We clarify that while GRPO has higher per-iteration cost than single-policy RL methods, it is highly parallelizable and scales linearly with problem size. We also report that increasing population size beyond a certain point yields limited gains, confirming the existence of diminishing returns and justifying our parameter choices.

Remark:

The introduction and conclusion briefly mention real-time scheduling and cost efficiency. However, the broader societal relevance, such as supporting renewable energy integration, improving grid resilience, or enabling demand-response strategies, is underdeveloped. Consider adding:
A paragraph on how GRPO might support the transition to sustainable energy systems.
Discussion on potential for deployment in developing countries or microgrid scenarios.
Estimates of financial savings or CO₂ reduction if GRPO were adopted at scale.

Response:

We thank the reviewer for this valuable suggestion. In the revised manuscript, we have expanded the introduction and discussion to highlight the broader societal relevance of GRPO which includes its role in renewable energy integration, grid resilience, and demand-response. We also added discussion on applications in developing countries and microgrid contexts, along with illustrative estimates of potential financial savings and CO₂ reduction at scale.

Remark:

Some recent developments in reinforcement learning for smart grid optimisation are not included. Suggest citing:
Transformer-based RL approaches (for long-term dependencies).
Multi-agent RL systems for distributed grids.
Safe RL or explainable RL in critical infrastructure contexts.

Adding these will position GRPO more clearly within the emerging frontier of learning-based energy optimisation.

Response:

We thank the reviewer for this excellent suggestion. In response, we have extended the background by citing recent literature on transformer-based reinforcement learning, multi-agent RL for distributed grid control, and safe/explainable RL. These references help contextualize GRPO within the broader trajectory of modern RL research in power systems and highlight complementary avenues such as coordination, interpretability, and operational safety.

Remark:

Theoretical performance bounds are briefly discussed in Section 4.6.4. The authors could expand this with:
Clarification of assumptions under which the bound holds.
Practical implications: what does the bound tell us about solution quality or convergence rate in constrained, real-world scenarios?

Response:

In the revised Section 4.6.4, we have clarified the assumptions under which the bound applies, namely, bounded rewards, finite time horizon, and controlled update magnitudes. We also explain the practical implications: the bound guarantees that GRPO will not significantly degrade policy quality compared to the baseline, as long as policy updates remain within the defined trust region. This helps support the empirical convergence stability observed in our experiments.

Remark:

In Table 3, specify whether the comparison methods are fully feasible (as is noted for EHNN).
Include annotations in Figure 2 (e.g., iteration markers or lines for key transitions like exploration-exploitation shift).

Response:

We have tried to improve the table and the figure correspondingly. Thanks!

Remark:

Improve conciseness in Sections 4 and 5. There is some redundancy in descriptions of GRPO components (e.g., elite sets and trust region mechanisms are explained in multiple places).

Response:

We have streamlined Sections 4 and 5 by removing redundant explanations related to elite memory, trust region updates, and adaptive mechanisms. The implementation section (Section 5) now references the theoretical formulations in Section 4 to improve conciseness and avoid repetition, while preserving technical clarity.

Remark:

Consider re-organising the constraint-handling techniques into a summary table (type of constraint, method used, penalty formula or repair logic).

Response:

We thank the reviewer for the helpful suggestion. In response, we have added a summary table (Table 1) at the end of Section 3.4 that clearly outlines each constraint type and the corresponding handling method used in GRPO. This table helps distinguish between repair-based and penalty-based mechanisms and provides references to the relevant equations for each case.

Finally, we sincerely thank the reviewer for their valuable time and effort in providing constructive feedback on our paper. We tried to handle most of the remarks suggested by the reviewer.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised paper is significantly improved. However, Table 1 is not readable. Also ensure all figures and Tables are referenced in the manuscript.

Author Response

We have corrected the formatting issue with the table as suggested. We also carefully double-checked the manuscript to ensure that all tables and figures are referenced in the text. We sincerely thank the reviewer for his time, effort, and valuable feedback.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has undergone thorough revisions and improvements, reaching a level that is now suitable for publication. I recommend that the paper be accepted for publication.

Author Response

We are grateful to the reviewer for their thoughtful comments and the time and effort they invested in helping us improve our work.

Article Menu

A Reinforcement Learning Approach Based on Group Relative Policy Optimization for Economic Dispatch in Smart Grids

Further Information

Guidelines

MDPI Initiatives

Follow MDPI