You are currently viewing a new version of our website. To view the old version click .
by
  • Olimpiu Nicolae Moga,
  • Adrian Florea* and
  • Claudiu Solea
  • et al.

Reviewer 1: Tokar Adriana Reviewer 2: Anonymous Reviewer 3: Anonymous

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The subject addressed by the authors is topical but requires the following revisions:

- To draw a distinction between economic viability and operational capacity. How is the balance achieved?

- It is not clear who manages this centralized system for collecting data from the multitude of consumers and how the optimization models are implemented. ? (chapter 2.1)

- The measures presented as proposed measures for efficient peak load management are measures already known and implemented by energy managers for industry and cities.

- Explain how you reduce voltage deviations by efficiently placing photovoltaic panels.

- Given that energy is traded on the stock exchange based on the general principle of economy and is transported and distributed to consumers through the electrical networks of the Electroenergetic System, explain the applicability of Peep-to-peer (P2P) trading.

- It is not clear where the data presented in Fig.1 is taken from. An explanation is required

- What is the role of the storage system installed at Building 3?

- Explain how you determined that PPO can balance the load curve without presenting the equivalent electrical diagram of the analyzed networks. It is necessary to present the equivalent electrical diagram.

- How were the Battery Storage parameters established (Capacity / Nominal Power)

- There is no correspondence between the installed power of the photovoltaic generation systems and the storage system. Clarification is needed.

- I believe that the load curve presented in Fig.6 for a shopping center is unrealistic in the interval 0:00-06:00.

- How was the total number of environment interactions established?

- It is not clear where the data presented graphically in Fig. 9-11 and in tables 6-8 are taken from. Clarification is needed.

Author Response

We would like to thank to the reviewers for their relevant, constructive, and useful observations. We have taken account of them in the revision, in order to improve the article. We also corrected and updated the paper in accordance to Sustainability guidelines.

Once again, we are grateful for the time and effort you took to read and evaluate our work and provide specific and constructive observations. We have carefully considered your feedback and incorporated the corresponding changes into our revised manuscript, highlighting them for your reference in the version with track changes attached to the submission.

We colored in red the reviewer’s observations and in blue our answers.

Q1 To draw a distinction between economic viability and operational capacity. How is the balance achieved?

Thank you for your remark. We have added the following sentence in Section 2.1 referring to the broader research directions in microgrid and energy community management. 

“These directions represent the broader research landscape, however the present study focuses exclusively on the economic viability, without modelling operational reliability constraints such as outages, inverter limits or degradation.”

 

To prevent misunderstanding, we have also clarified in the revised text that the present study focuses solely on the economic side (cost and emissions) and does not model operational reliability aspects such as inverter limitations, outages, or component degradation. However this is a good direction for future work. A brief explanatory sentence has been added to Section 2.1 to make this distinction explicit, and these limitations are mentioned in the new 3.8 section about limitations and assumptions.

Q2 It is not clear who manages this centralized system for collecting data from the multitude of consumers and how the optimization models are implemented. ? (chapter 2.1)

Thank you for the observation. The paragraph “The IoT sensors transmit the collected data in real time to a centralized monitoring system. This data includes information on energy consumption patterns, usage trends, and peak demand periods “ in Section 2.1 describes centralized monitoring approaches found in the literature and is not meant to represent the architecture of our proposed system. To avoid confusion, we have added a clarification in Section 2.1 explicitly stating that this description summarizes general practices in existing research and is not part of the methodology used in this study. In this sense, we have added reference [14] in section 2.1.

As stated in the new section 3.5, our “PPO controller is implemented as a decentralized multi-agent system. Each building in the community is represented by an independent PPO agent that receives only its own local state observation and outputs one action corresponding to the charging or discharging power of its electrical storage system. While observations are not shared between agents, cooperation emerges through the use of a shared global reward, which encourages agents to not act selfishly, since individual actions that reduce overall grid imports improve the rewards for all agents.”

 

Q3 The measures presented as proposed measures for efficient peak load management are measures already known and implemented by energy managers for industry and cities.

The measures mentioned in the literature review section (e.g., load shifting, peak shaving, coordinated storage) were not intended to be presented as new contributions of our work. They are established techniques widely used in industry, and we included them to provide the broader context and describe the research landscape. 

In section 2.2 we have added the following paragraph and accompanying references [8], [20], [21]:
Grid resilience research is also being conducted on the physical side of energy communities and microgrids implementations. Bi-directional power flow, voltage deviations, and network losses are being addressed through energy transaction algorithms and efficient PV placement [8], [20], [21]. In our study, these strategies are part of the existing body of peak-load management methods and are discussed solely to contextualize the motivation for using reinforcement learning.“

 

 

 

Q4 - Explain how you reduce voltage deviations by efficiently placing photovoltaic panels.

Indeed, one of the questions that needs to be addressed is how the voltage kept within acceptable limits while power is being injected into the grid by prosumers [22].The statements regarding voltage deviation reduction and optimal PV placement refer to findings reported in the cited literature (the new reference [8,20,21] former [5,7]) and do not describe methods implemented in our study. Our work focuses exclusively on economic performance and carbon emissions and does not model power-flow constraints, voltage deviations, or electrical network characteristics. There however one hypotheses regarding the voltage values considered in the study:

A previous analysis of line voltage in Romania has revealed that the values are within acceptable limits, although above the nominal voltage. The energy communities have an intrinsic characteristic to control potentially hazardous voltage levels, a large proportion of the energy generated being for own consumption. [23]

 

Q5-Given that energy is traded on the stock exchange based on the general principle of economy and is transported and distributed to consumers through the electrical networks of the Electroenergetic System, explain the applicability of Peep-to-peer (P2P) trading.

We would like to clarify that the discussion of peer-to-peer (P2P) energy trading appears in Section 2.3 as part of the literature review, summarizing emerging approaches explored in recent research. Our study does not implement or rely on P2P mechanisms.

Regarding applicability, P2P energy trading is already being piloted in several regions (e.g., Netherlands, Belgium, Australia, and the United States), where regulatory sandboxes allow consumers to exchange locally generated renewable electricity within a community. These examples demonstrate that P2P approaches are feasible and actively studied, even though large-scale deployment depends on local regulations. P2P energy sharing along with other transactive energy frameworks are mentioned as future developments in the manuscript. 

See the following references. [27],[28]

 

Q6 - It is not clear where the data presented in Fig.1 is taken from. An explanation is required

Thank you for pointing this out. We have clarified the origin of the data used in Figure 1. The figure now explicitly states that the load and PV curves are taken from a CityLearn simulation (Building 1, Schema 1, July 1). CityLearn derives its load, PV, and weather profiles from the U.S. DOE End-Use Load Profiles, which ensures that the values are realistic and reflect real consumption and irradiance patterns. The revised caption now clearly indicates this source. We also added this information in the new 3.3 section “Community configurations and input data”.

 

Q7- What is the role of the storage system installed at Building 3?

Thank you for this observation. We have clarified the role of the storage system in building 3, schema 1. This building is intentionally modelled as a storage-only participant to represent realistic cases where storage is available but no local PV exists. Such configurations appear in communities where building installs PV later than a battery or participates in “storaget-as-a-service” arrangements. Optionally, the battery also enables energy arbitrage, taking advantage of lower prices during the night period and using the stored energy when it's more expensive. [38],[39]

The former paragraph has been completed with “Schema 1, seen in Figure 2 and described in Table 1, models a community of 5 residential buildings. Building 1 is a regular prosumer, with a battery and PV, buildings 2 and 5 represent a typical consumer, with neither solar or storage capabilities, while building 3 only has PV, and building 4 only storage capabilities. Building 4 includes only an energy storage system, without local PV generation. This was introduced to test whether a storage-only participant could provide value through energy arbitrage or load balancing within the community, as well as for simulating future models for actors which participate in demand-shifting or “storage as a service” arrangements [38], [39].” 

 

Q8 - Explain how you determined that PPO can balance the load curve without presenting the equivalent electrical diagram of the analyzed networks. It is necessary to present the equivalent electrical diagram.

CityLearn is a high-level simulation framework that does not include an electrical network model (feeder topology, impedances, voltage constraints, or power-flow equations). As a result, there is no equivalent electrical single-line diagram to present. The PPO agent therefore does not perform load balancing in the classical power-flow sense; instead, it smooths the net energy import profile at the community level by optimally scheduling battery charge/discharge actions. We have added this issue in Section 3.8 Limitations and Assumptions

Q9- How were the Battery Storage parameters established (Capacity / Nominal Power)

 We thank the reviewer for raising this point. In the revised manuscript we clarify the origin and rationale of the battery storage parameters. The values used in this study were selected to remain within the realistic range of capacities and power ratings defined in the CityLearn Challenge 2021 dataset, which is based on the U.S. DOE End-Use Load Profiles. Although the exact numbers were not taken verbatim from any single CityLearn configuration, they were chosen to reflect typical magnitudes found in aggregated residential and commercial building clusters, which the CityLearn dataset models. Section 3 has been restructured to include the following:
3.1 Overview of Methodology

3.2 Simulation Environment

3.3 Community Configurations and Input Data which highlights the point at issue.

 

Q10 - There is no correspondence between the installed power of the photovoltaic generation systems and the storage system. Clarification is needed.

The mismatch between PV capacity and battery size across buildings was intentional, as such heterogeneous adoption patterns can be observed in real communities, where storage may be added at a later stage, sized based on budget constraints, or deployed unevenly across participants. Please also see the “storage as a service” arrangements we discussed in Q3

Q11 - I believe that the load curve presented in Fig.6 for a shopping center is unrealistic in the interval 0:00-06:00.

We thank the reviewer for the observation. We have clarified in the revised manuscript that the profile corresponds to the commercial building archetype provided by the dataset, which is based on the U.S DOE End-Use Load Profiles. To avoid confusion, we have replaced earlier references to specific commercial subtypes with the more accurate term  “commercial building , with lower overnight power consumption.” (see also reference [34]   K. Nweye, B. Liu, P. Stone, and Z. Nagy, “Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings,” Energy AI, vol. 10, p. 100202, Nov. 2022, doi: 10.1016/j.egyai.2022.100202. )

Q12 - How was the total number of environment interactions established?

The total number of environment interactions corresponds to the number of training steps used for the PPO agent and is determined by its convergence behaviour.

The following paragraphs were added in Section 3.7:

 “The PPO agent was trained for 400.000 environment steps, where each step corresponds to one simulated hour. This choice reflects a balance between learning stability and overfitting, the agent continued to improve up to approximately 500.000-700.000 steps, its performance degraded beyond this interval, and 400.000 steps provided the most stable and generalisable behaviour. Training was performed continuously over the full-year horizon in a rolling-window fashion, without shorter episodes or resets.

Multiple preliminary runs were executed while testing different hyperparameters, but the final results reported in this paper correspond to a run using the configuration shown in Table 5. Evaluation was performed only after the completion of the full training sequence. All computations were carried out on a standard CPU machine without hardware acceleration.” 

 

Q13 - It is not clear where the data presented graphically in Fig. 9-11 and in tables 6-8 are taken from. Clarification is needed.

Figures 9-11 and Tables 6-8 present simulation results obtained from the experiments described in Section 3. As clarified in the revised manuscript , Section 3.7 now explains in detail how these numerical values were obtained from the CityLearn environment. We added the sources for tables 6-8 

“(Source: authors own experiments using CityLearn dataset from challenge 2021, Schema X configuration).”

Reviewer 2 Report

Comments and Suggestions for Authors

A brief summary

This paper presents the results of a study on the performance of a reinforcement learning agent based on the proximal policy optimization (PPO) algorithm for energy supply management in three different energy community configurations.

 

General concept comments

The manuscript fully complies with the journal's scope and theme. The work is original and makes a significant contribution to the advancement of current knowledge. It is written at a high scientific level, adhering to generally accepted standards for presenting research results.

The methodological basis should be presented in more detail and in a separate section.

The findings have both theoretical and practical significance.

The literature review covers relevant aspects of the research area, is comprehensive, and does not reveal any significant knowledge gaps. The bibliographic references primarily include publications from the last five years, emphasizing the relevance of the work. However, section "1. Introduction" is inexplicably lacking references.

The scientific validity of the study is confirmed by the appropriate selection of methods for hypothesis testing and the comprehensiveness of the statistical analysis. The illustrative material (figures, tables) is appropriately selected and facilitates a better understanding of the results. More explanation is needed, and a separate section should be devoted to the explanation of all the obtained values ​​presented in the tables, including how they were obtained, what equipment was used, and what literature sources were consulted. How current is this information?

The conclusions formulated follow logically from the presented data and arguments and will undoubtedly be of interest to the journal's readers, as they contribute to the advancement of modern scientific knowledge.

The article requires revision.

 

Specific comments

Comment 1: Section "1. Introduction" should include a bibliographic analysis confirming the relevance of the selected research topic. This section should be revised using additional sources and preferably concluded with a summary of the study's purpose, relevance, and scientific novelty. Why are there no references in this section? Perhaps Sections 1 and 2 should be combined.

Comment 2: How was the system described in Section "3. Reinforcement Learning for Energy Communities" validated to confirm the correctness and reliability of the system's operation?

Comment 3: The obtained results should be compared with those of other authors.

Comment 4: How were the values ​​presented in Tables 1–3 obtained?

Comment 5: It would be desirable to include a "Materials and Methods" section describing the methods used in the study and describing the object and subject of the study. How were the values ​​presented in the tables obtained? What was the geographic area studied? What are the average summer and winter ambient temperatures?

Author Response

We would like to thank to the reviewers for their relevant, constructive, and useful observations. We have taken account of them in the revision, in order to improve the article. We also corrected and updated the paper in accordance to Sustainability guidelines.

Once again, we are grateful for the time and effort you took to read and evaluate our work and provide specific and constructive observations. We have carefully considered your feedback and incorporated the corresponding changes into our revised manuscript, highlighting them for your reference in the version with track changes attached to the submission.

We colored in red the reviewer’s observations and in blue our answers.

Comment 1: Section "1. Introduction" should include a bibliographic analysis confirming the relevance of the selected research topic. This section should be revised using additional sources and preferably concluded with a summary of the study's purpose, relevance, and scientific novelty. Why are there no references in this section? Perhaps Sections 1 and 2 should be combined.

Thank you for this helpful suggestion. We have revised the Introduction to include a clearer bibliographic foundation supporting the relevance of energy communities, the challenges of managing decentralized renewable resources, and the emerging use of reinforcement learning in microgrid optimization. New references have been added to contextualize the transition toward decentralized energy systems, the operational complexity of energy communities, and prior applications of RL-based control.

[1]      V. Vetter, P. Wohlgenannt, P. Kepplinger, and E. Eder, “Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities,” Energies, vol. 18, no. 17, p. 4489, Jan. 2025, doi: 10.3390/en18174489. 

[2]      G. Palma, L. Guiducci, M. Stentati, A. Rizzo, and S. Paoletti, “Reinforcement Learning for Energy Community Management: A European-Scale Study,” Energies, vol. 17, no. 5, p. 1249, Jan. 2024, doi: 10.3390/en17051249. 

[3]      M. Uddin, H. Mo, and D. Dong, “Real-Time Energy Management Strategies for Community Microgrids,” June 28, 2025, arXiv: arXiv:2506.22931. doi: 10.48550/arXiv.2506.22931. 

[4]      N. Rego, R. Castro, and J. Lagarto, “Sustainable energy trading and fair benefit allocation in renewable energy communities: A simulation model for Portugal,” Util. Policy, vol. 96, p. 101986, Oct. 2025, doi: 10.1016/j.jup.2025.101986. 

[5]      X. Fang, P. Hong, S. He, Y. Zhang, and D. Tan, “Multi-Layer Energy Management and Strategy Learning for Microgrids: A Proximal Policy Optimization Approach,” Energies, vol. 17, no. 16, p. 3990, Jan. 2024, doi: 10.3390/en17163990. 

[6]      G. Jones, X. Li, and Y. Sun, “Robust Energy Management Policies for Solar Microgrids via Reinforcement Learning,” Energies, vol. 17, no. 12, p. 2821, Jan. 2024, doi: 10.3390/en17122821. 

 

Besides this, more references have been added. Before the revision there were 23, and after we reached 44

To summarize the study's purpose, relevance, and scientific novelty we added the following paragraph: “The contribution of this work is the comparative evaluation of reinforcement learning and rule-based control across three heterogeneous community configurations, each with different combinations of photovoltaic generation and storage availability. This allows us to examine how the level of controllable resources influences the effectiveness of PPO, and to identify conditions under which RL provides improvements over simpler strategies. “

Comment 2: How was the system described in Section "3. Reinforcement Learning for Energy Communities" validated to confirm the correctness and reliability of the system's operation?

Thank you for the question. The system used in Section 3 is validated through the use of the CityLearn 2021 environment, which is a standardized, peer-reviewed simulation framework widely adopted for benchmarking building-level energy management and multi-agent reinforcement learning. CityLearn provides fully validated energy models, verified load and weather profiles, and deterministic simulation dynamics. In the following references (the first two were added after the review) similar results and experiments to ours can be seen.

 

[1]    V. Vetter, P. Wohlgenannt, P. Kepplinger, and E. Eder, “Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities,” Energies, vol. 18, no. 17, p. 4489, Jan. 2025, doi: 10.3390/en18174489. 

[16]    K. Nweye, B. Liu, P. Stone, and Z. Nagy, “Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings,” Energy AI, vol. 10, p. 100202, Nov. 2022, doi: 10.1016/j.egyai.2022.100202. 

[17]    “The citylearn challenge 2021 | Request PDF,” in ResearchGate, doi: 10.1145/3486611.3492226. 

 

Comment 3: The obtained results should be compared with those of other authors.

Thank you for this comment. We added a dedicated comparison paragraph in the Discussion section. Detailing recent similar work:

“Our results are consistent with those reported in recent literature on reinforcement learning for energy community and microgrid control [1,2,3]. Several studies using PPO or other RL algorithms on CityLearn or similar datasets report cost or emission reductions in the range of 8.78% to 20% compared to rule-based or baseline controllers. For example [3] reports 18% lower operation costs. Studies exploring alternative RL methods (like a Deep Q Network controller) also show comparable behaviour, achieving a 8.78% reduction in operating costs relative to its baseline [1]. Direct quantitative comparison remains difficult due to differences in datasets, pricing structures, community configurations, baseline controllers, and other implementation differences, however the improvement observed in this work fits well within the expected performance range. “

Comment 4: How were the values ​​presented in Tables 1–3 obtained?

We thank the reviewer for raising this point. In the revised manuscript we clarify the origin and rationale of the battery storage and PV parameters. The values used in this study were selected to remain within the realistic range of capacities and power ratings defined in the CityLearn Challenge 2021 dataset, which is based on the U.S. DOE End-Use Load Profiles. Although the exact numbers were not taken verbatim from any single CityLearn configuration, they were chosen to reflect typical magnitudes found in aggregated residential and commercial building clusters, which the CityLearn dataset models We also added this information in the new 3.3 section “Community configurations and input data”.

Comment 5: It would be desirable to include a "Materials and Methods" section describing the methods used in the study and describing the object and subject of the study. How were the values ​​presented in the tables obtained? What was the geographic area studied? What are the average summer and winter ambient temperatures?

Thank you for this valuable suggestion. In response, we have restructured and expanded Chapter 3 into a clear Materials and Methods section. This revision now explicitly describes the study object, the simulation workflow, the datasets used, and how all numerical values (PV sizes, storage capacities, load profiles, and weather data) were obtained. Section 3 has been restructured as follows:

3.1 Overview of Methodology

3.2 Simulation Environment

3.3 Community Configurations and Input Data 

3.4 Load, PV, and Pricing Data Sources

3.5 State , Action and Reward Spaces 

3.6 Algorithm and Training Parameters

3.7 Training Procedure and Evaluation Setup

3.8 Limitations and Assumptions

 

Reviewer 3 Report

Comments and Suggestions for Authors

There are following concerns authors should address.

  1. The Abstract claims the PPO agent is “up to 9.2% more effective” in reducing annual costs and carbon emissions than RBC. Please briefly state the reference (which schema) and whether this number is an average or the best-case single result. Add whether results are statistically significant or single-run numbers.
  2. The Introduction motivates RL vs RBC well, but it is unclear what the novel scientific contribution is beyond applying PPO to three CityLearn schemas. Is the main novelty (a) a comparative study across community types, (b) specific reward engineering choices, or (c) uncovering conditions where RL helps vs when it doesn’t? Explicitly state the gap in literature and how this manuscript uniquely fills it.
  3. Several RL-for-energy papers are cited, but the manuscript should compare to closely related studies that used CityLearn (and the CityLearn Challenge) or PPO in microgrid control, both qualitatively and (if possible) quantitatively. For example — how do your percent improvements compare to prior RL works on similar datasets? If there aren’t comparable metrics, state that clearly and explain why.
  4. Is the PPO agent centralized (one agent sees all buildings and issues actions for each) or decentralized (one agent per building)? The text sometimes reads like centralized and sometimes like per-building control. Please state clearly: observation space size, action vector dimension, which features are global vs local, and how actions map to each battery.
  5. Equation (1) appears as min(-e3, 0) in the PDF (or similar). This is unclear: is the reward min(-e^3, 0)? Why cubic? Why clamp to 0? This needs a correct, precise formula, units, and rationale.
  6. Table entries (e.g., Building 1 PV 120 kW, battery 140 kWh for small residential) seem large for residential units. Please justify these sizing choices (source dataset? scaled units?) and ensure units/power ratings are consistent.
  7. The action range is said to be [-1, 1], mapped to charge/discharge. How are these mapped to actual power (kW)? How is SOC clamping handled?
  8. The manuscript reports percent reductions but no measures of statistical significance. Provide averages and standard deviations (or confidence intervals) across runs, and include statistical tests where appropriate. Also define metrics precisely: “Annual Cost” , what price structure (flat, time-varying)? Are costs computed only from imports or net of exports?
  9. The energy arbitrage strategy may cause extra cycling; excluding battery degradation can bias results toward heavier battery use. Please discuss whether battery degradation costs (or cycle-life impacts) were included. If not, add as limitation and consider a variant in future work that penalizes cycles.
  10. The Discussion mentions CityLearn limitations, but expand to real-world constraints: communication delays, partial observability, privacy concerns, cyber-security, measurement noise, inverter/grid interaction dynamics, and regulatory constraints on time-of-use pricing and feed-in tariffs. Discuss how these factors could affect transfer from simulation to deployment.

Author Response

We would like to thank to the reviewers for their relevant, constructive, and useful observations. We have taken account of them in the revision, in order to improve the article. We also corrected and updated the paper in accordance to Sustainability guidelines.

Once again, we are grateful for the time and effort you took to read and evaluate our work and provide specific and constructive observations. We have carefully considered your feedback and incorporated the corresponding changes into our revised manuscript, highlighting them for your reference in the version with track changes attached to the submission.

We colored in red the reviewer’s observations and in blue our answers.

Q1 -The Abstract claims the PPO agent is “up to 9.2% more effective” in reducing annual costs and carbon emissions than RBC. Please briefly state the reference (which schema) and whether this number is an average or the best-case single result. Add whether results are statistically significant or single-run numbers.

Thank you for pointing out the need for additional clarity regarding the reported 9.2% improvement in the Abstract. We have now revised the Abstract to specify that this value corresponds to the best-case result obtained in Schema 3, when comparing PPO against the rule-based controller, and that the results reflect single-run performance. At this stage the statistical significance is not reported, as CityLearn training runs are deterministic given fixed seeds and the study did not include multi-run variance analysis. This clarification has been added both in the abstract and in Section 3.7

We have added the following paragraph: “Across the three evaluated community configurations, the PPO agent achieved its greatest improvement over a single run in the scenario where all participants were prosumers (Schema 3), with a reduction of up to 9.2% in annual costs and carbon emissions” in the abstract.

 

Q2 -The Introduction motivates RL vs RBC well, but it is unclear what the novel scientific contribution is beyond applying PPO to three CityLearn schemas. Is the main novelty (a) a comparative study across community types, (b) specific reward engineering choices, or (c) uncovering conditions where RL helps vs when it doesn’t? Explicitly state the gap in literature and how this manuscript uniquely fills it.

 

We agree that the novelty of the study should be stated more explicitly. We have revised the Introduction to clarify that the primary scientific contribution is the comparative analysis of PPO-based control across three structurally different energy community configurations, something that is not typically addressed in existing City-Learn studies. The manuscript now highlights that the study also identifies under which conditions reinforcement learning provides benefits, and where it would be worth considering a more traditional approach. Reward function engineering is not the focus of the present article, but we mention that it represents a direction for future work. We added the following text in the introduction:
“The contribution of this work is the comparative evaluation of reinforcement learning and rule-based control across three heterogeneous community configurations, each with different combinations of photovoltaic generation and storage availability. This allows us to examine how the level of controllable resources influences the effectiveness of PPO, and to identify conditions under which RL provides improvements over simpler strategies. ”

 

Q3-Several RL-for-energy papers are cited, but the manuscript should compare to closely related studies that used CityLearn (and the CityLearn Challenge) or PPO in microgrid control, both qualitatively and (if possible) quantitatively. For example — how do your percent improvements compare to prior RL works on similar datasets? If there aren’t comparable metrics, state that clearly and explain why.

Thank you for this comment. We added a dedicated comparison paragraph in the Discussion section. Recent RL-based studies using PPO, DQN, and similar approaches for EC and microgrid control report improvements typically between 10–20% relative to baseline controllers. Our PPO agent’s best improvement of 9.2% falls comfortably within this range. Because different studies use different datasets, building compositions, tariff structures, and baselines, a strict one-to-one comparison is not always possible

[1]    V. Vetter, P. Wohlgenannt, P. Kepplinger, and E. Eder, “Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities,” Energies, vol. 18, no. 17, p. 4489, Jan. 2025, doi: 10.3390/en18174489. 

[3] Uddin, H. Mo, and D. Dong, “Real-Time Energy Management Strategies for Community Microgrids,” June 28, 2025, arXiv: arXiv:2506.22931. doi: 10.48550/arXiv.2506.22931. 

We added this paragraph to the Discussion section:
“Our results are consistent with those reported in recent literature on reinforcement learning for energy community and microgrid control [1,2,3]. Several studies using PPO or other RL algorithms on CityLearn or similar datasets report cost or emission reductions in the range of  8.78% to 20% compared to rule-based or baseline controllers. For example [3] reports 18% lower operation costs. Studies exploring alternative RL methods (like a Deep Q Network controller) also show comparable behaviour, achieving a 8.78% reduction in operating costs relative to its baseline [1]. Direct quantitative comparison remains difficult due to differences in datasets, pricing structures, community configurations, baseline controllers, and other implementation differences, however the improvement observed in this work fits well within the expected performance range.”

Q4-Is the PPO agent centralized (one agent sees all buildings and issues actions for each) or decentralized (one agent per building)? The text sometimes reads like centralized and sometimes like per-building control. Please state clearly: observation space size, action vector dimension, which features are global vs local, and how actions map to each battery.

Thank you for highlighting the need to clarify the control architecture. In the revised manuscript, in the new section 3.5, we now specify that the PPO controller is implemented in a decentralized multi-agent configuration consistent with the native CityLearn framework. Each building is represented by an independent agent that observes only its own local state variables. We also mentioned the exact observation and action spaces for each agent, to be 47 (state space) and 1 (action space).

As stated in the new section 3.5, our “PPO controller is implemented as a decentralized multi-agent system. Each building in the community is represented by an independent PPO agent that receives only its own local state observation and outputs one action corresponding to the charging or discharging power of its electrical storage system. While observations are not shared between agents, cooperation emerges through the use of a shared global reward, which encourages agents to not act selfishly, since individual actions that reduce overall grid imports improve the rewards for all agents.”

The observation vector size for each agent is 47.

Each agent can take 1 action per simulation step, which only controls the agent’s battery (if present), so each agent’s action vector size is 1.

 

Q5-Equation (1) appears as min(-e3, 0) in the PDF (or similar). This is unclear: is the reward min(-e^3, 0)? Why cubic? Why clamp to 0? This needs a correct, precise formula, units, and rationale.

          Thank you for pointing this out, yes, the intended ecuation was min(-e^3, 0). The reward is a dimensionless penalty signal used solely for training and is provided by the CityLearn environment; we did not modify this built-in reward function. We have updated the text around Equation (2) to explicitly state the formula, and the rationale for the cubic form and clamping.

In the text the changes look like : “r=min(-e3,0),                                                                    (2)

where e is net energy consumption. The cubic term increases the penalty for higher grid imports, while the reward is capped at 0 so that grid export does not artificially inflate the reward. This reward function is provided by the simulator and was not modified in this study.”

 

Q6 -Table entries (e.g., Building 1 PV 120 kW, battery 140 kWh for small residential) seem large for residential units. Please justify these sizing choices (source dataset? scaled units?) and ensure units/power ratings are consistent.  

The values used in this study were selected to remain within the realistic range of capacities and power ratings defined in the CityLearn Challenge 2021 dataset, which is based on the U.S. DOE End-Use Load Profiles. Although the exact numbers were not taken verbatim from any single CityLearn configuration, they were chosen to reflect typical magnitudes found in aggregated residential and commercial building clusters, which the CityLearn dataset models

 

“All PV capacities, battery sizes, nominal power ratings, and load profiles used in the simulations originate from the CityLearn Challenge 2021 dataset or remain within its characteristic ranges. These values are based on the U.S. Department of Energy End-Use Load Profiles and represent aggregated building clusters rather than individual homes, which explains the relatively high installed capacities for buildings classified as “residential” [16], [18], [36].

“The apparent mismatch between PV capacity and battery size in several buildings reflects heterogeneous adoption patterns commonly seen in real communities [41,42]. Storage systems may be installed later than PV, sized differently due to economic constraints, or deployed as shared assets, Including such asymmetries allows the PPO agent to be tested under mixed-resource conditions [36,43].”

 

Q7- The action range is said to be [-1, 1], mapped to charge/discharge. How are these mapped to actual power (kW)? How is SOC clamping handled?  

In the revised manuscript, we have clarified the mapping between the normalized action space and physical power values, as well as the state-of-charge constraints. The CityLearn simulator then updates the battery state of charge while enforcing physical limits: SOC is constrained to the interval 

[0,1], and the effective charge/discharge rate cannot exceed the device’s power rating. These details have been added to Section 3.5 to make the action-to-power mapping and SOC clamping explicit.

“The simulator maps the normalized action to the building’s battery power.

 Pbattery = a  x Pmax                                                     (1)

Where a is the normalized agent action (continuous value in the range [-1,1]), and Pmax represents the maximum charge/discharge power. Each agent can take 1 action per simulation step, which only controls the agent’s battery (if present), so each agent’s action vector size is 1. 

The simulator’s internal energy models calculate the actual change in the SOC based on the agent’s action, while respecting physical constraints. The SOC cannot exceed 1 (full) or drop below (0), and the rate or charge/discharge is limited by the power rating of the system.”

 

Q8-The manuscript reports percent reductions but no measures of statistical significance. Provide averages and standard deviations (or confidence intervals) across runs, and include statistical tests where appropriate. Also define metrics precisely: “Annual Cost” , what price structure (flat, time-varying)? Are costs computed only from imports or net of exports?

          In the revised manuscript we clarified that simulations are deterministic under fixed seeds, and that the study adopts a single-run evaluation with the mention that full statistical analysis is planned for future work. We have also introduced chapter 3.7 which clearly states how all evaluation metrics are defined and computed. We use the dynamic tariff structure in CityLearn and export compensation is not included.

“Multiple preliminary runs were executed while testing different hyperparameters, but the final results reported in this paper correspond to a run using the configuration shown in Table 5. Evaluation was performed only after the completion of the full training sequence. All computations were carried out on a standard CPU machine without hardware acceleration.

All scenarios (Grid Only, Grid + Solar with no control strategy, RBC and PPO controlled communities) were evaluated using the same year-long dataset to ensure comparability. For the Grid Only, hourly energy consumption was computed as the sum of all building loads, and the annual cost was obtained by multiplying the hourly load by the corresponding tariff. Carbon emissions were computed similarly, using the hourly carbon intensity signal, and peak demand was defined as the maximum hourly grid import over the entire year.

For the Grid + Solar scenario, net demand was calculated by subtracting the PC generation from the load at each timestep (load - PV). Costs, emissions and peak demand were calculated from this net consumption.

For the RBC and PPO controlled communities, net demand was calculated as load - PV - battery discharge, where the simulator applies all physical limits on battery power and state of charge. Costs and emissions follow the same formulation as in the other scenarios. The CityLearn environment does not include export compensations, so any net export does not reduce the annual cost metric.

A fixed random seed was not enforced, however preliminary runs produced similar results across training attempts, and convergence behaviour remained consistent.

All reported performance values correspond to single-run results. CityLearn simulations are deterministic under fixed seeds, the variability introduced by PPO is limited to neural-network initialization, and the scope of this study did not include multi-run statistical variance analysis. “

 

Q9-The energy arbitrage strategy may cause extra cycling; excluding battery degradation can bias results toward heavier battery use. Please discuss whether battery degradation costs (or cycle-life impacts) were included. If not, add as limitation and consider a variant in future work that penalizes cycles.

We agree with the reviewer that omitting battery degradation may bias the ppo agent toward more aggressive cycling behaviour. As noted in the revised manuscript, the current CityLearn environment does not simulate degradation. We have now expanded both the new 3.8 Limitations and Assumptions, and the existing 5 Discussion sections to acknowledge this fact, and to indicate that future research will integrate degradation-aware reward terms.

 

Q10-The Discussion mentions CityLearn limitations, but expand to real-world constraints. Discuss how these factors could affect transfer from simulation to deployment.

We agree that real-world deployments introduce several practical constraints that are not captured by the CityLearn simulator. To address this, across the article we have  outlined issues ranging from dataset source (3.3 Community Configurations and Input Data) and inverter/grid interaction dynamics (3.5 State , Action and Reward Spaces, 3.6 Algorithm and Training Parameters and 3.8 Limitations and Assumptions), and regulatory constraints on tariffs (3.4 Load, PV, and Pricing Data Sources) and export policies (3.7 Training Procedure and Evaluation Setup)

 

 Furthermore we make the following comments regarding the reviewers suggestions: 

  • partial observability

This is an ongoing project and in this stage, the power outage functionality of the simulator was not yet used. We included it in the future work section.

  • privacy concerns, Cyber-security

We added “Privacy and security concerns are also not covered in our work, technologies such as blockchain and federated learning might be required to maintain confidentiality  [24,25].” in the Discussion section

 

  • measurement noise and  communication delays,

The values used in this study were selected to remain within the realistic range of capacities and power ratings defined in the CityLearn Challenge 2021 dataset, which is based on the U.S. DOE End-Use Load Profiles. Although the exact numbers were not taken verbatim from any single CityLearn configuration, they were chosen to reflect typical magnitudes found in aggregated residential and commercial building clusters, which the CityLearn dataset models [37]

 . How this data was collected was not within the scope of this paper.

 

  • inverter/grid interaction dynamics,

One of the questions that needs to be addressed is how the voltage kept within acceptable limits while power is being injected into the grid by prosumers [22] The statements regarding voltage deviation reduction and optimal PV placement refer to findings reported in the cited literature (the new reference [8,20,21] former [5,7]) and do not describe methods implemented in our study. Our work focuses exclusively on economic performance and carbon emissions and does not model power-flow constraints, voltage deviations, or electrical network characteristics. There however one hypotheses regarding the voltage values considered in the study, a previous analysis of line voltage in Romania has revealed that the values are within acceptable limits, although above the nominal voltage. The energy communities have an intrinsic characteristic to control potentially hazardous voltage levels, a large proportion of the energy generated being for own consumption. [23]

 

Systems like Peer-To-Peer energy sharing and third-party energy aggregators could allow the Reinforcement Learning agent to explore more complex decisions and potentially achieve better results.Concerns like battery degradation, grid stability and peak energy consumption analysis also represent future focus points, given the framework’s built-in power outage scenarios, such issues could be mitigated by experimenting with new reward functions.

 

 

 

  • and regulatory constraints on time-of-use pricing and feed-in tariffs.

We added “Electricity pricing follows the dynamic tariff included in the CityLearn dataset. This tariff does not provide export compensations, so grid exports do not directly reduce annual cost measurements. Hourly carbon-intensity signals are also taken from the dataset and applied consistently across all scenarios. “ in section 3.4

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I appreciate the authors' effort to implement the suggestions, however clarifications are still needed:

-For Q1: The economic viability of a product or service cannot be accepted by excluding the disadvantages of the adopted system. However, how do you justify that the product is economically viable from an operational point of view. Is a photovoltaic system located in an area with low solar potential economically viable?

-For Q4: Reducing carbon emissions in the operation phase of energy systems can also be achieved by disconnecting consumers from the grid and by reducing consumption as a result of the increase in energy prices, as they always pursue economic performance. I recommend that the authors develop algorithms that solve the technical problems addressed in the sources: 8, 20, 21.

-For Q7: I believe that it is not realistic to propose an electrical energy storage system without installing a renewable energy source. I recommend that the authors reconsider the system for building 3.

-For Q8: The adoption of mathematical models that are not based on an equivalent scheme, without scientific foundation, cannot constitute input data for a load curve optimization algorithm. What is the proposal to transform the real energy community into an ideal one so that PPO is applicable?

-For Q11: The loading curve shown in Fig. 6 is not found in reference [34].

Author Response

We would like to thank the reviewers for their relevant, constructive, and useful observations. We have taken account of them in the revision, in order to improve the article. We also corrected and updated the paper in accordance to Sustainability guidelines.

Once again, we are grateful for the time and effort you took to read and evaluate our work and provide specific and constructive observations. We have carefully considered your feedback and incorporated the corresponding changes into our revised manuscript, highlighting them for your reference in the version with track changes attached to the

submission.

We colored in red the reviewer’s observations, in blue our answers, and highlighted in green the changes added to the manuscript.

Q1- The economic viability of a product or service cannot be accepted by excluding the disadvantages of the adopted system. However, how do you justify that the product is economically viable from an operational point of view. Is a photovoltaic system located in an area with low solar potential economically viable?

 

 Thank you for your comment. We would like to clarify that the purpose of this work evaluates operational economic performance (i.e., reducing annual  electricity costs using different decision making strategies) and not the full techno-economic viability of PV and storage systems (i.e., the initial costs of setting up such a system).

Therefore:

  • We do not claim that photovoltaic systems are universally economically viable. They might be influenced by location, layout, and weather.
  • We simply evaluate which control strategy is the best at reducing operational costs, given the  PV + battery systems are already in place, and certain capacities are defined in the CityLearn dataset. All 3 approaches observe the same measurements.
  • The dataset used (CityLearn Challenge 2021) represents a location with moderate solar potential, and the same profiles are applied to all strategies to ensure comparability.

To make this clearer, in the revised manuscript we added : “The study evaluates operational cost only, not investment cost or full economic viability.” in section 1 of the manuscript.

Also, we stated in Section 3.8, “capital expenditures (PV installation cost, battery cost, degradation, maintenance, etc.) were intentionally excluded because the objective of this study is to compare control strategies (PPO vs. RBC vs. no control) under identical system configurations.”

In addition, we agree that a photovoltaic system located in an area with low solar potential might not be economically viable. The results interpreted were solely in the context of this work.

The interpretation of the results (section 4) was also changed to “As expected in all our configurations (Tables 1-3)”, to make it clearer that the previous statement “in all situations” strictly refers to our configurations. The same measurements were used for evaluating all three control strategies.

Q4- Reducing carbon emissions in the operation phase of energy systems can also be achieved by disconnecting consumers from the grid and by reducing consumption as a result of the increase in energy prices, as they always pursue economic performance. I recommend that the authors develop algorithms that solve the technical problems addressed in the sources: 8, 20, 21.

 

Thank you for your comment. We agree that carbon-emission reductions can be achieved through consumer demand reduction or temporary disconnection from the grid in response to high prices. However, these strategies depend heavily on climatic, geographic, and occupant-comfort constraints and are not universally applicable. CityLearn includes a function which also evaluates occupant comfort, although not used in our study, disconnecting buildings / consumers from the grid will affect this metric.

For example, in many climates (particularly temperate-continental regions with cold winters and hot summers) full or partial load-shedding is not feasible for long durations. Essential loads such as heating, cooling, refrigeration, or medical equipment cannot be interrupted without compromising safety, health, or well-being. For these reasons, demand-reduction strategies are highly context-dependent and cannot be applied uniformly across all buildings or communities.

With all respect for the reviewer, the goal of our work is different:
We evaluate how control strategies (PPO vs. RBC) manage available distributed energy resources (PV + storage) under identical operational conditions, without assuming that consumers can disconnect or substantially reduce essential energy use. This maintains comparability across scenarios and avoids introducing behavioural assumptions that are not modeled in the CityLearn framework.

Regarding the reviewer’s recommendation to incorporate the technical methods discussed in references [8, 20, 21], we appreciate the suggestion. These works address grid-level technical challenges such as voltage regulation, optimal PV-storage placement, and hosting-capacity enhancement, topics that are highly relevant for future expansion of our research. However, they fall outside the scope of the present study, which focuses solely on operational energy management given fixed infrastructure.

 

 To make it clearer, we extended the future work section to include

“Concerns like battery degradation, grid stability, voltage regulation, optimal PV-storage placement and peak energy consumption analysis also represent future focus points. Given the framework’s built-in power outage scenarios, such issues could be mitigated by experimenting with new reward functions. “

 

 

Q7- : I believe that it is not realistic to propose an electrical energy storage system without installing a renewable energy source. I recommend that the authors reconsider the system for building 3.

 

Thank you for your comment. Building 3 in Schema 1 was intentionally designed as a storage-only participant to reflect emerging real-world models such as Energy Storage as a Service (ESaaS) and community battery programs, which are increasingly deployed even in the absence of on-site renewable generation.

 

We also noticed an error in our description of schema 1, where buildings 4 and 3 were inverted, this has now been addressed in the latest version of the manuscript:

 “Building 1 is a regular prosumer, with a battery and PV, buildings 2 and 5 represent a typical consumer, with neither solar or storage capabilities, while building 4 only has PV, and building 3 only storage capabilities. Building 3 includes only an energy storage system, without local PV generation.“

Our manuscript already cites concrete examples of such systems:

  • 38 J. Arteaga, H. Zareipour, and N. Amjady, “Energy Storage as a Service: Optimal sizing for Transmission Congestion Relief,” Appl. Energy, vol. 298, p. 117095, Sept. 2021, doi: 10.1016/j.apenergy.2021.117095.
  • 39 A. Ramos, M. Tuovinen, and M. Ala-Juusela, “Battery Energy Storage System (BESS) as a service in Finland: Business model and regulatory challenges,” J. Energy Storage, vol. 40, p. 102720, Aug. 2021, doi: 10.1016/j.est.2021.102720.

As discussed in the reviewed scientific literature (Ref. 38, Ref. 39):

  • Community and shared storage systems are becoming common in Europe, allowing consumers without PV to benefit from load shifting and grid-flexibility services.
  • Behind-the-meter commercial storage can operate independently of local PV for peak-shaving, arbitrage, and resilience support.
  • Battery Energy Storage Systems as a Service (BESSaaS) business models allow customers to access storage capacity without owning PV or storage assets, through aggregator-managed schemes.

These examples demonstrate that storage-only assets are both realistic and already implemented in practice.

To give a clearer picture of why BESSaaS can be useful, we extended section 3.3 to include:”For example in [39] where the idea of BESSaaS is implemented, and used to provide grid-level flexibility, and [38] such a system is used in Finland to provide energy storage as a service. This reflects emerging real-world models such as Battery Storage Systems as a Service (BESSaaS) and community battery programs, which are increasingly deployed even in the absence of on-site renewable generation. ”

Q8-The adoption of mathematical models that are not based on an equivalent scheme, without scientific foundation, cannot constitute input data for a load curve optimization algorithm. What is the proposal to transform the real energy community into an ideal one so that PPO is applicable?

Thank you for your comment. With all respect for the reviewer, we believe there may be a misunderstanding regarding the source and scientific basis of the data used in this study. The load curves, PV generation profiles, weather signals, pricing data, and carbon-intensity signals are not synthetic or arbitrary mathematical models created by the authors. All input data is provided directly by the CityLearn Challenge 2021 dataset, which is built from the U.S. Department of Energy’s End-Use Load Profiles for the U.S. Building Stock (EULP). These profiles are scientifically validated, widely used in peer-reviewed research (please see 16, 18, 35, 36, 37, 40), and represent realistic building energy consumption and renewable generation patterns.

 

The simplifications described in Section 3.8 refer only to the electrical network physics (e.g., the omission of voltage constraints, reactive power, or explicit power-flow modeling), not to the underlying input data. The simulation framework preserves realistic temporal and meteorological dynamics while abstracting away low-level grid constraints in order to focus on the evaluation of control strategies. As far as we know, this abstraction is a common practice in reinforcement-learning research for building energy management.

 

Regarding the question of “transforming a real community into an ideal one,” our study does not attempt to idealize or modify a real energy community. Instead, PPO is applied directly to the CityLearn environment as provided. The purpose is to test how an RL controller performs under standardized, realistic, and reproducible operating conditions, where all strategies (Grid-only, Grid + Solar, RBC, PPO) interact with identical building profiles and exogenous signals

To make this clearer we added: “This abstraction applies only to the electrical network physics and does not affect the validity of the input data. All load, photovoltaic generation, weather, pricing, and carbon-intensity profiles used in this study originate from the CityLearn Challenge 2021 dataset. “ in section 3.8

 

Q11 The loading curve shown in Fig. 6 is not found in reference [34].

 Thank you for the observation. Figure 6 is not intended to reproduce a figure from Reference [34]; instead, it illustrates a sample daily load pattern for a commercial building taken directly from the CityLearn Challenge 2021 dataset used in this work. To avoid ambiguity, we specified the exact source of the data in the revised manuscript. We updated the Figure’s caption, and included a footnote with a direct link to the github source which contains the measurements used.

“Sample load profile over 24 hours.8

(Footnote 8) Taken from building 3, schema 3. Measurements and values provided by the CityLearn simulator. The measurements used can bee seen at https://github.com/OogaBooga21/EC_RL/blob/main/scen3/Building_3.csv

 

 

Reviewer 2 Report

Comments and Suggestions for Authors

The article has been substantially revised taking into account all comments and suggestions, so I recommend this article for publication.

Author Response

We would like to thank you for appreciating our work, and thank you for the constructive comments.

Reviewer 3 Report

Comments and Suggestions for Authors

Authors have made significant changes. I have no further comments.

Author Response

We would like to thank you for appreciating our work, and thank you for the constructive comments.