Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study

Moga, Olimpiu Nicolae; Florea, Adrian; Solea, Claudiu; Vintan, Maria

doi:10.3390/su172310696

Open AccessArticle

Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study

Computer Science and Electrical Engineering Department, Lucian Blaga University of Sibiu, 550024 Sibiu, Romania

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(23), 10696; https://doi.org/10.3390/su172310696

Submission received: 1 November 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Energy Transition and the Collaborative Governance for Reduction of Pollution and Carbon Emissions)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Energy communities represent an important step towards clean energy; however, their management is a complex task due to various factors such as fluctuating demand and energy prices, variable renewable generation, and external factors such as power outages. This paper investigates the effectiveness of a Reinforcement Learning agent, based on the Proximal Policy Optimisation (PPO) algorithm, for energy management across three different energy community configurations. The performance of the PPO agent is compared against a Rule-Based Controller (RBC) and a baseline scenario using solar generation but no active management. Simulations were run in the CityLearn framework to simulate real world data. Across the three evaluated community configurations, the PPO agent achieved its greatest improvement over a single run in the scenario where all participants were prosumers (Schema 3), with a reduction of 9.2% in annual costs and carbon emissions. The main contribution of this work is demonstrating the viability of Reinforcement Learning agents in energy optimization problems, providing an alternative to traditional RBCs for energy communities.

Keywords:

energy community; microgrid; PPO algorithm; energy efficiency; renewable energy

1. Introduction

The energy sector is moving away from centralized, fossil-based generation, towards decentralized renewable-driven systems. This transition is made possible by the growing availability of small-scale generation technologies like rooftop solar panels, which allow consumers to become active participants in the energy system. In this context, energy communities have emerged as local collectives that can produce, consume, store, and manage energy together [1,2]. However, setting up and managing such a community is a complex task. Members may have conflicting priorities, energy consumption patterns, renewable energy production is not constant, and the regulatory and pricing environment is still evolving. Traditional approaches using RBCs can be too rigid in such dynamic environments [3,4].

This paper explores the potential of Reinforcement Learning as an adaptive solution for energy management optimization in energy communities. The study evaluates operational cost only, not investment cost or full economic viability. The advantage of reinforcement learning agents is their ability to learn optimal strategies through direct interaction with the environment, without requiring a complete mathematical model of the system, allowing them to learn an intelligent control strategy which can maximize the economic and environmental benefits of integrating local renewable energy resources [5,6].

To evaluate the potential of reinforcement learning under realistic conditions, this study examines three diverse community schemas. Each schema is meant to represent a different scenario in which an energy community might find itself in. The first schema represents a heterogeneous community, with both prosumers and consumers, meant to simulate more diverse and non-ideal configurations. The second schema represents a small community with a large producer, meant to simulate small communities which include a commercial space, with lower overnight power consumption, different consumption patterns and production capabilities. And the third configuration is similar to the second; however, it represents an ideal energy community in which all members are prosumers, with the ability to store energy.

The contribution of this work is the comparative evaluation of reinforcement learning and rule-based control across three heterogeneous community configurations, each with different combinations of photovoltaic generation and storage availability. This allows us to examine how the level of controllable resources influences the effectiveness of PPO, and to identify conditions under which RL provides improvements over simpler strategies.

The rest of this paper is structured as follows. Section 2 reviews the state of the art and related work in microgrid optimization and energy management. Section 3 describes the methodology used, presenting the three energy communities scenarios used, the CityLearn simulation environment, and the implementation of the PPO agent. Section 4 presents and compares results obtained by the PPO, the RBC, and baseline scenario. Section 5 discusses these results, the limitations of this study, highlights the conclusions, and presents future research directions.

2. Key Research Directions in Microgrid and Energy Community Management

2.1. Microgrid and Energy Storage Optimization

Modern research focuses on optimizing microgrid configurations to balance economic viability and operational ability. Often targeting commercial or remote applications [4,7]. These models usually focus on minimizing net present cost and carbon emissions while ensuring demand is met, either through backup generators or smart battery state management [7,8,9,10]. These directions represent the broader research landscape; however, the present study focuses exclusively on the economic viability, without modelling operational reliability constraints such as outages, inverter limits or degradation. Real-time energy monitoring is a key component of Industry 5.0’s energy efficiency efforts [11]. Smart Grids leverage IoT sensors to capture and analyze energy consumption data in real time, enabling a deeper understanding of energy usage patterns and identifying areas of potential energy wastage. Smart Grids are equipped with IoT sensors that are installed at various points within the grid, including households, businesses, and industrial facilities [12,13]. These sensors continuously collect data on energy consumption, such as electricity usage, voltage levels, and power quality parameters.

The IoT sensors transmit the collected data in real time to a centralized monitoring system. This data includes information on energy consumption patterns, usage trends, and peak demand periods [14]. It provides a comprehensive view of the energy flow throughout the grid and identifies areas where energy efficiency improvements can be made. The real-time energy consumption data is analyzed using data analytics techniques, including machine learning and AI algorithms. These algorithms identify patterns, anomalies, and correlations within the data, enabling insights into energy usage behaviours and potential areas of energy wastage. Real-time energy monitoring helps in identifying peak demand periods accurately. With this information, energy providers can implement demand-response strategies to manage peak loads efficiently. By incentivizing consumers to shift their energy usage to off-peak hours or adjust their consumption during peak periods, the grid’s overall energy efficiency can be improved [15].

The insights derived from real-time energy monitoring can be used to provide personalized recommendations and feedback to consumers. By analyzing individual energy consumption patterns, consumers can receive suggestions on how to reduce energy waste and optimize their usage. This can include tips on adjusting thermostat settings, optimizing lighting usage, or upgrading to energy-efficient appliances.

Overall, real-time energy monitoring enables utilities, businesses, and consumers to make data-driven decisions to improve energy efficiency. By identifying energy usage patterns, peak demand periods, and causes of power losses, real-time monitoring allows targeted interventions and optimization strategies that contribute to a more sustainable and efficient energy ecosystem [11,13,15].

In recent years, there has been an increasing trend of production capacities from renewable sources, i.e., wind and solar power plants. Recently, various research has been performed on different types of floating offshore wind turbines and floating photovoltaic energy yield and performance models, designing optimized structural platforms for climate-resilient systems. Thus, in the technological mix of the electricity production system, the share of energy produced from renewable sources increases. The problem that arises in the case of energy produced from renewable sources is that it has an intermittent character, that is, at night or when the wind is not blowing, the energy production is minimal, and if the sun shines brightly or the wind blows, a large amount of energy is produced, sometimes even an amount that cannot be consumed at that moment [16]. For this reason, capacities are needed to ensure the flexibility of the power system and to be able to respond quickly to the intermittency of wind and solar radiation, i.e., when too much energy is produced, to store the excess energy, and when the production is minimal, to release the stored energy. A solution would be the use of technologies for energy storage, such as high-capacity batteries, the most common and practically deployable solution to support renewable integration and local flexibility [8,17,18].

2.2. Grid Robustness and Resilience

Grid robustness and resilience have also been notable areas of research in recent studies regarding modern power systems. In the context of energy communities, robustness generally refers to the energy system’s ability to maintain stable functionality under expected conditions without performance degradation. Resilience generally refers to an energy system’s ability to recover from or resist more rare or extreme conditions, such as natural disasters or large-scale equipment failure [10].

As small-scale renewable energy solutions become more popular, traditional centralized control strategies may struggle to fulfil these requirements, making local or community level control strategies necessary [19]. These community level control strategies can improve voltage stability, reduce stress on the distribution network and maintain functionality during faults by taking advantage of techniques, such as intelligent scheduling, load shifting and distributed storage coordination.

Grid resilience research is also being conducted on the physical side of energy communities and microgrids implementations. Bi-directional power flow, voltage deviations, and network losses are being addressed through energy transaction algorithms and efficient PV placement [8,20,21]. Our work focuses exclusively on economic performance and carbon emissions and does not model power-flow constraints, voltage deviations, or electrical network characteristics. Indeed, one of the questions that needs to be addressed is how the voltage is kept within acceptable limits while power is being injected into the grid by prosumers [22]. There is one hypothesis regarding the voltage values considered in the study. A previous analysis of line voltage in Romania has revealed that the values are within acceptable limits, although above the nominal voltage. The energy communities have an intrinsic characteristic to control potentially hazardous voltage levels, a large proportion of the energy generated being for own consumption [23]. The other strategies are part of the existing body of peak-load management methods and are discussed solely to contextualize the motivation for using reinforcement learning.

Physical resilience measures, such as grid segmentation, islanding, backup generators, additional power lines, and communication redundancy, remain popular; however, recent research hints towards a shift from traditional hardware reinforcement to resilience through control strategies. Adaptive and A.I. controllers can detect abnormalities and recover easier from failures by dynamically configuring microgrids and managing distributed storage. Together, these studies suggest that resilience and robustness cannot rely only on physical redundancy, but must also incorporate intelligent coordination.

2.3. Transactive Energy and Market-Based Coordination

Transactive Energy Frameworks (TEFs) are a modern approach to coordinate distributed energy resources and improve local autonomy by facilitating structured interactions between prosumers, aggregators and the main grid [13,21]. By dynamically pricing electricity based on supply and demand conditions, these frameworks encourage participants to adjust consumption, generation, and storage schedules in a way that benefits both the local community and the grid operator [24,25].

In such systems, coordination mechanisms are varied. Peep-to-peer (P2P) trading is one such coordination mechanism, which allows for direct energy trades between members of a community, either for free or at a discounted rate, to encourage consumption of self-generated energy and discourage dependency on the grid [7,20]. Another example of a TEF are aggregator-based structures, which uses a central agent to manage energy transactions within a community, and can offer better scalability than P2P, requiring less infrastructure to be implemented. The use of blockchain technology and smart-contract platforms have also been proposed to record and validate transactions within the community, ensuring transparency and trust without the need of a central intermediary [10,19].

However, implementing TEFs comes at the cost of new operational and computational challenges. Managing a large number of transactions, forecasting demand and generation, and determining fair and stable prices require adaptive control. Reinforcement learning agents have shown strong potential in bidding automation, pricing, and scheduling decisions by learning optimal policies directly from market feedback [26,27]. Multi-agent reinforcement learning (MARL) techniques are proposed to enable decentralized negotiation between members of a community, leading to faster convergence without the need of a central coordinator. Despite these advances, several challenges remain. Designing fair compensation and incentive mechanisms and interoperability of heterogeneous members are still active areas of research, especially when integrating services such as demand response and storage dispatch in community-level markets [28].

Research suggests that combining market-based coordination with adaptable strategies, such as the PPO agent explored in this study, could provide a balanced framework which merges economic efficiency with operational adaptability [29,30].

2.4. Challenges

Energy communities are often made up of many actors, each with their own infrastructure, usage patterns, demands, and priorities. Within a single community, some buildings might have solar panels, and batteries, while others are supplied only from the grid. Some members might have fixed consumption patterns while for others these vary. Weather variability is another challenge, most renewable energy in these communities come from solar panels, which only produce energy during the day and are very sensitive to weather fluctuations. Coordinating the operation of multiple distributed storage devices across a long time horizon results in a very large search space. The scheduling problem becomes NP-hard, meaning that the number of possible control actions grows exponentially with the number of buildings, devices, and time steps [31]. This motivates the use of reinforcement learning or genetic algorithms, which can approximate near-optimal solutions without a complete state space search.

Even if technical solutions are in place, the problem of coordination remains. The members of the community have different production capabilities, load demands, and consumption patterns. Prosumers, aggregators, and collaborative networks of prosumers play an important role in ensuring resilience and in exploiting renewable energy sources in such a way as to achieve sovereignty over critical resources for the future [12,27].

Due to the complexity and variability of managing energy communities, traditional rule-based approaches might be hard to build or have poor performance. Reinforcement Learning learns from interacting with the environment, without requiring a full model of the system, learning directly from data, making it a fitting solution for systems with unpredictable dynamics.

While existing studies have explored microgrid optimization, storage strategies, and even reinforcement learning for demand response, several limitations remain. Many studies rely on rule-based controllers, which cannot adapt dynamically to uncertain conditions. Other studies apply reinforcement learning but focus narrowly on a single objective, such as cost minimization, without considering the impact on emissions, grid stability, or fairness. Comparative studies across heterogeneous communities, which include transactional systems like peer-to-peer energy sharing, and the use of aggregators are also scarce. In addition to rule-based approaches, many studies have applied multi-objective optimization algorithms, such as genetic algorithms or Pareto-based methods for P2P transactions in energy communities (e.g., NSGA-II). These methods are effective in exploring trade-offs between costs, emissions, and stability, but they require extensive offline computation and rely on predefined objective functions, making them more challenging to implement. Our work aims to fill this gap by testing a PPO-based reinforcement learning agent for three representative community schemas, and comparing the agent’s performance against traditional rule-based controllers [32,33].

3. Materials and Methods

3.1. Overview of Methodology

The workflow of this study consists of defining the community schemas, setting up the simulation environment, implementing the rule-based and reinforcement learning controller, training the PPO agent over a full-year dataset, and comparing the resulting system performance in terms of annual costs, energy consumption, carbon emissions, and peak power demand. All scenarios use identical load, PV, and pricing inputs to ensure comparability.

The main contribution of the methodology is the comparative analysis across heterogeneous community layouts, allowing us to examine how the effectiveness of PPO depends on the amount of controllable resources available in each schema. This approach highlights when a RL-based controller provides measurable improvements over a rule-based strategy, and when a more traditional approach is worth considering.

The primary research objective of this work is to explore the viability and effectiveness of Reinforcement Learning as an energy management system for energy communities with access to renewable energy.

PPO was selected for its stability in continuous action spaces and its suitability for long-horizon optimization. The agent interacts with the environment (referring to the simulation environment, not the real power grid) at each timestep, receives a state observation, and updates its policy through gradient-based optimization [5]. RL learns control policies directly from interaction with the environment, without requiring a full mathematical model of the system.

3.2. Simulation Environment

The environment in which our RL agent trains is CityLearn (The CityLearn repository is publicly available at https://github.com/intelligent-environments-lab/CityLearn, accessed on 1 November 2025) [34,35]. CityLearn is an open-source environment built on the OpenAI Gym interface, developed for testing energy management strategies in building clusters and urban energy communities. It includes components that simulate building electricity demand, photovoltaic generation, energy storage systems, and dynamic grid pricing. This framework provides realistic datasets based on real measurements and supports both centralized and multi-agent control schemes.

The framework utilizes datasets derived from the U.S. Department of Energy’s End-User Load Profiles for the U.S. Building Stock, which incorporates both measured and statistically generated consumption and weather patterns [3,36]. These profiles capture realistic seasonal and daily variations in usage patterns of space heating, space cooling, domestic hot water (DHW), and general appliance and lighting loads (see Figure 1 for a sample profile). Similarly, electricity pricing and carbon intensity signals are predefined within the CityLearn Challenge 2021 dataset and remain consistent across all simulation runs to ensure a fair comparison.

The environment provides each building’s state observation at each timestep, including load, PV generation, storage levels, outdoor temperature, and predicted weather variables. Actions are applied by the controller and the environment advances, following a standard reinforcement learning loop.

In our implementation, the three community schemas defined in Section 3.1 were configured through the framework’s JSON environment files by specifying the number of buildings, their generation and storage capacities, and load profiles. The chosen schemas were inspired by the values given in the CityLearn Challenge 2021 datasets [37]. Since CityLearn is open-source, it can be extended by modifying or adding environment components, such as custom reward functions, new state variables, or peer-to-peer trading mechanisms, allowing for future work involving transactive energy systems.

Figure 2 illustrates the general flow of reinforcement learning training. The process begins when the environment provides the agent with a state observation (Information describing the current state of the environment, this can be current demand, solar generation, electricity prices, etc.), based on this, the RL agent’s neural network generates an action (A value in the range [−1, 1], which determines whether to charge, discharge or remain idle). This action is then applied to the simulated environment, which then updates its state, returning a reward signal (A numeric value given by the reward function which helps guide the RL agent’s learning process) and a new state observation. This cycle continues for every step in the simulation, until the episode is done, allowing the agents to learn optimal strategies through experience rather than explicit programming.

By default, each building operates independently, drawing electricity from the grid whenever local PV generation or battery storage cannot meet its demand. This approach ensures the demand is met but results in high grid dependency, energy costs and emissions. To lower energy costs and carbon emissions, control strategies can be implemented to manage how and when each building uses or stores energy. In this study, two control strategies are evaluated and compared against a non-controlled baseline, where solar energy is used directly. The first strategy is a Rule-Based Controller (RBC), which follows a fixed schedule for charging and discharging. The second is a reinforcement learning proximal policy approximation (PPO) agent, which learns optimal control strategies through interaction with the environment. An example of how a control strategy could lower costs is energy arbitrage, taking advantage of lower energy prices during the night to charge the battery, and discharge it when energy is more costly during the day. Unfortunately, this scenario is not available in every country, differentiated prices at different times of the day not yet being regulated [27].

CityLearn has been used extensively in the CityLearn Challenge and in recent peer-reviewed studies as a benchmark environment for testing RL-based demand and response and energy flexibility strategies [37]. Its standardized structure ensures that the performance of different control strategies can be compared consistently.

3.3. Community Configurations and Input Data

The study evaluates three distinct community configurations, each defined by the availability of photovoltaic (PV) generation and battery storage systems. The first configuration represents a residential block with mixed PV and storage availability, the second combines residential and commercial buildings, with a big producer as a new actor, and the third assumes a best-case scenario in which all members of the community are prosumers.

Schema 1, seen in Figure 3 and described in Table 1, models a community of 5 residential buildings. Building 1 is a regular prosumer, with a battery and PV, buildings 2 and 5 represent a typical consumer, with neither solar or storage capabilities, while building 4 only has PV, and building 3 only storage capabilities. Building 3 includes only an energy storage system, without local PV generation. This was introduced to test whether a storage-only participant could provide value through energy arbitrage or load balancing within the community, as well as for simulating future models for actors which participate in demand-shifting or “energy storage as a service” arrangements. For example in [38] where the idea of BESSaaS is implemented, and used to provide grid-level flexibility, and [39] such a system is used in Finland to provide energy storage as a service. This reflects emerging real-world models such as Battery Storage Systems as a Service (BESSaaS) and community battery programmes, which are increasingly deployed even in the absence of on-site renewable generation.

Schema 2 simulates a smaller community of 3 entities, 2 regular prosumers and a third entity with a larger power consumption and generation capacity, but without a battery (see Figure 4 and Table 2).

Schema 3 is similar to schema 2, but the third building has a battery. It simulates an ideal case, a small community in which all participants are prosumers (see Figure 5 and Table 3).

All PV capacities, battery sizes, nominal power ratings, and load profiles used in the simulations originate from the CityLearn Challenge 2021 dataset or remain within its characteristic ranges. These values are based on the U.S. Department of Energy End-Use Load Profiles and represent aggregated building clusters rather than individual homes, which explains the relatively high installed capacities for buildings classified as “residential” [16,18,40].

The apparent mismatch between PV capacity and battery size in several buildings reflects heterogeneous adoption patterns commonly seen in real communities [41,42]. Storage systems may be installed later than PV, sized differently due to economic constraints, or deployed as shared assets, including such asymmetries allows the PPO agent to be tested under mixed-resource conditions [36,43].

3.4. Load, PV, and Pricing Data Sources

Load and weather profiles are provided from the CityLearn simulator, and represent aggregated residential and commercial building clusters. This data includes typical seasonal and daily variations in heating, cooling, domestic hot water, and appliance usage. PV generation profiles are derived from the same dataset using historical irradiance and temperature conditions.

Electricity pricing follows the dynamic tariff included in the CityLearn dataset [37]. This tariff does not provide export compensations, so grid exports do not directly reduce annual cost measurements. Hourly carbon-intensity signals are also taken from the dataset and applied consistently across all scenarios.

No additional preprocessing was performed apart from the standard normalization applied internally by CityLearn for observation scaling.

3.5. State, Action and Reward Spaces

The PPO controller is implemented as a decentralized multi-agent system. Each building in the community is represented by an independent PPO agent that receives only its own local state observation and outputs one action corresponding to the charging or discharging power of its electrical storage system. While observations are not shared between agents, cooperation emerges through the use of a shared global reward, which encourages agents to not act selfishly, since individual actions that reduce overall grid imports improve the rewards for all agents.

In our case, a state is represented by a series of values which describe the environment at each step, i.e., current and predicted electricity prices, carbon footprint metrics, outdoor temperature and energy loads (see Table 4). The observation vector size for each agent is 47.

A possible action in this environment represents a decision to charge or discharge an energy storage device and how fast. The simulator maps the normalized action to the building’s battery power.

P_{b a t t e r y} = a \times P_{m a x}

(1)

where a is the normalized agent action (continuous value in the range [−1, 1]), and P_max represents the maximum charge/discharge power. Each agent can take 1 action per simulation step, which only controls the agent’s battery (if present), so each agent’s action vector size is 1.

The simulator’s internal energy models calculate the actual change in the SOC based on the agent’s action, while respecting physical constraints. The SOC cannot exceed 1 (full) or drop below (0), and the rate or charge/discharge is limited by the power rating of the system.

The reward function in training RL is designed to guide the agent towards a desired behaviour. Initial experiments focused on minimizing wasted energy generated by PV; however, this caused the agent to only charge the battery, never actually using it in fear of wasting some solar energy. When the reward function was changed to focus on decreasing the overall energy consumption the agent learned a more balanced strategy. The exact formula used to calculate the reward was:

r = m i n (- e^{3}, 0)

(2)

where e is net energy consumption. The cubic term increases the penalty for higher grid imports, while the reward is capped at 0 so that grid export does not artificially inflate the reward. This reward function is provided by the simulator and was not modified in this study.

3.6. Algorithm and Training Parameters

Figure 1 and Figure 6 give a more detailed energy profile for a generic actor in a community. Generally, energy usage starts at ~07:00 and continues well into the evening. And the solar generation starts a bit earlier at ~06:00, with a drastic decrease at ~14:00. Provided with this information, the RBC was set to charge as fast as possible when the energy is the cheapest, during the night (22:00–06:00), to rely on solar generation at peak hours (07:00–14:00), and to discharge during the evening (14:00–22:00) when the solar generation cannot cover the consumption anymore.

Reinforcement Learning (RL) is a machine learning paradigm that fits problems which involve sequential decision making in uncertain environments. RL agents learn from feedback given from the environment to the agent in the form of rewards or punishments. In energy communities, the environment could represent the electrical grid, buildings, storage systems, and solar panels, while rewards could be represented by cost savings, carbon footprint reduction or meeting demand. What makes RL a good fit for this task is its ability to adapt to dynamic or incomplete systems, weather patterns, consumption habits, and system constraints. For example, an agent could learn when to charge or discharge a battery depending on predicted solar generation and user demand.

We chose PPO as the RL algorithm due to its ability to work with environments with continuous action spaces and robustness during training. PPO’s implementation prevents the sudden performance drops which appear due to large updates present with other RL implementations. We used the standard PPO implementation from the Stable Baselines3 library, 2 fully connected layers of 64 neurons each (see Figure 7). Given the tabular nature of the data, no convolutional layers were used, because the improvement in the feature extraction does not compensate for the additional computational cost. Similar architectures have been applied to energy management and demand-response problems, showing PPO’s stability and sample efficiency in continuous microgrid environments [9].

The simulation and decision-making process follow the steps shown in Figure 8. It gets an observation from the environment, this is passed through the agent’s neural network, and the dispatcher selects an action using the neural network’s output. After selecting an action, the environment is updated using the selected action, and the reward is calculated using the reward function, which helps the agent’s network to make better decisions.

3.7. Training Procedure and Evaluation Setup

The PPO agent was trained for 400,000 environment steps, where each step corresponds to one simulated hour. This choice reflects a balance between learning stability and overfitting, the agent continued to improve up to approximately 500,000–700,000 steps, its performance degraded beyond this interval, and 400,000 steps provided the most stable and generalisable behaviour, this can be observed in similar work [1]. Training was performed continuously over the full-year horizon in a rolling-window fashion, without shorter episodes or resets.

Multiple preliminary runs were executed while testing different hyperparameters, but the final results reported in this paper correspond to a run using the configuration shown in Table 5. Evaluation was performed only after the completion of the full training sequence. All computations were carried out on a standard CPU machine without hardware acceleration.

All scenarios (Grid Only, Grid + Solar with no control strategy, RBC and PPO controlled communities) were evaluated using the same year-long dataset to ensure comparability. For the “Grid Only” hourly energy consumption was computed as the sum of all building loads, and the annual cost was obtained by multiplying the hourly load by the corresponding tariff. Carbon emissions were computed similarly, using the hourly carbon intensity signal, and peak demand was defined as the maximum hourly grid import over the entire year.

For the “Grid + Solar” scenario, net demand was calculated by subtracting the PC generation from the load at each timestep (load-PV). Costs, emissions and peak demand were calculated from this net consumption.

For the RBC and PPO controlled communities, net demand was calculated as load-PV-battery discharge, where the simulator applies all physical limits on battery power and state of charge. Costs and emissions follow the same formulation as in the other scenarios. The CityLearn environment does not include export compensations, so any net export does not reduce the annual cost metric.

A fixed random seed was not enforced; however, preliminary runs produced similar results across training attempts, and convergence behaviour remained consistent.

All reported performance values correspond to single-run results. CityLearn simulations are deterministic under fixed seeds, the variability introduced by PPO is limited to neural-network initialization, and the scope of this study did not include multi-run statistical variance analysis.

3.8. Limitations and Assumptions

The simulation framework used in this study is subject to several assumptions and modelling constraints. CityLearn abstracts away from electrical network details such as voltage levels, line constraints, power-flow limitations, and inverter behaviour. As a result, the results reflect an idealized energy community without grid congestion, reactive power effects, or conversion losses beyond the simplified storage efficiency model implemented internally by the environment. This abstraction applies only to the electrical network physics and does not affect the validity of the input data. All load, photovoltaic generation, weather, pricing, and carbon-intensity profiles used in this study originate from the CityLearn Challenge 2021 dataset.

Battery degradation is not modelled, and the agent therefore operates under the assumption of a constant usable capacity and no cycle-life penalties. Real-world storage owners may limit cycling frequency for economic reasons, which could influence the practicality of some dispatch patterns learned by PPO.

All buildings are assumed to have perfect state observability, and forecasts for load, PV generation and weather conditions are not modelled, similarly sensor errors or communication delays are not considered. Occupant behaviour and appliance usage follow the fixed profiles embedded in the dataset and do not adapt to control actions or tariff changes.

The economic model does not simulate export compensation, and dynamic prices and carbon intensities follow predefined dataset signals. These assumptions allow for a consistent comparison across control strategies but may differ from tariff structures encountered in real energy communities.

Capital expenditures (PV installation cost, battery cost, degradation, maintenance, etc.) were intentionally excluded because the objective of this study is to compare control strategies (PPO vs. RBC vs. no control) under identical system configurations.

Finally, PPO was evaluated using a single run, without enforcing a fixed random seed. Although preliminary runs produced similar behaviour, small variations in learned policies could occur between training attempts.

4. Results

For each of our experimental setups four scenarios were considered in order to minimize the impact of each change. A grid-only scenario was used to show the raw energy needs of the community. A second scenario was a community which has access to solar panels but no control strategies, representing a passive system where buildings consume energy from the local PV and the remaining demand is supplied by the grid, which we used as a baseline to test control strategies against each other. The remaining two scenarios represent the same community with access to PV, storage capabilities, and two control strategies, a RBC and a PPO agent, respectively.

The performance of the control strategies is measured based on economic, stability, and environmental factors. We focus on the total amount of energy used, its cost, the amount of carbon emissions generated, and the peak energy consumption at the end of a one-year simulation period. All economic indicators in this study refer strictly to electricity costs from energy consumption, not accounting for installation of hardware investment costs. All percentage improvements reported below are computed relative to the corresponding baseline within the same schema, using the formula:

Percent reduction = \frac{M_{A} - M_{B}}{M_{A}} \times 100,

(3)

where

M_{A}

is the baseline value, and

M_{B}

is the value we compare with.

As expected in all situations considering our configurations (Table 1, Table 2 and Table 3), simply adding solar energy to the community greatly diminishes economic, and environmental impacts. When compared to the “Grid-only” baseline, the “Grid + Solar” configuration reduced annual costs by 72.4%, 87.2% and 84.7%, and reduced the amount of carbon emissions by 62.4%, 82.9%, and 79.5% in each schema, respectively.

The improvements depend on how many resources the RL agent is able to control. For Schema 1, the agent can control 3 out of 5 buildings, since only 3 buildings in the community had batteries. When compared to the “Grid + Solar” scenario, the PPO-controlled community achieved a 2.5% reduction in annual costs and carbon emissions (see Table 6, Figure 9), while the RBC-controlled community performed 1.9% worse. This demonstrates that an improper control strategy can degrade performance in certain conditions. Overall, the PPO agent was able to lower energy consumption and carbon footprint by 4.4% more than the RBC (calculated relative to the common “Grid + Solar” reference).

For schema 2, even though the agent had control over 2 out of 3 buildings, most of the energy available for the community is produced by the largest entity, which the agent cannot control. We see that in this case, none of the controllers could provide better performance than the “Grid + Solar” scenario. The PPO-controlled community achieved a 1.28% higher power consumption, and the RBC-controlled community achieved a 3.96% higher energy consumption. In such communities investing in any form of control system only brings the local distribution system operator (DSO) the advantage of a lower peak power consumption (see Table 7, Figure 10). While PPO remains superior to RBC in cost and carbon footprint reduction, the best solution in this case would be to go with a simpler system, of just using the power generated by the PV without storing anything.

In Schema 3, where all actors in the community had energy storage options, and the PPO agent was able to manage resources, we see a similar result to Schema 1. When comparing a scenario in which the PPO agent controls resources against a “Grid + Solar” baseline, we see a 4.3% reduction in costs and carbon emissions, and a 9.2% performance difference between PPO control and RBC control (see Table 8, Figure 11) (when comparing both against the “Grid + Solar” baseline).

One disadvantage of the PPO-controlled community is that given the current configuration and implementation it consistently registers higher peak power consumption due to the nature of the reward function, which does not punish the agent for reaching high power peaks.

5. Discussion

Results show that while simply integrating solar energy is highly effective, the choice of a control strategy can further improve economic and environmental outcomes. The choice of control strategy should depend on the layout of the energy community and its complexity. Reinforcement Learning is best suited for variable environments with controllable resources and rule-based approaches are more efficient in simpler or less flexible systems. The PPO agent consistently showed better performance for minimizing costs and carbon emissions compared to the RBC in the scenarios, provided it had enough control over the communities’ resources.

Our results are consistent with those reported in the recent literature on reinforcement learning for energy community and microgrid control [1,2,3]. Several studies using PPO or other RL algorithms on CityLearn or similar datasets report cost or emission reductions in the range of 8.78% to 20% compared to rule-based or baseline controllers. For example [3] reports 18% lower operation costs. Studies exploring alternative RL methods (like a Deep Q Network controller) also show comparable behaviour, achieving a 8.78% reduction in operating costs relative to its baseline [1]. Direct quantitative comparison remains difficult due to differences in datasets, pricing structures, community configurations, baseline controllers, and other implementation differences; however, the improvement observed in this work fits well within the expected performance range.

While the PPO agent is superior when it comes to resource optimization, one notable advantage of the RBC over the PPO agent is peak energy consumption. The RBC shows a clear advantage when it comes to reducing the peak power consumption, a metric of the grid stability. Schema 3, slightly different from schema 2, shows the advantage of providing control to a RL agent. Only by adding a battery to the third entity helps the RL agent make more impactful decisions, thus improving the overall performance against the baseline.

This research was conducted in the CityLearn framework, which has its limitations. While the framework provides a robust environment for simulating energy communities, it does not support more advanced concepts, like peer-to-peer energy trading or the role of energy aggregators. It does not provide specific electrical characteristics such as nominal voltage or current for the battery models either. This is a limitation of the simulation and was not a parameter considered in this study. The nominal power and capacity values for the PV panels and batteries were not arbitrarily chosen, they were specifically taken from one of the CityLearn Challenge datasets, which are designed to model realistic community configurations. These datasets leverage real-world data from sources like the End-Use Load Profiles for the U.S. Building Stock to create their scenarios [35].

Physical deployment may also find more constraints which have not been considered in this work, such as communication delays, sensor noise, and incomplete observability. These factors can reduce the stability and responsiveness of an RL agent trained under ideal assumptions and may require robust control mechanisms [44]. Privacy and security concerns are also not covered in our work, technologies such as blockchain and federated learning might be required to maintain confidentiality [24,25].

Regulatory frameworks impose practical limits on the applicability of advanced control. Physical deployment must comply with regional rules regarding dynamic pricing, feed-in tariffs, bidirectional metering, etc. [17].

6. Conclusions

This study analyzed the effectiveness of using a PPO reinforcement learning agent for energy management in three energy communities’ configurations, comparing it to a more traditional Rule-Based approach. Results show that the effectiveness of the PPO agent was directly linked to the degree of control it had over the available resources.

Across three community schemas, adding photovoltaic generation alone reduced annual operational costs by 72–87% and carbon emissions by 62–83% relative to grid-only operation. When storage and control were introduced, the PPO agent achieved up to 4.3% additional reductions in costs and carbon emissions when compared to a “Grid + Solar” baseline, and a 4–9% improvement over the rule-based controller. These results indicate that RL agent control strategies are advantageous in communities where the participants are prosumers, or a mixed community, where multiple buildings are equipped with photovoltaic generation and battery storage. In these configurations, the PPO agent can coordinate charging and discharging decisions, adapting to variables in the environment to minimize overall costs, whereas rule-based controllers may still be preferable in communities with a predictable energy dynamic, like commercial or industrial sites.

Future development will focus on methods for expanding the PPO agent’s control, by introducing new systems to the CityLearn framework. Implementation of an optimization algorithm based on Pareto non-dominance (e.g., NSGA-II, NSGA-III, CNSGA-II) to generate feasible offline or semi-online control and energy transfer strategies followed by implementation and testing of the resulting reward structures will represent a further step to follow in our research. Systems like Peer-To-Peer energy sharing and third-party energy aggregators could allow the Reinforcement Learning agent to explore more complex decisions and potentially achieve better results. Concerns like battery degradation, grid stability, voltage regulation, optimal PV-storage placement and peak energy consumption analysis also represent future focus points. Given the framework’s built-in power outage scenarios, such issues could be mitigated by experimenting with new reward functions.

Author Contributions

Conceptualization, O.N.M., A.F., C.S. and M.V.; Methodology, A.F., C.S. and M.V.; Software, O.N.M. and C.S.; Investigation, O.N.M.; Writing—original draft, O.N.M.; Writing—review & editing, O.N.M., A.F., C.S. and M.V.; Supervision, A.F. and M.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially developed in the project CoDEMO 5.0 (Co-Creative Decision-Makers for 5.0 Organizations), Project Number 101104819, ERASMUS-EDU-2022-PI-ALL-INNO-EDU-ENTERP (Alliances for Education and Enterprises), co-financed by the European Union Erasmus+ Programme.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vetter, V.; Wohlgenannt, P.; Kepplinger, P.; Eder, E. Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities. Energies 2025, 18, 4489. [Google Scholar] [CrossRef]
Palma, G.; Guiducci, L.; Stentati, M.; Rizzo, A.; Paoletti, S. Reinforcement Learning for Energy Community Management: A European-Scale Study. Energies 2024, 17, 1249. [Google Scholar] [CrossRef]
Uddin, M.; Mo, H.; Dong, D. Real-Time Energy Management Strategies for Community Microgrids. arXiv 2025, arXiv:2506.22931. [Google Scholar] [CrossRef]
Rego, N.; Castro, R.; Lagarto, J. Sustainable energy trading and fair benefit allocation in renewable energy communities: A simulation model for Portugal. Util. Policy 2025, 96, 101986. [Google Scholar] [CrossRef]
Fang, X.; Hong, P.; He, S.; Zhang, Y.; Tan, D. Multi-Layer Energy Management and Strategy Learning for Microgrids: A Proximal Policy Optimization Approach. Energies 2024, 17, 3990. [Google Scholar] [CrossRef]
Jones, G.; Li, X.; Sun, Y. Robust Energy Management Policies for Solar Microgrids via Reinforcement Learning. Energies 2024, 17, 2821. [Google Scholar] [CrossRef]
Chung, I.-H. Exploring the economic benefits and stability of renewable energy microgrids with controllable power sources under carbon fee and random outage scenarios. Energy Rep. 2025, 13, 6017–6041. [Google Scholar] [CrossRef]
Liu, X.; Zhao, P.; Qu, H.; Liu, N.; Zhao, K.; Xiao, C. Optimal Placement and Sizing of Distributed PV-Storage in Distribution Networks Using Cluster-Based Partitioning. Processes 2025, 13, 1765. [Google Scholar] [CrossRef]
Rizki, A.; Touil, A.; Echchatbi, A.; Oucheikh, R.; Ahlaqqach, M. A Reinforcement Learning-Based Proximal Policy Optimization Approach to Solve the Economic Dispatch Problem. Eng. Proc. 2025, 97, 24. [Google Scholar] [CrossRef]
Sarker, S.K.; Shafei, H.; Li, L.; Aguilera, R.P.; Hossain, M.J.; Muyeen, S.M. Advancing microgrid cyber resilience: Fundamentals, trends and case study on data-driven practices. Appl. Energy 2025, 401, 126753. [Google Scholar] [CrossRef]
Tan, Y.S.; Ng, Y.T.; Low, J.S.C. Internet-of-Things Enabled Real-time Monitoring of Energy Efficiency on Manufacturing Shop Floors. Procedia CIRP 2017, 61, 376–381. [Google Scholar] [CrossRef]
Gellert, A.; Fiore, U.; Florea, A.; Chis, R.; Palmieri, F. Forecasting Electricity Consumption and Production in Smart Homes through Statistical Methods. Sustain. Cities Soc. 2022, 76, 103426. [Google Scholar] [CrossRef]
Hussain, S.; Azim, M.I.; Lai, C.; Eicker, U. Smart home integration and distribution network optimization through transactive energy framework—A review. Appl. Energy 2025, 395, 126193. [Google Scholar] [CrossRef]
Salman, O.; Elhajj, I.; Kayssi, A.; Chehab, A. An architecture for the Internet of Things with decentralized data and centralized control. In Proceedings of the 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Marrakech, Morocco, 17–20 November 2015; pp. 1–8. [Google Scholar] [CrossRef]
Wang, Y.; Saad, W.; Mandayam, N.B.; Poor, H.V. Load Shifting in the Smart Grid: To Participate or Not? IEEE Trans. Smart Grid 2016, 7, 2604–2614. [Google Scholar] [CrossRef]
Qiu, Y.L.; Xing, B.; Patwardhan, A.; Hultman, N.; Zhang, H. Heterogeneous changes in electricity consumption patterns of residential distributed solar consumers due to battery storage adoption. iScience 2022, 25, 104352. [Google Scholar] [CrossRef] [PubMed]
An, J.; Hong, T. Multi-objective optimization for optimal placement of shared battery energy storage systems in urban energy communities. Sustain. Cities Soc. 2025, 120, 106178. [Google Scholar] [CrossRef]
Gomes, I.S.F.; Perez, Y.; Suomalainen, E. Coupling small batteries and PV generation: A review. Renew. Sustain. Energy Rev. 2020, 126, 109835. [Google Scholar] [CrossRef]
Hamidieh, M.; Ghassemi, M. Microgrids and Resilience: A Review. IEEE Access 2022, 10, 106059–106080. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Li, Z.; Golshannavaz, S. Robust Co-planning of distributed photovoltaics and energy storage for enhancing the hosting capacity of active distribution networks. Renew. Energy 2025, 253, 123645. [Google Scholar] [CrossRef]
Cui, S.; Xu, S.; Fang, J.; Ai, X.; Wen, J. A novel stable grand coalition for transactive multi-energy management in an integrated energy system. Appl. Energy 2025, 394, 126155. [Google Scholar] [CrossRef]
Florea, A.; Berntzen, L.; Vintan, M.; Stanescu, D.; Morariu, D.; Solea, C.; Fiore, U. Prosumer networks—A key enabler of control over renewable energy resources. Renew. Energy Focus 2024, 51, 100648. [Google Scholar] [CrossRef]
Gulraiz, A.; Zaidi, S.S.H.; Ashraf, M.; Ali, M.; Lashab, A.; Guerrero, J.M.; Khan, B. Impact of photovoltaic ingress on the performance and stability of low voltage Grid-Connected Microgrids. Results Eng. 2025, 26, 105030. [Google Scholar] [CrossRef]
Nepal, J.P.; Yuangyai, N.; Gyawali, S.; Yuangyai, C. Blockchain-Based Smart Renewable Energy: Review of Operational and Transactional Challenges. Energies 2022, 15, 4911. [Google Scholar] [CrossRef]
Singh, A.R.; Kumar, R.S.; Bajaj, M.; Kumar, B.H.; Blazek, V.; Prokop, L. A blockchain-enabled multi-agent deep reinforcement learning framework for real-time demand response in renewable energy grids. Energy Strategy Rev. 2025, 62, 101905. [Google Scholar] [CrossRef]
Ye, Y.; Tang, Y.; Wang, H.; Zhang, X.-P.; Strbac, G. A Scalable Privacy-Preserving Multi-Agent Deep Reinforcement Learning Approach for Large-Scale Peer-to-Peer Transactive Energy Trading. IEEE Trans. Smart Grid 2021, 12, 5185–5200. [Google Scholar] [CrossRef]
Zhang, M.; Eliassen, F.; Taherkordi, A.; Jacobsen, H.-A.; Chung, H.-M.; Zhang, Y. Energy Trading with Demand Response in a Community-based P2P Energy Market. In Proceedings of the 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Beijing, China, 21–23 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
Zhou, Y.; Wu, J.; Long, C.; Ming, W. State-of-the-Art Analysis and Perspectives for Peer-to-Peer Energy Trading. Engineering 2020, 6, 739–753. [Google Scholar] [CrossRef]
François-Lavet, V.; Taralla, D.; Ernst, D.; Fonteneau, R. Deep Reinforcement Learning Solutions for Energy Microgrids Management. In Proceedings of the European Workshop on Reinforcement Learning (EWRL 2016), Barcelona, Spain, 3–4 December 2016. [Google Scholar]
Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
Pisacane, O.; Severini, M.; Fagiani, M.; Squartini, S. Collaborative energy management in a micro-grid by multi-objective mathematical programming. Energy Build. 2019, 203, 109432. [Google Scholar] [CrossRef]
Shirinshahrakfard, P.; Suratgar, A.A.; Menhaj, M.B.; Gharehpetian, G.B. Multi-Objective Optimization of Peer-to-Peer Transactions in Arizona State University’s Microgrid by NSGA II. In Proceedings of the 2024 32nd International Conference on Electrical Engineering (ICEE), Tehran, Iran, 14–16 May 2024. [Google Scholar] [CrossRef]
Kharatovi, L.; Gantassi, R.; Masood, Z.; Choi, Y. A Multi-Objective Optimization Framework for Peer-to-Peer Energy Trading in South Korea’s Tiered Pricing System. Appl. Sci. 2024, 14, 11071. [Google Scholar] [CrossRef]
Vazquez-Canteli, J.R.; Dey, S.; Henze, G.; Nagy, Z. CityLearn: Standardizing Research in Multi-Agent Reinforcement Learning for Demand Response and Urban Energy Management. arXiv 2020, arXiv:2012.10504. [Google Scholar] [CrossRef]
Nweye, K.; Kaspar, K.; Buscemi, G.; Fonseca, T.; Pinto, G.; Ghose, D.; Nagy, Z. CityLearn v2: Energy-flexible, resilient, occupant-centric, and carbon-aware management of grid-interactive communities. J. Build. Perform. Simul. 2025, 18, 17–38. [Google Scholar] [CrossRef]
Nweye, K.; Liu, B.; Stone, P.; Nagy, Z. Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings. Energy AI 2022, 10, 100202. [Google Scholar] [CrossRef]
Nagy, Z.; Vázquez-Canteli, J.R.; Dey, S.; Henze, G. The Citylearn Challenge 2021; ResearchGate: Berlin, Germany, 2021. [Google Scholar] [CrossRef]
Ramos, A.; Tuovinen, M.; Ala-Juusela, M. Battery Energy Storage System (BESS) as a service in Finland: Business model and regulatory challenges. J. Energy Storage 2021, 40, 102720. [Google Scholar] [CrossRef]
Arteaga, J.; Zareipour, H.; Amjady, N. Energy Storage as a Service: Optimal sizing for Transmission Congestion Relief. Appl. Energy 2021, 298, 117095. [Google Scholar] [CrossRef]
Wang, Z.; Luther, M.; Horan, P.; Matthews, J.; Liu, C. Technical and economic analyses of PV battery systems considering two different tariff policies. Sol. Energy 2024, 267, 112189. [Google Scholar] [CrossRef]
De Groote, O.; Pepermans, G.; Verboven, F. Heterogeneity in the adoption of photovoltaic systems in Flanders. Energy Econ. 2016, 59, 45–57. [Google Scholar] [CrossRef]
Nurwidiana, N.; Sopha, B.M.; Widyaparaga, A. Modelling Photovoltaic System Adoption for Households: A Systematic Literature Review. Evergreen 2021, 8, 69–81. [Google Scholar] [CrossRef]
Niazi, K.A.K.; Kerekes, T.; Dolara, A.; Yang, Y.; Leva, S. Performance Assessment of Mismatch Mitigation Methodologies Using Field Data in Solar Photovoltaic Systems. Electronics 2022, 11, 1938. [Google Scholar] [CrossRef]
La Fata, A.; Brignone, M.; Procopio, R.; Bracco, S.; Delfino, F.; Barbero, G.; Barilli, R. An energy management system to schedule the optimal participation to electricity markets and a statistical analysis of the bidding strategies over long time horizons. Renew. Energy 2024, 228, 120617. [Google Scholar] [CrossRef]

Figure 1. Sample of energy loads and PV production over 24 h (Taken from building 1, schema 1, 1 July. Measurements and values provided by the CityLearn simulator).

Figure 2. Reinforcement Learning Loop.

Figure 3. General Architecture of Schema 1.

Figure 4. General Architecture of Schema 2.

Figure 5. General Architecture of Schema 3.

Figure 6. Sample load profile over 24 h (Taken from building 3, schema 3. Measurements and values provided by the CityLearn simulator. The measurements used can be seen at https://github.com/OogaBooga21/EC_RL/blob/main/scen3/Building_3.csv (accessed on 1 November 2025)).

Figure 7. Architecture of the PPO agent.

Figure 8. Application Flow.

Figure 9. Comparison of control strategies for Schema 1 using data from Table 6.

Figure 10. Comparison of control strategies for Schema 2 using data from Table 7.

Figure 11. Comparison of control strategies for Schema 3 using data from Table 8.

Table 1. Properties of the first energy community.

Schema 1	Solar PV	Battery Storage (Capacity/Nominal Power)	Notes
Building 1	Yes (120 kW)	Yes (140 kWh, 100 kW)	Full setup
Building 2	No	No	Grid-only
Building 3	No	Yes (50 kWh, 20 kW)	Battery only
Building 4	Yes (40 kW)	No	PV only
Building 5	Yes (25 kW)	Yes (50 kWh, 25 kW)	Full setup

Table 2. Properties of the second energy community.

Schema 2	Solar PV	Battery Storage (Capacity/Nominal Power)	Notes
Building 1	Yes (70 kW)	Yes (140 kWh, 100 kW)	Full setup
Building 2	Yes (70 kW)	Yes (100 kWh, 100 kW)	Full setup
Building 3	Yes (300 kW)	No	PV only

Table 3. Properties of the third energy community.

Schema 3	Solar PV	Battery Storage (Capacity/Nominal Power)	Notes
Building 1	Yes (70 kW)	Yes (140 kWh, 100 kW)	Full setup
Building 2	Yes (70 kW)	Yes (100 kWh, 100 kW)	Full setup
Building 3	Yes (250 kW)	Yes (150 kWh, 100 kW)	Full setup

Table 4. State observation structure (https://www.citylearn.net/overview/observations.html (accessed on 1 November 2025)).

Parameter Group	Parameters (with Units)	Category Description
Time Indicators	Month, Day type, Hour, Daylight-savings status	Encodes the temporal context for each observation, including calendar position and whether daylight-saving is active.
Outdoor Weather, Current Conditions	Outdoor dry-bulb temperature (°C), Outdoor relative humidity (%), Diffuse solar irradiance (W/m²), Direct solar irradiance (W/m²)	Real-time meteorological signals describing ambient thermal and solar conditions.
Outdoor Weather, Forecasts	Temperature forecasts: 6 h/12 h/24 h (°C); Humidity forecasts: 6 h/12 h/24 h (%); Diffuse solar forecasts: 6 h/12 h/24 h (W/m²); Direct solar forecasts: 6 h/12 h/24 h (W/m²)	Short-term weather predictions used to anticipate external loads and renewable availability.
Indoor Environmental State	Indoor dry-bulb temperature (°C), Indoor relative humidity (%), Unmet cooling setpoint difference (°C), Indoor temperature setpoint (°C), Temperature deviation from setpoint (°C)	Characterizes thermal comfort conditions and deviations from operational targets within the building.
Energy Storage State-of-Charge	Cooling storage SOC (0–1), Heating storage SOC (0–1), DHW storage SOC (0–1), Electrical storage SOC (0–1)	Fractional state-of-charge values for all storage assets, representing available flexibility.
Building Loads & Electricity Flows	Non-shiftable load (kWh), Solar generation (kWh), Net electricity consumption (kWh)	Core energy-flow metrics including fixed loads, on-site generation, and net grid imports.
Thermal Demands & Device Performance	Cooling demand (kWh), Heating demand (kWh), DHW demand (kWh), Cooling electricity use (kWh), Heating electricity use (kWh), DHW electricity use (kWh), Device efficiencies	Captures thermal service requirements, corresponding electricity use, and performance indices (COP/efficiency) for each thermal device.
Grid & Market Signals	Carbon intensity (kgCO₂/kWh), Electricity price ($/kWh), Electricity price forecasts: 6 h/12 h/24 h ($/kWh), Power-outage indicator (0/1)	Describes external grid conditions, including pricing, decarbonization signals, and supply interruptions.

Table 5. Hyperparameters used for training the RL agent.

Hyperparameter	Value	Description
Training steps	400,000	Total number of environment interactions used for training. This determines the overall training duration.
Learning rate	0.0001	Step size for gradient updates. Small values slow down convergence but also ensure gradual stable learning.
Gamma	0.99	Discount factor which controls how much future rewards influence current decisions.
Batch size	64	Number of samples used per gradient update.
Steps per update	2048	Number of environment steps collected before each policy update.
Epochs	10	Number of passes over each batch during optimization
GAE Lambda	0.95	Generalized Advantage Estimation. Trades off bias and variance in advantage computation for smoother learning.

Table 6. Performance across the board for strategies in the first schema.

SCHEMA 1	Grid Only	Grid + Solar	RBC	PPO
Annual Cost ($)	587,176.69	162,000.67	165,080.21	157,863.93
Energy Used (kWh)	2,044,709.71	736,365.63	750,364.67	717,566.10
Carbon emissions (kg)	1,094,800.08	413,185.86	421,040.27	402,635.01
Peak Energy consumption (kWh)	351.3	398.61	395.97	429.79

(Source: authors own experiments using CityLearn dataset from challenge 2021, Schema 1 configuration).

Table 7. Performance across the board for strategies in the second schema.

SCHEMA 2	Grid Only	Grid + Solar	RBC	PPO
Annual Cost ($)	581,955.27	74,270.4	77,209.41	75,218.32
Energy Used (kWh)	2,051,007.59	337,594.4	350,952.56	341,902.95
Carbon emissions (kg)	1,109,990.13	189,428.1	196,924.1	191,845.78
Peak Energy consumption (kWh)	816.33	543.99	513.64	549.56

(Source: authors own experiments using CityLearn dataset from challenge 2021, Schema 2 configuration).

Table 8. Performance across the board for strategies in the third schema.

SCHEMA 3	Grid Only	Grid + Solar	RBC	PPO
Annual Cost ($)	581,955.27	89,183.1	93,898.79	85,316.74
Energy Used (kWh)	2,051,007.59	405,378.7	426,811.46	387,801.47
Carbon emissions (kg)	1,109,990.13	227,463.22	239,490.68	217,602.01
Peak Energy consumption (kWh)	816.33	554.21	505.53	554.21

(Source: authors own experiments using CityLearn dataset from challenge 2021, Schema 3 configuration).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moga, O.N.; Florea, A.; Solea, C.; Vintan, M. Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study. Sustainability 2025, 17, 10696. https://doi.org/10.3390/su172310696

AMA Style

Moga ON, Florea A, Solea C, Vintan M. Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study. Sustainability. 2025; 17(23):10696. https://doi.org/10.3390/su172310696

Chicago/Turabian Style

Moga, Olimpiu Nicolae, Adrian Florea, Claudiu Solea, and Maria Vintan. 2025. "Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study" Sustainability 17, no. 23: 10696. https://doi.org/10.3390/su172310696

APA Style

Moga, O. N., Florea, A., Solea, C., & Vintan, M. (2025). Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study. Sustainability, 17(23), 10696. https://doi.org/10.3390/su172310696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Energy Management in Community Microgrids: A Comparative Study

Abstract

1. Introduction

2. Key Research Directions in Microgrid and Energy Community Management

2.1. Microgrid and Energy Storage Optimization

2.2. Grid Robustness and Resilience

2.3. Transactive Energy and Market-Based Coordination

2.4. Challenges

3. Materials and Methods

3.1. Overview of Methodology

3.2. Simulation Environment

3.3. Community Configurations and Input Data

3.4. Load, PV, and Pricing Data Sources

3.5. State, Action and Reward Spaces

3.6. Algorithm and Training Parameters

3.7. Training Procedure and Evaluation Setup

3.8. Limitations and Assumptions

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI