Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities

Vetter, Vinzent; Wohlgenannt, Philipp; Kepplinger, Peter; Eder, Elias

doi:10.3390/en18174489

Open AccessArticle

Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities

¹

Illwerke vkw Endowed Professorship for Energy Efficiency, Energy Research Centre, Vorarlberg University of Applied Sciences, Hochschulstrasse 1, 6850 Dornbirn, Austria

²

Faculty of Engineering and Science, University of Agder, Jon Lilletuns vei 9, 4879 Grimstad, Norway

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(17), 4489; https://doi.org/10.3390/en18174489

Submission received: 31 July 2025 / Revised: 18 August 2025 / Accepted: 20 August 2025 / Published: 23 August 2025

(This article belongs to the Special Issue Smart Energy Management and Sustainable Urban Communities)

Download

Browse Figures

Versions Notes

Abstract

As energy systems transition toward high shares of variable renewable generation, local energy communities (ECs) are increasingly relevant for enabling demand-side flexibility and self-sufficiency. This shift is particularly evident in the residential sector, where the deployment of photovoltaic (PV) systems is rapidly growing. While mixed-integer linear programming (MILP) remains the standard for operational optimization and demand response in such systems, its computational burden limits scalability and responsiveness under real-time or uncertain conditions. Reinforcement learning (RL), by contrast, offers a model-free, adaptive alternative. However, its application to real-world energy system operation remains limited. This study explores the application of a Deep Q-Network (DQN) to a real residential EC, which has received limited attention in prior work. The system comprises three single-family homes sharing a centralized heating system with a thermal energy storage (TES), a PV installation, and a grid connection. We compare the performance of MILP and RL controllers across economic and environmental metrics. Relative to a reference scenario without TES, MILP and RL reduce energy costs by 10.06% and 8.78%, respectively, and both approaches yield lower total energy consumption and CO₂-equivalent emissions. Notably, the trained RL agent achieves a near-optimal outcome while requiring only 22% of the MILP’s computation time. These results demonstrate that DQNs can offer a computationally efficient and practically viable alternative to MILP for real-time control in residential energy systems.

Keywords:

multi-energy optimization; energy community; net zero-energy building; reinforcement learning; mixed integer linear programming; deep Q-network

1. Introduction

As modern energy systems become increasingly dominated by renewable, variable power generation, local energy communities (ECs) emerge as a promising way to stabilize the grid and utilize electricity where it is generated [1]. These advantages foster the transition from individual buildings toward Net Zero Energy Buildings [2] or even a Net Zero Energy Neighborhood [3,4]. For the optimization and ideal operation of such systems, Mixed-Integer Linear Programming (MILP) has been widely regarded as the gold standard, particularly throughout the first two decades of the 2000s [5]. MILP is a powerful method for finding global optima, making it widely used in model-based optimization across economics, logistics, and energy-related fields [6,7,8,9]. However, as the complexity of the investigated systems increases, the required linearizations become more intricate, and computation times grow exponentially due to the introduction of additional binary variables. This prompts the need for alternative techniques [10]. Recent advances in machine learning and computational power have renewed interest in alternative methods for optimization and control.Reinforcement learning (RL) is model-free and data-driven, allowing it to achieve near-optimal solutions with significantly reduced computation time once trained [11,12]. However, effective training requires a sufficient amount of high-quality data. Consequently, training time becomes a critical factor in real-world applications, especially as retraining may be necessary during deployment. And while model-based approaches do not depend on historical data, RL agents tend to outperform traditional methods in the optimization process due to their faster execution and greater computational efficiency once trained. In ECs, where distributed flexibilities can be aggregated and coordinated across multiple entities, RL-based optimization approaches present a promising solution. Due to their decentralized and dynamic nature, such systems benefit from RL agents’ ability to support real-time decision-making under uncertainty. Ideally, this leads to reduced costs for all participants [13]. Optimizing single households often requires transparency and a willingness to compromise on comfort, as optimal control may lead to small temperature variations or limited availability during unforeseen events. Neighborhood-level solutions can help mitigate these issues, as aggregation typically leads to a smoother load curve in larger systems [14,15,16]. However, centralized control can raise security and cybersecurity concerns, since aggregated management may involve sharing sensitive household data [17,18,19]. In the following subsection, relevant publications on the application of MILP and RL optimization in different energy systems are critically discussed.

1.1. Related Works

In several studies on energy systems, MILP optimization has been investigated and compared with a rule-based conventional operation. Baumann et al. [20] used a co-simulation approach of IDA ICE and Gurobi to optimize the energy system’s control. The buildings were then optimized on a daily basis using MILP and simulated using a physical resistor–capacitor (RC) model. The investigation demonstrated significant benefits including reduced electricity costs, improved self-consumption, and self-sufficiency. In the work of Aguilera et al. [21], MILP was used to control large-scale heat pumps in a simulation utilizing thermal demand forecasts as well as heat pump performance maps. This optimization achieved cost savings in the operation using the digital twin model. Kepplinger et al. [22] also applied the MILP approach in a simulation, incorporating forecasting and state estimation methods to optimally control a heating element in a domestic hot water heater. In their later work, they validated their results using an experimental real-world setup [23]. Cosic et al. [24] extended the application of MILP optimization from operational scheduling to investment planning. In a comprehensive framework, they optimized the sizing and placement of PV and storage systems in a real Austrian municipality, evaluating multiple tariff and market scenarios to demonstrate the model’s adaptability and precision. A 15% reduction in total energy costs and a 34% reduction in CO₂ emissions were achieved.

RL has also been utilized as an optimizing method for energy systems, although primarily using incentive functions not related to economics. Bachseitz et al. [25] investigated various RL algorithms in comparison to a rule-based control strategy for managing the heat pump of a multi-family building. Their findings indicate that while the RL agent was able to maintain the required storage temperatures, it fell short of matching the rule-based approach in maximizing PV self-consumption. Lissa et al. [26] implemented an RL agent guided by a reward structure based on comfort levels and energy consumption to manage the energy system of a simulated single-family home. They achieved energy savings of 8 % as well as an increased use of renewable energy compared with a rule-based approach. Rohrer et al. [27] applied RL in a lab experiment to evaluate its feasibility for demand response. Using six months of real-world data for training, their approach achieved considerable energy savings, once more emphasizing the practical potential of RL in real demand response scenarios. Franzoso et al. [28] highlighted the use of reinforcement learning to optimize multi-energy systems integrating renewable technologies. Their study demonstrated improved energy management by reducing emissions as well as operational costs, showing the versatility of RL in energy applications. Guo et al. [29] and Cui et al. [30] also demonstrated how RL can be used to optimize the operation of microgrids and other small multi-energy systems under uncertainty, exploring this important aspect for the application.

Langer and Völling investigated a system comprising a single-family home with a heat pump, PV system, and battery electric storage system, optimized using an RL approach [31]. Their RL solution achieved a performance close to the results of a MILP model developed in their earlier work [32]. The evaluation focused on user comfort, grid feed-in, and overall energy usage but did not account for economic incentives. While this demonstrates the feasibility of RL for residential energy management, the absence of cost considerations limits its applicability in scenarios where financial optimization is critical.

While MILP and RL have been widely studied for building-level energy optimization, few works compare both methods in a comprehensive manner, and using real-world data from an existing neighborhood energy system. Furthermore, most RL studies focus on comfort or energy savings, often ignoring economic incentives driven by real-time electricity prices. Finally, the leveraging of price signals and the flexibility of storage, while accounting for PV production, remains insufficiently explored at the community scale. These gaps are addressed in the present study by directly comparing MILP and RL for economic optimization of an EC using historical real-time pricing and measured on-site data. The detailed contributions of this work are described in the following subsection.

1.2. Contribution

This study comprehensively compares RL and MILP side by side under closely matched conditions using both measured and synthetic data for an existing EC, including demand, real-time electricity prices, and PV generation profiles. The optimization focuses on exploiting the flexibility of an existing thermal energy storage (TES) by shifting the heat pump operation toward periods of low electricity prices while avoiding high-price intervals, thereby reducing overall energy costs. To enable a thorough comparison of the optimization methods, a reference scenario without flexibility was used, where heat demand is met directly. The individual contributions of this study are as follows:

Side-by-side comparison of MILP and RL, for the economic optimization of an existing EC using real-world input data and real-time electricity prices.
Development of an RL-based control strategy for economic demand response, leveraging price signals to shift grid usage toward low-price periods by optimizing the operation of thermal storage and PV flexibility, aiming to approach the MILP-derived optimum.
Comprehensive benchmark of RL and MILP optimization in a direct comparison and compared with a no-flexibility reference scenario. Within the benchmark cost savings and operational strategies under realistic conditions, including PV variability and demand patterns, are assessed.

Therefore, this study provides practical insights into the performance and trade-offs of RL and MILP in energy system optimization, paving the way towards future control strategies for cost-efficient operation of ECs.

2. Methods

In this section, the system is presented and the used methods are highlighted, from general equations to the RL implementation and the MILP.

2.1. System

A real-world energy system was investigated in this study. The small EC, as illustrated in Figure 1, comprises three single-family homes that share both electrical and thermal energy infrastructure. The electrical side includes a grid connection point and rooftop photovoltaic panels. The thermal energy system is centered around a sensible TES, providing flexibility to the two heat-supply components: a 12 kW geothermal heat pump (HP) serving as the primary heat source and a 6 kW auxiliary resistance heating element (HE) to cover peak loads. The storage tank has a capacity of 1120 L of water and operates between 35 °C and 55 °C (

T_{stor}

), allowing for approximately 26 kWh of storage capacity (

E_{stor}

).

The direction of the energy flows is indicated by arrows in Figure 1. All flows labeled with the letter P represent electrical power exchanged between system components. These include solar power generation (

P_{PV}

), the general electrical load of the homes (

P_{load}

), and the electrical power consumption of the heating components, the heat pump (

P_{HP}

), and the heating element (

P_{HE}

). All of these flows are unidirectional, with the exception of the grid connection (

P_{grid}

), which can either source electricity from the grid (

P_{grid, pos}

) or feed in surplus electricity generated by the PV system (

P_{grid, neg}

).

Thermal energy flows, denoted by

\dot{Q}

, indicate the direction of heat transfer between components. The heat generated by the heat pump (

{\dot{Q}}_{HP}

) and the heating element (

{\dot{Q}}_{HE}

) is stored in the TES. While the majority of the stored energy is ultimately used to meet the EC’s heating demand (

{\dot{Q}}_{out}

), heat losses to the environment (

{\dot{Q}}_{loss}

) must also be taken into account. It is noteworthy that in the investigated real energy system, a bypass option to directly cover the heating demand

{\dot{Q}}_{out}

using the heat pump and/or heating element is not available. While the heat load

{\dot{Q}}_{out}

from the TES supplies the required energy for space heating, domestic hot water consumption is accounted for within the synthetically generated electric load profile

P_{load}

.

2.2. Data

The data used for the simulation and optimization of the EC is categorized as real-world (measured) on-site data, historical data, or synthetic data, as summarized in Table 1.

Measured data were logged for a period of one year from 19 October 2022. The sensors recorded data at a one-minute resolution, which were averaged and stored every 15 min, resulting in 96 samples per day and a total of

35,040

samples for the entire year. In this case, privacy and security concerns do not apply, since the entire energy community is managed and overseen by a single private entity, in contrast to larger communities with multiple owners, where such issues are more significant. The synthetic data were generated with the same temporal resolution to ensure consistency across all datasets. Historical data was resampled to this resolution when necessary.

The measured temperature data exhibited considerable noise due to the interference from the heat pump operation, especially when used for cooling the buildings in the summer period, resulting in large fluctuations of the measured ground probe temperature profile. To ensure smooth simulation and optimization, as well as realistic COP and power estimates for both optimization methods, the data was preprocessed prior to any algorithmic use. Namely, a rolling average filter for 96 values was applied to smooth the signal, resulting in a synthetic signal derived from the measured signal.

For the RL optimization, the seasonal classification was considered by assigning binary values to each time step, where a value of 1 indicates the active season. Seasons were defined as shown in Table 2.

For validation and to ensure robustness of the algorithms, a train-test split was applied to the data, as seen in Figure 2. This split was designed to include all seasons and their combinations within both sets. Specifically, the data for each month were divided such that in total 260 of the days were used for training and the remaining 105 for testing. The exception was the initial month, October 2022, which was exclusively allocated to training, and the final month, October 2023, was exclusively allocated to testing. This approach ensures comprehensive seasonal coverage during both training and testing phases.

2.3. Physical Model

The EC was modeled through a series of equations to calculate the used energy and energy flows at each timestep. For this model, the following assumptions were made:

The specific heat capacity of water is equal to $c_{water} = 4.18 kJ / (kg \cdot K)$ , irrespective of the TES temperature.
Spatial temperature variations in the TES are neglected, making it a single-node model.
The mass balance is always fulfilled for the TES. Due to the narrow operating temperature range of only 20 °C, the volume of the storage is assumed to remain constant, as thermal expansion effects are negligible.
The electric power of the heating element is equivalent to its heating power ( $η = 1$ ).
The COP of the heat pump is equal to $0.5 \cdot {COP}_{Carnot}$ . This is in accordance with Walden and Pedulla [39].

The heat capacity

C_{stor}

of the sensible water storage was calculated using the volume of the tank

V_{stor}

, the specific density of water

ρ_{water}

, and the specific heat capacity of water

c_{water}

:

C_{stor} = V_{stor} \cdot ρ_{water} \cdot c_{water}

(1)

Using the specific heat capacity and the temporal change of temperature

d T_{stor} (t) / d t

of the storage tank at each timestep, the change in stored energy can be calculated. This was done by accounting for the heat flows to and from the tank, namely the heat supplied by the heating element

{\dot{Q}}_{HE}

, the heat pump

{\dot{Q}}_{HP}

, the heating demand

{\dot{Q}}_{out}

, and the heat losses

{\dot{Q}}_{loss}

:

C_{stor} \cdot \frac{d T_{stor} (t)}{d t} = {\dot{Q}}_{HE} (t) + {\dot{Q}}_{HP} (t) - {\dot{Q}}_{out} (t) - {\dot{Q}}_{loss} (t)

(2)

To calculate the heating power

{\dot{Q}}_{HE}

of the heating element, the electrical power input

P_{HE}

and the efficiency

η

of the system are required:

{\dot{Q}}_{HE} (t) = η \cdot P_{HE} (t)

(3)

The heating power of the heat pump, denoted as

{\dot{Q}}_{HP}

, was computed based on its electrical input power

P_{HP}

and an estimated coefficient of performance (COP) at each timestep. To this end, the ideal Carnot COP was used to model the temperature dependency of the heat pump COP, which depends on the heat sink temperature

T_{ub}

and the ground probe temperature

T_{GP} (t)

.

{COP}_{carnot} (t) = \frac{T_{ub} (t)}{T_{ub} (t) - T_{GP} (t)}

(4)

{\dot{Q}}_{HP} (t) = 0.5 \cdot {COP}_{carnot} (t) \cdot P_{HP} (t)

(5)

For simplification, the heating demand

{\dot{Q}}_{out} (t)

was defined as the total heat consumption of the three separate houses in the energy system,

{\dot{Q}}_{H 1}

,

{\dot{Q}}_{H 2}

, and

{\dot{Q}}_{H 3}

. These individual demands were aggregated into a single total demand:

{\dot{Q}}_{out} (t) = {\dot{Q}}_{H 1} (t) + {\dot{Q}}_{H 2} (t) + {\dot{Q}}_{H 3} (t)

(6)

The heat losses of the storage tank

{\dot{Q}}_{loss} (t)

were calculated using the heat transfer coefficient h, the total surface area of the tank A, and the temperature difference between the storage temperature

T_{stor} (t)

and the ambient temperature

T_{\infty}

:

{\dot{Q}}_{loss} (t) = h \cdot A \cdot (T_{stor} (t) - T_{\infty})

(7)

Using the previously defined heat capacity

C_{stor}

(see Equation (1)), the temporal rate of change of the storage tank temperature can also be expressed by:

\frac{{d T}_{stor} (t)}{d t} = \frac{{\dot{Q}}_{HE} (t) + {\dot{Q}}_{HP} (t) - {\dot{Q}}_{out} (t) - {\dot{Q}}_{loss} (t)}{C_{stor}}

(8)

Alternatively, assuming constant heat fluxes and parameters over time, the storage temperature evolution can be analytically described by the solution of the first-order linear differential equation:

\begin{matrix} T_{stor} (t) & = \frac{b}{a} + (T_{0} - \frac{b}{a}) \cdot e^{(- a \cdot t)} \end{matrix}

(9)

\begin{matrix} a & = \frac{h A}{C_{stor}} \end{matrix}

(10)

\begin{matrix} b & = \frac{{\dot{Q}}_{HE} + {\dot{Q}}_{HP} - {\dot{Q}}_{out} + h A \cdot T_{\infty}}{C_{stor}} \end{matrix}

(11)

The heating power

{\dot{Q}}_{heat}

was controlled using a simple, discrete-time proportional P-controller with saturation, as shown in Algorithm A1. At each time step i, with a duration of

Δ t_{PI} = 1 / 60 h

, the control error

e_{i}

was computed as the difference between the desired storage temperature

T_{set, i}

and the actual temperature

T_{actual, i}

. The heating power was then calculated by multiplying this error by the proportional gain

B_{0}

. To ensure that the power remains within operational limits,

{\dot{Q}}_{heat, i}

was constrained to lie within the range defined by

{\dot{Q}}_{HP, \min}

and

{\dot{Q}}_{HP, \max}

. The heating element was set to activate if the temperature of the TES drops to 1 °C below

T_{lb}

, acting strictly as an auxiliary heating unit. Both the RL and the MILP optimization are able to freely adjust the set temperature in order to utilize the flexibility of the thermal storage system. This flexibility is determined by the thermal capacity

C_{stor}

and the temperature bounds

T_{lb}

and

T_{ub}

, set to 35 °C and 55 °C, respectively.

A state of charge (SOC) was defined for the TES, and it is computed based on the storage temperature at time t according to:

SOC (t) = \frac{T_{ub} - T_{stor} (t)}{T_{ub} - T_{lb}}

(12)

where

T_{stor} (t)

is the current storage temperature, and

T_{ub}

and

T_{lb}

denote the upper and lower temperature bounds of the storage system.

The remaining parameters were derived from system documentation and planning materials. The complete energy system model was used as the environment for the RL agent, which was implemented using Python v3.12.6, the Gymnasium framework v1.0.0 [40], and PyTorch v2.6.0 [41]. The MILP model was formulated and solved using the GurobiPy interface to the Gurobi optimizer v11.0.0 [42]. It has to be noted that a MacBook Pro (M2 Pro, 16 GB RAM, Sequoia 15.5) was employed for all conducted computations.

Both models rely on the parameters listed in Table 3.

2.4. Reinforcement Learning

This subsection outlines the deep reinforcement learning (RL) algorithm employed for the control of the heat pump and the heating element to reduce the energy costs and improve the self-consumption. A Deep Q-Network (DQN) was implemented, following the approach presented in our earlier work [43]. The architecture of the DQN is illustrated in Figure 3.

The method builds upon the deep Q-learning algorithm originally introduced by Mnih et al. [44] and incorporates key improvements proposed by Van Hasselt et al. [45], which involve using two separate neural networks: a policy network for selecting actions and a target network for estimating Q-values. To enhance training stability, the algorithm integrates soft target updates as described by Lillicrap et al. [46]. Additionally, it uses experience replay with mini-batch sampling, following the stabilization technique as applied in the original DQN study [44].

In the present EC, the heating system’s controller acts as the agent, controlling the heat generators HP and HE, while the rest of the EC acts as the environment based on external conditions and actions. The agent’s control action consists of adjusting the SOC of the TES. The action set is discretized on a scale from 0 to 100, where 0 represents the minimum allowable TES charge and 100 the maximum, as regulated by the P-controller.

The optimization horizon covers one day, partitioned into 96 discrete time steps of

1 / 4

h each. Consequently, one episode spans 96 steps. The system state at each step i is composed of the current

{SOC}_{i}

(scaled between 0 and 100), the number of remaining steps

(96 - i)

in the day, the forecasted electricity prices

π_{j}

for the upcoming 96 intervals, the feed-in tariff f, the predicted thermal output

{\dot{Q}}_{out, j}

, the net electrical loads

P_{net - load, j}

, and the seasonal indicator variables

B_{1}, B_{2}, B_{3}, B_{4}

for those intervals. The forecasts for electricity prices and thermal load are assumed to be available one day in advance, reflecting realistic operational conditions where day-ahead market prices are published prior to execution. For the purposes of this application, operation is assumed to occur under perfect predictions without uncertainty, thereby simplifying the optimization problem. After each step, the forecast window shifts forward accordingly; unknown future values beyond the forecast horizon are set to zero. To enhance training stability, most state variables are scaled.

Electricity prices, used in calculating energy costs and rewards, were scaled according to a min–max normalization over the course of one day to a range from 1 to 10 via:

π_{i}^{*} = 9 \cdot \frac{π_{i} - min (π)}{max (π) - min (π)} + 1

(13)

Similarly, the feed-in tariff f, which remains constant throughout each episode, was scaled using the same minimum and maximum values as the electricity prices to ensure consistent normalization:

f^{*} = 9 \cdot \frac{f - min (π)}{max (π) - min (π)} + 1

(14)

Scaling electricity prices serves three main purposes: first, it ensures that all state variables have comparable magnitudes; second, it normalizes daily price values to account for seasonal fluctuations; and third, it enables the agent to generalize to price scenarios that differ from those seen during training. Importantly, scaled prices were constrained to be greater than or equal to one, which avoids zero-cost intervals that might otherwise encourage unrealistic energy usage.

Thermal loads are scaled to represent the percentage of the TES capacity consumed per time step:

{\dot{Q}}_{out, i}^{*} = \frac{{\dot{Q}}_{out, i} \cdot Δ t}{C_{stor} (T_{ub} - T_{lb})},

(15)

where

C_{stor}

is the thermal capacity of the TES.

To keep the state vector as small as possible, the electrical load

P_{load}

and PV generation

P_{PV}

have been summarized in the single net load

P_{net}

:

P_{net} = P_{PV} - P_{load}

(16)

Collectively, the state vector at time step i is given by:

\begin{matrix} S_{i} = ( & {SOC}_{i}, 96 - i, π_{i}^{*}, \dots, π_{i + 95}^{*}, f^{*}, {\dot{Q}}_{out, i}^{*}, \dots, {\dot{Q}}_{out, i + 95}^{*}, \\ P_{net, i}, \dots, P_{net, i + 95}, B_{1}, B_{2}, B_{3}, B_{4}) \end{matrix}

(17)

The discrete action space from 0 to 100 corresponds to the target SOC (

{SOC}_{set}

) for the P controller, from which the temperature setpoint

T_{set}

is computed as

T_{set} = T_{lb} + \frac{100 - {SOC}_{set}}{100} \cdot (T_{ub} - T_{lb})

(18)

The environment dynamics were modeled using Equation (9) alongside a P-controller. Although each episode consists of 96 discrete time steps representing a full day, the P controller operates at a higher frequency within each simulation step, executing 60 control updates per time interval. The scaled price signal was then incorporated into the reward function, which distinguishes between grid consumption and feed-in:

R_{i} = \{\begin{matrix} - P_{grid, pos, i} \cdot π_{i}^{*}, & if P_{grid, pos, i} > 0 (consumption) \\ P_{grid, neg, i} \cdot f, & if P_{grid, neg, i} > 0 (feed − in) \end{matrix}

(19)

where

P_{grid, pos}

and

P_{grid, neg}

denote the electrical powers consumed or fed into the grid, respectively, the

π_{i}^{*}

denotes the scaled electricity price at step i, and f is the constant feed-in tariff.

The hyperparameters of the RL algorithm are summarized in Table A1. The following steps were taken for training the RL agent:

At the start of each episode, a day index was randomly sampled from the training dataset.
For the selected day, the corresponding normalized profiles were loaded, including electricity prices $π^{*}$ , thermal load demand ${\dot{Q}}_{out}^{*}$ , electrical load $P_{net}$ , and feed-in tariffs $f^{*}$ .
Seasonal indicators such as winter ( $B_{1}$ ), spring ( $B_{2}$ ), summer ( $B_{3}$ ), and autumn ( $B_{4}$ ) flags were extracted for the selected day. This approach ensures diversity across episodes and captures a wide range of operational conditions.
The agent was trained over a total of 5000 episodes.

2.5. Mixed Integer Linear Programming

In this section, the MILP model is introduced and described. The system was modeled and optimized to determine the global optimum of the EC’s operation. In order to ensure a fair comparison with the RL approach, the MILP model was provided with a perfect prediction for each day of operation. This includes complete knowledge of the overall heat demand

{\dot{Q}}_{out}

, electricity prices

π

, feed-in tariffs f, ground probe temperatures

T_{GP}

, and photovoltaic generation

P_{PV}

. The optimization was carried out over a horizon of

N = 97

discrete time steps, resulting in 96 time periods p between them as

p \in P = {0, \dots, N - 1}

, covering one full day.

The parameters used for the MILP formulation correspond to Table 3. The objective function minimizes the net electricity cost by considering the grid import power (

P_{grid, pos}

), the electricity price signal

π

, the feed-in power (

P_{grid, neg}

), and the feed-in tariff f, as well as the number of time steps N. The objective function is formulated as

min \sum_{p = 0}^{N - 1} Δ t (P_{grid, pos, p} \cdot π_{p} - P_{grid, neg, p} \cdot f_{p})

(20)

This objective was calculated with a fixed horizon of 24 h, more precisely defined as 96 quarter-hour intervals. Since each day was optimized independently and the total cost was calculated at the end, the following temperature bounds for

T_{stor}

were considered for consistency:

\begin{matrix} T_{stor} (0) & = \{\begin{matrix} 40 & on the first day \\ T_{stor}^{prev} (N) & otherwise \end{matrix} \end{matrix}

(21)

\begin{matrix} T_{stor} (N) & = \{\begin{matrix} 40 & on the final day \\ \in [T_{lb}, T_{ub}] & otherwise \end{matrix} \end{matrix}

(22)

As seen in the equations, the first day was initialized with a storage temperature

T_{stor}

of 40 °C. On all following days, except the last one, the optimization began with the final temperature of the previous day, i.e.,

T_{stor, N}^{prev}

, to ensure that no energy is artificially lost or gained. While the optimizer was allowed to choose the end-of-day temperature of each day freely, on the last day, the final storage temperature was fixed to

T_{stor, N} = 40 ° C

, to maintain consistency with the RL.

All other constraints of the model are listed below and must be satisfied at every timestep on each day. These include operational limits, system dynamics, and technical constraints, all of which ensure the physical and economic feasibility of the optimization results across the full time horizon. The binary variables

B_{pos}

and

B_{neg}

are used with the Big-M method to ensure that either grid import

P_{pos}

or grid feed-in

P_{neg}

is active.

\begin{matrix} \forall p \in P : & a_{p} = \frac{h A}{C_{stor}} \end{matrix}

(23)

\begin{matrix} b_{p} = \frac{{\dot{Q}}_{HE, p} + {\dot{Q}}_{HP, p} - {\dot{Q}}_{out, p} + h A \cdot T_{\infty}}{C_{stor}} \end{matrix}

(24)

\begin{matrix} T_{stor, p + 1} = \frac{b_{p}}{a_{p}} + (T_{stor, p} - \frac{b_{p}}{a_{p}}) \cdot e^{(- a_{p} \cdot Δ t)} \end{matrix}

(25)

\begin{matrix} {\dot{Q}}_{HP, p} = 0.5 \cdot {COP}_{Carnot} \cdot P_{HP, p} \end{matrix}

(26)

\begin{matrix} {\dot{Q}}_{HE, p} = P_{HE, p} \end{matrix}

(27)

\begin{matrix} P_{grid, p} = P_{grid, pos, p} - P_{grid, neg, p} \end{matrix}

(28)

\begin{matrix} B_{pos, p} + B_{neg, p} \leq 1 \end{matrix}

(29)

\begin{matrix} P_{grid, pos, p} \leq B_{pos, p} \cdot M \end{matrix}

(30)

\begin{matrix} P_{grid, neg, p} \leq B_{neg, p} \cdot M \end{matrix}

(31)

\begin{matrix} P_{grid, p} + P_{PV, p} = P_{HP, p} + P_{load, p} + P_{HE, p} \end{matrix}

(32)

All variables used are bound by their respective boundaries as follows:

\begin{matrix} \forall p \in P : & T_{lb} \leq T_{stor, p + 1} \leq T_{ub} \end{matrix}

(33)

\begin{matrix} 0 \leq P_{grid, pos, p} \end{matrix}

(34)

\begin{matrix} 0 \leq P_{grid, neg, p} \end{matrix}

(35)

\begin{matrix} 0 \leq P_{HP, p} \leq P_{HP, max} \end{matrix}

(36)

\begin{matrix} 0 \leq {\dot{Q}}_{HP, p} \leq {\dot{Q}}_{HP, max} \end{matrix}

(37)

\begin{matrix} 0 \leq P_{HE, p} \leq P_{HE, max} \end{matrix}

(38)

\begin{matrix} 0 \leq {\dot{Q}}_{HE, p} \leq {\dot{Q}}_{HE, max} \end{matrix}

(39)

\begin{matrix} B_{pos, p} \in {0, 1} \end{matrix}

(40)

\begin{matrix} B_{neg, p} \in {0, 1} \end{matrix}

(41)

As already mentioned in the description of the reinforcement learning approach, the 96 values forecasted for the electricity price

π

, feed-in tariffs f, overall heat demand

{\dot{Q}}_{out}

, electrical load

P_{load}

, and solar generation

P_{PV}

were assumed to be perfectly accurate under the assumption of perfect prediction. This assumption simplifies the optimization by neglecting forecasting errors, allowing the model to focus solely on the system’s operational optimization. However, in practical applications, forecasting uncertainties do impact the solution quality and must be considered for a more robust approach.

3. Results and Discussion

3.1. Preliminary Results

In this section, the performance of the RL approach is investigated regarding the number of episodes trained. To determine a suitable balance between training time and cost savings, an analysis was conducted, the results of which are shown in Figure 4. For these results, training was performed using 5, 500, 1000, 2000, 3000, 4000, and 5000 episodes to investigate differences in training duration and resulting cost savings.

It is seen that cost savings significantly decrease with increased training episodes, gradually reaching a plateau from 3000 episodes onward. The most pronounced cost savings were achieved for an agent trained on 5000 episodes. Furthermore, 3000 episodes yielded cost savings nearly comparable to 5000 episodes while using approximately half of the training time. However, for the remaining analyses, all RL agents were trained with 5000 episodes, as cost-effectiveness was considered more important than shorter training times, considering the investigated scale of the system.

3.2. Comparison

In this section, the results of both optimizers are presented and discussed. In Figure 5, various electric energy flows P within the system are presented, with Panel a displaying five days of electrical power flows under the RL-optimized operation, and Panel c showing the behavior of the MILP-based system. Power flows into the hub, including photovoltaic generation (

P_{PV}

) and grid import (

P_{grid, pos}

), which are represented as positive values. In contrast, power flows leaving the hub, such as the electrical load (

P_{load}

), the heat pump (

P_{HP}

), the heating element (

P_{HE}

), and electricity exported to the grid as feed-in (

P_{grid, neg}

), are represented as negative values. The corresponding real-time electricity price signal and the constant feed-in tariff are shown in Panel b.

Although the general operation patterns appear similar at first glance, several key differences emerge between the two control strategies, most notably in the operation of the auxiliary heating element. For the heating element, the RL agent tends to favor longer periods of moderate operation or to avoid its activation altogether. In contrast, the MILP model opts for short, high-intensity bursts, operating the auxiliary heater at full capacity when required. This indicates a more aggressive but shorter-duration heating strategy under the MILP control to achieve the global optimum of operation. Both optimization approaches clearly attempt to maximize the self-consumption of PV-generated electricity before feeding into the grid. The constant feed-in tariff of 0.08 €/kWh appears insufficient to incentivize export when local usage is possible. Furthermore, the five-day excerpt illustrates a period of high thermal demand. Both systems frequently operate the heat pump and auxiliary heating elements at or near their maximum power levels, indicating a substantial heating demand driven by cold weather conditions (below −4 °C) for the EC. These conditions correspond to the second percentile of all annually occurring ambient temperatures.

Figure 6 depicts the transient energy balance of the TES depicted as a time series for the same time period as depicted in Figure 5. It showcases the RL-controlled and the MILP-controlled operations, Panels a and c, respectively. Positive flows symbolize heat entering the sensible storage tank during the operation of the heating element or the heat pump; negative flows, therefore, depict the heat taken from the TES, i.e., the heating load

{\dot{Q}}_{out}

for space heating. The sum of the heat flows is depicted as

{\dot{Q}}_{charge}

and reflects the charging power (or discharging power, if negative) of the TES within a period. In Panel b, the temperatures of both operation modes are shown in degrees centigrade as well as in % SOC.

Starting with the temperature curves in Panel b, the general shapes are quite similar, indicating a similar approach by both RL and MILP. The main difference between the two is that RL can dip slightly below the lower temperature bound, which triggers the heating element. This small, environment-driven difference occurs only a few times and does not disturb comfort, as the temperature remains sufficient for effective heating. As the MILP is bound by the temperature constraints, it is not allowed to let this happen. For the general operation seen in Panels a and b, the approach seems quite similar, as both systems have the same heat load applied to them. For these five days, the RL seems to have a more bipolar behavior, not using the inverter function of the heat pump but instead turning it on and off, while the MILP operation seems to utilize the modulation ability of the heat pump, resulting in a more continuous and moderated operation of the storage tank, while the RL-optimized TES exhibits more frequent and abrupt switching between charging and discharging.

To compare the performance of the reference scenario with the RL and MILP optimization approaches, five key performance indicators (KPIs) were selected, as illustrated in Figure 7. To also evaluate the statistical robustness of the RL approach, ten agents were trained for 5000 episodes each. The mean values and standard deviations of the KPIs are also depicted in Figure 7. Panel a presents the relative deviation of both optimization methods from the reference case in terms of cumulative costs, CO₂eq emissions, grid power consumption, self-consumption ratio (SCR), and self-sufficiency ratio (SSR). In addition to these relative deviations, Table 4 provides the corresponding absolute values. Panel b displays the cumulative cost evolution over the 105-day testing period, highlighting the comparative behavior of the MILP and RL approaches.

As seen in Panel a, MILP and RL are generally quite similar in their results. Both optimizers achieve cost savings compared with the reference scenario. While the MILP is able to find the global optimum at 10.06% reduction compared with the reference scenario, the RL optimization is able to approach this with a mean of 8.78%. Although not optimizing for emissions, CO₂ emissions are also reduced in the RL and MILP cases, by 4.63% and 4.01%, respectively. Grid power usage is nearly the same irrespective of the optimization strategy. One reason why emissions could be reduced is the increase in the SCRs and SSRs. As solar power was considered CO₂-neutral, an increase in self-consumption leads to a reduction in CO₂. This also explains the better results of the RL optimization, as it achieved the highest self-consumption rate, thereby also reducing CO₂ emissions.

The cumulative cost curves in Panel b reveal that the overall shape remains similar across both optimization approaches and the reference scenario. However, starting in spring, the curves begin to diverge as the optimized strategies achieve increasing cost savings. This divergence is driven by two opposing trends: higher PV production yields more energy that can be used directly, while milder temperatures reduce the heating load and thus the overall energy demand. Over the full year, MILP and RL achieve total savings of 67.58€ and 58.98€, respectively, with MILP delivering the best overall result and RL performing comparably close.

4. Conclusions

This study shows that optimization algorithms like RL can approach the global optimum as determined via MILP to a degree where the difference is negligible in practice. Based on the measured KPIs, RL even outperforms MILP in some areas, including CO₂, SCR, and SSR, suggesting that it could be a promising approach for real-world applications. These findings are consistent with Langer et al. [31,32], who also compared RL and MILP in similar energy systems and reported comparable performance under synthetic realistic conditions. This study further distinguishes itself by using high-resolution, real-world data, including measured demand, PV generation, and real-time electricity prices, under near-identical conditions for both methods.

Even if the result is promising and the optimization itself is faster in RL than it is for MILP, there are some aspects to consider. RL has two main downsides. First, a large amount of data is needed to train and validate the agent, which requires the implementation of additional features such as seasonality or normalized data. Secondly, training for 5000 episodes took 2.5 h to complete, meaning that from point zero to optimization, MILP was much faster than RL, the reason being the multidimensional state of each training episode combined with the relatively fine temporal resolution of 15-minute steps. Another topic that cannot be overlooked is the implementation of the auxiliary heating element in the RL. The kind of implementation used allows the RL-optimized system to drop below the hard lower bounds of the MILP by 1°C before applying the heating element, leading to some temperature deviations, which gives RL slightly more flexibility overall, although it does not affect user comfort or financial decisions in the optimization, as a closer inspection shows that these deviations are generally mild, with only 8 % of cases falling below 34.5 °C and none below 34.0 °C.

Despite these limitations, RL demonstrates strong performance under realistic conditions. With some targeted preprocessing, it is able to achieve near-optimal results and offers the key advantage of real-time control once training is complete. In deployment, the agent can act autonomously and adapt over time through retraining or transfer learning, indicating that RL may be a promising approach for dynamic environments and could potentially be adapted for edge applications, subject to real-world testing. Both approaches should be able to improve the performance of a simple system, so the question might not always be which method can produce the better result, but rather which method is more feasible for real-world applications, as small and fast computers become increasingly available.

Future research could focus on validating these findings in real-world settings, for example, through lab setups using synthetic loads similar to Rohrer et al. [27] or Kepplinger et al. [23]. Expanding the system to include additional components like solar thermal, biomass combined heat and power, or decentralized storage would test both scalability and robustness. Another valuable direction is the integration of real load forecasting and resilience to unexpected events such as sudden demand shifts, system faults, or other uncertainties, as showcased in the works by Guo et al. [29] and Cui et al. [30]. Beyond the immediate results, this work contributes to a growing body of evidence that learning-based control strategies can approach traditional optimization methods under certain real-world constraints. The insights gained from this study can be applied to similar energy systems and extended to other flexibility assets, supporting the transition from academic methods to real-world deployment.

Ultimately, this work delivers one of the first direct, high-resolution comparisons of RL and MILP for optimizing an existing EC using real-world data, PV generation, and market signals. It demonstrates that deep RL is not only a viable alternative to MILP but can, in some cases, even match a carefully tuned MILP model, while offering practical advantages in speed, adaptability, and scalability for potential real-time control of complex energy systems in practice.

Author Contributions

Conceptualization, V.V., P.W. and P.K.; Methodology, V.V., P.W., P.K. and E.E.; Software, V.V. and P.W.; Validation, V.V. and E.E.; Formal analysis, V.V., P.W., P.K. and E.E.; Investigation, V.V. and P.W.; Data curation, V.V.; Writing – original draft, V.V. and P.W.; Writing – review & editing, P.K. and E.E.; Visualization, V.V. and E.E.; Supervision, P.K. and E.E.; Project administration, P.K.; Funding acquisition, P.K. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the financial support from the Austrian Research Promotion Agency FFG for the Hub4FlECs project (COIN FFG 898053).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors also gratefully acknowledge the support of Ludwig Netzer from Netzer Group for providing real-world data. The authors would like to mention that LLMs have been employed for improving the spelling, grammar, and punctuation of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Additional Configurations

Algorithm A1 P-controller with saturation and auxiliary heater logic

1:: $e_{i} = T_{set, i} - T_{actual, i}$
2:: ${\dot{Q}}_{heat, i} = B_{0} \cdot e_{i}$
3:: if ${\dot{Q}}_{heat, i} > {\dot{Q}}_{HP, \max}$ then
4:: ${\dot{Q}}_{heat, i} = {\dot{Q}}_{HP, \max}$
5:: else if ${\dot{Q}}_{heat, i} < {\dot{Q}}_{HP, \min}$ then
6:: ${\dot{Q}}_{heat, i} = {\dot{Q}}_{HP, \min}$
7:: end if
8:: if $T_{actual, i} < T_{lb} - 1$ then
9:: $Q_{HE, i} = Q_{HE, \max}$
10:: else
11:: $Q_{HE, i} = 0$
12:: end if

Table A1. RL Parameters.

Parameter	Value
Training episodes	5000
Batch size	1250
Memory buffer size	10,000
Update rate $τ$	0.005
Adam learning rate	0.0001
Initial exploration rate $ϵ_{start}$	0.9
End exploration rate $ϵ_{end}$	0.05
Exploration decay rate $d_{ϵ}$	5000
Discount factor $γ$	0.999
Neural network layers	3
Layer 1	(input, 512), ReLU activation
Layer 2	(512, 512), ReLU activation
Layer 3	(512, 101), linear activation
Loss function	Huber Loss

References

Seiler, V.; Moosbrugger, L.; Huber, G.; Kepplinger, P. Assessing Model Predictive Control for Energy Communities’ Flexibilities. In Intelligente Energie- und Klimastrategien: Energie–Gebäude–Umwelt; Science Research Pannonia; Holzhausen: Wien, Austria, 2024; pp. 1–22. ISBN 978-3-903207-89-9. [Google Scholar] [CrossRef]
Jaysawal, R.K.; Chakraborty, S.; Elangovan, D.; Padmanaban, S. Concept of net zero energy buildings (NZEB)—A literature review. Clean. Eng. Technol. 2022, 11, 100582. [Google Scholar] [CrossRef]
Riechel, R. Zwischen Gebäude und Gesamtstadt: Das Quartier als Handlungsraum in der lokalen Wärmewende. Vierteljahrsh. Wirtsch. 2016, 85, 89–101. [Google Scholar] [CrossRef]
Kannengießer, T. Bewertung Zukünftiger Urbaner Energieversorgungskonzepte für Quartiere. Ph.D. Thesis, Rheinisch-Westfälische Technische Hochschule Aachen, Aachen, Germany, 2023. [Google Scholar]
Zhang, Q.; Grossmann, I.E. Enterprise-wide optimization for industrial demand side management: Fundamentals, advances, and perspectives. Chem. Eng. Res. Des. 2016, 116, 114–131. [Google Scholar] [CrossRef]
Ren, H.; Gao, W. A MILP model for integrated plan and evaluation of distributed energy systems. Appl. Energy 2010, 87, 1001–1014. [Google Scholar] [CrossRef]
Lindholm, O.; Weiss, R.; Hasan, A.; Pettersson, F.; Shemeikka, J. A MILP Optimization Method for Building Seasonal Energy Storage: A Case Study for a Reversible Solid Oxide Cell and Hydrogen Storage System. Buildings 2020, 10, 123. [Google Scholar] [CrossRef]
Wohlgenannt, P.; Huber, G.; Rheinberger, K.; Kolhe, M.; Kepplinger, P. Comparison of demand response strategies using active and passive thermal energy storage in a food processing plant. Energy Rep. 2024, 12, 226–236. [Google Scholar] [CrossRef]
Costa, T.; Nogueira, T.; Bomtempo, G.; de Souza, E.; Pimentel, B.; Alves, F.; Alves, J.; Ravetti, M. A Hybrid Mixed-Integer Linear Programming and Reinforcement Learning Framework for Integrated Mineral Supply Chain Optimization. SSRN Electron. J. 2025; preprint. ISSN 1876–6102. [Google Scholar] [CrossRef]
Urbanucci, L. Limits and potentials of Mixed Integer Linear Programming methods for optimization of polygeneration energy systems. Energy Procedia 2018, 148, 1199–1205. [Google Scholar] [CrossRef]
Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
Wang, Z.; Hong, T. Reinforcement learning for building controls: The opportunities and challenges. Appl. Energy 2020, 269, 115036. [Google Scholar] [CrossRef]
Charbonnier, F.; Peng, B.; Vienne, J.; Stai, E.; Morstyn, T.; McCulloch, M. Centralised rehearsal of decentralised cooperation: Multi-agent reinforcement learning for the scalable coordination of residential energy flexibility. Appl. Energy 2025, 377, 124406. [Google Scholar] [CrossRef]
Palma, G.; Guiducci, L.; Stentati, M.; Rizzo, A.; Paoletti, S. Reinforcement Learning for Energy Community Management: A European-Scale Study. Energies 2024, 17, 1249. [Google Scholar] [CrossRef]
Guiducci, L.; Palma, G.; Stentati, M.; Rizzo, A.; Paoletti, S. A Reinforcement Learning Approach to the Management of Renewable Energy Communities. In Proceedings of the 2023 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 6–10 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Pereira, H.; Gomes, L.; Vale, Z. Peer-to-peer energy trading optimization in energy communities using multi-agent deep reinforcement learning. Energy Inform. 2022, 5, 44. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, X.; Zhang, R.; Gu, W.; Cao, G. N-1 Evaluation of Integrated Electricity and Gas System Considering Cyber-Physical Interdependence. In IEEE Transactions on Smart Grid; IEEE: Piscataway, NJ, USA, 2025; p. 1. [Google Scholar] [CrossRef]
Gaggero, G.B.; Piserà, D.; Girdinio, P.; Silvestro, F.; Marchese, M. Novel Cybersecurity Issues in Smart Energy Communities. In Proceedings of the 2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC), Jeddah, Saudi Arabia, 23–25 January 2023; pp. 1–6. [Google Scholar] [CrossRef]
Gaggero, G.B.; Armellin, A.; Girdinio, P.; Marchese, M. An IEC 62443-Based Framework for Secure-by-Design Energy Communities. IEEE Access 2024, 12, 166320–166332. [Google Scholar] [CrossRef]
Baumann, C.; Wohlgenannt, P.; Streicher, W.; Kepplinger, P. Optimizing heat pump control in an NZEB via model predictive control and building simulation. Energies 2025, 18, 100. [Google Scholar] [CrossRef]
Aguilera, J.J.; Padullés, R.; Meesenburg, W.; Markussen, W.B.; Zühlsdorf, B.; Elmegaard, B. Operation optimization in large-scale heat pump systems: A scheduling framework integrating digital twin modelling, demand forecasting, and MILP. Appl. Energy 2024, 376, 124259. [Google Scholar] [CrossRef]
Kepplinger, P.; Huber, G.; Petrasch, J. Autonomous optimal control for demand side management with resistive domestic hot water heaters using linear optimization. Energy Build. 2015, 100, 50–55. [Google Scholar] [CrossRef]
Kepplinger, P.; Huber, G.; Petrasch, J. Field testing of demand side management via autonomous optimal control of a domestic hot water heater. Energy Build. 2016, 127, 730–735. [Google Scholar] [CrossRef]
Cosic, A.; Stadler, M.; Mansoor, M.; Zellinger, M. Mixed-integer linear programming based optimization strategies for renewable energy communities. Energy 2021, 237, 121559. [Google Scholar] [CrossRef]
Bachseitz, M.; Sheryar, M.; Schmitt, D.; Summ, T.; Trinkl, C.; Zörner, W. PV-Optimized Heat Pump Control in Multi-Family Buildings Using a Reinforcement Learning Approach. Energies 2024, 17, 1908. [Google Scholar] [CrossRef]
Lissa, P.; Deane, C.; Schukat, M.; Seri, F.; Keane, M.; Barrett, E. Deep reinforcement learning for home energy management system control. Energy AI 2021, 3, 100043. [Google Scholar] [CrossRef]
Rohrer, T.; Frison, L.; Kaupenjohann, L.; Scharf, K.; Hergenröther, E. Deep Reinforcement Learning for Heat Pump Control. arXiv 2022, arXiv:2212.12716. [Google Scholar] [CrossRef]
Franzoso, A.; Fambri, G.; Badami, M. Deep reinforcement learning as a tool for the analysis and optimization of energy flows in multi-energy systems. Energy Convers. Manag. 2025, 341, 120095. [Google Scholar] [CrossRef]
Guo, C.; Wang, X.; Zheng, Y.; Zhang, F. Real-Time Optimal Energy Management of Microgrid with Uncertainties Based on Deep Reinforcement Learning. Energy 2022, 238, 121873. [Google Scholar] [CrossRef]
Cui, Y.; Xu, Y.; Li, Y.; Wang, Y.; Zou, X. Deep Reinforcement Learning Based Optimal Energy Management of Multi-Energy Microgrids with Uncertainties. arXiv 2023, arXiv:2311.18327. [Google Scholar] [CrossRef]
Langer, L.; Volling, T. A reinforcement learning approach to home energy management for modulating heat pumps and photovoltaic systems. Appl. Energy 2022, 327, 120020. [Google Scholar] [CrossRef]
Langer, L.; Volling, T. An optimal home energy management system for modulating heat pumps and photovoltaic systems. Appl. Energy 2020, 278, 115661. [Google Scholar] [CrossRef]
EXAA Energy Exchange Austria. Spot Market Prices for Austria: 19.10.2022–19.10.2023. 2023. Hourly Spot Electricity Prices from EXAA for the Austrian Market Covering the Period 19 October 2022 to 19 October 2023. Available online: https://markt.apg.at/transparenz/uebertragung/day-ahead-preise/ (accessed on 22 March 2025).
Illwerke vkw AG. PV-Einspeisetarife Vorarlberg 2025; Illwerke vkw AG: Bregenz, Austria, 2024; Available online: https://www.vkw.at/media/Infoblatt_Photovoltaikanlagen_Einspeisung.pdf (accessed on 22 July 2025).
Ökostrom-Einspeisetarifverordnung 2018 (ÖSET-VO 2018). Bundesgesetzblatt für die Republik Österreich. Version: 2018.–BGBl. II Nr. 408/2017, § 6. Available online: https://www.ris.bka.gv.at/eli/bgbl/II/2017/408 (accessed on 22 July 2025).
Electricity Maps. Austria 19.10.2022–19.10.2023 Carbon Intensity Data (Version 27 January 2025). 2025. Available online: https://www.electricitymaps.com (accessed on 22 July 2025).
GGV Stadtwerke Groß-Gerau Versorgungs GmbH. Standard Load Profiles (SLP)—File: GGV_SLP_1000_MWh_2021_01.xlsx. Standard Load Profile Data Provided by GGV Stadtwerke Groß-Gerau. File version: 2020-09-24. 2021. Available online: https://www.ggv-energie.de/cms/netz/allgemeine-daten/netzbilanzierung-download-aller-profile.php (accessed on 22 October 2024).
GeoSphere Austria. Messstationen Stundendaten v2—ID 1115 Feldkirch Global Radiation Data (10-Minute Resolution), Version: 2024. 2024. Available online: https://data.hub.geosphere.at/dataset/klima-v2-1h (accessed on 22 March 2025). [CrossRef]
Walden, J.V.; Padullés, R. An analytical solution to optimal heat pump integration. Energy Convers. Manag. 2024, 320, 118983. [Google Scholar] [CrossRef]
Towers, M.; Kwiatkowski, A.; Terry, J.; Balis, J.U.; Cola, G.D.; Deleu, T.; Goulão, M.; Kallinteris, A.; Krimmel, M.; KG, A.; et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv 2024, arXiv:2407.17032. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. Version: 2025. Available online: https://www.gurobi.com (accessed on 30 April 2025).
Wohlgenannt, P.; Hegenbart, S.; Eder, E.; Kolhe, M.; Kepplinger, P. Energy Demand Response in a Food-Processing Plant: A Deep Reinforcement Learning Approach. Energies 2024, 17, 6430. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. arXiv 2015, arXiv:1509.06461. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2019, arXiv:1509.02971. [Google Scholar] [PubMed]

Figure 1. Schematic of the energy system, including electrical components (grid and PV), as well as thermal components such as the heat pump, the heating element, and the thermal storage tank. Arrows indicate the direction of energy flows. Flows labeled P denote electrical power: solar generation (

P_{PV}

), household load (

P_{load}

), heat pump (

P_{HP}

), heating element (

P_{HE}

), and grid exchange (

P_{grid}

). Flows labeled

\dot{Q}

denote heating power: heat pump (

{\dot{Q}}_{HP}

), heating element (

{\dot{Q}}_{HE}

), heating load (

{\dot{Q}}_{out}

), heat losses (

{\dot{Q}}_{loss}

).

Figure 1. Schematic of the energy system, including electrical components (grid and PV), as well as thermal components such as the heat pump, the heating element, and the thermal storage tank. Arrows indicate the direction of energy flows. Flows labeled P denote electrical power: solar generation (

P_{PV}

), household load (

P_{load}

), heat pump (

P_{HP}

), heating element (

P_{HE}

), and grid exchange (

P_{grid}

). Flows labeled

\dot{Q}

denote heating power: heat pump (

{\dot{Q}}_{HP}

), heating element (

{\dot{Q}}_{HE}

), heating load (

{\dot{Q}}_{out}

), heat losses (

{\dot{Q}}_{loss}

).

Figure 2. Heating load (kW) of the EC throughout the investigated year, depicted with the applied train-test split.

Figure 3. Architecture of the DQN framework illustrating the interaction between the agent and the environment through states, actions, and rewards. The design incorporates a policy network and a target network for Q-value estimation, experience replay for sample efficiency, and soft target updates to enhance training stability.

Figure 4. Cost savings relative to the reference case and training time of RL with respect to the number of episodes trained. MILP cost savings are depicted for comparison.

Figure 5. Electrical energy flows inside the hub with the price signal. Panel a shows the RL model, and Panel c shows the MILP model. Both panels depict the flows of electrical power from the photovoltaic system

P_{PV}

and the grid

P_{grid, pos}

into the node, and the consumption by the load

P_{load}

, heat pump

P_{HP}

, and heating element

P_{HE}

, as well as the feed-in to the grid

P_{grid, neg}

. Panel b shows the varying electricity price signal and the constant feed-in tariff.

Figure 5. Electrical energy flows inside the hub with the price signal. Panel a shows the RL model, and Panel c shows the MILP model. Both panels depict the flows of electrical power from the photovoltaic system

P_{PV}

and the grid

P_{grid, pos}

into the node, and the consumption by the load

P_{load}

, heat pump

P_{HP}

, and heating element

P_{HE}

, as well as the feed-in to the grid

P_{grid, neg}

. Panel b shows the varying electricity price signal and the constant feed-in tariff.

Figure 6. Heat flows in and out of the TES, along with the internal temperatures. Panel a shows the RL model, and Panel c shows the MILP model. Both panels depict heat input to the TES from the heat pump

{\dot{Q}}_{HP}

and the heating element

{\dot{Q}}_{HE}

, as well as heat output to the domestic heating load

{\dot{Q}}_{out}

. Panel b shows the internal TES temperatures, including the upper and lower boundary temperatures (

T_{ub}

and

T_{lb}

) in °C, along with the corresponding state of charge (SOC) in %.

Figure 6. Heat flows in and out of the TES, along with the internal temperatures. Panel a shows the RL model, and Panel c shows the MILP model. Both panels depict heat input to the TES from the heat pump

{\dot{Q}}_{HP}

and the heating element

{\dot{Q}}_{HE}

, as well as heat output to the domestic heating load

{\dot{Q}}_{out}

. Panel b shows the internal TES temperatures, including the upper and lower boundary temperatures (

T_{ub}

and

T_{lb}

) in °C, along with the corresponding state of charge (SOC) in %.

Figure 7. KPIs in Panel a, depicted as relative deviation from the reference scenario, and cumulative cost curves in Panel b for the reference scenario (black), RL (in blue), and MILP (in orange). The KPIs include total costs in €, CO₂eq emissions in kg, grid energy usage in kWh, self-consumption ratio, and self-sufficiency ratio.

Table 1. Overview of the data types used for simulation and optimization.

Variable	Data Type	Source
Geothermal probe temperatures ( $T_{GP}$ )	on-site	Measured locally
Heat requirements ( ${\dot{Q}}_{out}$ )	on-site	Measured locally
Electricity prices ( $π$ )	historical	EXAA market spot prices [33]
Feed-in tariffs (f)	historical	Local feed in tariffs [34,35]
Carbon intensity	historical	Electricity Maps [36]
Electrical load (non-heating) ( $P_{load}$ )	synthetic	Standard load profiles [37]
Photovoltaic power output ( $P_{PV}$ )	synthetic	Geosphere Austria [38]
Seasonal classification ( $B_{x}$ )	synthetic	One hot encoding

Table 2. Seasonal encoding based on calendar months. Each season is represented by a binary feature.

Binary Indicator	Season	Active Months
$B_{1}$	Spring	March, April, May
$B_{2}$	Summer	June, July, August
$B_{3}$	Autumn	September, October, November
$B_{4}$	Winter	December, January, February

Table 3. Model parameters.

Description	Parameter	Value	Unit
Resolution	$Δ t$	0.25	h
Storage capacity	$C_{stor}$	1.298	kWh/K
Storage lower temperature bound	$T_{lb}$	35	°C
Storage upper temperature bound	$T_{ub}$	55	°C
Ambient temperature	$T_{\infty}$	20	°C
Min. heating power heat pump	${\dot{Q}}_{HP, \min}$	0	kW
Max. heating power heat pump	${\dot{Q}}_{HP, \max}$	12	kW
Max. heating power heating element	${\dot{Q}}_{HE, \max}$	6	kW
Proportional gain	$B_{0}$	12	–
Equivalent CO₂ emissions	CO₂eq (Solar)	0	kg CO₂eq/kWh
Heat transfer coefficient	h	0.287	W/m²K
TES surface area	A	6	m²

Table 4. Comparison of absolute KPIs, mean values, and standard deviations shown for 10 trained RL agents.

KPI	REF	RL	MILP
Costs (€)	671.71	$612.73 \pm 0.79$	604.13
CO₂eq (kg)	929.88	$886.82 \pm 1.15$	892.58
Grid Energy (kWh)	1491.63	$1481.46 \pm 2.35$	1488.23
SCR (%)	28.76	$33.31 \pm 0.10$	33.01
SSR (%)	19.58	$23.38 \pm 0.09$	23.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vetter, V.; Wohlgenannt, P.; Kepplinger, P.; Eder, E. Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities. Energies 2025, 18, 4489. https://doi.org/10.3390/en18174489

AMA Style

Vetter V, Wohlgenannt P, Kepplinger P, Eder E. Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities. Energies. 2025; 18(17):4489. https://doi.org/10.3390/en18174489

Chicago/Turabian Style

Vetter, Vinzent, Philipp Wohlgenannt, Peter Kepplinger, and Elias Eder. 2025. "Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities" Energies 18, no. 17: 4489. https://doi.org/10.3390/en18174489

APA Style

Vetter, V., Wohlgenannt, P., Kepplinger, P., & Eder, E. (2025). Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities. Energies, 18(17), 4489. https://doi.org/10.3390/en18174489

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Approaches the MILP Optimum of a Multi-Energy Optimization in Energy Communities

Abstract

1. Introduction

1.1. Related Works

1.2. Contribution

2. Methods

2.1. System

2.2. Data

2.3. Physical Model

2.4. Reinforcement Learning

2.5. Mixed Integer Linear Programming

3. Results and Discussion

3.1. Preliminary Results

3.2. Comparison

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Additional Configurations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI