Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage

Louis Desportes; Inbar Fijalkow; Pierre Andry

doi:10.3390/en14154706

,

and

Equipes Traitement de l’Information et Systèmes, UMR 8051, National Center for Scientific Research, ENSEA, CY Cergy Paris University, 95000 Cergy-Pontoise, France

^*

Author to whom correspondence should be addressed.

^†

Current address: 6 avenue du Ponceau, 95000 Cergy-Pontoise, France.

Energies2021, 14(15), 4706;https://doi.org/10.3390/en14154706

This article belongs to the Special Issue Machine Learning and Deep Learning for Energy Systems

Version Notes

Order Reprints

Abstract

We address the control of a hybrid energy storage system composed of a lead battery and hydrogen storage. Powered by photovoltaic panels, it feeds a partially islanded building. We aim to minimize building carbon emissions over a long-term period while ensuring that 35% of the building consumption is powered using energy produced on site. To achieve this long-term goal, we propose to learn a control policy as a function of the building and of the storage state using a Deep Reinforcement Learning approach. We reformulate the problem to reduce the action space dimension to one. This highly improves the proposed approach performance. Given the reformulation, we propose a new algorithm, DDPG

_{α_{rep}}

, using a Deep Deterministic Policy Gradient (DDPG) to learn the policy. Once learned, the storage control is performed using this policy. Simulations show that the higher the hydrogen storage efficiency, the more effective the learning.

Keywords:

deep reinforcement learning; hybrid energy storage system; smart building

1. Introduction

Energy storage is a crucial question for the usage of photovoltaic (PV) energy because of its time-varying behavior. In the ÉcoBioH2 project [1], we consider a building with solar panels providing different usages. The building includes a datacenter that is constrained to be powered by solar energy. It has a low carbon footprint building with lead and hydrogen storage capabilities. Our goal is to monitor this hybrid energy storage system with a goal of low carbon impact.

The building [1] is partially islanded with a datacenter that can only be powered by the energy produced by the building’s solar panels. The proportion of energy produced by the PV in the energy consumed by the building, including the datacenter, defines the self-consumption. The EcoBioH2 project requests the self-consumption to be at least 35%. Demand flexibility, where the load is adjusted to meet production, is not an option in this building so that energy storage will be needed to power the datacenter. Daily variations of the energy production can be mitigated using lead or lithium batteries. However, due to their low capacity density such technologies cannot be used for interseasonal storage. Hydrogen energy storage, on the other hand, is a promising solution to this problem, enabling yearly low-volume, high-capacity, low-carbo-emission energy storage. Unfortunately, it is plagued by its low storage efficiency. Combining hydrogen storage with lead batteries in a hybrid energy storage system enables us to leverage the advantages of both energy storages [2]. Hybrid storage has been shown to perform well in islanded emergency situations [3]. Lead batteries can deliver a big load but not for long. On the other hand, hydrogen storage only supports a small load but has a higher capacity than lead or lithium batteries allowing a longer discharge. The question becomes how to monitor the charge and discharge of each storage and to balance between the short-term battery and the long-term hydrogen storage?

We encounter therefore several short and long-term goals and constraints in opposition summarized in Table 1. Minimizing the carbon impact discourages from using batteries, as batteries emit carbon during their lifecycle. It also encourages using

H_{2}

storage when needed, as less carbon is emitted per kW·h than battery storage. The less energy is stored, the less energy is lost in storage efficiency. This results in more energy available to the building. Thus, in the short-term, self-consumption increases. However, the datacenter is not guaranteed to have enough energy available for the long-term. Keeping the datacenter powered by solar energy requires storing as much energy as possible. Nevertheless, some energy is lost during charge and discharge leading to a lower self-consumption. This energy should be stored in the battery first since less energy is lost in efficiency, resulting in higher emissions. Keeping the datacenter powered is a long-term objective as previous decisions impact the current state that constraints our capacity to power the datacenter in the future. Nonetheless, because of their capacities our energy storage systems perform in opposition. Battery storage has a limited capacity. It allows the withstanding of short-term production variations. Hydrogen storage has an enormous capacity. It helps with long-term, interseasonal variations.

Table 1. Contradictory consequences of carbon impact minimization and datacenter powering.

Managing a long-term storage system means that the control system needs to choose actions (charge or discharge and storage type) depending on their long-term consequences. We consider a duration of several months. We want to minimize the carbon impact while having enough energy for a complete year at least, under the constraints of the datacenter being powered by solar energy. Using convex optimization to solve this problem requires precise forecasting of the energy production and consumption for the whole year. One cannot have months of such forecasts in advance [4,5]. In [6], the authors try to minimize the cost and limit their study to 3 days only. Methods based on genetic algorithms, as [7], require a detailed model of the building usages and energy production which is not realistic in our case since all parts are not known in advance. We also want to allow flexible usages. Therefore, we propose to adopt a solution that can cope with light domain expertise. If the input and output data of the problem are accessible, supervised learning and deep learning can be considered [8]. Having contradicting goals with different horizons, reinforcement learning is an interesting approach [9]. The solution we are looking for should provide a suitable control policy for our hybrid storage system. Most reinforcement learning methods quantize the action space to avoid having interdependent action space bounds [10]. However, such a solution comes with a loss in precision in the action’s selection. It requires more data for learning.

Taking into account theses aspects, we address in the sequel our problem formulation allowing the deployment of non-quantized Deep Reinforcement Learning (DRL) [11] to learn the storage decision policy. DRL learns a long-term evaluation of actions and uses it to train an actor that for each state of the building gives the best action. In our case, the action is the charge or discharge of the lead and hydrogen storages. Learning the policy could even improve controlling the efficiency in the short-term [12]. Existing works focus on non-islanded settings [13] where no state causes a failure. Since our building is partially islanded, this approach would yield to a failure where the islanded portion is not powered anymore. Existing DRL for hybrid energy storage systems focuses on minimizing the energy cost [14]. It does not consider the minimization of carbon emission in a partially islanded building.

In this paper, we formulate the carbon impact minimization of the partially islanded building to learn a hybrid storage policy using DRL. We will reformulate this problem to reduce the action space dimension and therefore improve the DRL performance.

The contributions of this paper are as follows:

We redefine the action space so that the action bounds are not interdependent.
We use this reformulation to reduce the action space to a single dimension.
From this analysis, we deduce a fixed up to a projection (but not learned) repartition policy between the lead and hydrogen.
We propose an actor–critic approach to control the partially islanded hybrid energy storage of the building, to be named DDPG $α_{rep}$ .

Simulations will show the importance of the hydrogen efficiency and carbon impact normalization in the reward, for the learned policy to be effective.

2. Problem Statement

In this section, we describe the model used to simulate our building. This model is sketched in Figure 1 and explained next. Action variables are noted in red.

Figure 1. View of our system. lines in green shows the solar-only part and purple lines shows the grid-only part. actions are displayed in red.

2.1. Storages

We use a simplified model of the energy storage elements as they are sufficient to validate the learning approach for our hybrid storage problem. However, the proposed learning approach can use any batteries model or data since the proposed reformulations and learning do not depend on the batteries model. As long as the action is limited to how much we should charge or discharge, any storage model can be used instead. Since we propose a learning approach, the learned policy could be further improved using real data. Both energy storages (lead battery and

H_{2}

) use the same equations:

E_{H_{2}} (t) = E_{H_{2}} (t - 1) + η_{H_{2}} E_{H_{2} i n} (t) - E_{H_{2} o u t} (t)

(1)

with

E_{H_{2}} (t)

the state of health of the

H_{2}

storage at instant t,

η_{H_{2}}

the global (charging electrolyser and discharging proton-exchange membrane fuel cells) efficiency of

H_{2}

storage.

E_{H_{2} i n} (t)

is the charge energy and

E_{H_{2} o u t} (t)

is the energy discharged at instant t. Equation (1) must satisfy the following constraints:

\begin{matrix} 0 \leq E_{H_{2}} (t) \leq E_{H_{2} max} \end{matrix}

(2)

\begin{matrix} 0 \leq E_{H_{2} i n} (t) \leq E_{H_{2} i n max} \end{matrix}

(3)

\begin{matrix} 0 \leq E_{H_{2} o u t} (t) \leq E_{H_{2} o u t max} \end{matrix}

(4)

with

E_{H_{2} max}

,

E_{H_{2} i n max}

and

E_{H_{2} o u t max}

the respective upper bounds for

E_{H_{2}} (t)

,

E_{H_{2} i n} (t)

and

E_{H_{2} o u t} (t)

. To obtain the lead battery equations replace

H_{2}

by

b a t t

in Equations (1)–(4). The lead battery efficiency

η_{b a t t}

covers the whole battery efficiency: charge and discharge

2.2. Solar Circuit

The solar circuit connects elements that manage the solar energy only. The production is provided by solar panels

E_{s o l a r} (t)

. Part of this energy will be stored in short-term (lead battery) or long-term (hydrogen) storage. Part of this energy will be consumed directly by a small datacenter,

E_{D C} (t)

. The solar circuit is not allowed to handle grid electricity. We define

E_{s u r p l u s} (t)

as:

E_{s u r p l u s} (t) = E_{s o l a r} (t) - E_{D C} (t) + E_{b a t t o u t} (t) - E_{b a t t i n} (t) + E_{H_{2} o u t} (t) - E_{H_{2} i n} (t)

(5)

Please note that this equation does not prevent from charging one energy storage by the other. The solar circuit can only give energy to the general circuit, so that:

E_{s u r p l u s} (t) \geq 0

(6)

This constraint (6) ensures that the datacenter can only be provided in solar energy, as is required by our project [1].

E_{s o l a r} (t)

values are computed using irradiance values from [15] and physical properties of our solar panels.

2.3. General Circuit

The building consumption

E_{b u i l d i n g} (t)

values come from EcoBio

H_{2}

technical office study [16]. They take into account the power consumption of the housing, the restaurant, ...and other usages that are hosted by the building. We define

δ E_{r e g u l} (t)

as the difference between

E_{b u i l d i n g} (t)

and

E_{s u r p l u s} (t)

:

δ E_{r e g u l} (t) = E_{b u i l d i n g} (t) - E_{s u r p l u s} (t)

(7)

When

δ E_{r e g u l} (t) > 0

, we define it as the consumption from the electric grid:

E_{g r i d} (t) = max (0, δ E_{r e g u l} (t))

(8)

When

δ E_{r e g u l} (t) < 0

, we define it as the energy discarded since this building is not allowed to give energy back to the grid:

E_{w a s t e} (t) = max (0, - δ E_{r e g u l} (t))

(9)

In reality, the energy discarded will not be produced. This will be done by temporarily disconnecting the solar panels.

We define

E_{g r i d} (t)

and

E_{w a s t e} (t)

in Equations (8) and (9) as they are used in the simulation metrics in Section 5.2. Variables defined previously and in the remaining of this paper are displayed in Table 2, parameters are in Table 3.

Table 2. Nomenclature of variables used.

Table 3. Parameters values used during simulations.

2.4. Long-Term Carbon Impact Minimization Problem

We gather the building consumption, the solar panels production at instant t and the previous stored energy state at

t - 1

variables in a so-called state defined as:

s_{t} = [E_{b u i l d i n g} (t), E_{s o l a r} (t), E_{b a t t} (t - 1), E_{H_{2}} (t - 1)]

(10)

We define the action variables in

a_{t} = [E_{b a t t i n} (t), E_{b a t t o u t} (t), E_{H_{2} i n} (t), E_{H_{2} o u t} (t)]

(11)

to control the energy storage at the current hour t. We define in Equation (12) the instantaneous carbon impact at state

s_{t}

when performing action

a_{t}

as

f (s_{t}, a_{t})

:

\begin{matrix} f (s_{t}, a_{t}) & = C_{s o l a r} E_{s o l a r} (t) + C_{b a t t o u t} E_{b a t t o u t} (t) + C_{b a t t i n} E_{b a t t i n} (t) + C_{H_{2} o u t} E_{H_{2} o u t} (t) + C_{H_{2} i n} E_{H_{2} i n} (t) \\ + C_{g r i d} (t) max (0, E_{b u i l d i n g} (t) + E_{D C} (t) - E_{s o l a r} (t) - E_{b a t t o u t} (t) + E_{b a t t i n} (t) - E_{H_{2} o u t} (t) + E_{H_{2} i n} (t)) \end{matrix}

(12)

with

C_{s o l a r}

the carbon intensity per kW·h from the complete lifecycle of PV usage.

C_{b a t t i n}

,

C_{b a t t o u t}

,

C_{H_{2} i n}

,

C_{H_{2} o u t}

are the carbon intensity from the complete lifecycle per kW·h of respectively lead battery charge, discharge, hydrogen storage charge and discharge.

C_{g r i d} (t)

quantifies the carbon emissions per kW·h associated with energy from the grid. Their values for simulations are provided in Table 3. Our goal is to minimize the long-term carbon impact taking into account the carbon emissions at the current and future states

s_{t}, \dots, s_{t + H}

as induced by the current and future actions

a_{t}, \dots, a_{t + H}

:

a_{t} = arg min_{a_{t}} \sum_{h = 0}^{H} f (s_{t + h}, a_{t + h})

(13)

under the constraints (2), (3), (4) and (6). We call this initial formulation

TwoBatts

.

The challenge comes from our ignorance of the actions that will be taken in the future

a_{t + 1}, \dots, a_{t + H}

. Yet, we need to account for their impact. DRL approaches are meant for such kind of challenges.

3. Problem Reformulations

In this section, we reformulate our problem (13) to simplify its resolution. We consider in particular the reduction of the action space to reduce the complexity and improve the convergence of learning.

3.1. Battery Charge or Discharge

The current formulation of our problem, TwoBatts, allows the policy to charge and discharge a battery simultaneously. We note that the cost function to be minimized (12) is increasing with the different components of

a_{t}

. This leads to multiple actions that, in the same state

s_{t}

, yield to the same

s_{t + 1}

while having different costs. To avoid having to deal with such cases, we impose that the energy storage systems can only be charged or discharged at a given instant t:

E_{b a t t i n} (t) \times E_{b a t t o u t} (t) = 0

(14)

Therefore, we express the charge and discharge of each battery in a single dimension:

\begin{matrix} δ E_{b a t t} (t) : = E_{b a t t o u t} (t) - E_{b a t t i n} (t) \end{matrix}

(15)

\begin{matrix} δ E_{H_{2}} (t) : = E_{H_{2} o u t} (t) - E_{H_{2} i n} (t) \end{matrix}

(16)

We propose to use these new variables as the action space:

a_{t} = [δ E_{b a t t} (t), δ E_{H_{2}} (t)]

(17)

To obtain the new model equations, we replace the following variables in Equations (1)–(12):

\begin{matrix} E_{b a t t o u t} (t) & : = max (δ E_{b a t t} (t), 0) \end{matrix}

(18)

\begin{matrix} E_{b a t t i n} (t) & : = max (- δ E_{b a t t} (t), 0) \end{matrix}

(19)

\begin{matrix} E_{H_{2} o u t} (t) & : = max (δ E_{H_{2}} (t), 0) \end{matrix}

(20)

\begin{matrix} E_{H_{2} i n} (t) & : = max (- δ E_{H_{2}} (t), 0) \end{matrix}

(21)

Thus, we obtain the formulation 2Dbatt of (13) with

\begin{matrix} f (s_{t}, a_{t}) & = C_{s o l a r} E_{s o l a r} (t) \\ + C_{b a t t o u t} max (δ E_{b a t t} (t), 0) + C_{b a t t i n} max (- δ E_{b a t t} (t), 0) \\ + C_{H_{2} o u t} max (δ E_{H_{2}} (t), 0) + C_{H_{2} i n} max (- δ E_{H_{2}} (t), 0) \\ + C_{g r i d} (t) max (E_{b u i l d i n g} (t) + E_{D C} (t) - E_{s o l a r} (t) - δ E_{b a t t} (t) - δ E_{H_{2}} (t), 0) \end{matrix}

(22)

Next, we revisit the constraints with this new action space. When we only charge (

δ E_{H_{2}} (t) = - E_{H_{2} i n} (t)

), straightforward calculations result in (2) being equivalent to

\begin{matrix} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}}} & \leq δ E_{H_{2}} (t) \end{matrix}

(23)

and (3) turns into:

\begin{matrix} - E_{H_{2} i n max} \leq δ E_{H_{2}} (t) \end{matrix}

(24)

When we only discharge (

δ E_{H_{2}} (t) = E_{H_{2} o u t} (t)

) (2) becomes:

\begin{matrix} δ E_{H_{2}} (t) & \leq E_{H_{2}} (t - 1) \end{matrix}

(25)

Accordingly, (4) is equivalent to:

\begin{matrix} δ E_{H_{2}} (t) \leq E_{H_{2} o u t max} \end{matrix}

(26)

The battery is constrained by variations of (23)–(26). Both storages are constrained by Equation (6) that turns into:

0 \leq E_{s o l a r} (t) - E_{D C} (t) + δ E_{b a t t} (t) + δ E_{H_{2}} (t)

(27)

The 2Dbatt formulation is the minimization of (13) over (17) constrained by Equations (23)–(26), their battery variant and (27).

3.2. Batteries Storage Repartition

In the 2Dbatt formulation, one storage can discharge when the other is charging which results in a loss of energy. Moreover, the actions bound (27) depends not only on the state but also on the action itself. The bounds are therefore interdependent. If we select an action outside the action bounds, there is a need to project it inside the bounds which is non-trivial because of this interdependence.

To alleviate this problem, we propose to rotate the action space frame. We merge the two action dimensions into the energy storage systems contribution and the contribution repartition defined as:

\begin{matrix} δ E_{s t o r a g e} (t) : = δ E_{b a t t} (t) + δ E_{H_{2}} (t) \end{matrix}

(28)

\begin{matrix} α_{r e p} (t) : = \frac{δ E_{H_{2}} (t)}{δ E_{s t o r a g e} (t)} \end{matrix}

(29)

so that the action becomes

a_{t} = [δ E_{s t o r a g e} (t), α_{r e p} (t)]

.

α_{r e p} (t)

is the proportion of hydrogen in the storing. It is equal to 0 when the sole battery storage is used and to 1 when only the hydrogen storage is used.

α_{r e p} (t)

is bounded between 0 and 1 by definition, so that one energy storage cannot charge the other. Furthermore, we only convert from Repartition to 2Dbatts and not the other way around. This is illustrated in Figure 2. To insert the new variables in the 2Dbatt formulation, we use the following equations:

\begin{matrix} δ E_{b a t t} (t) = (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \end{matrix}

(30)

\begin{matrix} δ E_{H_{2}} (t) = α_{r e p} (t) \times δ E_{s t o r a g e} (t) \end{matrix}

(31)

Figure 2. Repartition formulation (green),

δ E_{s t o r a g e} (t)

and

α_{r e p} (t)

, in the 2Dbatt (blue) action space. Actions where one storage is charged and the other discharged are highlighted in red.

We transform Equations (23)–(26) using (31):

\begin{matrix} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}}} \leq α_{r e p} (t) \times δ E_{s t o r a g e} (t) \end{matrix}

(32)

\begin{matrix} - E_{H_{2} i n max} \leq α_{r e p} (t) \times δ E_{s t o r a g e} (t) \end{matrix}

(33)

\begin{matrix} α_{r e p} (t) \times δ E_{s t o r a g e} (t) \leq E_{H_{2}} (t - 1) \end{matrix}

(34)

\begin{matrix} α_{r e p} (t) \times δ E_{s t o r a g e} (t) \leq E_{H_{2} o u t max} \end{matrix}

(35)

We obtain the battery variant of those equations using (31):

\begin{matrix} - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t}} \leq (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \end{matrix}

(36)

\begin{matrix} - E_{b a t t i n max} \leq (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \end{matrix}

(37)

\begin{matrix} (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \leq E_{b a t t} (t - 1) \end{matrix}

(38)

\begin{matrix} (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \leq E_{b a t t o u t max} \end{matrix}

(39)

Moreover, using (28), (27) becomes:

E_{D C} (t) - E_{s o l a r} (t) \leq δ E_{s t o r a g e} (t)

(40)

Equation (40) depends only on one variable,

δ E_{s t o r a g e} (t)

. Using this variable change, we have removed the interdependency of the constraint in (27).

Next, we propose bounds on

δ E_{s t o r a g e} (t)

and

α_{r e p} (t)

that will be critical in the sequel.

Proposition 1.

δ E_{s t o r a g e} (t)

is constrained by

δ E_{s t o r a g e min} (t) \leq δ E_{s t o r a g e} (t) \leq δ E_{s t o r a g e max} (t)

and their values are defined by:

\begin{matrix} δ E_{s t o r a g e min} (t) = max ( & E_{D C} (t) - E_{s o l a r} (t), \\ - E_{b a t t i n max} - E_{H_{2} i n max}, \\ - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t i n}} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2} i n}}, \\ - E_{b a t t i n max} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2} i n}}, \\ - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t i n}} - E_{H_{2} i n max}) \end{matrix}

(41)

\begin{matrix} δ E_{s t o r a g e max} (t) = min ( & E_{b a t t} (t - 1) + E_{H_{2}} (t - 1), \\ E_{b a t t o u t max} + E_{H_{2}} (t - 1), \\ E_{b a t t} (t - 1) + E_{H_{2} o u t max}, \\ E_{b a t t o u t max} + E_{H_{2} o u t max}) \end{matrix}

(42)

Proof of Proposition 1 is in Appendix A.

Proposition 2.

α_{r e p} (t)

is constrained by

α_{r e p min} (t) \leq α_{r e p} (t) \leq α_{r e p max} (t)

with values are defined by:

α_{r e p min} (t) = \{\begin{matrix} \begin{matrix} max (1 & + \frac{E_{b a t t i n max}}{δ E_{s t o r a g e} (t)}, \\ 1 & + \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t} \times δ E_{s t o r a g e} (t)}) \end{matrix} if δ E_{s t o r a g e} (t) < 0 \\ \begin{matrix} max (1 & - \frac{E_{b a t t} (t) - 1}{δ E_{s t o r a g e} (t)}, \\ 1 & - \frac{E_{b a t t o u t max}}{δ E_{s t o r a g e} (t)}) \end{matrix} if δ E_{s t o r a g e} (t) > 0 \end{matrix}

(43)

α_{r e p max} (t) = \{\begin{matrix} \begin{matrix} min ( & - \frac{E_{H_{2} i n max}}{δ E_{s t o r a g e} (t)}, \\ - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}} \times δ E_{s t o r a g e} (t)}) \end{matrix} if δ E_{s t o r a g e} (t) < 0 \\ \begin{matrix} min ( & \frac{E_{H_{2}} t - 1}{δ E_{s t o r a g e} (t)}, \\ \frac{E_{H_{2} o u t max}}{δ E_{s t o r a g e} (t)}) \end{matrix} if δ E_{s t o r a g e} (t) > 0 \end{matrix}

(44)

Proof of Proposition 2 is in Appendix B. Please note that when

δ E_{s t o r a g e} (t) = 0

,

α_{r e p} (t)

does not matter. We will set it to

α_{r e p} (t) = 0.5

as a convention.

The interest of bounds (43) and (44) is that they depend on

δ E_{s t o r a g e} (t)

only, whereas the bounds on

δ E_{s t o r a g e} (t)

do not depend on

α_{r e p} (t)

. Thus, given

δ E_{s t o r a g e} (t)

, we only need to decide the contribution of the distribution

α_{r e p} (t)

. The interdependence has been completely removed.

Moreover, we use (30) and (31) to obtain the expression of the modified carbon impact function (12):

\begin{matrix} f (s_{t}, a_{t}) & = C_{s o l a r} E_{s o l a r} (t) \\ + C_{g r i d} (t) max (0, E_{b u i l d i n g} (t) + E_{D C} (t) - E_{s o l a r} (t) - δ E_{s t o r a g e} (t)) \\ + \{\begin{matrix} \begin{matrix} C_{b a t t o u t} \times (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \\ + & C_{H_{2} o u t} \times α_{r e p} (t) \times δ E_{s t o r a g e} (t) \end{matrix} \\ if δ E_{s t o r a g e} (t) > 0 \\ \begin{matrix} - & C_{b a t t i n} \times (1 - α_{r e p} (t)) \times δ E_{s t o r a g e} (t) \\ - & C_{H_{2} i n} \times α_{r e p} (t) \times δ E_{s t o r a g e} (t) \end{matrix} \\ if δ E_{s t o r a g e} (t) < 0 \end{matrix} \end{matrix}

(45)

The problem (13) with action

a_{t} = [δ E_{s t o r a g e} (t), α_{r e p} (t)]

, the carbon impact (45) and under the constraints (41)–(44) is called the Repartition formulation.

3.3. Repartition Parameter Only

We have noticed that

δ E_{s t o r a g e} (t)

can be seen as a single global storage. To provide energy for as long a duration as possible, i.e., to respect (40), we want to charge as much as possible and discharge only when needed. We call this the frugal policy. It corresponds to

δ E_{s t o r a g e} (t)

being equal to its lower bound:

δ E_{s t o r a g e} (t) = δ E_{s t o r a g e min} (t)

(46)

To reduce even more the action space dimensionality, we propose to use the frugal policy and to focus on learning only

α_{r e p} (t)

the repartition between the lead and hydrogen energy storage systems contribution.

Using this remark, we propose the $α_{rep}$ reformulation with the goal (13), in order to find the single action

a_{t} = α_{r e p} (t)

, given the state

s_{t}

, using the carbon impact (45) and under constraints (43) and (44) with

δ E_{s t o r a g e} (t)

derived in (46). Unless specified, this is the formulation we use in the sequel of this paper.

3.4. Fixed Repartition Policy

In Section 4, we will propose a learning algorithm for the different formulations. To show the interest of learning, we want to compare the learned policies to a frugal policy (46) where

α_{r e p} (t)

is preselected and fixed to a value v. At each instant, we will only verify that

v \in [α_{min} (t), α_{max} (t)]

and project it into this interval otherwise. We call $α_{rep} = v$ the policy where

α_{r e p}

is preset to value v. So that:

α_{r e p} (t) = {projection}_{[α_{r e p m i n} (t), α_{r e p m a x} (t)]} (v)

(47)

In Section 1, we explained that the battery is intended for short-term storage and that the

H_{2}

storage is intended for long-term. Our intuition therefore suggests charging or discharging the lead battery first. This corresponds to a preset value of

α_{rep} = 0

, so that

α_{r e p} (t) = α_{r e p m i n} (t)

.

One should wonder, what is the best preselected

α_{r e p}

? To find it, we simulated 100 different values of

α_{r e p}

between 0 and 1. For each value v, we run a simulation for each day starting at midnight on a looping year 2006. A detailed description of this data is available in Section 5.1. We use the parameters in Table 3 and PV production computed using irradiance data [17,18] in (48):

E_{s o l a r} (t) = P_{s o l a r} (t) η_{s o l a r} η_{s o l a r o p a c i t y} S_{p a n e l s}

(48)

If the simulation does not last the whole year, we reject it (hatched area in Figure 3). Otherwise, we compute the hourly carbon impact:

\sum_{t = 0}^{T} \frac{f (s_{t}, a_{t})}{T}

(49)

with T the number of hours in 2006. This hourly impact is averaged over 365 different runs, each starting at midnight, one for each day of 2006. Figure 3 shows the carbon impact versus

α_{r e p}

. The

α_{r e p}

value that minimizes the average hourly impact while lasting the whole year is therefore

α_{rep} = 0.2

. It will be used for comparison.

Figure 3. Mean impact versus

α_{r e p}

preset. Hatched area corresponds to rejected

α_{r e p}

values where the policy does not last the whole year.

4. Learning the Policy with DDPG

In the reformulation $α_{rep}$ reformulation, we want to select

a_{t}

given the state

s_{t}

. The function that provides

a_{t}

given

s_{t}

is referred to as the policy. We want to learn the policy using DRL with an actor–critic policy-based approach: the Deep Deterministic Policy Gradient (DDPG) [19]. Experts may want to skip Section 4.2 and Section 4.3.

4.1. Actor–Critic Approach

We call

e n v

for the environment, the set of equations: (1) and its battery variant that allows the obtaining of

s_{t + 1}

from

a_{t}

and

s_{t}

,

s_{t + 1} = e n v . s t e p (s_{t}, a_{t})

. Its corresponding reward, the short-term evaluation function, is defined as a function of

s_{t}

and

a_{t}

,

r_{t} = R (s_{t}, a_{t})

. We use [19], an actor–critic approach, where the estimated best policy for a given environment

s_{t + 1} = e n v . s t e p (s_{t}, a_{t})

is learned through a critic as in Figure 4. The critic transforms this short-term evaluation into a long-term evaluation, the Q-values

Q (s_{t}, a_{t})

, through learning. It will be detailed in Section 4.2. The actor

π_{θ} : s_{t} \to a_{t}

is the function that selects the best possible action

a_{t}

possible. It uses the critic as to know what is the best action in a given state (as detailed in Section 4.3).

Figure 4. Overview of the actor–critic approach. Curved arrows indicate learning. Time passing with

t = t + 1

is displayed

Z^{- 1}

.

In Section 2.4, we set our objective to minimize the long-term carbon impact (13). However, in reinforcement learning we try to maximize a score, defined as the sum of all rewards:

\sum_{t_{0}}^{T} r_{t}

(50)

To remove this difference, we maximize the negative carbon impact

\sum - f (s_{t}, a_{t})

. However, the more negative terms you add, the lower the sum is. This leads to a policy trying to stop the simulation as fast as possible, in contradiction to our goal to always provide the datacenter in energy. To counter this, we propose, inspired by [20], to add a living incentive of 1 at each instant. Therefore, we propose to define the reward as:

r_{t} = R (s_{t}, a_{t}) = 1 - \frac{f (s_{t}, a_{t})}{{max}_{a} f (s_{t}, a)}

(51)

The reward accounting for the carbon impact is now normalized between 0 and 1 so that the reward is always positive. Still in this reward the normalization depends on the state

s_{t}

. When the normalization depends on the state, two identical actions can have different rewards associated with them. Therefore, the reward is not proportional to the carbon impact (45) making the reward harder to interpret. To alleviate this problem, we propose to use the global maximum instead of the worst case for the current state:

r_{t} = R (s_{t}, a_{t}) = 1 - \frac{f (s_{t}, a_{t})}{{max}_{s, a} f (s, a)}

(52)

By convention

r_{t}

is set to zero after the simulation ends.

The actor and critic are parameterized using artificial neural networks, respectively denoted

θ

and

ϕ

. They will be learned alternatively and iteratively. Two stabilization networks are also used for the critic supervision with weights

θ_{o l d}

and

ϕ_{o l d}

.

4.2. Critic Learning

Now that we have defined a reward, we can use the critic to transform it into a long-term metric. As time goes, we have less and less trust in the future. Therefore, we discount the future rewards using a discount factor

0 < γ < 1

. We define the critic

Q : s_{t}, a_{t} \to \sum_{k = 0}^{+ \infty} γ^{k} r_{t + k}

. It estimates the weighted long-term returns of taking an action

a_{t}

in a given state

s_{t}

. This weighted version of (50) also allows the binding of the infinite sum to learn it. Q can be expressed recursively:

\begin{matrix} Q (s_{t}, a_{t}) & = \sum_{k = 0}^{+ \infty} γ^{k} r_{t + k} \\ = r_{t} + γ \sum_{k = 0}^{+ \infty} γ^{k} r_{t + 1 + k} \\ Q (s_{t}, a_{t}) & = r_{t} + γ Q (s_{t + 1}, a_{t + 1}) \end{matrix}

(53)

We learn the Q-function using an artificial neural network of weights

ϕ

. At the

i th

iteration of our learning algorithm and for a given value of

ϕ_{o l d i}

and

θ_{o l d i}

, we define a reference value

y_{t}

from the recursive expression (53). Since we do not know

a_{t + 1}

, we need to select the best action possible at

t + 1

. The best estimator of this action is provided by the policy

π_{θ_{o l d i}}

, so that we define the reference as:

y_{t} = r_{t} + γ Q_{ϕ_{o l d i}} (s_{t + 1}, π_{θ_{o l d i}} (s_{t + 1}))

(54)

where

a_{t + 1}

has been estimated by

π_{θ_{o l d i}} (s_{t + 1})

.

The squared difference between the estimated value

Q_{ϕ} (s_{t}, a_{t})

, and the reference value

y_{t}

[21] is defined as:

J (ϕ_{i}) = \sum_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \in D} {(Q_{ϕ_{i}} (s_{t}, a_{t}) - y_{t})}^{2}

(55)

To update

ϕ_{i}

, we minimize

J (ϕ_{i})

in (55) using a simple gradient descent:

ϕ_{i + 1} = ϕ_{i} - μ \nabla J (ϕ_{i})

(56)

where

\nabla J (ϕ_{i})

is the gradient of

J (ϕ)

in (55) with respect to

ϕ

taken at the value

ϕ_{i}

.

μ

is a small positive step-size. To stabilize the learning [19] suggests updating the reference network

ϕ_{o l d}

slower, so that:

ϕ_{o l d i + 1} = τ ϕ_{i} + (1 - τ) ϕ_{o l d i} with 0 < τ ≪ 1

(57)

ϕ_{o l d 0} = ϕ_{0}

at weight initialization.

4.3. Actor Learning

Since we alternate the updates of the critic and of the actor, we address next the learning of the actor. To learn what is the best action to select, we need a loss function that grades different actions

a_{t}

. Using the reward function (52), as a loss function, the policy would select the best short-term, instantaneous, action. Since the critic

Q (s_{t}, a_{t})

depends on the action

a_{t}

, we replace

a_{t}

by

π_{θ} (s_{t})

. At iteration i, to update the actor network

θ_{i}

, we use the gradient ascent of the average

Q_{ϕ_{i}} (s t, π_{θ} (s t))

taken at

θ = θ_{i}

. This can be expressed as:

θ_{i + 1} = θ_{i} + λ {\nabla \sum}_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \in D} Q_{ϕ_{i}} (s_{t}, π_{θ_{i}} (s_{t}))

(58)

where

λ

is a small positive step-size.

To learn the critic a stabilized actor is used. Like the stabilized critic,

π_{θ_{o l d}}

is updated by:

θ_{o l d i + 1} = τ θ_{i} + (1 - τ) θ_{o l d i} with 0 < τ ≪ 1

(59)

with

θ_{o l d 0} = θ_{0}

at the beginning.

During learning, an Ornstein–Uhlenbeck noise [22], n, is added to the policy decision to make sure we explore the action space:

a_{t} = π_{θ i} (s_{t}) + n

(60)

4.4. Proposition: DDPG $α_{rep}$ Algorithm to Learn the Policy

From the previous section, we propose the DDPG

α_{rep}

Algorithm 1. This algorithm alternates the learning of the networks of the actor and of the critic. We select randomly the initial instant t to avoid learning time patterns. We start each run with full energy storage.

Once learned, we use the last weights

θ_{i}

of the neural network parameterizing the actor to select the action using

π_{θ_{i}} : s_{t} \to a_{t}

directly.

To learn well an artificial neural network needs the different samples of learning data to be uncorrelated. In reinforcement learning two consecutive states tends to be close, i.e., correlated. To overcome this problem, we store all experiences

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in a memory and use a tiny random subset as the learning batch [23]. The random selection of a batch from the memory is called

s a m p l e

.

Algorithm 1: DDPG

α_{rep}

5. Simulation

We have just proposed DDPG $α_{rep}$ to learn how to choose

α_{r e p} (t)

with respect to the environment. In this section, we display the simulations settings and results.

5.1. Simulation Settings

Production data are computed using (48) from real irradiance data [17,18] measured at the building location in Avignon, France. The building has

S_{p a n e l s} = 1000 m^{2}

of solar panels with

η_{s o l a r o p a c i t y} = 60 %

opacity and an efficiency of

η_{s o l a r} = 21 %

. Those solar panels can produce a maximum of

E_{s o l a r M a x} = 185

kW·h per hour.

Consumption data comes from projections of the engineering office [16]. It consists of powering housing units with an electricity demand fluctuating daily between 30 kW·h (1 a.m. to 6 a.m.) and 90 kW·h. The weekly variations of the consumption varies with a factor between 1 and 1.4 during awake hours between workdays and the weekend. There is little interseasonal variation, standard deviation of 0.6 kW·h (0.01% of yearly mean) between seasons, as heating uses wood pellets. In those simulations, the datacenter is consuming a fixed amount of

E_{D C m a x} = 10

kW·h. The datacenter consumption adds up to 87.6 MW·h per year, around 17% of the 496 MW·h that the entire building consumes in a year. To power this datacenter, our building’s solar panels produce an average of 53.8 kW·h/h during the

12.7

sunny hours on average day counts, for a yearly total of 249 MW·h/year. This covers a maximum of 2.8 times the consumption of our datacenter, but lowers to 99% if all energy goes through the hydrogen storage. The same solar production covers at most 50% of the building yearly consumption. When accounting for hydrogen efficiency, the solar production covers at most 17% of the building consumption.

We only use half of the lead battery capacity to preserve the battery health longer

E_{b a t t M a x} = 650 / 2 = 325

kW·h. The lead battery carbon intensity is split between the charge and discharge

C_{b a t t O u t} = 172 / 2 = 86 {gCO}_{2} eq / kW

·h. Since the charge quantity comes before the efficiency, its carbon intensity must account for efficiency:

C_{b a t t I n} = C_{b a t t O u t} η_{b a t t} = 86 \times 0.81 = 68.66 {gCO}_{2} eq / kW

·h. The carbon intensity of the electrolysers, accounting for the efficiency, is used for

C_{H_{2} i n} = 5 \times η_{H_{2}} = 1.75 {gCO}_{2} eq / kW

·h. The carbon intensity of the fuel cells corresponds to

C_{H_{2} o u t} = 5 {gCO}_{2} eq / kW

·h.

η_{H_{2}}

account for both the electrolysers and fuel cells efficiency.

C_{g r i d} = 53 {gCO}_{2} eq / kW

·h uses the average French grid carbon intensity. All those values are reported in Table 3.

The simulations use an hourly simulation step t.

We train on the production data from year 2005, validate and select hyperparameters, using best score (50) values, on the year 2006 and test finally on year 2007. Each year lasts 8760 h.

To improve learning, we normalize between

- 1

and 1 all state and action inputs and outputs. For a given value d bounded between

d_{m i n}

and

d_{m a x}

:

\begin{matrix} d_{n o r m} = 2 \times \frac{d - d_{m i n}}{d_{m a x} - d_{m i n}} - 1 \end{matrix}

(61)

d_{n o r m}

is then used as an input for the networks.

To accelerate the learning, all gradient descents are performed using Adam [24]. During training, we use the following step sizes

μ = 10^{- 3}

to learn the critic and

λ = 10^{- 4}

for the actor. For the stabilization networks,

τ = 0.001

. To learn, we sample batches of 64 experiences from a memory of

10^{6}

experiences. The actor and critic both have 2 hidden layers with a ReLU activation function. Hidden layers have respectively 400 and 300 units in them. The output layer uses a tanh activation to bound its output. The discount factor,

g a m m a

in (54) is optimized as a hyperparameter between 0.995 and 0.9999. We found the best value for the discount factor to be 0.9979.

5.2. Simulation Metrics

We name duration and note N the average length of the simulations. When all simulations last the whole year, the hourly carbon impact is evaluated as in (49). To select the best policy, the average score is computed using (50). Self-consumption, defined as the energy provided by the solar panels, directly or indirectly using one of the storages, over the consumption, is computed using:

s = \frac{\sum_{t = 0}^{T} E_{s u r p l u s} (t) + E_{D C} (t) - E_{w a s t e} (t)}{\sum_{t = 0}^{T} E_{b u i l d i n g} (t) + E_{D C} (t)}

(62)

Per the ÉcoBioH2 project, the goal is to reach 35% of self-consumption:

s \geq 0.35

.

5.3. Simulation Results

The following learning algorithms are simulated on data from Avignon from 2007 and our building:

DDPGTwoBatts: DDPG with actions $a t = [E_{b a t t i n} (t), E_{b a t t o u t} (t), E_{H_{2} i n} (t), E_{H_{2} o u t} (t)]$
DDPGRepartition: DDPG with actions $a t = [δ E_{s t o r a g e} (t), α_{r e p} (t)]$
proposed DDPG $α_{rep}$ with action $a t = [α_{r e p} (t)]$

where DDPGTwoBatts and DDPGRepartition are algorithms similar to DDPG

α_{rep}

with action spaces of the corresponding formulations respectively (11) and (17). The starting time is randomly selected from any hour of the year.

To test the learned policies, the duration, hourly impact (49), score (50) and self-consumption (62) metrics are computed on the 2007 irradiance data and averaged over all runs. We compute those metrics over 365 different runs, starting each 2007 day at midnight. For the sake of comparison, we also compute those metrics when applicable for the preselected values

α_{rep} = 0

and

α_{rep} = 0.2

using (47) on the same data. Recall that the fixed

α_{r e p}

values are bounded to (43) and (44) to ensure the long-term duration.

The metrics over the different runs are displayed in Table 4.

Table 4. Results computed on the year 2007. n.a.: not applicable.

We can see in Table 4 that DDPGTwoBatts and DDPGRepartition do not last the whole year. This shows the importance of our reformulations to reduce the action space dimensions. We observe that all policies using the $α_{rep}$ reformulation last the whole year (

N = 8760

). This validates our proposed reformulations and dimension reduction.

α_{rep} = 0.2

achieves the lowest carbon impact; however, it cannot ensure the target of self-consumption. On the other hand,

α_{rep} = 0

achieves the target self-consumption at the price of a higher carbon impact. The proposed DDPG

α_{rep}

provides a good trade-off between the two by adapting

α_{r e p} t

to the state

s t

. It reaches the target self-consumption minus 0.1% and lowers the carbon impact with respect to

α_{rep} = 0

. The carbon emission gain over the intuitive policy

α_{rep} = 0

, using hydrogen only as a last resort, is of 43.8

\times 10^{3}

gCO

_{2}

eq/year. This shows the interest of learning the policy once the problem is well formulated.

5.4. Reward Normalization Effect

In Section 4.1, we presented two ways to normalize the carbon impact in the reward. In this section, we show that the proposed global normalization (52) yields better results than the local state-specific normalization (51).

In Table 5, we display the duration for both normalizations. We see that policies that use the locally normalized reward have a lower duration than the ones using a globally normalized reward. This confirms that the local normalization is harder to learn as two identical actions have different rewards in different states.

Table 5. Learned policies duration depending on the reward normalization: local or global. Using simulations on 2007 test dataset.

Therefore, the higher dynamic of the local normalization is not worth the variability induced by this normalization. This validates our choice of the global normalization (52) for the proposed DDPG

α_{rep}

algorithm.

5.5. Hydrogen Storage Efficiency Impact

In our simulations, we have seen the sensibility of our carbon impact results to the parameters in Table 3. Indeed, the efficiency of the storage has a great impact on the system behavior. Hydrogen storage yields lower carbon emissions when its efficiency

η_{H_{2}}

is higher than some threshold. The greater is

η_{H_{2}}

, the greater

α_{r e p} t

could be and so the range for adapting

α_{r e p} t

via learning is more important. To find the threshold in

η_{H_{2}}

, we first compute the total carbon intensity of storing one kW·h in a given storage, including the carbon intensity of energy production. For

H_{2}

, we obtain:

C_{H_{2} t o t} = C_{H_{2} O u t} + \frac{C_{H_{2} I n} + C_{s o l a r}}{η_{H_{2}}} {gCO}_{2} eq / kW \cdot h

(63)

We display the value of (63) of both storages in Figure 5 with respect to

η_{H_{2}}

, the other parameters are taken from Table 3. When

C_{H_{2} t o t} < C_{b a t t t o t}

learning is useful since the policy must balance the lower carbon impact (using the hydrogen storage) with the low efficiency (using the battery storage). When

C_{H_{2} t o t} > C_{b a t t t o t}

the learned policy converges to

α_{rep} = 0

, as both objectives (minimizing the carbon impact and continuous powering of the datacenter) align.

Figure 5. The total hydrogen storage impact depending on the efficiency of storage.

We calculate from (63) and its battery variant, the threshold point where

C_{H_{2} t o t} = C_{b a t t t o t}

to be at efficiency:

η_{H_{2}}^{*} = \frac{C_{H_{2} I n} + C_{s o l a r}}{C_{b a t t t o t} - C_{H_{2} O u t}}

(64)

Using values in Table 3 on (64), hydrogen improves the carbon impact only when

η_{H_{2}} > η_{H_{2}}^{*} = 0.24

. The current value is

η_{H_{2}} = 0.35 > 0.24

, learning is also useful as shown in the simulations Table 4. We can also suggest that when the hydrogen storage efficiency will improve in the future, the impact of learning will be even more important.

6. Conclusions

We have addressed the problem of monitoring the hybrid energy storage of a partially islanded building with a goal of carbon impact minimization and self-consumption. We have reformulated the problem to reduce the number of components of the action to one,

α_{r e p} (t)

, the proportion of hydrogen storage given the building state

s_{t}

. To learn the policy,

π_{θ} : s_{t} \to α_{r e p} (t)

, we propose a new DRL algorithm using a reward tailored to our problem, DDPG

α_{rep}

. The simulation results show that when the hydrogen storage efficiency is large enough, learning of

α_{r e p} (t)

allows a decrease to the carbon impact while lasting at least one year and maintaining

35 %

of self-consumption. As hydrogen storage technologies improve, the proposed algorithm should have even more impact.

Learning the policy using the proposed DDPG

α_{rep}

can also be done when the storage model includes non-linearities. Learning can also adapt to climate changes in time using more recent data for learning. To measure such benefits, we will use in the future the ÉcoBioH2 real data to be measured in the sequel of the project. Learning from real data will reduce the gap between the model and the real system. Reducing this gap should improve performance. The proposed approach could also be used to optimize other environmental metrics with a multi-objective cost in

f (s_{t}, a_{t})

.

With our current formulation, policies cannot assess what day and hour it is as they only have two state variables to compute the hour:

E_{s o l a r} (t)

and

E_{b u i l d i n g} (t)

. They cannot differentiate between 1 a.m. and 4 a.m. at night as those two times have the same consumption and no PV production. They also cannot differentiate between a cloudy summer and a clear winter as production and consumption are close in those two cases. In the future, we will consider taking into account the knowledge of the current time to enable the learned policy to adapt its behavior to the time of the day and month of the year.

Author Contributions

Conceptualization, L.D., I.F. and P.A.; methodology, L.D., I.F. and P.A.; software, L.D.; validation, L.D. and I.F.; formal analysis, L.D.; investigation, L.D.; resources, L.D.; data curation, L.D.; writing—original draft preparation, L.D.; writing—review and editing, I.F. and P.A.; visualization, L.D.; supervision, I.F and P.A.; project administration, I.F.; funding acquisition, I.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by French PIA3 ADEME (French Agency For the Environment and Energy Management) for the ÉcoBioH2 project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available irradiance datasets were analyzed in this study. This data can be found here: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on: 1 October 2020) based on [18]. Restrictions apply to the availability of consumption data. Data were obtained from ÉcoBio via ZenT and are available at https://zent-eco.com/ with the permission of ZenT.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

DDPG	Deep Deterministic Policy Gradient
DRL	Deep Reinforcement Learning
PV	PhotoVoltaic

Appendix A. Proof of Proposition 1

Considering (23)–(26) for

H_{2}

and batt for all cases we find more upper and lower bounds on

δ E_{s t o r a g e} (t)

.

Appendix A.1. When δE_storage(t) < 0

Using (28), (24) and its battery variant:

\begin{matrix} δ E_{s t o r a g e} (t) & \geq - E_{b a t t i n max} - E_{H_{2} i n max} \end{matrix}

(A1)

Using (28), (23) and its battery variant:

δ E_{s t o r a g e} (t) \geq - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t}} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}}}

(A2)

Using (28), (23) and the battery variant of (24):

\begin{matrix} δ E_{s t o r a g e} (t) & \geq - E_{b a t t i n max} - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}}} \end{matrix}

(A3)

Using (28), (24) and the battery variant of (23):

\begin{matrix} δ E_{s t o r a g e} (t) & \geq - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t}} - E_{H_{2} i n max} \end{matrix}

(A4)

We obtain the global lower bound (41) by obtaining the min of (40), (A1)–(A4).

Appendix A.2. When δE_storage(t) > 0

Using (28), (25) and its battery variant:

\begin{matrix} δ E_{s t o r a g e} (t) & \leq E_{b a t t} (t - 1) + E_{H_{2}} (t - 1) \end{matrix}

(A5)

Using (28), (26) and its battery variant:

\begin{matrix} δ E_{s t o r a g e} (t) & \leq E_{b a t t o u t max} + E_{H_{2} o u t max} \end{matrix}

(A6)

Using (28), (25) and the battery variant of (26):

\begin{matrix} δ E_{s t o r a g e} (t) & \leq E_{b a t t o u t max} + E_{H_{2}} (t - 1) \end{matrix}

(A7)

Using (28), (26) and the battery variant of (25):

\begin{matrix} δ E_{s t o r a g e} (t) & \leq E_{b a t t} (t - 1) + E_{H_{2} o u t max} \end{matrix}

(A8)

We obtain the global upper bound (42) by obtaining the max of (A5)–(A8).

Appendix B. Proof of Proposition 2

Appendix B.1. When δE_storage(t) > 0

Given (29) and (26)

\begin{matrix} α_{r e p} (t) & \leq \frac{E_{H_{2} o u t max}}{δ E_{s t o r a g e} (t)} \end{matrix}

(A9)

Given (29) and (25)

\begin{matrix} α_{r e p} (t) & \leq \frac{E_{H_{2}} (t - 1)}{δ E_{s t o r a g e} (t)} \end{matrix}

(A10)

From (30) and the battery variant of (26)

\begin{matrix} (1 - α_{r e p} (t)) δ E_{s t o r a g e} (t) & \leq E_{b a t t o u t max} \\ 1 - α_{r e p} (t) & \leq \frac{E_{b a t t o u t max}}{δ E_{s t o r a g e} (t)} \\ 1 - \frac{E_{b a t t o u t max}}{δ E_{s t o r a g e} (t)} & \leq α_{r e p} (t) \end{matrix}

(A11)

From (30) and the battery variant of (25)

\begin{matrix} (1 - α_{r e p} (t)) δ E_{s t o r a g e} (t) & \leq E_{b a t t} (t - 1) \\ 1 - α_{r e p} (t) & \leq \frac{E_{b a t t} (t - 1)}{δ E_{s t o r a g e} (t)} \\ 1 - \frac{E_{b a t t} (t - 1)}{δ E_{s t o r a g e} (t)} & \leq α_{r e p} (t) \end{matrix}

(A12)

Appendix B.2. When δE_storage(t) < 0

Given (31) and (24)

\begin{matrix} α_{r e p} (t) δ E_{s t o r a g e} (t) & \geq - E_{H_{2} i n max} \\ α_{r e p} (t) & \leq - \frac{E_{H_{2} i n max}}{δ E_{s t o r a g e} (t)} \end{matrix}

(A13)

Given (31) and (23)

\begin{matrix} α_{r e p} (t) δ E_{s t o r a g e} (t) & \geq - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}}} \\ α_{r e p} (t) & \leq - \frac{E_{H_{2} max} - E_{H_{2}} (t - 1)}{η_{H_{2}} δ E_{s t o r a g e} (t)} \end{matrix}

(A14)

From (30) and (24) battery variant

\begin{matrix} (1 - α_{r e p} (t)) δ E_{s t o r a g e} (t) & \geq - E_{b a t t i n max} \\ 1 - α_{r e p} (t) & \leq \frac{- E_{b a t t i n max}}{δ E_{s t o r a g e} (t)} \\ 1 + \frac{E_{b a t t i n max}}{δ E_{s t o r a g e} (t)} & \leq α_{r e p} (t) \end{matrix}

(A15)

From (30) and (23) battery variant

\begin{matrix} (1 - α_{r e p} (t)) δ E_{s t o r a g e} (t) & \geq - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t}} \\ 1 - α_{r e p} (t) & \leq - \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t} δ E_{s t o r a g e} (t)} \\ 1 + \frac{E_{b a t t max} - E_{b a t t} (t - 1)}{η_{b a t t} δ E_{s t o r a g e} (t)} & \leq α_{r e p} (t) \end{matrix}

(A16)

We obtain the global upper bound (44) by obtaining the min of (A9) and (A10) when

δ E_{s t o r a g e} (t) > 0

and the max of (A13), (A14) when

δ E_{s t o r a g e} (t) < 0

. We obtain the global lower bound (43) by obtaining the min of (A11) and (A12) when

δ E_{s t o r a g e} (t) > 0

and the max of (A15), (A16) when

δ E_{s t o r a g e} (t) < 0

.

References

PIA3 ADEME (French Agency for the Environment and Energy Management). Project ÉcoBioH₂. 2019. Available online: https://ecobioh2.ensea.fr (accessed on 2 June 2021).
Bocklisch, T. Hybrid energy storage systems for renewable energy applications. Energy Procedia 2015, 73, 103–111. [Google Scholar] [CrossRef] [Green Version]
Pu, Y.; Li, Q.; Chen, W.; Liu, H. Hierarchical energy management control for islanding DC microgrid with electric-hydrogen hybrid storage system. Int. J. Hydrogen Energy 2018, 44, 5153–5161. [Google Scholar] [CrossRef]
Diagne, M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of solar irradiance forecasting methods and a proposition for small-scale insular grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef] [Green Version]
Desportes, L.; Andry, P.; Fijalkow, I.; David, J. Short-term temperature forecasting on a several hours horizon. In Proceedings of the ICANN, Munich, Germany, 17–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Nagasaki, Y.; Miyagi, D.; Tsuda, M.; Komagome, T.; Tsukada, K.; Hamajima, T.; Ayakawa, H.; Ishii, Y.; Yonekura, D. Stored energy control for long-term continuous operation of an electric and hydrogen hybrid energy storage system for emergency power supply and solar power fluctuation compensation. Int. J. Hydrogen Energyy 2019, 44, 8403–8414. [Google Scholar] [CrossRef]
Carapellucci, R.; Giordano, L. Modeling and optimization of an energy generation island based on renewable technologies and hydrogen storage systems. Int. J. Hydrogen Energy 2012, 37, 2081–2093. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; Springer: New York, NY, USA, 2006; pp. 1–2. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Vosen, S.; Keller, J. Hybrid energy storage systems for stand-alone electricpower systems: Optimization of system performance andcost through control strategies. Int. J. Hydrogen Energy 1999, 24, 1139–1156. [Google Scholar] [CrossRef]
Kozlov, A.N.; Tomin, N.V.; Sidorov, D.N.; Lora, E.E.S.; Kurbatsky, V.G. Optimal Operation Control of PV-Biomass Gasifier-Diesel-Hybrid Systems Using Reinforcement Learning Techniques. Energies 2020, 13, 2632. [Google Scholar] [CrossRef]
François-Lavet, V.; Taralla, D.; Ernst, D.; Fonteneau, R. Deep Reinforcement Learning Solutions for Energy Microgrids Management. In Proceedings of the European Workshop on Reinforcement Learning EWRL Pompeu Fabra University, Barcelona, Spain, 3–4 December 2016. [Google Scholar]
Tommy, A.; Marie-Joseph, I.; Primerose, A.; Seyler, F.; Wald, L.; Linguet, L. Optimizing the Heliosat-II method for surface solar irradiation estimation with GOES images. Can. J. Remote Sens. 2015, 41, 86–100. [Google Scholar] [CrossRef]
David, J. L 2.1 EcoBioH₂, Internal Project Report. 9 July 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on 1 October 2020).
Soda-Pro. HelioClim-3 Archives for Free. 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-free (accessed on 11 March 2019).
Rigollier, C.; Lefèvre, M.; Wald, L. The method Heliosat-2 for deriving shortwave solar radiation from satellite images. Solar Energy 2004, 77, 159–169. [Google Scholar] [CrossRef] [Green Version]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
Ernst, D.; Geurts, P.; Wehenkel, L. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 2005, 6, 503–556. [Google Scholar]
Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [Google Scholar] [CrossRef]
Lin, L.J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. View of our system. lines in green shows the solar-only part and purple lines shows the grid-only part. actions are displayed in red.

Figure 2. Repartition formulation (green),

δ E_{s t o r a g e} (t)

and

α_{r e p} (t)

, in the 2Dbatt (blue) action space. Actions where one storage is charged and the other discharged are highlighted in red.

Figure 3. Mean impact versus

α_{r e p}

preset. Hatched area corresponds to rejected

α_{r e p}

values where the policy does not last the whole year.

Figure 4. Overview of the actor–critic approach. Curved arrows indicate learning. Time passing with

t = t + 1

is displayed

Z^{- 1}

.

Figure 5. The total hydrogen storage impact depending on the efficiency of storage.

Table 1. Contradictory consequences of carbon impact minimization and datacenter powering.

Minimizing Carbon Impact	Keeping the Datacenter Powered
short duration	long duration
high self-consumption	low self-consumption
use only $H_{2}$	charge batteries first
do not need any capacity	need large hydrogen storage capacity

Table 2. Nomenclature of variables used.

Symbol	Meaning
$E_{H_{2}} (t)$	hydrogen storage state of charge at instant t
$E_{H_{2} in} (t)$	hydrogen storage charge at instant t
$E_{H_{2} out} (t)$	hydrogen storage discharge at instant t
$E_{b a t t} (t)$	lead storage state of charge at instant t
$E_{b a t t in} (t)$	lead storage charge at instant t
$E_{b a t t out} (t)$	lead storage discharge at instant t
$E_{solar} (t)$	Solar production for the hour
$E_{D C} (t)$	Datacenter consumption for the hour
$E_{s u r p l u s} (t)$	Energy going from the solar circuit to the general one
$E_{building} (t)$	Energy consumed by the building, excluding the datacenter
$E_{g r i d} (t)$	Energy coming from the grid
$E_{w a s t e} (t)$	Energy overproduced for the building
t	time step
a_t	action vector at instant t
s_t	state vector at instant t
$f (s, a)$	carbon impact in state $s$ doing action $a$
$R (s, a)$	reward in state $s$ doing action $a$
$r_{t}$	reward in state s_t doing action a_t
${δ E}_{b a t t} (t)$	lead battery contribution
${δ E}_{H_{2}} (t)$	Hydrogen storage contribution
${δ E}_{storage} (t)$	Global energy storages contribution
$α_{rep} (t)$	Energy storages contribution repartition
$Q (s_{t}, a_{t})$	discounted sum of future reward doing action a_t in state s_t
$y_{t}$	estimation of $Q (s_{t}, a_{t})$ used in the critic loss
$γ$	discount factor of future rewards
$π (s)$	policy returning an action $a$ in state $s$
$ϕ_{i}$	critic parameters at time step i
$θ_{i}$	policy parameters at time step i
$J (ϕ_{i})$	critic loss
$ϕ_{o l d i}$	critic parameters at time step i
$θ_{o l d i}$	policy parameters at time step i
$μ$	step-size for critic learning
$λ$	step-size for actor learning
$τ$	stabilization networks update proportion
N	duration: average length of a policy
s	self-consumption ratio (62)

Table 3. Parameters values used during simulations.

Quantity	Value	Unit
$η_{s o l a r o p a c i t y}$	0.6
$η_{s o l a r}$	0.21
$S_{p a n e l s}$	1000	m $^{2}$
$C_{s o l a r}$	55	gCO $_{2}$ eq/kW·h
$E_{s o l a r M a x}$	185	kW·h
$η_{b a t t}$	0.81
$C_{b a t t I n}$	68.66	gCO $_{2}$ eq/kW·h
$C_{b a t t O u t}$	86	gCO $_{2}$ eq/kW·h
$E_{b a t t M a x}$	650 / 2	kW·h
$E_{b a t t I n M a x}$	$E_{b a t t M a x}$	kW·h
$E_{b a t t O u t M a x}$	$E_{b a t t M a x}$	kW·h
$η_{H_{2}}$	0.35
$C_{H_{2} I n}$	1.75	gCO $_{2}$ 2eq/kW·h
$C_{H_{2} O u t}$	5	gCO $_{2}$ 2eq/kW·h
$E_{H_{2} M a x}$	1000	kW·h
$E_{H_{2} I n M a x}$	$2 \times 10$	kW·h
$E_{H_{2} O u t M a x}$	$2 \times 5$	kW·h
$C_{g r i d}$	53	gCO $_{2}$ eq/kW·h
$E_{D C m a x}$	10	kW·h
$E_{b u i l d i n g M a x}$	100	kW·h

Table 4. Results computed on the year 2007. n.a.: not applicable.

Policy	Duration	Hourly Impact	Score	Self-Consumption
	(h)	(gCO $_{2}$ eq/h)		(%)
DDPGTwoBatts	442	n.a.	413	n.a.
DDPGRepartition	8567	n.a.	7850	35%
$α_{rep} = 0$	8760	4591	n.a.	35%
$α_{rep} = 0.2$	8760	4510	n.a.	33.6%
DDPG $α_{rep}$	8760	4586	8020	34.9%

Table 5. Learned policies duration depending on the reward normalization: local or global. Using simulations on 2007 test dataset.

Policy	Local n.	Global n.
DDPGTwoBatts	248	442
DDPGRepartition	4312	8567
DDPG $α_{rep}$	8760	8760

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage

Abstract

1. Introduction

2. Problem Statement

2.1. Storages

2.2. Solar Circuit

2.3. General Circuit

2.4. Long-Term Carbon Impact Minimization Problem

3. Problem Reformulations

3.1. Battery Charge or Discharge

3.2. Batteries Storage Repartition

3.3. Repartition Parameter Only

3.4. Fixed Repartition Policy

4. Learning the Policy with DDPG

4.1. Actor–Critic Approach

4.2. Critic Learning

4.3. Actor Learning

4.4. Proposition: DDPG α rep Algorithm to Learn the Policy

5. Simulation

5.1. Simulation Settings

5.2. Simulation Metrics

5.3. Simulation Results

5.4. Reward Normalization Effect

5.5. Hydrogen Storage Efficiency Impact

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof of Proposition 1

Appendix A.1. When δEstorage(t) < 0

Appendix A.2. When δEstorage(t) > 0

Appendix B. Proof of Proposition 2

Appendix B.1. When δEstorage(t) > 0

Appendix B.2. When δEstorage(t) < 0

References

Article Metrics

Citations

Article Access Statistics

4.4. Proposition: DDPG $α_{rep}$ Algorithm to Learn the Policy

Appendix A.1. When δE_storage(t) < 0

Appendix A.2. When δE_storage(t) > 0

Appendix B.1. When δE_storage(t) > 0

Appendix B.2. When δE_storage(t) < 0