Deep Reinforcement Learning for Hybrid Energy Storage Systems: Balancing Lead and Hydrogen Storage

: We address the control of a hybrid energy storage system composed of a lead battery and hydrogen storage. Powered by photovoltaic panels, it feeds a partially islanded building. We aim to minimize building carbon emissions over a long-term period while ensuring that 35% of the building consumption is powered using energy produced on site. To achieve this long-term goal, we propose to learn a control policy as a function of the building and of the storage state using a Deep Reinforcement Learning approach. We reformulate the problem to reduce the action space dimension to one. This highly improves the proposed approach performance. Given the reformulation, we propose a new algorithm, DDPG α rep , using a Deep Deterministic Policy Gradient (DDPG) to learn the policy. Once learned, the storage control is performed using this policy. Simulations show that the higher the hydrogen storage efﬁciency, the more effective the learning.


Introduction
Energy storage is a crucial question for the usage of photovoltaic (PV) energy because of its time-varying behavior. In the ÉcoBioH2 project [1], we consider a building with solar panels providing different usages. The building includes a datacenter that is constrained to be powered by solar energy. It has a low carbon footprint building with lead and hydrogen storage capabilities. Our goal is to monitor this hybrid energy storage system with a goal of low carbon impact.
The building [1] is partially islanded with a datacenter that can only be powered by the energy produced by the building's solar panels. The proportion of energy produced by the PV in the energy consumed by the building, including the datacenter, defines the self-consumption. The EcoBioH2 project requests the self-consumption to be at least 35%. Demand flexibility, where the load is adjusted to meet production, is not an option in this building so that energy storage will be needed to power the datacenter. Daily variations of the energy production can be mitigated using lead or lithium batteries. However, due to their low capacity density such technologies cannot be used for interseasonal storage. Hydrogen energy storage, on the other hand, is a promising solution to this problem, enabling yearly low-volume, high-capacity, low-carbo-emission energy storage. Unfortunately, it is plagued by its low storage efficiency. Combining hydrogen storage with lead batteries in a hybrid energy storage system enables us to leverage the advantages of both energy storages [2]. Hybrid storage has been shown to perform well in islanded emergency situations [3]. Lead batteries can deliver a big load but not for long. On the other hand, hydrogen storage only supports a small load but has a higher capacity than lead or lithium batteries allowing a longer discharge. The question becomes how to monitor the charge and discharge of each storage and to balance between the short-term battery and the long-term hydrogen storage?
We encounter therefore several short and long-term goals and constraints in opposition summarized in Table 1. Minimizing the carbon impact discourages from using batteries, as batteries emit carbon during their lifecycle. It also encourages using H 2 storage when needed, as less carbon is emitted per kW·h than battery storage. The less energy is stored, the less energy is lost in storage efficiency. This results in more energy available to the building. Thus, in the short-term, self-consumption increases. However, the datacenter is not guaranteed to have enough energy available for the long-term. Keeping the datacenter powered by solar energy requires storing as much energy as possible. Nevertheless, some energy is lost during charge and discharge leading to a lower self-consumption. This energy should be stored in the battery first since less energy is lost in efficiency, resulting in higher emissions. Keeping the datacenter powered is a long-term objective as previous decisions impact the current state that constraints our capacity to power the datacenter in the future. Nonetheless, because of their capacities our energy storage systems perform in opposition. Battery storage has a limited capacity. It allows the withstanding of shortterm production variations. Hydrogen storage has an enormous capacity. It helps with long-term, interseasonal variations. Table 1. Contradictory consequences of carbon impact minimization and datacenter powering.

Minimizing Carbon Impact
Keeping the Datacenter Powered short duration long duration high self-consumption low self-consumption use only H 2 charge batteries first do not need any capacity need large hydrogen storage capacity Managing a long-term storage system means that the control system needs to choose actions (charge or discharge and storage type) depending on their long-term consequences. We consider a duration of several months. We want to minimize the carbon impact while having enough energy for a complete year at least, under the constraints of the datacenter being powered by solar energy. Using convex optimization to solve this problem requires precise forecasting of the energy production and consumption for the whole year. One cannot have months of such forecasts in advance [4,5]. In [6], the authors try to minimize the cost and limit their study to 3 days only. Methods based on genetic algorithms, as [7], require a detailed model of the building usages and energy production which is not realistic in our case since all parts are not known in advance. We also want to allow flexible usages. Therefore, we propose to adopt a solution that can cope with light domain expertise. If the input and output data of the problem are accessible, supervised learning and deep learning can be considered [8]. Having contradicting goals with different horizons, reinforcement learning is an interesting approach [9]. The solution we are looking for should provide a suitable control policy for our hybrid storage system. Most reinforcement learning methods quantize the action space to avoid having interdependent action space bounds [10]. However, such a solution comes with a loss in precision in the action's selection. It requires more data for learning.
Taking into account theses aspects, we address in the sequel our problem formulation allowing the deployment of non-quantized Deep Reinforcement Learning (DRL) [11] to learn the storage decision policy. DRL learns a long-term evaluation of actions and uses it to train an actor that for each state of the building gives the best action. In our case, the action is the charge or discharge of the lead and hydrogen storages. Learning the policy could even improve controlling the efficiency in the short-term [12]. Existing works focus on non-islanded settings [13] where no state causes a failure. Since our building is partially islanded, this approach would yield to a failure where the islanded portion is not powered anymore. Existing DRL for hybrid energy storage systems focuses on minimizing the energy cost [14]. It does not consider the minimization of carbon emission in a partially islanded building.
In this paper, we formulate the carbon impact minimization of the partially islanded building to learn a hybrid storage policy using DRL. We will reformulate this problem to reduce the action space dimension and therefore improve the DRL performance.
The contributions of this paper are as follows: • We redefine the action space so that the action bounds are not interdependent. • We use this reformulation to reduce the action space to a single dimension. • From this analysis, we deduce a fixed up to a projection (but not learned) repartition policy between the lead and hydrogen. • We propose an actor-critic approach to control the partially islanded hybrid energy storage of the building, to be named DDPGα rep .
Simulations will show the importance of the hydrogen efficiency and carbon impact normalization in the reward, for the learned policy to be effective.

Problem Statement
In this section, we describe the model used to simulate our building. This model is sketched in Figure 1 and explained next. Action variables are noted in red.

Storages
We use a simplified model of the energy storage elements as they are sufficient to validate the learning approach for our hybrid storage problem. However, the proposed learning approach can use any batteries model or data since the proposed reformulations and learning do not depend on the batteries model. As long as the action is limited to how much we should charge or discharge, any storage model can be used instead. Since we propose a learning approach, the learned policy could be further improved using real data. Both energy storages (lead battery and H 2 ) use the same equations: with E H 2 (t) the state of health of the H 2 storage at instant t, η H 2 the global (charging electrolyser and discharging proton-exchange membrane fuel cells) efficiency of H 2 stor-age. E H 2 in (t) is the charge energy and E H 2 out (t) is the energy discharged at instant t. Equation (1) must satisfy the following constraints: with E H 2 max , E H 2 in max and E H 2 out max the respective upper bounds for E H 2 (t), E H 2 in (t) and E H 2 out (t). To obtain the lead battery equations replace H 2 by batt in Equations (1)-(4). The lead battery efficiency η batt covers the whole battery efficiency: charge and discharge

Solar Circuit
The solar circuit connects elements that manage the solar energy only. The production is provided by solar panels E solar (t). Part of this energy will be stored in short-term (lead battery) or long-term (hydrogen) storage. Part of this energy will be consumed directly by a small datacenter , E DC (t). The solar circuit is not allowed to handle grid electricity. We define E surplus (t) as: Please note that this equation does not prevent from charging one energy storage by the other. The solar circuit can only give energy to the general circuit, so that: This constraint (6) ensures that the datacenter can only be provided in solar energy, as is required by our project [1]. E solar (t) values are computed using irradiance values from [15] and physical properties of our solar panels.

General Circuit
The building consumption E building (t) values come from EcoBioH 2 technical office study [16]. They take into account the power consumption of the housing, the restaurant, ...and other usages that are hosted by the building. We define δE regul (t) as the difference between E building (t) and E surplus (t): When δE regul (t) > 0, we define it as the consumption from the electric grid: When δE regul (t) < 0, we define it as the energy discarded since this building is not allowed to give energy back to the grid: In reality, the energy discarded will not be produced. This will be done by temporarily disconnecting the solar panels.
We define E grid (t) and E waste (t) in Equations (8) and (9) as they are used in the simulation metrics in Section 5.2. Variables defined previously and in the remaining of this paper are displayed in Table 2, parameters are in Table 3.

Symbol Meaning
E H 2 (t) hydrogen storage state of charge at instant t E H 2 in (t) hydrogen storage charge at instant t E H 2 out (t) hydrogen storage discharge at instant t E batt (t) lead storage state of charge at instant t E batt in (t) lead storage charge at instant t E batt out (t) lead storage discharge at instant t E solar (t) Solar production for the hour E DC (t) Datacenter consumption for the hour E surplus (t) Energy going from the solar circuit to the general one E building (t) Energy consumed by the building, excluding the datacenter E grid (t) Energy coming from the grid E waste (t) Energy overproduced for the building t time step a t action vector at instant t s t state vector at instant t f (s, a) carbon impact in state s doing action a R(s, a) reward in state s doing action a r t reward in state s t doing action a t δE batt (t) lead battery contribution δE H 2 (t) Hydrogen storage contribution δE storage (t) Global energy storages contribution α rep (t) Energy storages contribution repartition Q(s t , a t ) discounted sum of future reward doing action a t in state s t y t estimation of Q(s t , a t ) used in the critic loss γ discount factor of future rewards π(s) policy returning an action a in state s φ i critic parameters at time step i θ i policy parameters at time step i J(φ i ) critic loss φ old i critic parameters at time step i θ old i policy parameters at time step i µ step-size for critic learning λ step-size for actor learning τ stabilization networks update proportion N duration: average length of a policy s self-consumption ratio (62)

Long-Term Carbon Impact Minimization Problem
We gather the building consumption, the solar panels production at instant t and the previous stored energy state at t − 1 variables in a so-called state defined as: We define the action variables in to control the energy storage at the current hour t. We define in Equation (12) the instantaneous carbon impact at state s t when performing action a t as f (s t , a t ): with C solar the carbon intensity per kW·h from the complete lifecycle of PV usage. C batt in , C batt out , C H 2 in , C H 2 out are the carbon intensity from the complete lifecycle per kW·h of respectively lead battery charge, discharge, hydrogen storage charge and discharge. C grid (t) quantifies the carbon emissions per kW·h associated with energy from the grid. Their values for simulations are provided in Table 3. Our goal is to minimize the long-term carbon impact taking into account the carbon emissions at the current and future states s t , . . . , s t+H as induced by the current and future actions a t , . . . , a t+H : under the constraints (2), (3), (4) and (6). We call this initial formulation TwoBatts.
The challenge comes from our ignorance of the actions that will be taken in the future a t+1 , . . . , a t+H . Yet, we need to account for their impact. DRL approaches are meant for such kind of challenges.

Problem Reformulations
In this section, we reformulate our problem (13) to simplify its resolution. We consider in particular the reduction of the action space to reduce the complexity and improve the convergence of learning.

Battery Charge or Discharge
The current formulation of our problem, TwoBatts, allows the policy to charge and discharge a battery simultaneously. We note that the cost function to be minimized (12) is increasing with the different components of a t . This leads to multiple actions that, in the same state s t , yield to the same s t+1 while having different costs. To avoid having to deal with such cases, we impose that the energy storage systems can only be charged or discharged at a given instant t: Therefore, we express the charge and discharge of each battery in a single dimension: We propose to use these new variables as the action space: To obtain the new model equations, we replace the following variables in Equations (1)- (12): Thus, we obtain the formulation 2Dbatt of (13) with Next, we revisit the constraints with this new action space. When we only charge and (3) turns into: When we only discharge (δE H 2 (t) = E H 2 out (t)) (2) becomes: Accordingly, (4) is equivalent to: The battery is constrained by variations of (23)-(26). Both storages are constrained by Equation (6) that turns into: The 2Dbatt formulation is the minimization of (13) over (17) constrained by Equations (23)

Batteries Storage Repartition
In the 2Dbatt formulation, one storage can discharge when the other is charging which results in a loss of energy. Moreover, the actions bound (27) depends not only on the state but also on the action itself. The bounds are therefore interdependent. If we select an action outside the action bounds, there is a need to project it inside the bounds which is non-trivial because of this interdependence.
To alleviate this problem, we propose to rotate the action space frame. We merge the two action dimensions into the energy storage systems contribution and the contribution repartition defined as: is the proportion of hydrogen in the storing. It is equal to 0 when the sole battery storage is used and to 1 when only the hydrogen storage is used. α rep (t) is bounded between 0 and 1 by definition, so that one energy storage cannot charge the other. Furthermore, we only convert from Repartition to 2Dbatts and not the other way around. This is illustrated in Figure 2. To insert the new variables in the 2Dbatt formulation, we use the following equations: We transform Equations (23)-(26) using (31): We obtain the battery variant of those equations using (31): Moreover, using (28), (27) becomes: Equation (40) depends only on one variable, δE storage (t). Using this variable change, we have removed the interdependency of the constraint in (27).
Next, we propose bounds on δE storage (t) and α rep (t) that will be critical in the sequel.
and their values are defined by: Proof of Proposition 1 is in Appendix A.
Proof of Proposition 2 is in Appendix B. Please note that when δE storage (t) = 0, α rep (t) does not matter. We will set it to α rep (t) = 0.5 as a convention.
The interest of bounds (43) and (44) is that they depend on δE storage (t) only, whereas the bounds on δE storage (t) do not depend on α rep (t). Thus, given δE storage (t), we only need to decide the contribution of the distribution α rep (t). The interdependence has been completely removed.

Repartition Parameter Only
We have noticed that δE storage (t) can be seen as a single global storage. To provide energy for as long a duration as possible, i.e., to respect (40), we want to charge as much as possible and discharge only when needed. We call this the frugal policy. It corresponds to δE storage (t) being equal to its lower bound: To reduce even more the action space dimensionality, we propose to use the frugal policy and to focus on learning only α rep (t) the repartition between the lead and hydrogen energy storage systems contribution.
Using this remark, we propose the α rep reformulation with the goal (13), in order to find the single action a t = α rep (t), given the state s t , using the carbon impact (45) and under constraints (43) and (44) with δE storage (t) derived in (46). Unless specified, this is the formulation we use in the sequel of this paper.

Fixed Repartition Policy
In Section 4, we will propose a learning algorithm for the different formulations. To show the interest of learning, we want to compare the learned policies to a frugal policy (46) where α rep (t) is preselected and fixed to a value v. At each instant, we will only verify that v ∈ [α min (t), α max (t)] and project it into this interval otherwise. We call α rep = v the policy where α rep is preset to value v. So that: In Section 1, we explained that the battery is intended for short-term storage and that the H 2 storage is intended for long-term. Our intuition therefore suggests charging or discharging the lead battery first. This corresponds to a preset value of α rep = 0, so that α rep (t) = α rep min (t).
One should wonder, what is the best preselected α rep ? To find it, we simulated 100 different values of α rep between 0 and 1. For each value v, we run a simulation for each day starting at midnight on a looping year 2006. A detailed description of this data is available in Section 5.1. We use the parameters in Table 3 and PV production computed using irradiance data [17,18] in (48): E solar (t) = P solar (t)η solar η solar opacity S panels If the simulation does not last the whole year, we reject it (hatched area in Figure 3). Otherwise, we compute the hourly carbon impact: with T the number of hours in 2006. This hourly impact is averaged over 365 different runs, each starting at midnight, one for each day of 2006. Figure 3 shows the carbon impact versus α rep . The α rep value that minimizes the average hourly impact while lasting the whole year is therefore α rep = 0.2. It will be used for comparison.

Learning the Policy with DDPG
In the reformulation α rep reformulation, we want to select a t given the state s t . The function that provides a t given s t is referred to as the policy. We want to learn the policy using DRL with an actor-critic policy-based approach: the Deep Deterministic Policy Gradient (DDPG) [19]. Experts may want to skip Sections 4.2 and 4.3.

Actor-Critic Approach
We call env for the environment, the set of equations: (1) and its battery variant that allows the obtaining of s t+1 from a t and s t , s t+1 = env.step(s t , a t ). Its corresponding reward, the short-term evaluation function, is defined as a function of s t and a t , r t = R(s t , a t ). We use [19], an actor-critic approach, where the estimated best policy for a given environment s t+1 = env.step(s t , a t ) is learned through a critic as in Figure 4. The critic transforms this short-term evaluation into a long-term evaluation, the Q-values Q(s t , a t ), through learning. It will be detailed in Section 4.2. The actor π θ : s t → a t is the function that selects the best possible action a t possible. It uses the critic as to know what is the best action in a given state (as detailed in Section 4.3). In Section 2.4, we set our objective to minimize the long-term carbon impact (13). However, in reinforcement learning we try to maximize a score, defined as the sum of all rewards: To remove this difference, we maximize the negative carbon impact ∑ − f (s t , a t ). However, the more negative terms you add, the lower the sum is. This leads to a policy trying to stop the simulation as fast as possible, in contradiction to our goal to always provide the datacenter in energy. To counter this, we propose, inspired by [20], to add a living incentive of 1 at each instant. Therefore, we propose to define the reward as: The reward accounting for the carbon impact is now normalized between 0 and 1 so that the reward is always positive. Still in this reward the normalization depends on the state s t . When the normalization depends on the state, two identical actions can have different rewards associated with them. Therefore, the reward is not proportional to the carbon impact (45) making the reward harder to interpret. To alleviate this problem, we propose to use the global maximum instead of the worst case for the current state: By convention r t is set to zero after the simulation ends. The actor and critic are parameterized using artificial neural networks, respectively denoted θ and φ. They will be learned alternatively and iteratively. Two stabilization networks are also used for the critic supervision with weights θ old and φ old .

Critic Learning
Now that we have defined a reward, we can use the critic to transform it into a long-term metric. As time goes, we have less and less trust in the future. Therefore, we discount the future rewards using a discount factor 0 < γ < 1. We define the critic Q : s t , a t → ∑ +∞ k=0 γ k r t+k . It estimates the weighted long-term returns of taking an action a t in a given state s t . This weighted version of (50) also allows the binding of the infinite sum to learn it. Q can be expressed recursively: We learn the Q-function using an artificial neural network of weights φ. At the ith iteration of our learning algorithm and for a given value of φ old i and θ old i , we define a reference value y t from the recursive expression (53). Since we do not know a t+1 , we need to select the best action possible at t + 1. The best estimator of this action is provided by the policy π θ old i , so that we define the reference as: where a t+1 has been estimated by π θ old i (s t+1 ).
The squared difference between the estimated value Q φ (s t , a t ), and the reference value y t [21] is defined as: To update φ i , we minimize J(φ i ) in (55) using a simple gradient descent: where ∇J(φ i ) is the gradient of J(φ) in (55) with respect to φ taken at the value φ i . µ is a small positive step-size. To stabilize the learning [19] suggests updating the reference network φ old slower, so that:

Actor Learning
Since we alternate the updates of the critic and of the actor, we address next the learning of the actor. To learn what is the best action to select, we need a loss function that grades different actions a t . Using the reward function (52), as a loss function, the policy would select the best short-term, instantaneous, action. Since the critic Q(s t , a t ) depends on the action a t , we replace a t by π θ (s t ). At iteration i, to update the actor network θ i , we use the gradient ascent of the average Q φ i (s t , π θ (s t )) taken at θ = θ i . This can be expressed as: where λ is a small positive step-size.
To learn the critic a stabilized actor is used. Like the stabilized critic, π θ old is updated by: with θ old 0 = θ 0 at the beginning. During learning, an Ornstein-Uhlenbeck noise [22], n, is added to the policy decision to make sure we explore the action space:

Proposition: DDPG α rep Algorithm to Learn the Policy
From the previous section, we propose the DDPGα rep Algorithm 1. This algorithm alternates the learning of the networks of the actor and of the critic. We select randomly the initial instant t to avoid learning time patterns. We start each run with full energy storage.
Once learned, we use the last weights θ i of the neural network parameterizing the actor to select the action using π θ i : s t → a t directly.
To learn well an artificial neural network needs the different samples of learning data to be uncorrelated. In reinforcement learning two consecutive states tends to be close, i.e., correlated. To overcome this problem, we store all experiences (s t , a t , r t , s t+1 ) in a memory and use a tiny random subset as the learning batch [23]. The random selection of a batch from the memory is called sample.

Simulation
We have just proposed DDPGα rep to learn how to choose α rep (t) with respect to the environment. In this section, we display the simulations settings and results.

Simulation Settings
Production data are computed using (48) from real irradiance data [17,18] measured at the building location in Avignon, France. The building has S panels = 1000 m 2 of solar panels with η solar opacity = 60% opacity and an efficiency of η solar = 21%. Those solar panels can produce a maximum of E solar Max = 185 kW·h per hour.
Consumption data comes from projections of the engineering office [16]. It consists of powering housing units with an electricity demand fluctuating daily between 30 kW·h (1 a.m. to 6 a.m.) and 90 kW·h. The weekly variations of the consumption varies with a factor between 1 and 1.4 during awake hours between workdays and the weekend. There is little interseasonal variation, standard deviation of 0.6 kW·h (0.01% of yearly mean) between seasons, as heating uses wood pellets. In those simulations, the datacenter is consuming a fixed amount of E DC max = 10 kW·h. The datacenter consumption adds up to 87.6 MW·h per year, around 17% of the 496 MW·h that the entire building consumes in a year. To power this datacenter, our building's solar panels produce an average of 53.8 kW·h/h during the 12.7 sunny hours on average day counts, for a yearly total of 249 MW·h/year. This covers a maximum of 2.8 times the consumption of our datacenter, but lowers to 99% if all energy goes through the hydrogen storage. The same solar production covers at most 50% of the building yearly consumption. When accounting for hydrogen efficiency, the solar production covers at most 17% of the building consumption.
We only use half of the lead battery capacity to preserve the battery health longer E batt Max = 650/2 = 325 kW·h. The lead battery carbon intensity is split between the charge and discharge C batt Out = 172/2 = 86 gCO 2 eq/kW·h. Since the charge quantity comes before the efficiency, its carbon intensity must account for efficiency: C batt In = C batt Out η batt = 86 × 0.81 = 68.66 gCO 2 eq/kW·h. The carbon intensity of the electrolysers, accounting for the efficiency, is used for C H 2 in = 5 × η H 2 = 1.75 gCO 2 eq/kW·h. The carbon intensity of the fuel cells corresponds to C H 2 out = 5 gCO 2 eq/kW·h. η H 2 account for both the electrolysers and fuel cells efficiency. C grid = 53 gCO 2 eq/kW·h uses the average French grid carbon intensity. All those values are reported in Table 3.
The simulations use an hourly simulation step t.
We train on the production data from year 2005, validate and select hyperparameters, using best score (50)  To improve learning, we normalize between −1 and 1 all state and action inputs and outputs. For a given value d bounded between d min and d max : d norm is then used as an input for the networks.
To accelerate the learning, all gradient descents are performed using Adam [24]. During training, we use the following step sizes µ = 10 −3 to learn the critic and λ = 10 −4 for the actor. For the stabilization networks, τ = 0.001. To learn, we sample batches of 64 experiences from a memory of 10 6 experiences. The actor and critic both have 2 hidden layers with a ReLU activation function. Hidden layers have respectively 400 and 300 units in them. The output layer uses a tanh activation to bound its output. The discount factor, gamma in (54) is optimized as a hyperparameter between 0.995 and 0.9999. We found the best value for the discount factor to be 0.9979.

Simulation Metrics
We name duration and note N the average length of the simulations. When all simulations last the whole year, the hourly carbon impact is evaluated as in (49). To select the best policy, the average score is computed using (50). Self-consumption, defined as the energy provided by the solar panels, directly or indirectly using one of the storages, over the consumption, is computed using: Per the ÉcoBioH2 project, the goal is to reach 35% of self-consumption: s ≥ 0.35.

Simulation Results
The following learning algorithms are simulated on data from Avignon from 2007 and our building: proposed DDPGα rep with action a t = [α rep (t)] where DDPGTwoBatts and DDPGRepartition are algorithms similar to DDPGα rep with action spaces of the corresponding formulations respectively (11) and (17). The starting time is randomly selected from any hour of the year.
To test the learned policies, the duration, hourly impact (49), score (50) and selfconsumption (62) metrics are computed on the 2007 irradiance data and averaged over all runs. We compute those metrics over 365 different runs, starting each 2007 day at midnight. For the sake of comparison, we also compute those metrics when applicable for the preselected values α rep = 0 and α rep = 0.2 using (47) on the same data. Recall that the fixed α rep values are bounded to (43) and (44) to ensure the long-term duration.
The metrics over the different runs are displayed in Table 4. We can see in Table 4 that DDPGTwoBatts and DDPGRepartition do not last the whole year. This shows the importance of our reformulations to reduce the action space dimensions. We observe that all policies using the α rep reformulation last the whole year (N = 8760). This validates our proposed reformulations and dimension reduction. α rep = 0.2 achieves the lowest carbon impact; however, it cannot ensure the target of self-consumption. On the other hand, α rep = 0 achieves the target self-consumption at the price of a higher carbon impact. The proposed DDPGα rep provides a good trade-off between the two by adapting α rep (t) to the state s t . It reaches the target self-consumption minus 0.1% and lowers the carbon impact with respect to α rep = 0. The carbon emission gain over the intuitive policy α rep = 0, using hydrogen only as a last resort, is of 43.8 × 10 3 gCO 2 eq/year. This shows the interest of learning the policy once the problem is well formulated.

Reward Normalization Effect
In Section 4.1, we presented two ways to normalize the carbon impact in the reward. In this section, we show that the proposed global normalization (52) yields better results than the local state-specific normalization (51).
In Table 5, we display the duration for both normalizations. We see that policies that use the locally normalized reward have a lower duration than the ones using a globally normalized reward. This confirms that the local normalization is harder to learn as two identical actions have different rewards in different states. Therefore, the higher dynamic of the local normalization is not worth the variability induced by this normalization. This validates our choice of the global normalization (52) for the proposed DDPGα rep algorithm.

Hydrogen Storage Efficiency Impact
In our simulations, we have seen the sensibility of our carbon impact results to the parameters in Table 3. Indeed, the efficiency of the storage has a great impact on the system behavior. Hydrogen storage yields lower carbon emissions when its efficiency η H 2 is higher than some threshold. The greater is η H 2 , the greater α rep (t) could be and so the range for adapting α rep (t) via learning is more important. To find the threshold in η H 2 , we first compute the total carbon intensity of storing one kW·h in a given storage, including the carbon intensity of energy production. For H 2 , we obtain: We display the value of (63) of both storages in Figure 5 with respect to η H 2 , the other parameters are taken from Table 3. When C H 2 tot < C batt tot learning is useful since the policy must balance the lower carbon impact (using the hydrogen storage) with the low efficiency (using the battery storage). When C H 2 tot > C batt tot the learned policy converges to α rep = 0, as both objectives (minimizing the carbon impact and continuous powering of the datacenter) align.
We calculate from (63) and its battery variant, the threshold point where C H 2 tot = C batt tot to be at efficiency: Using values in Table 3 on (64), hydrogen improves the carbon impact only when η H 2 > η * H 2 = 0.24. The current value is η H 2 = 0.35 > 0.24, learning is also useful as shown in the simulations Table 4. We can also suggest that when the hydrogen storage efficiency will improve in the future, the impact of learning will be even more important.

Conclusions
We have addressed the problem of monitoring the hybrid energy storage of a partially islanded building with a goal of carbon impact minimization and self-consumption. We have reformulated the problem to reduce the number of components of the action to one, α rep (t), the proportion of hydrogen storage given the building state s t . To learn the policy, π θ : s t → α rep (t), we propose a new DRL algorithm using a reward tailored to our problem, DDPGα rep . The simulation results show that when the hydrogen storage efficiency is large enough, learning of α rep (t) allows a decrease to the carbon impact while lasting at least one year and maintaining 35% of self-consumption. As hydrogen storage technologies improve, the proposed algorithm should have even more impact.
Learning the policy using the proposed DDPGα rep can also be done when the storage model includes non-linearities. Learning can also adapt to climate changes in time using more recent data for learning. To measure such benefits, we will use in the future the ÉcoBioH2 real data to be measured in the sequel of the project. Learning from real data will reduce the gap between the model and the real system. Reducing this gap should improve performance. The proposed approach could also be used to optimize other environmental metrics with a multi-objective cost in f (s t , a t ).
With our current formulation, policies cannot assess what day and hour it is as they only have two state variables to compute the hour: E solar (t) and E building (t). They cannot differentiate between 1 a.m. and 4 a.m. at night as those two times have the same consumption and no PV production. They also cannot differentiate between a cloudy summer and a clear winter as production and consumption are close in those two cases. In the future, we will consider taking into account the knowledge of the current time to enable the learned policy to adapt its behavior to the time of the day and month of the year. Using (28), (24) and the battery variant of (23): We obtain the global lower bound (41) by obtaining the min of (40), (A1)-(A4).