Abstract
We address the control of a hybrid energy storage system composed of a lead battery and hydrogen storage. Powered by photovoltaic panels, it feeds a partially islanded building. We aim to minimize building carbon emissions over a long-term period while ensuring that 35% of the building consumption is powered using energy produced on site. To achieve this long-term goal, we propose to learn a control policy as a function of the building and of the storage state using a Deep Reinforcement Learning approach. We reformulate the problem to reduce the action space dimension to one. This highly improves the proposed approach performance. Given the reformulation, we propose a new algorithm, DDPG, using a Deep Deterministic Policy Gradient (DDPG) to learn the policy. Once learned, the storage control is performed using this policy. Simulations show that the higher the hydrogen storage efficiency, the more effective the learning.
1. Introduction
Energy storage is a crucial question for the usage of photovoltaic (PV) energy because of its time-varying behavior. In the ÉcoBioH2 project [1], we consider a building with solar panels providing different usages. The building includes a datacenter that is constrained to be powered by solar energy. It has a low carbon footprint building with lead and hydrogen storage capabilities. Our goal is to monitor this hybrid energy storage system with a goal of low carbon impact.
The building [1] is partially islanded with a datacenter that can only be powered by the energy produced by the building’s solar panels. The proportion of energy produced by the PV in the energy consumed by the building, including the datacenter, defines the self-consumption. The EcoBioH2 project requests the self-consumption to be at least 35%. Demand flexibility, where the load is adjusted to meet production, is not an option in this building so that energy storage will be needed to power the datacenter. Daily variations of the energy production can be mitigated using lead or lithium batteries. However, due to their low capacity density such technologies cannot be used for interseasonal storage. Hydrogen energy storage, on the other hand, is a promising solution to this problem, enabling yearly low-volume, high-capacity, low-carbo-emission energy storage. Unfortunately, it is plagued by its low storage efficiency. Combining hydrogen storage with lead batteries in a hybrid energy storage system enables us to leverage the advantages of both energy storages [2]. Hybrid storage has been shown to perform well in islanded emergency situations [3]. Lead batteries can deliver a big load but not for long. On the other hand, hydrogen storage only supports a small load but has a higher capacity than lead or lithium batteries allowing a longer discharge. The question becomes how to monitor the charge and discharge of each storage and to balance between the short-term battery and the long-term hydrogen storage?
We encounter therefore several short and long-term goals and constraints in opposition summarized in Table 1. Minimizing the carbon impact discourages from using batteries, as batteries emit carbon during their lifecycle. It also encourages using storage when needed, as less carbon is emitted per kW·h than battery storage. The less energy is stored, the less energy is lost in storage efficiency. This results in more energy available to the building. Thus, in the short-term, self-consumption increases. However, the datacenter is not guaranteed to have enough energy available for the long-term. Keeping the datacenter powered by solar energy requires storing as much energy as possible. Nevertheless, some energy is lost during charge and discharge leading to a lower self-consumption. This energy should be stored in the battery first since less energy is lost in efficiency, resulting in higher emissions. Keeping the datacenter powered is a long-term objective as previous decisions impact the current state that constraints our capacity to power the datacenter in the future. Nonetheless, because of their capacities our energy storage systems perform in opposition. Battery storage has a limited capacity. It allows the withstanding of short-term production variations. Hydrogen storage has an enormous capacity. It helps with long-term, interseasonal variations.
Table 1.
Contradictory consequences of carbon impact minimization and datacenter powering.
Managing a long-term storage system means that the control system needs to choose actions (charge or discharge and storage type) depending on their long-term consequences. We consider a duration of several months. We want to minimize the carbon impact while having enough energy for a complete year at least, under the constraints of the datacenter being powered by solar energy. Using convex optimization to solve this problem requires precise forecasting of the energy production and consumption for the whole year. One cannot have months of such forecasts in advance [4,5]. In [6], the authors try to minimize the cost and limit their study to 3 days only. Methods based on genetic algorithms, as [7], require a detailed model of the building usages and energy production which is not realistic in our case since all parts are not known in advance. We also want to allow flexible usages. Therefore, we propose to adopt a solution that can cope with light domain expertise. If the input and output data of the problem are accessible, supervised learning and deep learning can be considered [8]. Having contradicting goals with different horizons, reinforcement learning is an interesting approach [9]. The solution we are looking for should provide a suitable control policy for our hybrid storage system. Most reinforcement learning methods quantize the action space to avoid having interdependent action space bounds [10]. However, such a solution comes with a loss in precision in the action’s selection. It requires more data for learning.
Taking into account theses aspects, we address in the sequel our problem formulation allowing the deployment of non-quantized Deep Reinforcement Learning (DRL) [11] to learn the storage decision policy. DRL learns a long-term evaluation of actions and uses it to train an actor that for each state of the building gives the best action. In our case, the action is the charge or discharge of the lead and hydrogen storages. Learning the policy could even improve controlling the efficiency in the short-term [12]. Existing works focus on non-islanded settings [13] where no state causes a failure. Since our building is partially islanded, this approach would yield to a failure where the islanded portion is not powered anymore. Existing DRL for hybrid energy storage systems focuses on minimizing the energy cost [14]. It does not consider the minimization of carbon emission in a partially islanded building.
In this paper, we formulate the carbon impact minimization of the partially islanded building to learn a hybrid storage policy using DRL. We will reformulate this problem to reduce the action space dimension and therefore improve the DRL performance.
The contributions of this paper are as follows:
- We redefine the action space so that the action bounds are not interdependent.
- We use this reformulation to reduce the action space to a single dimension.
- From this analysis, we deduce a fixed up to a projection (but not learned) repartition policy between the lead and hydrogen.
- We propose an actor–critic approach to control the partially islanded hybrid energy storage of the building, to be named DDPG.
Simulations will show the importance of the hydrogen efficiency and carbon impact normalization in the reward, for the learned policy to be effective.
2. Problem Statement
In this section, we describe the model used to simulate our building. This model is sketched in Figure 1 and explained next. Action variables are noted in red.
Figure 1.
View of our system. lines in green shows the solar-only part and purple lines shows the grid-only part. actions are displayed in red.
2.1. Storages
We use a simplified model of the energy storage elements as they are sufficient to validate the learning approach for our hybrid storage problem. However, the proposed learning approach can use any batteries model or data since the proposed reformulations and learning do not depend on the batteries model. As long as the action is limited to how much we should charge or discharge, any storage model can be used instead. Since we propose a learning approach, the learned policy could be further improved using real data. Both energy storages (lead battery and ) use the same equations:
with the state of health of the storage at instant t, the global (charging electrolyser and discharging proton-exchange membrane fuel cells) efficiency of storage. is the charge energy and is the energy discharged at instant t. Equation (1) must satisfy the following constraints:
with , and the respective upper bounds for , and . To obtain the lead battery equations replace by in Equations (1)–(4). The lead battery efficiency covers the whole battery efficiency: charge and discharge
2.2. Solar Circuit
The solar circuit connects elements that manage the solar energy only. The production is provided by solar panels . Part of this energy will be stored in short-term (lead battery) or long-term (hydrogen) storage. Part of this energy will be consumed directly by a small datacenter, . The solar circuit is not allowed to handle grid electricity. We define as:
Please note that this equation does not prevent from charging one energy storage by the other. The solar circuit can only give energy to the general circuit, so that:
This constraint (6) ensures that the datacenter can only be provided in solar energy, as is required by our project [1]. values are computed using irradiance values from [15] and physical properties of our solar panels.
2.3. General Circuit
The building consumption values come from EcoBio technical office study [16]. They take into account the power consumption of the housing, the restaurant, ...and other usages that are hosted by the building. We define as the difference between and :
When , we define it as the consumption from the electric grid:
When , we define it as the energy discarded since this building is not allowed to give energy back to the grid:
In reality, the energy discarded will not be produced. This will be done by temporarily disconnecting the solar panels.
We define and in Equations (8) and (9) as they are used in the simulation metrics in Section 5.2. Variables defined previously and in the remaining of this paper are displayed in Table 2, parameters are in Table 3.
Table 2.
Nomenclature of variables used.
Table 3.
Parameters values used during simulations.
2.4. Long-Term Carbon Impact Minimization Problem
We gather the building consumption, the solar panels production at instant t and the previous stored energy state at variables in a so-called state defined as:
We define the action variables in
to control the energy storage at the current hour t. We define in Equation (12) the instantaneous carbon impact at state when performing action as :
with the carbon intensity per kW·h from the complete lifecycle of PV usage. , , , are the carbon intensity from the complete lifecycle per kW·h of respectively lead battery charge, discharge, hydrogen storage charge and discharge. quantifies the carbon emissions per kW·h associated with energy from the grid. Their values for simulations are provided in Table 3. Our goal is to minimize the long-term carbon impact taking into account the carbon emissions at the current and future states as induced by the current and future actions :
under the constraints (2), (3), (4) and (6). We call this initial formulation .
The challenge comes from our ignorance of the actions that will be taken in the future . Yet, we need to account for their impact. DRL approaches are meant for such kind of challenges.
3. Problem Reformulations
In this section, we reformulate our problem (13) to simplify its resolution. We consider in particular the reduction of the action space to reduce the complexity and improve the convergence of learning.
3.1. Battery Charge or Discharge
The current formulation of our problem, TwoBatts, allows the policy to charge and discharge a battery simultaneously. We note that the cost function to be minimized (12) is increasing with the different components of . This leads to multiple actions that, in the same state , yield to the same while having different costs. To avoid having to deal with such cases, we impose that the energy storage systems can only be charged or discharged at a given instant t:
Therefore, we express the charge and discharge of each battery in a single dimension:
We propose to use these new variables as the action space:
Thus, we obtain the formulation 2Dbatt of (13) with
Next, we revisit the constraints with this new action space. When we only charge (), straightforward calculations result in (2) being equivalent to
and (3) turns into:
When we only discharge () (2) becomes:
Accordingly, (4) is equivalent to:
3.2. Batteries Storage Repartition
In the 2Dbatt formulation, one storage can discharge when the other is charging which results in a loss of energy. Moreover, the actions bound (27) depends not only on the state but also on the action itself. The bounds are therefore interdependent. If we select an action outside the action bounds, there is a need to project it inside the bounds which is non-trivial because of this interdependence.
To alleviate this problem, we propose to rotate the action space frame. We merge the two action dimensions into the energy storage systems contribution and the contribution repartition defined as:
so that the action becomes . is the proportion of hydrogen in the storing. It is equal to 0 when the sole battery storage is used and to 1 when only the hydrogen storage is used. is bounded between 0 and 1 by definition, so that one energy storage cannot charge the other. Furthermore, we only convert from Repartition to 2Dbatts and not the other way around. This is illustrated in Figure 2. To insert the new variables in the 2Dbatt formulation, we use the following equations:
Figure 2.
Repartition formulation (green), and , in the 2Dbatt (blue) action space. Actions where one storage is charged and the other discharged are highlighted in red.
We obtain the battery variant of those equations using (31):
Equation (40) depends only on one variable, . Using this variable change, we have removed the interdependency of the constraint in (27).
Next, we propose bounds on and that will be critical in the sequel.
Proposition 1.
is constrained by and their values are defined by:
Proof of Proposition 1 is in Appendix A.
Proposition 2.
is constrained by with values are defined by:
Proof of Proposition 2 is in Appendix B. Please note that when , does not matter. We will set it to as a convention.
The interest of bounds (43) and (44) is that they depend on only, whereas the bounds on do not depend on . Thus, given , we only need to decide the contribution of the distribution . The interdependence has been completely removed.
3.3. Repartition Parameter Only
We have noticed that can be seen as a single global storage. To provide energy for as long a duration as possible, i.e., to respect (40), we want to charge as much as possible and discharge only when needed. We call this the frugal policy. It corresponds to being equal to its lower bound:
To reduce even more the action space dimensionality, we propose to use the frugal policy and to focus on learning only the repartition between the lead and hydrogen energy storage systems contribution.
3.4. Fixed Repartition Policy
In Section 4, we will propose a learning algorithm for the different formulations. To show the interest of learning, we want to compare the learned policies to a frugal policy (46) where is preselected and fixed to a value v. At each instant, we will only verify that and project it into this interval otherwise. We call the policy where is preset to value v. So that:
In Section 1, we explained that the battery is intended for short-term storage and that the storage is intended for long-term. Our intuition therefore suggests charging or discharging the lead battery first. This corresponds to a preset value of , so that .
One should wonder, what is the best preselected ? To find it, we simulated 100 different values of between 0 and 1. For each value v, we run a simulation for each day starting at midnight on a looping year 2006. A detailed description of this data is available in Section 5.1. We use the parameters in Table 3 and PV production computed using irradiance data [17,18] in (48):
If the simulation does not last the whole year, we reject it (hatched area in Figure 3). Otherwise, we compute the hourly carbon impact:
with T the number of hours in 2006. This hourly impact is averaged over 365 different runs, each starting at midnight, one for each day of 2006. Figure 3 shows the carbon impact versus . The value that minimizes the average hourly impact while lasting the whole year is therefore . It will be used for comparison.
Figure 3.
Mean impact versus preset. Hatched area corresponds to rejected values where the policy does not last the whole year.
4. Learning the Policy with DDPG
In the reformulation reformulation, we want to select given the state . The function that provides given is referred to as the policy. We want to learn the policy using DRL with an actor–critic policy-based approach: the Deep Deterministic Policy Gradient (DDPG) [19]. Experts may want to skip Section 4.2 and Section 4.3.
4.1. Actor–Critic Approach
We call for the environment, the set of equations: (1) and its battery variant that allows the obtaining of from and , . Its corresponding reward, the short-term evaluation function, is defined as a function of and , . We use [19], an actor–critic approach, where the estimated best policy for a given environment is learned through a critic as in Figure 4. The critic transforms this short-term evaluation into a long-term evaluation, the Q-values , through learning. It will be detailed in Section 4.2. The actor is the function that selects the best possible action possible. It uses the critic as to know what is the best action in a given state (as detailed in Section 4.3).
Figure 4.
Overview of the actor–critic approach. Curved arrows indicate learning. Time passing with is displayed .
In Section 2.4, we set our objective to minimize the long-term carbon impact (13). However, in reinforcement learning we try to maximize a score, defined as the sum of all rewards:
To remove this difference, we maximize the negative carbon impact . However, the more negative terms you add, the lower the sum is. This leads to a policy trying to stop the simulation as fast as possible, in contradiction to our goal to always provide the datacenter in energy. To counter this, we propose, inspired by [20], to add a living incentive of 1 at each instant. Therefore, we propose to define the reward as:
The reward accounting for the carbon impact is now normalized between 0 and 1 so that the reward is always positive. Still in this reward the normalization depends on the state . When the normalization depends on the state, two identical actions can have different rewards associated with them. Therefore, the reward is not proportional to the carbon impact (45) making the reward harder to interpret. To alleviate this problem, we propose to use the global maximum instead of the worst case for the current state:
By convention is set to zero after the simulation ends.
The actor and critic are parameterized using artificial neural networks, respectively denoted and . They will be learned alternatively and iteratively. Two stabilization networks are also used for the critic supervision with weights and .
4.2. Critic Learning
Now that we have defined a reward, we can use the critic to transform it into a long-term metric. As time goes, we have less and less trust in the future. Therefore, we discount the future rewards using a discount factor . We define the critic . It estimates the weighted long-term returns of taking an action in a given state . This weighted version of (50) also allows the binding of the infinite sum to learn it. Q can be expressed recursively:
We learn the Q-function using an artificial neural network of weights . At the iteration of our learning algorithm and for a given value of and , we define a reference value from the recursive expression (53). Since we do not know , we need to select the best action possible at . The best estimator of this action is provided by the policy , so that we define the reference as:
where has been estimated by .
The squared difference between the estimated value , and the reference value [21] is defined as:
To update , we minimize in (55) using a simple gradient descent:
where is the gradient of in (55) with respect to taken at the value . is a small positive step-size. To stabilize the learning [19] suggests updating the reference network slower, so that:
at weight initialization.
4.3. Actor Learning
Since we alternate the updates of the critic and of the actor, we address next the learning of the actor. To learn what is the best action to select, we need a loss function that grades different actions . Using the reward function (52), as a loss function, the policy would select the best short-term, instantaneous, action. Since the critic depends on the action , we replace by . At iteration i, to update the actor network , we use the gradient ascent of the average taken at . This can be expressed as:
where is a small positive step-size.
To learn the critic a stabilized actor is used. Like the stabilized critic, is updated by:
with at the beginning.
During learning, an Ornstein–Uhlenbeck noise [22], n, is added to the policy decision to make sure we explore the action space:
4.4. Proposition: DDPG Algorithm to Learn the Policy
From the previous section, we propose the DDPG Algorithm 1. This algorithm alternates the learning of the networks of the actor and of the critic. We select randomly the initial instant t to avoid learning time patterns. We start each run with full energy storage.
Once learned, we use the last weights of the neural network parameterizing the actor to select the action using directly.
To learn well an artificial neural network needs the different samples of learning data to be uncorrelated. In reinforcement learning two consecutive states tends to be close, i.e., correlated. To overcome this problem, we store all experiences in a memory and use a tiny random subset as the learning batch [23]. The random selection of a batch from the memory is called .
| Algorithm 1: DDPG |
![]() |
5. Simulation
We have just proposed DDPG to learn how to choose with respect to the environment. In this section, we display the simulations settings and results.
5.1. Simulation Settings
Production data are computed using (48) from real irradiance data [17,18] measured at the building location in Avignon, France. The building has of solar panels with opacity and an efficiency of . Those solar panels can produce a maximum of kW·h per hour.
Consumption data comes from projections of the engineering office [16]. It consists of powering housing units with an electricity demand fluctuating daily between 30 kW·h (1 a.m. to 6 a.m.) and 90 kW·h. The weekly variations of the consumption varies with a factor between 1 and 1.4 during awake hours between workdays and the weekend. There is little interseasonal variation, standard deviation of 0.6 kW·h (0.01% of yearly mean) between seasons, as heating uses wood pellets. In those simulations, the datacenter is consuming a fixed amount of kW·h. The datacenter consumption adds up to 87.6 MW·h per year, around 17% of the 496 MW·h that the entire building consumes in a year. To power this datacenter, our building’s solar panels produce an average of 53.8 kW·h/h during the sunny hours on average day counts, for a yearly total of 249 MW·h/year. This covers a maximum of 2.8 times the consumption of our datacenter, but lowers to 99% if all energy goes through the hydrogen storage. The same solar production covers at most 50% of the building yearly consumption. When accounting for hydrogen efficiency, the solar production covers at most 17% of the building consumption.
We only use half of the lead battery capacity to preserve the battery health longer kW·h. The lead battery carbon intensity is split between the charge and discharge ·h. Since the charge quantity comes before the efficiency, its carbon intensity must account for efficiency: ·h. The carbon intensity of the electrolysers, accounting for the efficiency, is used for ·h. The carbon intensity of the fuel cells corresponds to ·h. account for both the electrolysers and fuel cells efficiency. ·h uses the average French grid carbon intensity. All those values are reported in Table 3.
The simulations use an hourly simulation step t.
We train on the production data from year 2005, validate and select hyperparameters, using best score (50) values, on the year 2006 and test finally on year 2007. Each year lasts 8760 h.
To improve learning, we normalize between and 1 all state and action inputs and outputs. For a given value d bounded between and :
is then used as an input for the networks.
To accelerate the learning, all gradient descents are performed using Adam [24]. During training, we use the following step sizes to learn the critic and for the actor. For the stabilization networks, . To learn, we sample batches of 64 experiences from a memory of experiences. The actor and critic both have 2 hidden layers with a ReLU activation function. Hidden layers have respectively 400 and 300 units in them. The output layer uses a tanh activation to bound its output. The discount factor, in (54) is optimized as a hyperparameter between 0.995 and 0.9999. We found the best value for the discount factor to be 0.9979.
5.2. Simulation Metrics
We name duration and note N the average length of the simulations. When all simulations last the whole year, the hourly carbon impact is evaluated as in (49). To select the best policy, the average score is computed using (50). Self-consumption, defined as the energy provided by the solar panels, directly or indirectly using one of the storages, over the consumption, is computed using:
Per the ÉcoBioH2 project, the goal is to reach 35% of self-consumption: .
5.3. Simulation Results
The following learning algorithms are simulated on data from Avignon from 2007 and our building:
- DDPGTwoBatts: DDPG with actions
- DDPGRepartition: DDPG with actions
- proposed DDPG with action
where DDPGTwoBatts and DDPGRepartition are algorithms similar to DDPG with action spaces of the corresponding formulations respectively (11) and (17). The starting time is randomly selected from any hour of the year.
To test the learned policies, the duration, hourly impact (49), score (50) and self-consumption (62) metrics are computed on the 2007 irradiance data and averaged over all runs. We compute those metrics over 365 different runs, starting each 2007 day at midnight. For the sake of comparison, we also compute those metrics when applicable for the preselected values and using (47) on the same data. Recall that the fixed values are bounded to (43) and (44) to ensure the long-term duration.
The metrics over the different runs are displayed in Table 4.
Table 4.
Results computed on the year 2007. n.a.: not applicable.
We can see in Table 4 that DDPGTwoBatts and DDPGRepartition do not last the whole year. This shows the importance of our reformulations to reduce the action space dimensions. We observe that all policies using the reformulation last the whole year (). This validates our proposed reformulations and dimension reduction.
achieves the lowest carbon impact; however, it cannot ensure the target of self-consumption. On the other hand, achieves the target self-consumption at the price of a higher carbon impact. The proposed DDPG provides a good trade-off between the two by adapting to the state . It reaches the target self-consumption minus 0.1% and lowers the carbon impact with respect to . The carbon emission gain over the intuitive policy , using hydrogen only as a last resort, is of 43.8 gCOeq/year. This shows the interest of learning the policy once the problem is well formulated.
5.4. Reward Normalization Effect
In Section 4.1, we presented two ways to normalize the carbon impact in the reward. In this section, we show that the proposed global normalization (52) yields better results than the local state-specific normalization (51).
In Table 5, we display the duration for both normalizations. We see that policies that use the locally normalized reward have a lower duration than the ones using a globally normalized reward. This confirms that the local normalization is harder to learn as two identical actions have different rewards in different states.
Table 5.
Learned policies duration depending on the reward normalization: local or global. Using simulations on 2007 test dataset.
Therefore, the higher dynamic of the local normalization is not worth the variability induced by this normalization. This validates our choice of the global normalization (52) for the proposed DDPG algorithm.
5.5. Hydrogen Storage Efficiency Impact
In our simulations, we have seen the sensibility of our carbon impact results to the parameters in Table 3. Indeed, the efficiency of the storage has a great impact on the system behavior. Hydrogen storage yields lower carbon emissions when its efficiency is higher than some threshold. The greater is , the greater could be and so the range for adapting via learning is more important. To find the threshold in , we first compute the total carbon intensity of storing one kW·h in a given storage, including the carbon intensity of energy production. For , we obtain:
We display the value of (63) of both storages in Figure 5 with respect to , the other parameters are taken from Table 3. When learning is useful since the policy must balance the lower carbon impact (using the hydrogen storage) with the low efficiency (using the battery storage). When the learned policy converges to , as both objectives (minimizing the carbon impact and continuous powering of the datacenter) align.
Figure 5.
The total hydrogen storage impact depending on the efficiency of storage.
We calculate from (63) and its battery variant, the threshold point where to be at efficiency:
Using values in Table 3 on (64), hydrogen improves the carbon impact only when . The current value is , learning is also useful as shown in the simulations Table 4. We can also suggest that when the hydrogen storage efficiency will improve in the future, the impact of learning will be even more important.
6. Conclusions
We have addressed the problem of monitoring the hybrid energy storage of a partially islanded building with a goal of carbon impact minimization and self-consumption. We have reformulated the problem to reduce the number of components of the action to one, , the proportion of hydrogen storage given the building state . To learn the policy, , we propose a new DRL algorithm using a reward tailored to our problem, DDPG. The simulation results show that when the hydrogen storage efficiency is large enough, learning of allows a decrease to the carbon impact while lasting at least one year and maintaining of self-consumption. As hydrogen storage technologies improve, the proposed algorithm should have even more impact.
Learning the policy using the proposed DDPG can also be done when the storage model includes non-linearities. Learning can also adapt to climate changes in time using more recent data for learning. To measure such benefits, we will use in the future the ÉcoBioH2 real data to be measured in the sequel of the project. Learning from real data will reduce the gap between the model and the real system. Reducing this gap should improve performance. The proposed approach could also be used to optimize other environmental metrics with a multi-objective cost in .
With our current formulation, policies cannot assess what day and hour it is as they only have two state variables to compute the hour: and . They cannot differentiate between 1 a.m. and 4 a.m. at night as those two times have the same consumption and no PV production. They also cannot differentiate between a cloudy summer and a clear winter as production and consumption are close in those two cases. In the future, we will consider taking into account the knowledge of the current time to enable the learned policy to adapt its behavior to the time of the day and month of the year.
Author Contributions
Conceptualization, L.D., I.F. and P.A.; methodology, L.D., I.F. and P.A.; software, L.D.; validation, L.D. and I.F.; formal analysis, L.D.; investigation, L.D.; resources, L.D.; data curation, L.D.; writing—original draft preparation, L.D.; writing—review and editing, I.F. and P.A.; visualization, L.D.; supervision, I.F and P.A.; project administration, I.F.; funding acquisition, I.F. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by French PIA3 ADEME (French Agency For the Environment and Energy Management) for the ÉcoBioH2 project.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available irradiance datasets were analyzed in this study. This data can be found here: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on: 1 October 2020) based on [18]. Restrictions apply to the availability of consumption data. Data were obtained from ÉcoBio via ZenT and are available at https://zent-eco.com/ with the permission of ZenT.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
| DDPG | Deep Deterministic Policy Gradient |
| DRL | Deep Reinforcement Learning |
| PV | PhotoVoltaic |
Appendix A. Proof of Proposition 1
Appendix A.1. When δEstorage(t) < 0
Appendix A.2. When δEstorage(t) > 0
References
- PIA3 ADEME (French Agency for the Environment and Energy Management). Project ÉcoBioH2. 2019. Available online: https://ecobioh2.ensea.fr (accessed on 2 June 2021).
- Bocklisch, T. Hybrid energy storage systems for renewable energy applications. Energy Procedia 2015, 73, 103–111. [Google Scholar] [CrossRef] [Green Version]
- Pu, Y.; Li, Q.; Chen, W.; Liu, H. Hierarchical energy management control for islanding DC microgrid with electric-hydrogen hybrid storage system. Int. J. Hydrogen Energy 2018, 44, 5153–5161. [Google Scholar] [CrossRef]
- Diagne, M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of solar irradiance forecasting methods and a proposition for small-scale insular grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef] [Green Version]
- Desportes, L.; Andry, P.; Fijalkow, I.; David, J. Short-term temperature forecasting on a several hours horizon. In Proceedings of the ICANN, Munich, Germany, 17–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z.; Nagasaki, Y.; Miyagi, D.; Tsuda, M.; Komagome, T.; Tsukada, K.; Hamajima, T.; Ayakawa, H.; Ishii, Y.; Yonekura, D. Stored energy control for long-term continuous operation of an electric and hydrogen hybrid energy storage system for emergency power supply and solar power fluctuation compensation. Int. J. Hydrogen Energyy 2019, 44, 8403–8414. [Google Scholar] [CrossRef]
- Carapellucci, R.; Giordano, L. Modeling and optimization of an energy generation island based on renewable technologies and hydrogen storage systems. Int. J. Hydrogen Energy 2012, 37, 2081–2093. [Google Scholar] [CrossRef]
- Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; Springer: New York, NY, USA, 2006; pp. 1–2. [Google Scholar]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Vosen, S.; Keller, J. Hybrid energy storage systems for stand-alone electricpower systems: Optimization of system performance andcost through control strategies. Int. J. Hydrogen Energy 1999, 24, 1139–1156. [Google Scholar] [CrossRef]
- Kozlov, A.N.; Tomin, N.V.; Sidorov, D.N.; Lora, E.E.S.; Kurbatsky, V.G. Optimal Operation Control of PV-Biomass Gasifier-Diesel-Hybrid Systems Using Reinforcement Learning Techniques. Energies 2020, 13, 2632. [Google Scholar] [CrossRef]
- François-Lavet, V.; Taralla, D.; Ernst, D.; Fonteneau, R. Deep Reinforcement Learning Solutions for Energy Microgrids Management. In Proceedings of the European Workshop on Reinforcement Learning EWRL Pompeu Fabra University, Barcelona, Spain, 3–4 December 2016. [Google Scholar]
- Tommy, A.; Marie-Joseph, I.; Primerose, A.; Seyler, F.; Wald, L.; Linguet, L. Optimizing the Heliosat-II method for surface solar irradiation estimation with GOES images. Can. J. Remote Sens. 2015, 41, 86–100. [Google Scholar] [CrossRef]
- David, J. L 2.1 EcoBioH2, Internal Project Report. 9 July 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-pay (accessed on 1 October 2020).
- Soda-Pro. HelioClim-3 Archives for Free. 2019. Available online: http://www.soda-pro.com/web-services/radiation/helioclim-3-archives-for-free (accessed on 11 March 2019).
- Rigollier, C.; Lefèvre, M.; Wald, L. The method Heliosat-2 for deriving shortwave solar radiation from satellite images. Solar Energy 2004, 77, 159–169. [Google Scholar] [CrossRef] [Green Version]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
- Ernst, D.; Geurts, P.; Wehenkel, L. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 2005, 6, 503–556. [Google Scholar]
- Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [Google Scholar] [CrossRef]
- Lin, L.J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
