Optimal Operation Control of PV-Biomass Gasiﬁer-Diesel-Hybrid Systems Using Reinforcement Learning Techniques

: The importance of e ﬃ cient utilization of biomass as renewable energy in terms of global warming and resource shortages are well known and documented. Biomass gasiﬁcation is a promising power technology especially for decentralized energy systems. Decisive progress has been made in the gasiﬁcation technologies development during the last decade. This paper deals with the control and optimization problems for an isolated microgrid combining the renewable energy sources (solar energy and biomass gasiﬁcation) with a diesel power plant. The control problem of an isolated microgrid is formulated as a Markov decision process and we studied how reinforcement learning can be employed to address this problem to minimize the total system cost. The most economic microgrid conﬁguration was found, and it uses biomass gasiﬁcation units with an internal combustion engine operating both in single-fuel mode (producer gas) and in dual-fuel mode (diesel fuel and producer gas).


Introduction
Hybrid energy systems development based on renewable energy sources (RES) leads to the need of solving many practical problems, including the problem of optimal power systems' structure selection (the ratio of capacities in the energy system of energy sources and storage systems) and their control. These characteristics of the system depend both on the technical and economic indicators of energy sources, as well as on the availability and energy potential of renewable energy resources in a given area, including the distribution of this potential (wind speed and solar radiation intensity) over time. These problems attract a lot of specialists [1][2][3], including experts in data driven unit commitment problem solvers development. Various software packages have been developed (Homer, Calliope, RETScreen, DER-CAM, Compose, iHOGA, and others) to calculate the potential of renewable energy and to support the best choice of the hybrid system's components [4]. Optimization of the power and components of a hybrid system with renewable energy sources in most cases is carried out to minimize the cost of generated energy, taking into account all costs, to provide 100% reliability of energy supply. The following optimization criteria were employed: energy efficiency, maximum energy production on a specific source of renewable energy, maximum use of installed renewable energy generation capacity, exergy efficiency, minimizing the payback period, minimizing capital costs, environmental impact from the selection of suitable raw materials, to the control of processes in the reactor and the disposal of emissions [32][33][34].
Biomass is characterized by a high moisture content and variable size distribution of the source material; high reactivity compared to fossil coal [35]; variability of the mechanical properties of particles (tendency to agglomerate [34,36] or, conversely, to destruction [37,38]); the formation of significant amounts of tarry products during heating and oxidation [33]; and low ash content. The latter, however, often have increased corrosion properties and a tendency to form fly ash [34,37,38]. Many processes of biomass processing have been proposed [28,33], but their efficiency is very sensitive to the conditions of their implementation. There are more specific conversion processes: plasma processing [39,40] or the use of supercritical water [33,41], but they are technologically more complicated and require higher energy costs.
The pyrolysis and gasification are potentially applicable in small and medium capacity generation [42,43], usually working with an internal combustion engine [44,45], a microturbine [46], or a gas burner [47]. However, the combustion and gasification of biomass can be applied at large thermal stations to partially replace coal and reduce emissions [48][49][50]. The processes of co-combustion of coal and biomass were also considered in [36,[51][52][53][54].
A promising solution for the optimal control of hybrid microgrids with various flexible and inflexible power sources is the modeling and control of the operating modes of such systems as the Markov decision process (MDP). Such a formulation, in fact, allows one to obtain a rather realistic model of a hybrid microgrid with various states, control actions, and probabilistic transitions between them. The most advanced methods for solving MDP problems are reinforcement learning (RL). Trained RL agents, knowing most of the optimal solutions, can be employed to control the energy management of the power system or microgrid in real time. Such an approach will significantly reduce computational costs, because a stochactic optimization problem is solved offline to find the optimal policy for all possible scenarios. In recent years, several successful studies have been published on the use of advanced RL methods for optimal control of microgrids based on deep Q-networks (DQN) [55,56], Monte-Carlo tree search (MCTS) [57], deep policy gradient [58], batch RL [59], multi-agent RL [60], etc. Part of the research is devoted to comparing the effectiveness of the RL methods (capable of giving quick, but approximate solutions) with traditional optimization methods, for example, mixed-integer linear programming (MILP) [61,62].
The aim of this work is to calculate and to optimize the assets of the operation of a hybrid microgrid based on renewable energy sources (solar energy and biomass gasification) and a traditional diesel power station. In order to achieve the formulated objectives, the following tasks were solved: 1.
The control problem of an isolated microgrid is formulated as an MDP. The modified open-source RL framework is employed for the modeling of an off-grid microgrid to investigate how state-of-the-art RL techniques can utilize the simulated data in order to learn an operation policy that minimizes the total system cost.

2.
The biomass gasification unit is employed to obtain producer gas. At the same time, the operation of the internal combustion engine (generator) is considered only in producer gas and dual-fuel mode (producer gas and diesel fuel). They operate as steerable generators of different configurations of a microgrid.
An optimization model based on MILP is used as a reference for comparing the effectiveness of RL models that gives a good approximation for the lower bound of the control problem.
This paper is organized as follows: Section 2 describes the simulation environment based on the MDP used for the RL methods application in Section 3. Section 4 describes the case study and the results. The concluding remarks are given in Section 5.

Microgrid MDP-Based Environment Simulator
A separate feature of microgrids is the use of stochastic components: RES from the generation side and flexible active loads from the consumption side. In comparison with large power systems, microgrids are capable of independently generating and delivering electricity to consumers, but only do all this at a local level. To ensure reliable and optimal operation of the microgrid, such grids use an energy management system, which, in accordance with the developed policy (management strategy), are able to automatically switch between energy sources, exchange energy with an external network, and even make load shedding if necessary. At the same time, the possible activity of consumers and the presence of RES introduce a stochastic nature into the optimization problem, and the desire for off-grid operation makes it necessary to apply the principles of online optimization.
Online optimization is a stochastic optimization application that studies sequential decision making. One of the standard modeling approaches in this case is the MDP, which is a specification of the sequential decision-making problem for a fully observable environment with a Markov transition model and additional rewards. MDPs are useful for studying optimization problems solved based on dynamic programming and reinforcement learning. In recent years, MDP appears to be a promising mathematical formulation of the optimizing microgrid operation problem [63,64]. A number of studies clearly demonstrate the effectiveness of energy microgrids management using MDP-based methods: dynamic programming [65,66], deep RL [55,56,58,67], and Monte Carlo models [57,68].
This paper proposes an MDP-based environment that aims at simulating the techno-economic performance of a hybrid AC/DC microgrid, and in particular at quantifying the performance of an agent responsible for controlling the devices of the microgrids, as a function of the random processes governing all the variables that impact the microgrid operation, e.g., consumption, renewable generation, and market prices. Components of the microgrid include non-steerable (i.e., renewable PV or wind) and steerable (i.e., diesel, gasified biomass, or co-fired generators), as well as battery energy storage systems, and different type of loads. When the energy level from storages and from non-flexible production is not sufficient to ensure the loads are served, the steerable generators compensate for the remaining energy to be supplied.

Dynamics
The simulated system is composed of several consumption, storage, and generation devices. In this paper, intermittent generation and non-flexible consumption are represented by real data gathered from an off-grid microgrid.

Storage
Let us employ a linear model for the simulation of the battery since it is assumed that the simulation time-step size ∆t is large enough (1 h). The dynamics of a battery is modeled as where SOC(t) denotes the state of charge at each time step; t, P charge t and P discharge t correspond to the charging and discharging power, respectively; and η charge , η discharge represent the charging and discharging efficiencies of the storage system, respectively. The charging (P charge t ) and discharging (P discharge t ) power of the battery are assumed to be limited by a maximum charging and discharging rate respectively. For more sophisticated models of the storage systems readers may refer to [69] and the references therein.

Steerable Generator Model
Steerable generation allows any type of diesel or biomass-based generation that can be dispatched at any time-step t. The fuel curve can be used to determine the fuel amount that the steerable generator consumes to produce electricity. It is assumed that the fuel curve is a straight line and use the following equation gives the generator's fuel consumption in units/h: The generator fuel intercept coefficient F 1 gives the no-load fuel consumption of the generator divided by its rated capacity. The marginal fuel consumption of the generator is determined by the generator fuel curve slope, F 0 , and can be expressed in units of fuel per hour per kW of output, or equivalently, units of fuel per kWh.
The generator's electrical efficiency can be defined as the relationship of the electrical energy coming out and the chemical energy of the fuel going in using the following equation: where: . . m f uel = ρ f uel (F/1000). A generator operates in dual-fuel mode (diesel fuel and producer gas). In each time step, the MDP-based environment simulator calculates the required output of the generator and the corresponding mass flow rates of diesel fuel and producer gas. The system in dual-fuel mode always attempts to maximize the use of producer gas and minimize the use of diesel fuel.
The fuel curve of a generator defines the fuel consumption of the generator in pure diesel mode. Therefore, the fuel consumption in pure diesel mode is given by the following equation If actual value of the producer gas flow rate . m gas is known, at any time step, the diesel fuel flow rate can be calculated from Equation (5) . where m 0 is the diesel fraction, i.e., the ratio of diesel fuel used by the generator in dual-fuel mode to that required to produce the same output power in pure diesel mode.

Stochastic Optimization Formulation
Due to the stochastic nature of hybrid distributed generation, the dynamic dispatch of the microgrid is essentially a stochastic optimization problem. Usually, the goal is to minimize the operational cost. The optimization-based controller or agent serves as a baseline for comparison to our proposed methods. This controller receives as input all the parameters available and solves an Energies 2020, 13, 2632 6 of 20 optimization problem in receding horizon. The objective function to minimize aggregates curtailment, shedding, and fuel costs (the π parameters denote unit costs), are taken from [65]: where P curt g,t , π curt g is generation curtailment and the curtailment price, respectively; P shed, d,t , π shed d is load shedding and shedding price, respectively; and π f uel g is the fuel price. Due to constraints of the stochastic optimization model, the energy balance equation of the following form is suggested: are shedding power and non-flexible demand, respectively.
In addition, the binary variables k g,t are added to the optimization model to specify the minimum operating point of the steerable generators, ∀ t ∈ T: The law of transition of the state of charge s of each battery b is modeled as presented in [57]. Thus, this mathematical problem in general is a MILP.

Problem Statement
RL solves the problem of sequential optimal decision making [69]. The mathematical model of this problem is MDP. RL is a promising way of machine learning, which suggests that the agent learns by interacting with an environment, for example, a microgrid. In simple words, RL is trying to find a set of actions (policy) that would be the most beneficial for the agent.
Centralized microgrids' control strategy can be separated into four following tasks: estimation of parameters of microgrid devices, forecasting consumption and generation from renewable energy sources, operational planning for predicting the impact of weather and human activities, and real-time control to adapt the planned solutions to the current control moment. RL methods use microgrid simulation data (or simulated data before the microgrid is actually involved) to study management strategies. Therefore, they actually combine the four steps described above. Theoretically, they can adapt to certain types of changes without the need for manual tuning.
This paper proposes the simulation framework, where the RL agent only has access to the current non-steerable generation and non-flexible consumption in the microgrid. It has also access to the state of charge of the different storages and it must decide how to use the storage systems. The steerable generation compensates to establish the equilibrium. In case there is an excess of non-steerable generation and no more room for storage, the non-steerable generation is "curtailed", i.e., is lost. At each time-step t, the state variable S contains all the relevant information for the optimization of the system. The control ∀ g ∈ G ∈ G applied at each time-step t contains the charging/discharging decisions for the storage systems and the generation level of the steerable generator. At each time-step t, the system performs transitions based on the dynamics described above according to s t+1 = f (s t , a t , w t ). Each transition generates a Energies 2020, 13, 2632 7 of 20 cost according to the cost function c(s t , a t ) = (c f ol + c curt + c sh ) ∈ R. Figure 1 shows the main RL-based approach for energy microgrids optimal management. steerable generation and no more room for storage, the non-steerable generation is "curtailed", i.e., is lost. At each time-step t, the state variable = , , ∀ ∈ , , ∈ contains all the relevant information for the optimization of the system. The control = , , , , ∀ ∈ , ( , , ∀ ∈ ∈ applied at each time-step t contains the charging/discharging decisions for the storage systems and the generation level of the steerable generator. At each time-step t, the system performs transitions based on the dynamics described above according to = ( , , ). Each transition generates a cost according to the cost function ( , ) = ( + + ) ∈ . Figure  1 shows the main RL-based approach for energy microgrids optimal management. The total discounted cost for the microgrid associated to a policy π ∈ Π is given by An optimal policy π * is a policy that, for any initial state s 0 , yields the actions that minimize the total discounted cost such as: Most of the RL algorithms include a quality function evaluation that says how "useful" or "valuable" the current state (V-function) or state-action pair (Q-function). Both functions return the mathematical expectation of the γ-discounted amount of rewards until the end of the simulation using a specific policy π. Additionally the state-action value function Q(s t , a t ) associated to an optimal policy π * is used to characterize the quality of taking action a t at state, and then acting optimally and is defined as: where r(s t , a t ) ∈ R is the reward function, which define each transition and generates an operational revenue r t for each individual scenario of the network configuration. The optimal action at each time-step t can be obtained using the optimal Q-value as:

Reinforcement Learning Agents
The key idea of this article was to study advanced RL models for optimal control of an off-grid PV-diesel-biomass microgrid. It was decided to consider RL algorithms that in recent years have shown so-called superhuman efficiency (i.e., they solved complex mathematical problems better than an expert in the subject field), namely DQN agents as the leader in Atari Games, and proximal policy optimization (PPO) agents who defeated the best players in Dota and Monte Carlo tree search (MCTS), which became the basis of the AlphaGO system. The results of optimizing the microgrid regime are compared with results of the reference, classical MILP algorithm. The available information for RL agent at each time-step is composed of the consumption, the state of charge, the number of cycles and the capacity of each storage device, the renewable production, and its capacity. It is assumed that the RL agent has control of the storage devices. However, the original action space is continuous and of high-dimensionality. High-level actions are used in the decision-making process that are then mapped into the original action space. The instantaneous reward is defined as the negative total cost of operation of the microgrid according to Equation (7) and is composed of: 1.
fuel costs for the generation, 2.
curtailment cost for the excess of generation that had to be curtailed, and 3.
load shedding cost for the excess of load that had to be shed in order to maintain balance in the microgrid.

MILP-Based Optimizer
This optimizer solves a linear program that minimizes the cost to optimize its actions. The output actions are continuous actions showing the exact charge/discharge level of each storage and the exact generation from steerable generators. In the presented study, the authors used an optimization model based on MILP as a reference for comparing the effectiveness of RL models. MILP-based optimization formulations, however, suffer from important drawbacks. Most importantly, they are restricted in terms of the number of integer or binary variables that can be practically included and are difficult to efficiently parallelize. This limits possibilities for optimizing the planning and control of large-scale microgrids (e.g., larger than 30-100 buildings [62]) and power systems. Compared with MILP, RL generates near-optimal solutions on par with the research approaches of conventional operations; however, it makes it significantly faster (because an RL-agent has already found all the optimal policy offline). The statement of the MILP problem for optimizing microgrid management is described in detail above in Section 2.

Deep Q-Network Agent
The main idea is to employ the deep neural networks to represent the so-called DQN and train this network to predict the total reward [70,71]. The approach is based on the Q-learning algorithm, which implements an iterative approximation of the Q function through training on temporal differences, where the mean square error between the predictor and the goal is minimized at each step, see Equation (11). When the number of states is large, saving a lookup table with all possible values of action-state pairs is inappropriate. In [72], a general solution to this problem was proposed using the parameterized approximation function Θ, so that Q(s, a) ≈ Q(s, a; Θ). It was proposed to use a deep neural network as an approximator. The neural network parameters Θ t can be updated using stochastic gradient descent by sampling batches of transitions, a quadruple s t , a t , r t , s t+1 and the parameters Θ t are updated according to: where α is a scalar step size called the learning rate.

Monte-Carlo Tree Search Agent
MCTS is a policy-optimization algorithm for finite-horizon, finite-size MDP, based on random episode sampling structured by a decision tree, where each node in the tree represents a complete state of the domain and each link represents one possible valid action, leading to a child node representing the resulting state after taking an action. The statement of the problem in MCTS is based on game theory. It had a strong influence on programs for playing Go, although it finds its application in other games. Monte Carlo methods work by approximating future rewards that can be achieved through random samplings [73].
Energies 2020, 13, 2632 9 of 20 MCTS proceeds in four phases of selection, expansion, rollout, and back-propagation. The standard MCTS algorithm proceeds by repeatedly adding one node at a time to the current tree. Given that leaf nodes are likely to be far from terminal states, it uses random actions, to estimate state-action values. After the rollout phase, the total collected rewards during the episode is back-propagated through the tree branch, updating their empirical state-action values, and visit counts. Choosing which child node to expand (i.e., choosing an action) becomes an exploration/exploitation problem given the empirical estimates. Upper confidence bounds (UCB) is an optimization algorithm that is used for such settings with provable guarantees [74]. Each parent node chooses its child with the largest USB(s t , a t ) value according to the following formula USB(s t , a t ) = Q(s t , a t ) + S ln N p where N i is the visit count for the ith child; N p is the number of visit counts for the parent node. The parameter c ≥ 0 controls the tradeoff between choosing lucrative nodes (low c) and exploring nodes with low visit counts (high c). It is often set empirically. High efficiency is determined by the fact that with the MCTS method the decision tree grows asymmetrically: more "interesting" nodes are visited more often, less "interesting" nodes less often, and it becomes possible to evaluate a single node without revealing the entire tree. If the task of managing a microgrid is formulated as a partially observable MDP, then a simulator of its operation (environment) can be developed in which all possible states can be formed in the form of a tree structure and passed using the MCTS agent.

Proximal Policy Optimization Agent
PPO agent trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. PPO belongs to the family of policy gradient methods, which use several eras of random gradient rise to complete each policy update [75]. In this method [76], a parametrized stochastic policy function π(a t |s t ; θ) with parameters θ is directly optimized towards the objective defined in Equation (10). After the collection of N full trajectories τ = (s 0,i , a 0,i , c 0,i , s t+1,i , . . . , s T,i ) a gradient step is performed for the update of the parameters θ as: with clipped objective J clip proposed in [72], ∇J π = J clip = E max(r(θ)Â π , clip(r(θ), 1 − , 1 + ) (16) where E denotes the empirical expectation over time steps,Â π is the estimated advantage at time t, r(θ) is probability ratio under the new and old policies respectively, is a hyperparameter, usually 0.1 or 0.2. The optimal policy is derived by performing multiple steps of stochastic gradient descent on this objective. While standard policy gradient methods perform one gradient update per data sample, the PPO algorithm enables multiple epochs of minibatch updates resulting in better sample efficiency.

Results
The evaluation of the proposed methodology was performed using empirical data measured by the off-grid microgrid system composed of 10 kW of PV panels, 24 kWh of two battery storages, and a 10 kW generator. The microgrid configuration contained three loads (each one being 10 kW), a PV module, a steerable generator (biomass gasifier with an internal combustion engine operating in only producer gas and dual-fuel mode), as well as storage devices (Figure 2). Additionally, the costs for curtailment and load shedding were defined. Time-series from the two year historical parameter dataset (frequency of 1 h) are used to simulate the three loads and the PV module. The storage devices have slightly different characteristics, namely different charging/discharging efficiencies. The parameters used for this specific microgrid configuration are given in Table 1.  4.0 *. In case of co-fired generator the capacity is selected as 10 kW, for the case of no PV, 20 kW. The optimization agent system is intended to become multi-objective. It has to minimize the operation cost while ensuring the reliability by maximizing the service level or served demand. The case of an off-grid system is considered under the assumption that imports are equivalent to load The technical limits of the generator i.e., the maximum (capacity) and the minimum stable (percentage of the capacity) operating point are also specified. The operating points of the steerable generators from experimental studies are used to get their fuel curve. Two fuel curve inputs are the  The optimization agent system is intended to become multi-objective. It has to minimize the operation cost while ensuring the reliability by maximizing the service level or served demand. The case of an off-grid system is considered under the assumption that imports are equivalent to load shedding (π shed d = 100 euro/kW) and exports are equivalent to production curtailment (π curt g = 10.5 euro/kW). The technical limits of the generator i.e., the maximum (capacity) and the minimum stable (percentage of the capacity) operating point are also specified. The operating points of the steerable generators from experimental studies are used to get their fuel curve. Two fuel curve inputs are the intercept coefficient and the slope according to Equation (2). For example, according to the practical studies [77], biomass consumption increased with an increase in load; however, specific biomass consumption decreased with an increase in load. The following operating points are selected: the biomass consumption 13.2 and 15 kg/h at 3.0 and 10.0 kW load, respectively.

Microgrid Simulator Description
To carry out the calculations, the open source simulator of the microgrid operation developed in Python [78] was used and modified by the authors. This simulator was implemented as a training environment for the optimization of RL agents such as DQN, MCTS, and PPO, for the implementation of which the TensorFlow and OpenAI gym libraries were used [79]. To implement the MILP model, the code from Gurobi Optimizer was used.
The optimization agent has control of the storage devices. The actions available at each decision step are the charging (C), discharging (D), and idling (I) of each storage device in the microgrid. The actions are then converted in implementable actions automatically following a rule-based strategy:

1.
If the total possible production (i.e., PV production, active steerable generators capacity, and the storages maximum discharge rate) is lower than the total consumption, a steerable generator is activated at its minimum stable generation. This instruction is repeated until the total load can be served or until all steerable generators are active. In a few words, the generators are activated one by one at their minimum stable generation until the total load can be served. Given the lower flexibility of the gasifier biomass generator compared to the diesel generator, it is assumed that the biomass generator does not turn off completely but continues to operate in idle mode. For the co-fired generator, the possibility of autonomous start-up on diesel fuel remains to ensure ignition of the gasifier biomass generator [80][81][82].

2.
Once all active steerable generators are known, the net generation can be calculated based on their minimum stable generation, the PV production, and the total consumption. 3.
If the net generation is positive, the storages (with charge instruction) charges the excess of energy until the net generation becomes zero. The storages with discharge or idle instructions do not do anything. The remaining excess of energy is curtailed.

4.
If the net generation is negative, the storages (with discharge instruction) discharges the deficit of energy until the net generation becomes zero. The storages with charge or idle instructions do not do anything. The remaining deficit of energy is then compensated by the active steerable generators which can be adjusted at a higher production level than their minimum stable power.
If, in addition, steerable generators cannot handle the remaining deficit, this deficit is considered as lost load.
The following protocol was carried out for the training and the evaluation of the proposed RL-based algorithms and MILP. The policies were trained in the first three months (December-February) and were tested in one week of the fourth month (March). The performance of the algorithms was compared against the benchmarks of MILP described in Section 2. The following MILP-based optimization controller was considered for comparison purposes. A MILP optimization controller with perfect knowledge was considered with 12 periods of look-ahead and additional noise around the exact value of the stochastic variables. This gave a good approximation for the lower bound of the control problem.

Analysis of Different Microgrid Configuration Efficiency
In addition to evaluating the effectiveness of the state-of-the-art optimization models for the microgrid management, another and main goal of our paper was a comparative study of the use of various types of steerable generators on diesel fuel and wood biomass from the point of view of minimizing the operational costs of microgrid, according to Equation (7). The following microgrid configurations are examined:
Case 4 considers a realistic case for some regions of Siberia (Russia), where the installation of PV generation is not profitable in remote villages, and the use of generators using diesel fuel incurs increased costs (Figure 3). Therefore, the latter case included only a co-fired generator as the main energy source for the microgrid, operating in conjunction with two storage devices, where it becomes possible to accumulate electricity for cases of possible interruptions in the operation of the main generation (temporary lack of biofuel, possible generator breakdown, etc.). For case 4, it is assumed that the power of a co-fired generator is 20 kW. In all cases, a gasifier biomass generator and a co-fired generator used pellets as biofuel.
Case 4 considers a realistic case for some regions of Siberia (Russia), where the installation of PV generation is not profitable in remote villages, and the use of generators using diesel fuel incurs increased costs (Figure 3). Therefore, the latter case included only a co-fired generator as the main energy source for the microgrid, operating in conjunction with two storage devices, where it becomes possible to accumulate electricity for cases of possible interruptions in the operation of the main generation (temporary lack of biofuel, possible generator breakdown, etc.). For case 4, it is assumed that the power of a co-fired generator is 20 kW. In all cases, a gasifier biomass generator and a cofired generator used pellets as biofuel. The results of the described protocol are presented in Table 2, which show the total cost of each strategy for each testing period, in order that a comparison can be drawn. As can be seen from the table, the closest to the MILP reference solution are policies of the MCTS algorithm for all considered cases of microgrid configuration.  The results of the described protocol are presented in Table 2, which show the total cost of each strategy for each testing period, in order that a comparison can be drawn. As can be seen from the table, the closest to the MILP reference solution are policies of the MCTS algorithm for all considered cases of microgrid configuration. Table 2. Total cost of obtained optimal policies, π * for compared optimization agents.

Models
Total It is clearly seen that the use of a gasifier biomass generator (Case 1) and a co-fired generator (Cases 3, 4) can reduce operational costs compared to using a diesel generator (Case 3) as a steerable generator in the microgrid. This is clearly shown in the graphs of Figures 4 and 5, which show the total costs (including accumulated ones), as well as the dynamics of the components of generation and consumption for the microgrid for the one-week testing period. The best option was obtained for the configuration of a microgrid containing a solar station and a gasifier biomass generator (Case 2). It should also be noted that Case 4 provides slightly higher costs compared to Case 1, i.e., when there is no PV generation, due to the fact that the energy management system fails to fully realize the stored energy in the storage devices (Figure 5b). This is obvious, since it is more expedient to use storage devices if the microgrid contains any RES (sun or wind), and in this respect Case 4 as considered by us, may look somewhat artificial. However, for the configuration of a microgrid with only one generation source, the meaning of the optimal control problem is lost. It is clearly seen that the use of a gasifier biomass generator (Case 1) and a co-fired generator (Cases 3, 4) can reduce operational costs compared to using a diesel generator (Case 3) as a steerable generator in the microgrid. This is clearly shown in the graphs of Figures 4 and 5, which show the total costs (including accumulated ones), as well as the dynamics of the components of generation and consumption for the microgrid for the one-week testing period. The best option was obtained for the configuration of a microgrid containing a solar station and a gasifier biomass generator (Case 2). It should also be noted that Case 4 provides slightly higher costs compared to Case 1, i.e., when there is no PV generation, due to the fact that the energy management system fails to fully realize the stored energy in the storage devices (Figure 5b). This is obvious, since it is more expedient to use storage devices if the microgrid contains any RES (sun or wind), and in this respect Case 4 as considered by us, may look somewhat artificial. However, for the configuration of a microgrid with only one generation source, the meaning of the optimal control problem is lost.   . Total costs (left) and generation/load mix -right (The load mix on the graph here does not mean the entire total load of the microgrid, but only an illustration of what components of the electricity consumption (load, battery, or curtailment) the generated power were used to ensure balance) of different microgrids' configurations for optimal policies, π * obtained using the Monte-Carlo tree search (MCTS) for the one-week testing period. balance) of different microgrids' configurations for optimal policies, * obtained using the Monte-Carlo tree search (MCTS) for the one-week testing period.

Figure 5.
Total costs (left) and generation/load mix (right) of different microgrids with co-fired generators for optimal policies, * obtained using MCTS for the one-week testing period.

Comparative Study of RL-Based Models
It is observed that in all cases the MCTS policy performed very close to the MILP-based optimization controller (Table 2). Perhaps, this is due to the fact that the MCTS algorithm manages to anticipate periods of high energy curtailment or load shedding and manages to utilize the storage device accordingly. In addition, a fairly good policy, along with MCTS, is provided by the PPO algorithm ( Figure 6). MCTS policy also gives good results for Case 4, when the optimization of energy storage is not always obvious, due to the lack of RES. It is clearly seen that the PPO and DQN algorithms actually fail to find adequate policies for this case and the high costs, in fact, are associated with large volumes of curtailment lost energy in the storage devices (Figure 7). It is important to note that the search for the optimal policy, * in the training process, is much faster for PPO and DQN algorithms, when compared to that of the MCTS algorithm. Figure 5. Total costs (left) and generation/load mix (right) of different microgrids with co-fired generators for optimal policies, π * obtained using MCTS for the one-week testing period.

Comparative Study of RL-Based Models
It is observed that in all cases the MCTS policy performed very close to the MILP-based optimization controller (Table 2). Perhaps, this is due to the fact that the MCTS algorithm manages to anticipate periods of high energy curtailment or load shedding and manages to utilize the storage device accordingly. In addition, a fairly good policy, along with MCTS, is provided by the PPO algorithm ( Figure 6). MCTS policy also gives good results for Case 4, when the optimization of energy storage is not always obvious, due to the lack of RES. It is clearly seen that the PPO and DQN algorithms actually fail to find adequate policies for this case and the high costs, in fact, are associated with large volumes of curtailment lost energy in the storage devices (Figure 7). It is important to note that the search for the optimal policy, π * in the training process, is much faster for PPO and DQN algorithms, when compared to that of the MCTS algorithm.     Dynamics of the charge and discharge of batteries for Case 4 for optimal policies, π * obtained using PPO algorithm for the one-week testing period.

Discussion and Conclusions
This paper deals with the control and optimization problems for an isolated microgrid combining RES (solar energy and biomass gasification) with a diesel power plant. To attack this problem, the contemporary methods of stochastic online optimization based on reinforcement learning and linear programming were employed, when the microgrids control was based on the MDP. The main advanced reinforcement learning methods DQN, PPO, and MCTS were examined, and the results were compared with the reference solution of the MILP model. The closest results to the reference strategy were demonstrated by the MCTS algorithm for all cases of microgrid configuration.
The multi-objective optimization problem, which was minimizing the total cost of operating a microgrid, including the cost of fuel for controlled generators, electric power reduction, and load shedding, was addressed. As a result, the most economic microgrid configuration was found and it used the gasification of biomass with gasifier/internal combustion engine system operating both in single-fuel mode (producer gas) and in dual-fuel mode (diesel fuel and producer gas). Their use in the microgrid is cheaper when compared with diesel generators. This is obviously caused by the lower cost of biomass, which is pine pellets in our case. It is to be noted that fuel delivery was ignored in our case. It should also be outlined that the use of a conventional biomass-gasifier, which burned only the producer gas in an internal combustion engine, was somewhat more economical in comparison with that of the dual-fuel engine operation mode. However, the latter is more maneuverable due to the possibility of starting and flexible engine control by varying the share of diesel fuel use, which allows it to be used more efficiently (along with a conventional diesel generator) when the corresponding microgrid energy management system is operating.