Power Generation Optimization of the Combined Cycle Power-Plant System Comprising Turbo Expander Generator and Trigen in Conjunction with the Reinforcement Learning Technique

: In this paper, a method that utilizes the reinforcement learning (RL) technique is proposed to establish an optimal operation plan to obtain maximum power output from a trigen generator. Trigen is a type of combined heat and power system (CHP) that provides chilling, heating, and power generation, and the turbo expander generator (TEG) is a generator that uses the decompression energy of gas to generate electricity. If the two are combined to form a power source, a power generation system with higher e ﬃ ciency can be created. However, it is very di ﬃ cult to control the heat and power generation amount of TEG and trigen according to the ﬂow rate of natural gas that changes every moment. Accordingly, a method is proposed to utilize the RL technique to determine the operation process to attain an even higher e ﬃ ciency. When the TEG and trigen are conﬁgured using the RL technique, the power output can be maximized, and the power output variability can be reduced to obtain high-quality power. When using the RL technique, it was conﬁrmed that the overall e ﬃ ciency was improved by an additional 3%.


Introduction
Countries around the world are continuing their efforts to prevent environmental changes caused by global warming. South Korea is also implementing various policies to prevent global warming in line with this global trend. As part of this effort, research is conducted on the improvement of energy efficiency [1]. Increasing the energy efficiency can reduce the amount of fossil energy necessary to obtain an equivalent amount of energy, thereby reducing CO 2 emissions. This is considered as an effective method to slow down global warming by limiting CO 2 , known as the main cause of global warming.
As part of this ongoing research, a trigen system has been developed that can provide chilling, heating, and power generation with the use of a gas engine [2]. This system operates a gas engine to provide chilling when it is hot, heating when it is cold, while simultaneously providing the electricity required by the user. In addition, this system improves the energy efficiency by recovering the heat generated from each process. The system consists of a built-in heat pump for chilling and heating, and a generator for generating power. Depending on the user's demands, thermal energy is obtained by connecting the gas engine to the heat pump, and electrical energy is obtained by connecting the gas engine to the generator.
In addition to improving efficiency, developing renewable resources constitutes another method for reducing CO 2 emissions. TEG can be classified as a new energy source technology as it generates power using the energy discarded during the gas decompression process in natural gas supply facilities [3][4][5]. The pressure difference created during the decompression process turns the turbine, and electricity is generated. It is a system that converts pressure energy into electricity without using raw materials. Although this is not yet classified as renewable energy in Korea, it is an outstanding system for generating power without CO 2 emissions as it creates electrical energy using discarded energy.
As the temperature of natural gas drops significantly during the decompression process, the temperature of the compressed gas needs to be somewhat increased before decompression to provide natural gas directly to households, or to turn the turbine. In existing methods, the natural gas temperature is increased with a gas boiler. The turbo expander generator (TEG) can reduce the energy loss by converting decompression energy into electricity, but there is no method to recover the heat energy to compensate the temperature drop during the decompression.
However, linking the aforementioned trigen system to this part will not only provide heat energy through the gas boiler, but will also improve the overall energy efficiency by recovering the heat energy used for chilling, heating, or generating additional power.
Research on the utilization of trigen is under way in a variety of ways. A number of studies have been conducted to determine how to combine several combined heat and power (CHPs) systems to supply heat and electricity to the entire system and distribute them [6][7][8][9][10][11][12][13][14]. Other studies have also been published on methods to decentralize energy for the purpose of reducing the energy consumption of the entire society by considering it as distributed power to divide the power source itself into power distribution systems [15][16][17][18][19][20][21][22]. Additionally, other studies have been conducted to devise methods to link an energy storage system to these distribution systems to make the overall power generation process robust, while also utilizing them to configure microgrids [23][24][25][26][27][28][29]. Furthermore, studies on improving the overall efficiency of the power generation system by grouping these power sources and scheduling the power generation time have also been conducted [30][31][32][33][34][35].
Various studies related to CHP have been conducted as above, but there has been no research in connection with TEG yet. In addition, when operating in conjunction with TEG and trigen, no research has been conducted on a control plan to optimize heat energy and power generation.
In this study, methods are proposed to optimize electrical and thermal energy, and to maximize electrical power output by linking TEG and trigen. Although numerous studies have been conducted to increase the energy efficiency using trigen, studies on the achievement of high-energy efficiency by linking it with TEG are lacking. To obtain high-energy efficiency with the TEG+trigen power generation system, a method is proposed to achieve the desired energy through the best selection process for each situation. The reinforcement learning (RL) technique is applied in this study to provide an optimal solution for the maximization of efficiency [36]. In brief, with the RL technique, when an action is executed in each state, the environment provides a reward, whereby the action with more rewards is executed. Therefore, with the TEG+trigen power generation system, RL can be configured to achieve optimal energy efficiency by providing rewards for the action that maximizes energy production [37][38][39][40][41][42][43][44][45][46][47][48].
The remainder of this study is organized as follows. Section 2 describes the power generation system using trigen and TEG, and Section 3 describes the RL technique as an optimization algorithm for maximizing the power output of the proposed power generation system. Section 4 validates the practicality of this technique through a case study. Section 5 summarizes the study and explains future plans and limitations.

Trigeneration
As described earlier, trigeneration refers to the generation of chilling, heating, and electrical energy with the use of one energy source. In this process, a gas engine is used to produce chilling, heating, and electricity through the trigen system. A schematic diagram of a trigen system that uses a gas engine is shown in Figure 1. The trigen system comprises three parts: a gas engine, generator, and a heat pump. Herein, the heat pump is composed of a compressor, a four-way valve, a heat exchanger, an oil separator, a gas-liquid separator, and various valves and switches. The engine drive shaft, generator, and compressor are connected to each other. Thus, when the gas engine is fueled by natural gas during operation, the generator and compressor are operated simultaneously, whereby a power cutoff device protects the system from overloading.
As shown in Table 1, trigen is a heat pump chilling/heating device that generates electricity and chilling or heating simultaneously by operating a compressor and a generator through a gas engine fueled by liquefied petroleum gas or town gas, with a 30 kW power output, 56 kW chilling capacity, and a 67 kW heating capacity.

TEG
The process of converting the rotational movement of high-pressure gas that passes through an expander is utilized extensively in industrial areas. The most typical example is a cryogenic process used to acquire cold energy. As the temperature drops drastically during the entropy process, wherein a high pressure is converted to work, the cryogenic process is conventionally used for the air liquefaction and separation (ASU). Where air is liquefied to separate nitrogen and oxygen. Additionally, the naphtha cracking center (NCC) (ethylene fabrication process, wherein methane is liquefied) and the liquefied natural gas (LNG) processes (that liquefies natural gas) are also used. Conversely, a TEG system does not consume cold energy; instead, it recovers it in the form of electricity by connecting the rotational movement of the expanded high-pressure gas to the generator.
While the high-pressure gas radially moves in and passes a turbine, the turbine rotates and performs work. In this process, the high-pressure gas is decompressed and exhausted in the axial direction. The variable geometry nozzle (VGN) at the turbine inlet controls the inflow of high-pressure gas by adjusting its angle, thereby controlling the pressure at the outlet of the TEG. In other words, the TEG controls the pressure and produces electricity at the same time.
In the conventional decompression process, each natural gas station forces natural gas through a pressure control valve from P1 (high pressure) to P2 (low pressure). This process is an isenthalpic process, that is, a horizontal movement in the h-s diagram. Conversely, the flow passing through a TEG undergoes a vertical isentropic process. A decompressed fluid passing through the pressure control valve from P1 to P2 has a different temperature from another fluid that is decompressed through the TEG. The pressure control process by the pressure control valve (PCV) is an isenthalpic process; that is, it is a throttle process. In the isenthalpic process, the temperature at the outlet of the PCV decreases owing to the Joule-Thomson effect. The pressure control process by the TEG is an isentropic process, wherein the temperature decrease of a fluid is larger than that caused by the Joule-Thomson effect. When natural gas is decompressed, depending on the composition and state of the gas, the throttle process of the PCV, which is an isenthalpic process, undergoes a temperature drop of 4.5-6 • C for a decompression pressure of 10 bar, while the decompression process of the TEG that is an isentropic process, exhibits a temperature drop in the range of 15-20 • C.
The TEG replaced one of the pressure regulators that had already been installed. Thus, the TEG was installed in parallel with the pressure regulators. The TEG system was configured so that natural gas first flowed into the TEG and then into the pressure regulators when the base load of the TEG was exceeded.
Regarding the natural gas flow at the base load, electrical energy was produced, the TEG conducts voltage control, and the remaining natural gas-which has not been processed by the TEG-flowed to the PCV in parallel with respect to the TEG. Thus, the PCV controls the pressure of natural gas. In addition, if the TEG fails, the shutoff valve of the TEG is closed, and a pressure regulator connected in parallel was operated in a stable manner.
Similar to pressure regulators, a TEG also needs to transfer natural gas at a temperature of 0 • C to customers. For this reason, a heater was installed at the inlet of the TEG. The quantity of heat used for the preheating operation was calculated using the following formula: In this study, it was assumed that a 75 kW TEG was installed in accordance with the heating capacity of the trigen [3].

Energy Optimization Method
This chapter describes the method used to optimize the energy generated through the trigen and TEG introduced in Section 2. The goal was to achieve maximum energy by combining the electrical energy and heat energy obtained from the two systems. Accordingly, for this purpose, the RL technique and deep Q-network technique are introduced.

Reinforced Learning
RL is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize the notion of cumulative reward. RL differs from supervised learning in that it does not require the presentation of labeled input/output pairs nor that suboptimal actions are explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) [36]. The environment is typically stated in the form of a Markov decision process (MDP) because many RL algorithms utilize dynamic programming techniques when used in this context [37]. The main difference between the classical dynamic programming methods and RL algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP, and they target large MDPs where exact methods become infeasible.
In a typical RL scenario, an agent executes an action in an environment, and this action is interpreted as a reward and a representation of the state. These are then fed back to the agent. This is illustrated in Figure 2, wherein the agent is the learner and the decision maker. The environment is the entity with which the agent interacts, comprising everything apart (and outside of) the agent. The action involves all the possible moves that the agent can make. The state is the current situation returned by the environment. The reward is an immediate return value from the environment to evaluate the last action by the agent. At every time step t, the agent executes action a t and receives the state s t and scalar reward r t . The environment receives action a t , transmits state s t+1 , and receives a reward r t+1 at every time step. Each action influences the agent's future state. Success is measured by a scalar reward signal, and RL selects actions to maximize future rewards. The agent employs a strategy, policy (π), to determine the next action based on the current state. A policy, expressed as π(s, a), describes a probable action path. Specifically, it is a function that takes in a state and an action and returns the probability of taking that action in that state.
π : A × S → [0, 1], π(a, s) = P(a t = a s t = s) where P is the probability of action a in state s. The value function V π (s) is defined as the expected return starting with state s, that is, s 0 = s, and successively following policy π. Hence, the value function estimates how good it is to be in a given state.
where E is the expected (future) cumulative reward, R is the sum of future discounted rewards, r t is the immediate reward, and γ is the discount factor (this is smaller than unity). As a particular state becomes older, its effect on the later states becomes progressively less, and its effect is thus discounted. Although the range of immediate reward is not specified, having a positive value means that the action is a recommended action, and having a negative value means that the action is an action that is not recommended. It also means more recommended or more non-recommended action depending on its absolute size. The Q-value function at state s that executes action a is the expected cumulative reward from taking action a in state s. The Q-value function estimates how good the state-action pair is.
The optimal Q-value function Q*(s, a) is the maximum expected cumulative reward achievable from a given state-action pair. The concept of the optimal Q-value function is illustrated in Figure 3. To obtain the optimal Q-value, RL breaks the decision problem into smaller subproblems. Bellman's principle of optimality describes how this is achieved [37]. It is stated as follows: an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
The Bellman equation is classified as a functional equation because solving it requires finding the unknown function V which is the value function. Recall that the value function describes the best possible value of the objective as a function of state s. By calculating the value function, the function that describes the optimal action as a function of the state is also found. This is called the policy function. Equations (6)-(9) describe this process: V π (s) = R(s, π(s)) + γ s P(s s, π(s))V π (s ), V π * (s) = max a R(s, a) + γ s P(s s, a)V π * (s ) , V π (s) = a∈A π(a|s) * q π (s, a), q π (s, a) = γ s P π (ss )V π (s ), (9) By expressing the above equations as recursive functions and calculating them repeatedly, the optimal value can be obtained. When performing repetitive calculations, the action that brings the maximum value to each state is selected, and these are summed and moved to the direction with the highest value. Or, for a situation in which the future value is the largest, the value is searched so that the present maximum value appears. This is called back propagation.

Deep Q-Network Algorithm
Before explaining the deep Q-network, Q-learning is described. Each agent repeats the following steps to maximize the Q-value: • Choose an action a in the current state, s. Based on the above process, known as greedy action selection, an action that maximizes the Q-value is selected. However, if the action that maximizes the Q-value is only considered, the opportunity to learn about various environmental changes would be lost. As a result, optimization may fail. Therefore, using the ε-greedy method, we explore nongreedy actions with a certain probability (ε) to learn about the various situations. At this time, it is very important to maintain a balance between greedy actions and nongreedy actions to obtain optimal results. In general, selecting a greedy action is called exploiting, and selecting a non-greedy action is called exploration. When searching for the best results, exploration is very important, meaning that pursuing only the best results does not always bring about good results.
A deep Q-network can be thought of as a form that combines artificial intelligence learning by combining a neural network with Q-learning. There are two types of deep Q-network methods. One is a method that receives state and action as inputs and outputs the Q function value of the state and action. The other is a neural network that receives only the state as input and outputs the Q function value for each action at once through a feedforward process. Here, we use a neural network type that receives both state and action and outputs the Q function value. To formulate this to create an algorithm, the Q-value function is first approximated to the Q-network.
If the objective function is developed in the direction that reduces the difference between the current Q-value and the target Q-value in the mean-squared error (MSE) method, the equation is as follows, where U(D) is the replay memory, θ is a neural network parameter, and θ − is the old parameter.
Using the above equation, the gradient descent method is applied to determine the optimum value. The replay memory D that appears here is referred to as experience replay, and stores a dataset of the agent's experience (but excludes other relationships). If the reward rt is received after selecting the action at state s t , and the new state being observed is s t+1 , the transition (s t , a t , r t , s t+1 ) is stored in the replay memory, D. The transitions stored in the memory are used to optimize the MSE, and in some cases, the minibatch is partially executed to improve the speed and improve the optimization result.
Algorithm 1 describes an algorithm that performs Deep Q-network using replay memory, D. It initializes memory D, Q-function, and target Q-function, then randomly selects an action and inputs the result to memory D. Of the D thus obtained, minibatch is executed again to perform an action that obtains an optimal value. The action, reward, and policy used at this time will be described in the next section.
Algorithm 1: Deep Q-Network Algorithm 1. Initialize replay memory D to capacity N 2. Initialize action-value function Q with θ 3. Initialize target action-value function Q with θ − = θ 4. For episode = 1 to num episodes do 5. For t = 1 to T do 6. With probability ε select a random action a t , otherwise select a t = max a Q(s, a; θ) 7. Execute action a t in emulator and observe reward r t and state s t 8. Store transition (s t , a t , r t , s t+1 ) in D 9. Sample random minibatch of transitions (s j , a j , r j , s j+1 ) from D 10. Perform a gradient descent step on L j (θ) with respect to the network parameters θ 11. End For 12. End For

Action, Reward, and Policy
The prerequisite for the operation of this system is to provide an adequate amount of heat. This is because the temperature of the compressed gas must be raised above a certain temperature for the TEG to generate power. In addition, TEG serves as a decompression facility by default even if it does not generate power; hence, the temperature of the compressed gas must be raised to provide gas to households. Accordingly, it is essential to supply an adequate amount of heat.
With a proper amount of heat supplied, the TEG can generate maximum power, and trigen can generate additional power by using the remaining power capacity. As this is not a prerequisite for the system used in the research to have an independent microgrid configuration, the power output to be supplied to the system is not limited; thus, the greater the power generation amount is, the better it is. This is possible subject to the precondition that the system is linked to a commercial system that can infinitely receive the generated power.
Therefore, subject to the situation wherein heat energy is adequately supplied, the establishment of a policy for maximum output, and the action and reward can be expressed according to the following Figure 4. There are three different states: appropriate heat, lack of chills, and lack of heat. The final target is to reach an appropriate heat state, and power generation can only be attempted at this state. Therefore, with an attempt for power generation, all cases are rewarded, and a reward of 3.0 is given to those returning to the appropriate heat, the most recommended action and state. Even if the state lacks chills or lacks heat after power generation, a reward of 1.0 is provided, as the power generation itself is a recommended action.
In the absence of chills, the state is allowed to change to an appropriate heat through chilling; however, if the state remains at a lack of chills even after chilling is applied, it is rewarded with −3.0 to avoid this situation.
Similarly, in a state lacking heat, the state is allowed to change to an appropriate heat state through heating, but if the state remains unchanged (that is, lacks heat), even after heating is applied, the policy is designed to give a reward of −3.0 to discourage and avoid this action.

Case Study
A simulation was conducted to confirm the benefits of optimized operation by combining trigen to TEG and by using RL compared with the existing TEG exclusive system. In this process, the aforementioned MDP [37] was applied mutatis mutandis, and MSE [36] was used as it was as a mathematical model. The tool used for the analysis was TensorFlow engine [49]. Assuming that the thermal energy needed for this system is unnecessary elsewhere, and is used only with TEG, the required thermal energy is determined by the flow rate of the gas entering the TEG. Figure 5 shows the flow rate of gas at a gas station wherein the TEG is scheduled to be installed.  Figure 6 shows the power output that can be generated through the TEG in such a flow pattern. When the flow rate exceeds a certain amount, power generation is always available with the exceptional case of failure; hence, an operation rate of 100% is assumed. The annual power output that can be generated is 624 MWh, and the required annual quantity of heat is 595 MWh [3]. Both the power output and required quantity of heat are dependent on the average monthly gas flow rate, and they are all different depending on the situation. In winter, the flow rate is high because the amount of gas usage is high, but the required amount of heat is also high because the outside temperature is low.
Assuming that the required amount of heat is provided by trigen, and the maximum power output is provided by RL, The average monthly power simulation results are shown in Figure 7. Compared with the exclusive generation by TEG, the results show that the power output increases, and the variation of average monthly power output decreases. The required amount of heat can be calculated using Equation (1), and the calculated heat is supplied by trigen. According to the results in Figure 7, when the flow rate is high, the power output of the TEG is large; however, the required amount of heat increases that in turn reduces the power output of the trigen. When the power output of the TEG decreases, the required amount of heat decreases accordingly, thus providing trigen the capacity to generate electricity. In turn, the overall power output remains at a similar level.
In addition, the amount of heat supply and power output of trigen can be optimized to maximize the power output obtained by using the RL technique. The power output cannot be optimized with the existing simple combined cycle power plant TEG + trigen; thus, there is no significant increase in the power output. However, by applying RL to TEG + trigen as a means of optimizing operation, it is confirmed that the overall power output is maximized.
The overall power generation efficiency obtained is summarized in Table 2. The power generation efficiency increased significantly when TEG and trigen were used compared with TEG alone, and further increased based on the application of RL to determine the optimal operation method. The energy generation efficiency was calculated using Equation (13) by estimating the generated power output and gas consumption.
where η out represents the total power generation efficiency [%], P e represents the generated power output [kW], H f represents the heat output of fuel [MJ/Nm 3 ], and F f represents the gas consumption [Nm 3 /h].

Conclusions
In this study, TEG and trigen were employed to improve the total power generation efficiency, and a method of applying the RL technique was presented to maximize the power output. The energy efficiency was enhanced when trigen provided the required heat for power generation in TEG, and additional power was generated based on the capacity of trigen, whereby the total energy efficiency was maximized according to the improvement of the efficiency of the overall power generation system and the maximization of the power output. In particular, it was confirmed that the total energy efficiency was improved by 3% when the operation was optimized using the RL technique rather than simple TEG + trigen.
Currently, application of the RL technique to optimize operation simply maximized the power output. However, a more precise operation method was expected to be derived by reflecting the characteristics of the gas flow rate, temperature, and gas usage pattern of the location at which the TEG was installed. In the future, it is anticipated that power generation systems using TEG and trigen will be expanded and installed in gas gate stations. It is expected that the methods for configuring MGs for each system and the construction of a comprehensive power generation system based on the integration of the entire operation could then be studied.