Reinforcement Learning Based Peer-to-Peer Energy Trade Management Using Community Energy Storage in Local Energy Market

: Many studies have proposed a peer-to-peer energy market where the prosumers’ actions, including energy consumption, charge and discharge schedule of energy storage systems, and transactions in local energy markets, are controlled by a central operator. In this paper, prosumers’ actions are not controlled by an operator, and the prosumers freely participate in the local energy market to trade energy with other prosumers. We designed and modeled a local energy market with a management algorithm that uses community energy storage for prosumers who competitively participate in trade in the real-time energy market. We propose an energy-trade management algorithm that manages the trades of prosumers in two phases based on bids and offers submitted by prosumers. The ﬁrst phase is to manage the trade of prosumers who have submitted fair prices to trade with other prosumers in the real-time energy market. The second phase is managing the trade of prosumers that could not trade in the ﬁrst phase. Community energy storage is employed in the second phase and controlled by a reinforcement learning-based trading algorithm to decide whether to buy, sell, or do nothing with the prosumers. The action of buying and selling means charging and discharging the community energy storage, respectively. Numerical results show that the proposed trading algorithm gains a near-maximum proﬁt. Besides, we veriﬁed that community energy storage yields more proﬁt than the battery wear-out cost.


Introduction
The greenhouse gases, which are the most significant driver of global warming, mainly originate from the use of fossil fuels. In 2018, 75% of greenhouse gases were emitted by fossil fuel combustion, and most of them are used in our daily lives, for example, for transportation, energy generation, and industrial use [1,2]. To reduce the amount of greenhouse gas emission, renewable energy sources (RES), including wind, solar, geothermal, and biomass energies, are gaining attention as alternatives to replace fossil fuels. In 2018, worldwide investments in renewable energy amounted to $272.9 billion, and such large investments in RES have affected consumers [3]. Many consumers have installed private RES-generation facilities with government support to lower their electricity bills through self-generation and self-consumption and make profits by selling the surplus energy. Many consumers have become prosumers who can produce energy independently. However, it is difficult for most prosumers to earn a profit by trading with the utility company due to the low feed-in tariff. This led prosumers to create an independent local energy market (LEM) where prosumers can sell their energy to their neighbors. For example, a Dutch company called Vandebron was established in 2014 to provide the first online peer-to-peer (P2P) energy trading, and in the UK, an open utility has launched in 2015 as a pilot program for P2P-trading services [4]. In the USA, the first solar energy sale between neighbors was recorded in Brooklyn, New York, in 2016 [5].
The P2P energy trade gives prosumers an option to trade not only with the utility company but also with other prosumers. The challenge that arises in the P2P energy trade is finding a proper bidding strategy to gain profit. Prosumers have to decide the proper bidding price at the proper time when they need to trade energy according to their demand. If prosumers fail to trade energy, they have to trade with the utility at a low price. There is also an option of adopting an energy storage system (ESS) for time-flexible trading with prosumers, but it is difficult to make profits considering the high battery wear-out cost (BWC). Several studies have demonstrated the economic effects of adopting ESS, and in most cases, it was difficult to make profits without government aid [6].
Herein, we adopt the concept of sharing ESS with nearby local prosumers, known as community energy storage (CES), in LEM to reduce the maintenance cost of ESS. CES can also be used for demand-side management [7], but herein, it is used for prosumers. CES is managed by an energy-trade supporter (ETS) who oversees the trade of prosumers. In our proposed LEM, prosumers submit their bids and offers to ETS, and the ETS decides whether to use CES for prosumers. ETS manages the trade of prosumers in two phases. The first phase is real-time arbitrage trading, and the second phase is reinforcement learning (RL)-based arbitrage trading. In the real-time arbitrage trading phase, the ETS trades with prosumers according to their bids and offers. Prosumer's energy is transferred from the seller to the buyer according to ETS, so CES is not used in this phase. In the second phase, the ETS decides whether to trade with a prosumer who failed to trade in the first phase. The ETS decides whether to buy, sell, or do nothing. If the ETS decides to buy energy from a prosumer, the energy is stored in CES, and if the ETS decides to sell energy to a prosumer, the energy stored in CES is discharged to the buyer. Notably, the stored energy in CES is only sold to prosumers in the second phase. The ETS chooses its action through an RL-based algorithm that focuses on maximizing its arbitrage trading profit. The key contributions of this work are as follows: • We designed an LEM where prosumers are not controlled by a central operator. Prosumers freely participate in the LEM, and the trade between prosumers is managed based on the bids and offers submitted by prosumers. • We proposed a new role called ETS that manages trades in the LEM. ETS not only acts as a middleman between prosumers but also as a supporter for prosumers who failed to trade in real-time LEM. • CES is applied to the LEM and controlled by ETS. CES is used only for prosumers who failed to trade in real-time LEM. Through numerical simulations, we showed that the limited use of CES has more economic benefits than using the CES for all prosumers. • We adopted an RL-based energy trade management technique for CES, which targets maximizing the trading profit considering BWC. We compared the BWC and economic benefits of CES from the RL-based energy trade management algorithm.
The rest of this paper is organized as follows: Section 2 provides an overview of the recent research on energy trade management systems; Section 3 presents the LEM market design; Section 4 presents the proposed energy trade management algorithm; Section 5 provides the numerical results of the proposed algorithm; Section 6 summarizes the key results of this study.

Related Work
We categorize previous studies on energy trade management in LEM into three types: transactive energy management system (EMS), P2P energy trade system, and energy trade system with CES.

Transactive EMS
The demand response (DR) is a demand-side management technique that attempts to change the energy usage pattern of consumers in response to changing the price or giving incentives [8]. The term transactive energy is being said as a generalized form of DR that focuses on balancing supply and demand by incentivizing prosumers [9]. In transactive energy, the distributed energy resources of prosumers are controlled by their owners, but the transaction mechanism is designed in a system-friendly manner.
In [10][11][12], a transactive EMS that considers selling prosumer's energy to the grid for demand-side management was introduced. Daneshvar et al. [10] designed four mathematical models of microgrids using a transactive EMS. These models have different optimization goals with different constraints. The optimization goal could be individual or collective, and each model has different energy cost-saving constraints. The main contribution of [10] is on proposing an effective model that considers both the collective and individual interest. Nizami et al. [11] designed a transactive EMS managed by a management framework that starts in the day-ahead stage. The house-level EMS minimizes the operational cost of homes by scheduling optimized energy use in the day-ahead stage. In the real-time stage, the house-level EMS participates in LEM when the real-time demand exceeds the scheduled demand. The transaction information from house-level EMSs is gathered by the local transaction agent (LTA) who decides the market clearing rate. The LTA receives requests from not only EMSs but also the grid by a grid agent. The main contribution of [11] is on proposing a transactive EMS to mitigate grid overloading in a two-stage management model. Koltsaklis et al. [12] designed a transactive EMS similar to [11] that starts in the day-ahead stage. The main contribution of [12] is that they implemented three short-term forecasting processes for the optimal day-ahead energy scheduling.
In [13,14], a grid-independent system that does not consider the grid but only coordinates the shared energy between prosumers was introduced. Celik et al. [13] proposed two coordination algorithms that operate in a day-ahead scheme. Houses are assumed to have an ESS and this storage is controlled by the energy coordination algorithm. The main contribution of [13] is on proposing a day-ahead decentralized coordination method to minimize the electricity bills of prosumers. Akter et al. [14,15] proposed a centralized coordination scheme for different types of houses, dividing the houses into three types. The first type includes houses with a rooftop solar photovoltaic (PV) system and ESS, the second includes houses with only a PV system, and the third includs traditional houses with no additional systems. The main contribution of [14,15] is proposing a transactive EMS solving a centralized energy management problem of different types of houses.
The common point of transactive systems [10][11][12][13][14][15] is the design of the management system of LEM in a system-friendly manner. The energy of prosumers is shared and managed by a centralized agent or aggregator. Herein, we focused on designing a userfriendly system that supports the active trading of a prosumer with other prosumers to earn more profit. The proposed system can be categorized as a P2P energy trade system and is discussed in the following subsection.

P2P Energy Trade System
P2P energy trading refers to the direct trading of energy between prosumers and consumers [16]. The P2P energy trade system is designed in a user-friendly manner in which prosumers act competitively compared to the transactive EMS. A retailer or broker can facilitate trading among prosumers, but prosumers' energy is not shared, and prosumers optimize their profit using a bidding strategy. Most research issues in the P2P energy trade system are discussed from the prosumers' perspective.
Chen et al. [17,18] designed several behavior models of the participants. In [17], prosumers were modeled to maximize their trading profit by utilizing a wind turbine and ESS. Prosumers choose to charge or discharge the ESS, and at the same time, they choose to buy or sell energy in LEM. An RL algorithm that decides the action at a given time to maximize profit was employed. In [18], a middleman, who maximizes their profit by managing the trade among prosumers, was designed. The middleman is called a retail energy broker and determines the market clearing time and clearing rate of prosumers using an RL algorithm. Long et al. [19] employed a game theoretic approach to the P2P trading mechanism of prosumers. Jing et al. [20] divided the prosumers into commercial and residential prosumers and proposed a fair pricing strategy for P2P trading between the two types of prosumers. The proposed trading mechanism in [20] sought to maximize the trading profit of prosumers while also considering the fair trading profit of prosumers. Bose et al. [21] proposed a RL-based trading model that gives the option to prosumers to choose between different trading strategies with different gains and penalties. Kim et al. [22] proposed a RL-based trading model with a new trading evaluation criterion that considers the main factors in LEM.
Since research on P2P energy trading is far in progress, some studies focused on actual implementation issues. Guerrero et al. [23] proposed an energy trading scheme that guarantees network constraints in energy trading. Thakur et al. [24] implemented a secure trading platform using blockchain for P2P energy trade. Umer et al. [25] proposed a P2P energy trading scheme that guarantees the privacy of prosumers. Herein, the implementation issues, including grid network constraints or trading platforms, are not discussed. We focused on designing a trade management algorithm for prosumers in the P2P energy trade system with assumptions similar to those in previous studies [17][18][19][20][21][22], but we employed the CES that were not considered in [17][18][19][20][21][22].

Energy Trade System with CES
The good feature of ESS is its support for time-flexible energy use under uncertain demand and generation. The disadvantages of ESS include high capital cost and the degradation of the battery; thus, attempts have been made to reduce capital cost by identifying the optimal battery size [26] and modeling battery degradation as a cost function [27]. Currently, the cost of batteries for storage is decreasing rapidly [28], but it is not assured that ESS can yield many benefits without government aid.
The concept of CES can be seen as an effort that could minimize cost and maximize utilization by installing a shared ESS. Researchers have sought to determine the economic value of CESs, demonstrating that CESs not only have economic benefits but also offer wider societal benefits [6,29]. The early use of CES concentrated on demand-side management with a CES manager that control the action of prosumers [7,30], but recent studies investigated the economic benefits of CES by integrating it into LEM for P2P energy trading [31]. Herein, the CES is managed by a RL-based trading algorithm for supporting the trade of prosumers. Prosumers in the proposed LEM are not controlled by a central operator, but their trades are supported by using the CES. While using either ESS or CES, it is important to prove that the profit is higher than that of BWC. Herein, we employed an existing BWC model [27] and demonstrated that the proposed energy trade management algorithm using CES earns more profit than BWC.

System Model
This section describes our proposed energy market and explains how prosumers submit their bids and offers to the market.

Proposed Energy Market
The proposed energy market is shown in Figure 1. The proposed market follows the double auction market where multiple buyers and sellers submit bids and offers simultaneously to trade [32]. We divide each day into n time slots, and T is defined as the interval between time steps, which is equal to 24 h n . At each time step t, prosumers submit their bids and offers, and the ETS manages prosumers' trade based on the submitted bids and offers.

Prosumers
Prosumers are represented as a set of sellers I and buyers J according to their energy demands. A single seller is denoted as i and the buyer as j. The time when i and j submit their offers and bids is defined as t sell The offers and bids submitted by prosumers include the price and amount of energy they want to trade. The price offered by seller i is denoted as p sell i , and the price bided by buyer j is denoted as p buy j . Similarly, the offer energy that seller i requests to sell is e sell i , and the bid energy that buyer j requests to buy is e buy j . Prosumers freely submit offers and bids to the market according to their demand, but the trading price is regulated by the ETS to guarantee a fair profit for sellers and buyers. Seller i should submit p sell i at a higher rate than the utility's feed-in tariff to make a profit [33]. At the same time, p sell i should not exceed the utility's service rate; otherwise, buyers will choose to trade with the utility instead of seller i. Thus, p sell where p feed-in is the feed-in tariff per kWh offered by the utility, and p rate is the service rate per kWh charged by the utility. The variable I t denotes the subset of sellers who await trade in the market at time t. The value of η is the price regulation of ETS. Similarly, for the bid price of buyer j, it should be guaranteed that where p buy j is the bid price of buyer j. The variable J t denotes the subset of buyers waiting to trade in the market at time t.

Energy Trade Management Algorithm
Algorithm 1 is the proposed energy trade management algorithm. In each time step t, the trade between prosumers is managed by ETS in two phases. The first phase is real-time arbitrage trading phase, and the second phase is RL-based arbitrage trading phase. In the real-time arbitrage trading phase, ETS acts as a middleman who tries to maximize its profit by trading in real-time with prosumers. The ETS buys energy from sellers and sells it directly to buyers. The ETS continues to trade with prosumers until there is a profit. In the RL-based arbitrage trading phase, the ETS trades with prosumers who failed to trade in the real-time arbitrage trading phase. The ETS decides whether to buy energy, sell energy, or do nothing. If the ETS decides to buy energy, the ETS selects a seller to buy energy, and that seller's energy is stored in CES. If the ETS decides to sell energy, the ETS selects a buyer to sell energy, and the stored energy in CES is discharged to that buyer. Notably, the CES is used only in the RL-based arbitrage trading phase. The RL-based trading algorithm is applied in this second phase, which aims at maximizing the ETS trading profit. The following subsections explain the two phases using the example shown in Figures 2 and 3.

Real-Time Arbitrage Trading Phase
In the real-time arbitrage trading phase, the ETS selects prosumers to trade energy. The selected sellers directly send energy to the selected buyers; thus, CES is not used in the real-time arbitrage trading phase. Figure 2 shows an example with four buyers and four sellers in the market at a certain time t. The ETS selects to trade with sellers i 3 and i 4 who have submitted a lower offer price compared to other sellers. The energy bought from i 3 and i 4 is sold to buyers j 1 and j 2 , who have submitted the higher bid price compared to other buyers. The real-time arbitrage trading phase can be formulated by an optimization problem as follows: where x sell i ∈ X sell and x buy j ∈ X buy are the total amounts of energy that would be traded for the ith seller and jth buyer. The first constraint shows that all energy bought from the sellers is sold to the buyers. The last two constraint shows that x sell i and x buy j should be lower than seller ith offer energy and buyer jth bid energy, respectively.
After the real-time arbitrage trading phase, the prosumers who were not selected by the ETS remain in the market. The remaining prosumers in the market can be seen as prosumers who have failed to submit a proper price to trade with ETS. A formal middleman will stop trading in this phase because the energy cannot be sold with profit in real-time. There is an option of storing the energy in a battery to sell at a higher price, but the BWC cost should be considered while using a battery. Herein, ETS tries to trade with the remaining prosumers using CES. The details of how ETS trades with these remaining prosumers are described in the following subsection.

RL-Based Arbitrage Trading with CES Phase
In RL-based arbitrage trading with the CES phase, the ETS decides whether to trade energy with a prosumer who failed to trade in the real-time arbitrage trading phase. The objective of RL-based arbitrage trading with the CES phase is to maximize the ETS profit using CES, thereby maximizing the amount of energy to be traded between prosumers. If ETS decides to buy energy from a seller, the seller's energy is stored in CES. If ETS decides to sell energy to a buyer, the CES energy is sent to that buyer.
After real-time arbitrage trading phase ( Figure 2) is completed, the market would be as shown in Figure 3, which presents the prosumers who failed to trade with the ETS. The ETS decides whether to buy energy from a seller, sell energy to a buyer, or do nothing. If the ETS decides to buy energy, the ETS buys energy from i 2 , who submitted the lowest offer price, and if the ETS decides to sell energy, the ETS sells energy to j 3 , who submitted the highest bid price. The action of ETS at time t is defined as a t , and the action space A is defined as follows: where a charge is the action of buying energy from a seller and storing it in CES, a discharge is the action of selling the energy stored in CES to a buyer, and a idle represents the action of doing nothing at time t.
The ETS refers to three state values to decide which action maximizes the profit. These values are o t , b t , and c t , related to the seller's offer price, buyer's bid price, and CES state of charge (SOC), respectively. The aforementioned variables are used as a part of the state s t , which is defined as: where S denotes the state space. The first state variable o t is the converted value of p charge t , representing the relative price of the seller. The value of p charge t is calculated as follows: where p where n is a natural number in the range of 1 ≤ n ≤ 8.
The second state variable b t is calculated through the same process as the first state variable o t . The value of b t is the converted value of the relative price of the buyer. The value of p discharge t is calculated as follows: The third state value c t is a converted value of the CES SOC d CES t , which is defined as follows: where e CES t is the stored energy in the CES at time t, and E CES is the capacity of CES. The value of d CES t is converted to c t as follows: The value of s t and a t determines the reward used to learn the RL agent. The reward function r t (s t , a t ) is defined as follows: where r buy and r sell − p are the rewards for charge and discharge actions, respectively. The variable p is the penalty for trying to sell energy when there are no buyers. Each reward is defined as follows: where g buy and g sell are the coefficients to control the effect of the price, and h buy and h sell are the coefficients to control the effect of the amount of energy that remains in CES.
The value h buy d CES t is subtracted to obtain r buy , which makes the RL agent avoid buying energy if there is much energy in CES. In the case of r sell , the value h sell d CES t is added to the reward that makes the RL agent sell energy if there is a lot of energy in CES.
The Q-learning algorithm is employed to maximize the amount of energy to be traded between prosumers by finding the optimal policy that maximizes the profit of the arbitrage trade of the ETS [34]. The Q-table is updated as follows: where α is the learning rate, and γ is the discount factor.

Numerical Simulation
Section 5.1 describes the transaction data of prosumers and the learning parameters used for the simulation. Section 5.2 shows the simulation results of CES's economic benefits and the average profit of the proposed algorithm. In the last section, Section 5.3, we compare the profit of the proposed algorithm in the RL-based arbitrage trading phase to other kinds of trading strategies.

Simulation Setting
The setting used for the simulation is shown in Table 1. We divided each day into 72 time steps, allowing 20-min intervals for an event to occur. We assumed that 50 buyers and 50 sellers enter the market once per day. The transaction data of prosumers are generated for each day with the distribution shown in Table 1. Many sellers enter the market around noon when the amount of PV generation is expected to be high, and many buyers enter the market in the afternoon when the amount of energy use is expected to be high. The prosumers' waiting time is fixed to w, which is a constant value ranging from 0 to 3. The average feed-in tariff and service rate in Germany is used as the boundary value of p sell i and p buy j [35,36]. In this study, the ETS does not regulate the prosumers' price, so η is set to 0. We assume that CES uses a lithium-ion battery pack, which costs $137 per kWh [37]. The battery efficiency is 0.95, which elicits about 90% of round trip efficiency. The capacity of the CES is 400 kWh, except for the simulation in Section 5.3 that shows the ETS profit for the CES capacity ranging from 100 to 800 kWh. The Q-learning parameters, including learning rate α, discount factor γ, and greedy parameter, are set to 0.1.

Simulation Results
In order to validate the economic benefits of using CES, we compared the total profit resulting from RL-based arbitrage trading to the CES phase with BWC. The total profit includes the ETS profit and the prosumers' profit from trading with the ETS instead of the utility. The total profit obtained by the proposed trading algorithm is defined as follows: where p sell i − p feed-in and p rate − p buy j indicate the price advantage of prosumer i and j can obtain by trading with ETS instead of the utility.
BWC when the stored energy in the battery changes from e CES t+1 to e CES t can be defined as follows: where AWC(D) is the average wear-cost per unit of energy transfer, and the inner variable D indicates the battery depth of discharge (DOD). The AWC per kWh is formulated using an existing BWC model in [27] as follows: AWC(D) = Battery Price Total Transferable Energy During the Life Cycle where µ is the battery efficiency, and µ 2 indicates the round-trip efficiency of the battery. ACC(D) is the average cycle count of a typical lithium-ion battery, which is defined as follows: where a and b are battery-dependent parameters that can be acquired experimentally. However, batteries that are made with the same material do not have the exact same battery wear characteristics and thus need to be checked by experiments. The most typical specifications are given by the battery manufacturers with respect to the DOD. Herein, we applied the values of a typical lithium-ion battery, which are a = 694 and b = 0.795 [27]. A lithium-ion battery pack cost $137/kWh in 2020 [37]. We assume the other variable as D = 1 and µ = 0.95; thus, AWC(1) = $0.1093/kWh. Figure 4 presents the total profit and BWC resulting from the RL-based arbitrage trading phase. The total profit is about $105.34 per day, and BWC is about $77.89 per day. This result reveals that the proposed algorithm using CES yields a profit of $27.45 per day in the proposed LEM. If we consider social benefits, such as the decrease in grid consumption, we expect far more profits than the result.  Figure 5 shows the daily profit of ETS. The ETS Profit with CES graph represents the ETS profit using the proposed energy trade management algorithm. The ETS Profit without CES graph represents the ETS profit using the proposed energy trade management algorithm without RL-based arbitrage trading with the CES phase but only with real-time arbitrage trading phase since it assumes that the ETS has no CES. The daily average of the ETS profit is $69.58 and $36.22, respectively. The ETS can earn about $33.36 more using the CES.  The CES first algorithm uses CES to trade with a prosumer who submitted the highest bid price or lowest offer price that can be traded with a high probability in the RL-based arbitrage trading phase. This results in a lower profit of the CES first algorithm compared to that of the proposed algorithm. There is a slight increase in the ETS profit if prosumers wait for a certain time w to trade in the LEM, but there is no big difference in the total profit. In Figure 6, the proposed algorithm shows an ETS profit of $69.58 and a total profit of $105.34, and the CES first algorithm shows an ETS profit of $56.08 and a total profit of $89.37. In Figure 7, the proposed algorithm shows an ETS profit of $71.99 and a total profit of $105.29, and the CES first algorithm shows an ETS profit of $67.10 and a total profit of $91.39.

Simulation Analysis
In order to verify the performance of RL-based arbitrage trading in the proposed algorithm, we compared the trading profit with that of three other trading strategies. The first trading strategy is the previous action maintain (PAM), which repeats the action chosen at the previous time step. PAM starts with buying energy until the CES reaches its capacity, and after reaching the capacity, PAM then repeats selling energy. The second strategy is called random action as it chooses actions in a random order. The last trading strategy is named optimal, which gives the maximum profit with the perfect forecast of the whole day's transaction. Figures 8 and 9 show the CES discharge rate and ETS profit with respect to the CES capacity. The 100-day trading result is used to calculate the daily average of CES discharge rate and ETS profit. The RL-based arbitrage trading in the proposed algorithm reveals a high discharge rate, 0.97 on average, and the profit increases slightly as the CES capacity increases. PAM also shows a high discharge rate of 0.98 when the CES capacity is 400 kWh. However, there is a big difference in the ETS profit. PAM shows a profit of $9.02 per day when the CES capacity is 400 kWh, whereas RL-based arbitrage trading in the proposed algorithm shows a profit of $33.36 per day for the same CES capacity.    Figure 10 shows the daily profit of the ETS for each trading strategy. The optimal strategy shows a profit of $55.5 per day, resulting in a high selling price of energy with no surplus energy in the CES. In the case of other trading strategies, we assume that the remaining energy in the CES after a day is sold to the utility company. This prevents the CES from being overloaded by energy having a low possibility of being sold to another prosumer.  Figures 11 and 12 show the pattern of the daily use of CES. Since CES is used for prosumers who failed to trade in the real-time arbitrage trading phase, the daily use graph of the CES is irregular. PAM shows a typical CES use pattern compared to that of other algorithms as it does not consider the price of a prosumer but only the SOC of CES. RLbased arbitrage trading in the proposed algorithm shows an irregular CES use pattern, but the charge and discharge graphs are similar, indicating the high selling rate of the proposed algorithm.   Figure 13 shows the daily discharge rate for each algorithm. The daily discharge rate shows the selling rate of arbitrage trading. PAM also shows a high discharge rate compared to the RL-based arbitrage trading of the proposed algorithm. However, as we have seen above, PAM was not able to sell its energy at a time when the prosumer price is high.
In addition, we have analyzed the performance of the RL-based trading algorithm with different sizes of s t . Figure 14 shows the daily profit of the ETS in RL-based arbitrage trading phase when the CES capacity is 400 kWh. Increasing the size s t from 5 × 5 × 5 to 10 × 10 × 10 shows an increase in profit of about $3.01, but increasing the size from 10 × 10 × 10 to 15 × 15 × 15 shows a small decrease of about $1.18 in profit. According to the simulation result, decreasing the size of s t results in a smaller profit, but increasing the size of s t did not result in a bigger profit.

Conclusions and Discussion
In this study, an energy-trade management algorithm is proposed for managing the trade of prosumers using CES. We designed and modeled a LEM that manages the trade of prosumers who are not controlled by a central operator and act competitively with other prosumers. The CES is managed by an RL-based algorithm and used for prosumers who fail to trade in the real-time trading phase. Numerical simulations show that there are more economic benefits if the CES is limited to prosumers who fail to trade in the real-time trading phase. The main parameters considered in the numerical simulations are the ETS profit from trading with prosumers, the prosumers profit from trading with the ETS instead of the utility, and BWC. Numerical simulations show that the total trading profit by using the CES is higher than the BWC. The primary results of the numerical simulations are as follows: • The total profit of the proposed energy management algorithm is $105.34 per day. If CES is used in the real-time trading phase, the total profit decreases to $89.37 per day. • The BWC of the proposed energy management algorithm is about $77.89 per day. This result reveals that the proposed algorithm using CES yields a profit of $27.45 per day in the proposed LEM.
There are two limitations to this study, which we shall address in our future studies. First, instead of generating the simulation data with specific mean and deviations, we hope to analyze and apply real-world energy transaction data of prosumers. Second, since the Q-learning algorithm has problems using continuous values for the state and action, we can adopt a novel RL algorithm to trade more efficiently in the proposed LEM.