Reinforcement Learning and Stochastic Optimization with Deep Learning-Based Forecasting on Power Grid Scheduling

: The emission of greenhouse gases is a major contributor to global warming. Carbon emissions from the electricity industry account for over 40% of the total carbon emissions. Researchers in the ﬁeld of electric power are making efforts to mitigate this situation. Operating and maintaining the power grid in an economic, low-carbon, and stable environment is challenging. To address the issue, we propose a grid dispatching technique that combines deep learning-based forecasting technology, reinforcement learning, and optimization technology. Deep learning-based forecasting can forecast future power demand and solar power generation, while reinforcement learning and optimization technology can make charging and discharging decisions for energy storage devices based on current and future grid conditions. In the optimization method, we simplify the complex electricity environment to speed up the solution. The combination of proposed deep learning-based forecasting and stochastic optimization with online data augmentation is used to address the uncertainty of the dispatch system. A multi-agent reinforcement learning method is proposed to utilize team reward among energy storage devices. At last, we achieved the best results by combining reinforcement and optimization strategies. Comprehensive experiments demonstrate the effectiveness of our proposed framework.


Introduction
Nowadays, with the rapid development of artificial intelligence (AI), household appliances and equipment intelligence are gradually becoming popularized.More and more families are installing home solar power generation equipment and small-scale energy storage equipment, not only to meet their own electricity needs but also to sell excess power through the sharing network.If we can make home electricity use more efficient, then the community power grid will be more economical and low-carbon.Furthermore, the efficient and stable of community power grid can provide a guarantee for the stability of the national power grid.
Electricity research generally includes Large-scale Transmission Grids (LTG for short) and Small-scale Micro-Grids (SMG for short).LTG focuses on high-voltage and longdistance power transmission, while SMG focuses on electricity consumption in small areas such as schools, factories, or residential areas.We focus on smart scheduling techniques in SMG.For example, Figure 1 shows a case of SMG.Households can generate electricity from solar energy, store the excess power, and share with neighbors on the grid network (green arrows).When neither self-generated power nor a shared network can provide enough electricity, power is supplied by the national grid (orange lines).The national grid generates electricity through wind power, hydroelectric, and thermal.The cost of electricity and carbon emissions vary over time.In this paper, we use an AI-based approach to enable efficient scheduling of household storages.The AI-based scheduling method leads to economical and decarbonized electricity use.In the power generation process, increasing the proportion of new energy sources is one of the most important methods to reduce carbon emissions.The use of new energy sources, such as wind power and solar power, reduces carbon emissions for the grid network but adds more uncertainty to the entire power network.For example, solar power generation is affected by the weather, and if future weather changes cannot be accurately predicted, then this will affect the scheduling program of other power generation methods in the power network.Uncertainty in new energy generation poses a great challenge to traditional dispatch systems.We categorize the uncertainty as data drift: the relation between input data and the target variables' changes over time [1].For example, the sequential transition in a time series of renewable energy generation can be fluctuating (e.g., wind power and solar power).
The field of AI-based forecasting is continuously evolving.AI-based forecasting methods have been applied to predict the spread of contagious diseases such as COVID-19 [2], demonstrating their potential in public health applications.Deep learning techniques, including recurrent neural network (RNN) and long short-term memory (LSTM) networks [3], have been extensively studied for time series forecasting, showing promising results.Neural network architectures, such as feed-forward neural networks and convolutional neural network (CNN) [4], have been explored for time series forecasting, contributing to the advancement of AI-based forecasting models [5].These studies provide insights into the use of advanced AI-based forecasting techniques and their applications in different domains, especially in the time series forecasting domain.Therefore, in the electricity power domain, we involve a deep learning-based method to predict future user demand and renewable generation (the task can be regarded as a sub-domain of time series forecasting domain).
For the problem of uncertainty, classical model predictive control (MPC)-based methods use rolling control to correct the parameters by realizing the feedback of rolling [6,7].However, the effect is not up to expectations in practical applications.Taking industrial application as an example, the sequential MPC framework can usually be decomposed into point prediction of target variables (e.g., solar power generation), followed by de-terministic optimization, which is unable to capture the uncertainty of probabilistic data distribution [8,9].To solve the above problems, stochastic-based methods have been proposed, and they are able to eliminate the effects caused by some uncertainties.
Taking into account the uncertainty in forecasting, it is possible to improve energy efficiency by 13% to 30% [10,11].Stochastic-based methods mainly include two types: one that requires prior knowledge of system uncertainty [12,13], and another is based on scenarios, generating values for multiple random variables [14,15].Additionally, adaptive methods are also applied in the presence of uncertainty [16][17][18].In this paper, enhanced generalization capability is achieved by combining stochastic optimization with online adaptive rolling updates.
Despite some recent progress, it is difficult for the existing system to meet the demand of real-time scheduling due to the huge number of SMGs and high model complexity.Under the requirement of real-time scheduling, the attempt of reinforcement learning in power grids is gradually emphasized.
Reinforcement learning has been proven to give real-time decisions in several domains and has the potential to be effectively applied in the power grid scenarios.In Largescale Transmission Grids (LTG), reinforcement learning has not yet been successfully applied due to security concerns.In Small-scale Micro-Grids (SMG), where economy is more important (security can be guaranteed by the up-level grid network), reinforcement learning is gradually starting to be tried.In reinforcement learning, the model learns by trial and error through constant interaction with the environment [19] and ultimately obtains the best cumulative reward.Training for reinforcement learning usually relies on a simulation environment, which is assumed to be provided in this paper.Unlike the existing single agent approach, in this paper, we propose a multi-agent reinforcement learning method to adapt a grid scheduling task.Reinforcement learning in electricity power scheduling offers the potential to enhance the efficiency, reliability, and sustainability of power systems, leading to cost savings, reduced environmental impact, and improved overall performance.The main contributions of this paper are:

•
To adapt to uncertainty, we propose two modules to achieve robust scheduling.One module combines deep learning-based prediction techniques with stochastic optimization methods, while the other module is an online data augmentation strategy, including stages of model pre-training and fine-tuning.

•
In order to realize sharing rewards among buildings, we propose to use multi-agent PPO to simulate each building.Additionally, we provide the ensemble method between reinforcement learning and optimization methods.

•
We conducted extensive experiments on a real-world scenario and the results demonstrate the effectiveness of our proposed framework.

Problem Statement
Generally, SMG contains various types of equipment, including solar generation machines (denoted as G), storage devices (denoted as S), and other user devices (denoted as U ). M denotes the markets, such as carbon and electricity.The total decision steps is set to T. We define the load demand of the user as: L u,t , where step t ∈ T = {1, . . ., T} and u ∈ U .p t is the market price as time t per unit or the average price among M.
The variables in SMG include the electricity need from the national grid (denoted as P grid,t ), the power generation of device g ∈ G (denoted as P g,t ), the charging or discharging of storage (denoted as P + s,t or P − s,t ), and the state of charge of device s ∈ S (denoted as E s,t ).We define the decision variables as: X = {P grid,t , P g,t , P + s,t , P − s,t , E s,t }, where t ∈ T , s ∈ S, g ∈ G, and then the objective is to minimize the total cost of all markets, which is defined [20]: s.t.: To facilitate the understanding of the above constraints, we explain each formula with details: (2) Electricity need bounds from national grid: larger than zero and without upper bounds.(3) (P min g,t ) denotes the lower bound of each electricity generation device, such as solar generation, while (P max g,t ) denotes the upper bound.(4) (P + s,t max ) represents the upper limit for battery/storage charging at timestamp t, while (P − s,t max ) represents the upper limit for discharging.
(5) E min s,t represents the lower value of soc (state of charge), while E max s,t denotes the upper value, and the second equation denotes the updating of the soc.(6) This equation makes sure the power grid is stable (the sum of power generation is equal to the sum of power consumption).
In practical application scenarios, it is not possible to obtain exact data on market prices, new energy generation, and user loads in advance when conducting power scheduling.Therefore, it is necessary to predict these values before making decisions.In the following, we will provide a detailed introduction to our solution.

Feature Engineering
Feature engineering provides input for the subsequent modules, including the forecasting module, reinforcement learning module, and optimization method module.We extract features for each building (the detailed building information will be introduced in the subsequent dataset section).Due to the different scales of features, we normalize all features X as follows: where x new is the normalized output, max(X) denotes the max value of each domain, while min(X) represents the minimum, and is a value that prevents the denominator from being zero.Moreover, to eliminate the influence of some outliers, we also performed data denoising processes as: where α is a pre-set adjustable parameter, and avg(X) represents the average value of the feature.We truncate the outliers that exceed a certain percentage of the average value.We show the key feature components of continuous modules.For the forecasting module: • The user loads of past months; • The electricity generation of past months; • The radiance of solar direct or diffuse; • Detailed time including the hour of the day, the day of the week, and the day of the month; • The forecasting weather information including the values of humidity, temperature, and so on; For the reinforcement learning module and optimization method module: • The key components detailed before; • The predictions of user load and electricity generation; • The number of solar generation units in each building; • The efficiency and capacity of the storage in each building; • Market prices including the values for electricity and carbon;

Deep Learning-Based Forecasting Model
The deep learning-based forecasting module generates the corresponding input data for the next modules, including the optimization method module (or reinforcement learning module).The target variables include user load (denoted as L u,t ), market prices (denoted as p t ), and capacity of solar generation (denoted as P max g,t ).The input features of the forecasting models are listed in the Feature Engineering part before.
In sequence prediction tasks, deep neural network methods have gradually become state-of-the-art (SOTA).Gated Recurrent Unit (GRU for short) is one of the most commonly applied types of recurrent neural network with a gating mechanism [21].We employ recurrent neural network (RNN) with a GRU in our approach.Additionally, our framework can easily adapt to any other neural networks, including CNNs and transformers.Compared to other variants of recurrent networks, RNN shows good performance in small datasets with a gated mechanism [22].Thus, when given the input sequence x = (x 1 , . . ., x T ), the RNN we used is described as [23]: where h t denotes the hidden state of RNN at time t, y t denotes the corresponding output, and φ 1 and φ 2 represent the non-linear functions (active function or the combination with affine transformation).Fitting maximum likelihood on the training data, the model is able to predict f L u , f p , and f P g , corresponding to user load, market prices, and capacity of solar generation, respectively.Moreover, since each of our modules is decoupled, it is easy to incorporate the predictions of any other forecasting methods into the framework.

Reinforcement Learning
In most scenarios, reinforcement learning can provide real-time decision-making, but the safety of these decisions cannot be guaranteed.Therefore, reinforcement learning has not been practically applied in LTG.However, SMG serves as a good testing ground for reinforcement learning.Due to the fact that SMG does not require the calculation of power flow in the network, in the training process, the interaction between the agent and the simulation environment can be conducted within a limited time.Since its proposal, Proximal Policy Optimization (PPO) [19] has been validated to achieve good results in various fields.Therefore, here, we model and adapt the power grid environment based on the PPO method.
The reinforcement learning framework we principally used for SGM, as shown in Figure 2, includes several parts: simulation environment module, external data input module, data preprocessor module, model module, and result postprocessor module.
The simulation environment simulates and models the microgrid, mainly using past years' real data for practice simulations.External input data includes real-time climate information obtained from websites.The data preprocessor filters and normalizes the observed data.The model module consists of multi-agent PPO (MAPPO), which includes multiple neural network modules and loss function design.The final result postprocessor module handles the boundaries of the model's output, such as checking whether the output of the generator exceeds the physical limits.Most existing applications of reinforcement learning focus on single-agent methods, including centralized PPO (CPPO) and individual PPO (IPPO) [24].As shown in Figure 3, CPPO learns the model by consolidating all inputs and interacting with the SMG.On the other hand, IPPO involves independent inputs for multiple learning instances.In the case of an SMG, each input represents a generation or consumption unit, such as a building.In practical scenarios, there are various types of SMG, including factories, residential communities, schools, hospitals, etc.Therefore, the framework should be able to adapt to different types of SMG.The CPPO method mentioned above concatenates all inputs as one input each time, which cannot be applied to SMG with different inputs.For example, a model trained on a school SMG with 10 teaching buildings cannot be quickly adapted and applied to one with 20 teaching buildings.To address this issue, the IPPO method is introduced, which allows all teaching buildings to be inputted into the same agent in batches.However, in actual SMG, information sharing among teaching buildings is crucial.For example, the optimal power scheduling plan needs to be achieved through sharing solar energy between teaching buildings in the east and west.Since IPPO only has one agent, it cannot model the information sharing.Based on this, we propose a multi-agent PPO (MAPPO) model to address the information sharing problem in SMG.
As shown in the Figure 4, in the MAPPO framework, taking a school microgrid as an example, each agent represents a building, and each building has its own independent input.Additionally, the main model parameters are shared among all the buildings.If π i (a i |τ i ) is an agent model, the joint model is: π(a|s) := ∏ n i=1 π i (a i |τ i ), where n denotes the number of teaching buildings.The expected discounted accumulated reward is defined as [24]: where γ represents the discount ratio, R is the reward, and s t = [o ( t) 1 , . . ., o n t , a t , rt ] is the current state of the whole system.

Optimization 3.4.1. Stochastic Optimization
In the deep learning forecasting module, we have trained models that can predict user load ( Lu,t ), market prices ( pt ), and the capacity of solar generation ( Pmax g,t ).In the validation dataset, we obtain the deviations of the models for these predictions, and their variances are denoted as ΣL u , Σp, and ΣP g , respectively.These values represent the level of uncertainty.To mitigate the impact of uncertainty, we propose a stochastic optimization method as shown in Figure 5b.We use the predicted values as means and uncertainty as variances, for example, ( Pg, t max , ΣP g ), ( Lu, t, ΣL u ), and ( pt, Σp), to perform Gaussian sampling.Through Gaussian sampling, we can obtain multiple scenarios, which are considered as a multi-scenario optimization problem.Assuming we have N scenarios, the n-th scenario can be represented as (n ∈ S N ) [25]: Then, the objective function in our proposed stochastic optimization can be redefined as: Constraint ( 3) is refined as: Constraint (6) is refined as: Through solving the stochastic optimization problem (10), we obtain the scheduling plan: Ẋ = { Ṗgrid,t , Ṗg,t , Ṗ+ s,t , Ṗ− s,t , Ės,t } .

Online Data Augmentation
In order to address the data drift problem, we propose the data augmentation method as shown in Figure 5c.The module contains two parts: pre-training/fine-tuning scheme and rolling-horizon feedback correction.

Pre-Training and Fine-Tuning
In practice, the real-time energy dispatch process is a periodic task (e.g., daily dispatch).Considering that the prediction models are trained based on historical data and future data and may not necessarily follow the same distribution as the past, we perform online data augmentation.Online data augmentation consists of two parts: pre-training and fine-tuning.Firstly, we pre-train the neural network model using historical data to obtain a model capable of predicting f L u , f p , and f P g .Secondly, we fine-tune the neural network using the accumulated online data.Specifically, in the fine-tuning process, we employ partial parameter fine-tuning to obtain the refined network fL u , fp , and fP g .

Rolling-Horizon Feedback Correction
In addition to updating the prediction models online, we also employ the rollinghorizon control technique.In the optimization process, we solve the optimization problem every horizon H (to incorporate the latest prediction models and trade-off computational time).This operation is repeated throughout the scheduling period.We conducted experiments on building energy management using a real-world dataset from Fontana, California.The dataset includes one year of electricity scheduling for 17 buildings, including their electricity demand, solar power generation, and weather conditions.This dataset was also used for the NIPS 2022 Challenge.With our proposed framework, we achieved the global championship in the competition [20].

Metric
We follow the evaluation setup of the competition.The 17 buildings are divided into visible (5 buildings) and invisible data (12 buildings).The visible data are used as the training set, while the invisible data include the validation set and the testing set.Visible data contain all labels including user load demand and solar generation in a year.The labels of the invisible data can only be evaluated through limited interactions with the competition organizers' open API.The final leaderboard ranking is based on the overall performance of the model on all data sets.The evaluation metrics include carbon emissions, electricity cost, and grid stability.Specifically, the electricity consumption of each building i is calculated as E i,t = L i,t − P i,t + X i,t , where L i,t represents the load demand at timestamp t, P i,t represents the solar power generation of the building, and X i,t represents the electricity dispatch value provided by the model.The electricity consumption of the entire district is denoted as Using the above notations, three metrics are defined as:

Baseline
To evaluate the proposed MAPPO, Optimization, and their Ensemble method, we compare them with the following baseline methods: • RBC: Rule-Based Control method.We tested several strategies and selected the best one: charging the battery by 10% of its capacity between 10 a.m. to 2 p.m., followed by discharging it by the same amount between 4 p.m. to 8 p.m. • MPC [26]: A classical Model-Predictive-Control method.A GBDT-based model [27] is used to predict future features, and a deterministic optimization is used for daily scheduling.
Moreover, after the competition, we also compared the proposals of several top-ranked contestants: • AMPC [26]: An adaptive Model-Predictive-Control method.• SAC [28]: A Soft Actor-Critic method that uses all agents with decentralization.• ES [29]: Evolution-Strategy method with adaptive covariance matrix.

Implementations
The environment simulator that employs reinforcement learning and an evaluation process is provided by the competition organizers [30].The learning of deep learning networks is implemented using PyTorch.The optimization problem-solving utilizes our self-developed MindOpt [31].All experiments are conducted on an Nvidia Tesla V100 GPU with eight cards.

Results
If only one metric is considered, any of the three metrics can perform very well.Therefore, the final effect needs to be seen in terms of the average value of the three metrics.In particular, as shown in Table 1, 'Emission', 'Price', and 'Grid' denote the metric C Emimssion , C Price , and C Grid , respectively.Since the performance is compared with no use of storage, a lower value indicates a better performance.Our proposed MAPPO method and Optimization method both achieve better results than other competitors.
As shown in Table 1, the individual model has limited performance.By combining reinforcement learning and optimization, we can achieve the best results.Through observ-ing the validation dataset, we found that reinforcement learning and optimization perform alternately in different months.By leveraging their advantages, we fuse their results based on the month to create a yearly schedule (named Ensemble), ultimately obtaining the best outcome.Besides, all calculations of the models above are completed within 30 min to generate the scheduling for the next year.

Ablation Studies
We conducted ablation studies on some modules to understand their contributions to the overall performance.

Analysis of Online Data Augmentation
We compare the performances of different online updating methods, as shown in Figure 6: No-Ft: no fine-tuning on online data; Self-Adapt: adaptive linear correction by minimizing the mean squared error between historical value and predicted value; Scratch: re-learning from scratch; Small-LR: continuous learning with a smaller learning rate; Freeze: continuous learning with online data but freezing the weights of the first few layers and only updating the last layer.To compare the efficiency of the models, we evaluate the average execution time of real-time scheduling within 24 h.Results show that fine-tuning with a smaller learning rate has advantages in terms of efficiency and effectiveness.

Analysis of Forecasting Models
As shown in Table 2, we evaluated different forecasting models.The evaluation metrics include overall scheduling performance, execution time, and forecasting performance measured by the weighted mean absolute percentage error (WMAPE).The experimental results indicate that the RNN model with online fine-tuning achieves the best performance.In stochastic optimization, the number of scenarios is a very important parameter.As shown in Figure 7, as the number of scenarios increases, the effectiveness of the model also gradually increases.This is in line with common sense, as a model that can cover more scenarios tends to have better performance.

Conclusions
The challenge of power grid scheduling lies in the complexity of long-term decisionmaking.Through our research, we have learned that achieving end-to-end learning with a single strategy is difficult for such complex problems.We have identified that future load and solar energy generation are key information for decision-making.Our results show that using pre-trained auxiliary tasks to learn representation and prediction ahead of optimization and reinforcement learning outperforms directly feeding all the data into the decision model.By employing optimization and multi-agent reinforcement learning algorithms for decision-making, we have found that the optimization algorithm achieves better generalization on an unknown dataset through target approximation, data augmentation, and rolling-horizon correction.On the other hand, multi-agent reinforcement learning better models the problem and finds better solutions on a known dataset.The issue of data augmentation to improve generalization in energy management tasks warrants further research.We have also observed that the policies learned by the optimization algorithm and reinforcement learning perform differently in different months, which has motivated us to explore ensemble learning approaches.We left the ensemble of forecasting models as future work.

Figure 1 .
Figure 1.The micro-grid network framework.Green arrows denote solar power sharing among the micro-grid buildings and orange lines indicate how the micro-grid obtains power from the national grid.

Figure 5 .
Figure 5.The whole optimization method framework.The subplot (a) represents the flowchart of prediction based on deep learning.The output of the prediction module serves as the input for the stochastic optimization module, as shown in (b).During the scheduling process, real-time data accumulates over time, and we update the predictions based on the real data, as demonstrated in (d), named the online data augmentation module.This framework enhances the robustness of scheduling under uncertain conditions.

Figure 6 .
Figure 6.Analysis of online data augmentation, the evaluation about performance and execution time with various settings.

Figure 7 .
Figure 7. Effect of different number of scenarios N. The curve denotes the expected value, while the area is the standard deviation of the stochastic sample.

Table 1 .
Comparison of the performances of all methods in the entire building.All values are normalized against the simple baseline without strategy, i.e., not using the storage.Therefore, a lower value indicates a better performance.

Table 2 .
Analysis of different forecasting models, including scheduling performance, forecasting performance, execution time, and updating methods.