Asymmetric Loss Functions for Contract Capacity Optimization

: For high-voltage and extra-high-voltage consumers, the electricity cost depends not only on the power consumed but also on the contract capacity. For the same amount of power consumed, the smaller the di ﬀ erence between the contract capacity and the power consumed, the smaller the electricity cost. Thus, predicting the future power demand for setting the contract capacity is of great economic interest. In the literature, most works predict the future power demand based on a symmetric loss function, such as mean squared error. However, the electricity pricing structure is asymmetric to the under- and overestimation of the actual power demand. In this work, we proposed several loss functions derived from the asymmetric electricity pricing structure. We experimented with the Long Short-Term Memory neural network with these loss functions using a real dataset from a large manufacturing company in the electronics industry in Taiwan. The results show that the proposed asymmetric loss functions outperform the commonly used symmetric loss function, with a saving on the electricity cost ranging from 0.88% to 2.42%.


Introduction
Predicting power demand is vital for power companies to plan the production of electricity. The overproduction of electricity not only increases the production cost and the electricity attrition rate but also accelerates the damage to the power equipment and the pollution to the environment. In contrast, the underproduction of electricity could result in energy rationing or even interruption of the power supply. Thus, the prediction of power demand has received much attention from both academics and practitioners in the electric power industry [1][2][3][4][5][6].
Various technologies have been proposed in the literature for forecasting power demand and optimizing energy consumption. For example, Building Energy Simulation, hybrid approaches and statistical-based tools were applied to building energy simulation models [7,8]. Furthermore, a smart grid co-simulation software platform between the energy management systems and a building energy model was proposed to decouple the control algorithms and the building [9]. In [10], a linear programming model and an energy system cost-optimization model were used to forecast the long-term power demand and CO 2 emissions. In [11], time-series forecasting models were adopted to forecast the long-term power demand in a scenario of energy transition from fossil fuels to carbon-free sources. In [12], an energy consumption scheduling device was used on the customer site to achieve autonomous demand response with the goal of controlling customer's flexible demand and optimizing the cost. For a recent review of state-of-the-art load forecasting techniques, please refer to [13].
The design of contract capacity in the electric power market is an effective means to control the power demand of high-voltage and extra-high-voltage consumers through demand-side management. A high-voltage or extra-high-voltage customer signs a contract with a power company to purchase a certain amount of electricity (referred to as "contract capacity") for the next month. If the customer uses more electricity than the contract capacity during the next month, he/she needs to purchase the excess amount of electricity at a higher rate. However, if the customer uses less electricity than the contract capacity during the next month, he/she still has to pay for the remaining unused part of the contract capacity. Thus, it is advantageous for the customer to set the contract capacity near his/her actual power demand to avoid paying electricity at a higher rate. Additionally, because these high-voltage and extra-high-voltage consumers account for a large proportion of total electricity consumption, the power company can use their contract capacities to improve the prediction of the total power demand.
High-voltage and extra-high-voltage consumers need to predict their future power demand to decide their contract capacity. Previous work on predicting the future power demand is usually based on a symmetric loss function, such as mean squared error (MSE), to build the prediction models [5,6]. However, for high-voltage and extra-high-voltage consumers, their electricity pricing structure is asymmetric to the contract capacity. Specifically, underestimation (i.e., contract capacity < actual power demand) usually results in a higher per unit electricity cost than overestimation (i.e., contract capacity > actual power demand) does, for the same amount of difference between the actual power demand and the contract capacity. The motivation of this study is to adopt an asymmetric loss function to reflect the asymmetric pricing structure in the prediction model to minimize the electricity cost.
In this paper, we study the problem of determining the contract capacity to minimize the electricity cost for high-voltage and extra-high-voltage consumers. Since electricity cost usually accounts for a significant expense to high-voltage and extra-high-voltage consumers, even a small percentage of reduction in electricity expense could significantly reduce their overall operation cost, making them competitive in today's market. In the literature, Long Short-Term Memory (LSTM) has shown to avoid the vanish gradient problem in the recurrent neural networks and yield excellent performance for time series prediction [6,14,15]. Thus, we use an LSTM neural network with an asymmetric loss function to build the prediction model for contract capacity, where the model is trained using the monthly electricity consumption data. The asymmetric loss function is derived from the pricing structure for high-voltage and extra-high-voltage consumers of Taiwan Power Company. Our performance study using a real dataset shows that the model based on the asymmetric loss function outperforms the model based on the commonly-used symmetric loss function, MSE.
The rest of this paper is organized as follows. Section 2 reviews previous work on electricity load forecasting and optimization of contract capacity. Section 3 formally defines the problem of determining the contract capacity to minimize the electricity cost. Section 4 proposes six asymmetric loss functions derived from the electricity pricing structure. Section 5 presents our performance study, and Section 6 concludes this paper and give directions for future research.

Related Work
The problem of determining the contract capacity to minimize the electricity cost is closely tied to the problem of predicting the future power demand. If we can predict the future power demand accurately, we can set the contract capacity accordingly to avoid paying electricity at a higher rate. Based on the scale of the forecast horizon, predicting the future power demand can be divided into four categories: very short-term, short-term, medium-term, and long-term. Very short-term and short-term load forecasting focuses on the prediction of hourly or daily load for one hour to four weeks ahead [2,4,14,16]. Medium-term forecasting aims to predict monthly load for 1 to 12 months ahead, and long-term load forecasting aims to predict yearly load for 1 to 20 years ahead [16]. Since contract capacity can be set on a month-to-month basis, medium-term load forecasting fits the requirement of determining the contract capacity.
In the literature, many forecasting methods have been proposed for electricity load forecasting [4,5]. For example, Seasonal Autoregressive Integrated Moving Average model (SARIMA) and feed-forward back-propagation neural networks were used to predict the power demand in the Turkish electricity market [17]. Among these forecasting methods, artificial neural networks, especially recurrent neural networks (RNNs), are the most widely used. RNN considers the ordering among data, making it very suitable for time series prediction. In [18], empirical mode decomposition on a sliding window was used to select features (including power demand features and weather-related features). Then, an Elman network (a variant of RNNs) was trained to predict the future power demand, where the weightings in the network were optimized by a population-based heuristic search algorithm.
However, vanilla RNN suffers from the vanishing gradient and the exploding gradient problems, making it unable to retain the useful memory about the data exhibited earlier. To mitigate this problem, LSTM adds more gates and links to an RNN node such that useful (or useless) memory about past data can be remembered (or forgotten). LSTM has been applied to many problems where time ordering of data is crucial, e.g., sequence translation [19], human activity recognition [20], hyperspectral image classification [21], and electricity load forecasting [6,14]. In [14], a prediction model for the electricity load of a day is constructed using the data from similar days in the past. First, the data from similar days are decomposed into several intrinsic mode functions (IMFs) using empirical mode decomposition. Then, an LSTM prediction model is built for each IMF, and finally, the load prediction is formed by combining the predictions from all these LSTM models. In [6], LSTM and the genetic algorithm are integrated for short-term load forecasting.
Because the electricity cost is minimized when the actual power demand is equal to the contract capacity, we can first predict the power demand for next month and then use the predicted value as the contract capacity for next month. However, most of the previous works on load forecasting aim to minimize the error between the predicted and the actual power demands without considering the pricing structure of electricity. Thus, the costs of underestimation and overestimation are symmetric. Consequently, they use a symmetric loss function, such as MSE, both to adjust the prediction model and to evaluate the performance of the predictions. However, as described in Section 1, the electricity pricing structure for high-voltage and extra-high-voltage consumers is asymmetric to the contract capacity. For the goal of minimizing the electricity cost, using an asymmetric loss function is more appropriate than using MSE.
There are two scenarios where an asymmetric loss function is often adopted. The first scenario is when the available dataset is imbalanced, e.g., in medical imaging applications [22]. The second scenario is when the loss of the underlying problem is asymmetric. One example is remaining useful life (RUL) estimation in Prognostics and Health Management (PHM), where underestimating the RUL of a component only results in the waste of replacing the component too earlier, while overestimating the RUL may cause detrimental effects to the machinery [23]. Other examples include the asymmetric loss on wind speed and power predictions [24,25] and oil price prediction [26]. In this paper, we focus on the problem of setting the contract capacity to minimize the cost. Most of the existing methods for load forecasting did not consider the electricity pricing structure, which is asymmetric to under-and overestimation. As a result, an unbiased function such as MSE was often adopted to train the model and to evaluate the prediction performance. Consequently, the same amount of underestimation and overestimation results in the same cost using MSE, which contradicts to the electricity pricing structure.
The problem of contract capacity optimization was studied in [27][28][29][30]. Given the monthly power demand for the past twelve months and the electricity pricing structure, the problem is to determine the contract capacity for each of these past twelve months such that the total electricity cost for these twelve months is minimized. This problem considers more facets in the electricity pricing structure than just the per unit electricity cost. For example, the electricity pricing structure also includes the expanding construction fee for those months of increasing the contract capacity. As a result, simply setting the contract capacity to the actual power demand does not necessarily minimize the total cost. This problem can be formulated as an optimization problem, and linear programming [27] and metaheuristic algorithms [28,29] have been used to derive or search the optimal contract capacity for each month. Notably, this problem assumes that the real power demand is known, making it different from the problem studied in the current paper, where the real power demand is unknown, and the goal is to determine the contract capacity to minimize the electricity cost for future months.

Problem Formulation
This paper considers the problem of determining the next month's contract capacity for a high-voltage or extra-high-voltage consumer such that his/her electricity cost can be minimized. We adopt the electricity pricing structure of Taiwan Power Company and focus on the determination of peak contract capacity [27]. Let R be the basic per unit electricity cost, andx i and x i denote the contract capacity and the actual power demand for month i of a consumer, respectively.
If the actual demand x i is less than the contract capacityx i , then the customer has to pay a fixed capacity charge Rx i . If the actual demand is greater than the contract capacity, then the excess demand within 10% of the contract capacity is charged at twice the basic rate R, and the excess demand over 10% of the contract capacity is charged at three times the basic rate. Thus, the customer's electricity cost for month i (denoted by C i ) can be calculated as follows.
Ideally, ifx i = x i , then the consumer's electricity cost for month i equals Rx i , which is the optimal electricity cost for the power demand x i . If the contract capacityx i is an over-or underestimation of the actual demand x i , then a penalty is imposed on the electricity cost. The penalty can be calculated as the electricity cost C i minus the optimal electricity cost Rx i . Let P C i denote the penalty on the customer's electricity cost for month i due to under-or overestimation of the actual demand. Because the basic per unit electricity cost R is a constant, we further divide C i − Rx i by R to make the value of P C i independent of the value of R, as shown below.
Based on Equation (2), we define the problem under study as follows. Problem Definition. Given the monthly power demand up to month i − 1, exclusively, predict the contract capacityx i for month i such that the penalty P C i for month i is minimized. Notably, when we need to decide the contract capacity for month i, the actual power demands for both months i and i − 1 are still unknown. Thus, x i and x i−1 cannot be used to predictx i . In other words, this problem has a forecast horizon of 2.
To show the asymmetric pricing structure of electricity, we divide the electricity cost C i by the actual demand x i to yield the consumer's per unit electricity cost for month i (denoted as R i ), as follows.
Then, the penalty (denoted as P R i ) on the customer's per unit electricity cost for month i can be calculated as the per unit electricity cost R i minus the basic per unit electricity cost R. Similar to Equation (2), we further divide R i − R by R to make the value of P R i independent of the value of R (see Equation (4)). Notably, the if-conditions of Equation (4) The solid blue line in Figure 1 shows the values ofx i The solid blue line in Figure 1 shows i.e., the speed of increasing penalty is more than double of the original speed.

Proposed Loss Functions for Contract Capacity Prediction
In machine learning, a loss function can be used to adjust a prediction model to fit the training data. For example, an artificial neural network uses a loss function to calculate the difference between the predicted and actual values of the training data. It then back-propagates the difference to finetune the weightings of the links in the network. The most commonly used loss function is MSE, which is symmetric to under-and overestimation. In this study, we use the monthly contract capacity as the predicted values for the actual power demand, so MSE can be calculated as follows, where n denotes the number of instances in the training data.
In the rest of this section, we proposed several asymmetric loss functions that are derived from the electricity pricing structures described in Section 3. Later, in Section 5, we experiment with these loss functions and compare their performance against that of MSE.

Proposed Loss Functions for Contract Capacity Prediction
In machine learning, a loss function can be used to adjust a prediction model to fit the training data. For example, an artificial neural network uses a loss function to calculate the difference between the predicted and actual values of the training data. It then back-propagates the difference to fine-tune the weightings of the links in the network. The most commonly used loss function is MSE, which is symmetric to under-and overestimation. In this study, we use the monthly contract capacity as the predicted values for the actual power demand, so MSE can be calculated as follows, where n denotes the number of instances in the training data.
In the rest of this section, we proposed several asymmetric loss functions that are derived from the electricity pricing structures described in Section 3. Later, in Section 5, we experiment with these loss functions and compare their performance against that of MSE.
As discussed in Section 3, the electricity pricing structure is asymmetric to under-and overestimation of the actual power demand. Thus, a loss function that takes into account this asymmetric pricing structure is more appropriate for the machine learning algorithm to optimize the electricity cost. In Section 3, we derived the penalty of electricity cost P C i in Equation (2) and the penalty of per unit electricity cost P R i in Equation (4). In Equations (6) and (7), we define two loss functions, L C and L R , by calculating the average of P C i and P R i over the training data, respectively.
Both L C and L R are zero when the predicted values equal the actual values. Furthermore, underestimation incurs more penalty to both L C and L R than overestimation does.
The MSE loss function has the effect of slowly increasing its value as the deviation between the predicted value and the actual value is small, but quickly increasing its value as the deviation gets large. We can modify Equations (6) and (7) as follows to achieve a similar effect.
As shown in Figure 1, the electricity pricing structure is symmetric to the linex i We can modify the penalty of electricity cost P C i in Equation (2) and the penalty of per unit electricity cost P R i in Equation (4) to remove this symmetric portion to aggravate the penalty of underestimation, as follows.
Modified penalty of electricity cos t for month i, Modified penalty of per unit electricity cos t for month i, Notably, .,x i is an underestimation of x i ), then M R i > P R i , indicating a larger penalty using M R i than using P R i , as shown in the red dash line in Figure 1. Based on M C i and M R i , another two loss functions are defined as follows.
Equations (10) and (11) consistently incur more penalties to underestimation than to overestimation. In contrast, Equations (6) and (7) put more penalties to underestimation than to overestimation only when the prediction is far from the actual demand, i.e., for underestimation in the range of (1 − 0.1 1.1 ) >x i x i and overestimation in the range ofx i x i > (1 + 0.1 1.1 ), to be exact.

Experiment Design
To evaluate the effectiveness of the proposed loss functions, we conducted a performance study using a real dataset from a large manufacturing company in the electronics industry in Taiwan. The dataset contains six time series, corresponding to the company's six power lines. Each time series in the dataset contains the monthly power demands of a power line for 50 consecutive months. Two preprocessing steps were adopted to protect data privacy. First, the monthly power demands in each time series were normalized to between 1 and 2. Second, the 50 consecutive months are numbered from 1 to 50, instead of indicating the exact months and years. The dataset after preprocessing is shown in Figure 2.

Experiment Design
To evaluate the effectiveness of the proposed loss functions, we conducted a performance study using a real dataset from a large manufacturing company in the electronics industry in Taiwan. The dataset contains six time series, corresponding to the company's six power lines. Each time series in the dataset contains the monthly power demands of a power line for 50 consecutive months. Two preprocessing steps were adopted to protect data privacy. First, the monthly power demands in each time series were normalized to between 1 and 2. Second, the 50 consecutive months are numbered from 1 to 50, instead of indicating the exact months and years. The dataset after preprocessing is shown in Figure 2. Each of the six power lines has its contract capacity. Thus, each time series was handled separately to build its prediction model.
A time-series cross-validation approach was adopted in the experiment for performance evaluation [31]. For each time series ( , , … , ), a moving window of size 38 was used to yield 13 segments, where the i-th segment contains ( , , … , ). As illustrated in Figure 3, the window first covers the segment ( , , … , ), then the segment ( , , … , ), and so on, and finally the segment ( , , … , ). Then, for each segment ( , , … , ), i = 1 to 13, the first 36 elements ( , , … , ) were used as the training data (shown as blue circles in Figure 3) to build a prediction model, and the last element was used as the test data (shown as red circles in Figure  3) for performance evaluation. Because the problem under study (see Section 3) requires a forecast horizon of 2, a model using as the test data should be not trained with a dataset containing . Thus, in the segment ( , , … , ) was used for neither training nor testing (shown as green circles in Figure 3). Notably, the seasonality of the electronics industry motivates us to use a multiple of 12 months of data for more than two years to train a model. Because each of the time Each of the six power lines has its contract capacity. Thus, each time series was handled separately to build its prediction model.
A time-series cross-validation approach was adopted in the experiment for performance evaluation [31]. For each time series (x 1 , x 2 , . . . , x 50 ), a moving window of size 38 was used to yield 13 segments, where the i-th segment contains (x i , x i+1 , . . . , x i+37 ). As illustrated in Figure 3, the window first covers the segment (x 1 , x 2 , . . . , x 38 ), then the segment (x 2 , x 3 , . . . , x 39 ), and so on, and finally the segment (x 13 , x 14 , . . . , x 50 ). Then, for each segment (x i , x i+1 , . . . , x i+37 ), i = 1 to 13, the first 36 elements (x i , x i+1 , . . . , x i+35 ) were used as the training data (shown as blue circles in Figure 3) to build a prediction model, and the last element x i+37 was used as the test data (shown as red circles in Figure 3) for performance evaluation. Because the problem under study (see Section 3) requires a forecast horizon of 2, a model using x i+37 as the test data should be not trained with a dataset containing x i+36 . Thus, x i+36 in the segment (x i , x i+1 , . . . , x i+37 ) was used for neither training nor testing (shown as green circles in Figure 3). Notably, the seasonality of the electronics industry motivates us to use a multiple of 12 months of data for more than two years to train a model. Because each of the time series only contains 50 months of data, we chose to use 36 months of data for training so that we can still retain sufficient data (i.e., 50-36-1=13 months) for testing. Thus, the size of the moving window is set to 38 (i.e., 36 months for training, one month for gap, and one month for testing). As shown in Figure 3, a time series in our dataset has 13 segments, and the last element of each segment is used as the test data, so the test data include the last 13 items (i.e., , , … , ) of the time series. Assume that a machine learning scheme predicts the values of , , … , as , , … , . Then, by using the predicted values (i.e., , , … , ) as the contract capacities for their respective months, we can calculate the electricity costs for months 38 to 50 using Equation (1). Then, we evaluate the performance of the machine learning scheme using two performance measures, and , where both are derived from the difference between the electricity cost and the optimal electricity cost, as follows.
Notably, is the optimal electricity cost for month i, which occurs when = . Measure provides a macro view of the performance by using the total electricity cost (i.e., ∑ ) and the optimal electricity cost (i.e., ∑ ) over the entire period of the test data. In contrast, measure provides a micro view of the performance by first calculating the penalty (i.e., ) for each month in the period of the test data, and then taking their average. The smaller the values of and , the better the performance. Without loss of generality, we set the basic per unit electricity cost R to 1.
In this experiment, we compared the performance of using LSTM with seven different loss functions: MSE, from Equation (6), from Equation (7), from Equation (8), from Equation (9), from Equation (12), and from Equation (13). For each loss function, and were calculated and compared for each time series in the dataset. Then, paired t-test were conducted to check whether using asymmetric functions , , , , and is significant better than using the symmetric loss function MSE.
The learning algorithm was implemented using Keras (https://keras.io/) in Python. An LSTM neural network model was built with one hidden LSTM layer and one Dense output layer, where the loss function was set to one of the seven loss functions described earlier. A grid search process was used to determine the hyper-parameters of the model, where three settings ("adam", "rmsprop" and "nadam") for the optimizer, three settings (100, 200 and 300) for the learning epochs, and four settings (1, 2, 5 and 10) for the batch learning size were explored. The final month of the learning data was kept aside as the validation data, and the rest of the training data were used to train the model. The As shown in Figure 3, a time series in our dataset has 13 segments, and the last element of each segment is used as the test data, so the test data include the last 13 items (i.e., x 38 , x 39 , . . . , x 50 ) of the time series. Assume that a machine learning scheme predicts the values of x 38 , x 39 , . . . , x 50 asx 38 ,x 39 , . . . ,x 50 . Then, by using the predicted values (i.e.,x 38 ,x 39 , . . . ,x 50 ) as the contract capacities for their respective months, we can calculate the electricity costs for months 38 to 50 using Equation (1). Then, we evaluate the performance of the machine learning scheme using two performance measures, F macro and F micro , where both are derived from the difference between the electricity cost and the optimal electricity cost, as follows.
Notably, Rx i is the optimal electricity cost for month i, which occurs when x i =x i . Measure F macro provides a macro view of the performance by using the total electricity cost (i.e., 50 i=38 C i ) and the optimal electricity cost (i.e., 50 i=38 Rx i ) over the entire period of the test data. In contrast, measure F micro provides a micro view of the performance by first calculating the penalty (i.e., ) for each month in the period of the test data, and then taking their average. The smaller the values of F macro and F micro , the better the performance. Without loss of generality, we set the basic per unit electricity cost R to 1.
In this experiment, we compared the performance of using LSTM with seven different loss functions: MSE, L C from Equation (6), L R from Equation (7), L C2 from Equation (8), L R2 from Equation (9), L MC from Equation (12), and L MR from Equation (13). For each loss function, F macro and F micro were calculated and compared for each time series in the dataset. Then, paired t-test were conducted to check whether using asymmetric functions L C , L R , L C2 , L R2 , L MC and L MR is significant better than using the symmetric loss function MSE.
The learning algorithm was implemented using Keras (https://keras.io/) in Python. An LSTM neural network model was built with one hidden LSTM layer and one Dense output layer, where the loss function was set to one of the seven loss functions described earlier. A grid search process was used to determine the hyper-parameters of the model, where three settings ("adam", "rmsprop" and "nadam") for the optimizer, three settings (100, 200 and 300) for the learning epochs, and four settings (1, 2, 5 and 10) for the batch learning size were explored. The final month of the learning data was kept aside as the validation data, and the rest of the training data were used to train the model. The validation data were used to evaluate which combination of the hyper-parameters yields the best training performance. Once the best combination of the hyper-parameters was determined, it was adopted to build the LSTM model with all training data. Then the resulting model was used to predict the test data.

Experimental Results
Tables 1 and 2 shows the values of F macro and F micro for using LSTM with different loss functions, respectively. In most cases, using an asymmetric loss functions (i.e., L C , L R , L C2 , L R2 , L MC or L MR ) yields better results (i.e., smaller F macro and F micro ) than using the symmetric loss function MSE. There are only six cases where MSE performs better than an asymmetric loss function in terms of F macro (shown in italic in Table 1). Similarly, there are only six cases where MSE performs better than an asymmetric loss function in terms of F micro , as shown in italic in Table 2. Both F macro and F micro show consistent results. Using MSE also yields the worst mean and the second worst standard deviation for both F macro and F micro among the seven loss functions tested. Overall, using an asymmetric loss function results in about 1~2% reduction on F macro and F micro in comparison to using MSE. The results of each asymmetric loss function were compared against the results of MSE using one-tailed paired t-test at significant level 0.05 to check whether the mean of F macro (or F micro ) is significantly smaller using an asymmetric loss function than using MSE. The results are shown in Table 3, and only the F micro using L MC or L R2 is not significantly smaller than that using MSE. To examine the prediction performance for each month i in the test set, Figure 4 shows the penalty P C i on the customer's electricity cost for month i due to under-or overestimation of the actual demand (see Equation (2) for details). The results of using L MC or L R2 are excluded from Figure 4 for clarity and because their performance is not significantly different from that of MSE according to the t-test in Table 3. Notably, no loss function performs the best for every month in any series.
Energies 2019, 12 FOR PEER REVIEW 10 demand (see Equation (2) for details). The results of using or are excluded from Figure 4 for clarity and because their performance is not significantly different from that of MSE according to the t-test in Table 3. Notably, no loss function performs the best for every month in any series. . The x-axis is the month no. of the 13 months in the test data, and the y-axis is the penalty , which is calculated as the difference between the electricity cost and the optimal cost of month i divided by the optimal cost of month i.

Conclusions
In this study, we built the LSTM models with several asymmetric loss functions derived from the electricity pricing structure. The results show that, in most cases, using the proposed asymmetric loss functions yields better performance than using the symmetric loss function MSE. Specifically, as shown in Table 3, both and are significantly improved with , , or than with MSE. However, using or only significantly improves but not , and thus and are not good choices for the problem under studied. Using performs better than using MSE for all six time series, and achieves lower mean  . The x-axis is the month no. of the 13 months in the test data, and the y-axis is the penalty P C i , which is calculated as the difference between the electricity cost and the optimal cost of month i divided by the optimal cost of month i.

Conclusions
In this study, we built the LSTM models with several asymmetric loss functions derived from the electricity pricing structure. The results show that, in most cases, using the proposed asymmetric loss functions yields better performance than using the symmetric loss function MSE. Specifically, as shown in Table 3, both F macro and F micro are significantly improved with L C , L R , L C2 or L MR than with MSE. However, using L R2 or L MC only significantly improves F macro but not F micro , and thus L R2 and L MC are not good choices for the problem under studied.
Using L C performs better than using MSE for all six time series, and achieves lower mean F macro value and F micro value than using the other loss functions does (see Tables 1 and 2). Recalled that L C is based on the electricity cost, while L R is based on per unit electricity cost. Their simplified versions (i.e., L MC and L MR ) and the squared version (i.e., L C2 and L R2 ) do not show much improvement over the original version (i.e., L C and L R ). Thus, using a loss function that directly reflects the electricity cost is a good choice to start with.
The main contribution of this study is two-fold. First, we showed that using the LSTM model with a loss function consistent with the electricity pricing structure can reduce the electricity cost. Second, although we only experimented with the LSTM model in this study, the same idea can be easily adapted to other machine learning algorithms that use a loss function to adjust the prediction model during the learning process.
The electricity pricing structure may be different in different countries and regions. It remains unknown whether our results can be applied directly to the other electricity pricing structures. However, most pricing structures are asymmetric with respect to under-and overestimation of the power demand. Although some pricing structures may add more (or less) penalty to underestimating power demands, adapting the loss function to the electricity pricing structure is a direction with great potential for reducing the electricity cost.
Several directions are worth pursuing in future research. First, we only experimented with the LSTM to build the prediction models. Future research can explore other machine learning techniques to improve prediction performance further. Second, developing a more sophisticated loss function that takes into account both the problem goal (i.e., lower electricity cost) and the convergence performance of the machine learning techniques can be further explored.

Conflicts of Interest:
The authors declare no conflict of interest.

R
Basic per unit electricity cost x i , x i Contract capacity and actual power demand for month i, respectively C i , R i Electricity cost and per unit electricity cost for month i, respectively P C i Penalty on the electricity cost for month i P R i Penalty on the per unit electricity cost for month i M C i Modified penalty of electricity cost for month i M R i Modified penalty of per unit electricity cost for month i L C , L R , L C2 , L R2 , L MC , L MR Loss functions based on P C i , P R i , (P C i ) 2 , (P R i ) 2 , M C i and M R i , respectively. F macro Percentage of deviation from the optimal total electricity cost F micro Average of the percentage of deviation from the optimal monthly electricity cost