Hessian with Mini-Batches for Electrical Demand Prediction

: The steepest descent method is frequently used for neural network tuning. Mini-batches are commonly used to get better tuning of the steepest descent in the neural network. Nevertheless, steepest descent with mini-batches could be delayed in reaching a minimum. The Hessian could be quicker than the steepest descent in reaching a minimum, and it is easier to achieve this goal by using the Hessian with mini-batches. In this article, the Hessian is combined with mini-batches for neural network tuning. The discussed algorithm is applied for electrical demand prediction.


Introduction
Networks have many applications like detection [1,2], recognition [3,4], classification [5,6], and prediction [7,8]. Steepest descent is a supervised algorithm that is frequently used for neural network tuning, wherein the value of the scale parameters is adjusted according to the cost map. Steepest descent evaluates the first-order partial derivatives of the cost map with respect to the scale parameters in the neural network.
Mini-batches are commonly used to get better tuning of the steepest descent in a neural network; the training data are divided in mini-batches, and the training of the steepest descent is applied to all the mini-batches taking into account one tuning of the scale parameters for each mini-batch. One tuning of all the mini-batches is one epoch.
There are several applications for mini-batches. In [9][10][11][12], mini-batches were employed for tuning. In [13,14], mini-batches were used for clustering. In [15,16], mini-batches were utilized for optimization. Since mini-batches have been used in several applications, they could be a good alternative to get better tuning using steepest descent.
Steepest descent with mini-batches is used during tuning with a search for each mini-batch. Nevertheless, steepest descent with mini-batches could be delayed in reaching a minimum. The Hessian has been used as an alternative for neural network tuning, wherein the Hessian evaluates the second-order partial derivatives of the cost map with respect to the scale parameters.
The Hessian has the same form as the steepest descent. However, steepest descent takes into account constant values in its tuning rate and momentum, while the Hessian takes into account the second-order partial derivatives of the cost map with respect to the scale parameters in its tuning rate and momentum. This is the main reason why the steepest descent method may be delayed in reaching 3 of 14 where b i is the input, q l is the output of the neural network, θ ji are hidden layer scale parameters, and ϕ lj are output layer scale parameters.
Appl. Sci. 2020, 10 We express the cost map as where l q is the output of the neural network and l t is the target, T L is the total output number.
We express the forward propagation as where i b is the input, l q is the output of the neural network, ji  are hidden layer scale parameters, and lj  are output layer scale parameters.
We take into account the activation map in the hidden layer as The first and second derivatives of the sigmoid map (4) are We take into account the activation map of the output layer as the linear form The first and second derivatives of the linear map (6) are  We take into account the activation map in the hidden layer as The first and second derivatives of the sigmoid map (4) are We take into account the activation map of the output layer as the linear form The first and second derivatives of the linear map (6) are The first and second derivatives of the cost map (2) are Using the cost map (2) and f (x l ) = x l (6), we express the propagation of the output layer as Appl. Sci. 2020, 10, 2036 4 of 14 Using the cost map (2) and g / (z j ) = ∂o j ∂z j = ∂g(z j ) ∂z j = g(z j ) 1 − g(z j ) (5), we express the propagation of the hidden layer as We express the second derivative of E as the Hessian H: and the Hessian is symmetrical: The Hessian terms are and We substitute (13) and (11); then the Hessian is where b i are the inputs, q l are the outputs, g(z j ) = 1 1+e −z j are the activation maps, t l are the targets, z j = θ ji b i are the hidden layer outputs, and ϕ lj are the scale parameters of the hidden layer.
In the next step, we evaluate the Hessian using the Newton method.

Design of the Newton Method
It is necessary to express a method to tune the scale parameters of the Hessian. The Newton method is one alternative to tune the scale parameters of the Hessian. We express the basic tuning of the Newton method as follows: are as in (14) for each k; are as in (9), (10); θ ji,k , ϕ lj,k are the scale parameters; and α is the tuning factor. The Newton method can quickly reach a minimum. The Newton method requires the existence of the inverse of the Hessian (H −1 k ). Now, we express the Newton method of (15) by terms. First, from (15), we obtain the inverse of H k as We substitute H −1 k of (16) and θ k , ∂E k ∂θ k of (16) into θ k+1 of (16) as follows: We express (17) by terms as Appl. Sci. 2020, 10, 2036 are as in (14) for each k; are as in (9), (10) for each k; θ ji,k , ϕ lj,k are the scale parameters for each k; and α is the tuning factor. Thus, (18) is the Newton method by terms.
For comparison, we express the steepest descent method as are as in (9), (10) for each k; θ ji,k , ϕ lj,k are the scale parameters for each k; and α is the tuning factor. Thus, (19) is the steepest descent method. It can be seen that the Newton method by terms (Hessian) (18) has the same form as the steepest descent (19). However, the steepest descent (19) takes into account constant values in its tuning rate β Gij,k , β Gji,k and momentum γ G,k , while the Hessian (18) takes into account the second-order partial derivatives of the cost map with respect to the scale parameters in its tuning rate β Hij,k , β H ji,k and momentum γ H,k .
We express the mini-batches in the next section to get better tuning of the Hessian.

Mini-Batches to Get Better Tuning of the Hessian
The form to update the scale parameters of the neural networks is that each neuron assigns information to the next neuron and it receives information from the previous neuron. We need training for successful neural network tuning. The training is developed from one epoch to the next until the scale parameters reach constant values and the cost map reaches a minimum. In addition, we need the training data to be tuned in a random form with the goal to quickly reach a minimum.
In the training stage, the neural network computes its outputs each time to obtain a result, and we compare the outputs with targets; in this way, the cost map of the neural network decreases. The scale parameters take random initial values, and these scale parameters are tuned through time.
We use other testing data as the basic method to evaluate the neural network efficacy. This consists of taking 80% of the data for training and taking 20% of the data for testing. The first stage is the training, and the second stage is the testing.
In the tuning, the training and testing stages are important.

Design of the Mini-Batches
We take into account training data v and u characteristics: In the mini-batches, we divide the training data v into w-many mini-batches of size y, with the goal to quickly reach a minimum. We express the Hessian with mini-batches as (we divide the training data v in w-many mini-batches of size y): (1) For each epoch.
We express the properties of the mini-batches below: • Most of the time, we do not need to utilize all data to reach an acceptable descent direction. A small number of mini-batches could be sufficient to estimate the target.

•
Obtaining the Hessian using all the training data could have high computational cost.
We tune the neural network with a tuning factor of α and l neurons in the hidden layer, and we use e epochs. In this kind of tuning, we divide the training data into mini-batches as in (22,23).

Comparisons
In this section, we compare steepest descent (SD), steepest descent with mini-batches (SDMB) from [9][10][11][12], the Hessian (H) from [17][18][19][20], and the Hessian with mini-batches (HMB) from this investigation for electrical demand prediction. The goal of these algorithms is that the neural network output q l must reach the target t l as soon as possible.
Efficient electrical demand prediction is critical for acceptable operations and planning with the intention of achieving profits. The load forecast influences a series of decisions, including the generators to be used for a given period, and influences the wholesale prices and the market prices in the electrical sector.
The training data used were a table with the history of electrical demand for each hour and temperature observations provided by the International Organization for Standardization (ISO) of Great Britain. The meteorological information includes the dry bulb temperature and the dew point. We took into account the data of the hourly electrical demand.
For the electrical demand prediction, we took into account eight characteristics to tune the neural network: • The load of the same hour, in the past day; • Load of the same hour, the same day of the past week.
Further, we utilized the load of the same day as the target.

1.
Using the training data (34800 × 8), we trained the neural network for electrical demand prediction. After the training stage of the neural network, we used 8770 datapoints for the testing for each characteristic, yielding a matrix with dimensions (8770 × 8)

2.
The neural network had three layers-one input layer, one hidden layer, and one output layer. The input layer had eight neurons, the hidden layer had six neurons, and the output layer had one neuron.
We obtained neural network tuning by using the Hessian with the following steps: 1.
We initialized the scale parameters with random values between 0 and 1; 2.
We obtained the forward propagation; 3.
We obtained the cost map; 4.
We obtained the back propagation; 5.
We utilized the Hessian tuning.
To evaluate the tuning of the neural network, we employed the determination coefficient (R 2 ), the mean absolute error (MAE), and the mean absolute percent error (MAPE), determined as follows: where q l is the neural network output, t l is the target, and t l is the mean of the target. R 2 generates values from 0 to 1; L T is the total output number. If a method provides good tuning, it has R 2 values near to 1, and if a method provides bad tuning, it has R 2 values near to 0. If a method provides good tuning, it has MAE values near to 0 MWh, and if a method provides good tuning, it has MAPE values near to 0%. We also used the cost map E (2) to evaluate the tuning of the neural network. If a method provides good tuning, it has E values near to 0.

Results of the Comparison
It should be noted that the neural network trained by steepest descent (SD) (19) had l = 6 neurons in its hidden layer, a tuning factor of α = 0.0004, and a number of epochs of e = 40.
It should be noted that the neural network trained by the Hessian with mini-batches (HMB) in this investigation and using (15), (18), (14), (9), (10), (22), (23) had l = 6 neurons in its hidden layer, mini-batches with a size of y = 32, a tuning factor of α = 0.0004, and a number of epochs of e = 40. Figure 2 shows the cost maps during the training of the neural network with steepest descent (SD), steepest descent with mini-batches (SDMB), Hessian (H), and Hessian with mini-batches (HMB). As we can see, the Hessian with mini-batches provides better tuning when it comes to training the neural network and tends to converge more directly (with the help of the information provided from the second derivative) than with the use of the steepest descent. The issue with the normal downward steepest descent is that often a minimum cannot be quickly found. The use of mini-batches helps to quickly reach a minimum.
Appl. Sci. 2020, 10, 2036 9 of 14 (HMB). As we can see, the Hessian with mini-batches provides better tuning when it comes to training the neural network and tends to converge more directly (with the help of the information provided from the second derivative) than with the use of the steepest descent. The issue with the normal downward steepest descent is that often a minimum cannot be quickly found. The use of mini-batches helps to quickly reach a minimum.      The tuning of the neural network with SD, SDMB, H, and HMB during training is shown in Figure 4. During 40 epochs, the neural network trained with steepest descent failed to tune, and its tuning was very slow when compared to the neural network trained with Hessian and mini-batches, which provided better tuning. The Hessian provided better tuning than steepest descent. The tuning of the neural network with SD, SDMB, H, and HMB during training is shown in Figure 4. During 40 epochs, the neural network trained with steepest descent failed to tune, and its tuning was very slow when compared to the neural network trained with Hessian and mini-batches, which provided better tuning. The Hessian provided better tuning than steepest descent.  The neural network tuning using SD, SDMB, H, and HMB during testing is shown in Figure 5. Similar to the training results, the neural network trained using steepest descent did not have the ability to predict. The neural network prediction using the Hessian with mini-batches was better than that using the other methods, as can be seen in Figure 6, which is a zoom of the neural network prediction with SD, SDMB, H, and HMB during testing.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 14 The neural network tuning using SD, SDMB, H, and HMB during testing is shown in Figure 5. Similar to the training results, the neural network trained using steepest descent did not have the ability to predict. The neural network prediction using the Hessian with mini-batches was better than that using the other methods, as can be seen in Figure 6, which is a zoom of the neural network prediction with SD, SDMB, H, and HMB during testing.      Table 1 compares the results of SD, SDMB, H, and HMB during training and testing with 40 epochs in terms of the determination coefficient (R 2 ) and cost (E). It should be noted that the neural network had 6 neurons in its hidden layer and each mini-batch had a size of y = 32. R 2 has values between 0 and 1, where values close to 1 correspond to algorithms with better tuning. Since HMB obtained the biggest value of R 2 during training and testing and obtained the smallest value of E during training, HMB provides the best tuning in comparison with H, SDMB, and SD. Table 2 compares the results of SD, SDMB, H, and HMB during training and testing for 40 epochs in terms of the mean absolute error (MAE) and the mean absolute percent error (MAPE). Smaller values of MAE and MAPE correspond to algorithms with better tuning. Since HMB obtained the smallest values of the MAE and MAPE during testing, HMB provides the best tuning in comparison with H, SDMB, and SD.
As we decrease the mini-batch size, we speed up the training of the algorithm, but we also increase the computation cost. This means a trade-off between computation cost and training speed.

Conclusions
Our goal in this article was to design the Hessian with mini-batches to get better tuning than steepest descent for a neural network. The Hessian with mini-batches was compared with steepest descent, steepest descent with mini-batches, and the Hessian for electrical demand prediction; since we reached the nearest approximation between the neural network output and the target and reached the smallest value of the cost map using the proposed algorithm, we got the best tuning with our proposed algorithm. In future work, we will find the convergence of the Hessian with mini-batches, we will propose other algorithms different to the Hessian to compare our results, and we will apply our algorithm for the prediction of other processes.