Optimizing Predictor Variables in Artiﬁcial Neural Networks When Forecasting Raw Material Prices for Energy Production

: This paper applies a heuristic approach to optimize the predictor variables in artiﬁcial neural networks when forecasting raw material prices for energy production (coking coal, natural gas, crude oil and coal) to achieve a better forecast. Two goals are (1) to determine the optimum number of time-delayed terms or past values forming the lagged variables and (2) to improve the forecast accuracy by adding intrinsic signals to the lagged variables. The conclusions clearly are in opposition to the actual scientiﬁc literature: when addressing the lagged variable size, the results do not conﬁrm relationships among their size, representativeness and estimation accuracy. It is also possible to verify an important e ﬀ ect of the results on the lagged variable size. Finally, adding the order in the time series of the lagged variables to form the predictor variables improves the forecast accuracy in most cases.

This paper analyzes the forecast of raw material prices for energy production by means of ANN, focusing on the selection of optimum parameters in order to configure the ANN. Design of experiments (DOE) is normally used to select these parameters [11].
DOE can be focused on estimating the number of neurons in hidden layers [12], on forming training and test datasets [13], on eliminating redundant dimensions in the predictor variables trying to achieve compression [14,15], on determining the optimum size of the lagged variables [16], on adding signals to the lagged variables [17], etc.
A heuristic approach will be used to optimize the predictor variables in ANN by means of (1) modifying the lagged variable size and (2) adding intrinsic signals to the lagged variables.
Addressing the lagged variable size, Liu and Su [18] indicated that a larger size allows for increasing the forecast accuracy, while it decreases the representativeness of the subsample heterogeneity. Moreover, although smaller sizes may improve representativeness, they will reduce the estimation accuracy. In the above empirical research, they used lagged variable sizes of 12, 24 and 36 months to test different alternatives. Nevertheless, the lagged variable sizes indicate very little effect on the results.
Tang and Abosedra [19] established that the lagged variable regression results are very sensitive regarding their size, but as there are no proper methods to select an optimum size, arbitrary selections have to be made. Other authors argue that a larger lagged variable size would lead to short-run predictability information being missed, and thus, a shorter size is preferred [20,21].
WEKA from the Machine Learning Group at the University of Waikato (Waikato, New Zealand) [22,23], a well-known open source machine learning software widely used for teaching, research and industrial applications, has a specific time-series analysis environment to forecast models. WEKA's time-series framework uses a machine learning/data mining approach to model time series. It transforms the data by removing the temporal ordering of individual input examples by encoding the time dependency via additional input fields or lagged variables. When using WEKA, it is possible to manipulate and control how lagged variables are created. They are the main mechanism to capture the relationship between current and past values, creating a window over a certain time period. Essentially, the number of lagged variables created determines the size of the window.
Regarding the adding of intrinsic signals to the lagged variables, Tavakoli et al. [24] proposed an input management system based on flexible data by immediately providing a variable definition layer on top of the acquisition layer to feed a data mining module to build modeling functions. Recently, and within the neural networks field, Uykan and Koivo [25] have presented and analyzed a new design for the predictor variables of a radial basis function neural network. In this design, the predictor variables were augmented with a desired output vector, allowing for better/comparable performance when compared with the standard neural network.
Raw material selection, namely, coking coal, natural gas, crude oil and coal, was based on the representativeness and price availability of such materials. In the case of coking coal, prices were obtained from the Colombian Mining Information System as they were publicly disclosed, while for the rest of the raw materials, their prices were obtained from the World Bank Commodity Price Data (The Pink Sheet) under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
The program used to simulate ANNs was NeuralTools 7.5 from Palisade Corporation (Ithaca, NY, USA).

Method
This paper will attempt to improve the result of the time-series forecasting of raw material prices for energy production developed with ANN by means of a twofold optimization of the predictor variables (modifying the lagged variable size and adding intrinsic signals to the lagged variables), taking into consideration the previous work of Matyjaszek et al. [26], in which coking coal prices were forecasted by means of autoregressive integrated moving average models (ARIMA) [27,28] and ANN, as well as the transgenic time-series theory.

Artificial Neural Networks
Two different types of ANNs proposed by Specht [29] will be tested using the best net search function that is available in NeuralTools: generalized regression neural networks (GRNNs), which were used in the past to forecast European thermal coal spot prices among very different applications [30,31], and multilayer feedforward networks (MLFNs), with one or two layers as described in García Nieto et al. [32].
GRNNs are based on nonlinear regression theory and are very closely related to probabilistic neural nets. In GRNNs, a case prediction with a dependent value that is unknown is obtained by means of interpolation from the training cases, with neighboring cases given more weight [33]. The optimal parameters for the interpolation are found during training. The main advantage is not requiring any configuration at all. Figure 1 presents a GRNN with two independent variables in the input layer and only four training cases, with the pattern layer having four nodes corresponding to these training cases. Each of the nodes will compute its Euclidean distance regarding the presented case. Then, these values pass to the summation layer. The summation layer has two parts: one is the numerator, and the other is the denominator. The numerator contains the sum of multiplying the training output data and the activation function. The denominator is the sum of all the activation functions.

104
Finally, the output layer has only one neuron that calculates the output by dividing the 105 numerator and the denominator of the summation layer.

106
On the other hand, MLFNs consist of an input layer, one or even two hidden layers and the 107 output layer. A MLFN is configured by specifying the number nodes in the hidden layers. The net 108 behavior will depend on the number of nodes selected for each hidden layer, the connections weights, 109 the bias terms that are assigned to each node, and the activation/transfer function selected to convert 110 into output the inputs of each node. They are able to approximate complex relationships between the 111 variables.

113
One of the issues to be analyzed within this paper is the current discussion about if a larger 114 lagged variable size allows for increasing the forecast accuracy [18] or if a shorter size is preferred 115 based on avoiding to miss short-run predictability information [20,21]. Another issue will be whether 116 lagged variable regression results are very sensitive regarding their size [19] or not [18].

117
Lagged variables are generated by a number of linear time-delayed input terms or past values, 118 normally in ascending order, such as P(t−n)… P(t−2), P(t−1), to estimate the output value P(t) [34].

119
They are also referred to as rolling windows [35].

120
To undergo a first estimation of the number of time-delayed terms that should form the lagged 121 variables (n), there are several alternatives that can be selected, including the one developed by Ren 122 et al. [36], which uses the seasonal characteristic that appears in the autocorrelation function (ACF) 123 plot, although this value is not always available.

124
Other common approach is to approximate the value by determining the square root of the 125 amount of data available to undertake the analysis [26]: Another alternative is the one used also by Matyjaszek et al. [26], in which the adequate number Finally, the output layer has only one neuron that calculates the output by dividing the numerator and the denominator of the summation layer.
On the other hand, MLFNs consist of an input layer, one or even two hidden layers and the output layer. A MLFN is configured by specifying the number nodes in the hidden layers. The net behavior will depend on the number of nodes selected for each hidden layer, the connections weights, the bias terms that are assigned to each node, and the activation/transfer function selected to convert into output the inputs of each node. They are able to approximate complex relationships between the variables.

Lagged variable Size
One of the issues to be analyzed within this paper is the current discussion about if a larger lagged variable size allows for increasing the forecast accuracy [18] or if a shorter size is preferred based on avoiding to miss short-run predictability information [20,21]. Another issue will be whether lagged variable regression results are very sensitive regarding their size [19] or not [18].
Lagged variables are generated by a number of linear time-delayed input terms or past values, normally in ascending order, such as P(t−n) . . . P(t−2), P(t−1), to estimate the output value P(t) [34]. They are also referred to as rolling windows [35].
To undergo a first estimation of the number of time-delayed terms that should form the lagged variables (n), there are several alternatives that can be selected, including the one developed by Ren et al. [36], which uses the seasonal characteristic that appears in the autocorrelation function (ACF) plot, although this value is not always available.
Other common approach is to approximate the value by determining the square root of the amount of data available to undertake the analysis [26]: Energies 2020, 13, 2017

of 15
Another alternative is the one used also by Matyjaszek et al. [26], in which the adequate number of time-delayed terms, k, that should form each input layer is calculated as follows: Total n • of data ≤ n 2 + 2n + 1,

Adding Intrinsic Signals to the Lagged Variables
Up to date and in order to improve the forecast accuracy by adding signals to the lagged variables, research is focused on the extrinsic ones [24,25]. This paper would analyze whether it is feasible to optimize predictor variables by adding intrinsic signals, so that the neural network will have more information available; thus, a better forecast could be made. For this purpose, the order in the time series of each lagged variable would be used, so the ANN could exploit this feature.
This line of thinking is congruent with the work developed by Barabási [37], who states that there is a huge disconnect between network science and deep learning; although ANN are abstractions of natural processes, some of the key neural networks could not be more ignorant about real networks. Main deep learning algorithms treat network features, like degree, as simple variables. Thus, they cannot truly exploit the network effects, which are the essence of these systems, as in networked systems the key information is in the relationships between the connected components (i.e., in the links or edges, which are the direct interactions between nodes), not in the node attributes. Table 1 presents an example of the first through tenth lagged variables used in a model with five time-delayed input terms: P(t−k) . . . P(t−2), P(t−1), with k = 5, as well as the output to be estimated: P(t).  Table 2 presents the same first through tenth lagged variables with five time-delayed input terms plus the order in the time series of each lagged variable, as well as the output to be estimated: P(t). Table 2. First through tenth predictor variables with 5 time-delayed input terms plus the order in the time series of each lagged variable, and the output to be estimated (t).

Neuron Number
Order

Figures of Merit
The experimental results will be evaluated using the two most common figures of merit [38], namely, the root mean squared error (RMSE) and the mean absolute error (MAE).
The RMSE is an excellent general-purpose error measure used for numerical predictions. It amplifies and penalizes large errors and can be expressed as follows: where A t is the actual value, F t is the forecasted value, and n is the number of forecasted values. The MAE is used to measure how close the predictions are to the outcomes and can be expressed as follows: Chai and Draxler [39] proposed the use of a combination of metrics including but not limited to the RMSE and the MAE. Conversely, Carta et al. [40], when addressing wind resource prediction, proposed using the MAE, the MAPE and the index of agreement (IoA).
In this paper, the standard deviation of absolute error (STD of AE) was selected to complement these measures as in Lazaridis [38], characterizing the dispersion of the absolute errors.

Coking Coal
The dataset used was the Colombia hard coking coal monthly prices free on board (FOB) for the period from January 1991 to December 2015, as publicly disclosed by the Colombian Mining Information System [41], totaling 300 data points. The dataset is presented in Figure 2.

150
The experimental results will be evaluated using the two most common figures of merit [38], 151 namely, the root mean squared error (RMSE) and the mean absolute error (MAE).

152
The RMSE is an excellent general-purpose error measure used for numerical predictions. It 153 amplifies and penalizes large errors and can be expressed as follows: where At is the actual value, Ft is the forecasted value, and n is the number of forecasted values.

155
The MAE is used to measure how close the predictions are to the outcomes and can be expressed 156 as follows: Chai and Draxler [39] proposed the use of a combination of metrics including but not limited to 158 the RMSE and the MAE. Conversely, Carta et al. [40], when addressing wind resource prediction,

159
proposed using the MAE, the MAPE and the index of agreement (IoA).

160
In this paper, the standard deviation of absolute error (STD of AE) was selected to complement 161 these measures as in Lazaridis [38], characterizing the dispersion of the absolute errors.

169
In first place, GRNNs and MLFNs with 2-6 nodes in the first hidden layer (as the second layer 170 is seldom needed for better prediction accuracy) were tested using the best net search function that 171 is available in NeuralTools. In first place, GRNNs and MLFNs with 2-6 nodes in the first hidden layer (as the second layer is seldom needed for better prediction accuracy) were tested using the best net search function that is available in NeuralTools. Table 3 presents the results of this best net search, where the configuration of the lagged variables was made using 19 time-delayed terms, which was the lagged variable size given by the seasonal characteristic that appears in the autocorrelation function (ACF) plot of the transformed time series when representing a consistent genome [26,42]. The results showed that the GRNN improved all the Energies 2020, 13, 2017 6 of 15 MLFN models, as stated for most cases by Modaresi et al. [43], so this will be the ANN to be used in this paper. To undergo a first estimation of the number of time-delayed terms that should form the lagged variables, the square root of the amount of data available to undertake the analysis was calculated:

172
Total n • o f data = √ 300 = 17.32 Using the other alternative previously mentioned, the value obtained is as follows: Nevertheless, the GRNN will be trained starting with 12 time-delayed input terms and up to 24 time-delayed input terms, a range that includes the previous values of k as well as one and two complete year periods [18].
This allows considering almost any periodical aspect that may be hidden within the time-series values but without drastically reducing the sample size requirements according to Turmon and Fine [44]. The results from the GRNN training are presented in Table 4. Based on these measures, the best result was obtained with 19 time-delayed input terms, which was the lagged variable size given by the seasonal characteristic that appears in the autocorrelation function (ACF) plot of the transformed time series when representing a consistent genome [26].
The figures of merit were root mean squared error (RMSE) of 4.550, mean absolute error (MAE) of 2.784 and standard deviation of absolute error of 3.599. With 30% tolerance, the percentage of bad predictions was 8.5409%.
Then, it was checked if it was feasible to optimize the predictor variables by adding intrinsic signals to the lagged variables so that the ANN would have more information available, and thus, a better forecast could be achieved. For this purpose, the order of the lagged variables in the time series was considered. Using these predictor variables, results from the training of the GRNN are presented in Table 5. The best result was obtained with 24 time-delayed input terms, with a RMSE of 3.376, a MAE of 2.286 and a standard deviation of absolute error of 2.485. With 30% tolerance, the percentage of bad predictions was 2.5362%. Thus, the order in the time series of the lagged variables clearly improved the model's forecasting performance, as it significantly reduced the RMSE, the MAE, the standard deviation of absolute error and the percentage of bad predictions.

Natural Gas
The second raw material for energy production analyzed was natural gas. The dataset used was natural gas prices in Europe for the period from January 1991 to August 2019, totaling 344 values.
Prices were obtained from the World Bank [45] and are presented in Figure 3 in MMBtu, also known as million British thermal units, with 1 MMBtu = 28.263682 m 3 of natural gas at 1 • F.

224
The results from the training of the GRNN are presented in Table 5.  Figure 3. Natural gas prices in Europe for the period from January 1991 to August 2019 [45].

225
As no seasonal characteristic appears in the autocorrelation function (ACF), in order to estimate the number of time-delayed input terms, the following calculations were made: Nevertheless, the GRNN was trained with lagged variables starting with 12 time-delayed input terms and up to 24 time-delayed input terms, using the same interval as in the coking coal case and for the same reasons.
The results from the training of the GRNN are presented in Table 6. The best result was obtained with 12 time-delayed input terms, with a RMSE of 0.06273, a MAE of 0.03456 and a standard deviation of absolute error of 0.05235. With 5% tolerance, the percentage of bad predictions was 4.8193%. A 5% tolerance was used this time, as with a 30% tolerance the percentages of bad predictions were always zero.
Then, the GRNN was trained using the same number of time-delayed input terms but considering the order in the time series of each lagged variable. The results from the training of the GRNN are presented in Table 7. In this case, the best result was obtained with 14 time-delayed input terms, with a RMSE of 0.01970, a MAE of 0.01032 and a standard deviation of absolute error of 0.01678. With 5% tolerance, the percentage of bad predictions decreased to zero. Thus, again, the order in the time series of the lagged variables clearly improved the model's forecasting performance, as it significantly reduced the RMSE, the MAE, the standard deviation of absolute error and the percentage of bad predictions.

Crude Oil
The third raw material for energy production analyzed was crude oil. The dataset used was that of Brent crude oil prices for the same period as for the natural gas: January 1991 to August 2019, totaling 344 values. Again, prices were obtained from the World Bank [45] and are presented in Figure 4 in $/bbl, that is, dollars per barrel, with 1 barrel being approximately 159 liters.     The number of estimated time-delayed input terms that should be used were the same as in the case of natural gas and crude oil, as the number of data points was the same in all cases (344).
The GRNN was trained again with lagged variables starting with 12 time-delayed input terms and up to 24 time-delayed input terms. The results from the training of the GRNN are presented in Table 8. The best result was obtained with 23 time-delayed input terms, with a RMSE of 1.177, a MAE of 0.6645 and a standard deviation of absolute error of 0.971. With 5% tolerance, the percentage of bad predictions was 23.0530%. Then, the GRNN was trained using the same number of time-delayed input terms but considering the order in the time series of each lagged variable. The results from the training of the GRNN are presented in Table 9. Table 9. Training results of the GRNN model for the Brent crude oil time series, from 12 to 24 time-delayed input terms including the order in the time series of the lagged variables (total number of data points is 344). The best result was obtained with 14 time-delayed input terms, with a RMSE of 0.0281, a MAE of 0.0178 and a standard deviation of absolute error of 0.0218. With 5% tolerance, the percentage of bad predictions decreased again to zero.

Time-Delayed
With 12 time-delayed input terms, it is clear that the ANN was able to learn the exact configuration of the time series, but only in this case.
Thus, the order in the time series of the lagged variables clearly improved the model's forecasting performance, as it significantly reduced the RMSE, the MAE, the standard deviation of absolute error and the percentage of bad predictions.

Coal
The fourth and last raw material for energy production analyzed was coal. The dataset used was that of Australian coal prices for the same period as crude oil and natural gas: January 1991 to August 2019. Prices were also obtained from the World Bank [45] and are presented in Figure 5 in $/t. The GRNN was trained again with lagged variables starting with 12 time-delayed input terms 281 and up to 24 time-delayed input terms, as in the case of European natural gas and Brent crude oil.

282
The results from the training of the GRNN are presented in Table 9.

283
The best result was obtained with 19 time-delayed input terms, with a RMSE of 1.3280, a MAE 284 of 0.7397 and a standard deviation of absolute error of 1.1029. With 5% tolerance, the percentage of 285 bad predictions was 15.6923%.   Table 10.

296
The  Figure 5. Australian coal prices for the period January from 1991 to August 2019 [45].
The number of estimated time-delayed input terms that should be used were the same as in the case of natural gas and crude oil, as the number of data points was the same in all cases (344). The GRNN was trained again with lagged variables starting with 12 time-delayed input terms and up to 24 time-delayed input terms, as in the case of European natural gas and Brent crude oil. The results from the training of the GRNN are presented in Table 10. Then, the GRNN was trained using the same number of time-delayed input terms but considering the order in the time series of each lagged variable. The results from the training of the GRNN are presented in Table 11. Table 11. Training results for the GRNN models of the Australian coal time series, from 12 to 24 time-delayed input terms and including the order in the time series of the lagged variables (total number of data points is 344). In this case, adding the order in the time series of each lagged variable only improved the standard deviation of absolute error.

Time-Delayed
Nevertheless, if the RMSE and MAE were compared with the training results obtained without adding the order in the time series of each lagged variable, although slightly higher, they were very similar to the best results, and better than the rest of the training results.
Thus, the differences between both options were almost negligible.

Discussion and Conclusions
This paper applied a heuristic approach to optimize the predictor variables in artificial neural networks when forecasting raw material prices for energy production to achieve a better forecast.
Two goals are (1) to determine the optimum number of time-delayed terms or past values forming the lagged variables and (2) to optimize predictor variables by adding intrinsic signals to the lagged variables.
The experimental results were evaluated using the two most common figures of merit, the root mean squared error (RMSE) and the mean absolute error (MAE), as well as the standard deviation of absolute error, as the scientific literature proposes the use of a combination of metrics including RMSE and MAE, but not being limited to them.
Results demonstrated, first, that in opposition to scientific literature when addressing lagged variable size, a larger size did not allow for increasing the forecast accuracy, and that smaller sizes did not reduce the estimation accuracy. Moreover, the lagged variable regression results were very sensitive regarding their size.
In the three raw materials with the same number of cases (natural gas, crude oil and coal), the best results were obtained with rolling window sizes of 12, 23 and 19, respectively. Furthermore, it was possible to verify an important effect of the lagged variable size on the results, with differences that in some cases were larger than 20%.
Thus, and in opposition again to scientific literature indicating that there are no proper methods to select an optimum size so arbitrary selections have to be made, it is recommendable to address this question by trial and error method, although the approximate size can be estimated in order to select the complete year's period range to which this value belongs, e.g., 12-24 months or 24-36 months. This will allow considering any periodical aspect that may be hidden within the time series values, but without drastically reducing or increasing the sample size requirements for neural networks.
Second, in three of the four raw materials analyzed (coking coal, natural gas and crude oil), it was possible to improve the forecast accuracy by adding the order in the time series of the lagged variables to form the predictor variables. The best results were achieved with rolling window sizes of 24, 14 and 14, respectively.
In the case of the Australian coal, this process only improved the standard deviation of absolute error. Nevertheless, if the RMSE and MAE were compared with the training results obtained without adding the order, although slightly higher, they were very similar to the best results and better than the rest of the training results without adding the order.
As the differences between both options were almost negligible, it is possible to recommend adding the order in the time series of each lagged variable to the predictor variable in all cases.
Third, only with the Brent crude oil with 12 time-delayed input terms and considering the order in the time series of the lagged variables, the ANN was able to learn or deduct the exact configuration of the time series. This is completely congruent with the fact that there is a huge disconnect between network science and deep learning, as the key information is in the relationships between the connected components, not in the node attributes.
Concluding, the findings presented in this paper have an immediate practical application addressing the forecast of time series by means of ANN that consider lagged variables, without being restricted to the studied case of raw material prices for energy production.
Any forecast may be optimized just by adding an intrinsic signal to the predictor variable consisting of the order in the time series of each lagged variable. By doing this way, the ANN will be able to exploit this feature, something that will not happen otherwise. In most of the cases, figures of merit may improve (may be reduced) up to a 20%, with the consequent benefit for decision-makers regarding savings, efficiency/benefit gains and/or lower risk.
Regarding the size of the lagged variable, a selection should be made about the period that will be analyzed in order to undergo a trial and error process. This selection should follow the procedure shown in this paper or other ones that may be found in the scientific literature.
Further research should address different issues such as the use of more intrinsic signals. Regarding this issue, authors have made interesting preliminary approaches by considering the transgenic time series theory that allows eliminating anomalous phenomena from the time series. Augmenting the lagged variables within this anomalous period with a '1' and the rest with a '0', or vice versa, it was possible to improve a priori the figures of merit.
Another area of interesting future research will be to develop a procedure to determine accurately the number of time-delayed input terms that should be used when considering the order in the time series of the lagged variables. While the seasonal characteristic that appears in the autocorrelation function (ACF) plot is valid before augmenting the lagged variables, later this figure is no longer valid, so a new approach should be addressed. Nevertheless, nothing is yet developed addressing the time series with an ACF plot that does not allow one to extract a seasonal characteristic. Again, the transgenic time series theory could be of help in these cases.
Finally, it should be addressed by future research why in the case of Australian coal, or in similar cases, it was not possible to improve the figures of merit by adding to the predictor variable the order in the time series of each lagged variable.