Multivariate Statistical Analysis for Training Process Optimization in Neural Networks-Based Forecasting Models

: Data forecasting is very important for electrical analysis development, transport dimen-sionality, marketing strategies, etc. Hence, low error levels are required. However, in some cases data have dissimilar behaviors that can vary depending on such exogenous variables as the type of day, weather conditions, and geographical area, among others. Commonly, computational intelligence techniques (e.g., artiﬁcial neural networks) are used due to their generalization capabilities. In spite of the above, they do not have a unique way to reach optimal performance. For this reason, it is necessary to analyze the data’s behavior and their statistical features in order to identify those signiﬁcant factors in the training process to guarantee a better performance. In this paper is proposed an experimental method for identifying those signiﬁcant factors in the forecasting model for time series data and measure their effects on the Akaike information criterion (AIC) and the Mean Absolute Percentage Error (MAPE). Additionally, we seek to establish optimal parameters for the proper selection of the artiﬁcial neural network model.


Introduction
Neural network design requires the proper determination of the input variables, i.e., the appropriate selection of the factors that affect the variable behavior to be modeled. However, this is not a trivial issue because there is no formal theory to ensure that the selected network is the best for a particular application problem. There are no handled performance metrics for unsupervised neural networks to ensure that the configuration developed has reached the optimal performance [1]. In supervised neural networks, the Mean Absolute Percentage Error (MAPE) is commonly used to evaluate the generalization capability of the model. Likewise, the state-of-the-art reports that the Akaike information criterion (AIC) is proposed as a measure of comparison to identify a suitable configuration [2,3]. For these reasons, it is necessary to know which factors influence the behavior of these two-performance metrics, so that it can be established whether it is possible to determine an optimal operating point for AIC and an appropriate dispersion index level for MAPE.
Next, this paper will show some examples of the neural network training process design, applying different strategies.
The training process models based on Radial Basis Function (RBF) networks [4] are the most efficient in the neural network models for achieving an effective, adaptive, and versatile architecture with precise computational time results. A multi-level neural architecture composed of RBF Serially Operating Multipliers (SOM) algorithms executed in parallel in a programming mode known as Compute Unified Device Architecture (CUDA) is proposed by [5] to improve the accuracy and timing of the RBF model with the same amount of data. The validation for such a structure consists of estimating ecological variables with information on the environment through the CUDA-RBF-SOM structure, which shows an improvement in time and precision for the training process of 0.1154% compared with the RBF network for the same estimated variable. In conclusion, they present a new training process (CUDA-RBF-SOM) that reduces the execution time by 99.8846% compared to an RBF-SOM model.
In some cases, the quantity of data for the training process can lead to a problem, so the architecture must not be limited by data. The neural network model proposed by [6] is focused on reducing the quantity of input data needed in the training process. An RBF network is applied to regression problems; the new structure consists of a multi-stage training process that matches the orthogonal least squares (OLS) with an optimization gradient. The model validation tests are performed on the data for the prediction of stress and force in the knee required to prevent injuries in different scenarios (tasks), in which the results and estimated times are compared with other models, such as OLS and feed-forward back-propagation network (FFN), with their real values. The proposed structure named Opt-RBN shows, in some scenarios, favorable results in the training data. Although the training process time is a little longer, the authors consider it a good time because it is less than one minute.
The Stochastic Gradient Descent (SGD) model during the training process updates a Convolutional Neural Network (CNN) with a noisy gradient calculated from a random batch. Each batch uniformly updates the network every once in a while, which leads to loss problems in the batches. A model to mitigate this problem is proposed in the form of a stochastic process that automatically selects the batch with the largest loss to accelerate its training. This is called Inconsistent Stochastic Gradient Descent (ISGD) by [7]. The key concept is that the inconsistent training dynamically adjusts the training effort without loss. ISGD gradually reduces the average lost batch and uses a dynamic upper control limit to identify a large lost batch as it goes along. The ISGD remains in the identified batch to speed up the training with additional gradient updates. The tests for validating the ISGD are based on data such as those derived from ImageNet (A large-scale hierarchical image database), MNIST (Modified National Institute of Standards and Technology database), and CIFAR-10 (Canadian Institute For Advanced Research database). Those tests showed a convergence of the ISGD that was 14.94% faster than the SGD using the ImageNet database. ISGD showed a convergence that was 23.57% faster than the SGD in the CIFAR-10 test. It also showed a convergence that was 28.57% faster than the SGD using the MNIST database.
The Back Propagation Neural Network (BPNN) model is used in the training process to determine the number of neurons needed in the two hidden layers of a neural network to forecast the magnitude, on the Richter scale, of earthquakes in a region of the Philippine Sea [8]. With the use of a data register for the earthquakes in the area from 1990 to 2014, several series of BPNN models were built to forecast the magnitude of the earthquake and compare them to the actual data. From the obtained data, it was concluded that a number of 10 neurons per hidden layer is the ideal number for the forecasting model, since the forecasting errors of the BPNN model with 10 neurons in each hidden layer were very similar to those of the models that use more than 10 neurons per layer, which involves a longer time of computation for similar results. The results were compared with the actual values of the magnitudes of earthquakes that have already occurred, and the authors concluded that the BPNN model forecasts a reliable Richter scale earthquake magnitude result.
Neural network models are used to solve the text categorization problem. One of the models is the Improved Back-Propagation Neural Network (IBPNN), proposed by [9], that, with a parallel computational process, speeds up the neural network training for text categorization. The BPNN algorithm uses a Sun Cluster with 34 nodes (processors). The parallel IBPNN is integrated with the Singular Value Decomposition (SVD) technique, wherein the neural network input is represented as a low dimensional feature vector. The validation is performed using different databases wherein the number of processors is modified from 1 to 32, which produces an improvement in the execution time without diminishing the categorized text accuracy. The results show that the parallel IBPNN and the SVD technique achieve a faster, more adaptive, and more reliable training process in the text categorization problem. Table 1 shows a summary of techniques commonly used in the training of time series models. According to Table 1, the data modeling process is not unique when evaluating several different configurations' performances to select one that best fits the variables to be considered along with the data themselves. Depending on the configuration, it will be necessary to define which process is more suitable for adjusting each parameter that defines it. Since there is no formal theory for selecting the best model, in the case of neural networks, it is common to find authors who base their selection on iterative processes in which the training parameters change. The training process is supposed to be automatized, and an experimental design wherein the most significant parameters are identified within the modeling structure is proposed, in such a way that parameters change during the iterative process for selection and comparison.

Experimental Planification
A design for experiments considering AIC and MAPE as performance metrics is proposed to identify which factors are significant in the neural networks' training process for forecasting purposes. The historical energy demand data of an energy commercialization market in Colombia, in the city of Barranquilla, were used to carry out this procedure [16].
In the training process, the validation and selection of the neural network can influence the adaptability and generalization of the neural network. Factors related to the network configuration are highlighted in the first instance, within which are (1) network type, (2) number of layers, (3) the number of neurons in the hidden layer, and (4) activation function type. Additionally, the factors defined in the training and validation stages are (5) initial learning coefficient, (6) number of data to consider, (7) percentage of data for training, validation, and testing, (8) training algorithm, (9) training epochs, (10) corresponding time, and (11) presentation data order for training [17].
Factors 1, 4, 9, 10, and 11 are held constant. The activation functions used in pattern recognition and classification are typically the input neurons' identity function, and the sigmoid function for the other layers (hidden layer and output layer). For the network type, a Multilayer Perceptron (MLP) is used due to its better generalization capability; for example, RBF for activities related to classification and pattern recognition applications. The number of epochs is set at 100. Factors 2, 3, and 5-9 will be the design factors manipulated to verify their significance in the AIC index and the MAPE metric. Since the main goal is to identify which factors are significant in designing a neural network for time series modeling, we decided to choose two factor levels (low and high).
According to the state-of-the-art, the ranges to be considered during the training and validation processes shall be as follows: 1.
Validation percentage-10-30. Table 2 shows the levels and ranges for the design factors. Since the objective of this study is to identify the significant factors in the performance of a neural network for forecasting purposes, we chose to select as response variables the AIC and the MAPE.

Experimental Design
The 2 k factorial design is selected to evaluate each factor's effect on the neural network's training and validation processes. This experiment is adequate when the goal is to analyze the significance of the factors with a minimum number of runs [18]. In this case, only two levels (low and high) are considered. This can be viewed as a weakness when the factors have significant interaction and a curvature in the experimental zone. When the curvature is detected, it is necessary to aggregate central and axial points in the experimental design. Another aspect to consider is the null quantity of the degree of freedom of the error. Thus, it is recommended to apply the following two strategies: (1) to identify the significant effects through a normal probability plot, i.e., remove from the analysis those factors fitted to the normal probability plot because their behaviors are similar to the residues. In this case, the degree of freedom of each excluded effect is added to the error. (2) To aggregate central points to the 2 k experiment; these points are added in the center of the experimental zone, at points x i = 0 (i = 1, 2, 3, . . . , k). This strategy allows for adding a degree of freedom to the error and measuring the response variable experimental zone curvature.
The experimental design carried out in this paper is a 2 5 with six central points. Three of these will be developed with a qualitative factor in the low level (Levenberg-Marquardt algorithm) and the other three in the high level (Resilient Backpropagation algorithm). If the linear model does not fit the data, adding an axial point located at the central point will be necessary.

Experimental Results
ANOVA multifactorial analysis of variance is used to identify which factors are relevant. In this sense, a normal probability plot is constructed to determine which factors are relevant, as shown in Figure 1 for the AIC index. Figure 1 and Table 3 show the CE effect as the most relevant. The AIC response is only affected by the CE effect, because the p-value is less than 0.05. It is necessary to verify that the model fulfills the normality, homoscedasticity, and independent conditions to validate the result reached through the ANOVA analysis.  Table 3 shows the results reached through the ANOVA analysis for AIC. Chi-square is used to validate the normality condition as is shown in Table 4.
Because the Chi-square test has a p-value greater than 0.05, it is not possible to reject the null hypothesis; hence, the residuals fit into a normality plot.
To verify the homoscedasticity condition, we used the Levene's test, as shown in Table 5.
According to the results in Table 5, four factors fulfill the homoscedasticity condition because their p-values are greater than 0.05.
The third and last condition is lag 1 autocorrelation, and the results are as follows: Durbin-Watson = 2.3533 (p-value = 0.5515), Lag 1 autocorrelation = −0.057753. The Durbin-Watson statistic is equal to 2 approximately; hence, residuals are independent. Table 4. Normality test for AIC.

p-Value
Chi-square 0.1319 For MAPE analysis the same procedure is carried out. As is seen in Figure 2 and Table 6, the CE effect is the most relevant. The MAPE response is only affected by the CE effect, because the p-value is less than 0.05. It is necessary to verify that the model fulfills the normality, homoscedasticity, and independent conditions to validate the result reached through the ANOVA analysis for MAPE.  Chi-square is used to validate the normality condition. The result is shown in Table 7. There are no data results for this stage with the normality test. Since the factors have similar effects compared to the AIC response, it is considered acceptable to continue the optimization stage with only the last response variable.
According to the results of Table 8, four factors fulfill the homoscedasticity condition because the p-values are greater than 0.05. Due to the fact that the Durbin-Watson statistic has a p-value greater than 0.05, the null hypothesis is rejected; hence, the residuals have no linear correlation.

Regression Model
According to the ANOVA analysis, the interaction of two factors (C: Neurons number and E: Validation percentage) is related significantly to the AIC behavior. Now, it is necessary to set an ideal configuration through an optimization process. To carry out this stage, the experimental design to be used is two factors with one replicate.
The results of the experiment are shown in Table 9. Table 9. Levels of model analysis. Equation (1) shows the fitted equation of the regression model for the AIC response. The coefficient test is conducted according to the result shown in Table 10.

Neu_Num (C) Val_Per (E) AIC
(1)  Table 11 shows the grouped regression statistics of the adjusted model. The R 2 is the achieved correlation coefficient.  Table 12 shows that the p-value (4.7736 × 10 −7 ) is less than the α used in the experiment; hence, the null hypothesis is rejected, demonstrating that the model fits to the actual data.  Figure 3 shows only the residuals for the relevant factor (number of neurons) discovered in the fitting curve for the AIC response. It has no clear structure in the residuals data.

Optimization Results
According to the regression model analysis, only the C factor is required for fitting the data to the AIC response. It is proposed to evaluate the means of observing whether there is any statistical difference between levels. The results are shown in Table 13. In this case, we propose minimizing the AIC response with a minimum number of neurons.  Tables 13 and 14 show the results of LSD (Least Significant Difference) analysis. The results show that there is no statistical difference between 1 and 5.5 levels. However, the change when the number of neurons jumps to 10 is notable.
The optimal point where the system reaches a minimum AIC with a low number of neurons is less than five for one neuron in the hidden layer.

Forecasting Model Testing
The historical energy demand data of an electricity market in Colombia (State of Atlántico) were taken in the period from 1 January 2016 to 30 September 2019 (see Figure 4) to evaluate the performance of this proposal. Here, 70% of the data are taken for training, 15% for validation, and 15% for testing the time series. The models' performance is validated by using one of the rolling windows with a step k that depends on the number of models for each subset. For the validation process, the separation of data into small subsets has been proposed to avoid data overlap during the training process. This ensures that the models are unaware of all the validation data. Once the model training process is done, the test data are used to evaluate the model performance. The training data selection will be made randomly and following a uniform distribution. Weather data were acquired through the website www.accuweather.com (accessed on 30 September 2019). Time series data of Figure 4 are used as a reference to evaluate the results from Section 7. Figures 5 and 6 show that an increase in the C factor only increases the model complexity, without guaranteeing better performance in terms mainly of MAPE.  Additionally, we show how it is possible to obtain better performance by varying the number of neurons, allowing a lower computational cost due to a training process with a lower time requirement. Table 15 shows the relationship between the number of neurons and the training time for data from the 24 periods of energy demand according to the considerations addressed in [19].  Table 15 shows that a higher number of neurons in the training process implies an exponentially higher computational cost. Therefore, the proposed approach is key when defining an experimentation zone that reduces the time required for the training process, searching for an optimal performance value of the forecasting model.
The results in the implementation of an energy demand prediction model are shown in Figure 7, using the same methodology in the training process of the models as is used in [16].   The forecasting models obtained through the proposed methodology evidence a better performance due to this method's radar chart area being less than that of the Atlántico electricity market operator.

Conclusions
The results of this experimental design allowed us to identify that factors such as the number of hidden layers, the quantity of training data, and the validation percentage are not relevant in the network performance in terms of AIC and MAPE performance. Through the ANOVA analysis and the normal probability plot, it was possible to show initially which interaction had more relevance. In both cases, the response variable depended only on the interaction between the number of neurons and the validation percentage.
After selecting the relevant effects, a 3 k design was used to detect curvature and determine the best model. In this case, only a linear model is necessary to describe the relationship between AIC and the number of neurons. There are defined as constraints a maximum of five (5) neurons and an AIC of less than 10 to optimize the regression model. An Excel solver makes it possible to find this optimal point.
Finally, multi-comparison tests showed a high difference between levels 1 and 3. A significant difference in terms of better performance was only demonstrated for a low number of neurons.