Data Driven Natural Gas Spot Price Prediction Models Using Machine Learning Methods

: Natural gas has been proposed as a solution to increase the security of energy supply and reduce environmental pollution around the world. Being able to forecast natural gas price beneﬁts various stakeholders and has become a very valuable tool for all market participants in competitive natural gas markets. Machine learning algorithms have gradually become popular tools for natural gas price forecasting. In this paper, we investigate data-driven predictive models for natural gas price forecasting based on common machine learning tools, i


Introduction
The world total primary energy supply (TPES) by fuel in 2016 was as follows: oil (31.9%), coal (27.1%), natural gas (22.1%), biofuels and waste (9.8%), nuclear (4.9%), hydro (1.8%), and other (0.1%) [1]. Natural gas thus has the third-largest share among the TPES. Furthermore, natural gas production continues to grow at a higher pace, most notably with a 3.6% increase in 2017 compared to 2016 that constitutes the largest increase since 2010. In today's world, concerns about air quality and climate change are growing, but renewable energy is expanding at a limited rate and low-carbon energy sources are hard to find in some areas. Natural gas offers many potential benefits as a solution to environmental problems. Natural gas generates heat, power, and mobility with fewer emissions, including both carbon-dioxide (CO 2 ) emissions and air pollutants, than the other fossil fuels, helping to address widespread concerns over air quality. Because the natural gas energy causes less pollution to the environment than other kinds of energy resource, it has received much more recognition recently. Natural gas exploitation has significantly helped many countries to reduce CO 2 emissions nationally and globally since 2014 [2]. Natural gas, which is one of the most important energy resources, is going to play an expanded role in the future of global energy due to its significant environmental benefits.
Forecasting natural gas prices is a powerful and essential tool which has become more important for different stakeholders in the natural gas market, allowing them to make better decisions for managing the potential risk, reducing the gap between the demand and supply, and optimizing the model validation techniques, model parameter selection, and empirical study results using prepared models. Finally, Section 5 presents the conclusions.

ANN
The neural network is a computational model created in 1943 by McCulloch and Pitts based on mathematics and threshold logic algorithms [18]. The neural network is a framework for absorbing various machine learning algorithms for cooperative work, and thus it is not an algorithm. ANN has risen in importance since the 1980s and has been a research hotspot in the field of artificial intelligence. ANN has widespread applications in terms of data processing, classification, regression, function approximation, and numerical control.
ANN process the information like the neural network of the human brain, i.e., by connecting different simple units/nodes, named as artificial neurons, to form different complex networks [19]. Each node includes an activation function to produce an output value based on one or multiple inputs. The output signal of a node can be passed to another node using a weighted connection. Thus, the output an ANN depends on the connection structure, the weight value and activation function [20]. A simple signal layer ANN is illustrated in Figure 1. Given one unit (an artificial neuron) j, let its input signals connected from other units be xi (i = 1, 2,…, n) with corresponding weights wij (i = 1, 2,…, n). There are two basic operations in the processing unit, namely summation and activation of the input signals [18]. The unit's output of yj is defined as: where tj is a bias for the unit j, and f is the activation function which can be commonly defined as the sigmoid function [21]: The output yj will pass to the connected units from the next layer as an input signal. Given a designed architecture, all units in ANNs are interconnected in different layers. A simple example of a three-layer ANN is shown in Figure 2: the information flows through the input, hidden and output layers. The input layer or the first layer contains the same number of units as the input vector size, followed by the hidden layer with an arbitrary number of units. The output layer aggregates the weighted outputs from the hidden layer units. Given one unit (an artificial neuron) j, let its input signals connected from other units be x i (i = 1, 2, . . . , n) with corresponding weights w ij (i = 1, 2, . . . , n). There are two basic operations in the processing unit, namely summation and activation of the input signals [18]. The unit's output of y j is defined as: where t j is a bias for the unit j, and f is the activation function which can be commonly defined as the sigmoid function [21]: The output y j will pass to the connected units from the next layer as an input signal. Given a designed architecture, all units in ANNs are interconnected in different layers. A simple example of a three-layer ANN is shown in Figure 2: the information flows through the input, hidden and output layers. The input layer or the first layer contains the same number of units as the input vector size, followed by the hidden layer with an arbitrary number of units. The output layer aggregates the weighted outputs from the hidden layer units.

SVM
SVM is mainly based on the theory of statistical learning theory. It a machine learning tool for solving a convex quadratic problem containing linear constraints, in which local minimum does not exist. Linear support vector machines can be extended into nonlinear space by introducing the kernel method, which barely increases the computational cost for high-dimensional data. Vapnik and Chervonenkis proposed the VC theory of exponential function sets, which motivated the progress of support vector machines in the historical stage [22]. With the development and maturity of theoretical research, Boser et al. proposed SVM by harnessing the theory of statistical learning knowledge and realized the nonlinear support vector machine according to the kernel function theory [23]. With respect to the risk minimization principle, ANN employs the empirical mode while SVM relies on the structural mode. So far, it has been widely used in various classification tasks, such as pattern recognition, images and text classification. In addition, SVM has been used for the regression task with some loss functions [24].
SVM can realize the regression version, which is called support vector regression (SVR). SVR solves regression problems on the basis of support vector machines [25,26]. SVR has powerful learning ability and generalization ability suitable for a small set of samples and has got great success in forecasting work with promising results [27,28]. This results from that it can adaptively process various tasks through generalization and meanwhile gets gain access to satisfactory answers. In particular, a nonlinear relation can be transferred into a linear relation with the utilization of SVR and thus the task can be easily solved. A drawback is that this becomes time-consuming when the task scale becomes large [29,30].
For the decision function, it can be represented by: where w, x, and b represent the weight vector, the input vector, and the bias, respectively. For the training input vector and target output, we assume xm and ym be the m-th (m = 1, 2,…, M) input and output, respectively. Let C be a positive parameter for penalizing the samples which does not satisfy inequality constraints. The error function can be defined as: The error function has two terms, which are used for penalizing model complexity and representing the ε-insensitive loss function, respectively. The loss function is mathematically given as:

SVM
SVM is mainly based on the theory of statistical learning theory. It a machine learning tool for solving a convex quadratic problem containing linear constraints, in which local minimum does not exist. Linear support vector machines can be extended into nonlinear space by introducing the kernel method, which barely increases the computational cost for high-dimensional data. Vapnik and Chervonenkis proposed the VC theory of exponential function sets, which motivated the progress of support vector machines in the historical stage [22]. With the development and maturity of theoretical research, Boser et al. proposed SVM by harnessing the theory of statistical learning knowledge and realized the nonlinear support vector machine according to the kernel function theory [23]. With respect to the risk minimization principle, ANN employs the empirical mode while SVM relies on the structural mode. So far, it has been widely used in various classification tasks, such as pattern recognition, images and text classification. In addition, SVM has been used for the regression task with some loss functions [24].
SVM can realize the regression version, which is called support vector regression (SVR). SVR solves regression problems on the basis of support vector machines [25,26]. SVR has powerful learning ability and generalization ability suitable for a small set of samples and has got great success in forecasting work with promising results [27,28]. This results from that it can adaptively process various tasks through generalization and meanwhile gets gain access to satisfactory answers. In particular, a nonlinear relation can be transferred into a linear relation with the utilization of SVR and thus the task can be easily solved. A drawback is that this becomes time-consuming when the task scale becomes large [29,30].
For the decision function, it can be represented by: where w, x, and b represent the weight vector, the input vector, and the bias, respectively. For the training input vector and target output, we assume x m and y m be the m-th (m = 1, 2, . . . , M) input and output, respectively. Let C be a positive parameter for penalizing the samples which does not satisfy inequality constraints. The error function can be defined as: The error function has two terms, which are used for penalizing model complexity and representing the ε-insensitive loss function, respectively. The loss function is mathematically given as: This function cannot penalize errors less than ε in order to reduce model complexity. The solution of minimizing the error function can be defined as: in which the parameters α m and α * m are Lagrange multipliers. Support vectors from the training vectors give non-zero Lagrange multipliers and contribute directly to the solution, while the non-support vectors do not. The relationship between support vectors and the model complexity is that its number can measure the latter [31,32].
The key in SVR applications is the utilization of kernel functions, which can map non-linear data into spaces in which they are essentially linear for the optimization process. One can utilize the following kernel function: Then, the model is changed into: For kernel functions, Gaussian, polynomial, and hyperbolic tangent type can be used. The Gaussian type is one of the most popular kernel functions used in empirical analysis, mathematically described as follows:

GBM
Friedman proposed the concept of gradient boosting and introduced the gradient boosting machine (GBM) technique for the extension of the boosting to the regression [33][34][35]. GBM has many applications in regression and classification by generating forecasting models in the form of the set of weak forecasting models (usually decision tree). GBM is a kind of boosting method. The basic idea of boosting is to utilize some way such that base learners of each round are more concerned about the misclassified learning samples of last round during the process of training, but GBM focuses on each modelling using the gradient descent direction in the previous model loss function. The negative gradient is used as the measurable indicator that base learners of last round and are fitted in the next round learning to correct mistakes appeared in the last round. The difference with traditional boosting is that the purpose of each computation is to reduce the residual of last computation. In order to eliminate the residual, one can refer to the gradient direction of residual decrease for model build-up. Thus, the building-up of each new model is to make the residual of the previous model cut down along the gradient direction in the gradient boosting, which greatly differs from traditional boosting in weighting correct and wrong samples.
GBM has been shown significant success deployed in various kinds of practical applications in different fields. It can be considered as a methodological framework, i.e., it is very flexible and customizable which can achieve high accuracy in many data mining tasks. However, in practice, the most issue of GBM is it memory-consumption. With respect to a GBM algorithm, given an output y, an input x, a training set x i , y j n i=1 , and number of iterations M, we seek an approximationF(x) for a function F(x) and the approximation requires minimizing the expected value of some specified loss function L(y, F(x)): The gradient boosting method assumes a real-valued y and seeks an approximationF(x) and a base- can be approximated in a greedy way: The negative gradient is calculated by: h m (x) is utilized for fitting y i through minimizing the squared error: One can employ line search for ρ m so as to minimize the loss function: Finally, it is updated:

GPR
GPR builds upon the Gaussian process (GP). GP is a kind of random processes that generalizes multivariate normal distribution in the infinite dimensional space and therefore is defined in the distribution over functional values. It is a powerful nonlinear multivariate interpolation tool. The prior information of GP is used for the inference of continuous value, which is referred to as GPR. GPR is an interpolation method, in which the interpolation is controlled by prior covariances, and can give the best linear unbiased prediction for median under the appropriate prior assumptions. It can also be illustrated by the theory of Bayesian inference [36]. GPR is proposed by Rasmussen et al. at first, which is an efficient machine learning algorithm based on kernel function and provides a principle, practicable, and probabilistic method for kernel machine learning [37,38]. In comparison with other machine learning methods, the advantages of GPR lie in being able to seamlessly integrate multiple machine learning tasks such as parameter estimation; therefore, the regression can be obviously simplified and the objective influence for the result is slight, more convenient for an explanation. GPR possesses excellent performance with relatively small training dataset and offers confidence interval for prediction by adopting flexible kernel function. However, a known bottleneck is that the computation complexity in prediction is cubic about |x|, thus it is infeasible for large datasets.
Let X be the matrix of input vectors and x i be the i-th training input vector. Let K(X, X) be the covariance matrix for pairwise vectors, such that: The variable f follows the following distribution: which is a multivariate Gaussian density distribution. m f (µ, Σ) represents the normal density function. Additionally, some noise can be added for the estimation of target output y: where the noise ε yields the independent normal distribution with mean zero and standard deviation σ n . Then, given an input vector x * from the test data, the outputf * becomes:

Data Preparation and Description
There exist a great number of factors that affect the natural gas price, e.g., crude oil, heating oil, drilling activity, temperature, natural gas supply, demand, storage, imports, etc. To better forecast the gas price, we should consider as many relevant factors as possible. To begin with, the relationship between crude oil price and natural gas price is a continuous research focus [39]. They often behave a strong connection and crude oil price fluctuations usually affect the natural gas price. Secondly, the natural gas can be seen as an alternative energy resource to heating oil in US industry and power plants, thus, their demands can be shifted between each other based on the market prices [40]. Thirdly, the prosperity level of the natural gas market can be reflected based on the gas drilling activities [41]. Fourthly, the price of natural gas often suffers from the temperature and weather changes are also a key factor [42]. The demand for natural gas varies with season. In addition, obviously, natural gas supply, demand, storage, and imports are important indicators related to natural gas price. Thus, the abovementioned variables include crude oil price (WTI), heating oil price (HO), natural gas rotary rigs (NGRR), heating degree-days (HDD), cooling degree-days (CDD), natural gas marketed production (NGMP), natural gas total consumption (NGTC), natural gas underground storage volume (NGUSV), and natural gas imports (NGI) are used as inputs in this study. Natural gas price is the output variable and Henry Hub Natural Gas Spot Price is adopted, which is from the US EIA [43] and quoted in dollars per million btu. Henry Hub is the earliest representative American natural gas spot trading centre and is also the largest one until now. It has greatly benefited the development of the natural gas market and energy economic prosperity since its establishment. For the abovementioned variables, our dataset comprises monthly data within the period from January 2001 to October 2018 and all data in the months in this period are complete, which guarantees the availability of data.
Historical trends of natural gas and relevant variables are described in Figure 3, in which Figure 3a depicts the natural Henry Hub spot price from January 2001 to October 2018, containing a total of 214 records. It can be known that during this period, the average natural price was $4.71, the median was $4.05, and the standard deviation was $2.22. As shown in Figure 3a One can easily find that there are a few sharp rises for natural gas before 2009. The first sharp rise appeared in the winter of 2003, which is mainly caused by extremely cold weather, consistent with the case of storage reserves being substantially reduced, as shown in Figure 3i. The second upsurge appeared in October to December of 2005 when hurricanes Katrina and Rita happened in succession. Hurricanes make natural gas supply drop markedly. The lowest marketed production is in September 2005, only 1,400,941 million cubic feet, while in October, it increased by 74,708 million cubic feet, which can be observed in Figure 3g. The third rapid rise is before the global financial crisis in 2008, which, to a great extent, was caused by an increase of the crude oil prices that led a strong demand for natural gas, as shown in Figure 3b-d. Since 2009, natural gas in the US has remained at a low level and this is mainly due to two aspects: one is that the demand for natural gas is not urgent. The other aspect is the oversupply due to the rapid development of shale gas.    The fluctuation range of HDD doubles that of CDD or so and thereby the uncertainty of temperature changes in winter is greater than that in autumn in the US, affecting the expectation of natural gas demand in winter. It can be seen from Figure 3g,i that the supply of natural gas basically went up and the storage volume increased continually during the observation period, which results from the shale gas revolution in the US. Total consumption of natural gas has a rising trend while the imports show the opposite. Table 2 lists the descriptive statistics for natural gas spot price (NGSP)and relevant variables including max, min, median, mean, standard deviation (SD), and relative standard deviation (RSD). It can be seen from Table 2 that we can know that the standard deviation of natural gas price is far smaller than that of other variables, except for heating oil price, while the relative standard deviation of natural gas is at the mid-level among all the variables. The fluctuation range of HDD doubles that of CDD or so and thereby the uncertainty of temperature changes in winter is greater than that in autumn in the US, affecting the expectation of natural gas demand in winter. It can be seen from Figure 3g,i that the supply of natural gas basically went up and the storage volume increased continually during the observation period, which results from the shale gas revolution in the US. Total consumption of natural gas has a rising trend while the imports show the opposite. Table 2 lists the descriptive statistics for natural gas spot price (NGSP)and relevant variables including max, min, median, mean, standard deviation (SD), and relative standard deviation (RSD). It can be seen from Table 2 that we can know that the standard deviation of natural gas price is far smaller than that of other variables, except for heating oil price, while the relative standard deviation of natural gas is at the mid-level among all the variables.  Table 3 shows the correlations among the variables. It can be known from Table 3 that with respect to the correlation between the natural gas price and other variables [45], the strongest is with the number of rotary rigs and second are marketed production and imports. The weakest correlations are with heating degree-days and cooling degree-days. Among relevant variables, the crude oil price and the heating oil price have the strongest correlation. The variables with strong correlations also occur between the number of rotary rigs and marketed production, the number of rotary rigs and imports, heating degree-days and cooling degree-days, heating degree-days and total consumption, and marketed production and imports. The variables with weak correlations occur between crude oil price and total consumption, the number of rotary rigs and heating degree-days, and the number of rotary rigs and cooling degree-days, in which the weakest correlation is between the number of rotary rigs and heating oil price.

Forecasting Performance Evaluation Criteria
There existed many model evaluation criteria and we choose some classic statistical criteria for testing model forecasting performance. R-square (R 2 ) is often used for measuring the goodness-of-fit [46] where y t and y t represent the actual value and the prediction value at the time t, respectively. N means the total number of samples in the dataset. R 2 can manifest the goodness and badness of fitting by using changes in the data. The value of R 2 ranges from is from zero to one. The regression fitting effect becomes better when the value of R 2 is closer to 1 [47,48]. In general, the value greater than 0.8 can indicate that the model has a high enough goodness-of-fit. Moreover, there are four criteria used for measuring the forecasting errors.
Mean absolute error (MAE) is also a common forecasting performance evaluation tool obtained by averaging the absolute errors [49,50]: Mean square error (MSE) is an average of the quadratic sum of the deviation of forecasting value and actual value and is good at measuring the degree of change [51,52]. The prediction model gets better accuracy with the decrease of the MSE value. In contrast to MAE, MSE enlarges the value of great prediction deviation and compares the stability of different prediction models. It is represented as: Root-mean-square error (RMSE) can be directly obtained from MSE by calculating the square root [53,54]. It is very sensitive to the very large or very small error value and thus has a good reflection for precision, formulated as: Mean absolute percentage error (MAPE) is often used as a loss function, since it can intuitively explain the relative errors [55,56]. It considers not only the deviation of prediction value and actual value, but also the ratio between them. It is given by: In the above-mentioned evaluation indices including MAE, MSE, RMSE, and MAPE, the smaller the value, the better the forecasting model accuracy.

Model Validation Technique
Cross-validation is an important method for model selection to obtain feasible and stable models in machine learning. In cross-validation, K-fold validation is a common method of preventing over-fitting when the model is too complex. It has been demonstrated that compared with Holdout method and Leave-one-out cross-validation, K-fold validation has more advantages in precision estimation and model selection [57]. With the help of K-fold validation, we partition training dataset into K equal folds, of which K − 1 folds are employed for model training while the remaining fold is used for validation. This operation of training and validation is repeatedly done by rotating K folds. After carrying out this operation K times, we gather all the errors to calculate out the final error. Note that, K is often taken as 10 in reality [57].

Model Parameter Selection
Machine learning methods involve many parameters, some of which are key parameters and thus are carefully selected. The complexity of the model is often related to these key parameters, which is also called model selection parameters. With regard to ANN, we choose a nonlinear autoregressive model with external input. The number of hidden neurons and the number of delays are selected as 10 and 2, respectively. In the train network, the training algorithm chosen is Levenberg-Marquardt, which typically needs more memory but less time. According to the MSE value of the validation samples, the training will stop automatically once the generalization cannot improve.
The kernel function should be carefully selected in the SVM model. In order to pick off the appropriate kernel function, our strategy is to test different functions and then the one with minimum error is what we need. Accordingly, six common kernel functions including linear type, quadratic type, cubic type, fine Gaussian, medium Gaussian, and coarse Gaussian are investigated, as shown in Table 4. Over one hundred experiments are deduced to the cubic type, which has better results than others overall and thereby serves as the kernel function of SVM in this study. GBM requires a reasonable loss function. The gradient boosting strategy harnesses several popular loss criteria: least-squares, least absolute deviation, Huber, logistic binomial log-likelihood, etc. We choose least-squares as loss function, since only the case of least-squares is suitable for regression model while others are usually applied in classification. In addition, the covariance function or kernel function is needed in GPR, such as Gaussian kernel and squared exponential kernel used in many machine learning algorithms. In this paper, squared exponential is used as the kernel function, since it behaves better results.

Empirical Study Results
Based on the before-mentioned machine learning methods, prepared data, forecasting performance evaluation criteria, model validation technique, and selected model parameters, the empirical study is carried out. Table 5 lists the data that describe the forecasting performance of four machine learning methods, in which each value is the average of one hundred tests for the purpose of more accurate results. Observing data of four criteria can easily find that the forecasting performance of ANN and SVM is better than that of GBM and GPR. In particular, ANN is obviously superior to other methods while GBM has the worst behaviour. Overall, the performance ranking is ANN, SVM, GPR, and GBM from strong to weak.    Figure 5 shows the distribution suitcases between predictive values and actual values. It can be seen that the actual normal price is centered around 2 to 8 and these four methods give a good prediction for normal values. The fluctuation of ANN is less than that of others. However, they are unsatisfactory when the price goes up greater than 8, e.g., from September to December and April to July 2008. The prediction of abnormal values is still a difficult task so far.   Figure 5 shows the distribution suitcases between predictive values and actual values. It can be seen that the actual normal price is centered around 2 to 8 and these four methods give a good prediction for normal values. The fluctuation of ANN is less than that of others. However, they are unsatisfactory when the price goes up greater than 8, e.g., from September to December and April to July 2008. The prediction of abnormal values is still a difficult task so far.   Figure 5 shows the distribution suitcases between predictive values and actual values. It can be seen that the actual normal price is centered around 2 to 8 and these four methods give a good prediction for normal values. The fluctuation of ANN is less than that of others. However, they are unsatisfactory when the price goes up greater than 8, e.g., from September to December and April to July 2008. The prediction of abnormal values is still a difficult task so far.  From the above results, we can understand that there are obviously distinct differences among the four machine learning models, and there is a general ranking reference. Overall, ANN and SVM have better prediction performance than GBM and GPR, where ANN wins the best and GBM is the worst. ANN has a good ability of self-learning, self-adapting, and self-organizing [58], which can analyze the patterns and rules of observed data and form complex non-linear functions through training, and adapt to large-scale, multi-factor, incomplete, and inaccurate data processing. The number of delays in ANN is 10 in this study, which is greater than traditional ANN's (the number of delays is 3 at most). That may be one reason that ANN is superior to SVM. SVM has an abundant and theoretical foundation and has strong approximation ability and generalization ability based on structural risk minimization. SVM aims at small sample, of which the optimal solution is based on limited sample. GPR has a simplified regression process and easily explains the consequence. It has favourable performance in number and stability, but it is suitable for research on relatively small training dataset. GBM has a relatively weaker performance than the above methods in forecasting natural gas prices, however, we can obtain the degree of importance of explanatory variables by using GBM.

Conclusions
The aim of this study is to investigate natural gas price forecasting based on four machine learning methods (ANN, SVM, GBM, and GPR). Monthly Henry Hub natural gas spot price data from January 2001 to October 2018 (there are 215 observations) were used in four prediction methods. Nine variables were investigated as inputs, which are NGSP, WTI, HO, NGRR, HDD, CDD, NGMP, NGTC, NGUSV, and NGI. The method cross-validation was used in model training. Four forecasting performance evaluation criteria including R 2 , MSE, RMSE, and MAPE are employed in prediction methods. Finally, the empirical results demonstrate that four prediction methods have decent performance in forecasting natural gas price. Overall, ANN and SVM have better forecasting performance than GBM and GPR. In particular, ANN obviously outperforms the other methods while GBM is the worst. This study could be improved by more thorough research, e.g., by comparing more different aspects of the prediction performance such as computation efficiency and using more diverse machine learning methods such as random forest [59,60]. For future work, we will evaluate the effect of emerging machine learning algorithms, such as deep learning and reinforcement learning, on energy price and correlation prediction.