An Enhancement Method Based on Long Short-Term Memory Neural Network for Short-Term Natural Gas Consumption Forecasting

: Artiﬁcial intelligence models have been widely applied for natural gas consumption forecasting over the past decades, especially for short-term consumption forecasting. This paper proposes a three-layer neural network forecasting model that can extract key information from input factors and improve the weight optimization mechanism of long short-term memory (LSTM) neural network to effectively forecast short-term consumption. In the proposed model, a convolutional neural network (CNN) layer is adopted to extract the features among various factors affecting natural gas consumption and improve computing efﬁciency. The LSTM layer is able to learn and save the long-distance state through the gating mechanism and overcomes the defects of gradient disappearance and explosion in the recurrent neural network. To solve the problem of encoding input sequences as ﬁxed-length vectors, the layer of attention (ATT) is used to optimize the assignment of weights and highlight the key sequences. Apart from the comparisons with other popular forecasting models, the performance and robustness of the proposed model are validated on datasets with different ﬂuctuations and complexities. Compared with traditional two-layer models (CNN-LSTM and LSTM-ATT), the mean absolute range normalized errors (MARNE) of the proposed model in Athens and Spata are improved by more than 16% and 11%, respectively. In comparison with single LSTM, back propagation neural network, support vector regression, and multiple linear regression methods, the improvement in MARNE exceeds 42% in Athens. The coefﬁcient of determination is improved by more than 25%, even in the high-complexity dataset, Spata.


Introduction
As a high-energy intensity and low-carbon energy source, natural gas plays a vital role in achieving "Carbon Peaking and Carbon Neutrality" targets [1]. Global natural gas consumption has continued to rise and annually increased by 2.9% in the past 10 years. In 2021, natural gas consumption accounted for 24.7% of primary energy consumption [2]. Natural gas consumption forecasting, especially for daily short-term forecasting, is crucial for pipeline optimization, gas distribution, and economic feasibility analysis of natural gas pipeline systems [3]. As the contract between dispatchers and users is generally on a daily basis, the dispatch management, planning, and operation optimization of the gas pipeline system rely on accurate daily consumption forecasting [4]. An accurate and automatic daily forecasting system is also an important part of intelligent pipeline network systems, which is being implemented in many cities. Therefore, the accuracy of daily gas consumption forecasting methods is of great significance in improving the management and economics of gas supply systems and ensuring a safe and uninterrupted gas supply [5].
Since Verhulst et al. [6] developed the first demand forecasting model for French gas demand in 1950, various short-term consumption forecasting models were further proposed. These models can be generally divided into conventional statistical methods and artificial intelligence (AI)-based methods. The conventional methods include multiple linear regression (MLR) [7], gray model [8], autoregressive moving average model [9], etc. Artificial intelligence methods include support vector machines [10] and artificial neural networks (ANNs). With the improvement in computer processing capabilities, artificial intelligence forecasting models based on machine learning overcome the defect of conventional statistical methods that are difficult to learn complex patterns from nonlinear time series data [11] and construct multiple improved neural networks and support vector regression (SVR) models. Compared with traditional neural networks, recurrent neural networks (RNNs) enhance the ability to save historical information [12]. As a special branch of RNN, the long short-term memory (LSTM) neural network is considered the most popular time series forecasting method in various fields, which solves the problem of gradient disappearance and explosion during the long-distance transmission of RNN [13].
Muzaffar and Afshari [14] utilized LSTM to forecast daily consumption for the following seven and thirty days. The results indicated that the LSTM model outperformed conventional statistical models and achieved the lowest mean absolute percentage error (5.97% and 9.75%). Kong et al. [15] proved that the performance of LSTM was better than that of conventional neural network models for forecasting the short-term consumption of a single residential home. Peng et al. [16] proposed a novel combined forecasting model integrating local mean decomposition, wavelet threshold denoising, and LSTM approaches to forecast daily consumption in London. Compared with single methods, the proposed combined model presented an excellent performance in short-term natural gas consumption forecasting.
However, recent studies have found that the LSTM model has two inherent shortcomings: (1) it cannot extract key features from input factors [17], and (2) it also cannot overcome the defect of encoding input sequences as a fixed-length hidden vector [18].
In order to draw major components from the input factors, feature selection and feature extraction methods are used to eliminate the redundant components of the time series features and improve the computing efficiency of LSTM [19]. Wei et al. [20] suggested an enhanced principal component analysis method and integrated it with LSTM to forecast daily consumption in Xi'an, China, and Athens, Greece. The proposed method extracted key information from input factors, and also eliminated redundant components and minimized the data dimension. Wu et al. [21] constructed a novel feature extraction method that improved the accuracy of LSTM by 49% and was at least 17% superior to other forecasting methods. In terms of feature selection methods, the elastic-net regularized generalized linear model, spike-slab lasso method, and Bayesian model average method were compared to select the appropriate features for the input of the LSTM neural network by Lu et al. [22]. The results implied that the accuracy of the short-term consumption forecasting model was apparently enhanced by the combination with the feature selection method. Although feature selection methods for optimizing the input of the LSTM model reduce the feature dimension by examining the relationship between features, the filter strategy may eliminate the critical information that has a great impact on time series forecasting results [17]. Additionally, feature extraction approaches only evaluate the spatial characteristics between features, the temporal characteristics between samples, which are crucial for time series forecasting, are not discussed [23].
Furthermore, to improve the second shortcoming of the LSTM model, multiple variants of the LSTM model, such as bidirectional LSTM, stacked LSTM, etc., were developed to optimize the model structure. Shahid et al. [18] utilized bidirectional LSTM, which consists of forward and backward LSTM neural networks, to forecast the confirmed cases of COVID-19. The outputs of forward and backward hidden vectors at each moment were concatenated to represent a fuller hidden layer output. The experimental results indicated that bidirectional LSTM enhanced the structure of hidden vectors and exhibited better performance. Sebt et al. applied stacked LSTM that comprised multiple LSTM layers to forecast the number of customer transactions [24] and concluded that the performance of their proposed stacked LSTM model was superior to recurrent neural network, prophet, and Autoregressive Integrated Moving Average model (ARIMA) models. However, the proposal of bidirectional LSTM and stacked LSTM cannot overcome the defect that the typical encoding-decoding LSTM model encodes the input sequences into fixed-length hidden vectors while learning and saving the long-distance state [25]. As the length of the input sequence increases, each hidden vector is still assigned the same weight, and the LSTM model cannot distinguish the importance among hidden vectors [26]. Some crucial spatial and temporal information will be ignored during the training process and resulting in worse performance. Thus, a mechanism is required to assist the LSTM model in evaluating the significance of hidden vectors at different moments in the input sequence and highlighting the key factors in the hidden vectors.
Motivated by the above analysis, the aim of this paper is to provide an effective feature extraction method that can extract temporal and spatial characteristics for the input of LSTM and solve the problem that the hidden vectors of LSTM share the same weight. The proposed method suggests a novel LSTM optimization framework, which comprises the convolutional neural network, LSTM, and attention mechanism. The convolutional neural network is used to extract the major components of the input sequence from the perspectives of spatial and temporal characteristics and reconstruct a new feature pattern for LSTM input. Then, the LSTM neural network is applied to forecast time series data. The attention mechanism is set behind LSTM to evaluate the significance of hidden vectors at different moments in the input sequence. It can adaptively draw hidden vectors from the LSTM layer and assign varying attention to hidden vectors at different moments so as to highlight the major components. Additionally, to evaluate the robustness and accuracy of the proposed model, we design two real-life scenarios with different fluctuation characteristics and analyze the effect on various types of datasets.

Methodology
This section describes the algorithm and framework of the combination model used in this paper. The proposed three-layer neural network forecasting model consists of three approaches: convolutional neural network (CNN) algorithm for improving the input of LSTM, LSTM benchmark algorithm, and attention mechanism (ATT) for optimizing the weight distribution of hidden vectors. The strategy and framework of the proposed CNN-LSTM-ATT model are presented in Section 2.4.

Convolutional Neural Network
The convolutional neural network (CNN) is a deep learning-based neural network that is typically used to analyze data with a known grid topology in fields such as time series analysis, computer vision, and natural language processing. The structure of the CNN consists of an input layer, a convolutional layer (kernel and convolutional output), a pooling layer, and an output layer [27]. A schematic of the convolutional neural network is depicted in Figure 1.
It can be seen from Figure 1 that the convolutional layer is the major component of CNN; it extracts spatial and temporal characteristics from the input features based on the predefined convolutional kernel and uses an activation function to perform a nonlinear transformation on each convolutional result to map the initially linearly indistinguishable multidimensional features to another space. The pooling layer is set behind the convolutional layer. The feature map calculated from the convolutional layer is scanned in a step-by-step manner. Then, the maximum value within the filter is captured in turn to reduce the number of connections between neurons in the convolutional layer and perform secondary feature extraction on the input features.
CNN can automatically learn spatial and temporal features from the input data and has the advantages of local connection, weight sharing, pooling operation, and multi-layer structure, which simplify the complexity of LSTM input. The gradient descent optimization approach applied in CNN reduces overfitting and provides better generalizability. For certain sequences, the effect of one-dimensional convolution can be compared with recurrent neural networks and only requires less computing cost [28]. The equations for the convolutional layer and pooling layer are as follows: Convolutional layer: Pooling layer: where x l−1 i and x l i represent the output of layer (l − 1) and layer l, respectively; k l ij is the convolution kernel of layer l; b, M j , and w represent the bias, input feature vector, and width of the pooling area, respectively; q l i (t) is the value of the tth neuron in the ith feature vector of layer l; and P l+1 i (j) is the value corresponding to the neuron of layer (l + 1).

Long Short-Term Memory
The long short-term memory (LSTM) algorithm was proposed by Hochreiter [29] in 1997. The principle is to continuously update the weights through a gating mechanism to learn and save long-term memory of time series data. It avoids the problems of gradient disappearance and explosion in RNNs and serves as the most popular forecasting model in time series forecasting. The cell structure of LSTM is shown in Figure 2. The equations of LSTM can be described as follows: (1) Forget gate: the forget gate reflects the ability to learn historical information.
(2) Input gate: the input gate performs the selectivity of the memory module by utilizing a nonlinear function to determine which portion of the input information will be stored.
Update cell status: (3) Output gate: the function of the output gate is to update the parameters of hidden layers, including selective learning and preservation of historical data. The new cell state and hidden vector state will be transmitted to the subsequent time step.
where C t , C t , and C t-1 represent cell status; x t and h t are the vector states at time t; and W and b are the weights and biases, respectively.

Attention Mechanism
Attention mechanism is derived from the principle of human visual focus, which simulates the characteristics of human beings unconsciously focusing on the key positions in a picture [30]. Given this, attention mechanism is designed to concentrate the limited computing ability on key information in the collected data so that it can save computing costs and process information efficiently. Its essence is to learn an appropriate weight distribution for input features so that the model focuses on high-weight features and pays less attention to low-weight features [31]. Figure 3 presents the structure of the attention mechanism. The process of the attention mechanism can be separated into three stages. The first stage is to calculate the similarity S i between Y i and each H i . In the second stage, softmax normalization is performed on the similarity scores, and the attention value is calculated by the weighted sum of H i and weight coefficient. The equations can be expressed as follows: where H i is the output of the LSTM hidden layer; S i represents the similarity score; W and α i are the weight matrix and weight coefficient, respectively; and C i represents the attention value. Figure 4 describes the strategy of the proposed enhancement method for LSTM. The first step is to add a CNN layer to optimize the input of LSTM. The convolution operation in CNN adaptively extracts spatial and temporal features from the sample data and transmits them into the LSTM model as the optimized input. Then, the LSMT layer is used to learn and save the long-distance state of input through the gating mechanism.

Strategy of the Proposed Model
To solve the problem that the LSTM model assigns the same weight to the hidden vector h, the third layer is designed as the attention layer. During the decoding process, the attention mechanism is used to evaluate the importance of different hidden vectors h and assign appropriate weights to them. That is, we can learn the weight distribution of hidden vectors in LSTM and pay more attention to high-weight features so that it can improve forecasting performance and efficiency. Furthermore, the model limitations are discussed to complete our research.

Results and Discussion
The purpose of the experiments is to reveal the enhancement effect of the convolutional neural network and attention mechanism on LSTM and validate the model performance and robustness in terms of accuracy. Based on this, our experiments can be divided into two parts. The first part validates the performance of the proposed model compared to LSTM, which combined only one enhancement method, attention or CNN. Part two evaluates the performance of the proposed model and four popular forecasting models, including single LSTM, back-propagation neural network (BPNN), SVR, and MLR, and validates the model robustness on two datasets with different complexities.

Evaluation Methods
Evaluation indicators used in regression forecasting algorithms include MARNE (mean absolute range normalized error), MAPE (mean absolute percentage error), R 2 (coefficient of determination), MSE (mean squared error), MAE (mean absolute error), and RMSE (root-mean-square error). To validate the model robustness on the two designed datasets, MSE, MAE, and RMSE affected by the order of magnitude are not suitable for evaluating the forecasting results. Thus, MARNE, MAPE, and R 2 are utilized as the major evaluation indicators in our research. MARNE: MAPE: R 2 : where y i , y i , and y i represent the forecast data, actual data, and average of data y, respectively; and n is the number of samples.

Data Description
To evaluate the robustness and performance of the proposed CNN-LSTM-ATT model, it is validated on two datasets with different complexities and fluctuations. The designed datasets are collected from two nodes in the Greek gas pipeline network, ranging from 1 January 2018 to 31 December 2021. The consumption curves are shown in Figure 5. The complexity of consumption is determined by the sample entropy, which is specially researched to describe the complexity index of time series data [32]. High sample entropy means high complexity, indicating that the fluctuation of time series data is more complex. It is a widely used index for measuring complexity [33]. Table 1 shows the sample entropy, training, and testing data size of the designed datasets. It can be seen from Figure 5 that Athens has an obvious seasonal periodic trend. The fluctuation is mainly concentrated from December to March of the next year and is more complicated in 2021, which is the forecasting target of this paper. The fluctuation of Spata is apparently more complicated than that of Athens. It presents a weak seasonal periodicity trend, and the fluctuations between data points are more concentrated. The sample entropies calculated in Table 1 also prove that Spata (2.10) has nearly twice the complexity of Athens. Thus, it can be found that Athens and Spata are time series datasets with different complexities from the perspectives of consumption curve and sample entropy, which can be perfectly applied to validate the robustness of the proposed model.
Influencing factors affecting the fluctuation of natural gas consumption can be selected as input factors for model training. Wei et al. [34] collected 19 weather profiles from the weather company of IBM for daily consumption forecasting. Sabo et al. [35] analyzed the implicit, explicit, functional, and linear dependence of natural gas consumption on temperature and demonstrated that natural gas consumption and temperature are explicitly related. According to the natural gas consumption forecasting review report conducted by Soldo [36], Tamba et al. [37], and Liu et al. [38], short-term gas consumption is significantly affected by weather profiles and temperature. Thus, factors related to temperature and weather profiles are regarded as the input data for model training to improve the nonlinear fitting ability. In this work, we focus on daily natural gas consumption; therefore, all data we mentioned will be daily data without any specifications.

Experimental Setup
Considering that the length of time series data is 1096, the LSTM benchmark forecasting model used in this paper is set to a single LSTM layer. Maximum pooling as a method of secondary extraction in the convolutional layer is not appropriate for regression forecasting with a limited data size. Two feature extractions may change the original fluctuations, making the model learn incorrect patterns and reducing the forecasting accuracy. Thus, the feature extraction layer is set to a single CNN layer. The attention layer is set behind the LSTM layer to reassign different weights to the hidden vectors of LSTM. The number of neurons in CNN and LSTM layers, as well as the batch size and epochs, are determined by the GridsearchCV function [39], CV = 5. Other parameters used in the models are determined by trial and error.
Furthermore, all methods mentioned in this paper are coded in Python 3.8. The neural network functions and structures are constructed based on the package Keras 2.4.3, which was developed by Google.

Factor Selection
The factors of the two profiles (temperature and weather profiles) related to natural gas consumption include maximum, average, and minimum temperatures; maximum, average, and minimum dew points; maximum, average, and minimum humidities; and maximum, average, and minimum pressures. Those 12 factors are used as the input data of the model for daily forecasting. The Pearson correlation coefficient is used to select the highly correlated factors among the 12 factors. Table 2 shows the Pearson correlation coefficients between consumption and temperature and weather profiles. It can be seen from Table 2 that features from temperature and dew point profiles present a higher correlation to natural gas consumption and indicate a negative correlation.
Among those 12 factors, the maximum and average temperatures and maximum dew point have the higher absolute correlation coefficients and are selected as the input factors of the forecasting models in this paper.

Performance Discussion
To evaluate the improvement obtained from the convolutional neural network and attention mechanism on LSTM in terms of accuracy, we first compare the improvement from the single and combined enhancement methods to LSTM. Then, it is compared with other popular forecasting models to prove the robustness of the proposed model.

Comparison between Two Enhanced Methods for LSTM
As two enhancement methods of LSTM, CNN-LSTM and LSTM-ATT, are utilized to compare with the proposed model to find the best forecasting method appropriate for datasets with different complexities. Figure 6 and Table 3 show the forecasting curves, errors, and computational cost (seconds per series) [40] of the three enhancement methods for LSTM.   Figure 6 shows that the forecasting curve of our proposed model perfectly fits the peak and trough of the real consumption curve Figure 6(a1). Although the forecasting results of Spata with high sample entropy are not as accurate as Athens, the proposed algorithm is closest to the actual curve compared with other enhancement methods Figure 6(b1). It can also be found from Figure 6(a2,b2) that CNN-LSTM-ATT has the highest R 2 and lowest MARNE and MAPE in Athens and Spata. Table 3 indicate that CNN-LSTM-ATT outperforms CNN-LSTM and LSTM-ATT in the designed datasets. Compared with CNN-LSTM, the evaluation indicators of the proposed model are improved by more than 20% in Athens and by more than 5% in Spata. The MARNE, MAPE, and R 2 are improved by 16.03%, 9.80%, and 5.93%, respectively, in Athens as against LSTM-ATT. In Spata, the indicators are improved by 11.38%, 2.21%, and 17.54%, respectively. Additionally, it is apparent that the computational cost is shorter than CNN-LSTM and LSTM-ATT. The improvement implies that our proposed LSTM enhancement method presents superiority over other combined methods on datasets with different complexities. Although the improvement of the attention mechanism to LSTM is better than CNN, only the combination of the three methods can take full advantage of each method. Additionally, the proposed model should be compared with other popular forecasting models to validate the model's performance.

Comparison with Other Forecasting Models
The above analysis proves that the proposed model is superior to other LSTM enhancement methods. However, it is still necessary to compare the proposed model with other popular forecasting models to evaluate the model performance and robustness on datasets with different complexities. Figure 7 and Table 4 present the forecasting curves, errors, and computational cost (seconds per series) [41] between CNN-LSTM-ATT and other forecasting models.  It can be seen from Figure 7 that the forecasting results of LSTM and BPNN are closer to the real consumption curve than SVR and MLR, which deviate far from the original curve, but it is clear that CNN-LSTM-ATT perfectly describes the real consumption curve. Figure 7(b1) presents that the forecasting results of the other four methods deviate from the original consumption curve for dataset Spata with higher complexity. Although the proposed method cannot fit the original curve well either, it obtains better results. The indicators in Figure 7(a2,b2) also prove that CNN-LSTM-ATT presents the best forecasting performance. Table 4 indicates that the MARNE, MAPE, and R 2 of the proposed model are improved by more than 42%, 27%, and 5%, respectively, in Athens compared with LSTM, SVR, and MLR. In Spata, the proposed model improves R 2 by more than 25%. The improvement in MARNE is 16.23%, 11.61%, and 16.61%, respectively. The MAPE is improved by 3.47%, 4.83%, and 4.93%, respectively. All indicators imply that the proposed model shows the best robustness and performance. Additionally, the computational cost of classic statistical methods MLR and SVR is shorter than ANN methods BPNN, LSTM, and CNN-LSTM-ATT. This is because ANN methods take extra time to fit complex nonlinear relationships through model training. It is obvious that our proposed method takes less training time than BPNN and LSTM.
From the above analysis, it can be concluded that the superiority of the CNN-LSTM-ATT model has been validated on datasets with different complexities. The proposed method demonstrates better performance and robustness than other popular forecasting models and can be applied to datasets with different complexities in various real-life scenarios.

Conclusions
This paper proposes a novel three-layer neural network forecasting model, namely, CNN-LSTM-ATT, which consists of a CNN layer, an LSTM layer, and the attention mechanism. The CNN layer extracts major components, including spatial and temporal information, from the input factors. The LSTM layer is used to learn and save the long-distance state of time series data. The attention mechanism adaptively redistributes the weight of the LSTM hidden vectors to overcome the defect of encoding input sequences into fixed-length vectors. Two datasets with different sample entropies and fluctuation characteristics are designed to evaluate the model's robustness. Given this, the most important findings are as follows: I Compared with two enhancement methods for LSTM, the combined approach of CNN, LSTM, and attention mechanism takes full advantage of each algorithm and achieves better performance. Compared with CNN-LSTM, the evaluation indicators of the proposed model are improved by more than 20% in Athens and by more than 5% in Spata. The MARNE of the proposed model is improved by more than 11% in the two designed datasets as against LSTM-ATT.

II
The proposed enhancement method for LSTM significantly improves forecasting accuracy. Compared with single LSTM, SVR, and MLR, CNN-LSTM-ATT perfectly fits the peak and trough of the consumption curves and exhibits the best performance and robustness. The results indicate that the improvement of the proposed model on MARNE, MAPE, and R 2 exceeds 42%, 27%, and 5% in Athens, respectively. The R 2 is improved by more than 25%, even in the high-complexity dataset, Spata.
In this paper, the proposed three-layer neural network forecasting model proves its superiority in terms of model performance and robustness and can be regarded as a reliable forecasting method to provide consumption distribution plans for natural gas pipeline companies. However, we also found that the limitation of the model lies in the structure and parameters of the neural network. The optimal structure and parameters used in this paper are tested via trial and error, and a lot of time is spent throughout the experiments. Additionally, model performance is affected by the correlation between input factors and natural gas consumption. The low correlation input factors result in weak performance. To further improve the model performance and efficiency, the feature enhancement method will be considered to improve the correlation, and mathematical methods will be considered to find the optimal network structure and parameters in the future.