1. Introduction and Background
A time series is a sequence of observations that is captured through time, and forecasting is the process of estimating future trends or values based on present and past values. Time series forecasting has applications in various fields, such as electricity consumption and price forecasts [
1,
2], wind forecasting [
3], temperature forecasting [
4], and several other real-life applications [
5].
There are two main forecasting methods: deterministic forecasting and probabilistic forecasting [
6]. Deterministic forecasting, also known as point forecasting, is the process of predicting a single deterministic value in the future, which is then compared against the target real value. However, deterministic forecasting has shown limitations in the field because no information is available about the dispersion of the actual values around the estimated values, and it is hard to tell by how much the actual value would deviate from the predicted value, which could be especially disadvantageous for complex data. Therefore, probabilistic forecasting is being explored as a forecasting method that produces potentially substantial improvement over deterministic forecasting by providing more reliable models [
7]. In probabilistic forecasting, a range under which the target value should be present is predicted. This range is referred to as Prediction interval (PI).
The goal of time series analysis is to create a model that attempts to describe the behavior of the series and predict its future values. In order to facilitate the inference of information about time series, the series should be transformed to be a stationary one [
8]. Moreover, most of the statistical approaches to analyzing time series data require the series to be stationary [
9]. A stationary series is loosely defined as a series which has statistical properties that do not vary over time, such as mean and variance. A strict stationary series definition is too restricting; thus, a weaker version is usually used instead [
10]. In order to make the series stationary, we need to remove the trend and the seasonality. Trend represents a varying mean which can be observed in the series by the values that either keep on increasing or decreasing over time. On the other hand, seasonality is represented by a pattern which repeats itself over time, which can indicate a varying variance.
Both time series forecasting statistical and deep learning models have been discussed in the literature. Statistical models such as ARIMA are used for more precise prediction, but need experts and deep domain knowledge with vigorous analysis. On the other hand, deep learning models such as Long short-term memory (LSTM) require less knowledge in the field and less time in investigation, as there is no need to discover optimal features and parameters for the model [
11]. Time series models have also adapted an attention mechanism. Attention was first introduced to solve a machine translation task [
12]. The goal of attention was to overcome the shortcomings in recurrent neural networks (RNNs), as they struggled to remember long sequences. This is achieved by retaining the hidden stats at each step during decoding. Attention gives more importance to some features over the others by using weights for the features. Then, weighted sum is taken using soft max to get the sequence context for each feature.
The application of interest in this study is oil production. Oil is a traditional fossil fuels which is studied by many researchers. Even with the emergence of renewable energy such as wind, oil is still an important factor that affects the economy and plays an important role in energy investment due to the high risk, a long cost payback period and other factors accompanied by the investment in renewable energy [
13].
State-of-the-art reservoir engineering forecasting techniques rely on Arps’ Decline Curve Analysis (DCA) equations [
14]. To estimate oil and gas reserves for current and future wells, DCA has been used as one of the prominent techniques for such a task. Arps divides the well production into two main partitions:
- (1)
A hyperbolic curve representing the segment after an initial ramp-up period until the curve reaches a peak.
- (2)
An exponential curve representing the decline behavior after the peak.
The curve function is summarized as follows:
where
is the oil production rate in barrels/day,
is the initial production,
is the initial decline in the hyperbolic part of the equation,
t is time, and
b is the hyperbolic factor controlling the rate of change of the decline. After reaching a certain decline rate
, the curve is represented by an exponential one using
, the production reached by time
.
Decline curve modeling has been used to predict production data where the curve is fitted to the data to estimate future points. In [
15], different decline models are evaluated, namely the Exponential Decline Model (SEDM) and the Logistic Growth Model (LGM), followed by the Extended Exponential Decline Model (EEDM), the Power Law Exponential Model (PLE), Doung’s Model, and the Arps Hyperbolic Decline Model.
In [
16], unlike the traditional trend stationarity techniques, a new method was adopted which utilizes the Arps decline curve. The trend found in the oil datasets is removed, utilizing the Arps fitted curve in an attempt to make the series stationary.
In this study, we propose a machine learning model that estimates a prediction interval for a large dataset composed of monthly oil production data of unconventional oil-producing wells. Accurate estimation of prediction intervals can play a critical role in quantifying uncertainty and to support investment and divestment decisions. The following sections are organized as follows:
Section 2 discusses the experiment details and the setups used.
Section 3 describes the data.
Section 4 describes the evaluation metrics used to assess the prediction intervals.
Section 5 includes the results and their visualization, and presents some insights. We finally conclude in
Section 6.
2. Model
The machine learning model utilized in this study is a sequence-to-sequence (seq2seq) model. seq2seq is an encoder–decoder-based deep learning model. Two LSTM models are used separated by a repeat vector, which repeats the input three times. It is used as multi-step ahead forecasting, since it forecasts several steps (three) ahead in the future (future sequence). It is followed by two layers of densely connected NN of 100 units and 1 unit applied using TimeDistributed layers.
Using the model, we test different setups, which include trend removal and attention mechanism. Quantile loss is utilized to create the upper and lower bounds of the PIs, as well as the 0.5 quantile (p50).
Trend removal is used to make the series stationary by utilizing the special trend accompanied by oil production series. First, the sequence is fitted using a hyperbolic-to-exponential Arps decline curve. Then, trend removal is simply achieved by taking the difference between the original series with the Arps fitted curve, as shown in
Figure 1. Regarding the attention mechanism, we implemented a simple attention layer using keras [
17] following the attention mechanism introduced in [
12].
3. Data
Experiments are conducted on a dataset consisting of sequences of production data obtained over successive months for producing wells from all US oil and gas basins. These sequences are represented by the number of oil barrels per day from oil horizontal wells. Only the data post the peak production are used, bearing in mind that typically, the data prior to the peak production in the reservoir engineering domain are studied independently of the rest of the data. The total number of wells in our experiment is 60,000, where 50,000 were used to train the model and 10,000 are withheld for the testing phase. The sliding window technique is leveraged to cover all months and make the input sequence consistent in size where sequences of nine consecutive months are taken; six months are used as features, and three as targets. Accordingly, the training set consists of 1,596,240 sequences and the test set consists of 46,386 sequences. We aim to estimate an interval by employing the quantile loss using the 0.5 quantile for the lower bound and the 0.95 quantile for the upper one to achieve 90% of the prediction interval.
4. Evaluation Metrics
Different metrics are utilized to evaluate the predicted prediction intervals. The most commonly used metrics are: Prediction Interval Coverage Probability (PICP) and Prediction Interval Normalized Average Width (PINAW) [
2,
18]. PICP is a method that measures the probability of a specific target existing within the predicted interval. It is defined as follows:
where
is defined as
and represent the lower and upper bound of the prediction interval, respectively; N is the number of samples, and is the target.
The results are enhanced significantly by increasing the PICP value. However, the width of the PI has an impact on the prediction. Increasing the PI width by moving the lower and upper bounds further apart to include more targets negatively affects the prediction significance, as the decision-makers will have little information to base their decision upon. PINAW, also known in the literature as Normalized Mean Prediction Interval Width (NMPIW), was introduced to overcome this flaw; it is a metric that measures the width of the interval. It is commonly used in the literature with the investigation of probabilistic forecasting. PINAW is the average width of the several predicted PIs normalized by the width of the target, and it is defined as follows:
where
R is the range of the target; in other words, it is the maximum minus the minimum target.
Hence, the smaller the value of the PINAW, the better the results are.
It is desirable to have a narrow PI, which can be obtained by targeting a narrower interval while considering the quantiles. However, that has conflicted interests with having a large number of target points, which can be obtained by making the PI wider. Therefore, a Coverage Width-based Criterion (CWC) is introduced in the literature [
19]. CWC is defined as
where
is defined as follows:
While the hyper-parameter is the target PICP value and is the penalty for having a PICP value less than the target.
The value of
should be large to provide a high penalty for non-sufficiently-informative PIs. Having reached the target PICP, the CWC will have the same value as the PINAW and, in that case, it is safe to assume an informative PI. On the other hand, a smaller PICP will lead to high CWC values caused by the penalty
in the exponential term in equation
4. Hence, a small CWC is targeted.
5. Results and Discussion
The results are shown in
Table 1. Adding attention only and 90% PI by choosing 0.05 and 0.95 quantiles, we are able to obtain 90.5% PICP, 9.4% for PINAW and, 0.094 CWC when the expected PICP is set to 90% and the penalty parameter is equal to 50 [
20]. On the other hand, by applying trend removal only, we obtain 85.9% PICP, 6.9% PINAW, and 0.605 CWC. Additionally, when using both trend removal and attention, we get 85.4% PICP, 6.7% PINAW and 0.729 CWC. The CWC value increases with non-satisfactory PICP, which means when PICP is less than the expected value, the CWC will be significantly greater than PINAW. CWC will be equal to PINAW when the PICP value is greater than or equal to the expected value. Thus, we can deduce that the smaller the CWC value is, the better the prediction is. The very slight improvement upon using the attention can be attributed to the sequence under investigation; the sequence is not long enough to emphasize the enhancement.
The result in our work regarding using Arps differencing confirms the results in [
16], which can be found in
Table 2. From our results in
Table 1, it is clear that using Arps differencing yields a narrower width compared to keeping the trend. We can also see that using Arps differencing is better than choosing a narrower PI in regards to PICP, which represents the coverage probability of the PI when both yield a narrower width, as shown in
Figure 2.