Application of Gated Recurrent Unit (GRU) Neural Network for Smart Batch Production Prediction

: Production prediction plays an important role in decision making, development planning, and economic evaluation during the exploration and development period. However, applying traditional methods for production forecasting of newly developed wells in the conglomerate reservoir is restricted by limited historical data, complex fracture propagation, and frequent operational changes. This study proposed a Gated Recurrent Unit (GRU) neural network-based model to achieve batch production forecasting in M conglomerate reservoir of China, which tackles the limitations of traditional decline curve analysis and conventional time-series prediction methods. The model is trained by four features of production rate, tubing pressure (TP), choke size (CS), and shut-in period (SI) from 70 multistage hydraulic fractured horizontal wells. Firstly, a comprehensive data preprocessing is implemented, including excluding unﬁt wells, data screening, feature selection, partitioning data set, z-score normalization, and format conversion. Then, the four-feature model is compared with the model considering production only, and it is found that with frequent oilﬁeld operations changes, the four-feature model could accurately capture the complex variance pattern of production rate. Further, Random Forest (RF) is employed to optimize the prediction results of GRU. For a fair evaluation, the performance of the proposed model is compared with that of simple Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) neural network. The results show that the proposed approach outperforms the others in prediction accuracy and generalization ability. It is worth mentioning that under the guidance of continuous learning, the GRU model can be updated as soon as more wells become available.


Introduction
Production prediction plays an important role in decision making, development planning, and economic evaluation during the exploration and development period. Especially in unconventional reservoirs, accurate production estimation helps evaluate oil and gas potential and, at the same time, guides fracturing operation reasonably [1][2][3]. There are several commonly used methods for future production prediction.
Reservoir numerical simulation is a common method for production rate forecasting because of its reliable physical background [4,5]. However, it is time-consuming and demands many reservoir features, such as rock mechanical parameters and fluid property parameters [6]. Inaccurate or missing parameters will lead to unbelievable production predictions. Especially in conglomerate reservoirs, it is difficult to build a robust numerical model due to the effect of gravel as well as the complicated fracture propagation mechanism. mechanism, the architecture, and three prediction approaches of the proposed model were explained. After model training, model validation, application, and correction with Random Forest (RF) were employed for future production rate forecasting successively. Additionally, continuous learning was introduced into prediction with available data increases. Finally, the main conclusions and prospects were presented.

Data Set
In this study, we focus on the oil wells of X well block in M oilfield, China. M oilfield belonging to a conglomerate reservoir with a dozen well blocks. Among them, X is one of the earliest regions of scale development, owning more than 70 multistage hydraulically fractured horizontal wells since 2016. The size of gravel in X well block generally ranges from 2 mm to 8 mm with a maximum value of 16 mm, and the contents of gravel and sand are around 51% and 44%, respectively. According to the conventional logging test and core analyzing experiment, the permeability varies from 0.05 to 94.8 mD with an average value of 2.3 mD, and the porosity ranges from 4.3% to 15.3% with an average of 9.23%. Figure 1 shows the production time of all the available wells. From 2016 to 2019, the number of fractured wells increases year by year with 2 wells in 2016, 14 wells in 2017, 26 wells in 2018, and 28 wells in 2019 (from January to September). More wells are waiting for drilling and fracturing shortly. As of the end of September 2019, only 13 wells had produced for more than 2 years, and 32.4% of the wells are less than 200 days. Short production time will increase the difficulty and uncertainty in applying conventional DCA. Raw data of the above wells are collected from the oilfield database, including production date, SI, CS, TP, oil production rate (q o ), production method (PM), strokes, and so on (see Supplementary Materials). The ending of the production date is 2 September 2019.
Energies 2020, 13, x FOR PEER REVIEW 3 of 24 with Random Forest (RF) were employed for future production rate forecasting successively. Additionally, continuous learning was introduced into prediction with available data increases. Finally, the main conclusions and prospects were presented.

Data Set
In this study, we focus on the oil wells of X well block in M oilfield, China. M oilfield belonging to a conglomerate reservoir with a dozen well blocks. Among them, X is one of the earliest regions of scale development, owning more than 70 multistage hydraulically fractured horizontal wells since 2016. The size of gravel in X well block generally ranges from 2 mm to 8 mm with a maximum value of 16 mm, and the contents of gravel and sand are around 51% and 44%, respectively. According to the conventional logging test and core analyzing experiment, the permeability varies from 0.05 to 94.8 mD with an average value of 2.3 mD, and the porosity ranges from 4.3% to 15.3% with an average of 9.23%. Figure 1 shows the production time of all the available wells. From 2016 to 2019, the number of fractured wells increases year by year with 2 wells in 2016, 14 wells in 2017, 26 wells in 2018, and 28 wells in 2019 (from January to September). More wells are waiting for drilling and fracturing shortly. As of the end of September 2019, only 13 wells had produced for more than 2 years, and 32.4% of the wells are less than 200 days. Short production time will increase the difficulty and uncertainty in applying conventional DCA. Raw data of the above wells are collected from the oilfield database, including production date, SI, CS, TP, oil production rate (qo), production method (PM), strokes, and so on. The ending of the production date is 2 September 2019.

Workflow
The quality of the data set is of utmost importance because the core of data mining is data itself. Figure 2 depicts the workflow of data preparation (the blue box) and the modeling process (the green box). There are six steps in the preprocessing process.

Workflow
The quality of the data set is of utmost importance because the core of data mining is data itself. Figure 2 depicts the workflow of data preparation (the blue box) and the modeling process (the green box). There are six steps in the preprocessing process.
Firstly, unfit wells, whose production time is less than the history window (HW), are excluded from the data set. HW refers to the number of days used to predict the production rate at the next day. If HW is too big, the number of samples will be decreased, resulting in underfitting problems. On the contrary, if HW is too small, some key information hidden in the long time series might be ignored,  Firstly, unfit wells, whose production time is less than the history window (HW), are excluded from the data set. HW refers to the number of days used to predict the production rate at the next day. If HW is too big, the number of samples will be decreased, resulting in underfitting problems. On the contrary, if HW is too small, some key information hidden in the long time series might be ignored, which leads to poor prediction performance. As there are numerous newly developed wells, 14 days is selected as HW in this research, and then 5 wells are excluded from the data set.
Secondly, data screening is implemented on each well by deleting periods before oil production appears. These zero values do not contribute to models but will increase computing time. Remarkably, as an external feature, zero production periods caused by shut-in operation should not be excluded.
The third step of data preprocessing is to select appropriate input features. The available features obtained from the oilfield database include SI, CS, TP, PM, strokes, and qo. By analyzing the historical data, PM and strokes are removed from available features because all the wells are gusher wells with unchangeable values. Figure 3 displays a typical production history with frequent operations. It is obvious that changing choke size and shut-in operation relate to sharp fluctuation in production rate and tubing pressure. It is essential to consider SI and CS as input features. The analysis results are consistent with the Pearson correlation coefficients heatmap, as shown in Figure 4. The coefficients between every two variables range from −1 to 1. The number 1 and −1 represent a completely positive and negative relation, respectively. The number 0 means that there is no linear correlation between the two variables. It can be found that qo has a positive relation with CS and TP, whereas it has a negative relation with SI. The largest absolute Pearson correlation coefficient is 0.47, which is between qo and TP, suggesting that TP has a larger impact on qo.
After the above steps, a total of 65 wells are selected to implement data set partitioning. For different purposes, the data set is generally divided into the training set, validation set, and test set. The training data allow the model to learn the relationship between the input sequence and the output for good prediction. The validation set is utilized for neural network architecture optimization and hyperparameter optimization (i.e., learning rate, drop out ratio, regularization parameter of L2 norm). The test set is used for corroborating the prediction ability and robustness of the trained models. In this research, 52 wells are randomly selected as the training set, which accounts for 80% of the data set. The validation set and the test set contain 7 wells and 6 wells, respectively. Secondly, data screening is implemented on each well by deleting periods before oil production appears. These zero values do not contribute to models but will increase computing time. Remarkably, as an external feature, zero production periods caused by shut-in operation should not be excluded.
The third step of data preprocessing is to select appropriate input features. The available features obtained from the oilfield database include SI, CS, TP, PM, strokes, and q o . By analyzing the historical data, PM and strokes are removed from available features because all the wells are gusher wells with unchangeable values. Figure 3 displays a typical production history with frequent operations. It is obvious that changing choke size and shut-in operation relate to sharp fluctuation in production rate and tubing pressure. It is essential to consider SI and CS as input features. The analysis results are consistent with the Pearson correlation coefficients heatmap, as shown in Figure 4. The coefficients between every two variables range from −1 to 1. The number 1 and −1 represent a completely positive and negative relation, respectively. The number 0 means that there is no linear correlation between the two variables. It can be found that q o has a positive relation with CS and TP, whereas it has a negative relation with SI. The largest absolute Pearson correlation coefficient is 0.47, which is between q o and TP, suggesting that TP has a larger impact on q o .
After the above steps, a total of 65 wells are selected to implement data set partitioning. For different purposes, the data set is generally divided into the training set, validation set, and test set. The training data allow the model to learn the relationship between the input sequence and the output for good prediction. The validation set is utilized for neural network architecture optimization and hyperparameter optimization (i.e., learning rate, drop out ratio, regularization parameter of L2 norm). The test set is used for corroborating the prediction ability and robustness of the trained models. In this research, 52 wells are randomly selected as the training set, which accounts for 80% of the data set. The validation set and the test set contain 7 wells and 6 wells, respectively.
In the fifth step, data standardization is implemented based on z-score normalization to improve the efficiency of the trained model, as defined in Equation (1). One key point to note is that the validation set and test set should be normalized based on the mean and standard deviation of the training set only. The statistical property of input features is summarized in Table 1. where Mean tra and Std tra are the mean and standard deviation of the training set, Fea denotes input features, and Fea nor represents input features after z-score normalization.
Energies 2020, 13, x FOR PEER REVIEW 5 of 24  In the fifth step, data standardization is implemented based on z-score normalization to improve the efficiency of the trained model, as defined in Equation (1). One key point to note is that the validation set and test set should be normalized based on the mean and standard deviation of the training set only. The statistical property of input features is summarized in Table 1.    In the fifth step, data standardization is implemented based on z-score normalization to improve the efficiency of the trained model, as defined in Equation (1). One key point to note is that the validation set and test set should be normalized based on the mean and standard deviation of the training set only. The statistical property of input features is summarized in Table 1.   As shown in Figure 5, the features, having been normalized, will be converted to a three-dimensional tensor to match the request of layers in the sixth step. As for input, the first dimension is the number of samples (n_samples), the second dimension is the length of HW, and the third dimension is the number of input features (n_input). HW is an important parameter, determining how many days are used to predict the output at the next timestep. To explain more clearly, we assume HW is 3 days, which means that the model will use the input features of the 3 days ahead to predict the oil production rate of the fourth day. Similar to input, the output is converted to a two-dimensional tensor with n_samples as the first dimension and the length of predictive days as the second dimension. Here, we just predict 1 day after the HW, so the second dimension is 1.
Choke Size (CS) (mm) 3.16 0.90 0-10 As shown in Figure 5, the features, having been normalized, will be converted to a threedimensional tensor to match the request of layers in the sixth step. As for input, the first dimension is the number of samples (n_samples), the second dimension is the length of HW, and the third dimension is the number of input features (n_input). HW is an important parameter, determining how many days are used to predict the output at the next timestep. To explain more clearly, we assume HW is 3 days, which means that the model will use the input features of the 3 days ahead to predict the oil production rate of the fourth day. Similar to input, the output is converted to a twodimensional tensor with n_samples as the first dimension and the length of predictive days as the second dimension. Here, we just predict 1 day after the HW, so the second dimension is 1.

RNN, LSTM, and GRU
Based on RNN, LSTM is designed to handle long-term dependency problems [27]. Later Cho et al. proposed a simpler variant GRU [28]. Figure 6 shows the cell architecture of RNN, LSTM, and GRU. Compared with the simple architecture of RNN, LSTM is more complex. The core ideas behind LSTM are the cell state (i.e., 1 T c  , T c ) and three gates, which distinguish LSTM from RNN [32]. W and b represent the weight matrix and bias matrix of the gates, the subscript f, i, o denoting the forget gate, input gate, and output gate, respectively.

RNN, LSTM, and GRU
Based on RNN, LSTM is designed to handle long-term dependency problems [27]. Later Cho et al. proposed a simpler variant GRU [28]. Figure 6 shows the cell architecture of RNN, LSTM, and GRU. Compared with the simple architecture of RNN, LSTM is more complex. The core ideas behind LSTM are the cell state (i.e., c T−1 , c T ) and three gates, which distinguish LSTM from RNN [32]. W and b represent the weight matrix and bias matrix of the gates, the subscript f, i, o denoting the forget gate, input gate, and output gate, respectively.
where the symbol σ represents a sigmoid layer, outputting numbers between 0 and 1, given by , as shown in Figure 7a. 0 means "let nothing through" while 1 means "let everything through". It plays an important role in the three gates. The cell state (c T ) is a conveyor belt running through the top of the LSTM cell, which controls the flow of information across multiple LSTM cells. When the output of timestep T − 1 (h T−1 ) and input of timestep T (Fea T ) enters the cell, the forget gate ( f T ) will decide what information are not important and should be forgotten.
Energies 2020, 13, 6121 where the symbol σrepresents a sigmoid layer, outputting numbers between 0 and 1, given by σ(x) = (1 + e −x ) −1 , as shown in Figure 7a. 0 means "let nothing through" while 1 means "let everything through". It plays an important role in the three gates.
The cell state ( T c ) is a conveyor belt running through the top of the LSTM cell, which controls the flow of information across multiple LSTM cells. When the output of timestep T − 1 ( 1  T h ) and input of timestep T ( T Fea ) enters the cell, the forget gate ( T f ) will decide what information are not important and should be forgotten.
where the symbol σ represents a sigmoid layer, outputting numbers between 0 and 1, given by , as shown in Figure 7a. 0 means "let nothing through" while 1 means "let everything through". It plays an important role in the three gates. There is an input gate ( T i ) combined with the input layer ( C T ), determining what new information should be stored in the cell state.
where tanh denotes a tanh activation function, as shown in Figure 7b. W C is the weight matrix of the input layer. b C is the bias matrix of the input layer.
Next, the old cell state ( 1 C  T ) will be updated to the new cell state ( C T ) based on Equation (5).
The symbol represents the Hadamard product. There is an input gate (i T ) combined with the input layer ( C T ), determining what new information should be stored in the cell state.
where tanh denotes a tanh activation function, as shown in Figure 7b. W C is the weight matrix of the input layer. b C is the bias matrix of the input layer.
Next, the old cell state (C T−1 ) will be updated to the new cell state (C T ) based on Equation (5). The symbol represents the Hadamard product.
Ultimately, the output gate (o T ) will decide what information should be output as follows: where h T is the hidden state in the timestep T, which is equal to the value of Out T . On the basis of LSTM, GRU combines the forget gate and input gate into the update gate (z T ), and merges the hidden state with cell state, which makes the architecture simpler and computation cheaper. There are only two gates in the cell architecture of GRU, reset gate (r T ) and update gate (z T ), which can be described as Equations (8) and (9) respectively: where W r is the weight matrix of the reset gate and b r is the bias matrix of the reset gate.W z is the weight matrix of the update gate and b z is the bias matrix of the update gate. Subsequently, the new memory cell state will be obtained by Equation (10): Energies 2020, 13, 6121 where W h is the weight matrix of the new memory cell state, b h is the bias matrix of the new memory cell state.
The gating signal (z T ) ranges from 0 to 1. The closer the gating signal is to 1, the more data will be memorized, whereas the closer to 0, the more forgotten. Therefore, one single expression can control forgetting and inputting, generating output (h T ): 3.2. "SingleStep", "Iterative", and "PartIterative" Prediction Schemes As the architecture of GRU is flexible and changeable, we explored three prediction schemes: "SingleStep" (see Figure 8a), "Iterative" (see Figure 8b), and "PartIterative" (see Figure 8c). The three schemes were proposed for different purposes. The SingelStep approach was used for validating and evaluating model performance, which eliminating the influence of accumulative errors. The Iterative approach was carried out for long-term cumulative production prediction in the case of limited data. With oilfield data dynamically changing, the PartIterative approach was utilized for real-time production prediction. In Figure 8, the superscripts 1 − N of Fea represent the available input (exogenous) variables, the subscripts of Fea represent the sampling time, and q pred T+1 represents the predicted value of oil production rate at time T + 1. To eliminate the influence of HW and prediction length, the HW was set to 14 days, and the prediction length was set to 1 day. To remove the effect of the number of iterations, the SingleStep approach was put forward to evaluate the model performance on the single-step error fairly. The production rate used as "input" was updated daily, which means that the new production was forecasted based on true values. In other words, given external features Fea from T − n + 1 to T + 1 and actual production value q from T − n + 1 to T, we could predict the (T + 1)th production rate q pred T+1 . When shifting to the next timestep, given external features Fea from T − n + 2 to T + 2 and actual production value q from T − n + 2 to T + 1, we could predict the (T + 2)th production rate. Without replacing q T+1 with q pred T+1 , the influence of the number of iterations was eliminated, and it was more straightforward to evaluate the model performance on a single timestep. In this way, the RMSE of a well throughout the whole production lifecycle could be calculated, which was used for demonstrating the validity of the model.
In the Iterative scheme, the daily production was not be updated and the predictions were implemented on the input features of the first HW and previously predicted production. Unlike the SingleStep scheme, when shifting to the next timestep, given external features Fea from T − n + 2 to T + 2 and actual production value q from T − n + 2 to T + 1, we replaced q T+1 with q pred T+1 for forecasting the (T + 2)th production rate q pred T+2 . Then we moved the timestep to T + 3, replace q T+2 with q pred T+2 , predicted the (T + 3)th production rate q pred T+3 , and so on. As shown in Figure 8b, the first prediction generated an error ε T+1 , and the next timestep generated an error ε T+2 + G(ε T+1 ). After L number of iterations, the error was accumulated to ε T+L + To make full use of the dynamic production data, we combined the above two schemes and proposed the PartIterative approach, as depicted in Figure 8c. The update window represents the number of q updated at one time. This scheme starts with an iterative process. Given external features Fea from T − n + 1 to T + 1 and actual production value q from T − n + 1 to T, we could predict the (T + 1)th production rate q pred T+1 . Then we moved to the next time step, given Fea from T − n + 2 to T + 2, actual production value q from T − n + 1 to T and q pred T+1 , we predicted the (T + 2)th production rate q pred T+2 . Repeating the above iterative process for m times (size of update window), we updated the predicted value q pred T+1 , . . . , q pred T+m+1 into q T+1 , . . . , q T+m+1 and started the iterative process again and so on. Between iterations and updates, there is an adaptive system that compares the predicted values with the actual ones to judge whether the model should be retrained or not.

The Architecture of the GRU Model
In this paper, we utilized the GRU algorithm based on the python-based deep learning library Keras [33] configurated with the TensorFlow backend. The model consisted of the input, GRU, and dense layers. Hyperparameter optimization was not the focus of this research, so the neural network architecture was simply tuned according to the performance of the validation set. The number of neurons in the GRU layer was 32, big enough to learn the implicit information behind input and output. All the weight matrixes were initialized by the Xavier initialization [34], and the bias matrixes were initialized to zeros. The activation function of the GRU layer is ReLU, as shown in Figure 7c.

The Architecture of the GRU Model
In this paper, we utilized the GRU algorithm based on the python-based deep learning library Keras [33] configurated with the TensorFlow backend. The model consisted of the input, GRU, and dense layers. Hyperparameter optimization was not the focus of this research, so the neural network architecture was simply tuned according to the performance of the validation set. The number of neurons in the GRU layer was 32, big enough to learn the implicit information behind input and output. All the weight matrixes were initialized by the Xavier initialization [34], and the bias matrixes were initialized to zeros. The activation function of the GRU layer is ReLU, as shown in Figure 7c. Adaptive moment estimation (Adam) was used as an optimizer to minimize the loss function. Combining the advantages of AdaGrad and RMSProp, it calculates the update weights and bias based on the first moment estimation and the second moment estimation meanwhile [35]. Least squares (L2 norm) regularization [36] and dropout technique [37] were also used to avoid overfitting. Table 2 shows the parameters of the proposed GRU models. The models were coded in Python 3.7 and executed on Intel ® Core™ i7-4790 3.60 GHz CPU.

Random Forest
RF is an ensemble algorithm consisting of a collection of decision trees that calculate ensemble predicted values by averaging values of terminal nodes for regression problems [38]. Numerous studies have been implemented to prove that RF has high accuracy, good tolerance to noise and outliers, and the ability to avoid overfitting [39][40][41]. Further, it provides a novel method of determining feature importance, which could help us further understand the mechanism of fracturing stimulation. There are two straightforward methods for feature importance ranking: mean decrease impurity and mean decrease accuracy. The first method is implemented in the RF of Scikit-learn [42] and was adopted in this research. For every decision tree in the forest, the nodes were split at the most informative feature to maximize the information gain, as defined in Equation (12). Then the feature importance was equal to the sum over the number of splits that include the feature, relying on the number of samples it splits. The detailed implementations of RF can refer to [38,43].
where Fea is the feature to perform the split, N p is the number of samples in the parent node, N left and N right are the number of samples in child nodes separately, I is the impurity function, which refers to the variance for regression problems. D p is the sample at the parent node, D left and D right are samples at child nodes.

Evaluation Indices
In this research, we adopted mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE) as evaluation indices. MSE can evaluate the differences between actual values (y true ) and predicted values (y pred ), represented as Equation (13), which can be used to measure the overall performance of the trained model and help reduce forecasting error. Compared with MSE, RMSE represents the average error of a single sample, which helps to understand the physical significance, as defined in Equation (14). MAE gives a direct measure of the difference between predicted outcomes and true values, as defined in Equation (15).

Verification of the GRU Model with the SingleStep Approach
In this research, GRU is conducted for production rate prediction under multiple factors in the conglomerate reservoir for the first time. For a new well, given inputs of the first HW, the long-term production rate can be obtained according to the proposed well-block GRU model. Once the proposed model is trained with training wells, we can apply it to other wells without extra computational cost. For ease of understanding, the forecasting results have been denormalized (multiplying the original RMSE and the standard deviation of the training set). Figure 9 displays the RMSE of the proposed model on training wells, validation wells, and test wells. Gray dashed lines represent the area between 25% and 75% quantiles. As can be seen from Figure 9, RMSE for most wells is in the range of 4 to 6. The mean value of training wells, validation wells, test wells is 4.89, 4.98, and 4.91, respectively. The RMSE distribution of test wells coincides with that of the training wells, suggesting the generalization ability of the trained model.

Verification of the GRU Model with the SingleStep Approach
In this research, GRU is conducted for production rate prediction under multiple factors in the conglomerate reservoir for the first time. For a new well, given inputs of the first HW, the long-term production rate can be obtained according to the proposed well-block GRU model. Once the proposed model is trained with training wells, we can apply it to other wells without extra computational cost. For ease of understanding, the forecasting results have been denormalized (multiplying the original RMSE and the standard deviation of the training set). Figure 9 displays the RMSE of the proposed model on training wells, validation wells, and test wells. Gray dashed lines represent the area between 25% and 75% quantiles. As can be seen from Figure 9, RMSE for most wells is in the range of 4 to 6. The mean value of training wells, validation wells, test wells is 4.89, 4.98, and 4.91, respectively. The RMSE distribution of test wells coincides with that of the training wells, suggesting the generalization ability of the trained model. To evaluate the generalization ability of the proposed model, we focus on the performance on the test set. Figure 10 shows the prediction results of the GRU model on six test wells with the SingleStep prediction approach. The SingleStep approach is used to eliminate the influence of accumulative errors caused by iterations. In other words, every red dot is predicted based on the actual inputs of the previous 14 days before the current prediction point. RMSE is utilized as an index to evaluate the single-step error. As we can see from Figure 10, most predicted points are very close to actual production value, especially when the oil rate changes steadily, such as the oil rate of Well 2 from the 200th to 1100th day and Well 7 from the 200th to 700th day. However, the relative errors soar when enormous fluctuations or sudden variances occur, such as the oil rate of Well 2 at the 107th To evaluate the generalization ability of the proposed model, we focus on the performance on the test set. Figure 10 shows the prediction results of the GRU model on six test wells with the SingleStep prediction approach. The SingleStep approach is used to eliminate the influence of accumulative errors caused by iterations. In other words, every red dot is predicted based on the actual inputs of the previous 14 days before the current prediction point. RMSE is utilized as an index to evaluate the single-step error. As we can see from Figure 10, most predicted points are very close to actual production value, especially when the oil rate changes steadily, such as the oil rate of Well 2 from the 200th to 1100th day and Well 7 from the 200th to 700th day. However, the relative errors soar when enormous fluctuations or sudden variances occur, such as the oil rate of Well 2 at the 107th day, Well 7 around the 200th day, and frequent variances of Well 58, Well 22, and Well 31. The sudden drop or rise could lead to bigger uncertainties and errors for the prediction, which is rarely considered in previous studies. The prediction results show that the trained model can capture the complex and flexible variance of production, whereas the ability to capture sharp and sudden variances is limited. In the future, we can smooth the area when there is a brief and sudden change or extend the forecast unit from day to month to improve model accuracy.
Energies 2020, 13, x FOR PEER REVIEW 13 of 24 day, Well 7 around the 200th day, and frequent variances of Well 58, Well 22, and Well 31. The sudden drop or rise could lead to bigger uncertainties and errors for the prediction, which is rarely considered in previous studies. The prediction results show that the trained model can capture the complex and flexible variance of production, whereas the ability to capture sharp and sudden variances is limited.
In the future, we can smooth the area when there is a brief and sudden change or extend the forecast unit from day to month to improve model accuracy. The proposed GRU model takes choke size and shut-in operation into account, providing a method to predict the oil rate under variable external features. The predicted production can be changed by CS and SI. In total, nine training wells were selected to test the sensitivity of changing external features to the production rate. As shown in Figure 11, when the CS is changed from 2 mm, 5 mm to 8mm, the larger the choke size, the higher the production rate will be. As the SI is modified from 0 h, to 12 h to 24 h, the oil rate shows that production will increase with the shut-in period prolonged (see Figure 12). It is reasonable because the pressure will build up during the shut-in period. The predicted results are consistent with field experience and existing understanding. The proposed GRU model takes choke size and shut-in operation into account, providing a method to predict the oil rate under variable external features. The predicted production can be changed by CS and SI. In total, nine training wells were selected to test the sensitivity of changing external features to the production rate. As shown in Figure 11, when the CS is changed from 2 mm, 5 mm to 8mm, the larger the choke size, the higher the production rate will be. As the SI is modified from 0 h, to 12 h to 24 h, the oil rate shows that production will increase with the shut-in period prolonged (see Figure 12). It is reasonable because the pressure will build up during the shut-in period. The predicted results are consistent with field experience and existing understanding.

Application of the GRU Model with the Iterative Approach
In application, we often need to predict a longtime production provided with limited production history. In this research, we also explored the Iterative approach, in which we repeat the iterations using the predicted production. More narrowly, given data of the first HW (14 days), we predict the production rate of the 15th day. Based on the actual rate from the 2nd to the 14th day and the predicted rate of the 15th day, we can predict the rate of the 16th day. Then based on the actual rate from 3rd to 14th day and the predicted rate from 15th to 16th day, we can predict the rate of the 17th. Repeat the above iterations, and we can get the predicted production rate at any point in time.

Application of the GRU Model with the Iterative Approach
In application, we often need to predict a longtime production provided with limited production history. In this research, we also explored the Iterative approach, in which we repeat the iterations using the predicted production. More narrowly, given data of the first HW (14 days), we predict the production rate of the 15th day. Based on the actual rate from the 2nd to the 14th day and the predicted rate of the 15th day, we can predict the rate of the 16th day. Then based on the actual rate

Application of the GRU Model with the Iterative Approach
In application, we often need to predict a longtime production provided with limited production history. In this research, we also explored the Iterative approach, in which we repeat the iterations using the predicted production. More narrowly, given data of the first HW (14 days), we predict the production rate of the 15th day. Based on the actual rate from the 2nd to the 14th day and the predicted rate of the 15th day, we can predict the rate of the 16th day. Then based on the actual rate The one-feature (q o ) model and the four-feature (q o + TP + SI + CS) model were firstly applied to two training wells. As shown in Figure 13, Well 11 and Well 36 are selected to show the influence of iterations under different features. The first 14 days serve as HW, and the rest is treated as the prediction section.
Firstly, we focus on the comparison of the two models. It can be found that the one-feature model can only capture the descent trend and are not sensitive to oilfield operations, which could lead to enormous errors. For Well 36 with a stable decline trend production and fewer operations, both models can predict a reasonable production and the four-feature model shows a better performance. Obviously, for Well 11 with an undulating production, the predicted points of the four-feature model coincide with changing choke size and shut-in operation, pointing out the four-feature model can capture the complex variation pattern under multiple factors.
iterations under different features. The first 14 days serve as HW, and the rest is treated as the prediction section.
Firstly, we focus on the comparison of the two models. It can be found that the one-feature model can only capture the descent trend and are not sensitive to oilfield operations, which could lead to enormous errors. For Well 36 with a stable decline trend production and fewer operations, both models can predict a reasonable production and the four-feature model shows a better performance. Obviously, for Well 11 with an undulating production, the predicted points of the four-feature model coincide with changing choke size and shut-in operation, pointing out the four-feature model can capture the complex variation pattern under multiple factors.  Figure 14 shows the half boxplot of the mismatch between the two GRU models on six test wells. The difference between the predicted and true production rate of test wells is quantified by MAE and RMSE. The mean value of RMSE from the four-feature model, 13.84, is much less than that of the onefeature model, which is 25.91. Similarly, the mean value of MAEs from the four-feature model, 11.21, is less than that of the one-feature model, which is 22.79. Many scholars have corroborated the feasibility of RNN and LSTM algorithms in production prediction, but few focus on the influence of external variables on prediction performance. In this research, we compared the two models and found that the ability of the four-feature model to capture variances and prediction accuracy outperforms the one-feature model.  Figure 14 shows the half boxplot of the mismatch between the two GRU models on six test wells. The difference between the predicted and true production rate of test wells is quantified by MAE and RMSE. The mean value of RMSE from the four-feature model, 13.84, is much less than that of the one-feature model, which is 25.91. Similarly, the mean value of MAEs from the four-feature model, 11.21, is less than that of the one-feature model, which is 22.79. Many scholars have corroborated the feasibility of RNN and LSTM algorithms in production prediction, but few focus on the influence of external variables on prediction performance. In this research, we compared the two models and found that the ability of the four-feature model to capture variances and prediction accuracy outperforms the one-feature model. Then, we focus on the performance of the four-feature model itself. It can be found from Figure  13 that the production data are not monotonically decreasing but display a complex variation pattern because of frequent operations. The proposed model could capture the complicated changes and fluctuate with changing choke size and shut-in operation. However, the available production history, long prediction section, and frequent operations increase the difficulties of iterative prediction. Although errors of the SingleStep scheme are small, the errors become bigger if the Iterative approach is used. Figure 15 shows some typical prediction results of training wells with big relative errors. The first is partial inconsistency, which results from oilfield operations. The ability of the trained model Then, we focus on the performance of the four-feature model itself. It can be found from Figure 13 that the production data are not monotonically decreasing but display a complex variation pattern because of frequent operations. The proposed model could capture the complicated changes and fluctuate with changing choke size and shut-in operation. However, the available production history, long prediction section, and frequent operations increase the difficulties of iterative prediction. Although errors of the SingleStep scheme are small, the errors become bigger if the Iterative approach is used. Figure 15 shows some typical prediction results of training wells with big relative errors. The first is partial inconsistency, which results from oilfield operations. The ability of the trained model to capture sharp and sudden variances is limited, which leads to predicted values usually lower than actual values when CS is changed. This phenomenon is obvious in Well 32. The shut-in operation around the 380th day results in a stable gap between the predicted and actual values after the operation. The second is the overall trend lower than actual points, such as Well 35, which might be caused by the diversity of formation and fracturing conditions. The changing trend of predicted points is consistent with the recorded production rate, but a stable difference exists during the whole prediction. Although all the wells belong to the same well block, the geological and fracturing features change with the distribution of fault blocks and the advance of stimulation technologies. Another possible explanation is that the production of the first 14 days is still on the rise, which will provide an inaccurate baseline for later iterative prediction. The errors also occur in test wells when using the Iterative approach for prediction. Although the accuracy of single-point prediction is pretty high, the performance becomes worse as the number of iterations increase. The mean RMSE of test wells with the Iterative approach is 13.91, 2.8 times of the SingleStep approach, which is 4.91. The increased error is the comprehensive effect of numerous iterations, sharp variances, and diversity of formation and fracturing.

GRU_RF Model for Improving "Iterative" Accuracy
To improve the prediction accuracy, we set up an RF model to fit the errors between the predicted and true production rate. The allocation scheme of the training set, validation set, and test set is the same as that of the GRU model. Considering two kinds of errors, the inputs of RF include dynamic and static features relating to production, operations, formation, and fracturing, as listed in Table 3. The number of estimators is set to 50, and the maximum depth is 3.

Input Features Symbol
Production predicted production by GRU Predicted qT the predicted production of the previous 7 days by GRU

GRU_RF Model for Improving "Iterative" Accuracy
To improve the prediction accuracy, we set up an RF model to fit the errors between the predicted and true production rate. The allocation scheme of the training set, validation set, and test set is the same as that of the GRU model. Considering two kinds of errors, the inputs of RF include dynamic and static features relating to production, operations, formation, and fracturing, as listed in Table 3. The number of estimators is set to 50, and the maximum depth is 3.
The overall workflow of the GRU_RF model is illustrated in Figure 16. The proposed model in the blue dashed box consists of two sub-models: the GRU model and the RF model. Each step in the flowchart is described in detail below: Step 1. Training the GRU model. After data preprocessing (as described in Figure 2), the data set are split into three parts, and the GRU model is fed by training wells and optimized by validation wells.
Step 2. Preliminary forecasting with the GRU model and Iterative prediction approach. Based on historical production data of HW, we can get the time-series predicted production by continuous iterations.
Step 3. Training the RF model. To revise the GRU model's predicted errors caused by iterations and diversity of formation and fracturing, the RF model is trained with the predicted errors between the predicted production of the GRU model and the true production as output.
Remarkably, the RF model is fed by the same training wells as the GRU model. Step 4. Second forecasting with the RF model. Given inputs of production, operations, formation, and fracturing parameters, the correction errors can be predicted by the RF model.
Step 5. Forecasting with the GRU_RF model for test wells. The purpose is to use the trained model applicable to the whole well block to forecast the long-term production of new wells. Hence, given inputs of HW, we can directly get the predicted production (Step 2) and the predicted errors (Step 4) without training the model repeatedly, as shown in the deep red flow line in Figure 16. The predictions of the GRU_RF model are obtained by subtracting the predicted errors of the RF model from the predicted production of the GRU model.

Input Features Symbol
Production predicted production by GRU Predicted q T the predicted production of the previous 7 days by GRU For a fair evaluation, the accuracy of the proposed GRU_RF model is compared with that of the fully connected RNN, LSTM and the baseline GRU. To be consistent with GRU_RF, data of the first 14 days are treated as training data, and the rest are test data. The model architecture, prediction approach, and external features of RNN and LSTM are the same as GRU. The values of parameters used to train RNN and LSTM are also the same as GRU, as listed in Table 2. For the RNN/LSTM layer, the number of neurons is 32. All the models are trained and utilized with the Iterative prediction approach and four-feature inputs. Figure 17 shows the prediction results of RNN, LSTM, GRU, and GRU_RF. All of the results are consistent with the trend of the production for the test wells. However, the results of GRU_RF are with smaller errors, suggesting pretty model generalizability. Figure 18 compares the prediction performance of RNN, LSTM, GRU, and GRU_RF on test wells. It is shown that GRU_RF outperforms the other three methods. The mean RMSE of GRU_RF is 10.67, smaller than the mean RMSE of RNN, LSTM, and GRU, which is 16.82, 13.49, and 13.91, respectively. The mean MAE of GRU_RF is 7.55, less than that of RNN, LSTM, and GRU, which is 13.51, 10.60, and 11.29, respectively. In general, the effect of the GRU_RF is better than RNN, LSTM, and GRU. It is worth noting that only inputs of the first HW are available during the whole prediction. Although the predicted points are not in perfect agreement with the recorded production curve, it is exciting that the trained model provides a feasible method to predict long-term production for newly developed wells with limited production data. The traditional DCA and conventional prediction methods cannot be applied to new wells with little data easily. diversity of formation and fracturing, the RF model is trained with the predicted errors between the predicted production of the GRU model and the true production as output.
Remarkably, the RF model is fed by the same training wells as the GRU model.
Step 4. Second forecasting with the RF model. Given inputs of production, operations, formation, and fracturing parameters, the correction errors can be predicted by the RF model. Step 5. Forecasting with the GRU_RF model for test wells. The purpose is to use the trained model applicable to the whole well block to forecast the long-term production of new wells. Hence, given inputs of HW, we can directly get the predicted production (Step 2) and the predicted errors (Step 4) without training the model repeatedly, as shown in the deep red flow line in Figure 16. The predictions of the GRU_RF model are obtained by subtracting the predicted errors of the RF model from the predicted production of the GRU model. For a fair evaluation, the accuracy of the proposed GRU_RF model is compared with that of the fully connected RNN, LSTM and the baseline GRU. To be consistent with GRU_RF, data of the first 14 days are treated as training data, and the rest are test data. The model architecture, prediction approach, and external features of RNN and LSTM are the same as GRU. The values of parameters used to train RNN and LSTM are also the same as GRU, as listed in Table 2. For the RNN/LSTM layer, the number of neurons is 32. All the models are trained and utilized with the Iterative prediction approach and four-feature inputs. Figure 17 shows the prediction results of RNN, LSTM, GRU, and GRU_RF. All of the results are consistent with the trend of the production for the test wells. However, the results of GRU_RF are with smaller errors, suggesting pretty model generalizability.  Figure 18 compares the prediction performance of RNN, LSTM, GRU, and GRU_RF on test wells. It is shown that GRU_RF outperforms the other three methods. The mean RMSE of GRU_RF is 10.67, smaller than the mean RMSE of RNN, LSTM, and GRU, which is 16.82, 13.49, and 13.91, respectively. The mean MAE of GRU_RF is 7.55, less than that of RNN, LSTM, and GRU, which is and GRU. It is worth noting that only inputs of the first HW are available during the whole prediction. Although the predicted points are not in perfect agreement with the recorded production curve, it is exciting that the trained model provides a feasible method to predict long-term production for newly developed wells with limited production data. The traditional DCA and conventional prediction methods cannot be applied to new wells with little data easily. The proposed model has advantages over traditional DCA and conventional time-series prediction methods. Firstly, it can be applied to new wells with production time of over 2 weeks. For The proposed model has advantages over traditional DCA and conventional time-series prediction methods. Firstly, it can be applied to new wells with production time of over 2 weeks. For most of these new wells, the initial production rate shows an increasing trend because of the wellbore storage effect. Traditional DCA is nowhere to be utilized in wells without a declining trend [9]. The conventional time-series prediction methods (i.e., ARMA, ARIMA, ARIMAX) are linear models based on statistics [8,44]. Learning from the initial production will obtain a rising prediction trend contrary to the long prediction section, which will lead to huge errors. Therefore, neither traditional DCA nor conventional time-series prediction methods can achieve the production prediction of wells with short production time. Secondly, the proposed model can achieve batch forecasting of multiple wells. Additionally, it is much faster because it doesn't need to train the GRU_RF model every time. However, the other two methods have to tune multiple model hyperparameters (such as initial decline rate, decline index, autoregression coefficient, moving average coefficient) with wells changed, which is time-consuming and cumbersome. Thirdly, it takes CS and SI into account, offering a way to analyze the effect of external features on the production rate. As we can see from Figures 11 and 12, the predicted rate will change with CS and SI changes. However, traditional DCA, ARMA and ARIMA cannot consider the effect of CS and SI on production rate. Although ARIMAX can consider external features, it has many restrictions in application due to strict statistical background [45], such as strong correlations between the dependent feature and external features, which constrain its application in production prediction. Finally, the proposed model prevents the subjectivity of analysts from affecting the predicted results. The same inputs will guarantee the same output. However, the predicted values of DCA vary with the selection of the starting point, which will result in great uncertainty for production prediction [10]. Figure 19 shows the feature importance ranking determined by RF. It can be seen from Figure 19a that the most important features for model revising are the predicted rate 7 days ago, CS and oil saturation, which account for more than 60% of feature importance. Figure 19b displays the importance scores of different feature types. Production, formation conditions, and oilfield operations play an important role in revising the production rate. It is consistent with the previous analysis of the cause of errors, in which the rising trend of initial production, various oilfield operations, and differences in formation and stimulation will affect the single well prediction. This infers that it is essential to take the geological and completion features into account when building a model for a whole well block.

Continuous Learning
Production is a dynamic and coupled process, related to fracture propagation, proppant failure, formation property variation, and oilfield operations. With the production data updated dynamically, we plan to introduce continuous learning into production prediction. Continuous learning is the concept of learning continuously and adaptively about the external world and enabling the autonomous incremental development of ever more complex skills and knowledge [46,47]. Here continuous learning refers to updating the history window and trained models for future production predictions during the whole production period. For one new well, continuous learning refers to the PartIterative scheme, as depicted in Figure 8c, in which the HW will be updated as time goes on to make full use of production data. For multiple wells, continuous learning means that as data streams into the database, the model is updated as soon as enough new data are available, as shown in Figure 20. Firstly, the currently available data are divided into the training set and test set. The No.1 model is obtained based on current data. With new data added into the database, the old data serve as the training set, and new data become the test set. The No.2 model can be developed and the old data become the training set and so on. In this way, the well-block model could be updated continuously, similar to the cognitive process from simple to deep. In the future, we will further explore the continuous prediction to make full use of available data.

Continuous Learning
Production is a dynamic and coupled process, related to fracture propagation, proppant failure, formation property variation, and oilfield operations. With the production data updated dynamically, we plan to introduce continuous learning into production prediction. Continuous learning is the concept of learning continuously and adaptively about the external world and enabling the autonomous incremental development of ever more complex skills and knowledge [46,47]. Here continuous learning refers to updating the history window and trained models for future production predictions during the whole production period. For one new well, continuous learning refers to the PartIterative scheme, as depicted in Figure 8c, in which the HW will be updated as time goes on to make full use of production data. For multiple wells, continuous learning means that as data streams into the database, the model is updated as soon as enough new data are available, as shown in Figure 20. Firstly, the currently available data are divided into the training set and test

Conclusions
In this research, a novel hybrid model GRU_RF considering multiple external factors was proposed for the production prediction of the X well block. A thorough data preprocessing and

Conclusions
In this research, a novel hybrid model GRU_RF considering multiple external factors was proposed for the production prediction of the X well block. A thorough data preprocessing and modeling procedure were explained, which can be migrated to other reservoirs for batch production prediction. The prediction unit and length are changeable with flexible model architecture.
The major conclusions are as follows: (1). The proposed GRU_RF model provided a promising method for predicting long-term production of new-developed wells with the initial production, as evidenced by six test wells. Its generalization ability was satisfactory when applying to wells that do not join in the training set. (2). The model considering SI and CS provided a method to explore the effects of variable external features. The four-feature model outperformed the conventional one-feature model. It could accurately capture the complex variance pattern under multiple factors. (3). In addition to production and oilfield operations, formation and fracturing parameters significantly affected production prediction of multiple wells from one well block, as illustrated by feature importance ranking of RF. It was essential to combine statistic features with dynamic features into the time-series prediction. (4). The proposed model outperformed traditional DCA, conventional time-series methods, simple RNN, LSTM, and GRU. Given limited production, the model could achieve batch processing and provide a continuous learning method for real-time prediction.