A Production Prediction Method for Shale Gas Wells Based on Multiple Regression

The estimated ultimate recovery (EUR) of a single shale gas well is one of the important evaluation indicators for the scale and benefit development of shale gas, which is affected by many factors such as geological and engineering, so its accurate prediction is difficult. In order to realize the accurate prediction of ultimate recovery, this study considered 172 shale gas wells in the Weiyuan block as samples and selected 19 geological and engineering factors that affect the ultimate recovery of shale gas wells. Furthermore, eight key controlling factors were selected by means of the Pearson correlation coefficient and maximum mutual information coefficient comprehensive evaluation method. The data were divided into training and testing samples. Different numbers of training samples were selected and seven schemes were designed. Based on the key controlling factors, the ultimate recovery prediction model for shale gas wells in this block was established through multiple regression methods. The effectiveness of the prediction model was verified by analyzing the testing samples. The result shows that with the increase of the size of training samples, the error of the ultimate recovery predicted by the model gradually decreases gradually. When predicting the single gas well, the average absolute error of ultimate recovery is less than 20% if the number of the training gas well is more than 80. When analyzing the development potential of similar blocks without drilling, the error of the sum of ultimate recovery is less than 10% if the size of the training gas well reaches 60.


Introduction
In China, there is abundant shale gas of different types that are distributed widely [1][2][3][4]. For shale gas with enormous resources, the estimated ultimate recovery (EUR) of a single well is important to accurately estimate the potential of shale gas and to achieve scale and efficient development [5]. The nano-scale pore characteristics, the multi-shift mechanism, and the working system of "non-fixed pressure and non-fixed production" of the shale gas reservoir lead to the complex flow characteristics of shale gas [6][7][8][9]. These complexities bring great uncertainty to the EUR evaluation of shale gas wells. Therefore, it is instructive for exploring shale gas efficiently to figure out the key controlling factors of shale gas wells and obtain the EUR prediction model.
The EUR of shale gas wells is affected by many factors, and the factors influencing the EUR of shale gas wells in different blocks are different. Therefore, an increasing number of studies have been performed on the EUR in shale gas wells. Lei et al. [10] and Jia et al. [11] analyzed the key factors affecting shale gas well production and further proposed the technical direction of improving single well EUR. Xiao [12] and Geng et al. [13] screened the main controlling factors of shale gas well production and established the prediction model by using the grey correlation analysis method. Ma et al. [14] used the Pearson correlation coefficient (Pearson) and maximum mutual information coefficient (MIC) analysis method to analyze the key factors controlling the productivity of shale gas wells in the early stage.
The key factors controlling the EUR of shale gas wells are determined, which plays an important role in increasing the output of shale gas wells and guiding the development of shale gas.
The multiple regression method is a type of mathematical statistics method, which is based on the fundamental regression principle. It can screen out the significant factor, which influences the variation of a dependent variable so that it can be applied to the forecasting production rate of oil and gas fields [15]. Many Chinese scholars have researched the application of the multiple regression method in oil and gas fields. Tang et al. [16] predicted the production capacity of oil fields by mean of the multiple regression method, tested, and made an accurate estimation to the prediction. They thought the prediction has great instructive meanings and suggested that the multiple regression method should be used for the oil and gas production capacity prediction. Hu et al. [17] built a productivity prediction model for the Changqing oil field by using the multiple regression method. The prediction accuracy for the daily oil production per well reaches 85%. Wu et al. [18] built an initial productivity prediction model for the main area of Jiaoshiba after fracturing by means of the multiple regression method and the prediction is certainly accurate. Li et al. [19] built the productivity analogy prediction model for the oil well areas of Huangjinba by using the multiple regression method.
So far, however, the application of multiple regression methods to shale gas wells has been less researched. To this end, this paper studied the application of the multiple regression method to the productivity prediction of shale gas wells by considering 172 shale gas wells in the Weiyuan block as samples from the perspective of multiple linear regression.

Methods
EUR is influenced by many factors, some of which are in a linear relationship with EUR and others are in a nonlinear relationship with EUR. In order to determine the key controlling factors of EUR comprehensively, the Pearson correlation coefficient and maximum mutual information coefficient (Pearson-MIC) method were adopted in this study to measure the linear and nonlinear relationships between EUR and various factors respectively.

Pearson Correlation Analysis
Pearson correlation coefficient, one of the widely used relation measurement standards, can measure the linear relation between two random variables [20,21]. The calculation formula is as follows: where Cov(x,y) is the covariance of x and y, Var(x) is the variance of x, and Var(y) is the variance of y. The greater the absolute value of the correlation coefficient is, the stronger the correlation is. The closer the correlation coefficient is to 1 or −1, the stronger the correlation is. The closer the correlation coefficient is to 0, the weaker the correlation is. Generally, the correlation strength of variables is judged according to the following value ranges, i.e., 0.8-1.0: extremely strong correlation; 0.6-0.8: strong correlation; 0.4-0.6: moderate correlation; 0.2-0.4: weak correlation; and 0.0-0.2: extremely weak correlation or no correlation.

Maximum Mutual Information Coefficient
Maximum mutual information coefficient (MIC) is a non-parametric exploration based on information, which is used to measure the strength of linearity or nonlinearity between two variables [22,23]. It can show the linear functional relationship between variables and find the nonlinear functional relationship (exponential and periodic). Moreover, it can show the functional relationship and the nonfunctional relationship. In this way, it has broad application [23]. MIC fundamental takes advantage of the mutual information concept, which can be illustrated in the following formula [24]: I(x; y) = p(x, y) log 2 p(x, y) p(x)p(y) dxdy (2) The calculation formula of MIC is as follows: MIC(x; y) = max a * b<B I(x, y) log 2 min(a, b) where: p(x) is the probability of variable x, p(y) is the probability of variable y, p(x,y) is the joint probability of variable x and variable y, and a and b are the number of grids divided in the x and y directions, which is essentially a grid distribution and B is a variable. The calculation result of the MIC method is between 0 and 1.0 indicates full uncorrelation, and 1 indicates complete correlation. It is generally deemed that the two variables have a strong correlation when MIC is bigger than 0.5 [25].

Pearson-MIC Comprehensive Evaluation Method
Pearson correlation coefficient is sensitive to linear relationships [25][26][27][28]. Compared to the Pearson correlation coefficient, MIC is more robust than the Pearson correlation coefficient, less susceptible to outlier values, and can be used to detect potential nonlinear relationships between variables [14]. Combined with the advantages of these two correlation analyses, the Pearson-MIC comprehensive evaluation method proposed by Ma et al. [14] is used to screen the key controlling factors affecting the EUR of shale gas wells.

Multiple Linear Regression Method
Linear regression is one of the most important mathematical models, and it is often used as the base of many other models [29]. Multiple linear regression is a very important method for multivariate statistical analysis. This method can be used to evaluate the relative importance of each independent variable to the dependent variable [30]. Multiple linear regression model can be expressed as where y is the dependent variable, x 1 . . . x n is the independent variable, and β 0 . . . β n is the unknown parameter. Due to the importance of the multiple linear regression model, this method has been widely used in various industries, such as the economy, petroleum, and meteorological industries.

Factor Selection
Weiyuan shale gas field is located in the northern part of southern Sichuan, with an area of about 4024 km 2 , showing a northern mountain and hilly terrain in the central and southern regions. The terrain is tilted from northwest to southeast [31]. The buried depth of the high-quality shale section of the Wufeng-Longmaxi Formation of Lower Silurian in the Weiyuan block is 1500-3700 m deep, and the burial depth increases from southwest to southeast. The pressure coefficient is 1.2-2.0, indicating that they are mostly overpressure gas reservoirs. The target horizon of the Weiyuan shale gas field in the study area is the L1 1 sub-member of Longmaxi Formation (Longmaxi Formation is vertically divided into the first and second members, which are referred to hereinafter as L1 and L2, respectively. L1 member is further subdivided into the first (L1 1 ) and second (L1 2 ) sub-members. The L1 1 sub-member is further divided into L1 1 1 , L1 1 2 , L1 1 3 , and L1 1 4 sublayers), with reservoir characteristics of high total organic carbon (TOC), high porosity, and high gas content [32].
The production of shale gas wells is affected by many factors, including geological factors, engineering factors, and economic factors. Geological factors include reservoir thickness, TOC, gas content, maturity, porosity and permeability characteristics of the matrix, pressure coefficient, fracture development, burial depth, contents of brittle minerals, water saturation, etc. Engineering factors include drilling length of a high-quality reservoir, number of fracturing segments, number of perforation clusters, segment distance, cluster distance, horizontal sections, sand contents, fracturing fluid volume, the amount of proppant, flowback rate, etc. [11,14,[33][34][35][36][37][38][39][40]. Geological factors are uncontrollable, whereas engineering factors and economic factors are controllable. Engineering factors are affected by geological factors and economic factors. Geological factors and engineering factors can directly affect the productivity of shale gas wells [14].
In this study, the geologic and engineering parameters are collocated from 172 shale gas wells in Weiyuan. As shown in Table 1, and 19 geological and engineering factors are selected to analyze the key controlling factors of EUR in the Weiyuan block and build the EUR prediction model according to the availability and the efficiency of the statistics.  Table 2 shows the scatter plot and linear fitting between EUR and 19 factors, and the scatter plot with the goodness-of-fit greater than 0.1 is shown in Figure 1.  Table 3 shows the Pearson correlation coefficient between EUR and 19 factors, and the influencing factors with the correlation coefficient bigger than 0.35 are selected. Geological factors include the thickness of the L1 1 1 sublayer and gas saturation. Engineering factors include fracturing segment length, fracturing section, drilling catching length into the class I reservoir, and drilling catching length of L1 1 1 sublayer. Table 4 shows the MIC correlation coefficient between EUR and 19 factors, and the influencing factors with the correlation coefficient greater than 0.55 are selected. The geological factor is the thickness of the L1 1 1 sublayer, and engineering factors include fracturing segment length, fracturing fluids intensity, drilling catching length into the class I reservoir, drilling catching length of L1 1

EUR Prediction of Shale Gas Wells Based on Multiple Regression
By considering 172 shale gas wells in the Weiyuan block as samples and EUR as the final evaluation target, the paper designed seven schemes to verify the effect of training sample size on the results of the regression model, and the details of the schemes are shown in Table 5. Among them, 40 testing samples are the same data of gas wells. To carry out multi-factor analysis, a nondimensional treatment shall be performed to solve the problem that various indicators cannot be integrated because each indicator has its own nature and measurement unit, which are not comprehensive [41,42]. Therefore, the nondimensional treatment is used to deal with various factors by applying the extremum method to eliminate the influence of different dimensions (Formula 4). EUR prediction model can be built up based on the multiple regression of the processed data.
where maxx i is the maximum value of the sample data and minx i is the minimum value of the sample data. From the Pearson-MIC comprehensive evaluation method, it can be derived that EUR has a nonlinear relationship with fracturing fluid intensity and 360-day flowback rate. After curve fitting, it is found that the secondary correlation between EUR and fracturing fluid intensity is better. The relationship is: y = 0.109x − 0.002x 2 − 0.467; the cubic correlation between EUR and 360-day flowback rate is better. The relationship is: y = − 3.418x + 5.295x 2 − 2.391x 3 + 1.5. Multiple linear regression was performed through a linear transformation of fracturing fluid intensity and 360-day flowback rate.
Considering that the relationship between various factors is relatively complicated and there may be multicollinearity, this paper adopts the stepwise regression method to solve this problem. The basic idea of this method is to gradually introduce new variables. If the partial regression square sum of the new variables is significant after testing, it indicates that it can be introduced. At this time, the new variables are considered as independent explanatory variables and cannot be represented by other explanatory variables (approximately) linearly. Otherwise, it means that the new variable is not independent and should not be introduced [43].
The multiple linear regression models and goodness-of-fit (R 2 ) of the seven schemes are shown in Table 6. For the test results of each scheme, the average relative error and the average absolute error of the true value, and the model predictions for each testing sample are calculated respectively. The evaluation results are shown in Figure 2 and Table 7. Table 6. Multiple linear regression model results.

Discussion
By observing the model obtained by means of multiple linear regression, it can be observed that when the training sample sizes were 10 wells and 20 wells, the relationship between EUR and each key controlling factor could not be explored fully due to the small amount of data; therefore, only two factors including gas saturation and fracturing segment length were included in the regression model. When the number of training samples increased to 40 gas wells, the regression model included three key controlling factors, namely, fracturing segment length, the thickness of the L1 1 1 sublayer, and drilling catching length into the class I reservoir. When the training sample size is increased to 120 gas wells, the significant impact of the drilling catching length of the L1 1 1 sublayer on EUR exceeds that of drilling catching length into the class I reservoir.
The prediction results of the testing samples show that as the number of training samples increases, the error of the prediction results can be significantly reduced, but the increase in the sample size makes the goodness-of-fit of the regression model worse. When the training sample size is less than 40 gas wells, the average absolute error and average relative error of the test samples are both large, and the ideal prediction effect cannot be achieved. As the training sample size increases to more than 80 gas wells, the average absolute error of the test samples can be reduced to less than 20%.
The error of the sum of EUR of 40 gas wells in the seven schemes of testing samples also gradually decreases with the increase of the training sample size. When the training sample size is 60 gas wells, the error is less than 10%, and when the sample size is 120 gas wells, the error is only 0.83%.
All in all, the EUR prediction model for shale gas wells in this area is established based on the multiple linear regression method, which verifies the effectiveness of the multiple regression method in shale gas well EUR prediction. This method can be operated simply and is suitable for on-site production prediction of shale gas well EUR. However, the application and promotion of the shale gas well EUR prediction model established by the above method in other blocks may still have certain restrictions. This paper only takes the Weiyuan block as an example to analyze the effectiveness of the multiple regression method, and specific blocks need specific analysis. The research results show that to reduce the error of the EUR prediction model, more production gas well data are needed. How to obtain a more accurate EUR prediction based on few data on gas wells is still the direction of continuous improvement.

Conclusions
(1) The geological and engineering parameters of 172 shale gas wells in the Weiyuan block are comprehensively analyzed. The key controlling factors affecting EUR in this block are determined by means of the Pearson-MIC comprehensive correlation evaluation method. The result shows that the main geological factors that affect the EUR of this block include the thickness of the L1 1 1 sublayer and gas saturation, and the engineering factors include the fracturing segment length, fracturing section, fracturing fluid intensity, drilling catching length into the class I reservoir, drilling catching length of L1 1 1 sublayer and 360-day flowback rate; (2) The data of 172 actual production wells in the Weiyuan block are selected as samples, and seven different training sample schemes are designed. The multiple linear regression models are established based on the selected key controlling factors. The result shows that the number of training samples for establishing the model has a great influence on the accuracy of prediction results. The more the training samples are, the smaller the error of predicted EUR is. When the training sample size is greater than 80 wells, the EUR prediction error is less than 20%, and the prediction applied to a single well EUR has good accuracy; (3) By introducing the multiple linear regression method in the EUR prediction, the error of the sum of the EUR of the testing samples (40 gas wells) is less than 10% when the training sample size reaches 60 gas wells, and as the training sample size gradually increases, the sum of EUR of the testing samples will gradually decrease. This result shows that this method can be used as a good criterion for block standard well evaluation and applied to the development potential analysis of similar blocks without drilling.
In summary, this analysis method based on data mining provides a new idea for the EUR prediction of shale gas wells and improves the efficiency of EUR prediction for shale gas wells. Therefore, it is recommended to apply the multiple regression method to the EUR prediction of shale gas wells.