Investigating the Predictive Performance of Gaussian Process Regression in Evaluating Reservoir Porosity and Permeability

In this paper, a new predictive model based on Gaussian process regression (GPR) that does not require iterative tuning of user-defined model parameters has been proposed to determine reservoir porosity and permeability. For this purpose, the capability of GPR was appraised statistically for predicting porosity and permeability of the southern basin of the South Yellow Sea using petrophysical well log data. Generally, the performance of GPR is deeply reliant on the type covariance function utilized. Therefore, to obtain the optimal GPR model, five different kernel functions were tested. The resulting optimal GPR model consisted of the exponential covariance function, which produced the highest correlation coefficient (R) of 0.85 and the least root mean square error (RMSE) of 0.037 and 6.47 for porosity and permeability, respectively. Comparison was further made with benchmark methods involving a back propagation neural network (BPNN), generalized regression neural network (GRNN), and radial basis function neural network (RBFNN). The statistical findings revealed that the proposed GPR is a powerful technique and can be used as a supplement to the widely used artificial neural network methods. In terms of computational speed, the GPR technique was computationally faster than the BPNN, GRNN, and RBFNN methods in estimating reservoir porosity and permeability.


Introduction
The focus of oil and gas exploration in complex and challenging fields has intensively increased in recent years [1]. These complex hydrocarbon fields exhibit a high degree of heterogeneity and also non-uniform spatial distribution of reservoir properties [2,3]. The characterization of such reservoir properties like porosity and permeability cannot be accurately accomplished by developing empirical models from well log parameters since they are inadequate to fully account for the heterogeneous nature of the complex well conditions. An ideal and a more precise approach to directly measure these parameters from core and well testing analysis. However, it is known to be very expensive and time-consuming when conducted on all wells. This is the major reason why the petroleum industry has embraced the application of computational intelligence techniques, particularly in reservoir characterization, as a result of its inexpensive approach in achieving reliable values [4]. Most studies [5][6][7][8][9][10][11][12][13][14] have reported the capability of various forms and combination of computational intelligence methods, including artificial neural network (ANN), in generating porosity and permeability values. However, most of these computational intelligence methods (including ANN) are slow in generating results mainly due to the iterative tuning of the models' user-defined parameters and the steepest-gradient training algorithms utilized. Also, when performing prediction, supervised techniques like back propagation neural network (BPNN) and polynomial regression methods mainly engage in error reduction with less regard for the generalization or predictive ability of the model. Generally, this causes overfitting where the computational intelligence model performs well during training but cannot replicate the same results when tested on completely unseen data [15].
To overcome these issues, a non-parametric approach to regression known as Gaussian process regression (GPR) avoids overfitting by defining a function distribution and setting a prior distribution of unlimited possibilities over the function directly [15][16][17][18]. GPR is also known to generalize well due to its preference to a smooth function that accurately explains the training data without manual parameter tuning as has been the case of ANN [19]. This has led to its application in the oil and gas exploration studies. Rawlinson and Vasudeven [20] used GPR to successfully model a deep induction resistivity well log. An observation was made that GPR could predict better than linear interpolation fitting and inverse distance weighting when evaluating and quantifying uncertainties at locations with no data. A Gaussian process as a machine learning tool was also adopted by Yu et al. [15] to predict total organic carbon (TOC) data of the Yangchang formation in the south-eastern Ordos basin. Yu et al. [15] realized that the GPR model selected well logs that had improved correlation with the TOC data to generate predictions with the least error as compared to conventional methods like the Schmoker and Parsey approach.
Based on the computational benefits realized in the aforementioned discussion, the main aim of this study is to extend the application of Gaussian process regression (GPR) and assess its viability as a predictive intelligent model for the evaluation of reservoir porosity and permeability from petrophysical well logs. Furthermore, comparative analysis with benchmark ANN methods of back propagation, generalized regression, and radial basis function was conducted. This was to ascertain the predictive performance of GPR in acquiring porosity and permeability data of the southern basin of the South Yellow Sea.
The South Yellow Sea is the southern major sedimentary basin of the Yellow Sea is located between the Chinese continental landmass and the Korean peninsula ( Figure 1a). A central uplift divides the South Yellow Sea basin into the northern and southern basin. This research will be focused on the southern basin due to the recent oil discovery. This article is structured as follows: a detail description of the data used and the GPR will be presented in Section 2. Results and discussion of the predictions of GPR and ANN models will be given in Section 3. The final conclusion is made in Section 4.

Data Description
The South Yellow Sea basin is a south-west-west oriented rift depression basin located between the Subei basin and Korea peninsular as shown in Figure 1b. It was formed as a result of multi-cycle tectonic action, erosion, and uplift events from the Neoproterozoic through to the Cenozoic era [21][22][23]. It is worth noting that significant oil has been discovered in the Funing, Dainan, and Sanduo Formations of the southern basin. Funing, Dainan, and Sanduo Formations are composed mainly of an interbedding of sandstone and mudstone as indicated in Figure 1c. Muddy limestone can be found in the middle and upper part of the Funing Formation. Also, Figure 1c shows a layer of coal present in the lower portion of the Dainan Formation and muddy limestone in its upper portion.
Two wells (i.e., Well A and Well B) that target the oil-bearing Formations in the southern basin of the South Yellow Sea were considered in the present study ( Figure 1b). Well logs, core porosity, and permeability data of Well A were used to train the computational intelligence models while the models' predictive capabilities were examined on the data of Well B. The input well log parameters were gamma ray (GR), sonic travel time (DT), resistivity (RT), and spontaneous potential (SP). The output parameters were core porosity and permeability. A total of 727 sample data of Well A was used for training and 311 sample data of Well B was used as testing data. Well logs adopted in the study are seen in Figure 2. The statistical description of the well log parameters, porosity and permeability considered for this study are listed in Tables 1 and 2. Energies 2018, 11, x FOR PEER REVIEW 3 of 13 were gamma ray (GR), sonic travel time (DT), resistivity (RT), and spontaneous potential (SP). The output parameters were core porosity and permeability. A total of 727 sample data of Well A was used for training and 311 sample data of Well B was used as testing data. Well logs adopted in the study are seen in Figure 2. The statistical description of the well log parameters, porosity and permeability considered for this study are listed in Tables 1 and 2.

Gaussian Process Regression
A Gaussian process (GP) is an infinite group of random variables of which any of the finite subsets has a constant joint Gaussian distribution [24][25][26]. A GP is represented by a mean function and a covariance function. Since the GP is a linear combination of random variables having a normal distribution, by simplicity, the mean function is usually assumed to be zero. Assuming a training set y of n number of parameters and having an input matrix x ∈ R n and output variable y ∈ R, which is expressed to be porosity or permeability. The Gaussian process is therefore represented in Equation (1) as: where GP is the Gaussian process, m(x) is the mean function and k(x, x ) is the covariance function.
The m(x) in the GP represents the expected value of the function y * at the input matrix point x as expressed in Equation (2): The k(x, x ) is a measure of the confidence level for m(x) as represented in Equation (3). The covariance function takes any two arguments such that it generates a non-negative covariance matrix K.
The covariance function helps to implicitly specify certain aspects of the model such as smoothness, periodicity, stationarity, etc. The basic and widely used Gaussian process regression (GPR) is composed of a simple zero mean and squared exponential covariance function as expressed in Equation (4): σ f and l are the hyperparameters and they influence the performance of GP. σ f is the model noise and l is the length scale. The covariance is close for any set of inputs that are in close proximity; however, the covariance is exponentially decreased as the distance between input parameters increases.

4.
Rational quadratic: To estimate the expected function value ( f * ), which is the joint Gaussian prior distribution given the test input (x * ), Equation (10) can be used: f * is the mean value of the prediction and it gives the best estimate for y * . The variance cov( f * ) is an indication of the uncertainty of the prediction. The mean prediction, f * in Equation (11) is a linear combination of the target y while the variance cov( f * ) is not dependent on the target but only inputs: where K(X, X) is the covariance matrix of the training dataset. K(X * , X * ) represents the covariance matrix of the testing data, which represents the N × N * covariance matrix obtained from training and testing data, K(X * , X) = K(X, X * ) T . The marginal likelihood over f * is expressed in Equation (12) as: where cov(y * ) = K(X * , X * ) − K(X * , X) K(X, X) + σ 2 n I −1 K(X, X * ) Using the logarithmic identifier to simplify the integral expression of Equation (12) generates the log marginal likelihood given in Equation (15): where θ is the set of hyperparameters needed for a given covariance function. The minimum posterior hyperparameter in the covariance function can be obtained by maximizing the marginal likelihood P(y|X, θ ) or minimizing the negative log marginal likelihood. The output of the GPR model is presented in terms of its mean and variance.

Developing GPR Models
Different base kernel (covariance) functions, namely squared exponential (covSE), rational quadratic (covRQ), materniso 5/2 (covMatern5), materniso 3/2 (covMatern3), and exponential (CovExp), were assessed when developing the GPR models for porosity and permeability prediction. The idea here was to determine the optimal covariance function that could produce reliable predictions of porosity and permeability that were closely related to the actual values of Well B. For a simple Energies 2018, 11, 3261 7 of 13 GPR, the constant mean function is known to generate better results than both zero and linear mean functions [16]. The study, therefore, adopted the constant mean function for all the GPR models created.
Correlation coefficient (R) and root mean square error (RMSE) were the statistical indices employed to quantify the performance of the developed models. Here, R examined whether there existed a linear relationship between the predicted and the measured porosity and permeability values. RMSE showed the difference between the predicted outcome and the measured porosity and permeability values. The mathematical expression for R and RMSE are given in Equations (14) and (15), respectively. The GPR models were coded and implemented in MATLAB software (R2016a version, MathWorks Inc., Natick, MA, USA).
Root mean square error (RMSE) where t is the measured parameter value, p is the predicted parameter value, t is the mean measured parameter value, p is the mean predicted parameter value, and n is the total number of data points [27].

Performance of Porosity Models
The prediction results of the developed GPR porosity models based on different covariance functions are presented in Table 3. To do such a performance evaluation, the completely unseen porosity data of Well B was introduced into the GPR porosity models. We can observe that covSE performed worse compared to the other covariance functions. This assertion can also be confirmed by Figure 3A as the correlation between measured porosity and prediction from covSE-GPR was 0.8334. Improved porosity results compared to that of covSE was realized in the cases of covRQ, covMatern5, and covMatern3 as they produced R-values of 0.8415, 0.8412, and 0.8379, respectively (see Figure 3B-D). However, the highest R-value (0.85) was from the covExp porosity model. From Figure 3E, covExp was identified to be the best-ranked GPR porosity model with the least prediction RMSE of 0.0374 as compared to 0.0387 provided by covSE, covRQ, covMatern5, and covMatern3 (Table 3).

Performance of Permeability Models
The various GPR permeability models based on the different covariance functions were assessed based on the RMSE and the R criteria. In Figure 4a, it can be observed that covSE generalized poorly among the candidate models with R-value of 0.8402. Improved R-values of 0.8428, 0.8429, and 0.8439

Performance of Permeability Models
The various GPR permeability models based on the different covariance functions were assessed based on the RMSE and the R criteria. In Figure 4A, it can be observed that covSE generalized poorly among the candidate models with R-value of 0.8402. Improved R-values of 0.8428, 0.8429, and 0.8439 were however produced by covRQ, covMatern5, and covMatern3, respectively, as compared to covSE. The correlation plot of prediction from covRQ, covMatern5, and covMatern3 as a function of measured permeability data are shown in Figure 4B-D. Figure 4E reveals covExp scoring the highest R-value of 0.85.
We can also see from Table 4 that covExp performed slightly better than the rest of the kernel functions in predicting the permeability data of Well B. The prediction from the covExp kernel function Energies 2018, 11, 3261 9 of 13 generated an error of 6.4717, which was marginally lower than the error obtained from covMatern3 (6.5101), covMatern5 (6.5295), and covRQ (6.5309) as listed in Table 4. However, covSE attained a significantly high prediction RMSE of 6.5780 (Table 2). Therefore, based on the covariance function assessed for predicting permeability, covExp was identified to be the best-performing kernel function.
were however produced by covRQ, covMatern5, and covMatern3, respectively, as compared to covSE. The correlation plot of prediction from covRQ, covMatern5, and covMatern3 as a function of measured permeability data are shown in Figure 4b-d. Figure 4e reveals covExp scoring the highest R-value of 0.85.
We can also see from Table 4 that covExp performed slightly better than the rest of the kernel functions in predicting the permeability data of Well B. The prediction from the covExp kernel function generated an error of 6.4717, which was marginally lower than the error obtained from covMatern3 (6.5101), covMatern5 (6.5295), and covRQ (6.5309) as listed in Table 4. However, covSE attained a significantly high prediction RMSE of 6.5780 (Table 2). Therefore, based on the covariance function assessed for predicting permeability, covExp was identified to be the best-performing kernel function.

Comparative Analysis with Artificial Neural Network (ANN)
The covExp-GPR model, which had the best performance in predicting porosity and permeability, was further compared with ANN methods of back propagation, generalized regression, and radial basis functions, which have been widely used for such a purpose. To do a proper competitive comparison, the same petrophysical well log data for Well A was used to train the ANN methods and Well B was used to analyze the optimal ANN models.
A single layer back propagation neural network (BPNN) model was trained using the Levenberg-Marquardt learning algorithm. The optimal model structure, which is highly dependent on the number of hidden neurons, was achieved using a sequential trial and error method based on the lowest RMSE and highest R criteria. It is important to note that the training process was done for 1000 epochs with a learning rate of 0.03 and momentum coefficient of 0.7. In the case of a generalized regression neural network (GRNN) and radial basis function neural network (RBFNN), their output is highly based on the value of the width parameter. Therefore, the optimal width parameter value for GRNN and RBFNN were also achieved through a sequential trial and error approach. Also, a gradient descent rule was adopted to train models of GRNN and RBFNN. The ANN models were coded and implemented in MATLAB R2016a software.

Comparing CovExp-GPR Porosity Results with ANN
After numerous trials, the BPNN model structure that gave the highest correlation (R) and the least porosity prediction RMSE had 13 hidden neurons. Hence the optimal BPNN porosity model was 4 inputs, 7 hidden neurons, and 1 output variable. On the other hand, the optimal GRNN and RBFNN porosity models with the least RMSE and highest R values had a spread parameter of 0.03. This means that the best performing GRNN porosity model was 4 inputs, a spread parameter of 0.03, and 1 output, while the optimal RBFNN model for predicting porosity composed of 4 inputs, a hidden layer with a maximum of 50 hidden neurons, and a width parameter of 0.3.
The best performing BPNN model took 265.79 s on a Windows AMD Ryzen 5 @ 2 GHz machine (pavilion 15, Hewlett-Packard, Palo Alto, CA, USA) and had a correlation of 0.84 and prediction error of 0.038, as listed in Table 5. Table 5 also shows that the CovExp-GPR model produced comparable porosity prediction results to the widely used BPNN, GRNN, and RBFNN. This can be seen from the R and RMSE results produce by each approach. However, in terms of computational speed, the proposed CovExp-GPR model produced the fastest results. Here, the CovExp-GPR was produced in approximately 22.01 s, while BPNN, GRNN, and RBFNN took 265.79, 29.66, and 96.58 s on a machine having Windows AMD Ryzen 5 @ 2 GHz (see Table 5).

Comparing Permeability Results with ANN
The best performing BPNN permeability model with improved statistical metrics was observed at a structure of 4 inputs, a hidden layer with 8 neurons, and 1 output. While the optimal GRNN permeability model structure was identified at 4 inputs, a width parameter of 0.03, and 1 output. Similarly, the optimal RBFNN architecture for permeability was achieved with 4 inputs, a hidden layer with a maximum of 50 hidden neurons, a width parameter of 0.3, and 1 output.
From Table 6, it is seen that the proposed CovExp-GPR model produced similar results to the other methods. The reason is related to the reported statistical measures (R and RMSE) in Table 6. Analysis of Table 6 indicates that the proposed CovExp-GPR model predictions were closely related to the observed permeability with a prediction accuracy of 85%. The same was observed for the RBFNN while 86% prediction accuracy was noticed for BPNN and GRNN. The interpretation here shows that a marginal prediction limitation of 15% (GPR) and 14% (BPNN, GRNN, and RBFNN) could not be explained using the respective models. On the basis of the computational speed, the GPR produced the results using a computational time of 20.72 s, while BPNN, GRNN, and RBFNN produced results in a time of 190.04, 28.21, and 100.14 s (Table 6) on a Windows machine with AMD Ryzen 5 @ 2 GHz.

Conclusions
In the present research, the suitability of GPR technique was evaluated for the prediction of reservoir porosity and permeability. For this aim, the southern basin of the South Yellow Sea situated between China and the Korean peninsula was selected as a case study application. GPR was developed using five kernel (covariance) functions: covSE, covRQ, covMatern5, covMatern3, and covExp. The motivation was to identify the best-performing (optimal) kernel function suitable for accurately predicting porosity and permeability. It was confirmed from the statistical analysis that the exponential covariance function (covExp) achieved the highest correlation coefficient and least prediction root mean square error when estimating both porosity and permeability.
The performance of the optimal GPR was verified against the benchmark ANN methods of back propagation, generalized regression, and radial basis function. It was further revealed that GPR porosity and permeability prediction models generated faster and comparable results to the widely used ANN methods. The results of the comparative study strongly suggest that the proposed GPR is an efficient approach in obtaining reliable porosity and permeability values and can therefore be used for other wells of the southern basin of the South Yellow Sea for which there are no available porosity and permeability data.