A New Model for Estimation of Bubble Point Pressure Using a Bayesian Optimized Least Square Gradient Boosting Ensemble

: Accurate estimation of crude oil Bubble Point Pressure (Pb) plays a vital rule in the development cycle of an oil ﬁeld. Bubble point pressure is required in many petroleum engineering calculations such as reserves estimation, material balance, reservoir simulation, production equipment design, and optimization of well performance. Additionally, bubble point pressure is a key input parameter in most oil property correlations. Thus, an error in a bubble point pressure estimate will deﬁnitely propagate additional error in the prediction of other oil properties. Accordingly, many bubble point pressure correlations have been developed in the literature. However, they often lack accuracy, especially when applied for global crude oil data, due to the fact that they are either developed using a limited range of independent variables or developed for a speciﬁc geographic location (i.e., speciﬁc crude oil composition). This research presents a utilization of the state-of-the-art Bayesian optimized Least Square Gradient Boosting Ensemble (LS-Boost) to predict bubble point pressure as a function of readily available ﬁeld data. The proposed model was trained on a global crude oil database which contains (4800) experimentally measured, Pressure–Volume–Temperature (PVT) data sets of a diverse collection of crude oil mixtures from different oil ﬁelds in the North Sea, Africa, Asia, Middle East, and South and North America. Furthermore, an independent (775) PVT data set, which was collected from open literature, was used to investigate the effectiveness of the proposed model to predict the bubble point pressure from data that were not used during the model development process. The accuracy of the proposed model was compared to several published correlations (13 in total for both parametric and non-parametric models) as well as two other machine learning techniques, Multi-Layer Perceptron Neural Networks (MPL-ANN) and Support Vector Machines (SVM). The proposed LS-Boost model showed superior performance and remarkably outperformed all bubble point pressure models considered in this study. S.A. and A.A.; validation, S.A. and A.A.; formal analysis, S.A. and A.A.; investigation, S.A. and A.A.; resources, S.A. and A.A.; data curation, S.A. and A.A.; writing—original draft S.A. and A.A.; and S.A. A.A.;


Introduction
Determination of reservoir fluid bubble point pressure is a key element in the oil field development process. Bubble point pressure is required in many petroleum engineering calculations such as reserves estimation, material balance, reservoir simulation, production equipment design, and optimization of well performance. Bubble point pressure is an input parameter in other Pressure-Volume-Temperature (PVT) properties such as density, formation volume factor (Bo), and viscosity of reservoir fluids. Therefore, an inaccurate estimate of bubble point pressure will definitely propagate error in other oil PVT properties.
Ideally, the most accurate way to estimate PVT properties, including bubble point pressure, is through laboratory experiments on collected bottom-hole reservoir fluid samples ensemble algorithms have been reported to handle missing data and have the ability to model nonlinear patterns. On the other hand, the tuning of hyperparameters to achieve optimal regression performance may require the integration of optimization algorithms that would require large computational power for large data sets [26][27][28].

Global Database
The main aim of this study was to utilize a large and global database of experimentally measured PVT data to develop a general and accurate bubble point pressure (Pb) model in order to overcome the limitations usually associated with existing correlations. These limitations mostly fall into two categories, a) the use of a limited data range, and/or b) the use of specific geographic crude type (e.g., specific crude oil composition).
Similar to existing correlations, the proposed model predicts bubble point pressure as a function of readily available field data. A total of 4800 PVT data sets were collected from major oil fields from different regions all over the world. Each PVT data set contained the following independent parameters: 1.
The collected data sets cover a wide range of variation for dependent and independents parameters, as shown in Table 1 which presents the range of statistical parameters of the studied global database. The global database was used to train and validate the developed machine learning models (LS-Boost, MLP-ANN, and SVM), using a five-fold cross validation technique in order to avoid overfitting and selection bias issues. Furthermore, the global database was used to critically evaluate commonly used bubble point correlations.

Literature Database
Although the global database was sufficient to develop a general predictive model and draw solid conclusions on its performance compared to commonly available correlations, we aimed to take an extra step of model verification by introducing an independent PVT data set, which was not used in the model development and validation process, in order to test the generalization ability of the proposed model.
Accordingly, an additional database of 775 PVT sets was collected from open literature [1,[29][30][31][32][33][34]. This database consists of 5 sub-data sets representing different geographic crude types and a diverse range of pertinent parameters. The sub-data sets are divided based on their geographic origin as follows: 1.
Data Set L-5: Worldwide Crude (425 data sets of worldwide crudes, reference [1,33,34]) The range of statistical parameters of the literature database is presented in Table 2.

Methodology
As mentioned earlier, this study was intended to develop a general intelligent model and to critically review available bubble point pressure correlations. Thus, this section will give a brief introduction of all correlations used in this study as well as a description of the developed intelligent model.

Bubble Point Pressure Correlations
In this study, thirteen bubble point correlations were evaluated using our global database. It should be noted that the main advantage of these correlations is that they have simple mathematical form, and they are easy to interpret. On the other hand, they usually need tuning whenever they are introduced to new PVT data sets or new crude types. These correlations can be divided into four categories as follows: 1.
Non-Parametric Regression Models 3.1.1. Standing-Type Models Standing Correlation 1947 [2] was one of the first attempts to predict bubble point pressure using readily available field data. It was developed based on 105 experimentally measured PVT data sets from California, USA. The range of pertinent parameters are as follows: bubble point pressure from 130 to 7000 psi, solution gas-oil ratio from 20 to 1425 SCF/STB, gas specific gravity from 0.59 to 0.95, oil relative density from 16.5 to 63.8 API, and reservoir temperature from 100 to 258 F.
The original form of the Standing Correlation is shown in Equation (1): Mathematical forms of the above correlations can be found in Appendix A. The type of crude used in these correlations and the range of input parameters are presented in Table 3.

Glasø-Type Models
Glasø 1980 [8] extended Standing's [2] work by taking into account the effect of nonhydrocarbon impurities (e.g., CO2, N2, and H2S) in crude oil bubble point pressure as well as the effect of oil paraffinicity. Glasø correlation was developed based on 46 experimentally measured PVT data sets from the North Sea. The range of pertinent parameters are as follows: bubble point pressure from 165 to 7142 psi, solution gas-oil ratio from 90 to 2637 SCF/STB, gas specific gravity from 0.65 to 1.28, oil relative density from 22.3 to 48.1 API, and reservoir temperature from 80 to 280 F.
The Glasø Correlation is shown in Equation (2): where [a 1 = 1.7669, a 2 = 1.7447, a 3 = 0.30218, a 4 = 0.816, a 5 = 0.172, a 6 = −0.989]. Farshad et. al. in 1992 [5] made the only published attempt to modify Glasø Correlation [8]. This modification was done based on new PVT data sets of crude oil from Colombia, South America. The range of input parameters used in this modification are presented in Table 3. The mathematical form of Farshad et. al.'s correlation can be found in Appendix A.

Al-Marhuon-Type Models
Al-Marhoun in 1988 [9] developed his correlation based on 160 experimentally measured PVT data sets from 69 Middle East reservoirs. The Average Absolute Relative Error (AARE) of this correlation was 3.66% based on the Middle East data used in correlation development, while the Standing and Glasø correlations failed to give accurate results for the same data, with an AARE of 12.08% and 25.22%, respectively. The range of Al-Marhoun correlation parameters are as follows: bubble point pressure from 130 to 3573 psi, solution gas-oil ratio from 26 to 1602 SCF/STB, gas specific gravity from 0.75 to 1.37, oil relative density from 19.4 to 44.6 API, and reservoir temperature from 74 to 240 F.
The Al-Marhoun correlation is shown in Equation (3). The oil relative density used in this equation is dimensionless and not in API units: Alshammasi, 1999 [11] Mathematical forms of the above correlations can be found in Appendix A. The type of crude oil used in these correlations and the range of input parameters are presented in Table 3.

Non-Parametric Regression-Type Models
Non-parametric regression is a powerful statistical tool which provides a non-biased, data-driven way of providing the minimum error relationship between dependent and independent variables. Hence, unlike parametric regressions, it does not assume any predetermined functional form between dependent and independent variables.
McCain et. al. in 1998 [12] used a nonparametric regression technique called Alternating Conditional Expectation (ACE) and developed by Breiman and Friedman [14] to predict bubble point pressure using a total of 728 PVT data sets from different regions around the world.
Later, Malallah et. al. in 2006 [13] used the same technique (ACE) but with a larger global PVT data set compared to the one used in McCain et. al.'s [12] study. The range of input parameters used in these ACE models are presented in Table 3. Their mathematical form can be found in Appendix A.

Machine Learning Methods
Ensemble learning is a type of supervised machine learning method that combines a finite set of regression machine learning methods into a single meta learner that assigns weights to each individual learner based on their performance. Various methods can be selected as individual learners, such as regression trees, support vector machines, and multilayer perceptron neural networks. The diversity of individual methods result in different regression performances that yield to an improvement of the overall ensemble method performance. In this research, a Bayesian-optimized least squares-boosting ensemble was utilized to predict the bubble point pressure given the inputs of temperature, oil relative density, gas specific gravity, and the initial gas solution oil ratio.
The least square-boosting (LS-Boost) ensemble combines individual regression trees, known as weak learners, to minimize the mean square error. The LS-Boost algorithm trains the weak learners on the testing data set sequentially and fits the residual errors. At each iteration, the LS-Boost fits a new learner to improve the difference between the response value and the aggregated predicted value to improve the prediction accuracy. The LS-Boost algorithm is presented in Algorithm 1, as reported by Friedman in [35]. and F m (x) as the regression function. Initialization: F 0 (x) = y For m=1 to M: The Bayesian optimization method is utilized for tuning hyperparameters of the LS-Boost ensemble to yield better cross-validation scores and thus improve the model's prediction accuracy. Moreover, Bayesian optimization is most useful for computationally expensive function evaluations where it reduces the time to achieve the global minimum within the space of solutions. The exploration and sampling of the search space is based on prior belief about the problem as in Bayes' theorem, which states that the posterior probability of a model M given the evidence E is proportional to the likelihood of E given M multiplied by the prior probability of M, and can be mathematically expressed as: A surrogate model, such as the Gaussian process, is used to approximate the objective function, and the selection of the samples from the search space is directed via acquisition functions, such as expected improvement and maximum probability of improvement [36]. The Bayesian optimization algorithm is presented in Algorithm 2.

Algorithm 2: Bayesian optimization
For t =1, 2, . . . do Find x t by optimizing the acquisition function over the Gaussian Process (GP) x t = argmax x u(x|D 1:t−1 ) Sample the objective function: y t = f (x t ) + t Augment the data D 1:t = {D 1:t−1 , (x t , y t )} and update the GP End.
Finally, it should be stated that, to the best of authors knowledge, most of the published machine learning (bubble point pressure) predictive models are based on either Neural Network or Support Vector Machine methods [15][16][17][18][19][20][21][22][23][24][25][26]. Accordingly, both models (MPL-ANN and SVM) have been used for comparison with the proposed LS-Boost model. For more information on the theory and application of MPL-ANN and SVM, readers are referred to [37][38][39].

Performance Indicators
To evaluate the performance of the studied models in predicting the bubble point pressure, various statistical indicators were utilized such as the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Coefficient of Variation of Root Mean Square Error (CVRMSE), Mean Absolute Percentage Error (MAPE) and the coefficient of determination R 2 . These indicators are presented by Equations (5)-(10) as follows: whereŷ i is the predicted response andy is the average experimental bubble point pressure.
In summary, the flow of the proposed work can be divided into two phases as follows: 1. Phase 1: a. Critically evaluate available bubble point pressure correlations based on the global database, then the best correlation in terms of accuracy performance should proceed to Phase 2. b.
Build three machine learning models (LS-Boost, MLP-ANN, SVM) based on the global database, then the best model in terms of accuracy performance should proceed to Phase 2.

2.
Phase 2: Present a detailed comparison between the two best models extracted from Phase 1 based on an independent literature database which has not been used in the development and validation process of the machine learning models in Phase 1.

Evaluation of Empirical Bubble Point Correlations
A total of 13 bubble point correlations were evaluated using a large and global PVT database with a wide range of variation of pertinent parameters and crude oil types. Table  4 provides the statistical performance indicators of each correlation in terms of MAPE, MAE, RMSE, CVRMSE, and R 2 values. In observing the results of Table 4, it can be noted that the tested correlations resulted in a MAPE of 21% and higher, with some correlations reaching values as high as 45%. However, it can be noted from Table 4 that Standing's correlation [2] gave the lowest error among all, with a RMSE of 401, MAPE of 21.6%, CVRMSE of 36%, and R 2 value of 0.88. The second best was Alshammasi's correlation with a MAPE of 25.0%, followed by McCain's correlation with a MAPE of 27.0%.

Evaluation of Empirical Bubble Point Correlations
A total of 13 bubble point correlations were evaluated using a large and global PVT database with a wide range of variation of pertinent parameters and crude oil types. Table 4 provides the statistical performance indicators of each correlation in terms of MAPE, MAE, RMSE, CVRMSE, and R 2 values. In observing the results of Table 4, it can be noted that the tested correlations resulted in a MAPE of 21% and higher, with some correlations reaching values as high as 45%. However, it can be noted from Table 4 that Standing's correlation [2] gave the lowest error among all, with a RMSE of 401, MAPE of 21.6%, CVRMSE of 36%, and R 2 value of 0.88. The second best was Alshammasi's correlation with a MAPE of 25.0%, followed by McCain's correlation with a MAPE of 27.0%. A randomly selected unpublished sample of our global database including the outcome of the Standing, Alshammasi, and McCain correlations is presented in Table 5.  It is clear from this figure that the predicted values of bubble point pressure by these correlations deviate from the line of unity. It can also be seen that this deviation gradually increases with pressure, especially for bubble point pressure Pb > 4000 psi where the predicted values are well off the line of unity. It worth noting that both McCain and Standing correlations yield similar mean absolute percentage error (MAPE) for high pressure values (i.e., Pb > 4000 psi) while the Alshammasi correlation was third in line for this range. However, the prediction accuracy of McCain's model decreases as pressure decreases (especially for values lower than 2000 psi) compared to that of both Standing and Alshammasi models. Such behavior clearly highlights the main limitation of existing (Pb) models. That is, when they are mapped on a diverse global database, they tend to perform well for specific ranges of the database and fail in others, due to the fact that they have been developed for a certain range of pertinent parameters and/or specific types of crude oil composition. A deeper look at the performance of McCain's model compared to that of the best performer (Standing's model) in term of relative error for the whole range of bubble point pressures is presented in the next paragraph.          An in-depth analysis has been conducted based on API gravity groups. API gravity was used as it is closely related to crude oil composition (i.e., crude type) compared to other independent input parameters, and it is also a common practice in the literature to compare different bubble point pressure (Pb) correlations in terms of API group analysis [9,10,11,12,40,41]. That is, the global database was divided into different subsets based on API gravity, and the MAPE of each group was calculated for the top three correlations (i.e., Standing, Alshammasi, and McCain correlations). Such an analysis will help us to get a closer observation on the performance of each correlation at different API gravity subsets across the entire database. Accordingly, Figure 5 presents an API gravity group analysis for Standing, Alshammasi, and McCain correlations. It can be noted that for high API subsets (API > 45), the three correlations yield a high mean absolute percentage error (a MAPE of 30% and above) compared to other API ranges. In general, the Standing correlation gave the lowest MAPE among the three correlations for all API ranges. For the lowest API range (API < 20), the Alshammasi correlation performance was poor, while both the Standing and McCain correlations gave almost the same performance with a MAPE of 15%.   An in-depth analysis has been conducted based on API gravity groups. API gravity was used as it is closely related to crude oil composition (i.e., crude type) compared to other independent input parameters, and it is also a common practice in the literature to compare different bubble point pressure (Pb) correlations in terms of API group analysis [9][10][11][12]40,41]. That is, the global database was divided into different subsets based on API gravity, and the MAPE of each group was calculated for the top three correlations (i.e., Standing, Alshammasi, and McCain correlations). Such an analysis will help us to get a closer observation on the performance of each correlation at different API gravity subsets across the entire database. Accordingly, Figure 5 presents an API gravity group analysis for Standing, Alshammasi, and McCain correlations. It can be noted that for high API subsets (API > 45), the three correlations yield a high mean absolute percentage error (a MAPE of 30% and above) compared to other API ranges. In general, the Standing correlation gave the lowest MAPE among the three correlations for all API ranges. For the lowest API range (API < 20), the Alshammasi correlation performance was poor, while both the Standing and McCain correlations gave almost the same performance with a MAPE of 15%.  An in-depth analysis has been conducted based on API gravity groups. API gravity was used as it is closely related to crude oil composition (i.e., crude type) compared to other independent input parameters, and it is also a common practice in the literature to compare different bubble point pressure (Pb) correlations in terms of API group analysis [9,10,11,12,40,41]. That is, the global database was divided into different subsets based on API gravity, and the MAPE of each group was calculated for the top three correlations (i.e., Standing, Alshammasi, and McCain correlations). Such an analysis will help us to get a closer observation on the performance of each correlation at different API gravity subsets across the entire database. Accordingly, Figure 5 presents an API gravity group analysis for Standing, Alshammasi, and McCain correlations. It can be noted that for high API subsets (API > 45), the three correlations yield a high mean absolute percentage error (a MAPE of 30% and above) compared to other API ranges. In general, the Standing correlation gave the lowest MAPE among the three correlations for all API ranges. For the lowest API range (API < 20), the Alshammasi correlation performance was poor, while both the Standing and McCain correlations gave almost the same performance with a MAPE of 15%.

Bayesian-Optimized Least Squares-Boosting Ensemble
In this study, a state-of-the-art Bayesian-optimized least squares-boosting ensemble (LS-Boost) was utilized to predict the bubble point pressure given the inputs of temperature, oil gravity, gas specific gravity, and the initial gas solution oil ratio using a large and global PVT database which has a wide range of variation of pertinent parameters and crude oil types. The Bayesian optimization was utilized to find the optimum hyperparameters that yield the highest prediction accuracy. The Bayesian-optimization algorithm was simulated with 300 learners and found the optimized hyperparameters within 300 iterations based on the expected improvement acquisition function; the optimized hyperparameters and their search-space ranges are presented in Table 6. Furthermore, the same data was used to build two other predictive models using Multi-Layer Perceptron Neural Network and Support Vector Machine (MLP-ANN and SVM) techniques. Table 7 Table 8 presents a sample of predicted bubble point pressure using LS-Boost, MLP-ANN, and SVM models and Standing correlation. The input data used in Table 8 are taken from the PVT data set presented earlier in Table 5.

LS-Boost Generalization Test
In this section, an independent (775) PVT database collected from open literature was utilized to test the effectiveness and generalization ability of the LS-Boost model when introduced to new real field cases which have not been used during its development. Table 9 presents a randomly selected PVT data set from the collected literature database, including the outcome of the LS-Boost model and Standing correlation for such data sets.

LS-Boost Generalization Test
In this section, an independent (775) PVT database collected from open literature was utilized to test the effectiveness and generalization ability of the LS-Boost model when introduced to new real field cases which have not been used during its development. Table 9 presents a randomly selected PVT data set from the collected literature database, including the outcome of the LS-Boost model and Standing correlation for such data sets.                  Figure 10 presents a bar chart of the mean absolute percentage error (MAPE) for different crude types (i.e., different crude geographic locations); this figure presents the MAPE of each crude type for LS-Boost and Standing correlation. It can be noted that the LS-Boost model was superior to the Standing correlation for all crude types used in the literature database. It should also be stated that the difference in MAPE between both models is highest for Africa and Middle East crudes, where the Standing correlation gave a MAPE of 28.5% for Africa crude and 21% for Middle East crude, while the LS-Boost gave a MAPE of 10% and 8.5% for the same crudes, respectively.

Conclusions
This paper presented the use of a large and global crude oil database in the utilization of a state-of-the-art Bayesian-optimized Least Square Gradient Boosting Ensemble (LS-Boost) for prediction of bubble point pressure. The global database used in building the LS-Boost model consisted of 4800 experimentally measured PVT data sets of a diverse collection of crude oil mixtures from different oil fields in the North Sea, Asia, Africa, Middle East, and South and North America. The accuracy of the developed model was compared to commonly used bubble point pressure correlations and two other machine learning techniques (Multi-Layer Perceptron Neural Network, MLP-ANN and Support Vector Machine, SVM). Furthermore, an independent (775) PVT data set, which was collected from open literature (literature database), was used to investigate the effectiveness of the proposed model to predict the bubble point pressure from data that were not used during the model development process.
The accuracy of the developed models was assessed based on different performance indicators (RMSE, MAPE, MAE, CVRMSE, and R 2 ). LS-Boost outperformed all existing bubble point correlations, MLP-ANN, and SVM models with a CVRMSE of 10.63%, MAPE of 7.57%, and R 2 of 0.98 for the global database. LS-Boost also achieved a remarkably high accuracy when introduced to new real field data (i.e., literature database) with a CVRMSE of 20%, MAPE of 9.3% and R 2 of 0.96.
The presented results clearly highlight the potential of the LS-Boost model as an accurate, quick, and easy-to-use tool for the prediction of reservoir fluid bubble point pressure. Furthermore, the developed LS-Boost can be easily utilized in reservoir simulators and production optimization packages commonly used within the industry.

Conclusions
This paper presented the use of a large and global crude oil database in the utilization of a state-of-the-art Bayesian-optimized Least Square Gradient Boosting Ensemble (LS-Boost) for prediction of bubble point pressure. The global database used in building the LS-Boost model consisted of 4800 experimentally measured PVT data sets of a diverse collection of crude oil mixtures from different oil fields in the North Sea, Asia, Africa, Middle East, and South and North America. The accuracy of the developed model was compared to commonly used bubble point pressure correlations and two other machine learning techniques (Multi-Layer Perceptron Neural Network, MLP-ANN and Support Vector Machine, SVM). Furthermore, an independent (775) PVT data set, which was collected from open literature (literature database), was used to investigate the effectiveness of the proposed model to predict the bubble point pressure from data that were not used during the model development process.
The accuracy of the developed models was assessed based on different performance indicators (RMSE, MAPE, MAE, CVRMSE, and R 2 ). LS-Boost outperformed all existing bubble point correlations, MLP-ANN, and SVM models with a CVRMSE of 10.63%, MAPE of 7.57%, and R 2 of 0.98 for the global database. LS-Boost also achieved a remarkably high accuracy when introduced to new real field data (i.e., literature database) with a CVRMSE of 20%, MAPE of 9.3% and R 2 of 0.96.
The presented results clearly highlight the potential of the LS-Boost model as an accurate, quick, and easy-to-use tool for the prediction of reservoir fluid bubble point pressure. Furthermore, the developed LS-Boost can be easily utilized in reservoir simulators and production optimization packages commonly used within the industry.

Conflicts of Interest:
The authors declare no conflict of interest.