Prediction of Dead Oil Viscosity: Machine Learning vs. Classical Correlations

: Dead oil viscosity is a critical parameter to solve numerous reservoir engineering problems and one of the most unreliable properties to predict with classical black oil correlations. Determination of dead oil viscosity by experiments is expensive and time-consuming, which means developing an accurate and quick prediction model is required. This paper implements six machine learning models: random forest (RF), lightgbm, XGBoost, multilayer perceptron (MLP) neural network, stochastic real-valued (SRV) and SuperLearner to predict dead oil viscosity. More than 2000 pressure– volume–temperature (PVT) data were used for developing and testing these models. A huge range of viscosity data were used, from light intermediate to heavy oil. In this study, we give insight into the performance of different functional forms that have been used in the literature to formulate dead oil viscosity. The results show that the functional form f ( γ API , T ) , has the best performance, and additional correlating parameters might be unnecessary. Furthermore, SuperLearner outperformed other machine learning (ML) algorithms as well as common correlations that are based on the metric analysis. The SuperLearner model can potentially replace the empirical models for viscosity predictions on a wide range of viscosities (any oil type). Ultimately, the proposed model is capable of simulating the true physical trend of the dead oil viscosity with variations of oil API gravity, temperature and shear rate.


Introduction
Reservoir pressure-volume-temperature (PVT) properties are some of the most important ones for petroleum engineers and essential for different aspects of reservoir calculations. The precision of other measurements in reservoir engineering also relies primarily on the correctness of PVT data (e.g., calculations for material balance, reserve estimation, well test analyses, advanced data analysis, nodal analysis for a surface network, surface separation and numerical reservoir simulations) [1]. PVT data are obtained in an optimum situation from representative fluid samples collected from wellhead, surface or wellbore [2]. PVT reports the results of PVT tests usually at reservoir pressure (P) and temperature (T), whereas daily assessments are typically conducted as part of the oil and gas field monitoring programs at other P and Ts. These include standard tank oil gravity calculation of API, dead oil viscosity at ambient or different temperatures, gas gravity and composition (1) Dead oil viscosity, µ od , which is the crude oil that at atmospheric pressure is free of gas.
(2) Saturated, µ ob , which is the oil viscosity at reservoir temperature and pressure (saturation).
(3) Undersaturated, µ oa , is the viscosity of oil when its pressure and temperature is above the reservoir conditions (saturation).
Traditionally, crude oil viscosity is determined experimentally at the reservoir temperature and pressure on the subsurface or surface samples, but this is often costly and time-consuming and requires a strong technical specialty [40][41][42][43][44][45]. In this regard, a large number of empirical and semi-empirical relationships have been developed in the past decades, mainly from the corresponding equation of state to predict the crude oil viscosity. Most of the correlations presented have been established for a given region, so if used for other areas, erroneous results will be produced [34]. Empirical correlations are used to estimate dead oil, and saturated and undersaturated viscosities based on field data, but the output of these empirical correlations is typically unsatisfactory, and improvement is still sought [46]. The best-known ones are those developed by Beal 1946 [47], Beggs and Robinson (1975) [48], Glaso (1980) [13], Kaye (1985) [49], Al-Khafaji et al. (1987) [50], Petrosky (1990) [24], Egbogah and Ng (1990) [51], Labedi (1992) [52], Kartoatmodjo and Schmidt (1994) [53], De Ghetto (1994) [54], Bennison (1998) [55], Elsharkawy and Alikhan (1999) [56], Hossain et al. (2005) [57], Naseri et al. (2005) [58] and Alomair et al. (2011) [59], Hemmati et al. (2013) [38]. Correlations summarized in Table 1 relate dead oil viscosity. Some authors correlate crude oil viscosity to typically difficult-to-measure properties such as molar mass, critical temperature and acentric factor [60][61][62]. Furthermore, lack of availability of reservoir fluid samples may add another obstacle to reliable measurements; thus, reservoir engineers are encouraged to use established correlations to estimate crude oil viscosity. Most of the published correlations have been developed based on limited data Energies 2021, 14, 930 3 of 16 points and ranges [39]. The evaluation of most developed correlations in the literature have shown that high errors are generated when applied on other data sets rather than the one the correlation is generated for [63][64][65]. As a result, there is a need to propose methods that can estimate dead oil viscosity at wider ranges of temperature intervals when a physical sample to perform laboratory measurements is missing or there are time constraints and rapid results are needed even though fluid samples are available. Ultimately, we need to develop a method that would be cheaper compared to laboratory studies when both situations are not an issue. Regarding what was said, this has graveled the way for modification and adoption of already existing empirical correlations over a period of time. Furthermore, to overcome these challenges, some machine learning (ML) and artificial intelligence techniques (AI) have also been used to improve the prediction of oil viscosity, including radial basis function neural network (RBFNN) [30], artificial neural network (ANN) [68][69][70][71][72][73], functional networks (FN) [46,71], genetic algorithm (GA) [74], support vector machine (SVM) [75], and group method of data handling (GMDH) [76] and Ensemble models [77]. The literature argues that the lowest average absolute relative error can be achieved when viscosity is predicted by AI models and the highest correlation coefficient as compared to existing empirical correlations [70].
Based on what was said above, the industry has an interest to develop models that can predict viscosity versus shear rate and temperature of a wide range of oils for both Newtonian and non-Newtonian ones. Considering these points, the main goal of this study is to develop a model to predict dead oil viscosity vs. temperature through recently established ML methods that make us independent of creating new correlations and can be applicable to all type of oils and regions.

Experiments
Measurements of viscosity of the samples were carried out on the Anton Paar Rheometer MCR 302, located at the Skolkovo Institute of Science and Technology at the Center for Hydrocarbon Recovery. Based on the manufacturer's recommendations and the nature of the substance, viscosity was measured in various systems: coaxial cylinders, cone-plate according to ASTM D 4402-06 (ASTM D4402-06, Standard Test Method for Viscosity De-  , 2006). A large number of oil samples were tested in the C-ETD 300 system equipped with CC-17 coaxial cylinders for measurements of low viscous specimens. For high viscous samples, viscosity values were obtained in the P-ETD 400 system with cone-plate CP50-2 measuring system.
The elemental composition of oil samples was determined using a LECO Corporation CHN628 with an attachment for determination of the sulfur. The optional sulfur module for the 628 Series analyzers is designed to determine the sulfur content in a wide range of organic matrices. It allows expansion of the analytical capabilities of the devices CHN628, CN628 and FP628. The module can be used to work with samples of coal, coke, various types of fuels and some inorganic materials, for example, samples of soil, cement and limestone. A pre-weighed sample is placed in a ceramic boat and then in a horizontal furnace, the sample is burned in an oxygen stream, and the sulfur contained in it is converted to SO 2 . After that, due to the concentric design of the furnace tubes, the gases pass through the hot zone twice, which allows them to stay as long as possible in the high temperature zone for their complete oxidation. In the next stage, out of the furnace, gases pass through the anhydrone to remove moisture and the flow controller, where it is stabilized. The sulfur content is determined in the measuring infrared cell SO 2 of the analyzer of the 628 series, where the sulfur module is connected. Figure 1A displays dead oil viscosity from 10 representative samples as a function of temperature. As seen from the image, as the temperature increases, the viscosity of all samples decreases exponentially. A typical behavior of one oil sample in various temperatures vs. shear rate is also shown in Figure 1B, where y-axis is logarithmic for better data presentation. Density measurements were performed on an Anton Paar DMA TM 4200M density meter in the temperature ranges 10-90 and atmospheric pressure. Measurements were performed in accordance with ASTM D4052 (ASTM D4052-09, Standard Test Method for Density, Relative Density, and API Gravity of Liquids by Digital Density Meter, ASTM International, West Conshohocken, PA, U.S., 2009). A typical behavior of an oil sample from those 10 representative ones in various temperatures is shown in Figure 1C below.  , 2006). A large number of oil samples were tested in the C-ETD 300 system equipped with CC-17 coaxial cylinders for measurements of low viscous specimens. For high viscous samples, viscosity values were obtained in the P-ETD 400 system with cone-plate CP50-2 measuring system. The elemental composition of oil samples was determined using a LECO Corporation CHN628 with an attachment for determination of the sulfur. The optional sulfur module for the 628 Series analyzers is designed to determine the sulfur content in a wide range of organic matrices. It allows expansion of the analytical capabilities of the devices CHN628, CN628 and FP628. The module can be used to work with samples of coal, coke, various types of fuels and some inorganic materials, for example, samples of soil, cement and limestone. A pre-weighed sample is placed in a ceramic boat and then in a horizontal furnace, the sample is burned in an oxygen stream, and the sulfur contained in it is converted to SO2. After that, due to the concentric design of the furnace tubes, the gases pass through the hot zone twice, which allows them to stay as long as possible in the high temperature zone for their complete oxidation. In the next stage, out of the furnace, gases pass through the anhydrone to remove moisture and the flow controller, where it is stabilized. The sulfur content is determined in the measuring infrared cell SO2 of the analyzer of the 628 series, where the sulfur module is connected. Figure 1A displays dead oil viscosity from 10 representative samples as a function of temperature. As seen from the image, as the temperature increases, the viscosity of all samples decreases exponentially. A typical behavior of one oil sample in various temperatures vs. shear rate is also shown in Figure 1B, where y-axis is logarithmic for better data presentation. Density measurements were performed on an Anton Paar DMA TM 4200M density meter in the temperature ranges 10-90 and atmospheric pressure. Measurements were performed in accordance with ASTM D4052 (ASTM D4052-09, Standard Test Method for Density, Relative Density, and API Gravity of Liquids by Digital Density Meter, ASTM International, West Conshohocken, PA, U.S., 2009). A typical behavior of an oil sample from those 10 representative ones in various temperatures is shown in Figure 1C below.

Input Features and Data
A total of 2247 data points were collected from the literature [27,28,39,52,66,67,71,[77][78][79][80][81] and added to the authors dataset for the viscosity simulations. A statistical summary of the data is presented in Table 2.

Input Features and Data
A total of 2247 data points were collected from the literature [27,28,39,52,66,67,71,[77][78][79][80][81] and added to the authors dataset for the viscosity simulations. A statistical summary of the data is presented in Table 2.  In this study, in addition to the common correlating properties, γ API , T, other important variables have been incorporated as well in the formulations. It is important to note that not all of these approaches use hydrocarbon components; detailed information as seen below: In the above equations, µ od is the dead oil viscosity in cp, API is oil API gravity, T is temperature in F, R s is gas oil ratio in scf/STB and P b is bubble point pressure in psia. Based on the mentioned functional forms (Equations (1)-(4)), six different ML models are developed through Python programming that include: random forest (RF), lightgbm, XGBoost, MLP neural network, SRV and SuperLearner. The code programs that are written rely on the python package library described in Table 3 and can become available to readers upon request. For developing dead oil viscosity model estimation approximately 60% of the data was used for training model while the remaining was used for blind tests.

ML Model Development
In recent decades, different machine learning models have been used for PVT properties' estimation.
Among these, the artificial neural network has caught significant attention since the late 1990s. When we employ ML algorithms, one should keep in mind that theoretically, since dead oil viscosity of each type of oils will differ, various ML models should be tested, and a single algorithm will not suffice [77]. This means that one ML model might work properly and perform well on one dataset while others fail, and it becomes erroneous to exhibit their inferiority. For example, Van der Laan [78] developed SuperLearner where ensembles are utilized to stack base learners to reduce errors in forecasts. In this aspect, SuperLearner is superior compared to the base learner since systematic errors of base learners are found and delineated on the final prediction [79] to make it applicable in a variant of fields: biology domain including medicine, healthcare [79], biostatistics and genetics [80][81][82][83][84], and epidemiology [85]. It is important to note that this does not mean that each base learner and meta learner in a SuperLearner algorithm is independent of training for a precise prediction. In this study, the following five machine learning algorithms are implemented for dead oil viscosity prediction and results are compared across the board: XGBoost [86], LightGBM [87], random forest [88], an artificial neural network algorithm [89] and SVR [90]. Likewise, due to the superiority and robustness of SuperLearner that is proven in other fields [78,80,[91][92][93], for the first time, this method is also applied, and the output is compared to five other algorithms.
In order to estimate and accurately model, 6-fold cross-validation on all two sets of input was performed. For evaluation of the advanced machine learning algorithms for each data set we used cross-plots in order to evaluate measured and predicted values of µ od . The robustness and accuracy of the models in this study have been evaluated using different statistical quality measures [94,95], including the coefficient of determination, mean squared error (MSE), mean absolute error (MAE), the percentage of accuracy-precision (PAP) [96] and root mean squared error (RMSE) through the following equations: Root mean squared error (RMSE) metric is given by Equation (5): where, N is total number of observation Mean squared error (MSE) metric is given by Equation (6): Mean absolute error (MAE) metric is given by Equation (7): Coefficient of determination (R 2 ) metric is given by Equation (8): (Actual i − mean of the observed data) 2 (8) And finally, the percentage of accuracy-precision (PAP) is given by Equation (9):

Results and Discussion
We should recall that most of the correlation methods (Table 4) are based on data from different origins, and this is why some correlations cannot give good accuracy. The main aim of this study was to develop advanced ML methods to predict dead oil viscosity in interested temperature interval T = [40 • F-233.6 • F] for any oil type. We developed correlations that can predict the viscosity of any type of heavy as well as lighter oils at any temperature in the mentioned interval using the various functional forms. Comparing the performance of advanced ML algorithms with common empirical correlations from the literature, the results of all simulated by the advanced machine learning algorithms were better than the results of all the experiential relationships in Table 4. Figures 2 and 3 give a graphical comparison of the machine learning algorithms for µ od prediction. Figure 2 shows the cross-validation results based on the testing dataset respectively. Based on the mentioned functional forms of µ od (Equations (1)-(4)), different models for these considered machine learning algorithms were developed. Figures 2 and 3 exhibit the results of these approaches where experimental data on about 200 samples is plotted vs. predictions. It is important to note that the accuracy of our methods is determined based on the metrics (Equations (5)-(8)) by plotting experimental data vs. prediction results. Here, it is assumed that experimental data is the most reliable one and closeness of predicted values to that outcome can be the measure of algorithm/correlation performance. Comparing the metric and statistical error analysis results, it can be seen that SuperLearner might be a promising tool in viscosity prediction. Table 4, below, shows the performance of some common empirical correlations that are applied to our dataset. Here, it is seen that the best results among these empirical correlations are given by Naseri [57] and the next one is the correlation by Beal [47]. In the correlation developed by Dindoruk [66] as it was mentioned in Table 4, although there are other additional parameters in the function such as bubble point pressure (P b ) and solution gas oil ratio (R so ), it has not improved the results compared to other methods. In the correlation by Dindoruk [66], the functional form for estimation of dead oil viscosity is f (µ od ) = γ API , T, R sb , P b and we see that having additional parameters has not improved the results of the prediction. All other correlations have the same functional form f (µ od ) = γ API , T.  Considering the results from all functional forms of the viscosity (f (µod) = γAPI, T), the SuperLearner and random forest (RF), lightgbm, XGBoost, MLP neural network, SRV and SuperLearner exhibited very satisfactory patterns and behavior, which provided precise results and match with experimental data. This becomes more important when we consider that the majority of the empirical associations were formed based on these two variables only as the following functional form f (µod) = γAPI, T. In this regard, the SuperLearner has the least root mean square errors (RMSE) for all methods according to Figure 3. The best overall strategy lies among the determination coefficient's (R 2 ) SuperLearner. With the aid of this model, accurate dead oil viscosity can be calculated as one of the critical inputs for calculations and for the generation of live oil viscosity for fluids under different wellbore conditions.
In order to check the feasibility of the proposed SuperLearner model to see whether, with variance of input parameters, the model will catch the physically predicted patterns, we plotted the results of predicted value with classical correlations and SuperLearner, and compared them with experimental values. Figure 4 demonstrated the general effect of temperature on dead oil viscosity and it can be seen that these effects are correctly predicted by the SuperLearner. As we see in Figure 4, dead oil viscosity decreases when temperature is increased. At lower temperatures the effect is more noticeable. The results provided in Figure 4 provided annotated lists of the most common dead oil viscosity correlations. The results show that the Bennison method overpredicted the value of dead oil viscosity in temperatures lower than 120 °F, while the Petrosky method underpredicted the value of dead oil viscosity in all temperature ranges. The developed SuperLearner method could predict dead oil viscosity variance between predicted value and experimental value. Figure 5 displays the proposed SuperLearner model in this work that has the same trend as other famous correlations to predict the API vs. viscosity. Figure 5 shows that dead oil viscosity decreases when oil API gravity increases and shows the trend of the developed model and some other correlation at 120 °F.
From this figure, it is seen that results predicted by the SuperLearner are matching perfectly with the experimental data.
The results provided at Figure 5 show that the method proposed by the authors of this study are suited for all API ranges and the developed SuperLearner model can predict dead oil viscosity with the highest accuracy (R 2 = 0.96). The Al-Kafaji method is unfit for crude oil gravities below 15° API, while the method developed by Bennison, which was In order to validate the objective of this study and evaluate the performance of developed models, ML results were compared based on various metrics parameters to the results obtained from the existing classical correlations. The results of Table 4 show that the developed SuperLearner outperforms all pre-existing models. It is evident that the proposed models in this study are more robust and reliable and accurate than other published correlations in terms of statistical parameters.
Considering the results from all functional forms of the viscosity (f (µ od ) = γ API, T), the SuperLearner and random forest (RF), lightgbm, XGBoost, MLP neural network, SRV and SuperLearner exhibited very satisfactory patterns and behavior, which provided precise results and match with experimental data. This becomes more important when we consider that the majority of the empirical associations were formed based on these two variables only as the following functional form f (µ od ) = γ API , T. In this regard, the SuperLearner has the least root mean square errors (RMSE) for all methods according to Figure 3. The best overall strategy lies among the determination coefficient's (R 2 ) SuperLearner. With the aid of this model, accurate dead oil viscosity can be calculated as one of the critical inputs for calculations and for the generation of live oil viscosity for fluids under different wellbore conditions.
In order to check the feasibility of the proposed SuperLearner model to see whether, with variance of input parameters, the model will catch the physically predicted patterns, we plotted the results of predicted value with classical correlations and SuperLearner, and compared them with experimental values. Figure 4 demonstrated the general effect of temperature on dead oil viscosity and it can be seen that these effects are correctly predicted by the SuperLearner. As we see in Figure 4, dead oil viscosity decreases when temperature is increased. At lower temperatures the effect is more noticeable. The results provided in Figure 4 provided annotated lists of the most common dead oil viscosity correlations. The results show that the Bennison method overpredicted the value of dead oil viscosity in temperatures lower than 120 • F, while the Petrosky method underpredicted the value of dead oil viscosity in all temperature ranges. The developed SuperLearner method could predict dead oil viscosity variance between predicted value and experimental value.
based on North Sea crude oil's low API gravity, is not appropriate for gravities greater than 30° API.  The most common method for getting dead oil viscosity data is viscosity correlation, which is very useful and effective in estimation of dead oil viscosity at different temperatures for different oil types. The critical point in the application of common correlations is their limitation on the parameters from which these correlations have derived. This means these correlations are regionally-dependent and cannot be applied universally. Ultimately, in this study we developed a model based on the newly introduced ML  Figure 5 displays the proposed SuperLearner model in this work that has the same trend as other famous correlations to predict the API vs. viscosity. Figure 5 shows that dead oil viscosity decreases when oil API gravity increases and shows the trend of the developed model and some other correlation at 120 • F. based on North Sea crude oil's low API gravity, is not appropriate for gravities greater than 30° API.  The most common method for getting dead oil viscosity data is viscosity correlation, which is very useful and effective in estimation of dead oil viscosity at different temperatures for different oil types. The critical point in the application of common correlations is The results provided at Figure 5 show that the method proposed by the authors of this study are suited for all API ranges and the developed SuperLearner model can predict dead oil viscosity with the highest accuracy (R 2 = 0.96). The Al-Kafaji method is unfit for crude oil gravities below 15 • API, while the method developed by Bennison, which was based on North Sea crude oil's low API gravity, is not appropriate for gravities greater than 30 • API.
The most common method for getting dead oil viscosity data is viscosity correlation, which is very useful and effective in estimation of dead oil viscosity at different temperatures for different oil types. The critical point in the application of common correlations is their limitation on the parameters from which these correlations have derived. This means these correlations are regionally-dependent and cannot be applied universally. Ultimately, in this study we developed a model based on the newly introduced ML SuperLearner algorithm to estimate viscosity regardless of temperature, oil type and other critical parameters that are hard to obtain experimentally. As it was said in the literature, measuring heavy oil viscosity can be challenging at low temperatures because many times the expected values of the viscosity can exceed the upper limit of the device.
The developed model is applicable for a wide range of dead oil viscosity and variety of oils and temperatures. One main advantage of the proposed model is that there is no need for compositional analysis of the oil or asphalt content for the proposed dead oil viscosity correlation. The developed model has better predictive potential than the leading correlation in the whole considered range of viscosity. The proposed SuperLearner model can be used as a fast tool to verify the quality of the experimental data and/or the validity and accuracy of different viscosity models, especially where there are discrepancies and uncertainties in the datasets. The proposed SuperLearner model, which has improved crude oil viscosity accuracy and efficiency compared to previously published ones, can be applied in any reservoir simulator program.

Summary and Conclusions
This article demonstrates the idea of application of several machine learning algorithms, such as random forest (RF), lightgbm, XGBoost, MLP neural network, SRV and SuperLearner to predict dead oil viscosity based on different functional forms. The study revealed that XGBoost and SuperLearner might be promising tools in dead oil viscosity prediction from these methods and can be applied to reduce the cost of laboratory measurements. It was found that since SuperLearner integrates the merits of base machine learning algorithms, its capability in forecasting a more accurate dead oil viscosity is improved significantly. In addition, it has a potential to increase the accuracy of viscosity modeling of crude oil where some data are not available.
The main advantage of using machine learning and intelligence methods in the estimation of dead oil viscosity is that with knowing examples of previous patterns they can be easily trained and put to effective solving of unknown or untrained instances of the problem as was the case here. The results confirm the performance of the proposed SuperLearner model in estimation of dead oil viscosity. The results show that the simplest functional form to obtain viscosity would be adequate to provide statistically valid and acceptable results. This denotes that additional correlation variables might not be necessary to improve the performance of a model. From the empirical correlations, it can be noticed that the errors are higher than the ensemble models compared to the ML techniques, which are XGBoost, lightgbm, MLP ANN and SVR. The developed SuperLearner model in this study can be used when the dead oil viscosity should be predicted with limited input data when it is hard to perform an experiment with high accuracy, or when there is no physical sample to do additional experiments, especially for enhanced oil recovery (EOR) processes in the industry where they need viscosity at elevated temperatures. Ultimately,