Prediction of Water Saturation from Well Log Data by Machine Learning Algorithms: Boosting and Super Learner

: Intelligent predictive methods have the power to reliably estimate water saturation (S w ) compared to conventional experimental methods commonly performed by petrphysicists. However, due to nonlinearity and uncertainty in the data set, the prediction might not be accurate. There exist new machine learning (ML) algorithms such as gradient boosting techniques that have shown signiﬁcant success in other disciplines yet have not been examined for S w prediction or other reservoir or rock properties in the petroleum industry. To bridge the literature gap, in this study, for the ﬁrst time, a total of ﬁve ML code programs that belong to the family of Super Learner along with boosting algorithms: XGBoost, LightGBM, CatBoost, AdaBoost, are developed to predict water saturation without relying on the resistivity log data. This is important since conventional methods of water saturation prediction that rely on resistivity log can become problematic in particular formations such as shale or tight carbonates. Thus, to do so, two datasets were constructed by collecting several types of well logs (Gamma, density, neutron, sonic, PEF, and without PEF) to evaluate the robustness and accuracy of the models by comparing the results with laboratory-measured data. It was found that Super Learner and XGBoost produced the highest accurate output (R 2 : 0.999 and 0.993, respectively), and with considerable distance, Catboost and LightGBM were ranked third and fourth, respectively. Ultimately, both XGBoost and Super Learner produced negligible errors but the latest is considered as the best amongst all.


Introduction
Fluid saturation, in particular, water saturation, is a critical parameter for formation evaluation, reserve estimation, and future field planning. The estimated values of water saturation are fed into both reservoir static and dynamic models that are used to estimate original oil/gas in place (OOIP, OGIP) and consequently form the basis for future production forecasts and the determination of the economic viability of the discovered reservoir. Owing to its great significance, water saturation determination has always been an active area of research in petrophysics, and because of that, a variety of methods have been developed through the past decades and are still ongoing. We categorized all these methods into two main categories: (1) direct methods and (2) indirect methods.
Direct water saturation methods are mainly designed and implemented to produce water saturation data. In this regard, (a) laboratory analysis of rock samples and (b) empirical estimation using resistivity log data are the most typical forms. Laboratory-based methods such as Retort method and Dean-Stark and Soxhlet extraction are assumed to be accurate, however, these methods consider the rock sample has representative fluid saturation, which practically is not possible except for very expensive uncommon sampling methods such as sponge core barrel or pressure core barrel [1,2]. In addition, they are exhaustive, time-consuming, and provide discrete data points. For example, the Retort method is a technique of heating crushed core samples up to 650 • C and measuring water and oil volumes driven off. Moreover, Dean-Stark extraction is a technique for measurement of water and oil saturation by distillation extraction when the water in the core is vaporized by boiling solvent, then condensed and collected in a calibrated trap [1]. It can be seen that such methods are destructive, and considering the importance of the samples, they cannot be used in other experiments. Other laboratory approaches such as using rock centrifuge, displacing fluid, and mercury-air capillary pressure curve for a water-air system are expensive, time-consuming, and non-destructive.
Despite the accuracy, laboratory measurements are usually not available for all wells, while log data are commonly acquired in the majority of the wells; furthermore, continuous information across the well is gathered. Determination of fluids saturation is one of the main objectives of well logging, while resistivity is a key to water saturation. In order to utilize resistivity log data for water saturation determination, empirical relations are applied. The most universally practical way is the equation developed by Archie (Archie, 1942) [3][4][5][6], which employs log-driven resistivity and porosity values to compute water saturation. However, this equation was later modified to include tortuosity factor to account for the pore throats in the reservoir, which was shown to be a function of porosity to determine the resistivity factor, given by Equation (1), [1][2][3][4][5]: where a is the tortuosity factor, m cementation factor, n saturation exponent, ϕ m matrix porosity, R w resistivity and R t is observed bulk resistivity. What is more, the data represents only the initial state and cannot be used for saturation forecasts. It means that numerical model prediction might be a promising tool for optimization of expenses and filling the gaps in well log data. The exact computation of water saturation using Archie's formula is based on determination of accurate values of Archie's parameters a, m, and n. These parameters are ideally assessed from laboratory data. However, these parameters are usually taken as constant values for clean, clastic quartz reservoirs [1], which is not the case that can be found commonly in practice.
Well logging, although very common, could be expensive, thus we may skip running specific tools in a well. There exist cases when a resistivity log is not available due to the tool size limitation. Apart from that, the data obtained from the resistivity log might be unreliable in view of unfavorable geology. In recent years, due to its huge potential, application of machine learning (ML) in geosciences has generated considerable research interest. ML techniques can help to improve petrophysical properties prediction, log interpretation, to optimize core analysis planning and logging service, and to reduce the cost of laboratory measurements. As a result, there has been extensive research regarding the application of artificial intelligence (AI) techniques in well log interpretation [7], shear sonic log prediction [8], and prediction of various reservoir properties such as porosity, permeability, water saturation, lithofacies, and wellbore stability [9][10][11][12][13][14][15].
Numerous studies have shown an accurate prediction of petrophysical parameters from well log data in uncored wells. For example, total organic carbon (TOC) prediction in an unconventional well and permeability estimation in a conventional well were conducted by utilizing support-vector regression (SVR) [16]. Different combinations of gamma ray, formation resistivity, neutron porosity, and bulk density logs underwent training to predict core measurements. Results revealed SVR method accuracy and reliability in TOC and permeability estimation from well-log data. Another ML technique, artificial neural network (ANN), was successfully utilized for permeability prediction from log data in uncored wells [17] and water saturation prediction from wireline logs in sandstone formations [18,19].
Some models revealed a capability in fluids saturation forecasts, cutting expenses on well logging. For example, an unsupervised class-based ML algorithm was utilized to classify the input petrophysical data, such as gamma ray (GR), total porosity (PHIT), effective porosity (PHIE), formation sigma (SIGM), and open hole oil saturation [20]. In this study, seven classes were selected following time series modeling with multiple time-lapse runs on each of the seven selected classes. Next, analyses were conducted using Facebook's open-source forecasting tool Prophet. Results showed a good validation of oil saturation measurements during a natural depletion period [20].
Some studies proved that there is no need to calculate complex coefficients from Archie's equation, presenting a good estimation of the water saturation. Authors of the study [21] demonstrated a new approach based on radial basis function neural (RBFNN) for equations of water saturation from four conventional logs, including sonic (DT), deep resistivity (LLD), density (RHOB), and neutron porosity (NPHI) in a carbonate reservoir. It was concluded that the performance of the proposed model is considerably superior to the empirical models, which was also repeated in a similar study [21]. In addition, it was showed by L. Aliouane et al. [22] that the RBF neural network architecture is able to predict formation permeability, porosity, and water saturation using laboratory measurements of the cores and well-log data.
In order to predict water saturation, various machine learning algorithms are utilized. For example, clustering algorithms were proposed and tested on resistivity well-logs in work of [23], namely: fuzzy C-means clustering, Gustafson-Kessel algorithm, and Gath-Geva clustering. The authors chose unsupervised methods because they do not require real labeled data. Study [24] presented an application of the local linear neuro-fuzzy (LLNF) model in estimating reservoir water saturation from well logs. This was followed by [25] aiming to evaluate fluid saturation in oil sands by means of ensemble tree-based algorithms. Later on [26], support vector machine, decision tree forest, and tree boost methods were employed to predict water saturation of Mesaverde Tight Gas Sandstones located in Uinta Basin. In paper [27], the multilayer perception (MLP-) and kernel function-based leastsquares support vector machine (LS-SVM) techniques were utilized to develop predictive models for water saturation. Furthermore, an intelligent structure named robust committee machine (RCM) for water saturation prediction was introduced [28].
Various well log data and petrophysical properties measured in the lab are used as input parameters in different algorithms to predict water saturation in sandstone and carbonate reservoirs. For example, a water saturation model was constructed based on the lithofacies identified in different wells [29], or ANN was developed to predict a water saturation, while porosity, permeability, and height above free water level were used as the input data [30].
Among the machine learning methods used in practice, gradient tree boosting [31] is a technique that stands out in many applications. This method is also known as gradient boosting machine (GBM), or gradient boosted regression tree (GBRT). In the last two decades, boosting algorithms have been the most widely used ones in data science to achieve state-of-the-art results [32][33][34]. Boosting is a meta-algorithm based on an idea of gradually aggregating numbers of simple algorithms, called weak learners, to obtain a final strong learner. More specifically, each weak learner is optimized to minimize the error on the training data using the sum of the previous weak learners' predictions as an additional input [35,36]. This is conducted by dividing the training data and using each part to train different models or one model with a different setting and then using a majority vote, the results are combined together. The Adaboost was the first successful method of boosting discovered [37,38] for binary classification. Based on the work of Friedman [31], who introduced gradient boosting of decision trees, several implementations have been recently developed. Three effective methods of gradient boosting based on decision trees have been proposed, namely: XGBoost [39], CatBoost [40], and LightGBM [41].
Following what was said above, AdaBoost can be useful for comparatively small datasets, while scalable algorithms are required for much larger datasets. To resolve this requirement, XGBoost, LightGBM, and CatBoost are intended to be employed. XGBoost is a parallel tree boosting system, which is designed to be more flexible and portable and efficient. XGBoost applies the loss feature to a regularization term that helps construct more generalizable versions. In return, XGBoost is the most recently used algorithm from a design viewpoint to model proxy reservoir simulations. Many researchers recently declared the successful implementation of boosting algorithms applications in petrophysics and reservoir characterization as the proof of reliability to be applied for water saturation prediction [42][43][44][45][46]. Another powerful and flexible implementation of tree-based gradient boosting is LightGBM, such as XGBoost. To maximize concurrent learning, it leverages network connectivity algorithms. To speed up the training process and decrease memory usage, it utilizes a histogram-based algorithm. Additionally, instead of level-wise, Light-GBM grows trees leaf-wise. In ensemble learning, the tree growth technique is usually level-wise, which can be an inefficient method. Although XGBoost and LightGBM provide several benefits, CatBoost may provide a more effective approach while remaining scalable when a large number of categorical features are present in the dataset [41].
The new approaches were successfully employed in business, academia, and competitive machine learning [47]. While built on structurally similar ideas, these libraries slightly differ on how decision trees are grown or how categorical variables data are handled, and only investigation can validate which performs best. Ensemble modeling is a robust method to enhance the model's performance. Super Learner [48,49] is known as stacking ensembles. Van der Laan et al. [48] was the first who proposed this algorithm in biostatistics, while Polley and van der Laan [49] developed the detailed algorithm.
In the petroleum industry, there could be a significant prediction error in a single machine learning model while another could perform well. Authors believe Super Learners can further enhance predictive ability by stacking prediction results from algorithms that learn from the base algorithm. Therefore, although huge research has been carried out in recent years, the accuracy of prediction should be improved by testing more machine learning techniques and their combinations, e.g., in petrophysics. Hence, the purpose of this study is to describe and examine boosting methods for water saturation prediction compared with the Super Learner that has never been examined before, to the best of our knowledge, due to its newness. Moreover, we aimed to remove the dependency of such predictions on resistivity logs, which are necessary for water saturation determination. Finding an accurate prediction model will reduce the cost of core-based measurements and well-logging services. Ultimately, it will increase the accuracy of well log interpretation, where a certain type of data could be missing.

Dataset Preparation
Well log data (Gamma, density, neutron, sonic, PEF) from 11 wells drilled in a sandstone reservoir in the Russian Federation were used for this study and considered as a feature (including the entire log dataset) to develop the model, while 7 wells out of 11, were used for training the model and the rest for final validation.
On the input dataset, we used a 4-fold cross-validation scheme. To begin, each dataset was divided equally into 4 folds at random, and the distribution of each fold was tested to ensure there were no substantial variations in the distribution of the entire dataset. Onefold of samples was used as test or evaluation data for each iteration, while the remaining 3 folds of samples were used as training to match the model. On validation data, the fitted model was then used to predict water saturation. The iteration process proceeded until all sample folds were expected. In this case, all samples in the dataset were predicted using the model fitted without being used as training data themselves. This method of cross-validation scheme eliminated the possibility of overfitting to a fixed train dataset while simultaneously ensuring that all datasets were used to their full capacity. For better representation, a schematic of K-fold cross-validation is shown in Figure 1. Two separate approaches, one by including and the second one by excluding the PEF log, along with their corresponding datasets, were created and labeled, A and B, respectively. This was conducted to examine if tested ML methods can produce results with the least dependency on input data, especially those that might not always be available in petroleum industry more frequently, here PEF (photoelectric factor). To create the datasets, 5 readily available different well logs including: GR, sonic (DT), neutron-porosity (NPHI), formation density (RHOB), and photoelectric factor (PEF) were used for input into the algorithms. Two different approaches, as stated earlier, were carried out, which involved developing different ML techniques using data sets A and B, respectively. While both datasets benefitted from having 7 training wells and 4 blind tests, where the input features for dataset A were GR, DT, NPHI, RHOB, and PEF and for the dataset B, only GR, DT, NPHI, and RHOB. This was conducted based on the fact that first, we would like to examine the hypotheses of removing the dependency of our prediction on the resistivity log although it was available in all wells and was the basis for water saturation estimation, and second, if some lesser common logs, here PEF, can also be ignored too. In Figure 2, the process of constructing the ML models is illustrated.

Calculating Water Saturation
The Archie equation was used to calculate the water saturation through Dean-Stark data that was available for the formation understudy in 2 wells. The Archie estimated value of water saturation that was obtained in the lab was set as the target value, while the resistivity calculated water saturation was also calculated. It was observed in Figure 3 that the interval was clean with low GR readings reflecting a sandstone formation. Hence, the Archie equation should work well without any need to use complex empirical relationships for the estimation of water saturation. Furthermore, water saturation was also measured in 2 of these wells from plugs that were preserved and tested in the lab. As we can see in Figure 3 there was a good match between the experimental value of S w and predictions by Archie equation, shown on the fourth track. In this track, blue dots represent experimental data and the red curve was calculated water saturation through Archie's equation with m = 1.49 (cementation factor), n = 1.82 (saturation exponent); a = 1.532 (tortuosity factor) and R w = 0.35 (formation water resistivity). These parameters for accurate utilization of the Archie equation were based on knowing/measuring them and were found in laboratory studies for the formation understudy. Determination of R w in the application of Archie's method is the main problem when there are not enough production tests and sample water, which fortunately was not the case here. The other disadvantage of Archie's method was that when the rock matrix type is unknown, it can impose significant error into the results. This issue can be addressed by having the PEF log, which provides us with direct knowledge about the lithology (dominant lithology) of the formation. Additionally, since there is an inherent uncertainty in estimations of m, a, and n values, extensive experimental analyses are required. Ultimately, in the case when these parameters specific to the formation are not available, assumptions can be made, which will introduce errors to the final water saturation calculated results. Please note, in order to avoid a lengthy manuscript and deviate from the main idea in the study, we decided not to include steps that were taken to estimate S w by the conventional Archie's equation since this is a routine process that can easily be found in all petrophysics and reservoir engineering books [50].

Methods
Boosting is extremely an effective machine learning technique in its dependency on input data, using them and generating final outputs [42]. In this paper, we developed code programs based on 4 boosting methods: XGboost, LightGBM, Catboost, and Adaboost, in addition to the Super Learner, a total of 5 different algorithms to compare the accuracy of water saturation predictions across the board. All calculations were carried out using Python programming language where respective Python packages and their versions are listed in Table 1 (developed python code programs can be provided upon request). The Super Learner is distinct from other machine learning algorithms since it is simply a base learners combination algorithm. In this context, any other machine learning algorithm such as XGboost, LightGBM, Adaboost, and Catboost can be a base learner of a Super Learner algorithm. In our study, 2 machine learning algorithms, including: XGboost and LightGBM, were utilized as input parameters in the Super Learner ensemble.

Models
Boosting, like bagging, is a common method for regression or grouping that can be extended to a large number of base learners. In boosting, base learners (in our case, trees) are given training iteratively to increase focus on findings that the current aggregation of base learners fails to model well. Similar boosting algorithms calculate misclassification differently and choose various settings for the next iteration. The primary goal of boosting is to minimize bias. Because of the increased emphasis on misclassified cases, the bias part of the mistake was minimized. AdaBoost is the most widely used boosting process [51,52], which trains models in such a way that misclassified examples were found at the end of each iteration, and their importance was increased in a new training set. This set is then fed directly into the next iteration's beginning.
The gradient Boosting method, unlike AdaBoost, fit the base-learner to the negative gradient of the loss function measured in the previous iteration rather than re-weighted cases. Although AdaBoost and GBMs are effective for small datasets, scalable algorithms are needed for far larger datasets. This condition is addressed by XGBoost, LightGBM, and CatBoost.
The main steps of Adaboost methods is represented below: • Defining Weights: w j = 1 n , j = 1, 2, . . . , n; • For each i, define the training data to a weak learner Wl i (x) using weights and determine the weighted error For each i, estimate weights for predictors as: β i = log (1−Err i )

Err i
• Updated data wights for each i to N (N is the number of learner); • Adjust weak learner for data test (x) as output.
XGBoost is a parallel tree boosting framework that is available in large, distributed environments such as Hadoop and is designed to be highly powerful, scalable, and portable. Thanks to improvements over GBMs, such as the introduction of split finding algorithms for sparse data with nodes' default directions and fast enumeration of all feasible splits to maximize the splitting threshold, it can solve problems with billions of instances. XGBoost also has a regularization concept in the loss feature, which aids in the development of more generalizable models. In terms of architecture, XGBoost has recently been used to simulate steel fatigue resistance and surrogate reservoir simulation [53]. To model the performance y for a given dataset, an ensemble of n tresses should be trained according to the following expression, depicted in Figure 4:ŷ where example x is represented by the decision rule q(x) to the binary leaf index and f declares the space of regression trees; ω the weight of the leaf; f k the k th independent tree; and T is the number of leaves on the tree. LightGBM, such as XGBoost, is an effective and scalable tree-based gradient boosting implementation. It optimizes parallel learning by using network connectivity algorithms. It uses a histogram-based algorithm to reduce memory usage and speed up the training process. Furthermore, LightGBM grows trees leaf-by-leaf rather than level-by-level. In most cases, the tree growth technique used in ensemble learning is level-wise, which is inefficient [54].
While XGBoost and LightGBM have several advantages, when a dataset contains a large number of categorical attributes, CatBoost could be a more effective and scalable solution. CatBoost employs oblivious decision trees, a form of level-wise expansion. It implements a vectorized representation of the tree in this extension, which can be tested quickly. CatBoost also improves algorithmic performance by using ordered boosting, a permutation-driven alternative to the standard boosting algorithm, and a variety of target statistics for categorical function processing.
Polley and van der Laan [49] created the Super Learner algorithm. A diagram of the Super Learner algorithm is shown in Figure 5 below to make the algorithm simpler. Since it is a hybrid algorithm with base learners, Super Learner is unlike any other machine learning algorithm. Any machine learning algorithm, such as XGBoost or LightGBM, may be used as the base learners of a Super Learner algorithm. After that, the base learners are selected and configured separately using train data.

Model Evaluation
In Table 2, min-samples-split, n-estimators, max-depth, min-samples-leaf, learning rate, colsample-bytree, subsample, num-leaves, depth, and iterations are the minimum number of samples required to split an internal node, number of trees, maximum tree depth, minimum number of samples needed at a leaf node, learning rate, column subsample ratio when building a tree, fraction of observation ratio when building a tree.  The prediction performances of AI models highly depend on the quality of the input data. Before feeding any data to the AI system, data analysis and pre-processing steps were performed. Data pre-processing step involved statistical ways to remove outliers and unrealistic values that are highly recommended in ML methods and taking advantage of AI techniques [55,56]. To estimate the accuracy of the model, 4-fold cross-validation was performed. In this study, the robustness and accuracy of the models have been evaluated using different popular evaluation metrics: the coefficient of determination (R 2 ), mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) [56][57][58], each defined by the following equations: Root mean squared error (RMSE) metric is given by Equation (2): where, N is total number of observations. Mean squared error (MSE) metric is given by Equation (3): Mean absolute error (MAE) metric is given by Equation (4): Considering these error numbers, the lower the value of RMSE, MSE, and MAE metrics, the better the model would be. Finally, the coefficient of determination (R 2 ) metric is given by Equation (5): Best possible R 2 score is 1.0. A constant model that always predicts the expected value, disregarding the input data, would reach a score of 0.0.

Results and Discussion
To evaluate the performance of each ML method, the metric evaluation was applied. Table 3 summarizes the evaluation metrics summary. As can be seen from these values, the Super Learner algorithm prediction was the most accurate one, while Adaboost showed the least favorable results. Figure 6 is the cross-plot of actual water saturation measurements vs. the predictions for various algorithms used here, while Figure 7 provides the bar plots of evaluation metrics for better visualization. We can see that Super Learner and XGBoost have the lowest error values, while Adaboost has the highest error. Based on the obtained results, it is apparent that there is no need to calculate complex coefficients such as cementation factor, tortuosity factor, saturation exponent that are required when using Archie's equation for estimating water saturation.   Collectively, comparing the data, it can be seen that XGboost and Super Learner could be promising tools in water saturation prediction from the well log data. The performances of the ML techniques for the prediction of water saturation (dataset A) are summarized in Table 3, and it can be noticed that most leading techniques in each category can be used with confidence. It should be noted that in the following figures, Archie calculated water saturation based on parameters that were measured in the lab and verified when experimental data are plotted vs. AI estimated values.
Based on Figure 6, which is the cross-plot of target values vs. prediction outcomes, it is observed that there is an excellent correlation between these two values by Super Learner method. The metric evaluation of using different machine learning algorithms is also shown in Figure 7. Additionally, Figure 8 explains that Super Learner is the best choice for estimation of water saturation without resistivity logs being used as an input parameter as well as the XGboost with comparable performance. Although the discrepancy between various metric values of the algorithms that were utilized are not significant, Super Learner still performs better than the rest of the algorithms. On the contrary, Adaboost and Catboost exhibited the highest estimated errors. As shown in Figures 6-8, based on the results, Super Learner ranked the first in all performance judging criteria and XGboost ranked the second. Again, it should be reminded that the performance is the level of matching results in experimentally obtained water saturation values vs. the predictions. Please note the results containing all test wells through cross plotting predictions vs. measurement of water saturation are depicted in the Appendix A for dataset A for Well 1 to 3.
In dataset B, the PEF log is kept out of being input in predictions. It should be noted that the PEF log is not as common as other well logs used here while it provides us with direct clues about the formation lithology. Knowing formation lithology is vital for estimating porosity based on the rock physics models from neutron and density logs. The performances of the ML techniques for the prediction of water saturation (dataset B) are summarized in Table 4 where the most leading techniques in each category are revealed based on the error values they generated.  In dataset B, as it was mentioned, four well log suites were used as input features for estimation of water saturation. In dataset B, it was decided to leave the PEF log out of the input features. This was decided based on the fact that the PEF log might not be available and is not as commonly acquired in wells as the rest. The results illustrate that if input features from five well log suites are reduced to four well logs, the overall performance of ML techniques will not be affected notably across the board. Hence, water saturation can still be predicted with high accuracy. Based on metric analysis ( Figure 9) and cross-validation (Figure 10), Super Learner and XGboost have the lowest error and the highest accuracy among all algorithms. Moreover, Figure 11 represents the results for five algorithms presented as a well log format (continuous estimation of water saturation) for the entire interval. Please note the results containing all test wells through cross plotting predictions vs. measurement of water saturation is depicted in the Appendix B for dataset B for Well 1 to 3.     With a more detailed analysis of these graphs (Figures 9-11), we can conclude the reliability and precision of these predicative models collectively. Various statistical parameters, including correlation coefficient and relative errors, are used as a comparison basis to make judgments to see if predictions and experimentally measured values would match. The predictive model for dataset B in the absence of PEF log variable leads to high performance since the PEF log has a lower impact than other variables. Its lowest impact, although its importance was found in the results but also confirmed through sensitivity analysis as well. Consequently, it might be unnecessary to include an additional correlating parameter as an input feature for the prediction of water saturation. According to the statistical analysis of feature importance shown in Figure 12, conducted for the Super Learner approach, the decreasing order of importance of input variables for predicting water saturation would be as follows: gamma ray, sonic log, porosity, and bulk density, and photoelectric factor log while the latest one was found optional. Figure 12. The estimated score for potential input parameters. It can be seen that PEF has the lowest importance compared to all other well logs.
Iteratively, boosting approaches train a series of weak learners, where the weight of the records is modified according to the regression effects of the previous learners' loss function. In terms of CPU runtime and precision, we compared three state-of-the-art gradient boosting methods in this study. Using the same hyper-parameter optimization time budget, XGboost is more accurate in predicting water saturation for the studied dataset, and LightGBM appears to be considerably faster than the other gradient boosting strategies (Table 5). Super Learner was designed for this study with two base learners, such as XGBoost and lightgbm. The idea of a Super Learner is appealing, making it possible to train and test a multitude of machine learning models on a single collection of data and thereby allowing the model to optimally integrate any of the individual models to produce better overall predictions. In this article was investigated small aspects of the protentional of combining ML algorithms. The XGBoost technique also leads to almost the same findings and highly comparable to Super Learner based on the overall error analysis and matching the predictions with true values. In the literature, it is claimed that resistivity (RT) and porosity logs (neutron or density) are essential to determine water saturation using petrophysical models. According to the current study, the minimum log variables were used for the estimation of water saturation, which was not attempted previously. The engineers and/or operators in the oil and gas industry can utilize the developed deterministic approaches using the data from the most contributing and available common well logs as listed here for prediction of water saturation to save the exploration expenses and time in an effective manner.

Conclusions
This article demonstrates the idea of the application of machine learning algorithms, such as XGBoost, LightGBM, AdaBoost, CatBoost, and Super Learner, to predict water saturation from well-logging data. The study revealed that XGboost and Super Learner might be promising tools in water saturation prediction from well log data collected by the authors without relying on a resistivity log. These methods can be applied to reduce the cost of core measurements and well-logging services. In all combinations of predictors considered, Super Learner is proved to be useful to combine the merits of base machine learning algorithms and enhance predictive robustness on water saturation.
In addition, it has the potential to increase the accuracy of well logs interpretation in wells, where some data are not available. Two different datasets were used in this study to observe the effect of diverse variables. The additional correlating parameter (PEF log) has not convincingly improved the performances of ML techniques for the prediction. The main advantage of using machine learning and intelligent methods in estimation water saturation is that with knowing examples of previous patterns, they can be easily trained and put to effectively solve unknown or untrained instances of the problem. In addition, there is no need to calculate the complex coefficients such as cementation factor, tortuosity factor, saturation exponent, etc. The results confirm the performance of the proposed ML models in estimation water saturation, particularly never applied super leaner. In addition, in this study, estimated water saturation was achieved without relying on resistivity log, which could be challenging to certain geological structures. The proposed model can be employed in several applications of static reservoir modeling, such as porosity and permeability prediction in the future.  Acknowledgments: Authors would like to sincerely thank Gazporomneft for providing us with the data and allowing us to publish them. Furthermore, we appreciate input by anonymous reviewers and respected editor for constructive comments that significantly improved this manuscript.