Well-Logging-Based Lithology Classiﬁcation Using Machine Learning Methods for High-Quality Reservoir Identiﬁcation: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China

: The identiﬁcation of underground formation lithology is fundamental in reservoir characterization during petroleum exploration. With the increasing availability and diversity of well-logging data, automated interpretation of well-logging data is in great demand for more efﬁcient and reliable decision making for geologists and geophysicists. This study benchmarked the performances of an array of machine learning models, from linear and nonlinear individual classiﬁers to ensemble methods, on the task of lithology identiﬁcation. Cross-validation and Bayesian optimization were utilized to optimize the hyperparameters of different models and performances were evaluated based on the metrics of accuracy—the area under the receiver operating characteristic curve (AUC), precision, recall, and F1-score. The dataset of the study consists of well-logging data acquired from the Baikouquan formation in the Mahu Sag of the Junggar Basin, China, including 4156 labeled data points with 9 well-logging variables. Results exhibit that ensemble methods (XGBoost and RF) outperform the other two categories of machine learning methods by a material margin. Within the ensemble methods, XGBoost has the best performance, achieving an overall accuracy of 0.882 and AUC of 0.947 in classifying mudstone, sandstone, and sandy conglomerate. Among the three lithology classes, sandy conglomerate, as in the potential reservoirs in the study area, can be best distinguished with accuracy of 97%, precision of 0.888, and recall of 0.969, suggesting the XGBoost model as a strong candidate machine learning model for more efﬁcient and accurate lithology identiﬁcation and reservoir quantiﬁcation for geologists.


Introduction
Lithology identification is a task of great significance in reservoir characterization for petroleum exploration and engineering [1]. It is the basis for reservoir quality assessment (e.g., porosity and permeability) and supports related geological research and drilling activities (e.g., sedimentary modeling, favorable zone prediction, and well planning) [2,3]. Well-logging has been utilized as an effective remote sensing measurement to predict underground formation lithology from a surface geophysical survey. Well-logging data contains rich geological information, which is a synthesized reflection of formation lithology and physical properties [4].
The idea of lithology identification from well-logging is to establish the relationship between petrological characteristics and logging curves. Typical lithologies are supposed to have their own specific logging responses. For example, the GR-RT (gamma rayresistivity of true formation) crossplot is effective to recognize sandstone and mudstone in conventional sand and shale reservoirs due to the fact that sandstone has relatively low GR log values and high responding RT, whereas mudstone behaves oppositely on GR and RT logs [2].
However, traditional logging interpretation depends heavily on expertise and human experience, which is labor intensive and time consuming, and often suffers from subjectiveness and inconsistency of expert experience [5]. Due to the complexity of the geological condition in unconventional reservoirs (e.g., carbonate, tight sandstone, or sandy conglomerate reservoir [6,7]) and the increasing diversity and amount of logging data, the traditional logging interpretation methods show great limitations. As a result, researchers are turning to more advanced methods for breakthroughs in lithology identification.
Machine learning techniques have been embraced by the oil and gas industry as alternative methods in addressing the complex and challenging problems it faces to enable automation, lift performance, or explore new solution paradigms [8]. With advances in algorithms, computational theories, and hardware such as graphic processing units, machine learning shows great advantages in learning complex patters and relationships from large amount of data [9]. Two primary classes of machine learning algorithms, namely, unsupervised and supervised learning methods, have been prescribed for lithology identification. Supervised learning methods use a set of training data to learn relationships between features and corresponding labels and build models that are predictive for previously unseen data. Supervised learning algorithms outperform by a substantial margin compared with nearly all unsupervised learning algorithms reported in lithology classification using well-logging data [10,11].
A wide variety of supervised learning methods have been reported in the task of lithology identification, including but not limited to Naïve Bayes [12], linear regression [13], k-Nearest Neighborhood (kNN) [14], support vector machine (SVM) [15,16], decision tree and its variants (e.g., random forests and boosting trees) [17][18][19], and artificial neural networks (ANN) [20,21]. However, as different experiments were carried out using different datasets with their own lithology classification schemes, it is hard to make a parallel comparison of those machine learning models. Recently, more studies have attempted to compare the performance of machine learning methods for lithology identification. Xin et al. [22] compared the performance of five machine learning methods for formation lithology identification using well-logging data from the Daniudui gas field (DGF) and the Hangjinqi gas field (HGF) and concluded that Gradient Tree Boosting classifier (GTB) and Random Forest had better accuracy than other three methods, namely, Naïve Bayes, SVM, and ANN. Dev et al. [23] tested three models from the family of gradient-boosted decision tree (GBDT) methods using data from DGF and HGF, and identified LightGBM and CatBoost as the preferred algorithms for lithology classification using well-logging data. Merembayev et al. [24] evaluated five machine learning algorithms including kNN, Decision Tree, Random Forest, XGBoost, and LightGBM on well-logging data from Norway and Kazakhastan for lithofacies classification. The results showed that Random Forest has the best score among considered algorithms.
In this study, we intend to make a more systematic and comprehensive comparison of machine learning methods for lithology identification using well-logging data. We categorize supervised machine learning methods into three groups, namely, linear individual classifiers, nonlinear individual classifiers, and ensemble methods, with increasing model complexity. We select several typical machine learning models within each group to evaluate their performance using well-logging data collected from 17 wells in our study area and try to answer three key questions: • Do nonlinear individual classifiers always show better performance in terms of accuracy than linear individual classifiers for well-logging-based lithology classification? • Do ensemble methods consistently outperform individual classification models and by what margin? Which (if any) is the superior ensemble method? • How well can different lithology classes in our study be distinguished by the bestperforming models?
The rest of the paper is organized as follows: Section 2 introduces the study area and the well-logging dataset. Section 3 describes the machine learning methods included in our study for lithology identification, as well as the metrics used to evaluate their performance. Section 4 presents quantitative results of the experiments in terms of hyperparameter optimization, overall performance, and lithology classification results. Feature importance is also evaluated by the end of the section. The conclusions are summarized in Section 5.

Study Area and Dataset
The study area is located in the sandy conglomerate reservoir of the Baikouquan formation in the Mahu Sag of Xinjiang oilfield in the Junggar Basin (Figure 1), which is the main oil and gas exploration area in northwestern China. It was chosen for the availability of high-quality well-logging and corresponding core images. The dataset for the study consists of well-logging data with 9 properties acquired from 17 wells with close proximity to each other. Lithologic labels were interpreted from 520 m core images with 4156 data points for machine learning workflow development. The nine log properties include gamma ray (GR), self potential (SP), caliper log (CALI), shallow/medium deep/deep reading resistivity measurement (RESS/RESM/RESD), neutron porosity log (PHIN), bulk density log (RHOB), and interval transit time (DT). The description of the well-logging dataset is shown in Table 1.
The lithofacies identified from core images in Baikouquan Formation contain 3 classes: mudstone (M), sandstone (S), and sandy conglomerate (SC). The labeling scheme was designed to reduce the subjectivity that exists in core photograph interpretation and produce consistent and reliable labeling for the dataset to benchmark the performance of different machine learning algorithms. The prepared dataset consists of nine predictor variables and a lithology class as target variable.

Machine Learning Models for Lithology Classification
Machine learning has been increasingly used in data-driven discovery in geoscience to perform complex prediction tasks by learning patterns from large amounts of data, which cannot be easily done by a set of explicit rules [9]. There are four major machine learning paradigms: supervised learning, semisupervised learning, unsupervised learning, and reinforcement learning [26].
In supervised learning, the model attempts to predict a target value using a set of variables or features after learning the relationship between the predictors (the features) and the output in training. When the target variable is a categorical variable (also called a label), the problem is said to be a classification problem and the model is called a classifier.
This study explores an array of machine learning models and determines their performance in lithology classification using the well-logging data. These individual machine learning models can be broadly categorized as linear and nonlinear models, which will be detailed in Sections 3.1 and 3.2. Ensemble models are models combining individual models, which will be covered in Section 3.3.

Linear Models for Classification
Linear classification models refer to the class of classifiers that result in linear decision boundaries [27]. Linear models remain a popular choice in applications, especially when they can achieve adequate accuracy, for their straightforward implementation and better interpretability.

Logistic Regression
Logistic Regression (LR) is one of the most popular linear models for classification in the industry [27,28]. In the binary case, the model allows us to model the posterior probability of being 0 or 1 using a linear function of input variables or features, with a sum of one: Applying the logit transformation, one obtains the log-odds ratio as The input space is optimally divided by the decision boundary of hyperplane defined by x|β 0 + β T x = 0 for which the log-odds ratio is zero, meaning that the posterior probability of being in one class or the other is equal.

Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is another popular model that leads to linear decision boundary [28,29]. The LDA model separates two classes based on a set of observed characteristics x by modeling the class densities f 1 (x) and f 0 (x) of each class as multivariate normal distributions with means µ 1 and µ 10 and the same covariance matrix Σ.
Again, we compute and investigate the log-ratio where π 1 and π 0 are the prior probability of the two classes. The decision boundary or the hyperplane, defined by Equation (3), equals 0 and is linear in x. The hyperplane drawn by LDA aims to maximize the ratio of the between-group variance and the within-group variance, so the two classes can be best-separated [30].

Nonlinear Models for Classification
More advanced machine learning techniques have been developed to model complex patterns in data, which often result in nonlinear decision boundaries.

k-Nearest Neighbor
k-Nearest Neighbor (kNN) is a simple but effective classification method [31]. The approach consists of calculating the Euclidean distance of a new instance with each instance in the training sample that has already been labeled. Then, the class label of the new instance is assigned according to the major class of the k-nearest neighbors in the training set. kNN has the advantage of being nonparametric, but one has to carefully select k to achieve optimal classification results. The method is also sensitive to the scale of different features in multidimensional space; so, date standardization is required to eliminate the effect of scale differences in both training and test sets [32].

Support Vector Machine
Support Vector Machine (SVM) is one of the most widely applicable machine learning models developed by Vapnik [33]. The idea of the method is to transform the input space into a high-dimensional feature space using a nonlinear function, where two classes can be separated linearly. The goal of SVM is to find the hyperplane that maximizes the minimum distance between the hyperplane and the support vectors. Like LR and LDA, SVM was originally developed for two-class classification, then further extended to multiclass problems [34].
SVM is reported to perform well in cases where sample size is small or the number of features is more than the data points. It has good generalization in practice, and thus, a relatively low risk of overfitting. Despite its advantages, choosing the optimal kernel for SVM is a difficult task. SVM also does not directly provide probability estimates and is harder to be interpreted compared with decision-tree-based methods [35,36].

Decision Trees
Decision Trees (DT) are one of the most commonly used models in supervised classification and serve as the building blocks for several more-sophisticated ensemble models. DT constructs decision rules organized in a treelike structures to map input values to their target labels. In a tree structure, leaves represent labels and nonleaf nodes are features. Each branch represents a rule that leads to the final classification. The challenge lies in how to build the smallest decision trees: the best split should result in a classification with the lowest entropy or with the highest information gain. A realization of such a heuristic is C4.5 developed by Quinlan [37].
DT holds a lot of advantages, which explains its popularity in many applications, including the following: (1) easy to interpret and explain; (2) requiring relatively little effort from users for data preparation; (3) implicitly performing variable screening or feature selection. However, one key disadvantage of DT is that they tend to overfitting. Without proper pruning or limiting tree growth, they could become poor predictors [38].

Ensemble Models for Classification
To improve the performance of individual classifiers, ensemble models have been introduced. The idea of ensemble methods is to combine multiple weak learners to obtain a strong learner resulting in more accurate or robust predictions [39].
Ensemble models can be split into homogeneous and heterogeneous. Homogeneous ensemble models use only one type of classifier whereas heterogeneous ones combine different types of classifiers [40]. Two popular techniques in building homogeneous ensemble models are bagging and boosting. In bagging, k independent base classifiers are generated using bootstrapping; then, results are aggregated through majority voting. In boosting, base classifiers are built sequentially to improve the prediction of the previous outcomes [41].
In both cases, the base classifier can be any type of model, but decision tree methods are usually applied. Two such examples of bagging and boosting are Random Forest (RF) and Gradient-Boosted Decision Trees (GBDT).

Random Forest
Random Forest is one of the most popular bagging algorithms introduced by Breiman [42]. The algorithm starts with the generation of bootstrapped samples from the data; then, the collected decision trees are fitted to those samples. Predictions from all trees are aggregated in the inference to form the final decision via major voting in case of classification [43]. Benefiting from the randomization, RF helps in the reduction of variance, and is less likely to overfit compared with individual decision trees.

Extreme Gradient Boosting Trees
Extreme Gradient Boosting Trees (XGBoost) belongs to the family of gradient-boosted decision trees (GBDT). It was developed by Chen [44] and made multiple enhancements to improve the efficiency and scalability of the original GBDT methods in the implementation.
Plain gradient boosting trains each subsequent model using the residuals (the difference between the predicted and true values) or gradient, which is the reason why it is called "gradient boosting". By correcting the mistakes of the previous models, it gradually rectifies the results and improves the accuracy of predictions. XGBoost takes this one step further. It exploits the second-order derivative in the loss function formulation to accelerate the convergence of the model. XGBoost also introduces more regularization in the model formulation to control overfitting, which further improves its performance.
Built and developed for the sole purpose of model performance and computational speed, it quickly gained popularity and became the algorithm of choice for many winning solutions of machine learning competitions [45].

Experiment Setting and Parameter Tuning
To obtain steady and reliable results, we split the dataset into train, validation, and test subsets with a stratified random sampling method. This ensures that the distribution of different classes in the training and testing datasets are consistent. The testing set, consisting of 10% of the total data points in our case, is critical to evaluate the generalizability of the machine learning model resulted from training. The remaining data are further divided into train and validation subsets through cross-validation, which is also used to tune the hyperparameters for parameteric models. The higher the model complexity, the more hyperparameters there are to tune and the larger the feature space to search for the optimal hyperparameters. For hyperparmeter tuning, Bayesian optimization is utilized to make the exploration of large feature space more efficient [46]. Once the hyperparameters are determined, we train the model on the full training set and make inference on the testing set.

Model Evaluation
The performance of each model is evaluated using the following metrics: Accuracy, Recall, Precision, F1-score, and the area under the receiver operating characteristics curve (AUC).
According to the combination of actual data labels and predicted classes, the classification results can be divided into four cases: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The accuracy, defined by Equation (4), measures the percentage of correctly classified samples.
The recall is defined in Equation (5), indicating the percentage of real positive samples that are classified as positive.
The precision is defined in Equation (6), measuring the proportion of actual positive samples within the samples that are predicted to be positive.
The F1 score is the harmonic mean of Recall and Precision and can be used to evaluate the model thoroughly. It is calculated as Equation (7): AUC is the area under the receiver operating characteristics (ROC) curve that represents the trade-off between Recall (TPR) and Specificity (FPR), given by Equations (8) and (9). As it is independent of a cutoff value, AUC is considered a better overall performance indicator than accuracy.
The AUC value ranges from 0.5 to 1, with 0.5 as the expected value of random prediction. A model with better overall performance has an AUC value close to 1.

Results and Discussion
This section comprehensively evaluates the performances of different machine learning models with increased complexity for lithology classification. For each machine learning model, we compare its prediction results based on the optimal hyperparameters tuned with respect to the metrics listed in Section 3.5. The results are presented in Sections 4.1 and 4.2. In Section 4.3, we further investigate how each lithological class can be distinguished by the best-performing machine learning methods among all models we put into the test and to what extent, and discuss the implication of how it can contribute to the identification of high-quality reservoirs in the study area. Lastly, in Section 4.4 we explore the feature importance and how the number of features affect the classification performances, studying the potential for designing more effective lithology classification systems in the future. Table 2 presents the optimal hyperparameter settings tuned for different machine learning models in descending order of model complexity. The more complex the model, the more hyperparameters are required to tune for the model to avoid overfitting and achieve optimal performance. When the number of hyperparameters increase in the model, it becomes less feasible to use grid search to find the optimal hyperparameters as the searching space increases exponentially. Take the XGBoost model for example, consider five levels for each of the six parameters, the grid search needs to explore in total 5 6 = 15,625 hyperparameter settings, whereas Bayesian optimization takes around 200 evaluations, equivalent to 1.3% of the workload of the grid search. A similar comparison between grid search and Bayesian optimization can be found in [18].

Overall Performances
One of the main objectives of our study is to determine how different categories of machine learning methods perform in the task of lithology classification using well-logging data. Table 3 presents the overall performances of different machine learning models, with the better performer ranking higher in the list. It shows clearly that ensemble models perform best, followed by individual nonlinear models, while linear models rank last. The results live up to the common expectation that, in general, the classification performance improves with the increase in the model complexity.
Further, there are consistent performance results between training and testing for all different machine learning models, indicating that minimum overfitting exists in our trained models.
Values of AUC for ensemble models in both training and testing are well above 0.90, indicating very good discriminant and generalization abilities of the ensemble models. Within the ensemble models, XGBoost outperforms RF on all metrics. This is expected, as boosting methods are capable of reducing both bias and variance by increasing the expressive power of the base learner, while RF as bagging method is devised to reduce variance by subsampling the training data.
It is also noted that individual nonlinear models perform better than linear models. Linear models generally work better in situations where instances of different classes have clear boundaries and can be separated linearly, indicating that the lithology classes cannot be easily distinguished in the feature space formed by the well-logging data. Table 3. Overall performance of different machine learning models in lithology classification, with best metrics achieved in each column are highlighted in bold.

Model
Training Testing

Lithology Classification Evaluation
ROC curve and the confusion matrix are produced with optimized XGBoost and RF classifiers on test dataset, to inspect how well each lithology class can be distinguished from well-logging data in greater detail.
In Figure 2, ROC curves exhibit great generalization performances for both XGBoost and RF in classifying the three lithology classes. Among them, mudstone is the lithology class receiving the highest AUCs of 0.97 and 0.96 for XGBoost and RF, respectively. Meanwhile, sandstone gets the lowest AUCs for both classifiers, which are still above 0.92. The AUCs for sandy conglomerate are between those of mudstone and sandstone, with 0.95 and 0.94 for XGBoost and RF, respectively.  . M refers to mudstone, S refers to sandstone, and SC refers to sandy conglomerate. Figure 3 highlights the confusion matrices of predictions generated by XGBoost and RF, from which we can interpret the actual classification accuracy of each lithology class. The normalized confusion matrices (Figure 3b,d) demonstrate that sandy conglomerate has the same highest classification accuracies for both XGBoost and RF at 97%. This could be attributed to the fact that sandy conglomerate accounts for 70% in the dataset and the evaluation metric for model optimization is classification accuracy. XGBoost excels RF in the classification of mudstone and sandstone, with 5% and 11% higher classification accuracies for the two lithology classes, respectively. Classification performance of sandstone is the worst in both XGBoost and RF among the three lithology classes, which is coherent with the lowest AUCs observed in Figure 2. Mistakes are mainly concentrated in the misclassification of sandstone, as well as mudstone, into sandy conglomerate for both classifiers. This is likely due to the fact that sandy conglomerate has a mixed nature and contains samples with well logging signatures resembling that of mudstone and, especially, sandstone. It could result in overlaps between sandy conglomerate and the other two lithology classes in the feature space, thus leading to misclassification of sandstone and mudstone into sandy conglomerate.
Classification reports (Figure 4) are further derived from the confusion matrices. In our study, XGBoost model achieves both high precision of 0.888 and high recall of 0.969 for sandy conglomerate, which shows great potential to identify high-quality reservoirs in Mahu Sag [7]. In Mahu Sag, the sandy conglomerate accounts for over 90% of total oil-producing layers' thickness. It is also necessary to separate the lithofacies of tractive current sandy conglomerates (TCSC) and gravity flow sandy conglomerates (GFSC) from the lithology class of sandy conglomerate. Tractive current sandy conglomerates account for around 60% in thickness and about 90% in oil production of sandy conglomerate layers. To make the automated identification of TCSC possible, more rigorous lab work and interpretation are required to further label sandy conglomerate into sublithofacies of TCSC and GFSC, which is our next step.

Feature Importance
Feature selection in machine learning models is a relevant consideration for many applications. Reducing irrelevant features can reduce model complexity and increase the generalization performance of the model. It also helps in designing more cost-efficient models by reducing the number of features in data collection.
In this study, we examine the model performances with reduced features using backward elimination based on importance measures. Figure 5 shows the feature importance extracted from XGBoost model of the nine well-logging variables. We then remove the feature with lowest importance one by one and retrain the model based on cross-validation. Figure 6 showcases the accuracy and AUC obtained by the XGBoost model with the decreased number of features. As can be seen from the figure, AUC decreases slowly at first but takes a sharp downturn when the feature number drop below 3. To keep the classification accuracy in testing above 0.9, we need to keep the top 6 features in the model.

Conclusions
The identification of lithology from well-logging data is an important task in reservoir characterization for petroleum exploration. Many different machine learning methods have been reported for this application. In this study, we comprehensively evaluated the performances of an array of supervised machine learning methods, from linear and nonlinear individual classifiers to ensemble methods, on lithology identification using well-logging data acquired from the sandy conglomerate reservoir of the Baikouquan formation in the Mahu Sag of the Junggar Basin, China. Cross-validation and Bayesian optimization were applied to optimize the hyperparameters of different models and their performances were evaluated on separate test dataset.
Results exhibit that ensemble methods (XGBoost and RF) perform best among the three categories of machine learning models, followed by nonlinear individual classifiers (kNN, SVM, and DT). Linear individual classifiers (LR and LDA) produce the least favorable results, indicating their disadvantages in solving the nonlinear lithology classification problem using well-logging data. Within the ensemble methods under testing, XGBoost has the best performance, with an accuracy of 0.882 and AUC of 0.947. It outperforms RF especially in the classification of sandstone with an increase in accuracy of 11%. Among the three lithology classes, sandy conglomerate-as found in the potential reservoirs in the study area-can be best distinguished with an accuracy of 97%, precision of 0.888, and recall of 0.969, suggesting the XGBoost model as a strong candidate machine learning model for more efficient and accurate lithology identification and reservoir quantification for geologists. Furthermore, we investigated the importance of well-logging variables and the impact of the number of well-logging variables as input on the classification performance. Experiments showed that at least the top three features are required for the XGBoost model to maintain comparable performance.
The study suggests ensemble methods as the more accurate and efficient machine learning models that can assist geologists in reservoir identification and lithology classification in general. The machine learning workflow established is transferable and can be applied in other geological environments. Future work will include further separation of subclasses of tractive current sandy conglomerates (TCSC) and gravity flow sandy conglomerates (GFSC) within sandy conglomerates from core images and more labeling on well-logging data. More machine learning methods within the category of boosting and beyond, such as neural networks, will be explored to find the best-performing model for distinguishing TCSC and GFSC, thus achieving better reservoir quality (e.g., permeability and porosity) assessment. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data sharing is not applicable.

Acknowledgments:
We would like to thank the reviewers whose constructive comments improved the quality of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: