Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions

: In the process of lithology discrimination from a conventional well logging dataset, the imbalance in sample distribution restricts the accuracy of log identification, especially in the fine-scale reservoir intervals. Enhanced sampling balances the distribution of well logging samples of multiple lithologies, which is of great significance to precise fine-scale reservoir characterization. This study employed data over-sampling and under-sampling algorithms represented by the synthetic minority over-sampling technique (SMOTE), adaptive synthetic sampling (ADASYN), and edited nearest neighbors (ENN) to process well logging dataset. To achieve automatic and precise lithology discrimination on enhanced sampled well logging dataset, support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT) models were trained using cross-validation and grid search methods. Aimed to objectively evaluate the performance of different models on different sampling results from multiple perspectives, the lithology discrimination results were evaluated and compared based on the Jaccard index and F1 score. By comparing the predictions of eighteen lithology discrimination workflows, a new discrimination process containing ADASYN, ENN, and RF has the most precise lithology discrimination result. This process improves the discrimination accuracy of fine-scale reservoir interval lithology, has great generalization ability, and is feasible in a variety of different geological environments.


Introduction
Deep and precise navigation technology directs drilling instruments to a specific location of the oil reservoir to acquire the optimum recovery ratio, and this is of great significance in improving the economic benefits of oil fields.The effect of lithology discrimination, especially the discrimination result of fine-scale reservoir sections, directly affects the development process of precise downhole navigation [1][2][3].Traditional lithology discrimination methods rely on empirical rules and domain expertise, which are not only timeconsuming but also often suffer from subjectivity and inconsistency issues [4][5][6][7][8].Therefore, researchers are increasingly focusing on developing faster and more reliable identification tools [9][10][11][12][13].Currently, artificial intelligence technology is gradually being adopted as an important alternative method to address such complex problems [14][15][16][17][18][19].Utilizing intelligent methods such as machine learning models to construct lithology discrimination models is highly effective.However, addressing issues in the data preprocessing stage is also important [20][21][22][23].
Nowadays, various machine learning methods and related technologies are being applied in lithology discrimination tasks.These applications mainly include basic mod-els such as naive Bayes classifier, support vector machine (SVM), and decision trees, as well as ensemble methods like random forest (RF) and gradient boosting decision trees (GBDT) [24][25][26][27].Through practical data testing and quantitative comparisons, ensemble methods exhibit outstanding performance in classifying sandstone-related lithologies that are difficult to distinguish accurately using other algorithms.Compared to weak classifiers, ensemble methods have broader application prospects [28,29].Building upon the foundation of algorithmic models, optimizing the model structure according to the specific requirements of practical problems aids in addressing more specialized and complex situations [30,31].Improving the random forest model with a probability-based fuzzy representation method is able to provide more information about rhythm, heterogeneity, and geological characteristics and improves the fineness of formation evaluation and reservoir characterization [32].Introducing an enhanced multi-kernel Fisher discriminant model based on a single Fisher discriminant model effectively extracts features of carbonate reservoirs, enabling more precise lithology identification [33].Artificial intelligence methods hold a comprehensive advantage in lithology discrimination tasks.The combination of various machine learning models with optimization algorithms forms an efficient and stable workflow for lithology discrimination.For instance, the fusion of principal component analysis, fuzzy decision tree models, and particle swarm optimization algorithms into lithology discrimination processing systems exemplifies this approach [34].Additionally, there are methods that consist of dimensionality reduction, k-means sample clustering, and regression analysis for well logging lithology classification workflows [35][36][37].These studies demonstrate the significant benefits of machine learning methods in lithology discrimination and formation evaluation.However, most of the research has not taken into account the impact of imbalanced well logging data distribution.
Enhanced sampling algorithms can balance the dataset by adequately learning and accurately identifying classes with fewer samples in the original dataset.These methods improve data quality by considering both data distribution characteristics and model computation requirements [38].The Gaussian mixture model-based over-sampling method with Jensen-Shannon divergence (GJ-RSMOTE) can effectively improve the prediction accuracy of C4.5 decision trees and SVM models [39].According to the multi-model fusion strategy, using real sample points for interpolation to obtain synthetic data has more consistent probability distribution characteristics compared to traditional sample point interpolation methods, making it more advantageous [40].At present, in the field of well logging data mining, there are also studies that use stacking models combined with the SMOTE algorithm to obtain lithology identification results [41].These studies all focus on over-sampling the minority class to enrich the information content of the dataset.
The above review indicates that data balancing is crucial for enhancing the performance of classifiers and improving the predictive capability for sparsely represented classes in the samples.Well logging data often suffer from imbalanced sample distributions.Data on fine scale reservoir intervals are relatively scarce, and models tend to misclassify corresponding samples into more abundant lithology classes in order to maintain high overall accuracy.This can ultimately lead to missing highly economical reservoir intervals during industrial exploitation [42][43][44].To enhance the accuracy of identifying lithology classes with scarce samples, it is necessary to construct balanced datasets with an equal number of samples for each class.On the other hand, to adopt sample balancing methods for balancing the frequency distribution of data across different classes, it is necessary to consider both the information loss issue of under-sampling methods and the overfitting risk of over-sampling methods.
Based on the analysis of the above issues, this study proposes a novel approach that first utilizes over-sampling algorithms represented by the synthetic minority oversampling technique (SMOTE) and adaptive synthetic sampling (ADASYN) to generate synthetic samples for lithology classes with scarce samples.Then, the study applied the edited nearest neighbors (ENN) process to remove marginalized samples and maintain the balance in the count of samples between lithology classes.After enhanced sampling, several machine learning models, including support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), were applied to test the improvement of this sampling workflow on the lithology discrimination results.According to the experimental results, discussions were held to determine the optimal process of enhanced sampling and model application to achieve efficient and balanced enhanced sampling and intelligent and accurate lithology identification.This study provides a method for synthesizing logging data for scarce sample lithologies and explores an efficient workflow for the accurate identification of a fine-scale reservoir.

Methodology
The intelligent logging lithology discrimination using enhanced sampling methods is divided into four steps: data preparation, data balancing processing, intelligent discrimination model construction, and lithology prediction analysis (Figure 1).Firstly, the well logging dataset was prepared, including organizing data obtained from logging instruments, conducting data cleaning, and standardization.Secondly, the dataset was subjected to enhanced sampling processing, starting with over-sampling for classes with fewer samples to generate synthetic samples, followed by effective sample selection from the over-sampled dataset.The third step involved establishing an intelligent lithology discrimination model.This step included initializing machine learning models, model training, and determination of the optimal performing model.The final step was to analyze the lithology discrimination results of the model.This included quantifying the performance metrics of the model, comparing and evaluating the results of various models, and presenting the high effectiveness of lithology discrimination in practical applications.
nearest neighbors (ENN) process to remove marginalized samples and maintain the balance in the count of samples between lithology classes.After enhanced sampling, several machine learning models, including support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), were applied to test the improvement of this sampling workflow on the lithology discrimination results.According to the experimental results, discussions were held to determine the optimal process of enhanced sampling and model application to achieve efficient and balanced enhanced sampling and intelligent and accurate lithology identification.This study provides a method for synthesizing logging data for scarce sample lithologies and explores an efficient workflow for the accurate identification of a fine-scale reservoir.

Methodology
The intelligent logging lithology discrimination using enhanced sampling methods is divided into four steps: data preparation, data balancing processing, intelligent discrimination model construction, and lithology prediction analysis (Figure 1).Firstly, the well logging dataset was prepared, including organizing data obtained from logging instruments, conducting data cleaning, and standardization.Secondly, the dataset was subjected to enhanced sampling processing, starting with over-sampling for classes with fewer samples to generate synthetic samples, followed by effective sample selection from the over-sampled dataset.The third step involved establishing an intelligent lithology discrimination model.This step included initializing machine learning models, model training, and determination of the optimal performing model.The final step was to analyze the lithology discrimination results of the model.This included quantifying the performance metrics of the model, comparing and evaluating the results of various models, and presenting the high effectiveness of lithology discrimination in practical applications.

Data Preprocessing
The Z score method determines the position of each sample in a distribution by calculating the difference between each data point and the mean and standard deviation of the entire dataset.Based on such differences, samples that deviate significantly from the mean will be marked as outliers.Z score can be calculated using the following equation:

Data Preprocessing
The Z score method determines the position of each sample in a distribution by calculating the difference between each data point and the mean and standard deviation of the entire dataset.Based on such differences, samples that deviate significantly from the mean will be marked as outliers.Z score can be calculated using the following equation: where Z is the Z score of sample x, µ represents the mean of the sample set, and σ represents the standard deviation of the sample set.
Based on the principle that formation of the same period and phase exhibit similar logging responses, the logging curve data corresponding to the same-period strata in each well were adjusted to have similar frequency distributions by logging curve standardization.First, we selected a reference well and calculated the mean and standard deviation of its logging data.Based on this, the data from other wells were transformed.The standardized values of the logging curves can be calculated as: where V norm represents the standardized logging curve value, V raw represents the logging curve value before standardization, µ raw and σ raw represent the mean and standard deviation, respectively, of the logging curve before standardization, and µ std and σ std represent the mean and standard deviation of the logging curve corresponding to the reference well.
2.2.Enhanced Sampling Algorithms 2.2.1.Over-Sampling Algorithms SMOTE (synthetic minority over-sampling technique) is an algorithm designed to address the issue of imbalanced sample distributions in datasets [45].Data balancing processing is a crucial step in machine learning tasks, and SMOTE is a widely recognized over-sampling method that can effectively increase the number of samples of the minority classes.It has obvious advantages in maintaining data reliability and has become a fundamental basis for various over-sampling algorithms.Its basic concept is about generating synthetic samples for minority classes to enhance their representation.The specific process of the SMOTE algorithm is as follows (Figure 2): Firstly, the number of samples for each class in the training set are counted, with the class having the highest sample count designated as the majority class, and the remaining classes designated as minority classes.Next, the Euclidean distance between samples of minority classes is calculated to measure feature disparities, expressed by the following equation: where N represents the number of features.Then, for each minority class sample, another sample is randomly selected from its K nearest neighbors, and a synthetic sample is generated using interpolation.For the sample x i and its selected neighbor x j , the synthetic sample can be calculated as: where λ is a random number between 0 and 1. ADASYN (adaptive synthetic sampling) is a further improved data balancing algorithm that introduces adaptiveness based on SMOTE [46].It evaluates the local density of each minority class sample and generates more synthetic data for samples with lower density.The advantages of this method are reflected in many aspects such as dataset distribution and model prediction results.The specific process of the ADASYN algorithm is as follows (Figure 3): The first step is the same as the SMOTE algorithm, which involves finding the majority and minority classes in the training set.Next, for each minority class sample, compute the density of majority class samples among its K nearest neighbors.For sample x i in the current minority class, its density can be represented by the following equation: where ∆ i represents the number of samples belonging to the same class with x i among the K nearest neighbors of x i .The larger the value of r i , the fewer samples in neighbors of x i , and therefore more synthetic samples need to be generated around x i .ADASYN (adaptive synthetic sampling) is a further improved data balancing algorithm that introduces adaptiveness based on SMOTE [46].It evaluates the local density of each minority class sample and generates more synthetic data for samples with lower density.The advantages of this method are reflected in many aspects such as dataset distribution and model prediction results.The specific process of the ADASYN algorithm is as follows (Figure 3): The first step is the same as the SMOTE algorithm, which involves finding the majority and minority classes in the training set.Next, for each minority class sample, compute the density of majority class samples among its  nearest neighbors.For sample  in the current minority class, its density can be represented by the following equation: where ∆ represents the number of samples belonging to the same class with  among the  nearest neighbors of  .The larger the value of  , the fewer samples in neighbors of  , and therefore more synthetic samples need to be generated around  .The next step involves normalizing the density of each minority class sample to obtain corresponding sample synthesis weights.In the following data synthesis process, more synthetic samples will be generated around samples with larger weights.For the minority class sample  , the number of synthesized samples  can be calculated as: where  represents the number of majority class samples in the dataset, and  represents the number of minority class samples corresponding to  .The final step of the algorithm is the synthesis of synthetic samples, following the same interpolation method in the sample feature space, which can be represented by Equation ( 4).Compared to the SMOTE algorithm, the adaptiveness of sample synthesis enables ADASYN to perform better when dealing with complex imbalanced datasets, making it an important step for further optimizing the handling of imbalanced problems.

Under-Sampling Algorithm
ENN (edited nearest neighbors) is an under-sampling algorithm used for handling imbalanced datasets [47].It improves the quality of the dataset and enhances the predictive accuracy of classifiers by removing samples inconsistent with the majority class between their neighbors.It can effectively reduce the risk of overfitting and purify the dataset by removing possible noise samples (especially potential erroneous samples synthesized by over-sampling).And by combining with different over-sampling algorithms, the optimal benefits of the 'over-sampling + under-sampling' processing flow can be comprehensively discussed.The specific process of the ENN algorithm is as follows (Figure 4): Firstly, for each sample, calculate its Euclidean distance to other samples with Equation (3).Then, if it is found that the class of the sample is different to the most frequently occurring class among its  nearest neighbors, it will be marked as noise or a potentially misclassified sample and removed.This process is iterated until no more noise samples are found.After processing with the ENN algorithm, the resulting dataset can be repre- The next step involves normalizing the density of each minority class sample to obtain corresponding sample synthesis weights.In the following data synthesis process, more synthetic samples will be generated around samples with larger weights.For the minority class sample x i , the number of synthesized samples g i can be calculated as: where N represents the number of majority class samples in the dataset, and M represents the number of minority class samples corresponding to x i .The final step of the algorithm is the synthesis of synthetic samples, following the same interpolation method in the sample feature space, which can be represented by Equation (4).Compared to the SMOTE algorithm, the adaptiveness of sample synthesis enables ADASYN to perform better when dealing with complex imbalanced datasets, making it an important step for further optimizing the handling of imbalanced problems.

Under-Sampling Algorithm
ENN (edited nearest neighbors) is an under-sampling algorithm used for handling imbalanced datasets [47].It improves the quality of the dataset and enhances the predictive accuracy of classifiers by removing samples inconsistent with the majority class between their neighbors.It can effectively reduce the risk of overfitting and purify the dataset by removing possible noise samples (especially potential erroneous samples synthesized by over-sampling).And by combining with different over-sampling algorithms, the optimal benefits of the 'over-sampling + under-sampling' processing flow can be comprehensively discussed.The specific process of the ENN algorithm is as follows (Figure 4): Firstly, for each sample, calculate its Euclidean distance to other samples with Equation (3).Then, if it is found that the class of the sample is different to the most frequently occurring class among its K nearest neighbors, it will be marked as noise or a potentially misclassified sample and removed.This process is iterated until no more noise samples are found.After processing with the ENN algorithm, the resulting dataset can be represented as: where S ENN represents the dataset after under-sampling processing, x i represents the i-th sample, y i represents the true label of the i-th sample, K i represents the set of classes of the K nearest neighbor samples for the i-th sample, and mode(K i ) represents the most frequently occurring class in K i .

Machine Learning Models
Well logging curve data are input into the intelligent lithology discrimination model to achieve automatic differentiation and discrimination of various lithologies in the formation and to output lithology prediction results.In this study, considering the limited size of the dataset, the risk of overfitting of the deep learning model is very high.And in order to better verify the effect of the enhanced sampling algorithm on widely used machine learning models, three representative machine learning methods, namely support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT),

Machine Learning Models
Well logging curve data are input into the intelligent lithology discrimination model to achieve automatic differentiation and discrimination of various lithologies in the formation and to output lithology prediction results.In this study, considering the limited size of the dataset, the risk of overfitting of the deep learning model is very high.And in order to better verify the effect of the enhanced sampling algorithm on widely used machine learning models, three representative machine learning methods, namely support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), were selected to evaluate the improvement effect of enhanced sampling methods (Figure 5).

Random Forest (RF)
Random forest is an ensemble learning method.It performs well when processing high-dimensional data.It constructs multiple decision trees during the training process and improves overall accuracy by combining their classification results while controlling model overfitting [49].Random forest model employs random sampling with replacement from the training dataset to construct different training datasets.This technique of random sampling enhances the diversity and robustness of the model while reducing the influence of individual samples.Then, during the construction of each decision tree, the random forest model adopts a feature random selection strategy.When splitting each node of the decision tree, a subset of features is randomly selected from the total feature set as candidate split attributes.This feature random selection helps reduce the correlation between features and improves the model's generalization ability.Finally, the random forest model makes classification decisions through voting.For classification tasks, each decision tree provides a prediction, and the final classification result is determined by the majority vote.Assuming the set of classes is  ,  , … ,  and ℎ  represents the prediction result of the decision tree ℎ for class  , the voting result can be expressed by:

Support Vector Machine (SVM)
The support vector machine is a widely used machine learning algorithm in classification tasks.It is suitable for datasets with small sample size.Its core idea is to find a classification hyperplane that maximizes the margin between different classes.Support vectors are the data points closest to the hyperplane, and they determine the position of the hyperplane during model construction.For linearly separable problems, classification decisions can be made by evaluating the distance between data points and the hyperplane.However, in practical applications of well logging data analysis, data are often not completely linearly separable and may contain some noise [48].To address these situations, kernel functions can be employed in SVM to map data features into a higher-dimensional space, allowing nonlinear relationships to be separated by a linear hyperplane in this higher-dimensional space.The SVM model used in this study employed the radial basis function (RBF) kernel, which can be represented by the following equation: where x i and x j are data points and σ is a parameter controlling the kernel width.
The soft margin of SVM allows a small number of samples to be misclassified within a certain range, which helps maintain the model's generalization ability and prevent overfitting.During model training, parameters related to the kernel function and soft margin have the most direct impact on model performance.

Random Forest (RF)
Random forest is an ensemble learning method.It performs well when processing high-dimensional data.It constructs multiple decision trees during the training process and improves overall accuracy by combining their classification results while controlling model overfitting [49].Random forest model employs random sampling with replacement from the training dataset to construct different training datasets.This technique of random sampling enhances the diversity and robustness of the model while reducing the influence of individual samples.Then, during the construction of each decision tree, the random forest model adopts a feature random selection strategy.When splitting each node of the decision tree, a subset of features is randomly selected from the total feature set as candidate split attributes.This feature random selection helps reduce the correlation between features and improves the model's generalization ability.Finally, the random forest model makes classification decisions through voting.For classification tasks, each decision tree provides a prediction, and the final classification result is determined by the majority vote.Assuming the set of classes is {c 1 , c 2 , . . . ,c N } and h j i (x) represents the prediction result of the decision tree h i for class c i , the voting result can be expressed by:

GBDT
Gradient boosting decision tree (GBDT) is also an ensemble learning algorithm, and its core idea is to iteratively update the structure of decision trees, enabling each tree to correct the errors of the previous one, ultimately forming a robust predictive model [50].The specific process is as follows: Firstly, the model initializes a basic decision tree.By comparing the lithology prediction results of the decision tree on the well logging dataset with the actual lithology labels, the logarithmic loss function is obtained, represented by the following equation: where K represents the total number of classes, and if the i-th sample belongs to class k, then y k i = 1 and p k i represents the probability of the i-th sample belonging to class k.Next, we calculated the negative gradient (residual) of the multinomial logarithmic loss function, which can be represented by: where r k i represents the residual of class k for the i-th sample, and the updating process of the decision tree involves learning these residuals.Through repeated iterations, the residuals of the model on the training data gradually decrease.After stopping the iterations, the final model combines the outputs of all decision trees to obtain the classification results.

Model Evaluation Framework
To assess the effectiveness of the evaluation methods, it is essential to establish evaluation metrics for the experimental results.The Jaccard index serves as a metric for evaluating the classification performance of the model from the perspective of the well logging sample dataset.It is defined as the size of the intersection between the predicted label set and the true label set divided by the size of their union.The Jaccard index for the k-th class can be calculated using the following equation: where Y k represents the true label set for the k-th class and Ŷk represents the corresponding predicted label set.A higher Jaccard index score indicates that the model's predicted label set is closer to the true label set, reflecting better lithology prediction performance.We let N denote the total number of lithology classes.The overall Jaccard index score of the model is represented by the average of the scores for each class: Well logging lithology discrimination is a multi-class classification problem, and the F1 score is a commonly used evaluation metric for such problems.The F1 score reflects the classification performance of each class, and the F1 score for the k-th class can be calculated using the following equation: where TP k represents the number of true positives for the k-th class, FP k represents the number of false positives for the k-th class, and FN k represents the number of false negatives for the k-th class.The overall F1 score of the model can be represented by the average of the scores for each class: Here, N still represents the total number of lithology classes.A higher F1 score for the model indicates a better lithology prediction performance.

Dataset Description
The dataset consists of well logging data from eight wells (CB32, CB82, CB85, CB89, CB323, CB327, CB832, and CBX395) in the Chengbei Operation Area of Shengli Oilfield, covering the Ed1 to Ed4 sections.Based on the logging interpretation results and following the Chinese petroleum industry's classification standard for clastic grain size [51], the lithology labels for the dataset are classified into the following five classes: (1) mudstone, (2) siltstone, (3) fine sandstone, (4) coarse sandstone, and (5) pebbled sandstone.The dataset includes seven well logging curves for lithology discrimination: 1  ⃝ GR (gamma ray): Gamma ray logging curve is used to measure the intensity of gamma rays in the formation, is usually related to the content of radioactive materials in the rock, and can help identify lithologies. 2 ⃝ CAL (caliper): Caliper logging curve measures the diameter of the borehole.Changes in the borehole diameter can reflect the stability of the borehole and possible changes in geological structure. 3 ⃝ RD (deep investigate double lateral resistivity log): RD is a method of measuring formation resistivity, focusing on detecting strata deeper than the borehole, which helps to identify the type and content of fluids in deep layers.

4
⃝ RS (shallow investigate double lateral resistivity log): The detection depth of RS is shallower than RD.It is usually used to evaluate the resistivity of strata near the borehole and plays an important role in identifying the distribution of fluids near the borehole.

5
⃝ AC (acoustic log): Acoustic logging curve measures the propagation time of sound waves in the formation and is used to estimate the porosity and fluid properties of the formation. 6 ⃝ CNL (compensated neutron log): Compensated neutron logging curve is used to measure neutron absorption in the formation and is related to the porosity and fluid type of the formation. 7 ⃝ DEN (density log): Density logging curve measures the density of the formation and can reflect the porosity and mineral composition of the rock.

Data Preprocessing
The missing values in the original data were filled based on the core data and mud logging information.For dataset cleaning, we calculated the Z score for all samples in the original well logging data.We identified data points with a Z score outside the range [−3, 3] as outliers and removed them.This process resulted in the cleaned well logging dataset with a sample capacity of 28,763.The original dataset comprised five lithology labels: M for mudstone, S for siltstone, FS for fine sandstone, CS for coarse sandstone, and PS for pebbled sandstone.Table 1 displays the statistical information for each lithology in the original dataset.It is evident that the M label has the highest number of samples (13,307 samples), followed by the FS label (6830 samples).The number of samples for S (4716 samples) and PS (2855 samples) is relatively lower, while the CS label has the fewest samples (1055 samples).Based on the logging curves of well CB32, which served as the reference well, the well logging data for each well were standardized to eliminate the influence of environmental factors and logging instrument variations on data quality.By considering the maximum and minimum values of each logging curve from CB32 after cleaning, optimal estimates for standardization were set.Equation (2) was then applied to standardize the data from other wells, resulting in detailed information for the seven curves as shown in Table 2.

Data Balancing: Application of Enhanced Sampling Methods
In reservoir exploration, lithologies such as siltstone (S), fine sandstone (FS), and coarse sandstone (CS) represent crucial oil-bearing reservoirs.However, based on the preprocessing results above, the sample counts for these three lithology types are less than half of the total dataset size.To obtain more comprehensive information at the data level and train models for more accurate predictions of these relevant reservoirs, it is essential to synthesize abundant and reliable synthetic data for these lithology classes through data balancing methods.Therefore, synthesizing enriched and dependable simulated data for these classes via data balancing methods is a crucial task.

Over-Sampling Results
After applying the SMOTE method to the original dataset, the number of samples for each label was balanced to 10,666 (Figure 6).To examine the impact of the data balancing results on the consistency of sample distribution, principal component analysis (PCA) was used to visualize the main features before and after balancing processing, and frequency density curves of the two principal components were plotted (Figure 7b,e).It can be observed that SMOTE increased the sample capacity of the dataset while essentially maintaining the original distribution shape of the samples under the principal component attributes.

Under-Sampling Results
Using the ENN under-sampling algorithm to filter out confusing samples from the original dataset does not control the balance of sample quantities for each class.Therefore, the dataset processed by ENN remains unbalanced, with little change in the relative quantities and proportions of samples for each class compared to the original dataset (Figure 7g,j).The ENN algorithm can remove redundant majority class samples and construct a

Under-Sampling Results
Using the ENN under-sampling algorithm to filter out confusing samples from the original dataset does not control the balance of sample quantities for each class.Therefore, the dataset processed by ENN remains unbalanced, with little change in the relative quantities and proportions of samples for each class compared to the original dataset (Figure After applying the ADASYN over-sampling algorithm to the dataset, the minimum number of samples for each label is 10,411 (label FS), and the maximum number of samples is 10,733 (label PS).Through ADASYN over-sampling, the sample counts for each class are close to the original majority class (label M).From the frequency distribution of the principal components (Figure 7c,f), it can be observed that ADASYN can maintain the consistency of the distribution between the over-sampled dataset and the original dataset.

Under-Sampling Results
Using the ENN under-sampling algorithm to filter out confusing samples from the original dataset does not control the balance of sample quantities for each class.Therefore, the dataset processed by ENN remains unbalanced, with little change in the relative quantities and proportions of samples for each class compared to the original dataset (Figure 7g,j).The ENN algorithm can remove redundant majority class samples and construct a more lightweight and reliable well logging dataset without losing critical information.However, the under-sampling algorithm does not increase the quantity of minority class samples, resulting in limited improvement in the predictive performance for relevant lithologies.
In summary, combining the sample filtering capability of the ENN under-sampling algorithm with the data synthesis function of over-sampling algorithms can maximize the enrichment of lithological information content in well logging data.This approach maintains adaptability and flexibility in handling actual well logging datasets, ultimately leading to the training of optimal intelligent discrimination models.

Integrated Balancing Processing Results
The integrated method of SMOTE+ENN and ADASYN+ENN was employed to process the dataset, aiming to control the relative balance of sample quantities for each label (Figure 6).Initially, datasets processed by SMOTE or ADASYN over-sampling algorithms exhibited rich and fairly balanced sample quantities for each class.Subsequently, after ENN filtering, samples with ambiguous class attributions were removed.At this stage, the sample quantities for each class became comparable yet distinct.Both sets of datasets achieved a relatively balanced state.Upon comparing the principal component frequency densities, it is evident that both integrated methods maintained the original distribution shape of samples in the principal component attributes.Specifically, the results of the SMOTE+ENN approach resemble the principal component frequency distribution of the original dataset (Figure 7h,k) more closely.These findings demonstrate that integrated methods can uphold the overall consistency of key features in the data distribution, making them applicable in intelligent lithology discrimination models.

Application of Intelligent Discrimination Models
Regarding the division of the dataset, 80% of the original dataset was used for model training and validation, which is the main basis for updating model parameters.This part of the dataset was named 'training and validation set'.The remaining 20% was named 'testing set', and it was used for the final model result test and was not used for model parameter updating.Different data balancing methods were applied to the training and validation sets to obtain corresponding over-sampled or under-sampled datasets.
For the training and validation sets, five different data balancing method combinations were used, namely SMOTE, SMOTE, ADASYN, ENN, SMOTE+ENN, and ADASYN+ENN.Together with the training and validation sets without data balancing, a total of six datasets for training and validation were obtained, and they were divided into training and validation sets in a ratio of 4:1.The 5-fold cross-validation method was employed to enhance the model's generalization ability, and grid search was used to find the optimal parameter combinations for each model.To avoid randomness in the results, the training process was repeated 10 times, and the average values were recorded as the final results.Line plots were generated to illustrate the training progress of each model parameter on different datasets, facilitating the comparison of the effects of various balancing methods on model training performance.

Model Training
The detailed records of the tuning parameters and optimal values for each model are documented in Table 3.The parameter tuning for the SVM model includes the regularization penalty coefficient C and the coefficient γ of the RBF kernel function.Random forest (RF) and gradient boosting decision tree (GBDT) models were both based on decision trees as basic classifiers, so in terms of parameter tuning, attention needs to be paid to tree structure-related parameters.The most important among these is the minimum number of samples required to form a leaf node.Additionally, in the construction process of random forest, sample and feature selection were involved, making the maximum number of features another important parameter; GBDT involves an iterative process of updating to fit residuals; thus, while focusing on tree structure tuning, it was also necessary to control the learning rate of the model iteration.During each parameter tuning process, the improvement of the model with the ENNtreated dataset was not significant, while over-sampling methods enhanced the training performance of the models.Moreover, the overall training performance of the models on the ADASYN-treated dataset was better than that on the SMOTE dataset, with the training results being the best on the dataset treated by ADASYN+ENN (Figure 8).fit residuals; thus, while focusing on tree structure tuning, it was also necessary to control the learning rate of the model iteration.
During each parameter tuning process, the improvement of the model with the ENNtreated dataset was not significant, while over-sampling methods enhanced the training performance of the models.Moreover, the overall training performance of the models on the ADASYN-treated dataset was better than that on the SMOTE dataset, with the training results being the best on the dataset treated by ADASYN+ENN (Figure 8).

Model Testing
To avoid the randomness of test results, the support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT) models were tested on the test set for 10 runs each.The average Jaccard index and F1 score of the models were obtained (Figure 9).It can be observed that different methods of handling the datasets have varying effects on different models, showing a general trend that the more balanced the sample distribution of the dataset, the higher the model's lithology recognition scores.

Model Testing
To avoid the randomness of test results, the support vector machine (SVM), rando forest (RF), and gradient boosting decision tree (GBDT) models were tested on the test for 10 runs each.The average Jaccard index and F1 score of the models were obtain (Figure 9).It can be observed that different methods of handling the datasets have varyi effects on different models, showing a general trend that the more balanced the samp distribution of the dataset, the higher the model's lithology recognition scores.Specifically, for the SVM model, the improvement in the balanced datasets is the most significant.However, the highest scores of both evaluation metrics for SVM were lower than those of RF and GBDT, indicating that the overall lithology recognition performance of the other two models is superior.RF is the highest-scoring model, and it performs better on more balanced datasets as well.The balanced datasets further unleash the potential of the RF model.In contrast, the performance of the GBDT model is not as outstanding as RF, but the performance of the GBDT model also improves on balanced datasets.

Optimal Combination of Algorithms and Models
The average values of the two evaluation metrics for predicting all five lithology classes using different balanced datasets were recorded over 10 test runs for each model, and corresponding box plots were drawn (Figure 10).Based on the results of all models on the five training datasets, the following important findings were observed: (1) Models performed poorly on the original dataset and the ENN under-sampled dataset, showing a large range and low average scores.This is because the original dataset has very few samples corresponding to the S and CS labels, and the ENN under-sampling algorithm further exacerbates the imbalance issue by removing samples.Therefore, accuracy did not significantly improve compared to the original dataset and, in some cases, even fell below the results of the models on the original dataset; (2) ADASYN and SMOTE are over-sampling algorithms for classes with fewer samples.They generally have a positive impact on the lithology recognition model, primarily reflected in the improvement of the median accuracy, and the results after improvement are relatively consistent across models; (3) The comprehensive balancing method of ADASYN+ENN has significant advantages over all previous results, with the highest average scores and the smallest range.The overall results are relatively stable.These findings are consistent with the performance of the three models, indicating that data balancing algorithms have a certain degree of universality in their impact on models.
The random forest model demonstrated the best overall performance, and it exhibited the most stable performance on the dataset processed using the ADASYN+ENN method.Therefore, ADASYN+ENN+RF is the best method combination in this study.
According to the test scores of the two metrics for predicting all five lithology classes on the test set using random forest (Table 4), on the original imbalanced dataset, the model's prediction performance is relatively poor for the S and CS lithology classes, while higher scores are achieved when predicting the more abundant M and FS classes.Random forest, when trained on the balanced dataset obtained through the ADASYN+ENN method, learned from the added synthetic samples, leading to a significant improvement in prediction accuracy for the originally under-represented S and CS classes.Additionally, the prediction performance for the PS class also benefited from the ADASYN+ENN method, with scores showing some improvement.For the M and FS classes, the sampling method has almost no negative impact, and the model's prediction scores remain at a high level.Visualizing the lithology discrimination performance of the random forest model on the test data (Figure 11), the figure includes seven well log curves and three bar charts: lithology labels from mud logging core, predictions from the random forest model trained on the original dataset, and predictions from the random forest model trained on the dataset balanced using the ADASYN+ENN method.From a global perspective, the results shown in "No Balancing" exhibit more noise, while the results shown in "ADASYN+ENN" demonstrate a noticeable noise reduction.This reflects the significant role of sample balancing methods in improving the model's lithology discrimination result.Further observation reveals that in the "No Balancing" results, noise is mainly concentrated in the sandstone (S) and coarse sandstone (CS) reservoirs.This is because the original data lack samples corresponding to these classes, leading to unstable prediction Visualizing the lithology discrimination performance of the random forest model on the test data (Figure 11), the figure includes seven well log curves and three bar charts: lithology labels from mud logging core, predictions from the random forest model trained on the original dataset, and predictions from the random forest model trained on the dataset balanced using the ADASYN+ENN method.From a global perspective, the results shown in "No Balancing" exhibit more noise, while the results shown in "ADASYN+ENN" demonstrate a noticeable noise reduction.This reflects the significant role of sample balancing methods in improving the model's lithology discrimination result.Further observation reveals that in the "No Balancing" results, noise is mainly concentrated in the sandstone (S) and coarse sandstone (CS) reservoirs.This is because the original data lack samples corresponding to these classes, leading to unstable prediction results in these classes.However, after balancing processing, the model's prediction results in these classes become more stable, with reduced noise, resulting in more accurate predictions for classes such as sandstone and coarse sandstone.
results in these classes.However, after balancing processing, the model's prediction sults in these classes become more stable, with reduced noise, resulting in more accur predictions for classes such as sandstone and coarse sandstone.

The Effectiveness of Enhanced Sampling Algorithms
When using well logging data to identify important oil-bearing reservoirs rep sented by sandstone, there is often a situation where the corresponding number of sa ples is far less than that of non-oil-bearing reservoirs represented by shale.The direct p pose of adopting enhanced sampling methods in this study is to synthesize simulated d for lithology labels corresponding to oil-bearing reservoirs in order to increase the sam capacity of the dataset.

The Effectiveness of Enhanced Sampling Algorithms
When using well logging data to identify important oil-bearing reservoirs represented by sandstone, there is often a situation where the corresponding number of samples is far less than that of non-oil-bearing reservoirs represented by shale.The direct purpose of adopting enhanced sampling methods in this study is to synthesize simulated data for lithology labels corresponding to oil-bearing reservoirs in order to increase the sample capacity of the dataset.
From the results of the dataset processing with enhanced sampling methods (Figure 6), the ADASYN method synthesized a large number of accurate simulated samples, thereby improving the data quality of the high-quality reservoirs.Subsequently, the ENN algorithm filtered the real logging data and simulated samples, ensuring the reliability of the samples in the balanced dataset to a certain extent.According to the feature distributions before and after data balancing shown in Figure 7, it can be observed that the balanced dataset maintains the consistency of the overall distribution of features and samples.It can be said that the results of data balancing meet expectations and enrich the logging data samples of high-quality oil-bearing reservoirs.

Improvement of Lithology Discrimination Results
The enhanced sampling methods significantly improve the lithology recognition performance of each model after processing the dataset.According to the results shown in Figure 10, the improvement effect of balancing processing on the SVM model is the most significant.However, due to the limitations of the model itself, the lithology prediction score of SVM is consistently lower than that of random forest and GBDT.The random forest model performs the best after enhancement, especially in the prediction tasks of siltstone and sandstone, which lack original data.The balanced dataset helps improve the prediction score of random forest.The improvement in the GBDT model is relatively small, and the effect of the balanced dataset on GBDT is limited.
Compared to different data sampling methods, the proposed ADASYN+ENN method shows the most significant improvement in models.From the results shown in Figure 7, the effectiveness of the comprehensive method in processing the dataset is better than using ADASYN or SMOTE over-sampling algorithms alone.Over-sampling algorithms cannot accurately determine whether the baseline samples used to synthesize simulated data are errors generated during data collection and may increase noise in the dataset after synthesizing a certain number of samples.The key role of the ENN algorithm is to detect and remove potentially erroneous samples from the dataset.Therefore, the dataset processed by the comprehensive method combining ADASYN and ENN is more reliable.

Analysis of the Lithology Discrimination Effectiveness of the Optimal Method Combination
From the perspective of exploration and development applications, the excellent performance of machine learning models in lithology discrimination mainly depends on whether the model can accurately identify important reservoirs represented by various types of sandstone.A model with practical value cannot only maintain high accuracy in identifying lithology labels with low oil contents such as mudstone.According to Figure 11, after supplementing simulated samples with the ADASYN algorithm and removing erroneous samples with the ENN algorithm, the random forest model learned the characteristics of sandstone and coarse sandstone more comprehensively, resulting in more accurate predictions for these lithology classes.In summary, after balancing the data, the discrimination accuracy of labels with fewer original data samples significantly improved.Data balancing algorithms have a significant positive impact on these classes and can maintain high discrimination accuracy for classes with a larger number of samples in the original dataset.

Conclusions
Based on the need to address the relative scarcity of high-quality reservoir logging data in practical production, we propose a workflow that combines enhanced sampling algorithms with machine learning models to achieve efficient and accurate lithology discrimination from well logging data.Applying a data synthesis balancing method that combines the ADASYN over-sampling algorithm with the ENN under-sampling algorithm to the problem of intelligent lithology discrimination can assist machine learning models in accurately predicting and identifying reservoirs with insufficient logging curve samples, thereby providing reliable references for production practices such as reservoir interpretation.
The experimental results demonstrate the effectiveness of the proposed method.By comparing the changes in the dataset before and after data balancing, it was found that the comprehensive balancing method better maintains the original data's feature distribution compared to other data over-sampling or under-sampling algorithms, thereby increasing the number of learnable samples for high-quality reservoir lithology classes.Through testing and comparing the lithology discrimination performance of the SVM, RF, and GBDT models on the comprehensive balanced dataset and the original imbalanced dataset, evaluation metrics such as the Jaccard index and F1 score indicate that the comprehensive balancing method effectively improves the lithology discrimination performance of various machine learning models.Among them, the RF model performs the best and can accurately identify lithology in oil-bearing reservoirs.
The enhanced sampling algorithms used in this study demonstrate strong versatility, as they not only enhance the performance of the well logging lithology discrimination model but also provide important methodological references for identifying shale reservoirs and predicting sweet spots and other related tasks.Future research should conduct a more comprehensive long-term evaluation of this workflow with more data and time resources.

Figure 1 .
Figure 1.Schematic diagram of logging lithology discrimination with enhanced sampling methods.

Figure 1 .
Figure 1.Schematic diagram of logging lithology discrimination with enhanced sampling methods.

Figure 2 .
Figure 2. The work flow of SMOTE algorithm.

Figure 2 .
Figure 2. The work flow of SMOTE algorithm.

Figure 3 .
Figure 3.The work flow of ADASYN algorithm.

Figure 3 .
Figure 3.The work flow of ADASYN algorithm.

Figure 4 .
Figure 4.The work flow of ENN algorithm.

Figure 4 .
Figure 4.The work flow of ENN algorithm.
Appl.Sci.2024, 14, x FOR PEER REVIEW 8 of 23 overfitting.During model training, parameters related to the kernel function and soft margin have the most direct impact on model performance.

23 Figure 6 .
Figure 6.The number of samples before and after data balancing processing.

Figure 6 .
Figure 6.The number of samples before and after data balancing processing.

23 Figure 6 .
Figure 6.The number of samples before and after data balancing processing.

Figure 10 .
Figure 10.Box plots of the evaluation metrics for models on datasets processed with different data-balancing methods.

Figure 11 .
Figure 11.Visualization of lithology discrimination results of RF.

Figure 11 .
Figure 11.Visualization of lithology discrimination results of RF.

Table 1 .
Distribution of samples for each lithology label in the dataset.

Table 2 .
Description of standardized well logging curves.

Table 3 .
The tuned parameters for SVM, RF, and GBDT models trained on different datasets.

Table 4 .
Comparison of lithology discrimination scores for random forest model: original training set vs. ADASYN+ENN balanced dataset.
Figure 10.Box plots of the evaluation metrics for models on datasets processed with different databalancing methods.