Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method

Zhang, Jianting; Wang, Ruifei; Jia, Ailin; Feng, Naichao

doi:10.3390/app14103956

Open AccessArticle

Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method

by

Jianting Zhang

^1,*

,

Ruifei Wang

¹,

Ailin Jia

² and

Naichao Feng

²

¹

Institute of Petroleum Engineering, Xi’an Shiyou University, Xi’an 710016, China

²

PetroChina Research Institute of Petroleum Exploration and Development, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 3956; https://doi.org/10.3390/app14103956

Submission received: 29 February 2024 / Revised: 29 April 2024 / Accepted: 4 May 2024 / Published: 7 May 2024

(This article belongs to the Special Issue Applications of Machine Learning in Earth Sciences—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The prediction and distribution of reservoir porosity and permeability are of paramount importance for the exploration and development of regional oil and gas resources. In order to optimize the prediction methods of porosity and permeability and better guide gas field development, it is necessary to identify the most effective approaches. Therefore, based on the extreme gradient boosting (XGBoost) algorithm, laboratory test data of the porosity and permeability of cores from the southern margin of the Ordos Basin were selected as the target labels, conventional logging curves were used as the input feature variables, and the mean absolute error (MAE) and the coefficient of determination (R²) were used as the evaluation indicators. Following the selection of the optimal feature variables and optimization of the hyper-parameters, an XGBoost porosity and permeability prediction model was established. Subsequently, the innovative application of homogeneous clustering (K-means) data preprocessing was applied to enhance the XGBoost model’s performance. The results show that logarithmically preprocessed (LOG(PERM)) target labels enhanced the performance of the XGBoost permeability prediction model, with an increase of 0.26 in its test set R². Furthermore, the application of K-means improved the performance of the XGBoost prediction model, with an increase of 0.15 in the R² of the model and a decrease of 0.017 in the MAE. Finally, the POR_0/POR_1 grouped porosity model was selected as the final predictive model for porosity in the study area, and the Arctan(PERM)_0/Arctan(PER0M)_1 grouped model was selected as the final predictive model for permeability, which has better prediction accuracy than logging curves. The combination of K-means and the XGBoost modeling method provides a new approach and reference for the efficient and relatively accurate evaluation of porosity and permeability in the study area.

Keywords:

XGBoost; K-means; porosity and permeability prediction model

1. Introduction

In the oil and gas industry, the porosity (POR) parameter characterizes the reservoir’s storage capacity, and the permeability (PERM) characterizes the reservoir’s ability to pass fluids. Porosity and permeability are important indicators for the evaluation of the reservoir during the development and production of oil and gas fields [1,2]. Consequently, the prediction and distribution of reservoir porosity and permeability are crucial for the exploration and development of regional oil and gas resources. The determination methods for porosity and permeability of unconventional oil and gas reservoirs can be broadly classified into two categories: laboratory measurement and indirect interpretation.

The porosity experimental measurement method is selected based on the main distribution range of porosity in regional core samples. In sandstone reservoirs with a predominant distribution of micropores (<2 nm), some scholars have applied the low-temperature carbon dioxide (CO₂) adsorption method to measure porosity. This approach is based on the results of carbon dioxide condensation in the capillary under critical conditions, which enables an estimation of the pore size distribution range of the core sample. This method is more accurate in characterizing micropores than other techniques [3]. In sandstone reservoirs with predominantly medium porosity distribution (2–50 nm), the use of the low-temperature nitrogen (N₂) adsorption method is a more common approach for porosity determination [4,5,6,7]. Some scholars have applied constant-rate mercury injection technology to measure porosity in tight porous sandstone reservoirs. The constant-rate mercury injection method measures porosity with the unique advantage of an extremely low and constant injection rate of mercury [8,9]. Based on the variation of mercury injection rate with pressure, further research is conducted on the pore structure parameters of reservoir rocks [10]. Cheng et al. (2020) and Andrew et al. (2012) [11,12] applied nuclear magnetic resonance radiography to characterize pore diameters from 2 nm to 1000 nm. The processing and interpreting relaxation time data and its indirect conversion to porosity values had the advantages of not damaging the core and having high speed, high accuracy, and strong experimental operability, making it one of the important tools for reservoir evaluation and exploration engineering in oil and gas reservoirs. Therefore, the range of porosity measurement varies between different methods, and a single method cannot fully measure the pore characteristics of the reservoir. Multiple methods must be combined to quantitatively characterize the reservoir porosity. In contrast, permeability laboratory measurements are mainly conducted through gas measurement. According to Darcy’s law of fluids, when the gas reaches a steady state, the permeability parameters of the rock core are calculated, and the test results are more accurate [13,14]. In oil and gas production sites, permeability testing is mainly applied through methods, such as well testing and oil and gas production, in order to provide true formation permeability. The cost of testing methods is relatively higher [15,16]. In summary, the above experimental methods for determining porosity and permeability mentioned above are expensive, with a limited number of test points, which is not conducive to analyzing the porosity and permeability characteristics of regional and vertical reservoirs.

Indirect porosity interpretation requires a combination of multiple logging curves, such as acoustic time-difference logging (AC) and neutron logging (CNL), corrected with laboratory porosity test data to obtain a more accurate representation of logging porosity; then, permeability is calculated from the indirect porosity by analyzing pore–permeability relationships in the core to obtain a regional longitudinal and planar porosity and permeability logging evaluation [17,18,19,20]. In addition, many scholars have analyzed the linear or nonlinear relationship between experimentally measured porosity and permeability data and geophysical logging parameters to realize multi-angle and multi-method quantitative prediction of formation porosity and permeability parameters, which can then be applied to the actual production process of oilfields [21,22,23,24]. Among them, machine learning, with its sophisticated algorithms, feature engineering, and unique advantage of processing large-scale data to discover trends in the data as well as patterns of correlation between the data, has achieved numerous applications in the oil and gas industry. Andrei Erofeev et al. (2018) comparatively investigated the applicability of support vector machine, linear regression, and neural network methods in predicting rock properties, which show good predictive and generalization abilities [25]. Yile Ao et al. (2019) compared the predictive ability of eight algorithms in predicting formation property parameters, such as porosity and permeability, and found that linear random forest showed good superiority [26]. Daniel Asante Otchere et al. (2021) concluded that when the dataset is limited, SVM has better reservoir prediction ability than artificial neural networks [27]. In view of this, the application of the random forest regression, support vector machine algorithm, and integrated learning algorithm, combined with a variety of hyper-parameter optimization methods as well as neural network-based and deep learning methods to predict the porosity and permeability have achieved significant prediction results [28,29,30,31]. Compared to traditional linear relationship models, machine learning methods can better solve complex multidimensional nonlinear problems, have strong fault tolerance and reliability, and also provide new ideas for predicting porosity and permeability. Therefore, numerous optimization methods and model applications are also an indispensable part of modeling.

The XGBoost algorithm has demonstrated superior performance in prediction and optimization. It has been utilized across numerous applications, where it consistently outperformed other existing algorithms [32,33,34]. This paper combined logging and core porosity and permeability data from Qingyang Gas Field in the Ordos Basin, learned experimental test data, established the XGBoost porosity and permeability prediction model, and analyzed the applicability for the study area. Then, the innovative application of homogeneous clustering (K-means) method was used to optimize the XGBoost method and to comprehensively determine the XGBoost prediction model for porosity and permeability that is suitable for the stratigraphy in the study area. Finally, the final prediction model of porosity and permeability was applied in the plane to analyze the distribution of porosity and permeability, which provides a reference basis for the exploration and development of favorable zones.

2. Methodology

2.1. Overview of Regional Geology

The Ordos Basin is rich in tight gas resources, and the porosity and permeability of the reservoirs vary widely in different regions of the basin [35]. The Qingyang Gas Field is in the southwest edge of the basin and spans two first-level regional tectonic units of the Yishan Slope and the Tianhuan Depression. The tectonic unit is subdivided as in Figure 1a, which is located in the southern part of the Ordos Basin, and is influenced by the western margin of the alluvial fault zone, and the overall performance is a westward-tilted monoclinic. The reservoir is dominated by deltaic deposits characterized by deep burial, thin sands, rapid vertical changes, and dense low permeability. Therefore, further analysis of the porosity and permeability of the deep, thin, low-permeability sandstone reservoirs is an important reference for the exploration and development of the Qingyang Gas Field and the evaluation of favorable zones.

The Qingyang Gas Field is characterized by the bottom-up development of sedimentary strata in the Middle and Late Proterozoic, Paleozoic, Mesozoic, and Cenozoic, with the Upper Paleozoic Permian Shihezi Formation and Shanxi Formation being the main target strata for gas exploration and development. The region has a single purposeful stratigraphic system vertically and a large distribution area in plan. After uplift and denudation in the Upper Early Paleozoic, the study area began to subside in the Carboniferous, and the Upper Paleozoic stratigraphy consists of the Carboniferous Benxi Formation, the Permian Taiyuan Formation, the Shanxi Formation, the Lower Shihezi Formation, the Upper Shihezi Formation, and the Shiqianfeng Formation, from the bottom to the top, as shown in Figure 1c [36,37]. Among them, the H8 section of the Lower Shihezi Formation and the S1 section of the Shanxi Formation are the main gas-bearing segments in the study area. The sandstone at the bottom boundary of H8 is “camel neck” shaped and topped with high gamma mudstone. H8 section III stratigraphic sequence can be divided into two IV stratigraphic sequences according to the sequence cycle; the sandstone in the lower section of H8 is dominated by grayish-white, light-gray coarse sandstone, and conglomerate-bearing coarse sandstone, and with the development of horizontal bedding, part of the sandstone develops fissures. The type of pores is dominated by intergranular dissolution pores and lithic clastic dissolution pore. The sedimentary system of braided river delta is mainly developed in the He8 stage. The S1 section is dominated by gray medium-coarse sandstone and conglomerate-bearing coarse sandstone, with the development of parallel bedding, wedge bedding, and oblique bedding; the pore type is dominated by intergranular solution pores; and the S1 stage is mainly dominated by the Meandering-delta sedimentary system. The single-well lithological profile is shown in Figure 1d [38,39].

2.2. Method Process

Applying the porosity and permeability laboratory test data of cores and geophysical logging data, based on the extreme gradient boosting tree (XGBoost) regression prediction method in machine learning, an XGBoost prediction model of porosity and permeability of geophysical logging was established, followed by optimization with K-means method, which was applied to predict the porosity and permeability in the hole section of the study area, both in the longitudinal direction and in the planar direction. The main processes included were as follows, and the flowchart is shown in Figure 2:

Data preparation;
XGBoost model establishment: (1) Logging features and label extraction: feature variables suitable for machine learning methods were selected and extracted from the raw data to establish the dataset; (2) Dataset division: the dataset was randomly divided with 80% as the training set and the remaining 20% as the test set; (3) Feature combination optimization: the exhaustive feature variable combination method was used to weaken the problem of multicollinearity between feature variables, and the feature variable combination method of XGBoost method was preferred to provide a better feature variable representation for the model; (4) Hyper-parameter optimization: Random search was used to ensure that the hyper-parameters were searched in a wide range, and the hyper-parameters were manually adjusted in combination with the grid search so as to achieve a fine search of the parameters and to improve the stability and performance of the model; (5) Establishment of machine learning models: porosity and permeability XGBoost models were established through feature combination and hyper-parameter optimization steps; (6) Model evaluation: evaluated the model performance based on MAE and R²;
K-means optimization model: Based on K-means clustering, similar input data were grouped to minimize intra-cluster differences and maximize inter-cluster differences, and the performance of the grouped porosity and permeability XGBoost prediction model was established and evaluated;
Determine the model: The performance of the models was compared in steps (2) and (3), and the porosity and permeability prediction model that fits the study area was comprehensively determined;
Model application: The final model from step (4) was applied to the layer S13 section in the study area to analyze the planar porosity and permeability distribution.

Figure 2. Flow chart of XGBoost prediction model for porosity and permeability.

2.3. Data Description

With the core porosity and permeability experimental test data as the target label, the core layer is the section of H8 and S1 in the study area, there are 94 wells and 3389 experimental points, and the logging depth is 3441.43~5119.06 m. We counted the average values of the curves at the same depth of the single-well test points and logging curves and removed some outliers. The distributions of RLLD, RLLS, and RT were more discrete, as shown in Table 1. The distribution of wells is shown in Figure 1b. The laboratory-determined porosity distribution of H8 section was 5.0~10.0% with an average value of 7.0%, the laboratory-determined permeability distribution of H8 section was 0.1~1.0 × 10⁻³ μm² with an average value of 0.6 × 10⁻³ μm², the laboratory-determined porosity distribution of S1 section was 4.4~10.0% with an average value of 6.4%, and the laboratory-determined permeability distribution of S1 section was 0.1~1.2 × 10⁻³ μm² with an average value of 0.51 × 10⁻³ μm².

The maximum–minimum normalization method was applied to the input feature variables and target label values to unify the data range from 0 to 1. This eliminates the influence of the units of measure of different logs and improves the performance of the model operation, as shown in Equation (1).

x_{i}^{*} = \frac{x_{i} - x_{\min}}{x_{\max} - x_{\min}}

(1)

where:

x_{i}

is the input feature variable data;

x_{i}^{*}

is the normalized feature variable;

x_{m a x}

is the maximum value of feature variable data;

x_{m i n}

is the minimum value of feature variable data.

The order of magnitude difference is large, and the data transformation using a logarithmic function effectively compresses the data magnitude and improves the stability of the modeling.

x_{i}^{*} = l o g x_{i}

(2)

Normalization of the data can also be achieved using the inverse tangent function; when using this method, it should be noted that if the interval you want to map is [0, 1], the data should all be greater than or equal to 0, and data less than 0 will be mapped to the interval [−1, 0].

x_{i}^{*} = 2 a r c t a n x_{i} / π

(3)

Target Labels and Feature Variables

By organizing the logging and recording data in the study area, the target label was the porosity and permeability data tested in the laboratory, and the characteristic variables were derived from the average of the logging series data at the same depth location of the target label. By applying max–min normalization to feature variables and target labels, we found that among them, the medians of the normalized data for acoustic time difference logging (AC), compensated neutron logging (CNL), natural gamma logging (GR), resistivity shallow logging (RLLS), resistivity deep logging (RLLD), resistivity logging (RT), thorium logging (Th), uranium logging (U), permeability logging (PERM_L), anyway tangent preprocessing (Arctan(PERM)), and measured permeability (PERM) are distributed between 0 and 0.2; the median distribution of normalized data for caliper logging (CAL), potassium logging (K), porosity logging (POR_L), measured porosity (POR), and logarithmic preprocessing (LOG(PERM)) ranged from 0.2 to 0.4; and the median distribution of normalized data for photoelectric factor logging (PE), density logging (DEN), spontaneous potential logging (SP), and density logging (DEN) ranged from 0.4 to 0.8, as shown in Figure 3. An amount of 80% (2711) of the feature variable and target label dataset was used as the training set and 20% (678) as the test set for training the predictive model.

The Pearson correlation coefficients between the analyzed feature variables and the target labels are shown in Figure 4. It can be observed that the CNL, DEN, GR, PE, RLLS, RLLD, RT, SP, K, Th, and U logging curves are negatively correlated with the target labels POR and PERM, whereas the AC, CAL, POR_L, and PERM_L logging curves are positively correlated with the target labels POR and PERM. When there is a strong correlation between feature variables, such as feature variables RLLS, RLLD, and RT, considering multicollinearity, only one of them can be selected as the feature variable. When the feature variables are strongly correlated with the target label and can better represent the target label, they are selected as feature variables. The most suitable combination of XGBoost algorithm feature variables is selected using feature variable combination.

2.4. Evaluation Indicators

This paper employed the mean absolute error (MAE) and R-squared score (R²) to assess the performance of porosity and permeability model based on XGBoost algorithm. The objective was to provide guidance on the selection of the combination of feature variables and the hyper-parameter. Additionally, the impact of the K-means method on the XGBoost porosity and permeability model was evaluated.

The mean absolute error (MAE): The average value of the absolute difference between the predicted value and the true value is calculated as shown in Equation (4). This metric measures the average difference between the predicted value and the true value, with smaller values indicating a more accurate prediction.

M A E = \frac{\sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|}{n}

(4)

The R-squared score (R²) is also known as the coefficient of determination and is used to measure the extent to which a model explains the variability of the dependent variable, as shown in Equation (5). R² take values between 0 and 1. The closer R² is to 1, the better the fit between the predicted and true values, and the closer R² is to 0, the worse the fit.

R^{2} = 1 - \frac{\sum {({\hat{y}}_{i} - {\bar{y}}_{i})}^{2}}{\sum {(y_{i} - {\bar{y}}_{i})}^{2}}

(5)

In the above equation, the variables are defined as follows:

y_{i}

is actual data value.

{\bar{y}}_{i}

is actual data average value.

{\hat{y}}_{i}

is regression prediction value.

2.5. Principle of Extreme Gradient Boosting Tree

Chen Tianqi designed the extreme gradient boosting tree (XGBoost), and the core of the XGBoost algorithm is the integration of multiple weak learners to build a single strong learner by progressively optimizing the loss function [40,41]. The input samples of each decision tree and its predecessor tree are trained to correlate with the prediction results, and finally, the prediction results of all the decision trees are accumulated as the final prediction results. Assuming that the model itself consists of N decision trees, the training dataset samples are

D = \{(x_{i}, y_{i})\} (i = 1, 2, \dots, n, x_{j} \in R^{J}, y_{j} \in R)

, which contain n samples and J features, where

x_{i}

=

x_{1}, x_{2}, \dots \dots x_{n}

denotes the input feature values,

y_{i}

denotes the labels, and

{\hat{y}}_{i}

is the predicted value for the ith sample, and the model objective function is Equation (6):

o b j (t) = \sum_{i = 1}^{n} L (y_{i}, {\hat{y}}^{(t)}) + Ω \sum_{j = 1}^{t} (f_{j}),

(6)

The first term of the objective function represents the loss function of the predicted value of the entire strong learner with respect to the target value. The second term represents the complexity of the j weak learners in the strong learner.

The optimization objective of the model is to minimize the loss function until the predefined stopping condition is reached. This objective is expressed in Equation (7):

\arg \min_{c} \sum_{i = 1}^{N} L (y_{i}, {\hat{y}}_{i}^{(t)}) + Ω \sum_{j = 1}^{t} (f_{j}),

(7)

If the previous t − 1 regression tree has been trained and the objective function has been added t times, then Equation (8) is obtained:

o b j (t) = \sum_{i = 1}^{n} L (y_{i}, {\hat{y}}^{(t - 1)} + f_{t} (x_{i})) + Ω \sum_{j = 1}^{t} (f_{j}) + Ω (f_{t}),

(8)

Based on the second-order derivative of Taylor’s formula Equation (9), the objective function can be transformed into Equation (10):

f (x) \approx f (x_{0}) + f' (x_{0}) (x - x_{0}) + \frac{1}{2} f ″ (x_{0}) {(x - x_{0})}^{2}

(9)

o b j (t) \approx \sum_{i = 1}^{n} [L (y_{i}, {\hat{y}}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω \sum_{j = 1}^{t} (f_{j}) + Ω (f_{t})

(10)

where

g_{i}

and

h_{i}

are the first-order derivative Equation (11) and the second-order derivative Equation (12) of the loss function, respectively:

g_{i} = \frac{\partial L (y_{i}, {\hat{y}}^{(t - 1)})}{\partial {\hat{y}}^{(t - 1)}}

(11)

h_{i} = \frac{\partial^{2} L (y_{i}, {\hat{y}}^{(t - 1)})}{\partial {\hat{y}}^{(t - 1)}}

(12)

The complexity of each weak learner of XGBoost is determined by two factors:

γ T

and

λ {‖w‖}^{2}

, as indicated in Equation (13).

T

in

γ T

denotes the number of leaf nodes in a tree, and in general, the higher the number of leaf nodes, the more complex the tree model is;

w

in

λ {‖w‖}^{2}

denotes the value of a leaf node, and it is the complexity of the predicted value of the weak learner, where

γ

and

λ

are hyper-parameters in the model:

Ω \sum_{j = 1}^{t} (f_{j}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{t} {‖w_{j}‖}^{2}

(13)

The objective function is Equation (14):

o b j (t) \approx \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} {f_{t}}^{2} (x_{i})] + γ T + \frac{1}{2} λ {\sum_{j = 1}^{t} ‖w^{j}‖}^{2} + C

(14)

The final objective function is Equation (15):

o b j^{(t)} \approx \sum_{j = 1}^{t} [w_{j} G_{j} + \frac{1}{2} {w_{j}}^{2} (H_{i} + λ)] + γ T

(15)

where

G_{i}

and

H_{i}

are Equations (16) and (17):

G_{i} = \sum_{i \in I_{j}} g_{i}

(16)

H_{i} = \sum_{i \in I_{j}} h_{j}

(17)

From the quadratic Equation of the leaf node

w

, the optimal value of

w

can be obtained as shown in Equation (18):

w_{j} = - \frac{G_{i}}{H_{i} + λ}

(18)

In the above equation, the variables are defined as follows:

t

is the t-th regression tree;

Ω

is the complexity of the regression tree to prevent overfitting;

L

is the loss function.

2.6. Principles of the K-means Method

The K-means algorithm is a clustering algorithm proposed by MacQueen in 1967. The goal is to minimize the variance of the means of the data objects within a group by dividing the n data objects into K distinct groups. The K-means clustering method splits the data into different parts with the aim of dividing the data points into clusters, minimizing the intra-cluster data point differences, and maximizing the inter-cluster differences [42]. The K-means optimization objective is given by Equation (19):

E = \sum_{i = 1}^{k} \sum_{x \in C_{i}} {‖x - u_{i}‖}_{2}^{2}

(19)

where

u_{i}

is Equation (20):

u_{i} = \frac{1}{|C_{i}|} \sum_{x \in C i} x

(20)

3. Results

In order to construct an XGBoost model structure optimized for the K-means method, Python 3.8 software and the Scikit-Learn library were utilized in this article. The train_test_split module was imported to partition the training set and validation set. The GridSearchCV and RandomizedSearchCV modules were employed to achieve 5-fold cross validation and hyper-parameter optimization. The XGBRegression model was utilized to implement the XGBoost algorithm.

3.1. XGBoost Model Evaluation

The optimal combination of feature variables can provide a more accurate representation of the input features. Based on the correlation between feature variables and target labels, AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD, and RT were selected as input feature variables. An exhaustive search method was applied to identify all possible combinations. In this paper, the number of logging feature combinations was pre-set to 1, and 10 combinations of feature variables were searched. Subsequently, the combination of feature variables was input into the XGBoost model, and the minimum absolute error (MAE) and maximum coefficient of determination (R²) were used as the evaluation indices to optimize the feature combination method.

The results of the 1024 feature combination scores are shown in Table 2. The optimal feature combination for the POR model is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR], with a MAE of 0.0075 for the training set and 0.0195 for the test set and an R² of 0.97 for the training set and 0.68 for the test set. The optimal feature combination for PERM is [AC, CAL, SP, RLLS, PE, GR, RLLD], the training set MAE is 0.0003, the test set MAE is 0.0006, the training set R² is 0.99, and the test set R² is 0.31. The optimal feature combination for LOG(PERM) regression is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD], the training set MAE is 0.3757, the test set MAE is 0.0920, the training set R² is 0.99 and the test set R² is 0.31; the optimal feature combination for LOG(PERM) regression is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD], the training set MAE is 0.3757, the test set MAE is 0.0920, the training set R² is 0.97, and the test set R² is 0.57. The optimal feature combination for Arctan(PERM) regression is [CAL, CNL, DEN, SP, PE, GR, RLLD], the training set MAE is 0.0932, test set MAE is 0.0244, training set R² is 0.96, and test set R² is 0.50.

The RandomizedSearchCV function was used for large-scale and efficient searches based on random search results. Subsequently, the GridSearchCV function was utilized for more precise searches, ensuring a wide search range and accurate search results while reducing computational model runtime. Additionally, the functions were combined with 5-fold cross-validation to evaluate different parameter configurations with the objective of maximizing the model performance. The main hyper-parameters that must be configured to build the regression prediction model using the XGBoost algorithm are shown in Table 3.

With the combination of random search and grid search and based on the optimal combination of features, the results of the hyper-parameter selection are obtained as Table 4. The results show that the optimal feature combination of the POR model is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR]. Following the hyper-parameter optimization step, the MAE of the training set is 0.0008, and that of the test set is 0.0671; the R² of the training set is 0.99, and that of the test set is 0.72. The optimal feature combination of the PERM model is [AC, CAL, SP, RLLS, PE, GR, RLLD], and after the hyper-parameter optimization step, the MAE of the training set is 0.0017, the MAE of the test set is 0.0084, the R² of the training set is 0.99, and the R² of the test set is 0.32. The optimal feature combination of the LOG(PERM) model is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD], and after the hyper-parameter optimization step, the MAE of the training set is 0.0205, that of the test set is 0.3543, that of the training set R² is 0.99, and that of the test set R² is 0.62. The optimal feature combination of the Arctan(PERM) model is [CAL, CNL, DEN, SP, PE, GR, RLLD], and after the hyper-parameter optimization step, the MAE of the training set is 0.0225, and that of the test set is 0.0893; the R² of the training set is 0.96, and that of the test set is 0.52. The effect of fitting the model test and prediction sets is shown in Figure 5, where the POR and LOG(PERM) models are insignificantly overfitted and the PERM and Arctan(PERM) models are significantly overfitted.

3.2. K-means Optimized XGBoost Model Evaluation

The K-means clustering method was applied to group the experimental test porosity and permeability data into groups 0 and 1 based on the values of logging feature variables GR and PE in the study area. The distribution of data points was shown in Figure 6, and the feature combination preference and hyper-parameter optimization steps were carried out for different combinations, respectively. The results of the model evaluation were obtained in Table 5 and Table 6. The optimal feature combination for the POR_0 group is [AC, CAL, CNL, DEN, SP, RLLS, PE, GR], the training set MAE is 0.0744, the test set MAE is 0.0544, the training set R² is 0.98, and the test set R² is 0.68; the optimal feature combination for the POR_1 group is [CAL, CNL, SP, RLLS, PE, GR, RT], the training set MAE is 0.0007, the test set MAE is 0.0563, the training set R² is 0.99, and the test set R² is 0.84. For the permeability model, the inverse tangent permeability model (Arctan(PERM)) is chosen to perform better, and the optimal feature combination of the Arctan(PERM)_0 group is [CAL, CNL, DEN, SP, RLLS, PE, GR, RT], with the MAE of the training set being 0.0246, the MAE of the test set being 0.0915, the R² of the training set being 0.96, and the test set R² being 0.54. The optimal feature combination for the Arctan(PERM)_1 group is [CNL, SP, PE, GR, RT], the training set MAE is 0.0009, the test set MAE is 0.0530, the training set R² is 0.99, and the test set R² is 0.85. In comparison to the ungrouped model, the grouped model improves the model’s coefficient of determination and reduces the mean absolute error.

4. Discussion

Data preprocessing methods for XGBoost prediction model of permeability

In this paper, two experimental permeability data preprocessing methods were explored, and three XGBoost permeability prediction models were established: permeability (PERM), logarithmic preprocessed permeability (LOG(PERM)), and arctangent preprocessed (Arctan(PERM)). Among the ungrouped models, the final model of LOG(PERM) had a training set R² of 0.98 and a test set R² of 0.68, although there was a certain degree of overfitting. However, the model performance was better than that of PERM and Arctan(PERM). Among the grouped models, the penetration data model constructed using the inverse tangent preprocessing method demonstrated the most favorable performance.

2.: Optimization effect of K-means method for XGBoost model

The K-means method of data grouping proved beneficial in improving the performance of the XGBoost model. The experimental porosity and permeability data were grouped by using K-means, and the porosity and permeability grouped XGBoost model reduced the MAE by 0.017 and improved R² by 0.15 relative to the ungrouped model, as shown in Figure 7. It is worth noting that the data volume is reduced to some extent by K-means, which increases the risk of model overfitting. However, the overall prediction effect is improved.

3.: Effectiveness of XGBoost porosity and permeability modeling application

This paper took the M42 well as an example to analyze the vertical application effect of the established XGBoost porosity and permeability model. Comparing the grouped porosity model and logging curve to calculate the porosity and comparing the grouped permeability model and logging curve to calculate the permeability, the grouped XGBoost porosity and permeability model is closer to the real value, and it can realize the prediction of the whole well section, as shown in Figure 8. Through the above analysis, the grouped porosity model was chosen as the final prediction model for porosity in the study area, and the grouped model was chosen as the final prediction model for permeability. Next, the grouping model predicts single-well porosity and permeability and applies the polynomial interpolation method to obtain the distribution of porosity and permeability in the plane of layer S13 of the established model, as shown in Figure 9.

4.: Model Limitations

K-means-based XGBoost porosity and permeability model data volume should be as large as possible to prevent model overfitting and improve the generalization ability of the model. Based on Python 3.8 software and Scikit-Learn library, the XGBoost algorithm supports the advantage of parallel operation, running on eight threads simultaneously, and the model building and running time is about 10 h. The model built has good applicability in the study area, and the ideas and methods are useful in other regions.

5. Conclusions

In this study, the application of logarithmic and inverse tangent methods to preprocess the permeability data improved the accuracy of the permeability XGBoost prediction model. The LOG(PERM) demonstrated an improvement in the R² of the test set by 0.26 over PERM. Similarly, the Arctan(PERM) improved the R² of the test set by 0.19 over PERM. Additionally, the degree of training set fitting was reduced by 0.02 and 0.03, respectively, which attenuates overfitting to some extent.

The K-means method has positive implications for the optimization of porosity and permeability XGBoost logging prediction models. The K-means data grouping method was applied to the data input to the machine learning model, which was divided into two groups. The R² was improved by 0.15 on average, and the MAE was reduced by 0.017 on average when compared with the ungrouped model. This resulted in an improvement in the model prediction ability and a closer alignment of the prediction results with the laboratory test data.

The R² of the K-means grouped porosity model POR_0/POR_1 test set was 0.73 and 0.85, and the optimal feature combinations were [AC, CAL, CNL, DEN, SP, RLLS, PE, GR] and [CAL, CNL, SP, RLLS, PE, GR, RT]; the R² of the K-means grouped permeability model Arctan(PERM)_0/Arctan(PERM)_1 test set was 0.58 and 0.85, and the optimal feature combinations were [CAL, CNL, DEN, SP, RLLS, PE, GR, RT] and [CNL, SP, PE, GR, RT]. The accuracy of the single-well application was good relative to the porosity and permeability logging curves, and a full-well section prediction can be realized. Consequently, the K-means optimized porosity and permeability XGBoost model offers a novel approach and a reference point for the efficient and relatively accurate evaluation of porosity and permeability in the study area at a reduced cost. Furthermore, it provides a reference for K-means to combine with other machine learning methods to predict important parameters in the process of oil and gas field development, thereby expanding the scope of the applicability of the method.

Author Contributions

Conceptualization, R.W.; methodology, J.Z.; software, J.Z.; validation, N.F.; formal analysis, N.F.; investigation, J.Z.; resources, A.J.; data curation, A.J.; writing—original draft preparation, J.Z.; writing—review and editing, R.W.; visualization, J.Z.; supervision, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Key R & D Plan of Shaanxi Province (key industrial innovation chain (Group)) “Research, development and industrialization promotion of small-molecular recyclable self-cleaning fracturing fluid (Project No.2022ZDLSF07-04)”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AC	Acoustic time difference logging
CAL	Caliper logging
CNL	Compensated neutron logging
DEN	Density logging
GR	Gamma ray logging
PE	Photoelectric factor logging
RLLD	Resistivity deep logging
RLLS	Resistivity shallow logging
RT	Resistivity logging
SP	Spontaneous potential logging
K	Potassium logging
TH	Thorium logging
U	Uranium logging
POR_L	Porosity logging
PERM_L	Permeability logging
POR	Measured Porosity
PERM	Measured Permeability
SH	Shale logging
SAND	Sand logging
XGBoost	Extreme gradient boosting

References

Shi, B.; Chang, X.; Yin, W.; Li, Y.; Mao, L. Quantitative evaluation model for tight sandstone reservoirs based on statistical methods—A case study of the Triassic Chang 8 tight sandstones, Zhenjing area, Ordos Basin, China. J. Pet. Sci. Eng. 2019, 173, 601–616. [Google Scholar] [CrossRef]
Zhao, W.; Li, X.; Wang, T.; Fu, X. Pore size distribution of high volatile bituminous coal of the southern Junggar Basin: A full-scale characterization applying multiple methods. Front. Earth Sci. 2020, 15, 237–255. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Zhang, L.; Hou, J. Characterization of pore structure of constructed coal based on mercury intrusion, low-temperature N₂ adsorption, and CO₂ adsorption. China Coal Soc. 2019, 44, 1188–1196. [Google Scholar]
Haskett, S.E.; Narahara, G.M.; Holditch, S.A. A Method for Simultaneous Determination of Permeability and Porosity in Low-Permeability Cores. SPE Form. Eval. 1988, 3, 651–658. [Google Scholar] [CrossRef]
Zhan, H.; Li, X.; Hu, Z.; Duan, X.; Guo, W.; Li, Y. Influence of Particle Size on the Low-Temperature Nitrogen Adsorption of Deep Shale in Southern Sichuan, China. Minerals 2022, 12, 302. [Google Scholar] [CrossRef]
Nie, B.; Lun, J.; Wang, K.; Shen, J. Three-dimensional characterization of open and closed coal nanopores based on a multi-scale analysis including CO₂ adsorption, mercury intrusion, low-temperature nitrogen adsorption, and small-angle X-ray scattering. Energy Sci. Eng. 2020, 8, 2086–2099. [Google Scholar] [CrossRef]
Qin, L.; Li, S.; Zhai, C.; Lin, H.; Zhao, P.; Shi, Y.; Bai, Y. Changes in the pore structure of lignite after repeated cycles of liquid nitrogen freezing as determined by nitrogen adsorption and mercury intrusion. Fuel 2020, 267, 117214. [Google Scholar] [CrossRef]
Ni, X.; Zhao, Z.; Wang, B.; Li, Z. Classification of Pore–fracture Combination Types in Tectonic Coal Based on Mercury Intrusion Porosimetry and Nuclear Magnetic Resonance. ACS Omega 2020, 5, 33225–33234. [Google Scholar] [CrossRef]
Yang, Q.; Xue, J.; Li, W.; Du, X.; Ma, Q.; Zhan, K.; Chen, Z. Comprehensive evaluation and interpretation of mercury intrusion porosimetry data of coals based on fractal theory, Tait equation and matrix compressibility. Fuel 2021, 298, 120823. [Google Scholar] [CrossRef]
Pitman, E.D. Relationship of Porosity and Permeability to Various Parameters Derived from Mercury Injection-Capillary Pressure Curves for Sandstone. Am. Assoc. Pet. Geol. AAPG/Datapages 1992, 76, 191–198. [Google Scholar]
Squelch, A.; Harris, B.; AlMalki, M. Estimating porosity from CT scans of high permeability core plugs. ASEG Ext. Abstr. 2012, 2012, 1–3. [Google Scholar] [CrossRef]
Feng, C.; Yang, Z.; Feng, Z.; Zhong, Y.; Ling, K. A novel method to estimate resistivity index of tight sandstone reservoirs using nuclear magnetic resonance logs. J. Nat. Gas Sci. Eng. 2020, 79, 103358. [Google Scholar] [CrossRef]
Lyu, Q.; Shi, J.; Gamage, R.P. Effects of testing method, lithology and fluid-rock interactions on shale permeability: A review of laboratory measurements. J. Nat. Gas Sci. Eng. 2020, 78, 103302. [Google Scholar] [CrossRef]
Rieksts, K.; Hoff, I.; Scibilia, E.; Côté, J. Establishment of Intrinsic Permeability of Coarse Open-Graded Materials: Review and Analysis of Existing Data from Natural Air Convection Tests. Minerals 2020, 10, 767. [Google Scholar] [CrossRef]
Yang, Y.; Wang, D.; Yang, J.; Wang, B.; Liu, T. Fractal analysis of CT images of tight sandstone with anisotropy and permeability prediction. J. Pet. Sci. Eng. 2021, 205, 108919. [Google Scholar] [CrossRef]
Zhou, Z.; Wang, X.; Miao, X.; Qian, K.; Li, L.; Xu, P.; Lu, X. The application of modified isochronal well test in a low-permeability condensate gas field. Geosystem Eng. 2019, 22, 310–318. [Google Scholar] [CrossRef]
Amaefule, J.O.; Altunbay, M.; Tiab, D.; Kersey, D.G.; Keelan, D.K. Enhanced Reservoir Description: Using Core and Log Data to Identify Hydraulic (Flow) Units and Predict Permeability in Uncored Intervals/Wells. In SPE Annual Technical Conference and Exhibition? SPE: Houston, TX, USA, 1993. [Google Scholar]
Aquilera, R. Sandstone vs. carbonate petroleum reservoirs: A global perspective on porosity-depth and porosity-permeability relationships: Discussion. AAPG Bull. 2006, 90, 807–810. [Google Scholar] [CrossRef]
Hamada, G.; Joseph, V. Developed correlations between sound wave velocity and porosity, permeability and mechanical properties of sandstone core samples. Pet. Res. 2020, 5, 326–338. [Google Scholar] [CrossRef]
Bian, H.-L.; Guan, J.; Mao, Z.-Q.; Ju, X.-D.; Han, G.-Q. Pore structure effect on reservoir electrical properties and well logging evaluation. Appl. Geophys. 2014, 11, 374–383. [Google Scholar] [CrossRef]
Helle, H.B.; Bhatt, A.; Ursin, B. Porosity and permeability prediction from wireline logs using artificial neural networks: A North Sea case study. Geophys. Prospect. 2001, 49, 431–444. [Google Scholar] [CrossRef]
Babadagli, T.; Al-Salmi, S. A Review of Permeability-Prediction Methods for Carbonate Reservoirs Using Well-Log Data. SPE Reserv. Eval. Eng. 2004, 7, 75–88. [Google Scholar] [CrossRef]
AlHomadhi, E.S. New correlations of permeability and porosity versus confining pressure, cementation, and grain size and new quantitatively correlation relates permeability to porosity. Arab. J. Geosci. 2013, 7, 2871–2879. [Google Scholar] [CrossRef]
Liu, B.; Rostamian, A.; Kheirollahi, M.; Mirseyed, S.F.; Mohammadian, E.; Golsanami, N.; Liu, K.; Ostadhassan, M. NMR log response prediction from conventional petrophysical logs with XGBoost-PSO framework. Geoenergy Sci. Eng. 2023, 224, 211561. [Google Scholar] [CrossRef]
Erofeev, A.; Orlov, D.; Ryzhov, A.; Koroteev, D. Prediction of Porosity and Permeability Alteration Based on Machine Learning Algorithms. Transp. Porous Media 2018, 128, 677–700. [Google Scholar] [CrossRef]
Ao, Y.; Li, H.; Zhu, L.; Ali, S.; Yang, Z. The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling. J. Pet. Sci. Eng. 2019, 174, 776–789. [Google Scholar] [CrossRef]
Otchere, D.A.; Ganat, T.O.A.; Gholami, R.; Ridha, S. Application of supervised machine learning paradigms in the prediction of petroleum reservoir properties: Comparative analysis of ANN and SVM models. J. Pet. Sci. Eng. 2021, 200, 108182. [Google Scholar] [CrossRef]
Al-Anazi, A.F.; Gates, I.D. Support-Vector Regression for Permeability Prediction in a Heterogeneous Reservoir: A Comparative Study. SPE Reserv. Eval. Eng. 2010, 13, 485–495. [Google Scholar] [CrossRef]
Al-Anazi, A.; Gates, I. Support vector regression to predict porosity and permeability: Effect of sample size. Comput. Geosci. 2012, 39, 64–76. [Google Scholar] [CrossRef]
Al-Anazi, A.; Gates, I. Support vector regression for porosity prediction in a heterogeneous reservoir: A comparative study. Comput. Geosci. 2010, 36, 1494–1503. [Google Scholar] [CrossRef]
Thanh, H.V.; Yasin, Q.; Al-Mudhafar, W.J.; Lee, K.-K. Knowledge-based machine learning techniques for accurate prediction of CO₂ storage performance in underground saline aquifers. Appl. Energy 2022, 314, 118985. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2020, 54, 1937–1967. [Google Scholar] [CrossRef]
Pan, S.; Zheng, Z.; Guo, Z.; Luo, H. An optimized XGBoost method for predicting reservoir porosity using petrophysical logs. J. Pet. Sci. Eng. 2022, 208, 109520. [Google Scholar] [CrossRef]
Elmgerbi, A.; Chuykov, E.; Thonhauser, G.; Nascimento, A. Machine Learning Techniques Application for Real-Time Drilling Hydraulic Optimization. In Proceedings of the International Petroleum Technology Conference 2022, Dhahran, Saudi Arabia, 21–23 February 2022. [Google Scholar]
Shengli, G.; Jinxia, Y. Exploration of tight oil resources based on stratigraphic paleo-tectonics during hydrocarbon generation in the Ordos Basin, China. Arab. J. Geosci. 2019, 12, 387. [Google Scholar] [CrossRef]
Zhu, S.; Cui, J.; Chen, J.; Luo, G.; Wang, W.; Yang, Y. Shallow Water Delta Sedimentary System and Reservoir Petrological Characteristics: A Case Study of Shan1-He8 Member in the Western Ordos Basin. Acta Sedimentol. Sin. 2021, 39, 126–139. [Google Scholar]
Wei, Q.; Cui, G.; Liu, M. Reservoir characteristics and control factors of the Lower Member of the Permian He-8 in the southwestern part of the Ordos Basin. Lithol. Reserv. 2021, 33, 17–25. [Google Scholar]
Liang, F.; Hang, W.; Niu, J. Source analysis of the Shan1 section of the Shanxi Formation of the Permian in the southwestern margin of the Ordos Basin to the He8 section of the Lower Shihezi Formation. Acta Sedimentol. Sin. 2018, 36, 142–153. [Google Scholar]
Xiao, J.; Sun, F.; He, N.; Li, J.; Xiao, H. The North South Source Sedimentary Confluence Area and Paleogeography of the Shanxi Formation and Lower Shihezi Formation of the Permian in the Ordos Basin. J. Palaeogeogr. 2008, 4, 341–354. [Google Scholar]
Markovic, S.; Bryan, J.L.; Rezaee, R.; Turakhanov, A.; Cheremisin, A.; Kantzas, A.; Koroteev, D. Application of XGBoost model for in-situ water saturation determination in Canadian oil-sands by LF-NMR and density data. Sci. Rep. 2022, 12, 13984. [Google Scholar] [CrossRef]
Ibrahim, N.M.; Alharbi, A.A.; Alzahrani, T.A.; Abdulkarim, A.M.; Alessa, I.A.; Hameed, A.M.; Albabtain, A.S.; Alqahtani, D.A.; Alsawwaf, M.K.; Almuqhim, A.A. Well Performance Classification and Prediction: Deep Learning and Machine Learning Long Term Regression Experiments on Oil, Gas, and Water Production. Sensors 2022, 22, 5326. [Google Scholar] [CrossRef]
Giffon, L.; Emiya, V.; Kadri, H.; Ralaivola, L. QuicK-means: Accelerating inference for K-means by learning fast transforms. Mach. Learn. 2021, 110, 881–905. [Google Scholar] [CrossRef]

Figure 1. (a) Tectonic unit division, (b) well distribution, (c) stratigraphic composite histogram, and (d) single-well lithologic profile in Qingyang Gas Field, Ordos Basin.

Figure 3. Normalized distribution of feature variables and target labels.

Figure 4. Plot of Pearson’s correlation coefficient of feature variables.

Figure 5. Figure of model test set and prediction set fitting effect.

Figure 6. Data grouping diagram of K-means method.

Figure 7. Comparison of Optimization Effects of K-means Method on XGBoost Model.

Figure 8. Comparison of the effect of XGBoost porosity and permeability modeling application and logging curves for well M42.

Figure 9. (a) Planar distribution of porosity in layer S13 in the study area; (b) planar distribution of permeability in layer S13 in the study area.

Table 1. Statistics of logging curves for experimental data points.

Logging Curves	Mean	First Quartile	Median	Third Quartile	Min	Max	Std Deviation	Variance
AC (μs/m)	216.20	203.68	212.68	225.65	181.86	379.57	19.17	367.71
CAL (in)	25.04	23.14	24.89	26.07	21.30	39.68	2.48	6.14
CNL (%)	11.12	6.17	9.27	13.51	0.13	50.59	7.05	49.73
DEN (g/cm³)	2.53	2.49	2.56	2.62	1.29	2.85	0.16	0.03
GR (API)	66.96	38.78	59.17	85.31	15.64	437.66	35.80	1282.35
PE	2.91	2.21	2.55	3.00	0.79	13.70	1.45	2.11
RLLD (Ω·m)	67.07	28.25	45.88	73.25	5.37	679.66	75.83	5751.95
RLLS (Ω·m)	60.68	25.65	42.45	68.13	5.51	666.82	66.98	4487.29
RT (Ω·m)	66.59	28.16	45.96	72.95	0.00	676.38	75.80	5747.06
SP (MV)	56.52	47.81	57.45	65.93	−7.18	113.31	14.33	205.47
K (%)	1.32	0.69	1.18	1.79	0.11	4.32	0.77	0.60
TH (mg/L)	8.91	4.94	7.60	11.62	1.51	42.88	5.15	26.53
U (mg/L)	2.36	1.36	1.88	2.88	0.22	15.78	1.52	2.31
POR_L (%)	4.55	2.16	4.69	6.74	0.00	15.20	3.16	9.97
PERM_L (10⁻³ μm²)	0.33	0.01	0.11	0.40	0.00	11.37	0.72	0.52

Table 2. XGBoost feature variable combination preference results.

Model	Feature Combination	MAE of Training Set	MAE of Test Set	R² of Training Set	R² of Test Set
POR	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR]	0.0075	0.0195	0.97	0.68
PERM	[AC, CAL, SP, RLLS, PE, GR, RLLD]	0.0003	0.0006	0.99	0.31
LOG(PERM)	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD]	0.3757	0.0920	0.97	0.57
Arctan(PERM)	[CAL, CNL, DEN, SP, PE, GR, RLLD]	0.0932	0.0244	0.96	0.50

Table 3. Description of hyper-parameter optimization of XGBoost model.

Algorithm	Hyper-Parameter Optimization	Hyper-Parameter Description
XGBoost	learning_rate	The smaller the learning rate, the smaller the impact of each tree, and the more stable the model training
	max_depth	Control the maximum depth of each tree, with small values, making it difficult for the model to overfit
	min_child_weight	Preventing model overfitting on the training set
	n_estimators	The number of decision trees, the more decision trees there are, the better the model performance

Table 4. XGBoost model hyper-parameter preference values.

Algorithm	Feature Combination	Hyper-Parameter Values	MAE of Training Set	MAE of Test Set	R² of Training Set	R² of Test Set
POR	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR]	learning_rate: 0.1 max_depth: 50 min_child_weight: 9 n_estimators: 300	0.0008	0.0671	0.98	0.72
PERM	[AC, CAL, SP, RLLS, PE, GR, RLLD]	n_estimators: 100	0.0017	0.0084	0.99	0.32
LOG(PERM)	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD]	learning_rate: 0.02 max_depth: 90 min_child_weight: 9 n_estimators: 600	0.0205	0.3543	0.97	0.62
Arctan(PERM)	[CAL, CNL, DEN, SP, PE, GR, RLLD]	learning_rate: 0.02 max_depth: 50 min_child_weight: 9 n_estimators: 300	0.0225	0.0893	0.96	0.52

Table 5. Grouped machine learning feature variable combination preference results.

Algorithm	Group	Feature Combination	MAE of Training Set	MAE of Test Set	R² of Training Set	R² of Test Set
POR	0	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR]	0.0744	0.0544	0.98	0.68
POR	1	[CAL, CNL, SP, RLLS, PE, GR, RT]	0.0007	0.0563	0.99	0.84
PERM	0	[AC, CNL, RLLS, RLLD]	0.0023	0.0085	0.98	0.32
PERM	1	[AC, CAL, DEN, RT]	0.0008	0.0131	0.99	0.79
LOG(PERM)	0	[CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD]	0.0969	0.3761	0.96	0.58
LOG(PERM)	1	[CNL, SP, RLLS, RLLD, RT]	0.0010	0.2493	0.99	0.83
Arctan(PERM)	0	[CAL, CNL, DEN, SP, RLLS, PE, GR, RT]	0.0246	0.0915	0.96	0.54
Arctan(PERM)	1	[CNL, SP, PE, GR, RT]	0.0009	0.0530	0.99	0.85

Table 6. Preferred values of hyper-parameters for the grouped XGBoost model.

Algorithm	Group	Feature Combination	Hyper-parameter Values	MAE of Training Set	MAE of Test Set	R² of Training Set	R² of Test Set
POR	0	[AC, CAL, CNL, DEN, SP, RLLS, PE, GR]	learning_rate: 0.07 max_depth: 100 min_child_weight: 20 n_estimators: 500	0.0006	0.0685	0.99	0.73
POR	1	[CAL, CNL, SP, RLLS, PE, GR, RT]	n_estimators: 17	0.0135	0.0552	0.99	0.85
PERM	0	[AC, CNL, RLLS, RLLD]	n_estimators: 100	0.0023	0.0085	0.98	0.32
PERM	1	[AC, CAL, DEN, RT]	n_estimators: 500	0.0008	0.0131	0.99	0.79
LOG(PERM)	0	[CAL, CNL, DEN, SP, RLLS, PE, GR, RLLD]	learning_rate: 0.01 max_depth: 40 min_child_weight: 9 n_estimators: 800	0.0570	0.3520	0.98	0.61
LOG(PERM)	1	[CNL, SP, RLLS, RLLD, RT]	n_estimators: 31	0.0482	0.2494	0.99	0.83
Arctan(PERM)	0	[CAL, CNL, DEN, SP, RLLS, PE, GR, RT]	learning_rate: 0.1 max_depth: 80 min_child_weight: 15 n_estimators: 100	0.0211	0.0842	0.96	0.58
Arctan(PERM)	1	[CNL, SP, PE, GR, RT]	n_estimators: 37	0.0009	0.0530	0.99	0.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Wang, R.; Jia, A.; Feng, N. Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method. Appl. Sci. 2024, 14, 3956. https://doi.org/10.3390/app14103956

AMA Style

Zhang J, Wang R, Jia A, Feng N. Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method. Applied Sciences. 2024; 14(10):3956. https://doi.org/10.3390/app14103956

Chicago/Turabian Style

Zhang, Jianting, Ruifei Wang, Ailin Jia, and Naichao Feng. 2024. "Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method" Applied Sciences 14, no. 10: 3956. https://doi.org/10.3390/app14103956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization and Application of XGBoost Logging Prediction Model for Porosity and Permeability Based on K-means Method

Abstract

1. Introduction

2. Methodology

2.1. Overview of Regional Geology

2.2. Method Process

2.3. Data Description

Target Labels and Feature Variables

2.4. Evaluation Indicators

2.5. Principle of Extreme Gradient Boosting Tree

2.6. Principles of the K-means Method

3. Results

3.1. XGBoost Model Evaluation

3.2. K-means Optimized XGBoost Model Evaluation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI