Next Article in Journal
Simulation and Environmental Sustainability Assessment of an Integrated LNG-Power Cycle-Electrolyzer-Methanol Process for Clean Energy Generation
Previous Article in Journal
Incipient Fault Detection Based on Feature Adaptive Ensemble Net
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Application Study of Machine Learning Methods for Lithological Classification Based on Logging Data in the Permafrost Zones of the Qilian Mountains

1
Jiangxi Engineering Laboratory on Radioactive Geoscience and Big Data Technology, East China University of Technology, Nanchang 330013, China
2
Shaanxi Key Laboratory of Petroleum Accumulation Geology, Xi’an Shiyou University, Xi’an 710065, China
3
Key Laboratory of Tectonics and Petroleum Resources, Ministry of Education, China University of Geosciences, Wuhan 430074, China
4
Hubei Subsurface Multi-Scale Imaging Key Laboratory, Institute of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China
5
Jiangxi Bureau of Geology Non-Ferrous Geological Brigade, Ganzhou 341000, China
*
Author to whom correspondence should be addressed.
Processes 2025, 13(5), 1475; https://doi.org/10.3390/pr13051475
Submission received: 25 March 2025 / Revised: 30 April 2025 / Accepted: 5 May 2025 / Published: 12 May 2025

Abstract

:
Lithology identification is fundamental for the logging evaluation of natural gas hydrate reservoirs. The Sanlutian field, located in the permafrost zones of the Qilian Mountains (PZQM), presents unique challenges for lithology identification due to its complex geological features, including fault development, missing and duplicated stratigraphy, and a diverse array of rock types. Conventional methods frequently encounter difficulties in precisely discerning these rock types. This study employs well logging and core data from hydrate boreholes in the region to evaluate the performance of four data-driven machine learning (ML) algorithms for lithological classification: random forest (RF), multi-layer perceptron (MLP), logistic regression (LR), and decision tree (DT). The results indicate that seven principal lithologies—sandstone, siltstone, argillaceous siltstone, silty mudstone, mudstone, oil shale, and coal—can be effectively distinguished through the analysis of logging data. Among the tested models, the random forest algorithm demonstrated superior performance, achieving optimal precision, recall, F1-score, and Jaccard coefficient values of 0.941, 0.941, 0.940, and 0.889, respectively. The models were ranked in the following order based on evaluation criteria: RF > MLP > DT > LR. This research highlights the potential of integrating artificial intelligence with logging data to enhance lithological classification in complex geological settings, providing valuable technical support for the exploration and development of gas hydrate resources.

1. Introduction

Gas hydrates are crystalline substances with a rigid cage structure that arise from integrating gas and water molecules under conditions of low temperature and high pressure, similar to ice. They are found mainly in permafrost on the seabed and in inland areas [1]. The exploration of gas hydrates in China commenced in 1999 [2]. In November 2008, gas hydrate samples were first gained from the depth range of 133.5 to 135.5 m in the DK–1 borehole situated in the Muli permafrost zones of the Qilian Mountains (PZQM) on the Tibetan Plateau, establishing China as the first nation to successfully discover gas hydrates in the mid–low-latitude permafrost zone, which is of great scientific and economic importance [3,4,5]. Subsurface geological stratigraphy and lithology are pivotal for reservoir characterization and resource assessment [6,7]. These factors impact rock physical properties like porosity and permeability, which affect gas hydrate saturation [8,9,10]. Lithology classification is a crucial aspect of the hydrate exploration and development process, as it informs the classification of hydrate reservoir types. Reservoir delineation represents a pivotal element of hydrate fine description [11], which is of paramount importance for the accurate quantitative evaluation of gas hydrate reserves. Therefore, reliable lithology identification results are conducive to improving the accuracy of reservoir physical property prediction and reducing uncertainty during the exploration, development, and stabilization of gas hydrate deposits [12,13]. Conventional identification techniques, including cross-plot [14], statistical methods [15,16], and imaging logging [17], entail the manual classification of lithologies by the interpreter, a process that is both laborious and time-consuming. Compared with the conventional lithological identification techniques (for example, coring), the identification formation through the utilization of logging data is characterized by rapidity and minimal expenditure. Prior research has demonstrated that geophysical logging curves are responsive to diverse lithologies, enabling the effective identification and delineation of formation lithologies [18,19].
Over the past few decades, multitudinous experts have conducted extensive research in the field of lithological classification based on logging data. As a consequence, they have proposed several conventional methods for lithological classification problems, including cross-plot, statistical methods, and imaging logging. However, these conventional methods present several shortcomings, including difficulties in identification, low accuracy, slow efficiency, and a high degree of susceptibility to human factors. Furthermore, the cost of stratigraphic imaging logging is a significant barrier to its implementation in a broad range of practical applications.
In recent years, machine learning (ML) techniques have facilitated the processing of geophysical datasets in novel ways [20]. Researchers have commenced utilizing ML techniques to explore the correlation between logging data and rock types and to develop methodologies for forecasting rock types. For instance, supervised learning algorithms in ML algorithms, including support vector machine (SVM) [21], decision tree (DT) [22], multi-layer perceptron (MLP) [23], and random forest (RF) [24], have been effectively utilized. Furthermore, unsupervised ML algorithms include clustering [25,26] and principal components analysis [27]. As the volume of data and computational capabilities have expanded, sophisticated deep neural network algorithms have been developed and implemented for lithology identification [28,29,30,31]. These ML algorithms, along with deep neural network algorithms, have taken lithology identification to the stage of automatic identification. For example, Delavar [32] used a multiple-kernel-function SVM combined with three neutrally heuristic optimization generators, including particle swarm optimization (PSO), grasshopper optimization algorithm (GOA), and grey wolf optimizer (GWO), to classify carbonate reservoir fractures within the Asmari reservoir in the Middle East. A comparison with alternative ML methods demonstrates that the hybrid SVM(RBF)-GWO offers superior accuracy. Bressan et al. [33] classified lithology using four ML algorithms, namely MLP, DT, RF, and SVM, on multivariate logging data from offshore wells of the International Ocean Discovery Program (IODP), achieving good classification results. Zhao et al. [34] proposed a classification enhancement semi-supervised generative adversarial network (CE-SGAN), which employs a classification separation architecture and a pseudo-label processing mechanism to reduce the impact of data unbalance, marking its inaugural application in lithology identification. The results show significant improvements in lithology identification for small, unbalanced datasets, along with good generalization performance and competitive advantages in data augmentation. Alzubaidi et al. [6] introduced a convolutional neural network (CNN) model based on the ResNeXt-50 architecture for the automatic prediction of core tray images, which outperformed CNN models based on ResNet-18 and Inception-v3 architecture in terms of prediction accuracy. In conclusion, a considerable number of researchers have employed lithological analyses in the pursuit of conventional oil, gas, and mineral resources. Nevertheless, there is a deficiency of research examining the utilization of lithological classification in gas hydrate boreholes, and the existing methods for lithological classification are imperfect. In the PZQM, geological logging data contain abundant lithological information. However, the permafrost zones have development faults, the presence of missing and duplicated stratigraphy, and a variety of rock types [35,36], and reservoirs abundant in gas hydrates are characterized by lithologies such as fine sandstone, oil shale, siltstone, and mudstone [5,37]; there is a difficult problem in recognizing these lithologies. Accordingly, an accurate delineation of lithology in gas hydrate boreholes within the specified area would facilitate subsequent exploration and development of hydrate resources.
In this study, we integrated the logging data from hydrate boreholes within the PZQM with lithology and selected the logging curves that are more responsive to lithology. At the same time, we utilized four ML algorithms, namely RF, DT, MLP, and LR, to apply to the lithology classification of gas hydrate boreholes in the research zone. This extends a novel approach and methodology for lithology classification in permafrost zones and establishes a foundation for the future identification and exploitation of gas hydrates in permafrost zones.
The rest of the paper is organized as follows. Section 2 describes the geological background of the study. Section 3 presents an overview of the fundamental principles of the selected algorithms and the metrics used for model evaluation. Section 4 presents the results of the lithological classification and the evaluation of the model. Then, a discussion of the results is given in Section 5, followed by conclusions in Section 6.

2. Geological Background

In this study, the target zone is situated in the PZQM in Western China, within the Juhugeng mining area of the Muli Coalfield, Tianjun County, Qinghai Province (Figure 1). The PZQM is situated in the northern part of the Tibetan Plateau. The internal topography is characterized by a general elevation gradient from west to east and south to north, with an altitude of 4100–4300 m. Permafrost is present throughout the year, with a thickness of 60 to 120 m. The area encompasses approximately 100,000 km2, and the yearly average air temperature stands at −5.1 °C [38]. The Qilian Mountains comprise three tectonic units: the Northern Qilian Tectonic Belt, the Central Qilian Tectonic Belt, and the Southern Qilian Tectonic Belt [35,39]. The aforementioned tectonic units are separated by four ruptures, including the northern margin of the North Qilian–Central Qilian Fracture, the southern margin of the Central Qilian Fracture, and the Tuergen Daban Mountain–Zongwunong Mountain–Qinghai Lake Fracture.
In the permafrost region, the entire area is developed with Jurassic coal seams, with the upper part being the Jiangcang Formation and the lower part being the Muli Formation. From bottom to top, the sedimentary environments have transitioned from braided rivers, floodplain swamps, deltas, shallow lakes, and semi-deep lakes to deep lakes. The Juhugeng mining area, shaped by tectonic processes and evolutionary outcomes, features an anticline of Triassic strata in its central zone, with the northern and southern flanks consisting of synclines formed by coal-bearing Jurassic layers. Overall, it consists of a major anticline and two minor synclines [40]. Within the mine, northwest-trending reverse faults are significantly developed, while northeast-trending large-scale shear fractures cut it into intermittent blocks of different sizes, presenting the tectonic characteristics of north–south zoning and east–west zoning. It divides the Juhugeng coal mining area into a planar pattern of three open pits and four well fields [41].
The drilling area for gas hydrate is found within the Sanlutian field, where 14 gas hydrate drilling boreholes have been completed. The period of 2008–2009 saw a collaborative effort between the China Geological Survey and the Qinghai Coal Geological Exploration Team 105, resulting in the drilling of seven boreholes in the research region, from which gas hydrate physical samples were acquired from DK–1, DK–2, and DK–3. In 2013, four boreholes were drilled in the northwest part of the study area, yielding gas hydrate physical samples from the wells of the DK11–14, DK13–11, and DK12–13. In 2014, an additional 10 holes were drilled in the central to the eastern section of the research zone, with gas hydrate physical samples obtained only from the DK8–19 hole [41,42].
In the Muli region, all of the gas hydrates found are located beneath the permafrost layer, primarily within the Jiangcang Formation, and the reservoir depths are mostly in the range of 100–400 m. Gas hydrate usually aggregates in both porous and fractured forms. Porous hydrate is mostly in the form of dots, layers, and ripples that fill argillaceous siltstone and siltstone [5]. Fractured hydrate is mostly in the form of thin layers, flakes, and blocks that fill dense rock types such as siltstone, oil shale, and mudstone [37].

3. Methods

In this study, a range of algorithms were employed to develop and assess models for lithology classification based on logging data. Figure 2 shows the basic process of this study.

3.1. ML Algorithms

The selected algorithms included LR, DT, RF, and MLP. The following section presents an overview of the fundamental principles of the selected algorithms and the metrics used for model evaluation.

3.1.1. DT

DT is a model that makes decisions stemming from an arborescence structure and is considered a form of supervised learning algorithm. At the top of this model, there is a root node that links to the inner nodes of the tree, while the inner nodes are associated with the leaf nodes. The internal nodes are associated with features or attributes. Branch represents the value of these attributes, and leaf nodes represent the value of a target variable [43]. From the root node, branches are created successively along the tree by the values of the features, ultimately reaching the leaf nodes where decisions or predictions are made [44]. The structure diagram is shown in Figure 3. There are two criteria for splitting nodes, Equation (1) entropy and Equation (2) Gini coefficient [33]:
H ( X ) = i = 1 n P ( X = i ) log 2 P ( X = i ) ,
G i n i ( D ) = 1 i = 1 n P 2 ( X = i ) ,
where P(X = i) is the probability of the random input variable X taking the value i.
DT offers several advantages over alternative ML algorithms. It is simple and intuitive, allowing for the straightforward comprehension and interpretation of the underlying concepts. Additionally, it is adept at handling a diverse range of features, including numerical and categorical data, while exhibiting relatively fast training times when working with large datasets.

3.1.2. RF

RF represents an integrated learning model founded upon the bootstrap aggregating (Bagging) strategy. In this approach, a decision tree is employed as an estimator to train numerous decision trees and subsequently merge them into a unified forest. At each iteration, samples are randomly selected from the dataset with replacement, and a subcollection of features is randomly chosen as input. The final prediction is accomplished through a voting process conducted across all decision trees [45]. The structure is shown in Figure 4. RF can effectively handle nonlinear problems and is adept at managing large amounts of samples and features. Compared to a single decision tree, it enhances the stability and robustness of the model while reducing the likelihood of overfitting. Furthermore, RF is adept at processing large-scale datasets and high-dimensional feature spaces. Due to the randomness and combinatorial effects among different trees, they often provide more accurate predictive results [23]. The algorithm has demonstrated superior performance in numerous practical applications, exhibiting characteristics such as straightforward implementation, rapid computation, and effective error correction.

3.1.3. MLP

MLP constitutes a foundational architectural component of the artificial neural network [33]. The basic structure is shown in Figure 5. The network comprises multiple neurons forming a hidden layer, where each layer of neurons is fully connected to the neurons in the next layer. Each neuron performs a nonlinear conversion of a linear combination of inputs, facilitated by an activation function. This function is responsible for both the feature extraction and classification processes, through the learning of weights [46]. MLP calculates the output value through forward propagation and adjusts the weights through the back-propagation algorithm to maximize the accuracy of predictions [47]. The output of each layer serves as the input of the lower layer until the output layer, and the calculation formula for each layer is shown in Equation (3):
y = T ( w x + b ) ,
where x is the input of this layer; w is the weight of this layer; b is the threshold of this layer; T is the activation function of this layer.
MLP can comprise several hidden layers, thus enhancing the expressive capacity of the network. This approach is exemplified by a deep neural network, which is capable of processing intricate nonlinear relationships.

3.1.4. LR

LR serves as a linear model for tackling binary classification tasks. Although the term regression is included in the algorithm nomenclature, logistic regression constitutes a classification algorithm employed to forecast the likelihood of occurrence of binary output variables. In the LR, it applies a logistic function (sigmoid function) to the linear conjunction of features to ensure that the output values stay between 0 and 1, assuming there is a linear association between the features and the target output, as shown in Equation (4):
s i g m o i d = 1 1 + e z .
This function is used to represent the probability that the observed sample belongs to a given category. The objective of LR model training is to maximize the likelihood function or minimize the loss function, thereby ensuring that the predicted probability of the model approximates the true label [48]. Furthermore, LR is capable of providing estimates of categorical probabilities and can accommodate both linear and nonlinear relationships. However, it is more susceptible to correlations between features than other techniques.

3.2. Model Evaluation Criteria

To more accurately assess the predictive exactitude and generalization capability of the lithology classification models, this study incorporated the confusion matrix, precision, recall, F1-score, and Jaccard coefficient in measuring the model performance.
The confusion matrix illustrates the performance and error of the implemented models, providing data on the misclassification of the lithological categories for which predictions were generated using the model. In general, T (true) signifies an accurate prediction (which corresponds to the actual value), and F (false) signifies an inaccurate prediction (which does not correspond to the actual value). P (positive) is used for specimens that are truly positive, while N (negative) is used for specimens that are truly negative. The calculation of precision and recall is a straightforward process that can be performed using the confusion matrix.
Precision refers to the likelihood that a sample classified into a particular category indeed pertains to that category. Recall denotes the probability of accurately classifying a specimen within a specified class. The F1-score simultaneously considers the precision and recall, functioning as the harmonic mean of these two metrics. The equations for precision, recall, and F1-score are provided in Equations (5)–(7) [33]:
Precision = TP TP + FP ,
Recall = TP TP + FN ,
F 1 = 2 × Precision × Recall Precision + Recall ,
where TP refers to the count of samples that are true positives (actual positives and predicted positives), FP stands for the count of false positives (actual negatives but predicted positives), FN signifies the count of false negatives (actual positives but predicted negatives), TN indicates the count of true negatives (actual negatives and predicted negatives).
The Jaccard similarity coefficient can intuitively reflect the accuracy of model predictions and has a certain robustness for imbalanced datasets. It represents the size of the intersection of the true values and the predicted values, divided by the union size of the two labels. The equation for the coefficient is provided in Equation (8):
J ( y , y p ) = ( | y y p | ) ( | y y p | ) ,
where y is the set of actual label values and yp is the set of predicted values.

4. Application

4.1. Data

In this study, the logging data employed were sourced from the PZQM, Qinghai Muli Sanlutian field. This dataset comprises the logging data from three boreholes, DK10–16, DK13–11, and DK12–13, along with the lithological information of these boreholes, as depicted in Table 1 and Figure 6. Different logging data evinced distinctive responses to the same lithology. According to the correlation study (Figure 7), six types of logging data with high correlation and sensitivity to lithology in the study area were selected from conventional logging data: natural gamma (GR), sonic velocity (VP), neutron (CNL), resistivity (RT), borehole diameter (CAL), and density (DEN).
The dataset comprised a total of 9214 samples, of which 3755 were siltstone samples, 339 were mudstone samples, 1553 were oil shale samples, 306 were coal samples, 2147 were sandstone samples, 995 were silty mudstone samples, and 119 were argillaceous siltstone samples. These are illustrated in Figure 8, with a sampling interval of 0.1 m.

4.1.1. Histogram Analysis of Logging Data

The use of histograms can serve to reflect the distribution of values of logging data, in addition to the attribute characteristics of each lithology. Furthermore, it can be observed how effectively the logging attributes differentiate between different lithologies. The histograms were constructed for the various logging datasets of the seven identified lithologies, as Figure 9 illustrates.
The data presented in the histograms indicate that sandstone has the properties of low GR and CNL and high DEN; coal is marked by low GR and DEN and high RT. Oil shale demonstrates high CNL properties, and mudstone exhibits high GR properties. It can be concluded from these properties that sandstone is better distinguished in the CNL histogram (Figure 9d), and coal is better distinguished in the DEN histogram (Figure 9e). Nevertheless, the majority of the lithologies represented in the logging attribute histograms exhibit such substantial overlap that effective identification becomes a significant challenge. Moreover, it can be perceived from Figure 9b that oil shale experienced expansion during the exploration process.

4.1.2. Cross-Plot Analysis of Logging Curve

Cross-plot is a frequently utilized data visualization tool designed to analyze and interpret the relationships among various attributes. Its fundamental concept involves plotting data points of two or more variables onto a two-dimensional or three-dimensional coordinate system. By observing the distribution patterns of these data points, one can deduce correlations between variables, recognize trends and anomalies within the data, and proceed with classification and prediction tasks.
The six logging curves were subjected to a cross-plot analysis, with the outcomes described in Figure 10. As seen from the diagram, the curve cross-plot of density and natural gamma can effectively identify sandstone and coal. However, the remaining lithologies exhibit a more concentrated distribution and greater data overlap, making identification more challenging. The curve cross-plot of neutron and density can effectively identify sandstone but lacks the same effectiveness for identifying the other lithologies. In conclusion, the cross-plot method is applicable when the objective is to identify a specific lithology. However, this method encounters certain challenges in delineating lithologies such as siltstone, oil shale, and mudstone in areas where hydrates are present. For example, a sonic velocity and density cross-plot can reveal significant lithological overlap, which can impede the effective identification of mudstone and oil shale. Similarly, natural gamma and neutron cross-plots also face challenges in identifying siltstone and mudstone. Consequently, this study employs ML algorithms to discern lithologies based on conventional logging curves.

4.2. Preprocessing

In the ML model training process, the study randomly stratified 70% of the well logging data into a training set and 30% into a test set. The training set comprised 6775 lithology samples, while the test set comprised 2765 lithology samples. The training dataset is utilized for model training, and the test dataset assesses their generalization capacity. The selected conventional logging curves were used as input feature vectors for the lithology identification model, with the seven lithologies serving as the outputs of the model.
Given the disparate sizes and units of each feature quantity of the logging raw data, normalization is typically required. It serves to eliminate the influence of the scale, as well as the discrepancies in the degree of variability exhibited by each variable. This facilitates an enhancement in the accuracy of the model’s predictive capabilities. Concurrently, it eliminates the considerable outliers and gaps in the data caused by the initial measurement of the instrument, thus ensuring the integrity and coherence of the dataset. The equation for performing normalization is provided in Equation (9) [32]:
normalization : X i X m i n X m a x X m i n ,
where the variable Xi indicates the eigenvalue for the ith sample in the original dataset. Similarly, Xmin denotes the minimum eigenvalue in the dataset, while Xmax denotes the maximum eigenvalue in the dataset.

4.3. Results

4.3.1. The Classification Behavior of DT

During the model training of the decision tree, there were infrequent assumptions made by the decision tree regarding the training data. Consequently, without limitations, the tree might develop significant depth, resulting in a tightly fitted structure that could overfit as changes occur in the training dataset. To circumvent the occurrence of overfitting, it is necessary to reduce the flexibility inherent in the decision tree. This process was defined as a form of regularization. The hyperparameter values were confirmed using the grid search method. As shown in Table 2, this study selects entropy as the criterion for feature splitting during tree building, sets the maximum tree depth to 18 layers, and establishes a minimum sample requirement of 2 for splitting internal nodes. Due to the imbalanced nature of the dataset in this study, the class weight parameter ‘balanced’ was chosen to balance the weights of the majority and minority class samples, to achieve more accurate results.
A series of evaluation metrics are obtained by utilizing the adjusted hyperparameterized decision tree to apply to the test dataset. Through the analysis of the confusion matrix for the DT model, the precision can be calculated to have a value of 0.898, with a recall of 0.898 and an F1-score of 0.898. Additionally, the Jaccard coefficient is 0.814. Figure 11a shows the confusion matrix of the DT, and Table 3 displays the predicted evaluation values of various lithologies in this model. From a comprehensive standpoint, it shows that the model demonstrates good classification accuracy for all lithologies except mudstone, for which the F1-score exceeds 0.850. It performed well in the lithology prediction of the study area. However, it is susceptible to the overfitting phenomenon during parameter tuning, so the range of hyperparameter values should be strictly controlled to achieve a model with strong generalization ability.

4.3.2. The Classification Behavior of RF

The RF model is an assembled system combining multiple decision trees. This method randomly selects distinct datasets to generate a range of model outputs, which are then integrated to produce a unified result set. In this study, the optimal hyperparameter values were determined via grid search. As shown in Table 2, 160 decision trees were selected for integrated learning, and overfitting was avoided by limiting the maximum tree depth, setting it to 27. When constructing each decision tree, entropy was selected as the criterion for feature splitting, a maximum of 4 features were considered at each split, and a minimum sample requirement of 3 was considered for splitting internal nodes. At the same time, the class weight parameter “balanced” was chosen to adjust the weight of the data.
The adjusted hyperparameterized random forest model was employed in the test dataset to receive a collection of evaluation metrics. The confusion matrix calculation of the RF illustrates that the model exhibits a high degree of precision and recall, with respective values of 0.941 and 0.941, an F1-score of 0.940, and a Jaccard coefficient of 0.889. Figure 11b describes the model confusion matrix, while Table 4 illustrates the predicted evaluation values for each lithology type within this model. The synthesis above demonstrates that the model exhibits superior classification accuracy for all seven lithologies in the region, with each lithology achieving a score above 0.800. Furthermore, the comprehensive evaluation is the highest among the models. Therefore, the algorithm exhibits the most optimal performance in lithology classification across the region. It demonstrates consistent and reliable outcomes, along with robust anti-interference and anti-overfitting capabilities.

4.3.3. The Classification Behavior of MLP

MLP comprises three distinct layers: an input layer, one or more hidden layers, and an output layer, with each layer interconnected through fully connected neurons, forming a complex network of information processing. During the training process of the MLP, the selection of hyperparameters for the model is essential to attain the optimal predictive outcomes. The hyperparameter values are revealed in Table 2. The L2 regularization parameter is specified, the activation function is “Relu”, the maximum iteration count is set to 200, and the initial learning rate is 0.001. Furthermore, we select “Adam” as the optimization algorithm and set 2 hidden layers, with 300 neurons in the first layer and 200 neurons in the second layer. The batch size for each training iteration is set to 15. A class weight loss function is defined to reduce the impact of imbalanced datasets.
To obtain the evaluation metrics, the above-adapted MLP was utilized in the test dataset. Through the analysis of the confusion matrix for the MLP, the precision of the model can be calculated as 0.923, the recall as 0.922, and the F1-score as 0.921. The Jaccard coefficient is 0.854. Figure 11c describes the confusion matrix of the MLP model, while Table 5 illustrates the predicted evaluation values for each lithology within the model. The model demonstrates good classification accuracy for six lithologies: siltstone, oil shale, coal, sandstone, silty mudstone, and argillaceous siltstone. The F1-score for each lithology exceeds 0.860, except for mudstone, in which the model predicts 24 samples as siltstone.

4.3.4. The Classification Behavior of LR

LR is a generalized linear regression analysis model that assesses the likelihood of an event happening from a specified dataset of independent explanatory variables. To prevent overfitting, the principal hyperparameters were established (Table 2). The regularization type of L2 was selected, the strength was set to 1, and “Liblinear” was chosen as the optimization method, while also using the class weight parameter “balanced” to mitigate the impact of data imbalance.
The above-adapted model was applied to the test data to obtain the evaluation metrics. Based on the analysis of the confusion matrix for the LR model, we can calculate that the precision of the model is 0.723, the recall is 0.702, the F1-score is 0.707, and the Jaccard coefficient is 0.541. Figure 11d illustrates the confusion matrix, and Table 6 shows the predicted evaluation values for each lithology category in the model. From the aforementioned synthesis, it is evident that the model exhibits average prediction performance and has a tendency to misclassify samples from two lithologies: mudstone and silty mudstone. These samples are predominantly mislabeled as silt and oil shale. For argillaceous siltstone, the LR model showed a high recall rate but obtained a very low F1-score. This indicates that the model can accurately predict the lithology, but it also incorrectly predicts many other lithologies as argillaceous silt. It can be observed that the model does not effectively reflect the relationship between features and lithology.

4.4. Case

To verify the precision of the forecast, the samples from two well sections in DK10–16 and DK13–11, which were not incorporated into the modeling process, were designated as blind test data. These data were then predicted based on the model that was previously constructed.
As illustrated in Figure 12a, all four algorithms demonstrated superior predictive accuracy for the two lithologies, sandstone and coal. In addition, the MLP model exhibited the most notable performance in the prediction of mudstone. In conclusion, the MLP model demonstrates superior generalization ability in the prediction of these three lithologies. As seen in Figure 12b, the RF and MLP models showed comparable generalization ability. By visualizing the predicted lithologies produced by the ML models, it becomes evident that they provide comparable predictions within the same depth interval. This verifies the functionality of each ML model in predicting the lithology within the area.

5. Discussion

5.1. Discussion and Analysis of Lithology Identification Results

In this research, we apply different ML methods to categorize the lithological characteristics of three wells that contain gas hydrate in the Muli PZQM, Qinghai. The classification effectiveness is measured through the use of the Jaccard coefficient, precision, recall, and F1-score. Table 7 provides a sequence of evaluation metrics results on the test dataset for various ML methods. The results demonstrate that (1) in terms of lithology classification performance, RF outperforms other ML techniques, with the optimal Jaccard coefficient, precision rate, recall rate, and F1-score. (2) Algorithms other than LR are also capable of predicting lithology with greater accuracy, as evidenced by classification F1-score exceeding 0.890.
To facilitate a comparison between the prediction results and the actual lithologies, this study presents the prediction confusion matrices for all models (Figure 11a–d). Additionally, the prediction precision, recall, and F1-score for various lithologies, provided by all ML models, are calculated (Table 3, Table 4, Table 5 and Table 6). It can be observed that the predictive performance of the models is satisfactory for the majority of the test set samples, including the lithological types of siltstone, oil shale, and sandstone. This suggests that all of these algorithms may be suitable for the prediction of these three lithologies with sufficient samples in the dataset. For the four lithologies with relatively small sample sizes, each model demonstrates robust predictions for coal, which is characterized by a markedly higher resistivity compared to the other lithologies. In the prediction results for mudstone, only one model, RF, exceeded an F1-score of 0.800. Notably, the score of the LR model is less than 0.600. Furthermore, an examination of the confusion matrices for all models reveals that the majority of mudstone samples that were incorrectly predicted were classified as siltstone, silty mudstone, and oil shale. Incorrect predictions may be partially attributed to the significant disparity in the quantity of siltstone versus mudstone samples (Figure 8, Table 1). Another potential contributing factor may be the occurrence of well-wall collapse in the oil shale, which has the effect of dilating the rock and resulting in errors in the data. From Figure 9b, it can be seen that there is indeed an expansion phenomenon in the oil shale, and its wellbore diameter exceeds the general maximum wellbore diameter by 150 mm. In addition, the expansion phenomenon of shale was also observed in the shaded area of Figure 12b. Due to the expansion problem caused during the survey process, the lithology was incorrectly predicted. In the case of the silty mudstone, the prediction scores of all models except the LR model exceed 0.850, indicating a high degree of classification accuracy. In particular, the RF classifier demonstrates the most effective performance on this lithology, with an F1-score of 0.914, thereby substantiating the superior capabilities of the model. In regard to predicting argillaceous siltstone, LR exhibited deficiencies in its predictive capabilities. Specifically, LR has a good prediction for argillaceous siltstone, but it also incorrectly predicted many other rock types as argillaceous siltstone, resulting in a lower F1-score for this rock type. However, the MLP model showed the best performance for this rock type, with an F1-score of 0.935. These observations suggest that LR may not be an optimal choice for distinguishing this particular lithology. The alternative models are more effective in predicting this lithology.
In addition, the LR model demonstrated suboptimal performance in predicting the lithology of the study area. The author will proceed to conduct a comprehensive analysis of the deficiencies exhibited by the algorithm. (1) This study presents a multi-categorization and multi-feature problem, which introduces a greater complexity for LR. However, LR is a relatively simple model that may not be sufficiently robust for complex datasets. Compared with other, more intricate models (such as MLP and RF), LR may be less capable of effectively discerning the latent patterns in the data. (2) Imbalance in the samples of the dataset is also one of the problems. If the quantity of specimens within each class is disproportionate, LR may exhibit a proclivity to forecast the majority class, thereby leading to a diminished identification rate for the minority class. Although class weight had been set to increase the model’s focus on minority classes, it may still have a significant impact on LR models. (3) The LR algorithm is used with certain assumptions regarding the probability distribution of the data, which constrains its efficacy in addressing complex nonlinear problems. If the true relationship between the data is not linear, the LR will be unable to discern and capture this relationship, resulting in suboptimal classification outcomes. The data relationships in this study are nonlinear. However, in terms of the individual predictions for each type of lithology (Figure 9d, Table 6), the model demonstrates favorable predictive outcomes for coal, siltstone, oil shale, and sandstone, exhibiting enhanced predictive capabilities for these four lithologies. The algorithmic model is illustrated to possess discernible classification performance in the Muli PZQM.
Ultimately, following the application of the models to logging data not included in the training and testing phases, it can be observed from Figure 12 that all models demonstrate robust predictive capabilities for both coal and sandstone. However, the generalization abilities of these models require further enhancement to achieve optimal performance in predicting the lithologies of other rock types. Generally, the implementation of ML techniques has demonstrated the potential for lithology identification in the Muli PZQM. However, there is still scope for further optimization in terms of model performance and the identification of application effects.

5.2. Feature Importance Analysis

Feature importance analysis is a method used to evaluate and interpret the impact of various features in ML models on prediction results, which can help us understand model decisions, optimize feature selection, and improve model performance. Table 8 presents the feature importance reports of the four models in this study.
For the DT model, the results of the feature importance report show that the top features are DEN and CAL, so the model will rely more on these two features to make the final recognition decision. However, other characteristics besides VP can also affect the final decision. The DEN histogram in Figure 9e shows that the most distinguishable lithology is coal, which has the lowest density. The CAL histogram of Figure 9b shows that the diameters of all rock types are almost the same, except for the oil shale, where there is an increase in diameter. The CNL histogram of Figure 9d shows that the neutron values of sandstone are lower than those of other rock types. Through the GR histogram of Figure 9a, one can intuitively observe that the GR values of coal and sandstone are lower, which can be well distinguished. The resistance value of the coal in the Figure 9c RT histogram is far greater than other lithologies, so it can be well identified. As for the recognition results of the DT model (Table 3), the F1-scores of coal, oil shale, and sandstone are indeed at the forefront, which is consistent with the results of feature importance. The top features of the RF model are also DEN and CNL, which have similar results to the DT model. Next are RT, CNL, and GR, which improve the recognition accuracy of coal and sandstone. The test results of the model (Table 4) also correspond to the results of feature importance, and the recognition results of coal, oil shale, and sandstone are the best. The feature importance ranking of the MLP model is shown in Table 8, indicating that the model takes into account all features more fully. Therefore, in addition to coal, oil shale, and sandstone, other rock types have also obtained good identification results (Table 5). The feature importance of the LR model is ranked as DEN, GR, CAL, RT, CNL, and VP, and the importance of DEN is 1.000. It can be concluded that the model is highly dependent on this feature, and GR also has an influence degree of 0.693. According to the DEN histogram, it can be seen that the best lithology that can be distinguished is coal, but the density of other lithologies is not significantly different, making it difficult to distinguish. This may also be a factor that leads to the low prediction accuracy of the model. In the GR histogram, it can be seen that coal and sandstone can be well distinguished, which enables sandstone to also obtain good recognition scores. The most recognizable lithology in the CAL histogram is oil shale, which also affects the recognition accuracy of oil shale to a certain extent. The recognition results of LR (Table 6) also confirm this result, with coal, sandstone, and oil shale having the highest recognition scores, but the recognition performance of other rock types is significantly insufficient.
This evaluation of feature importance provides a more comprehensive explanation for the model prediction results. It concluded that all four models can learn well the complex mapping relationship between logging curve features and coal, oil shale, and sandstone. However, from Table 3, Table 4, Table 5 and Table 6, it can be observed that DT, MLP, and LR did not map the relationship between mudstone and features well, resulting in an F1-score below 0.800. This also prompts the use of more complex models in the future to explore the more complex relationship between features and lithology, to obtain better results.

6. Conclusions

In this study, four classical ML algorithms, namely, DT, RF, MLP, and LR, are applied to the problem of classifying seven lithologies within the research region of the Sanlutian field of the Muli PZQM in Qinghai. Six types of logging data were employed as input feature curves for the training samples, including GR, VP, CNL, RT, CAL, and DEN from gas hydrate boreholes in the study area. Major lithologies in the tundra area were used as the outputs, including siltstone, mudstone, oil shale, coal, sandstone, silty mudstone, and argillaceous siltstone. Four ML models were trained using both training and test sets. To evaluate the behavior of each classification model, precision, recall, F1-score, and Jaccard coefficient were calculated. The key findings are as follows:
(1)
In comparison with alternative ML models, RF has been demonstrated to be the most effective for the classification of lithological characteristics in logging data. The model achieved the highest evaluation score for precision, recall, F1-score, and Jaccard coefficient, with values of 0.941, 0.941, 0.940, and 0.889, respectively. The evaluation scores for the remaining models, in descending order, were as follows: MLP, DT, and LR.
(2)
Among the seven major lithologies in this research area, most of the mudstone samples are misclassified as siltstone, silty mudstone, and oil shale, which presents a significant challenge concerning classification. To improve the classification accuracy of mudstone, it is necessary to obtain a larger number of additional samples and mine more characteristic curves in the study region.
(3)
The ML technique utilizing logging data can facilitate a more accurate method of classifying the lithology type of hydrate boreholes within the PZQM. It has the potential to provide novel technical support for prospective searches for gas hydrate, as well as offer valuable references for the identification and exploration of gas hydrate reserves within the PZQM.

Author Contributions

Conceptualization, X.H.; methodology, X.H. and G.S.; software, G.S.; validation, K.X.; formal analysis, H.Y.; investigation, G.S.; resources, K.X. and X.H.; data curation, K.X.; writing—original draft preparation, G.S.; writing—review and editing, X.H. and G.S.; visualization, W.L. and Y.W.; supervision, C.W.; project administration, X.H. and C.W.; funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 42404150), Open Fund of Shaanxi Key Laboratory of Petroleum Accumulation Geology (No. PAG-202403), Key Laboratory of Tectonics and Petroleum Resources (China University of Geosciences), Ministry of Education (No. TPR-2021-12), Jiangxi Engineering Laboratory on Radioactive Geoscience and Big Data Technology, East China University of Technology (No. JELRGBDT202201), Hubei Subsurface Multi-scale Imaging Key Laboratory, China University of Geosciences (No. SMIL-2021-03), the Academic and Technical Leader Training Program of Jiangxi Province (No. 20204BCJ23027), and the Postgraduate Innovation Fund from the East China University of Technology (No. DHYC-202432 and DHYC-202317).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data underlying this article will be shared upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CALBorehole diameter
CE-SGANClassification enhancement semi-supervised generative adversarial network
CNLNeutron
CNNConvolutional neural network
DENDensity
DTDecision tree
GOAGrasshopper optimization algorithm
GRNatural gramma
GWOGrey wolf optimizer
IODPInternational Ocean Discovery Program
LRLogistic regression
MLMachine learning
MLPMulti-layer perceptron
PSOParticle swarm optimization
PZQMPermafrost zones of the Qilian Mountains
RBFRadial basis function
RFRandom forest
RTResistivity
SVMSupport vector machine
VPSonic velocity

References

  1. Xing, L.C.; Gao, L.; Ma, Z.S.; Lao, L.Y.; Wei, W.; Han, W.F.; Wang, B.; Gao, M.Z.; Xing, D.H.; Ge, X.M. A permittivity-conductivity joint model for hydrate saturation quantification in clayey sediments based on measurements of time domain reflectometry. Geoenergy Sci. Eng. 2024, 237, 212798. [Google Scholar] [CrossRef]
  2. Zuo, Y.H.; Wang, Q.F.; Lu, Z.Q.; Chen, H. Tectono-thermal evolution and gas source potential for natural gas hydrates in the Qilian Mountain permafrost, China. J. Nat. Gas Sci. Eng. 2016, 36, 32–41. [Google Scholar] [CrossRef]
  3. Lu, Z.Q.; Zhu, Y.H.; Liu, H.; Zhang, Y.Q.; Jin, C.S.; Huang, X.; Wang, P.K. Gas source for gas hydrate and its significance in the Qilian Mountain permafrost, Qinghai. Mar. Petrol. Geol. 2013, 43, 341–348. [Google Scholar] [CrossRef]
  4. Riley, D.; Schaafsma, M.; Marin-Moreno, H.; Minshull, T.A. A social, environmental and economic evaluation protocol for potential gas hydrate exploitation projects. Appl. Energy 2020, 263, 114651. [Google Scholar] [CrossRef]
  5. Hu, X.D.; Zou, C.C.; Qin, Z.; Yuan, H.; Song, G.; Xiao, K. Numerical simulation of resistivity and saturation estimation of pore-type gas hydrate reservoirs in the permafrost region of the Qilian Mountains. J. Geophys. Eng. 2024, 21, 599–613. [Google Scholar] [CrossRef]
  6. Alzubaidi, F.; Mostaghimi, P.; Swietojanski, P.; Clark, S.R.; Armstrong, R.T. Automated lithology classification from drill core images using convolutional neural networks. J. Petrol. Sci. Eng. 2021, 197, 107933. [Google Scholar] [CrossRef]
  7. Li, Y.; Xin, X.; Xu, T.; Zhu, H.; Wang, H.; Chen, Q.; Yang, B. Comparative analysis on the evolution of seepage parameters in methane hydrate production under depressurization of clayey silt reservoir and sandy reservoir. J. Mar. Sci. Eng. 2022, 10, 653. [Google Scholar] [CrossRef]
  8. Bu, Q.T.; Xing, T.J.; Li, C.F.; Zhao, J.H.; Liu, C.L.; Wang, Z.H.; Zhao, W.G.; Kang, J.L.; Meng, Q.G.; Hu, G.W. Effect of hydrate microscopic distribution on acoustic characteristics during hydrate dissociation: An insight from combined acoustic-CT detection study. J. Mar. Sci. Eng. 2022, 10, 1089. [Google Scholar] [CrossRef]
  9. Xing, L.C.; Wang, S.; Wu, X.F.; Lao, L.Y.; Salehi, S.M.; Wei, W.; Han, W.F.; Xing, D.H.; Ge, X.M. Evaluating hydraulic permeability of hydrate-bearing porous media based on broadband electrical parameters: A numerical study. Gas Sci. Eng. 2025, 134, 205526. [Google Scholar] [CrossRef]
  10. Bei, K.Q.; Xu, T.F.; Shang, S.H.; Wei, Z.L.; Yuan, Y.L.; Tian, H.L. Numerical modeling of gas migration and hydrate formation in heterogeneous marine sediments. J. Mar. Sci. Eng. 2019, 7, 348. [Google Scholar] [CrossRef]
  11. Zhu, L.Q.; Wu, S.G.; Zhou, X.Q.; Cai, J.C. Saturation evaluation for fine-grained sediments. Geosci. Front. 2023, 14, 101540. [Google Scholar] [CrossRef]
  12. Liu, X.Y.; Li, J.Y.; Chen, X.H.; Zhou, L.; Guo, K.K. Bayesian discriminant analysis of lithofacies integrate the Fisher transformation and the kernel function estimation. Interpretation 2017, 5, SE1–SE10. [Google Scholar] [CrossRef]
  13. Kianoush, P.; Mohammadi, G.; Hosseini, S.A.; Khah, N.K.F.; Afzal, P. Inversion of seismic data to modeling the Interval Velocity in an Oilfield of SW Iran. Results Geophys. Sci. 2023, 13, 100051. [Google Scholar] [CrossRef]
  14. Gu, Y.F.; Zhang, D.Y.; Lin, Y.B.; Ruan, J.F.; Bao, Z.D. Data-driven lithology prediction for tight sandstone reservoirs based on new ensemble learning of conventional logs: A demonstration of a Yanchang member, Ordos Basin. J. Petrol. Sci. Eng. 2021, 207, 109292. [Google Scholar] [CrossRef]
  15. Tian, Y.K.; Zhou, H.; Yuan, S.Y. Lithologic discrimination method based on Markov random-field. Chin. J. Geophys. 2013, 56, 1360–1368. (In Chinese) [Google Scholar] [CrossRef]
  16. Grana, D.; Lang, X.Z.; Wu, W.T. Statistical facies classification from multiple seismic attributes: Comparison between Bayesian classification and expectation–maximization method and application in petrophysical inversion. Geophys. Prospect. 2017, 65, 544–562. [Google Scholar] [CrossRef]
  17. Zhang, J.; Nie, X.; Xiao, S.Y.; Zhang, C.; Zhang, C.M.; Zhang, Z.S. Generating porosity spectrum of carbonate reservoirs using ultrasonic imaging log. Acta Geophys. 2018, 66, 191–201. [Google Scholar] [CrossRef]
  18. Collett, T.S. Detailed evaluation of gas hydrate reservoir properties using JAPEX /JNOC /GSC Mallik 2L-38 gas hydrate research well downhole well-log displays. Bull. Geol. Surv. Can. 1999, 544, 295–311. [Google Scholar]
  19. Boswell, R.; Collett, T.; McConnell, D.; Frye, M.; Shedd, B.; Mrozewski, S.; Guerin, G.; Cook, A.; Godfriaux, P.; Dufrene, R.; et al. Joint Industry Project Leg II discovers rich gas hydrate accumulations in sand reservoirs in the Gulf of Mexico. Nat. Gas Oil 2009, 304, 285–4541. [Google Scholar]
  20. Abdollahian, A.; Wang, H.; Liu, H.; Zheng, X.M. Transfer learning for acoustic cement bond evaluation: An image classification approach using acoustic variable Density log. Geoenergy Sci. Eng. 2024, 239, 212960. [Google Scholar] [CrossRef]
  21. Mandal, P.P.; Rezaee, R. Facies classification with different machine learning algorithm-An efficient artificial intelligence technique for improved classification. ASEG Ext. Abstr. 2019, 2019, 1–6. [Google Scholar] [CrossRef]
  22. Kumar, T.; Seelam, N.K.; Rao, G.S. Lithology prediction from well log data using machine learning techniques: A case study from Talcher coalfield, Eastern India. J. Appl. Geophys. 2022, 199, 104605. [Google Scholar] [CrossRef]
  23. Ullah, J.; Li, H.; Ashraf, U.; Ehsan, M.; Asad, M. A multidisciplinary approach to facies evaluation at regional level using well log analysis, machine learning, and statistical methods. Geomech. Geophys. Geo-Energy Geo-Resour. 2023, 9, 152. [Google Scholar] [CrossRef]
  24. Rathore, P.W.S.; Hussain, M.; Malik, M.B.; Amin, Y. Well log analysis and comparison of supervised machine learning algorithms for lithofacies identification in pab formation, lower indus basin. J. Appl. Geophys. 2023, 219, 105199. [Google Scholar] [CrossRef]
  25. Ameur-Zaimeche, O.; Zeddouri, A.; Heddam, S.; Kechiched, R. Lithofacies prediction in non-cored wells from the Sif Fatima oil field (Berkine basin, southern Algeria): A comparative study of multilayer perceptron neural network and cluster analysis-based approaches. J. Afr. Earth Sci. 2020, 166, 103826. [Google Scholar] [CrossRef]
  26. Eftekhari, S.H.; Memariani, M.; Maleki, Z.; Aleali, M.; Kianoush, P.; Shirazy, A.; Shirazi, A.; Pour, A.B. Employing Statistical Algorithms and Clustering Techniques to Assess Lithological Facies for Identifying Optimal Reservoir Rocks: A Case Study of the Mansouri Oilfields, SW Iran. Minerals 2024, 14, 233. [Google Scholar] [CrossRef]
  27. Gao, P.Y.; Jiang, C.; Huang, Q.; Cai, H.; Luo, Z.F.; Liu, M.J. Fluvial facies reservoir productivity prediction method based on principal component analysis and artificial neural network. Petroleum 2016, 2, 49–53. [Google Scholar] [CrossRef]
  28. Imamverdiyev, Y.; Sukhostat, L. Lithological facies classification using deep convolutional neural network. J. Petrol. Sci. Eng. 2019, 174, 216–228. [Google Scholar] [CrossRef]
  29. Tian, M.; More, H.; Xu, H.M. Inversion of well logs into lithology classes accounting for spatial dependencies by using hidden markov models and recurrent neural networks. J. Petrol. Sci. Eng. 2021, 196, 107598. [Google Scholar] [CrossRef]
  30. Wang, J.; Cao, J.X.; Yuan, S. Deep learning reservoir porosity prediction method based on a spatiotemporal convolution bi-directional long short-term memory neural network model. Geomech. Energy Environ. 2022, 32, 100282. [Google Scholar] [CrossRef]
  31. Eftekhari, S.H.; Memariani, M.; Maleki, Z.; Aleali, M.; Kianoush, P. Electrical facies of the Asmari Formation in the Mansouri oilfield, an application of multi-resolution graph-based and artificial neural network clustering methods. Sci. Rep. 2024, 14, 5198. [Google Scholar] [CrossRef] [PubMed]
  32. Delavar, M.R. Hybrid machine learning approaches for classification and detection of fractures in carbonate reservoir. J. Petrol. Sci. Eng. 2022, 208, 109327. [Google Scholar] [CrossRef]
  33. Bressan, T.S.; Souza, M.K.D.; Girelli, T.J.; Junior, F.C. Evaluation of machine learning methods for lithology classification using geophysical data. Comput. Geosci. 2020, 139, 104475. [Google Scholar] [CrossRef]
  34. Zhao, F.D.; Yang, Y.; Kang, J.W.; Li, X.S. CE-SGAN: Classification enhancement semi-supervised generative adversarial network for lithology identification. Geoenergy Sci. Eng. 2023, 223, 211562. [Google Scholar] [CrossRef]
  35. Li, B.; Sun, Y.H.; Guo, W.; Shan, X.L.; Wang, P.K.; Pang, S.J.; Jia, R.; Zhang, G.B. The mechanism and verification analysis of permafrost-associated gas hydrate formation in the Qilian Mountain, Northwest China. Mar. Petrol. Geol. 2017, 86, 787–797. [Google Scholar] [CrossRef]
  36. Dong, H.M.; Sun, J.M.; Arif, M.; Liu, X.F.; Golsanami, N.; Yan, W.C.; Cui, L.K.; Zhang, Y.H. A method for well logging identification and evaluation of low-resistivity gas hydrate layers. Pure Appl. Geophys. 2022, 179, 3357–3376. [Google Scholar] [CrossRef]
  37. Sun, Z.J.; Yang, Z.B.; Mei, H.; Qin, A.H.; Zhang, F.G.; Zhou, Y.L.; Zhang, S.Y.; Mei, B.W. Geochemical characteristics of the shallow soil above the Muli gas hydrate reservoir in the permafrost region of the Qilian Mountains, China. J. Geochem. Explor. 2014, 139, 160–169. [Google Scholar] [CrossRef]
  38. Fang, H.; Xu, M.C.; Lin, Z.Z.; Zhong, Q.; Bai, D.W.; Liu, J.X.; Pei, F.G.; He, M.X. Geophysical characteristics of gas hydrate in the Muli area, Qinghai province. J. Nat. Gas Sci. Eng. 2017, 37, 539–550. [Google Scholar] [CrossRef]
  39. Dong, H.M.; Sun, J.M.; Arif, M.; Zhang, Y.H.; Yan, W.C.; Iglauer, S.; Golsanami, N. Digital rock-based investigation of conductivity mechanism in low-resistivity gas hydrate reservoirs: Insights from the Muli area’s gas hydrates. J. Petrol. Sci. Eng. 2022, 218, 110988. [Google Scholar] [CrossRef]
  40. Zhang, F.G.; Yang, Z.B.; Zhou, Y.L.; Zhang, S.Y.; Yu, L.S. Accumulation mechanism of natural gas hydrate in the Qilian Mountain permafrost, Qinghai, China. Front. Energy Res. 2022, 10, 1006421. [Google Scholar] [CrossRef]
  41. Qu, L.; Zou, C.C.; Lu, Z.Q.; Yu, C.Q.; Li, N.; Zhu, J.C.; Zhang, X.H.; Yue, X.Y.; Gao, M.Z. Elastic-wave velocity characterization of gas hydrate-bearing fractured reservoirs in a permafrost area of the Qilian Mountain, Northwest China. Mar. Petrol. Geol. 2017, 88, 1047–1058. [Google Scholar] [CrossRef]
  42. Hu, X.D.; Zou, C.C.; Lu, Z.Q.; Yu, C.Q.; Peng, C.; Li, W.; Tang, Y.Y.; Liu, A.Q.; Kouamelan, K.S. Evaluation of gas hydrate saturation by effective medium theory in shaly sands: A case study from the Qilian Mountain permafrost, China. J. Geophys. Eng. 2019, 16, 215–228. [Google Scholar] [CrossRef]
  43. Tian, D.M.; Yang, S.X.; Gong, Y.H.; Geng, M.H.; Li, Y.H.; Hu, G. A comparative study of machine learning methods for gas hydrate identification. Geoenergy Sci. Eng. 2023, 223, 211564. [Google Scholar] [CrossRef]
  44. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
  45. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  46. Juna, A.; Umer, M.; Sadiq, S.; Karamti, H.; Eshmawi, A.A.; Mohamed, A.; Ashraf, I. Water quality prediction using KNN imputer and multilayer perceptron. Water 2022, 14, 2592. [Google Scholar] [CrossRef]
  47. Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 1991, 2, 183–197. [Google Scholar] [CrossRef]
  48. Al-Abadi, A.M.; Al-Najar, N.A. Comparative assessment of bivariate, multivariate, and machine learning models for mapping flood proneness. Nat. Hazards 2020, 100, 461–491. [Google Scholar] [CrossRef]
Figure 1. A map showing the locations of hydrate boreholes in the PZQM (modified from Hu [5]).
Figure 1. A map showing the locations of hydrate boreholes in the PZQM (modified from Hu [5]).
Processes 13 01475 g001
Figure 2. Research flowchart.
Figure 2. Research flowchart.
Processes 13 01475 g002
Figure 3. Structural diagram of DT.
Figure 3. Structural diagram of DT.
Processes 13 01475 g003
Figure 4. Random forest structure diagram.
Figure 4. Random forest structure diagram.
Processes 13 01475 g004
Figure 5. MLP structural schematic diagram.
Figure 5. MLP structural schematic diagram.
Processes 13 01475 g005
Figure 6. The logging curve lithology map of the gas hydrate boreholes in the PZQM. The selected wells are represented by logging data based on geological core information, including GR, CAL, RT, CNL, DEN, and VP, corresponding to their respective lithologies. (a) DK10–16, (b) DK13–11, and (c) DK12–13.
Figure 6. The logging curve lithology map of the gas hydrate boreholes in the PZQM. The selected wells are represented by logging data based on geological core information, including GR, CAL, RT, CNL, DEN, and VP, corresponding to their respective lithologies. (a) DK10–16, (b) DK13–11, and (c) DK12–13.
Processes 13 01475 g006aProcesses 13 01475 g006b
Figure 7. Heatmap of correlation between logging curve features.
Figure 7. Heatmap of correlation between logging curve features.
Processes 13 01475 g007
Figure 8. A histogram of the quantity distribution for various lithologies is included in the dataset.
Figure 8. A histogram of the quantity distribution for various lithologies is included in the dataset.
Processes 13 01475 g008
Figure 9. Histogram of logging data distribution for different lithologies: (a) GR, (b) CAL, (c) RT, (d) CNL, (e) DEN, and (f) VP.
Figure 9. Histogram of logging data distribution for different lithologies: (a) GR, (b) CAL, (c) RT, (d) CNL, (e) DEN, and (f) VP.
Processes 13 01475 g009aProcesses 13 01475 g009b
Figure 10. A cross-plot showing the relationships between all features (GR, CAL, RT, CNL, DEN, and VP) was collected from the DK10−16, DK13−11, and DK12−13. Bright yellow-siltstone, saffron orange-mudstone, burnt orange-oil shale, dark brown-coal, dark blue-sandstone, cerulean blue-silty mudstone, and light blue-argillaceous siltstone.
Figure 10. A cross-plot showing the relationships between all features (GR, CAL, RT, CNL, DEN, and VP) was collected from the DK10−16, DK13−11, and DK12−13. Bright yellow-siltstone, saffron orange-mudstone, burnt orange-oil shale, dark brown-coal, dark blue-sandstone, cerulean blue-silty mudstone, and light blue-argillaceous siltstone.
Processes 13 01475 g010
Figure 11. The confusion matrix plots for the four classification models studied in the paper are presented for the test set. The color variations show the number and intensity of correctly classified and misclassified instances: (a) DT, (b) RF, (c) MLP, and (d) LR.
Figure 11. The confusion matrix plots for the four classification models studied in the paper are presented for the test set. The color variations show the number and intensity of correctly classified and misclassified instances: (a) DT, (b) RF, (c) MLP, and (d) LR.
Processes 13 01475 g011
Figure 12. The logging plots of unknown well sections, which include GR, CAL, RT, CNL, DEN, and VP. A comparison of the lithology classifications derived from all four ML models with the actual lithology of geological cores information: (a) 518.6–551.6 m of DK10–16 and (b) 119.7–146.8 m of DK13–11. The shaded area represents the depth range where the oil shale exhibits a dilation phenomenon.
Figure 12. The logging plots of unknown well sections, which include GR, CAL, RT, CNL, DEN, and VP. A comparison of the lithology classifications derived from all four ML models with the actual lithology of geological cores information: (a) 518.6–551.6 m of DK10–16 and (b) 119.7–146.8 m of DK13–11. The shaded area represents the depth range where the oil shale exhibits a dilation phenomenon.
Processes 13 01475 g012
Table 1. Lithological composition of hydrate boreholes in Sanlutian field, PZQM.
Table 1. Lithological composition of hydrate boreholes in Sanlutian field, PZQM.
LithologyLabelNumberSample Size
SiltstoneSiS13755
MudstoneMS2339
Oil shaleSH31553
CoalC4306
SandstoneSS52147
Silty mudstoneSiMS6995
Argillaceous siltstoneArSS7119
Table 2. Optimal hyperparameters of all models.
Table 2. Optimal hyperparameters of all models.
ModelHyperparameterValue
DTDetermines the criteria for feature splitting during tree building (criterion)entropy
Minimum number of samples required to split an internal node (min sample split)2
Max depth18
Class weightbalanced
RFCNumber of features to consider when looking for the best split (max features)4
Minimum number of samples required to split an internal node (min sample split)3
Number of trees in the forest (n-estimators)160
Max depth27
Class weightbalanced
Determines the criteria for feature splitting during tree building (criterion)entropy
MLPActivation function (activation)Relu
Regularization parameters (alpha)0.001
Maximum number of iterations of a neural network (max iter)200
Initialization learning rate (learning rate init)0.001
Optimization algorithms (solver)Adam
Number of samples required for each training iteration (batch size)15
Number of hidden layers and number of neurons in a layer (hidden layer sizes)2, (300,200)
LROptimization algorithms (solver)Liblinear
Class weightbalanced
Penalty (c)1
Table 3. The precision, recall, and F1-score for various lithologies in the test set using the DT model.
Table 3. The precision, recall, and F1-score for various lithologies in the test set using the DT model.
LithologySiSMSSHCSSSiMSArSS
Precision0.9030.7130.9081.0000.9070.8710.938
Recall0.8970.7550.9480.9570.9080.8360.833
F1-Score0.9000.7330.9280.9780.9080.8530.882
Table 4. The precision, recall, and F1-score for various lithologies in the test set using the RFC model.
Table 4. The precision, recall, and F1-score for various lithologies in the test set using the RFC model.
LithologySiSMSSHCSSSiMSArSS
Precision0.9210.9020.9721.0000.9530.9370.968
Recall0.9580.7250.9810.9670.9410.8930.833
F1-Score0.9390.8040.9760.9830.9470.9140.896
Table 5. The precision, recall, and F1-score for various lithologies in the test set using the MLP model.
Table 5. The precision, recall, and F1-score for various lithologies in the test set using the MLP model.
LithologySiSMSSHCSSSiMSArSS
Precision0.8840.8960.9801.0000.9630.8860.878
Recall0.9520.6760.9680.9890.8990.8361.000
F1-Score0.9170.7710.9740.9950.9300.8600.935
Table 6. The precision, recall, and F1-score for various lithologies in the test set using the LR model.
Table 6. The precision, recall, and F1-score for various lithologies in the test set using the LR model.
LithologySiSMSSHCSSSiMSArSS
Precision0.7660.2870.7290.8990.8400.4530.255
Recall0.6600.4510.7730.9670.8710.3590.972
F1-Score0.7090.3510.7500.9320.8550.4010.405
Table 7. The precision, recall, F1-score, and Jaccard coefficient for all models in the test set.
Table 7. The precision, recall, F1-score, and Jaccard coefficient for all models in the test set.
Model TypeJaccard IndexPrecisionRecallF1-Score
DT0.8140.8980.8980.898
RFC0.8890.9410.9410.940
MLP0.8540.9230.9220.921
LR0.5410.7230.7020.707
Table 8. Model feature importance report.
Table 8. Model feature importance report.
DT Feature ImportancesRF Feature ImportancesMLP Feature ImportancesLR Feature Importances
FeatureImportanceFeatureImportanceFeatureImportanceFeatureImportance
DEN0.254DEN0.238CNL0.325DEN1.000
CAL0.248CAL0.230GR0.292GR0.693
CNL0.176RT0.180CAL0.279CAL0.203
GR0.153CNL0.176RT0.246RT0.164
RT0.139GR0.146DEN0.179CNL0.056
VP0.030VP0.030VP0.117VP0.009
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, X.; Song, G.; Wang, C.; Xiao, K.; Yuan, H.; Leng, W.; Wei, Y. An Application Study of Machine Learning Methods for Lithological Classification Based on Logging Data in the Permafrost Zones of the Qilian Mountains. Processes 2025, 13, 1475. https://doi.org/10.3390/pr13051475

AMA Style

Hu X, Song G, Wang C, Xiao K, Yuan H, Leng W, Wei Y. An Application Study of Machine Learning Methods for Lithological Classification Based on Logging Data in the Permafrost Zones of the Qilian Mountains. Processes. 2025; 13(5):1475. https://doi.org/10.3390/pr13051475

Chicago/Turabian Style

Hu, Xudong, Guo Song, Chengnan Wang, Kun Xiao, Hai Yuan, Wangfeng Leng, and Yiming Wei. 2025. "An Application Study of Machine Learning Methods for Lithological Classification Based on Logging Data in the Permafrost Zones of the Qilian Mountains" Processes 13, no. 5: 1475. https://doi.org/10.3390/pr13051475

APA Style

Hu, X., Song, G., Wang, C., Xiao, K., Yuan, H., Leng, W., & Wei, Y. (2025). An Application Study of Machine Learning Methods for Lithological Classification Based on Logging Data in the Permafrost Zones of the Qilian Mountains. Processes, 13(5), 1475. https://doi.org/10.3390/pr13051475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop