Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering

Zhao, Aosai; Yu, Yang; Wang, Bin; Liu, Yewen; Liu, Jingyue; Fu, Xubiao; Zheng, Wenhao; Tian, Fei

doi:10.3390/app15105536

Open AccessArticle

Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering

by

Aosai Zhao

^1,2,3

,

Yang Yu

⁴,

Bin Wang

⁴,

Yewen Liu

⁴,

Jingyue Liu

^1,2,3,

Xubiao Fu

^1,2,3,

Wenhao Zheng

^1,2,3,5

and

Fei Tian

^1,2,3,*

¹

CAS Engineering Laboratory for Deep Resources Equipment and Technology, Institute of Geology and Geophysics, Beijing 100029, China

²

Innovation Academy for Earth Science, Chinese Academy of Sciences, Beijing 100029, China

³

College of Earth and Planetary Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Sinopec Shengli Petroleum Engineering Corporation, Dongying 257000, China

⁵

Department of Earth Science and Engineering, Imperial College London, London SW7 2BP, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5536; https://doi.org/10.3390/app15105536

Submission received: 14 April 2025 / Revised: 6 May 2025 / Accepted: 12 May 2025 / Published: 15 May 2025

(This article belongs to the Special Issue Advances in Reservoir Geology and Exploration and Exploitation)

Download

Browse Figures

Versions Notes

Abstract

Facing engineering challenges of real-time and high-precision recognition of strongly heterogeneous formations during directional drilling, it is crucial to address the issues of sparse lithology geological labels and multi-source lithology identification from LWD data. This paper proposes a real-time intelligent recognition method for strongly heterogeneous formations based on multi-parameter logging while drilling and drilling engineering, which can effectively guide directional drilling operations. Traditional supervised learning methods rely heavily on extensive lithology labels and rich wireline logging data. However, in LWD applications, challenges such as limited sample sizes and stringent real-time requirements make it difficult to achieve the accuracy needed for effective geosteering in strongly heterogeneous reservoirs, thereby impacting the reservoir penetration rate. In this study, we comprehensively utilize LWD parameters (six types, including natural gamma and electrical resistivity, etc.) and drilling engineering parameters (four types, including drilling rate and weight on bit, etc.) from offset wells. The UMAP algorithm is employed for nonlinear dimensionality reduction, which not only integrates the dynamic response characteristics of drilling parameters but also preserves the sensitivity of logging data to lithological variations. The K-means clustering algorithm is employed to extract the deep geo-engineering characteristics from multi-source LWD data, thereby constructing a lithology label library and categorizing the training and testing datasets. The optimized CatBoost machine learning model is subsequently utilized for lithology classification, enabling real-time and high-precision geological evaluation during directional drilling. In the Hugin Formation of the Volve field in the Norwegian North Sea, the application of UMAP demonstrates superior data separability compared with PCA and t-SNE, effectively distinguishing thin reservoirs with strong heterogeneity. The CatBoost model achieves a balanced accuracy of 92.7% and an F1-score of 89.3% in six lithology classifications. This approach delivers high-precision geo-engineering decision support for the real-time control of horizontal well trajectories, which holds significant implications for the precision drilling of strongly heterogeneous reservoirs.

Keywords:

heterogeneous formations; lithology recognition; Logging While Drilling; precise drilling; CatBoost

1. Introduction

With the ongoing advancement in the exploration and development of oil and gas resources, geological targets have transitioned from traditional structural and integral reservoirs to complex reservoirs characterized by intricate structures, significant stratigraphic heterogeneity, and thin effective reservoir layers. This geological complexity heightens the challenges in reservoir prediction and introduces greater technical challenges for drilling operations. Consequently, this evolution underscores the growing need for advanced intelligent, real-time formation prediction and geosteering technologies to ensure precise well placement within structurally and lithologically complex environments [1,2,3]. Logging While Drilling (LWD) has emerged as a pivotal technology in contemporary geophysical workflows, offering real-time petrophysical measurements during drilling operations. These data enable timely updates to geological models and enhance decision making for directional drilling by providing critical insights [4,5,6,7,8]. However, a significant challenge in this context remains the real-time identification of reservoir lithology, a crucial component for achieving effective model correction and precise drilling trajectory adjustment [9,10,11]. While traditional empirical methods frequently fail to adequately address the complexity and heterogeneity of subsurface formations, recent advancements have integrated machine learning techniques that leverage multi-source LWD and drilling engineering data [12,13,14,15,16]. These data-driven models have demonstrated exceptional capabilities for real-time lithology classification, thus significantly improving formation interpretation and geosteering accuracy in complex geological settings [17,18,19].

Compared with conventional wireline logging techniques, Logging While Drilling (LWD) provides substantial advantages for subsurface characterization [20,21]. LWD tools enable real-time measurements during the drilling process, such as azimuthal gamma ray, azimuthal resistivity, photoelectric factor, nuclear magnetic resonance, acoustic velocity, neutron porosity, and formation density [22,23,24,25,26]. However, the quality and stability of these measurements are often compromised by downhole vibration phenomena, including stick-slip, bit-bounce, and whirl, which can shorten tool lifespan and degrade data quality. These vibration loads not only interfere with accurate signal acquisition but also increase the uncertainty in lithology interpretation, highlighting the importance of robust data cleaning, dimensionality reduction, and modeling strategies for reliable formation identification [27,28,29]. Machine learning techniques have demonstrated significant advantages over traditional methods for lithology identification using LWD data. In contrast with empirical or rule-based approaches, machine learning algorithms are capable of processing vast amounts of heterogeneous data, capturing intricate nonlinear relationships, and revealing parameter interactions that are typically challenging to model explicitly. Such capabilities result in more accurate and robust lithology recognition outcomes [30,31,32]. Zhong et al. (2020) applied XGBoost, artificial neural networks (ANNs), and Random Forest (RF) models to classify lithologies in coal-bearing formations using both drilling engineering and LWD data, combined with data balancing techniques such as NearMiss under-sampling (NROS) and SMOTE [33]. Measurements-while-drilling (MWD) variables such as rate of penetration (ROP), weight on bit (WOB), and other real-time surface parameters have been utilized to predict electrical facies using supervised classification algorithms that link surface measurements to subsurface lithologies. Sun et al. developed a bit-level lithology prediction model based on LWD parameters using several machine learning algorithms, including one-vs-one Support Vector Machines (OVO-SVMs), RF, neural networks, and XGBoost [34]. Xu et al. introduced a contrastive domain-difference adversarial autoencoder (CDD-AAE) framework, achieving balanced multi-lithology classification performance through extensive experimentation on LWD datasets [35]. Siddig et al. proposed a real-time lithology identification approach using functional networks, RF, and SVM models based on drilling parameters such as ROP, WOB, and torque [36]. Existing lithology identification studies predominantly utilize supervised algorithms such as XGBoost, SVM, and RF, which achieve strong performance in scenarios with sufficient labeled data. XGBoost excels in structured logging data classification through gradient-boosted trees, while SVM leverages kernel-based optimization for high-dimensional feature separation, and RF provides noise-resistant interpretations via ensemble decision trees. However, these methods face inherent limitations in real-time LWD applications, particularly under conditions of label scarcity and dynamic downhole heterogeneity.

To address these challenges, an increasing number of studies have focused on semi-supervised learning approaches [37,38]. Semi-supervised learning extracts patterns and structures from data using a limited set of labeled samples, thereby mitigating the label dependency inherent in traditional supervised learning methods [39,40]. Clustering algorithms, such as K-means and DBSCAN, effectively manage unlabeled data by capturing the intrinsic distribution and structure of the dataset, thus uncovering potential lithological information [41]. For example, Singh et al. (2020) introduced a clustering-based lithology identification technique that achieved automatic classification of complex formation lithologies through the cluster analysis of logging data, eliminating the need for costly and time-intensive core sampling [42]. By combining clustering outcomes with auxiliary calibration, it is possible to integrate semi-quantitative interpretation of geophysical data with qualitative geological information. Nevertheless, numerous challenges remain in applying multi-source logging-while-drilling data within complex geological settings. Achieving efficient semi-supervised learning in multi-source logging-while-drilling data and effectively extracting the dynamic characteristics of reservoir geology and engineering during directional drilling continues to be a critical area for future research.

LWD and drilling engineering data contain abundant information about formation characteristics but are inherently high dimensional. By integrating gamma ray, resistivity, and other petrophysical responses with dynamic drilling parameters—such as rate of penetration (ROP), weight on bit (WOB), and torque, which reflect rock-breaking efficiency—a composite feature space can be constructed that incorporates both geomechanical and petrophysical properties. To effectively extract meaningful patterns from this high-dimensional, multi-source dataset, dimensionality reduction techniques are essential. Both linear and nonlinear dimensionality reduction algorithms are widely employed in geoscientific data analysis. Among linear methods, Principal Component Analysis (PCA) has been extensively applied to well logging data to reduce the dimensionality of highly correlated lithological variables. PCA transforms the data into a set of uncorrelated principal components, which can then be directly or indirectly used as input for lithology identification models [43]. Nonlinear methods, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), have also been utilized to explore spatial distributions within geological data and to visualize complex subsurface structures [44]. Uniform Manifold Approximation and Projection (UMAP), a more recent nonlinear technique, offers distinct advantages over earlier algorithms. UMAP preserves both local and global data structure, captures nonlinear relationships, and enhances class separability in feature space. It is particularly sensitive to subtle variations in drilling dynamics, such as those observed at the interfaces of thin reservoirs. From a data visualization perspective, UMAP enables clearer clustering and differentiation of lithological units by integrating diverse LWD data sources. This capability is highly valuable for real-time lithology identification in strongly heterogeneous formations and supports improved geosteering accuracy in thin reservoirs encountered during horizontal drilling [45].

To satisfy the engineering requirements for the real-time and high-precision evaluation of strongly heterogeneous formations during directional drilling, this study develops an intelligent lithology identification framework leveraging Logging While Drilling (LWD) data and drilling engineering parameters. This approach effectively tackles critical challenges, including sparse lithology labels and the integration of multi-source data (Figure 1). In the proposed workflow, six logging parameters—including natural gamma ray, electrical resistivity, and neutron porosity—alongside four drilling engineering parameters, such as rate of penetration (ROP) and weight on bit (WOB), are first subjected to data cleaning and standardization. These inputs are then processed simultaneously through the Uniform Manifold Approximation and Projection (UMAP) algorithm, which performs nonlinear dimensionality reduction. This step enables fusion of the high-dimensional feature space, capturing the real-time dynamics of drilling parameters while preserving the lithological sensitivity inherent in the logging data. Next, the low-dimensional representations are subjected to unsupervised clustering using the K-means algorithm. This clustering step reveals correlations between log-derived petrophysical patterns and variations in drilling responses, capturing the characteristic co-occurrence of gamma-resistivity anomalies and abrupt changes in ROP and torque at interfaces of strongly heterogeneous formations. Finally, an optimized CatBoost (1.2.7) ensemble learning model is trained using the clustered data to establish a real-time lithology identification and precision drilling framework. This model integrates the complementary strengths of geological interpretation derived from logging curves with the engineering sensitivity of drilling dynamics, thereby enhancing classification accuracy and providing robust drilling guidance. The proposed methodology achieves efficient dimensionality reduction and semi-supervised clustering of multi-source LWD data by extracting latent lithological features from fused LWD-drilling parameters, thereby reducing dependency on labeled data. Combined with CatBoost’s ordered boosting and categorical encoding, the framework enhances computational efficiency for real-time streaming, enabling accurate formation characterization under complex geological conditions. Comparative evaluations of various machine learning models are performed, and the optimal algorithm combination is determined. When applied to field data, the proposed method exhibits enhanced lithology identification accuracy in heterogeneous formations, offering valuable geological and engineering insights for precise horizontal well trajectory control.

2. Study Area and Data

The Volve field is located on the Theta Vest structure within the Sleipner area of the Norwegian North Sea. The predominant tectonic elements in this region, extending from east to west, comprise the Gamma High, the Sleipner Terrace, and the intervening main fault zone (Figure 2a). Theta Vest, situated within the Gamma High domain, represents a dagger-shaped structural high that has undergone substantial tectonic deformation. This deformation is attributed to two major episodes of continental rifting that impacted the Sleipner area and the South Viking Graben (Figure 2b). The most prominent structural feature in the region is a fault system originating from the collapse of a continuous salt ridge during the Jurassic Period. This fault system exhibits particular complexity and extensive fracturing in the western part of the area due to significant regional extension, thereby introducing uncertainties regarding fault geometry and continuity. Hydrocarbons in the Sleipner area, including the Sleipner Vest and Volve structures, are contained within sand-rich strata of the Triassic, Jurassic, and Paleocene age. Reservoirs are typically encountered at depths ranging from approximately 2700 to 3100 m [46].

The pre-rift evolution of the Theta Vest structure commenced during the Permian–Triassic transition and was predominantly marked by continental deposition. Throughout the Triassic and into the Jurassic, the depositional environment gradually evolved from continental to marginal marine settings, culminating in fully marine facies. A second significant rifting phase transpired from the Middle Jurassic to the Early Cretaceous, characterized by deltaic clastic sedimentation. The Upper Triassic to Lower Jurassic interval represents the inner rift sequence of this extensional phase (Figure 2c). During the Middle Jurassic, extensive peat accumulation occurred on the coastal plains, which constituted the primary source material for the Hugin Formation. Concurrently, significant volumes of conglomerates and gravel were deposited along the western margin of the basin, forming multiple stacked sequences characterized by transgressive degradation and vertical accretion. These deposits resulted in the formation of sandstone bodies with thicknesses reaching up to 200 m. The lithology is predominantly composed of fine to coarse-grained, poorly sorted subfeldsarenite sandstones. These units exhibit excellent reservoir quality, featuring favorable porosity and permeability that are highly conducive to hydrocarbon storage and flow [47,48].

3. Methodology

3.1. Data Preprocessing Methods

The improved Z-score threshold method was used to identify the outliers caused by drilling vibration or instrument noise during directional drilling while drilling. By calculating the statistical distribution characteristics of the parameters while drilling, the method quantifies the degree of deviation of the data points from the normal response range, which is defined as

z = \frac{x - μ}{σ}

(1)

where

z

is the standardized anomaly score, and

μ

and

σ

are the mean and standard deviation of the same lithology samples in the current well section, respectively. When the dynamic threshold is set

|z| > 3

, it is determined as an outlier value, and the discrete noise and real geological anomalies are distinguished by combining the drilling condition log. This method eliminates the mechanical interference signal and retains the slight change characteristics of the thin reservoir lithologic interface.

According to the spatial stability principle of the lithology-log response relationship in the same sedimentary period, the instrument system errors and regional background field differences are eliminated through the standardization of logging curves so as to ensure that the logging values of the same lithology strata at different well locations have the same order of magnitude and distribution characteristics. Standard wells with typical lithology characteristics and excellent data quality are selected, and the mean and standard deviation of logging parameters of the target stratum are extracted as reference values, and the cross-well data standardization is realized by linear transformation, as follows:

V_{n o r m} = \frac{V_{r a w} - μ_{r a w}}{σ_{r a w}} \cdot σ_{s t d} + μ_{s t d}

(2)

where represents

V_{n o r m}

the normalized curve value;

V_{r a w}

is the original curve value;

μ_{r a w}

and

σ_{r a w}

represent the local mean and standard deviation of the well to be corrected, respectively; and

μ_{s t d}

and

σ_{s t d}

represent the mean and standard deviation of the corresponding curve of the standard well.

In this study, we assume a stationary relationship between LWD parameters and engineering drilling parameters within the same geostratigraphic interval. Within each interval, the LWD tools and drilling mechanics exhibit relatively stable responses due to controlled drilling practices and limited interval lengths. Although dynamic effects such as thermal disturbance, borehole enlargement, and tool-rock interaction may introduce local non-stationarity, their impact is mitigated through the outlier detection and curve normalization methods described above. Therefore, the assumption of local stationarity within stratigraphic units serves as a practical and reasonable approximation for model training and interpretation.

3.2. Dimensionality Reduction Algorithm

(1): PCA

Principal Component Analysis (PCA) is an unsupervised multivariate statistical technique that reduces data dimensionality while preserving the main variation patterns. It transforms the original high-dimensional data into a new coordinate system composed of orthogonal principal components, which are ranked according to the amount of variance they capture [49]. This process simplifies the dataset, reduces redundancy, and helps extract essential features for tasks such as lithology recognition.

In lithology analysis, PCA can integrate multiple conventional logging parameters into a few principal components that retain most of the geological information. For example, the first two components can often preserve more than 90% of the original variance, providing an efficient and interpretable representation of formation characteristics (Figure 3).

(2): t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data in two or three dimensions [50]. Unlike PCA, which preserves global variance, t-SNE focuses on maintaining the local neighborhood structure of data points. It is particularly effective at revealing clusters, hidden patterns, or manifold structures that are difficult to detect in high-dimensional space.

The core idea is to convert the similarities between data points into probability distributions in both high- and low-dimensional spaces and then minimize the difference between these distributions using Kullback–Leibler (KL) divergence. This process results in a low-dimensional embedding where similar points remain close together (Figure 4).

The optimization objective of t-SNE is defined by minimizing the KL divergence between the high-dimensional and low-dimensional probability distributions, as follows:

C = K L (P | | Q) = \sum_{i} \sum_{j} p_{i j} \log \frac{p_{i j}}{q_{i j}}

(3)

where

p_{i j}

and

q_{i j}

represent the similarity of point pairs in the high- and low-dimensional spaces, respectively. In lithology recognition, t-SNE serves as a powerful visualization tool to assess data separability among lithofacies, evaluate clustering effectiveness, and guide the design of classification models.

(3): UMAP

Uniform Manifold Approximation and Projection (UMAP) is a nonlinear dimensionality reduction technique designed to preserve both local and global data structures. It reduces high-dimensional data to low-dimensional space by modeling its manifold structure through a weighted nearest neighbor graph and then optimizing its layout to reflect the intrinsic geometry of the data [51].

UMAP operates in two main steps. First, it constructs a weighted graph in the high-dimensional space based on the local relationships between data points. Then, it maps this structure into a low-dimensional space using a cross-entropy loss function to maintain the topological similarity. Compared with t-SNE, UMAP tends to preserve more of the global structure, scales better with larger datasets, and offers faster computation (Figure 5).

The similarity (edge weight) between the two high-dimensional data points

x_{i}

and

x_{j}

is computed as

ω (x_{i}, x_{j}) = e x p (\frac{- \max (0, d (x_{i}, x_{j}) - ρ_{i})}{σ_{i}})

(4)

where

d (x_{i}, x_{j})

is the distance between points,

ρ_{i}

adjusts for local connectivity, and

σ_{i}

is a normalization factor derived from the distribution of neighbors around

x_{i}

.

In lithology recognition, UMAP enables effective visualization and clustering of complex geological logging data, helping to identify distinct lithological patterns and improving the interpretability of classification models.

3.3. Intelligent Recognition Algorithm

(1): K-means

K-means is a clustering method based on an iterative algorithm. Due to its high computational efficiency and fast processing speed, K-means is widely used in clustering and data vectorization tasks [52]. In the process of clustering using K-means, suppose that

N

data points and

p

feature parameters are involved, and we want to classify the data points into

k

clusters. The clustering process is as follows:

(1) Select

k

initial cluster centers as the starting point of clustering.

(2) For each data point, calculate its distance from each of the

k

cluster centers, and assign the data point to the cluster

C (l)

that has the closest distance to it. If the cluster assignment of a data point changes, the center point of each cluster is recalculated.

Through such an iterative process, K-means is able to gradually adjust the positions of the cluster centers until a certain stable state is reached. This partitions the data points into

k

different clusters, where the data points within each cluster are highly similar and the data points between different clusters are less similar.

d (x_{i}, x_{j}) = {[\sum_{r = 1}^{p} {|x_{i r} - x_{j r}|}^{2}]}^{\frac{1}{2}}

(5)

C (l) = a r g \min_{1 \leq l \leq K} d (x_{i}, v_{i}), i = 1,2 \cdot \cdot \cdot, N

(6)

d (x_{i}, x_{j})

is the Euclidean distance selected for this clustering.

x_{i}

is the

i

-th data point.

x_{i r}

is the

r

-th feature parameter for the

i

-th data point.

C_{l}

is the set of data points contained in class

l

.

v_{i}

is the center of gravity of class

i

.

(3) By using this method, the sample can be repeatedly selected and the iterative calculation can be carried out until the preset number of iterations is met or the calculation result becomes stable, thereby terminating the calculation process (Figure 6).

(2): CatBoost

Gradient Boosting Decision Tree (GBDT) is an ensemble learning algorithm that iteratively trains multiple weak classifiers. By leveraging the residual information from the preceding model, it optimizes the new model in each iteration to gradually enhance overall prediction performance. Grounded in the core concept of gradient descent, GBDT fits a new model along the negative gradient direction of the loss function. This approach exhibits strong nonlinear modeling capabilities and has been widely applied to classification and regression tasks. Nevertheless, traditional GBDT faces certain limitations when handling categorical features, feature interactions, and gradient estimation, which may lead to overfitting or performance degradation.

CatBoost is an improved algorithm based on the Gradient Boosted Decision Tree (GBDT) framework, which enhances the model performance by optimizing the feature processing and gradient estimation mechanism [53]. The core innovation of CatBoost is reflected in the following three aspects:

(1) Categorical feature coding

For mixed datasets (including numeric and categorical features), CatBoost adopts an ordered target coding strategy. First, the samples are randomly permuted to generate multiple sets of sequences, and then the target statistics of the same class samples are calculated dynamically based on the permutation order. Specifically, for the class feature

k

of sample

x_{i}

, its encoding value is as follows:

{\hat{x}}_{k} = \frac{\sum_{j = 1}^{p - 1} [x_{σ_{j}, k} = x_{σ_{p}, k}] Y_{σ_{j}} + a P}{\sum_{j = 1}^{p - 1} [x_{σ_{j}, k} = x_{σ_{p}, k}] + a}

(7)

(2) High-order feature interactions

The algorithm automatically generates feature combinations during tree construction: only original features are used in the first split level, and all features on the current path are dynamically combined in subsequent splits. In particular, the combination of categorical features is converted numerically in real time to generate interpretable combination features. This mechanism significantly improves the ability of the model to represent complex nonlinear relationships by explicitly modeling the interactions between features.

(3) Ordered gradient boosting

Traditional Gradient Boosting Decision Trees (GBDTs) often suffer from bias accumulation due to the use of the same dataset for both gradient estimation and model updating. To address this limitation, CatBoost introduces the ordered boosting algorithm. For each sample

x_{i}

, an auxiliary model

M_{i}

is trained on a subset of data that excludes

x_{i}

. The unbiased gradient estimated from

M_{i}

is then used to update the main model. This approach decouples the gradient estimation from model training, effectively mitigating the risk of overfitting and enhancing the generalization ability of the model.

By leveraging this mechanism, CatBoost maintains high computational efficiency while significantly improving its modeling performance on high-dimensional, heterogeneous datasets. It is particularly well suited for geophysical classification tasks such as lithology identification, where strong nonlinear feature interactions are present. Furthermore, the combination of ordered target encoding and gradient correction strategies provides robust solutions for handling class imbalance and noise interference commonly observed in geological and well logging datasets (Figure 7).

3.4. Model Evaluation System

(1): Clustering evaluation index

To evaluate the performance of dimensionality reduction and clustering, the Davies–Bouldin Index (DBI) is adopted as a quantitative evaluation metric [54]. DBI is widely used to assess clustering quality by simultaneously considering intra-cluster compactness and inter-cluster separation. Specifically, it quantifies the average similarity between each cluster and its most similar counterpart, where similarity is a function of the ratio between within-cluster scatter and between-cluster distance. A lower DBI value indicates more compact clusters with better separation, suggesting improved clustering performance and more effective feature space transformation following dimensionality reduction. The compactness

C_{k}

of cluster

k

is defined as the average Euclidean distance between all points within the cluster and the cluster centroid, and is expressed as

C_{k} = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} ∥ x_{i} - μ_{k} ∥

(8)

Here,

x_{i}

represents the

i

-th sample within cluster

k

,

μ_{k}

denotes the centroid of cluster

k

, and

n_{k}

is the number of samples in cluster

k

. The separation

S_{k j}

between cluster

k

and cluster

j

is defined as the ratio of the distance between their centroids to the sum of their within-cluster compactness, as expressed by

S_{k j} = \frac{∥ μ_{k} - μ_{j} ∥}{C_{k} + C_{j}}

(9)

where

μ_{k}

and

μ_{j}

are the centroid vectors of clusters

k

and

j

, respectively, and

C_{k}

and

C_{j}

represent the compactness of clusters

k

and

j

. The compactness

C_{k}

is computed as the average Euclidean distance between all samples in cluster

k

and the cluster centroid

μ_{k}

. Based on the separation metric

S_{k j}

, the Davies–Bouldin Index (DBI) is calculated as the average of the maximum similarity between each cluster and all other clusters, as follows:

D B I = \frac{1}{N} \sum_{k = 1}^{N} \underset{j \neq k}{m a x} S_{k j}

(10)

where

N

is the total number of clusters, and

S_{k j}

denotes the similarity (inverse separability) between clusters

k

and

j

. A lower DBI value indicates a better clustering result, characterized by high intra-cluster compactness and strong inter-cluster separation. Conversely, a higher DBI suggests poor clustering performance due to overlapping or poorly separated clusters. By leveraging the DBI metric, the quality of clustering outcomes under different dimensionality reduction or clustering algorithms can be quantitatively assessed, thereby guiding the selection of appropriate methods and parameter configurations.

(2): Recognition model evaluation index

To evaluate the performance of lithology classification models, it is essential to select appropriate evaluation metrics. Balanced Accuracy and F1-score are widely used indicators, particularly effective for datasets with class imbalance.

Balanced Accuracy accounts for imbalanced class distributions by computing the sensitivity (recall) for each class and then averaging the results. It is defined as

Balanced Accuracy = \frac{1}{N} \sum_{k = 1}^{N} \frac{T P_{k}}{T P_{k} + F N_{k}}

(11)

where

T P_{k}

denotes the number of correctly classified samples of class

k

,

F N_{k}

is the number of samples of class

k

that were incorrectly classified, and

N

is the total number of classes. A higher Balanced Accuracy indicates that the model performs well across all classes, regardless of sample distribution.

The F1-score is the harmonic mean of precision and recall, and is particularly effective for evaluating models under imbalanced class conditions. For the

k

-th class, Precision, Recall, and the F1-score are calculated as follows:

{P r e c i s i o n}_{k} = \frac{T P_{k}}{T P_{k} + F P_{k}}

(12)

{R e c a l l}_{k} = \frac{T P_{k}}{T P_{k} + F N_{k}}

(13)

{F 1}_{k} = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

where

F P_{k}

denotes the number of samples incorrectly predicted as class

k

. The overall F1-score of the model is computed as the average F1-score across all classes, as follows:

F 1 = \frac{1}{N} \sum_{k = 1}^{N} F 1_{k}

(15)

A higher F1-score indicates a better balance between precision and recall, reflecting a model’s effectiveness in handling both false positives and false negatives. This is particularly valuable in lithology prediction tasks, where misclassifications may lead to significant geological and operational consequences.

4. Results

4.1. Dataset Description

The dataset used in this study was acquired from the Volve oilfield located in the Theta anticline, encompassing data from eight wells: F-1, F-5, F-11-A, F-11-T2, F-14, F-15, F-15-B, and F-15-C. The data include Logging While Drilling (LWD) measurements and corresponding drilling engineering parameters for the Hugin Formation. Based on the lithological interpretation results, the lithology labels in the dataset are classified into six categories according to the Udden–Wentworth classification of detrital grain size: (1) Clay shale: Fine-grained sedimentary rock primarily composed of clay minerals with grain sizes smaller than 0.0039 mm. It typically exhibits foliation, is rich in organic matter, has low permeability, and is commonly deposited in hydrostatic environments. (2) Silt: Clastic sediments with particle sizes ranging from 0.0039 mm to 0.0625 mm, predominantly composed of quartz and feldspar. These sediments exhibit weak permeability and are often deposited through suspension under low- to medium-energy hydrodynamic conditions. (3) Fine sand: Terrigenous detrital particles with grain sizes between 0.125 mm and 0.25 mm. These are well sorted and sub-angular to sub-rounded, typically deposited in high-energy environments such as beach bars and delta fronts. (4) Coarse sand: Detrital material with grain sizes ranging from 0.5 mm to 1 mm, usually poorly to moderately sorted, often containing gravel and showing cross-bedding. This lithology indicates high-energy depositional settings, such as fluvial channels or eolian systems. (5) Fine sandstone: Transitional clastic rocks with grain sizes between 0.0625 mm and 0.25 mm, characterized by mixed grain-size distributions. These reflect sedimentary processes influenced by fluctuating hydrodynamic conditions. (6) Limestone: Chemical or biochemical sedimentary rocks composed primarily of calcite (CaCO₃) exceeding 50%. To facilitate lithology identification, a total of 10 parameters are utilized in the dataset, including 6 LWD parameters: GR (natural gamma), DT (acoustic time difference), RD (deep resistivity), RS (shallow resistivity), NPHI (compensated neutron), and RHOB (density), and 4 drilling engineering parameters: ROP (Penetration rate), HKLA (average hook load), TQA (torque), and WOB (weight on bit).

4.2. Data Preprocessing

The original logging dataset underwent Z-score outlier detection to eliminate noise caused by instrument drift and drilling vibrations, using a threshold of

|z| > 3

. After preprocessing, 9628 valid samples were retained. According to lithological interpretation, the dataset includes six types of reservoir lithology: clay shale (CS) accounting for 13.48%, silt (SI) 4.67%, fine sand (F) 25.45%, coarse sand (C) 23.84%, fine sandstone (FS) 21.07%, and limestone (LS) 11.49% (Table 1). The statistical results show that the proportions of silt and limestone samples are relatively low, reflecting the limited development of high-energy sedimentary environments in the study area and indicating that these lithologies are not widely distributed in the reservoir formations.

Well F-14 was selected as the standard well due to its representative lithology and high-quality data. In the Hugin Formation, the baseline GR value for shale was determined to be 82.6 API with a standard deviation of ±9.8 API, while the mean RT value for fine sandstone was 12.4 Ω·m with a standard deviation of ±3.2 Ω·m, both reflecting typical regional log response characteristics. Using Equation (3), the data from all eight wells in the study area were standardized and transformed based on these reference values to eliminate instrument scale differences and formation background field offsets (Table 2), ensuring the consistency and comparability of logging responses across wells.

4.3. Lithology Classification Feature Extraction

(1): Dimension reduction

To facilitate data visualization and reduce computational complexity, this study reduces the dimensionality of high-dimensional data to two dimensions. The low-dimensional embeddings are then visualized to assess the effectiveness of different dimensionality reduction techniques. Figure 8 presents the two-dimensional projection results of the training samples using PCA, t-SNE, and UMAP. It is evident that UMAP yields markedly different results compared with the other two methods. PCA, a linear dimensionality reduction technique, produces a projection in which lithological classes appear as overlapping clusters distributed along linear directions in the 2D space. In contrast, both t-SNE and UMAP capture nonlinear relationships, resulting in more dispersed and irregular cluster distributions. However, t-Distributed Stochastic Neighbor Embedding (t-SNE) has notable limitations: it is memory-intensive and poorly suited for large datasets. Moreover, t-SNE fails to preserve the global structure of the data—only intra-cluster distances are meaningful, while inter-cluster relationships are distorted and unreliable. Principal Component Analysis (PCA) also has constraints, primarily its assumption of Gaussian-distributed data, which reduces its effectiveness for complex geological datasets. In comparison, Uniform Manifold Approximation and Projection (UMAP) is better suited for handling large non-Gaussian datasets. UMAP not only preserves local neighborhood relationships but also maintains the global structure of the data more effectively, ensuring the overall data integrity during dimensionality reduction. Given these advantages, UMAP demonstrates superior performance over t-SNE and PCA in the context of reducing the dimensionality of Logging While Drilling (LWD) data, making it a more appropriate choice for subsequent visualization and analysis tasks (Figure 8).

Based on a comprehensive evaluation of intra-class compactness, inter-class separability, and the Davies–Bouldin Index (DBI), UMAP exhibits a distinct advantage over PCA and t-SNE in the task of lithology identification (Table 3). UMAP achieves an average within-class compactness of 7.86, comprising values of 7.10 for silt and 11.36 for coarse sand, along with an average inter-class discrimination score of 6.91. These results demonstrate UMAP’s ability to effectively capture both local details and global structures, particularly in geologically complex transition zones such as the interface between silt and fine sandstone. As a nonlinear dimensionality reduction technique, UMAP alleviates the lithological overlap caused by PCA’s linear compression (e.g., DBI of 3.33 for UMAP versus 3.06 for PCA in the case of shale and limestone) and avoids the excessive dispersion and cluster fragmentation associated with t-SNE’s tendency to overfit local structures. Furthermore, UMAP projections exhibit strong consistency with known sedimentary patterns, such as the energy gradient across silt, fine-grained, and coarse sand, as well as lithological provenance, such as the isolated distribution of limestone. Overall, UMAP provides a high-fidelity, interpretable low-dimensional feature space that facilitates accurate real-time identification of highly heterogeneous formations during drilling operations.

A comprehensive evaluation based on intra-class compactness, inter-class discrimination, and the Davies–Bouldin Index (DBI) reveals that UMAP provides substantial advantages in lithology identification tasks (Figure 9). Specifically, UMAP achieves an average intra-class compactness of 5.67–6.19 for coarse sand and 4.55 for clay shale, along with an average inter-class discrimination score of 5.03. These metrics underscore UMAP’s capacity to effectively capture both internal physical variability (e.g., heterogeneity in coarse sand containing gravel inclusions) and the depositional logic between lithologies (e.g., gradual energy transitions between silt and fine sandstone). As a nonlinear dimensionality reduction technique, UMAP successfully addresses lithological overlap issues caused by the linear projections of PCA—most notably enhancing the separability between silt and fine sandstone by 214% compared with PCA—and mitigates the abnormal dispersion and cluster fragmentation associated with t-SNE’s local overfitting. The DBI value of 2.83 further underscores UMAP’s capability for fine-grained boundary segmentation, especially in complex interbedded formations such as thin layers of clay shale and silt, surpassing both t-SNE (DBI 7.22) and PCA (DBI 3.12). Visualization outcomes demonstrate that limestone forms a distinct and compact cluster (compactness 5.37) in the UMAP-reduced space, with negligible mixing with clastic lithologies. The spatial separation between coarse- and fine-grained sandstone is geologically rational, closely aligning with the grain-size variation patterns typical of fluvial sedimentary environments. In summary, UMAP delivers a high-fidelity, interpretable low-dimensional representation of LWD data via topological manifold learning, facilitating real-time recognition of highly heterogeneous subsurface formations during drilling operations.

(2): Cluster processing

Clustering algorithms leverage the inherent structure of data to define rules for grouping observations with similar characteristics. Among these algorithms, K-means is widely employed in machine learning cluster analysis due to its ability to identify natural groupings or clusters in dimensionality-reduced data. The algorithm assigns data points to clusters with equal variance, minimizing the within-cluster sum of squared deviations, also known as inertia. The accuracy of the clustering outcome is influenced by the selection of the optimal number of clusters and the relevant variables that capture the essential patterns and information in the dataset. In this study, we applied the K-means algorithm to a dataset reduced in dimensionality using UMAP. To assess the clustering quality, we plotted the Davies–Bouldin Index (DBI) as a function of the number of clusters ranging from 2 to 24 (Figure 10). The experimental results revealed that the DBI reached its minimum value of 0.71 when K = 15, indicating the optimal clustering configuration. This result suggests that, at K = 15, there is an ideal balance between cluster compactness and separation, leading to the best overall clustering quality.

The K-means algorithm was applied with K = 15 to the UMAP-reduced dataset (Figure 11), and the resulting clusters were visualized on top of the UMAP dimensionality reduction projections (Figure 12). In cluster 0, fine sand predominates, making up 86.8% of the cluster, clearly identifying it as a fine sand-dominated group. Cluster 1 contains 47.3% coarse sand and 41.4% clay shale, which likely represents a mixed layer of coarse sand and clay shale. In cluster 2, fine sandstone and limestone constitute 56.3% and 23.7%, respectively, suggesting an interaction zone between limestone and fine sandstone. Cluster 4 is composed entirely of fine sandstone, representing a homogeneous lithology, while cluster 6 contains 44.8% clay shale and 27.1% limestone, indicating a transitional or mixed deposit between clay shale and limestone. Clusters with high proportions of coarse sand, such as cluster 7 (74.1% coarse sand), likely indicate high-energy environments, such as fluvial or littoral settings. Conversely, clusters with a high percentage of clay shale, like cluster 1 (41.4% clay shale), may reflect low-energy environments, such as lake or deep-sea deposits. Clusters containing a large proportion of limestone, such as cluster 14 (85.8% limestone), likely correspond to carbonate platform or reef environments. Notably, some clusters exhibit unusual distributions that require further consideration. For instance, cluster 13 is dominated by coarse sand (79.1%), but also contains small amounts of other lithologies, which may suggest local variations in a coarse sand-dominated environment. Similarly, cluster 5 consists of 80.9% fine sand, with 12.4% limestone, possibly representing a sandy deposit with limestone fragments or cementing. Additionally, certain clusters may represent transitional phases within the sedimentary sequence, such as the mixture of fine sand (81.6%) and silt (15.5%) in cluster 9, which could reflect an environment with fluctuating hydrodynamic conditions.

Although the original lithology classification consists of six categories, actual formations often exhibit multi-scale sub-category structures due to variations in sedimentary microfacies, diagenesis, or engineering responses. For instance, clay shale can be further divided into deep-water organic enrichment facies, coastal calcareous cementing facies, and tidal flat biodisturbance facies, each displaying distinct logging responses (e.g., GR, resistivity) and engineering parameters (e.g., drilling time, torque). Similarly, coarse sands can be differentiated into loose channel sand and tightly cemented alluvial fan sand based on gravel content and cementation degree, which are naturally separated in the UMAP-reduced space. Furthermore, thin reservoirs—such as sand–claystone transition zones—are classified as a single lithological unit in the original labeling. However, the actual data distribution reveals a more gradual transition. By increasing the number of clusters, the algorithm effectively captured this continuous change in sedimentary energy. Thus, the clustering result for K = 15 aligns with geological understanding, revealing lithological sub-classes and microfacies information that manual classification fails to capture. This provides a higher-resolution approach for describing reservoir heterogeneity and significantly enhances the geological and engineering applicability of the lithology classification model.

Based on the distribution characteristics of acoustic time difference (DT) and torque (TQA), both of which are sensitive to lithology, the clustering results reveal the differentiation patterns of sedimentary environments for various lithology assemblages (Figure 13). Clusters with high DT values (e.g., cluster 1, median DT 91.92, IQR = 8.57) and low DT values (e.g., cluster 13, median DT 69.13, IQR = 5.19) correspond to high-porosity sand bodies and tightly cemented calcareous formations, respectively. The elevated DT value in cluster 1 reflects the gradual hydrodynamic changes characteristic of a pre-delta silty sand transition zone, while the low TQA fluctuations indicate homogeneous suspension deposition. Conversely, the very low DT value of cluster 13 aligns with the characteristics of coarse-grained sand, suggesting that the coastal sandy beach is formed through wave sorting of low-porosity, compact sand bodies. Transitional facies clusters exhibit moderate DT values and high TQA fluctuations, indicative of the heterogeneous nature of sand–limestone interbeds in a tidal flat environment. The variation in TQA is further amplified by cementation differences resulting from periodic exposure.

To address potential discrepancies between statistical clustering and geological reality, the lithology assignments of each cluster were rigorously validated against physical measurements. The lithology corresponding to each cluster was calibrated by integrating LWD logging parameters with high-resolution borehole imaging data (Figure 14). For example, cluster 7, which is statistically dominated by coarse sand (74.1%), exhibited clear gravel laminations and high-angle cross-bedding in imaging logs, confirming its interpretation as a high-energy fluvial deposit. Conversely, cluster 6, containing 44.8% clay shale and 27.1% limestone, displayed alternating thin-bedded shale and micritic limestone layers in imaging data, aligning with its transitional facies classification. Such cross-validation ensures that the UMAP + K-means-derived clusters not only reflect statistical similarity but also correspond to measurable geological features, thereby mitigating the risk of overinterpreting purely algorithmic patterns.

The 15 cluster labels generated by K-means clustering were incorporated into the lithology classification model as categorical features. Along with the original logging features and UMAP dimensionality-reduced features, these labels formed a multimodal input space. The cluster labels were transformed into categorical features using One-Hot encoding, thereby capturing the topological correlations and sedimentary facies information of the samples within the low-dimensional manifold space. By leveraging unsupervised clustering, hidden geological structure patterns were extracted, addressing the strong reliance of traditional supervised learning on artificial labels. This approach enhances the model’s generalization ability, particularly in small sample scenarios. This paper introduces a feature engineering framework that combines physical interpretability with data-driven generalization, facilitating the intelligent recognition of highly heterogeneous thin reservoirs.

4.4. Application of Lithology Identification Model

Based on the workflow for real-time intelligent identification and precision drilling in strongly heterogeneous formations using multi-parameter LWD and drilling engineering data, the original dataset was split into a training set and a test set with a ratio of 80% to 20%. The training set was then processed using various dimensionality reduction methods to obtain the corresponding datasets. These included the original training data, PCA, t-SNE, UMAP, t-SNE + K-means, and UMAP + K-means datasets, which were further divided into training and validation sets with a ratio of 4:1. To enhance the model’s generalization ability, a five-fold cross-validation method was applied, and a grid search was used to identify the optimal parameter combination for each model. To minimize the influence of random factors, the training process was repeated 10 times, and the average results were recorded. A line chart was generated to visualize the training performance of each model across the different datasets, allowing for a comparative analysis of how each dimensionality reduction method affected the model’s performance.

(1): Model training

Four different classification models were developed using Support Vector Machine (SVM), Random Forest (RF), Lightweight Gradient Boosting Machine (LightGBM), and Symmetric Decision Tree Gradient Boosting Machine (CatBoost). The tuning parameters and their optimal values for each model are detailed in Table 4. For the SVM model, the tuning parameters include the regularization penalty coefficient C and the coefficient γ of the RBF kernel function. Since Random Forest, LightGBM, and CatBoost all use decision trees as base classifiers, the tuning process focuses on parameters related to the tree structure. The most important of these is the minimum number of samples required to create leaf nodes. Additionally, Random Forest involves selecting both samples and features, making the maximum number of features another critical parameter. The CatBoost model, which is based on rank boosting, requires careful control of the learning rate during model iteration, as the model is trained on one subset of the data while calculating residuals on another subset.

During the parameter tuning process, datasets that underwent only dimensionality reduction did not significantly improve the model. However, applying both dimensionality reduction and clustering processing enhanced the training performance. Among the various dimensionality reduction methods, the dataset processed with UMAP generally produced better model performance. Furthermore, the best training results were achieved using the dataset processed with UMAP + K-means.

The optimal structural parameters with the highest score and strongest generalization ability for each model were determined through grid search and cross-validation. To assess the classification ability of the model for well logging lithology and evaluate the impact of the data feature enhancement method, the model with the optimal parameters was applied to the test dataset. The Balanced Accuracy and F1-score were then calculated to evaluate the model’s prediction performance.

(2): Model testing

To eliminate the influence of random chance on the test results, Support Vector Machine (SVM), Random Forest (RF), Lightweight Gradient Boosting Machine (LightGBM), and Symmetric Decision Tree Gradient Boosting Machine (CatBoost) were tested 10 times on the test set. The average Balanced Accuracy and F1-score of each model were plotted in bar charts (Figure 15), revealing significant differences in performance across various combinations of data processing methods and models. The combination of UMAP + K-means + CatBoost achieved the best performance in both metrics (Balanced Accuracy = 0.897, F1-Score = 0.872). This indicates that UMAP’s nonlinear dimensionality reduction, combined with K-means clustering, effectively enhances CatBoost’s ability to capture complex lithological features. The advantage arises from UMAP preserving both global and local data structure, while clustering reflects lithology sub-classes. In contrast, the SVM model performed weakly overall, particularly with PCA and other linear dimensionality reduction methods, highlighting its limited adaptability to high-dimensional and nonlinear data. LightGBM showed strong performance under UMAP and UMAP + K-means due to the efficiency of its histogram algorithm and its ability to capture nonlinear features, though it lagged slightly behind CatBoost, possibly due to less effective gradient bias optimization. Random Forest demonstrated robust performance but with a lower upper limit; while its ensemble learning mechanism reduces noise, it struggles to capture complex interactions as effectively as CatBoost. The improvements from combinations like t-SNE + K-means and PCA + K-means were marginal, sometimes even weaker than dimensionality reduction alone, potentially due to clustering labels being affected by t-SNE’s local overfitting or PCA’s linear distortion. This comparison underscores UMAP’s superior ability to retain geological information and CatBoost’s strong adaptability in integrating multi-parameter features.

In the study of intelligent lithology classification, the combination of UMAP + K-means and CatBoost demonstrated significant advantages (Figure 16). According to the confusion matrix analysis, CatBoost outperformed other models, particularly in recognizing less common categories like silt and limestone. The recall rate for silt reached 48.6%, significantly higher than SVM’s complete failure and LightGBM’s 48.6%, with a more balanced distribution of misclassifications. For limestone, the classification accuracy reached 69.0%, which was substantially better than SVM and Random Forest, with a 32% reduction in misclassification as shale. For more common lithologies, CatBoost’s accuracy in identifying fine sand and coarse sand matched that of the ensemble model, but with a lower cross-class misclassification rate. This advantage is attributed to the combined effects of CatBoost’s gradient unbiased estimation mechanism and UMAP + K-means. On one hand, CatBoost encodes the cluster labels as interactive features through ordered boosting, capturing the nonlinear relationships between lithology sub-classes and drilling parameters. On the other hand, UMAP’s dimensionality reduction preserves the global structure of the sedimentary sequence (e.g., the independent clustering of limestone and the energy gradient distribution of coarse and fine sand), providing a high-fidelity feature space for the model. The experiments show that this combination achieves the highest Balanced Accuracy and F1-Score, 10.2% and 4.5% higher than the next best model (LightGBM), respectively, underscoring its engineering applicability for real-time recognition of heterogeneous thin reservoirs. Table 5 and Table 6 show the performance of different models on various lithologies in this study. From the perspective of accuracy, the prediction results of each model for silt are slightly lower than the overall level, which may be due to the uneven distribution of data samples of this label. In addition, although the accuracy scores of CatBoost and Random Forest are very close, from the F1-score table, CatBoost performs significantly better than other models in multiple lithologies (limestone, clay shale, and fine sand, etc.).

To verify the practical effectiveness of the UMAP + K-means + CatBoost model, this study conducted an in-depth comparative analysis of lithology prediction results against real lithology labels using standard well F-14. Figure 17 shows a comparison between the real lithology labels and the model’s prediction results along the depth profile of the Hugin section in well F-14. Overall, the model’s predicted lithology is highly consistent with the real labels. Particularly in the primary lithology sections (depth range 2700–2850 m), such as fine sand (F) and coarse sand (C), the prediction accuracy reached 89.2% and 86.5%, respectively, aligning with the continuity characteristics of high-energy river sand body deposits. In the limestone (LS) section (depth 2900–2950 m), the model successfully identified the independent clusters of limestone with an accuracy of 72.3%, with only a small portion misclassified as clay shale (8.1%). This misclassification could be attributed to local thin clay intercalations or logging response anomalies. Notably, in the silt (SI) and fine sandstone (FS) transition zone (depth 2850–2880 m), some cross misjudgment was observed (silt recall of 45.7%), reflecting the classification ambiguity caused by the gradual change in physical properties within the transition lithology. CatBoost’s feature combination optimization significantly enhanced recognition ability, especially in these transition zones. Additionally, the prediction accuracy of clay shale (CS) in low-energy sedimentary segments (depth range 2950–3000 m) was 78.4%, demonstrating the model’s strong discrimination ability for thin reservoir segments. In general, the UMAP + K-means feature enhancement strategy facilitated high-precision lithology identification for well F-14, with an overall Balanced Accuracy of 0.927 (Figure 17). This was achieved by fusing logging physical characteristics with drilling dynamic parameter response mutations, leveraging the unbiased gradient estimation mechanism of CatBoost.

5. Discussion

5.1. Comparing the Performances of Different Machine Learning Algorithms

The experimental results demonstrate that the UMAP algorithm has significant advantages in dimensionality reduction for multi-source and multi-parameter LWD data. Compared with the linear dimensionality reduction method PCA and the nonlinear method t-SNE, UMAP outperforms in key indicators such as the Davies–Bouldin Index, intra-class compactness, and inter-class separation. The core advantage of UMAP lies in its ability to preserve both the global structure and local details of the data through topological manifold learning. This effectively mitigates the lithological overlap caused by the linear projection of PCA (e.g., the overlap between shale and limestone) and the abnormal dispersion caused by the local overfitting inherent in t-SNE. Moreover, the dimensionality reduction results of UMAP align well with geological sedimentary sequences (such as the energy gradient between silt, fine-grained sand, and coarse-grained sand) and lithological origins (for example, limestone being independently clustered). This consistency further verifies UMAP’s high-fidelity characteristics in complex geological environments. On the other hand, t-SNE’s engineering applicability is limited by its high memory consumption and lack of global structure, while PCA struggles to capture nonlinear geological responses due to its linear assumptions. In contrast, UMAP offers a more interpretable low-dimensional feature space for identifying strongly heterogeneous strata, making it highly suitable for nonlinear dimensionality reduction and adapting to high-dimensional data.

By comparing the performances of the SVM, Random Forest (RF), LightGBM, and CatBoost models, it is evident that the combination of CatBoost with the UMAP + K-means method achieves the best performance in the lithological classification task, with a Balanced Accuracy of 0.897 and an F1-Score of 0.872. SVM, but struggles with high-dimensional data and nonlinear relationships, resulting in the weakest performance, with a Balanced Accuracy of only 0.505 under PCA processing. While RF is robust, it has a low upper limit and its ensemble mechanism is insufficient for capturing complex interactive features. Although LightGBM is efficient, it does not perform as well as CatBoost in terms of gradient bias optimization. The advantage of CatBoost lies in its ordered boosting mechanism, which effectively integrates cluster labels through unbiased gradient estimation and enhances the model’s ability to capture lithology sub-classes using categorical feature encoding. This enables CatBoost to significantly outperform other models, especially in recognizing scarce classes and thin reservoirs, thereby demonstrating its engineering suitability for heterogeneous reservoir recognition.

5.2. Performance Analysis of the Optimal Method for Identifying Highly Heterogeneous Formations

The combination of UMAP + K-means and CatBoost achieves multi-dimensional optimization in lithology identification. UMAP dimensionality reduction provides high-fidelity input to the model by preserving the global structure and lithologic genetic characteristics of the sedimentary sequence. The 15 cluster labels generated by K-means clustering uncover lithology sub-classes not captured by manual classification, such as the microfacies differences in clay shale and the gradual transition zones between sand and claystone. One-Hot encoding was employed to enhance the category information, enriching the lithology labels. Experiments demonstrate that this method significantly improves the model’s ability to delineate complex stratigraphic boundaries and enhances the recognition accuracy of thin reservoirs. For instance, the misclassification rate between coarse-grained and fine-grained sands is reduced by 21%, and the misclassification from limestone to shale decreases by 32%. Furthermore, confusion matrix analysis reveals that CatBoost delivers a more balanced misclassification distribution across classes, validating its enhanced generalization performance through unbiased gradient estimation. This approach not only addresses the challenge of sparse lithology and geological labels in traditional supervised learning but also provides a solution that is both data-driven and geologically interpretable for real-time recognition of strongly heterogeneous formations in LWD applications. Compared with traditional petrophysical interpretation methods, which often rely on manual analysis and heuristic rules, the proposed approach offers significant advantages in terms of accuracy, efficiency, and adaptability to complex geological conditions.

5.3. Real-Time Optimization of Drilling Parameters Based on Strongly Heterogeneous Formation Identification

In this study, a collaborative optimization mechanism between LWD logging and drilling engineering parameters was developed. The drilling rate was dynamically adjusted through real-time identification of strongly heterogeneous formations, and the drilling time was reduced by modifying surface parameters (such as WOB, TQA, and HKLA) to maximize rock breakage efficiency and accurately target reservoirs. Generally, the ROP of conglomerate facies was the highest, followed by clay-rich facies and carbonate facies. The optimization process begins by predicting formation lithology using the CatBoost machine learning model, followed by the selection of the optimal drilling parameter combination based on the lithofacies characteristics. For instance, to increase the ROP of carbonate facies from 19.1 m/h to 34.3 m/h, the WOB, TQA, and HKLA should be adjusted to 16.5 kN, 22.6 kN·m, and 135.4 kN, respectively. The optimization system integrates predictive confidence thresholds from the CatBoost model to manage operational risks. High-confidence predictions (probability > 0.85) trigger automated parameter adjustments, while moderate confidence (0.6–0.85) activates real-time cross-validation with adjacent data points. Low-confidence scenarios (<0.6) require manual verification before parameter changes. This tiered uncertainty control prevents automated decisions in ambiguous lithological transitions or thin-bedded sequences. During critical reservoir entry phases, all automated adjustments mandate engineer authorization regardless of confidence level, ensuring operational safety against tool sticking or wellbore instability risks. This optimization algorithm improves the ROP, reduces drilling time, provides a quantitative control path for efficient drilling of strongly heterogeneous reservoirs, and ultimately saves drilling costs while enabling precise drilling of thin reservoirs.

5.4. Model Generalizability and Adaptation Strategies

While the model demonstrates robust performance in the Volve oilfield, its generalizability to other geological formations remains challenged by domain shifts in petrophysical patterns and drilling responses. The framework supports transfer learning by fine-tuning the CatBoost classifier with limited target-domain data, leveraging the UMAP+K-means feature space that captures core lithological trends. However, its effectiveness may diminish when encountering subtle lithological transitions, irregular class distributions, or highly imbalanced datasets, and performance is sensitive to clustering parameters. To enhance adaptability, incremental learning allows real-time updates to clustering centroids and classifier weights using streaming field data. To address data scarcity, synthetic logging curves generated via forward petrophysical simulations can augment the training set. In addition, while UMAP shows excellent dimensionality reduction performance, its computational cost may become a bottleneck in real-time applications. Although offline pretraining and centroid caching help reduce time consumption during deployment, further optimization or approximations (e.g., mini-batch UMAP) may be required for large-scale operations.

Future research will explore semi-supervised learning for improved label efficiency, alternative clustering techniques such as DBSCAN to better delineate complex boundaries, and adversarial domain adaptation to align source and target feature distributions. Preliminary validation in analogous North Sea reservoirs shows that the framework maintains >85% accuracy when calibrated with only 15% of local data, indicating promising transferability under limited-data conditions.

6. Conclusions

This study addresses the technical challenges of real-time identification and precision drilling in strongly heterogeneous formations by proposing a collaborative-driven method for multi-source and multi-parameter data dimensionality reduction and unsupervised clustering based on UMAP. By integrating LWD logging parameters (such as natural gamma and electrical resistivity) and drilling engineering dynamic parameters (such as drilling rate and torque), a high-dimensional feature space is created that captures both the physical properties of formations and rock mechanics characteristics. The UMAP algorithm is employed for nonlinear dimensionality reduction, significantly enhancing the resolution of thin reservoir recognition while retaining lithology sensitivity and engineering response characteristics. A CatBoost lithology classification model is developed based on K-means clustering results, effectively addressing the challenge of sparse lithology labels in traditional supervised learning. This approach overcomes the difficulties associated with complex lithology and thin reservoir identification in heterogeneous formations and provides a geology–engineering integration framework for the real-time regulation of horizontal well trajectories. This is particularly significant for precision drilling in unconventional oil and gas reservoirs.

Based on the measured data from eight wells in the Hugin Group of the Volve oilfield, the results indicate that the UMAP algorithm outperforms traditional PCA and t-SNE methods in dimensionality reduction. The Davies–Bouldin Index of UMAP is 61% lower than that of t-SNE, with particularly notable improvements in fine segmentation, especially for the silt–fine sandstone transition zone and the limestone independent cluster. After integrating the dimensionality reduction features, the CatBoost model achieves a balanced accuracy of 92.7% and an F1-score of 89.3%, representing an 18.6% improvement over the single logging parameter model. The recognition accuracy of coarse sand (F1 = 93.1%) and clay shale (F1 = 88.5%) shows the most significant improvement. These results demonstrate that the multi-source and multi-parameter collaborative analysis captures the coupling response between drilling rate variations and gamma-resistivity anomalies, with higher recognition accuracy for thin reservoirs (thickness < 0.3 m) compared with traditional methods. This validates the adaptability of the proposed method to strongly heterogeneous strata.

Future research can focus on three key areas for further breakthroughs. First, improving the real-time performance of the optimization algorithm is crucial. By leveraging lightweight model architectures and edge computing technology, the data processing delay can be reduced to mere seconds, meeting the timing requirements for ultra-deep well drilling decision making. Second, expanding the dimensionality of multi-modal data fusion will be important. Integrating data sources such as geosteering, downhole imaging, and other relevant measurements will help build a multi-dimensional geo-engineering feature space, enhancing the robustness of recognition in complex structural areas like fault fracture zones. Third, exploring the fusion of self-supervised learning with physical constraints offers significant potential. Incorporating prior knowledge of stratigraphic deposition and rock mechanics could guide the unsupervised clustering process, reducing the algorithm’s reliance on the number of samples. Additionally, strengthening the cross-block generalization ability and developing adaptive models suitable for different basin types will be essential to expand the application of intelligent identification technology in various geological settings.

Author Contributions

Conceptualization, A.Z. and F.T.; Data curation, A.Z., Y.Y., B.W., Y.L. and J.L.; Funding acquisition, Y.Y. and F.T.; Investigation, B.W.; Methodology, A.Z., W.Z. and F.T.; Project administration, Y.Y. and F.T.; Resources, Y.Y.; Software, X.F.; Supervision, Y.Y.; Visualization, A.Z., B.W., Y.L., X.F. and W.Z.; Writing—original draft, A.Z.; Writing—review & editing, Y.Y., J.L. and F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chinese National Key Research and Development Program, grant no. 2019YFA0708301 and no. 2023YFB3905005; the Youth Innovation Promotion Association Foundation of the Chinese Academy of Sciences (2021063); and the Strategic Priority Research Program of the Chinese Academy of Sciences, grant no. XDA14050101. The APC was funded by the Institute of Geology and Geophysics, Chinese Academy of Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to express our gratitude to Jiongyu Wei for their valuable assistance and contributions to this work.

Conflicts of Interest

Authors Yang Yu, Bin Wang, and Yewen Liu were employed by the company Sinopec Shengli Petroleum Engineering Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The abbreviations used in this article are summarized in the table below:

PCA	Principal Component Analysis
t-SNE	t-Distributed Stochastic Neighbor Embedding
UMAP	Uniform Manifold Approximation and Projection
DBI	Davies–Bouldin Index
SVM	Support Vector Machine
RF	Random Forest
LightGBM	Lightweight Gradient Boosting Machine
CatBoost	Categorical Boosting

References

Li, G.S.; Song, X.Z.; Zhu, Z.P.; Tian, S.Z.; Sheng, M. Research progress and the prospect of intelligent drilling and completion technologies. Pet. Drill. Tech. 2023, 51, 35–47. [Google Scholar] [CrossRef]
Xiong, Q.P.; Cheng, Z.J.; Wen, Y.W. Petroleum geology features and research developments of hydrocarbon accumulation in deep petroliferous basins. Pet. Sci. 2015, 12, 1–53. [Google Scholar] [CrossRef]
Sun, L.D.; Zou, C.N.; Jia, A.L.; Wei, Y.S.; Zhu, R.K.; Wu, S.T.; Guo, Z. Development characteristics and orientation of tight oil and gas in China. Pet. Explor. Dev. 2019, 46, 1073–1087. [Google Scholar] [CrossRef]
Holenka, J.; Best, D.; Evans, M.; Kurkoski, P.; Sloan, W. Azimuthal porosity while drilling. In Proceedings of the SPWLA Annual Logging Symposium, Paris, France, 12–15 June 1995; p. SPWLA-1995-BB. [Google Scholar]
Bornemann, E.; Bourgeois, T.; Bramlett, K.; Hodenfield, K.; Maggs, D. The application and accuracy of geological information from a logging-while-drilling density tool. In Proceedings of the SPWLA Annual Logging Symposium, Keystone, CO, USA, 26–29 May 1998; p. SPWLA-1998-L. [Google Scholar]
Carpenter, W.W.; Best, D.; Evans, M. Applications and interpretation of azimuthally sensitive density measurements acquired while drilling. In Proceedings of the SPWLA Annual Logging Symposium, Houston, TX, USA, 15–18 June 1997; p. SPWLA-1997-EE. [Google Scholar]
Qin, Z.; Tang, B.; Wu, D.; Luo, S.C.; Ma, X.G.; Huang, K.; Deng, C.X.; Yang, H.J.; Pan, H.P.; Wang, Z.H. A qualitative characteristic scheme and a fast distance prediction method of multi-probe azimuthal gamma-ray logging in geosteering. J. Pet. Sci. Eng. 2021, 199, 108244. [Google Scholar] [CrossRef]
Tian, F.; Di, Q.Y.; Zheng, W.H.; Ge, X.M.; Zhang, W.X.; Zhang, J.Y.; Yang, C.C. A formation intelligent evaluation solution for geosteering. Chin. J. Geophys. 2023, 66, 3975–3989. [Google Scholar] [CrossRef]
Datta, D.; Singh, G.; Routray, A.; Mohanty, W.K.; Mahadik, R. Automatic Classification of Lithofacies with Highly Imbalanced Dataset Using Multistage SVM Classifier. In Proceedings of the IECON 2021–47th Annual Conference of the IEEE Industrial Electronics Society, Toronto, ON, Canada, 13–16 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Li, Y.P.; Luo, M.L.; Ma, S.X.; Lu, P. Massive Spatial Well Clustering Based on Conventional Well Log Feature Extraction for Fast Formation Heterogeneity Characterization. Lithosphere 2022, 2022, 7260254. [Google Scholar] [CrossRef]
Martins Saporetti, C.; Da Fonseca, L.G.; Pereira, E.; Costa De Oliveira, L. Machine learning approaches for petrographic classification of carbonate-siliciclastic rocks using well logs and textural information. J. Appl. Geophys. 2018, 155, 217–225. [Google Scholar] [CrossRef]
Tian, F.; Jin, Q.; Lu, X.B.; Lei, Y.H.; Zhang, L.K.; Zheng, S.Q.; Zhang, H.F.; Rong, Y.S.; Liu, N.G. Multi-layered ordovician paleokarst reservoir detection and spatial delineation: A case study in the Tahe Oilfield, Tarim Basin, Western China. Mar. Pet. Geol. 2016, 69, 53–73. [Google Scholar] [CrossRef]
Tian, F.; Di, Q.Y.; Jin, Q.; Cheng, F.Q.; Zhang, W.; Lin, L.M.; Wang, Y.; Yang, D.B.; Niu, C.K.; Li, Y.X. Multiscale geological-geophysical characterization of the epigenic origin and deeply buried paleokarst system in Tahe Oilfield, Tarim Basin. Mar. Pet. Geol. 2019, 102, 16–32. [Google Scholar] [CrossRef]
Tian, F.; Luo, X.R.; Zhang, W. Integrated geological-geophysical characterizations of deeply buried fractured-vuggy carbonate reservoirs in Ordovician strata, Tarim Basin. Mar. Pet. Geol. 2019, 99, 292–309. [Google Scholar] [CrossRef]
Xing, Y.H.; Yang, H.T.; Yu, W. An Approach for the Classification of Rock Types Using Machine Learning of Core and Log Data. Sustainability 2023, 15, 8868. [Google Scholar] [CrossRef]
Zhang, J.L.; He, Y.B.; Zhang, Y.; Li, W.F.; Zhang, J.J. Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China. Energies 2022, 15, 3675. [Google Scholar] [CrossRef]
Baudzis, S.; Karłowska-Pik, J.; Puskarczyk, E. Electrofacies as a Tool for the Prediction of True Resistivity Using Advanced Statistical Methods—Case Study. Energies 2021, 14, 6228. [Google Scholar] [CrossRef]
Zheng, W.H.; Tian, F.; Di, Q.Y.; Xin, W.; Cheng, F.Q.; Shan, X.C. Electrofacies classification of deeply buried carbonate strata using machine learning methods: A case study on ordovician paleokarst reservoirs in Tarim Basin. Mar. Pet. Geol. 2021, 123, 104720. [Google Scholar] [CrossRef]
Liu, J.Y.; Tian, F.; Zhao, A.S.; Zheng, W.H.; Cao, W.J. Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions. Appl. Sci. 2024, 14, 6534. [Google Scholar] [CrossRef]
Mcdonald, W.J.; Ward, C.E. Logging while drilling: A survey of methods and priorities. In Proceedings of the SPWLA Annual Logging Symposium, Denver, CO, USA, 9–12 June 1976; p. SPWLA-1976-U. [Google Scholar]
De Andreacute, C.A.; Da Cunha, A.M.V.; Boonen, P.; Valant-Spaight, B.; Lefors, S.; Schultz, W. A comparison of logging-while-drilling and wireline nuclear porosity logs in shales from wells in Brazil. Petrophysics 2005, 46, 295–301. [Google Scholar]
Meador, R.A. Logging while drilling: A Story of Dreams, Accomplishments, And Bright Futures. In Proceedings of the SPWLA Annual Logging Symposium, The Woodlands, TX, USA, 15–19 June 2009; p. SPWLA-2009-43458. [Google Scholar]
Tollefsen, E.; Weber, A.; Kramer, A.; Sirkin, G.; Hartman, D.; Grant, L. Logging While Drilling Measurements: From Correlation to Evaluation. In Proceedings of the SPE International Oil Conference and Exhibition in Mexico, Veracruz, Mexico, 27 June 2007. [Google Scholar] [CrossRef]
Neville, T.J.; Weller, G.; Faivre, O.; Sun, H. A new-generation LWD tool with colocated sensors opens new opportunities for formation evaluation. SPE Reserv. Eval. Eng. 2007, 10, 132–139. [Google Scholar] [CrossRef]
Akinsanmi, O.B.; Aibangbe, O.; Kienitz, C. Application of azimuthal density while drilling images for dips, facies and reservoir characterization—Niger/delta experience. In Proceedings of the SPE/CIM International Conference on Horizontal Well Technology, Calgary, AB, Canada, 6 November 2000; SPE: Calgary, AB, Canada, 2000. [Google Scholar] [CrossRef]
Ijasan, O.; Torres-Verdín, C.; Preeg, W.E. Inversion-based petrophysical interpretation of logging-while-drilling nuclear and resistivity measurements. Geophysics 2013, 78, D473–D489. [Google Scholar] [CrossRef]
Landar, S.; Velychkovych, A.; Ropyak, L.; Andrusyak, A. A method for applying the use of a smart 4 controller for the assessment of drill string bottom-part vibrations and shock loads. Vibration 2024, 7, 802–828. [Google Scholar] [CrossRef]
Velychkovych, A.; Mykhailiuk, V.; Andrusyak, A. Numerical Model for Studying the Properties of a New Friction Damper Developed Based on the Shell with a Helical Cut. Appl. Mech. 2025, 6, 1. [Google Scholar] [CrossRef]
Velichkovich, A.; Dalyak, T.; Petryk, I. Slotted shell resilient elements for drilling shock absorbers. Oil Gas Sci. Technol.–Rev. D’ifp Energ. Nouv. 2018, 73, 34. [Google Scholar] [CrossRef]
Vukadin, D.; Čogelja, Z.; Vidaček, R.; Brkić, V. Lithology and Porosity Distribution of High-Porosity Sandstone Reservoir in North Adriatic Using Machine Learning Synthetic Well Catalogue. Appl. Sci. 2023, 13, 7671. [Google Scholar] [CrossRef]
Ren, Q.; Zhang, H.B.; Zhang, D.L.; Zhao, X.; Yan, L.Z.; Rui, J.W. A novel hybrid method of lithology identification based on k-means++ algorithm and fuzzy decision tree. J. Pet. Sci. Eng. 2022, 208, 109681. [Google Scholar] [CrossRef]
Zheng, D.; Liu, S.; Chen, Y.; Gu, B. A Lithology Recognition Network Based on Attention and Feature Brownian Distance Covariance. Appl. Sci. 2024, 14, 1501. [Google Scholar] [CrossRef]
Zhong, R.Z.; Johnson, R.L., Jr.; Chen, Z.W. Using machine learning methods to identify coal pay zones from drilling and logging-while-drilling (LWD) data. SPE J. 2020, 25, 1241–1258. [Google Scholar] [CrossRef]
Sun, J.; Li, Q.; Chen, M.Q.; Ren, L.; Huang, G.H.; Li, C.Y.; Zhang, Z.X. Optimization of models for a rapid identification of lithology while drilling—A win-win strategy based on machine learning. J. Pet. Sci. Eng. 2019, 176, 321–341. [Google Scholar] [CrossRef]
Xu, T.; Zhang, W.T.; Li, J.; Liu, H.N.; Kang, Y.; Lv, W.J. Domain generalization using contrastive domain discrepancy optimization for interpretation-while-drilling. J. Nat. Gas Sci. Eng. 2022, 105, 104685. [Google Scholar] [CrossRef]
Mutrif Siddig, O.; Elkatatny, S. Utilizing Drilling Data and Machine Learning in Real-Time Prediction of Poisson’s Ratio. In Proceedings of the SPE Middle East Oil and Gas Show and Conference, Manama, Bahrain, 19–21 February 2023; p. 021. [Google Scholar] [CrossRef]
Oloruntobi, O.; Butt, S. Application of specific energy for lithology identification. J. Pet. Sci. Eng. 2020, 184, 106402. [Google Scholar] [CrossRef]
Mishra, A.; Sharma, A.; Patidar, A.K. Evaluation and Development of a Predictive Model for Geophysical Well Log Data Analysis and Reservoir Characterization: Machine Learning Applications to Lithology Prediction. Nat. Resour. Res. 2022, 31, 3195–3222. [Google Scholar] [CrossRef]
Smillie, Z.; Demyanov, V.; Mckinley, J.; Cooper, M. Unsupervised classification applications in enhancing lithological mapping and geological understanding: A case study from Northern Ireland. J. Geol. Soc. 2023, 180, jgs2022-136. [Google Scholar] [CrossRef]
Iraji, S.; Soltanmohammadi, R.; Matheus, G.F.; Basso, M.; Vidal, A.C. Application of unsupervised learning and deep learning for rock type prediction and petrophysical characterization using multi-scale data. Geoenergy Sci. Eng. 2023, 230, 212241. [Google Scholar] [CrossRef]
Yang, X.; Chen, J.G.; Chen, Z.J. Classification of Alteration Zones Based on Drill Core Hyperspectral Data Using Semi-Supervised Adversarial Autoencoder: A Case Study in Pulang Porphyry Copper Deposit, China. Remote Sens. 2023, 15, 1059. [Google Scholar] [CrossRef]
Singh, H.; Seol, Y.; Myshakin, E.M. Automated Well-Log Processing and Lithology Classification by Identifying Optimal Features Through Unsupervised and Supervised Machine-Learning Algorithms. SPE J. 2020, 25, 2778–2800. [Google Scholar] [CrossRef]
Ren, Q.; Zhang, H.B.; Zhang, D.L.; Zhao, X. Lithology identification using principal component analysis and particle swarm optimization fuzzy decision tree. J. Pet. Sci. Eng. 2023, 220, 111233. [Google Scholar] [CrossRef]
Loriato Potratz, G.; Canchumuni, S.W.; Bermudez Castro, J.D.; Potratz, J.; Pacheco, M.a.C. Automatic Lithofacies Classification with t-SNE and K-Nearest Neighbors Algorithm. Anu. Inst. Geocienc. 2021, 44. [Google Scholar] [CrossRef]
Temizel, C.; Odi, U.; Balaji, K.; Aydin, H.; Santos, J.E. Classifying Facies in 3D Digital Rock Images Using Supervised and Unsupervised Approaches. Energies 2022, 15, 7660. [Google Scholar] [CrossRef]
Isaksen, G.H.; Patience, R.; Graas, G.V.; Jenssen, A.I. Hydrocarbon system analysis in a rift basin with mixed marine and nonmarine source rocks: The South Viking Graben, North Sea. AAPG Bull. 2002, 86, 557–591. [Google Scholar] [CrossRef]
Ravasi, M.; Vasconcelos, I.; Curtis, A.; Kritski, A. Vector-acoustic reverse time migration of Volve ocean-bottom cable data set without up/down decomposed wavefields. Geophysics 2015, 80, S137–S150. [Google Scholar] [CrossRef]
Saha, S.; Vishal, V.; Mahanta, B.; Pradhan, S.P. Geomechanical model construction to resolve field stress profile and reservoir rock properties of Jurassic Hugin Formation, Volve field, North Sea. Geomech. Geophys. Geo-Energy Geo-Resour. 2022, 8, 68. [Google Scholar] [CrossRef]
Maćkiewicz, A.; Ratajczak, W. Principal components analysis (PCA). Comput. Geosci. 1993, 19, 303–342. [Google Scholar] [CrossRef]
Van Der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Mcinnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar] [CrossRef]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
Petrovic, S. A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters. In Proceedings of the 11th Nordic Workshop of Secure IT Systems, Linköping, Sweden, 19–20 October 2006; pp. 53–64. [Google Scholar]

Figure 1. The workflow of real-time intelligent identification and precision drilling for strongly heterogeneous formations based on multi-parameters of LWD and drilling engineering.

Figure 2. Geological background of Volve oilfield (modified from references [44,45]): (a) location of the Volve field, (b) contour map and well distribution of Hugin Formation, and (c) geological time column of Volve oilfield reservoir.

Figure 3. Schematic diagram of the Principal Component Analysis (PCA) algorithm.

Figure 4. Schematic diagram of t-SNE algorithm.

Figure 5. Schematic diagram of UMAP algorithm.

Figure 6. Schematic of the K-means algorithm.

Figure 7. Schematic diagram of the CatBoost algorithm.

Figure 8. Visualization of the dataset after dimensionality reduction: (a) PCA dimensionality reduction results, (b) t-SNE dimensionality reduction results, and (c) UMAP dimensionality reduction result.

Figure 9. Separation quality of each lithology class after using different dimensionality reduction algorithms.

Figure 10. Davies–Bouldin Index for each class with a different number of clusters.

Figure 11. Clustering results under the optimal number of clusters (K = 15).

Figure 12. The number of lithological labels of each class contained in each cluster.

Figure 13. Distribution of parametric features across clusters: (a) feature distribution of DT for each cluster and (b) feature distribution of TQA for each cluster.

Figure 14. The lithology corresponding to each class of data points is calibrated according to the LWD logging parameters and imaging.

Figure 15. Model test score.

Figure 16. Performance comparison of UMAP + K-means feature enhancement methods on SVM, RF, LightGBM, and CatBoost models.

Figure 17. Visualization of lithology while drilling intelligent recognition effect.

Table 1. Sample distribution of each lithology category in the dataset.

Lithology	Label	Count	Proportion (%)
Clay shale	CS	1298	13.48
Silt	SI	450	4.67
Fine sand	F	2450	25.45
Coarse sand	C	2295	23.84
Fine sandstone	FS	2029	21.07
Limestone	LS	1106	11.49

Table 2. Detailed information of the standardized dataset.

	GR	DT	RD	RS	NPHI	RHOB	ROP	HKLA	TQA	WOB
mean	35.91	82.16	23.40	25.51	0.19	2.34	24.58	124.97	19.15	6.33
std	17.78	8.47	38.36	39.47	0.05	0.15	12.25	18.38	6.21	2.85
min	7.82	50.45	0.25	0.38	0.02	1.57	2.72	82.86	6.28	0.40
25%	21.97	76.23	1.91	1.96	0.16	2.23	14.88	117.27	13.23	4.22
50%	33.78	83.42	5.94	6.04	0.19	2.30	21.07	124.97	17.71	5.82
75%	46.08	88.30	28.17	31.57	0.21	2.44	30.78	132.67	25.89	7.38
max	153.67	120.02	30.14	32.48	0.71	2.98	61.01	151.36	30.80	13.80

Table 3. Within-cluster scatter for each lithology class after using different dimensionality reduction algorithms.

Lithologic Type	Silt	Fine Sandstone	Limestone	Clay Shale	Fine Sand	Coarse Sand
PCA	2.06	1.39	2.03	2.05	1.40	1.85
t-SNE	9.36	9.51	11.31	9.56	10.15	13.28
UMAP	5.85	6.57	5.37	4.55	7.49	6.19

Table 4. Tuning parameters for SVM, RF, LightGBM, and CatBoost models.

Model	Parameter	Searching Range	Raw	PCA	UMAP	PCA + K-Means	t-SNE + K-Means	UMAP + K-Means
SVM	C	1–50	5	5	5	10	50	50
SVM	γ	0.01–1.0	0.02	0.05	0.5	0.02	0.02	0.02
RF	min_samples_leaf	2–10	2	3	3	3	2	3
RF	max_features	1–6	3	3	3	3	3	3
LightGBM	min_samples_leaf	2–10	3	2	3	2	2	3
LightGBM	learning_rate	0.02–0.5	0.2	0.2	0.5	0.25	0.2	0.2
CatBoost	depth	2–10	5	6	6	5	6	6
CatBoost	learning_rate	0.02–0.5	0.2	0.2	0.25	0.3	0.2	0.2

Table 5. Comparison of different lithology Balanced Accuracies under four models.

Lithology Label	Random Forest	SVM	LightGBM	CatBoost
Silt	0.736	0.700	0.728	0.736
Fine sandstone	0.943	0.879	0.921	0.943
Limestone	0.812	0.718	0.812	0.821
Clay shale	0.855	0.792	0.844	0.855
Fine sand	0.897	0.866	0.880	0.897
Coarse sand	0.930	0.918	0.926	0.930

Table 6. Comparison of different lithology F1-scores under four models.

Lithology Label	Random Forest	SVM	LightGBM	CatBoost
Silt	0.811	0.789	0.850	0.821
Fine sandstone	0.878	0.780	0.893	0.903
Limestone	0.692	0.562	0.775	0.792
Clay shale	0.732	0.649	0.791	0.833
Fine sand	0.823	0.793	0.837	0.846
Coarse sand	0.839	0.850	0.878	0.893

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, A.; Yu, Y.; Wang, B.; Liu, Y.; Liu, J.; Fu, X.; Zheng, W.; Tian, F. Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering. Appl. Sci. 2025, 15, 5536. https://doi.org/10.3390/app15105536

AMA Style

Zhao A, Yu Y, Wang B, Liu Y, Liu J, Fu X, Zheng W, Tian F. Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering. Applied Sciences. 2025; 15(10):5536. https://doi.org/10.3390/app15105536

Chicago/Turabian Style

Zhao, Aosai, Yang Yu, Bin Wang, Yewen Liu, Jingyue Liu, Xubiao Fu, Wenhao Zheng, and Fei Tian. 2025. "Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering" Applied Sciences 15, no. 10: 5536. https://doi.org/10.3390/app15105536

APA Style

Zhao, A., Yu, Y., Wang, B., Liu, Y., Liu, J., Fu, X., Zheng, W., & Tian, F. (2025). Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering. Applied Sciences, 15(10), 5536. https://doi.org/10.3390/app15105536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Intelligent Recognition and Precise Drilling in Strongly Heterogeneous Formations Based on Multi-Parameter Logging While Drilling and Drilling Engineering

Abstract

1. Introduction

2. Study Area and Data

3. Methodology

3.1. Data Preprocessing Methods

3.2. Dimensionality Reduction Algorithm

3.3. Intelligent Recognition Algorithm

3.4. Model Evaluation System

4. Results

4.1. Dataset Description

4.2. Data Preprocessing

4.3. Lithology Classification Feature Extraction

4.4. Application of Lithology Identification Model

5. Discussion

5.1. Comparing the Performances of Different Machine Learning Algorithms

5.2. Performance Analysis of the Optimal Method for Identifying Highly Heterogeneous Formations

5.3. Real-Time Optimization of Drilling Parameters Based on Strongly Heterogeneous Formation Identification

5.4. Model Generalizability and Adaptation Strategies

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI