1. Introduction
In the modern architecture of oil and gas exploration and development, Lithology Identification serves not only as the starting point for stratigraphic layering but also as a fundamental prerequisite constraint for reservoir parameter calculation and Favorable Reservoir Prediction [
1,
2]. As global energy exploration advances into complex domains such as deep clastic rocks, tight oil and gas, and shale gas, geological targets exhibit strong heterogeneity and diagenetic complexity [
3,
4]. In these high-risk exploration areas, even minor lithology identification errors can propagate through subsequent reserve estimation and fracturing design, adversely affecting reservoir evaluation outcomes [
1,
5].
However, facing the “lithological continuum” composed of sandstone, siltstone, and mudstone, well log interpretation is often constrained by the dual challenges of physical non-uniqueness and mineralogical complexity [
6,
7]. On one hand, complex mineral compositions trigger significant “overlapping log responses” (i.e., different lithologies exhibiting similar log characteristics); for instance, high-gamma sandstones are frequently confused with mudstones [
7]. On the other hand, limited by the vertical resolution of logging instruments, thin interbeds often suffer from response smoothing due to volume effects, which further increases interpretation uncertainty [
6]. Confronted with this high-dimensional, non-linear geophysical inversion problem contaminated by noise, the traditional interpretation paradigm relying on manual experience struggles to achieve reliable geological interpretations, challenging both inversion stability and interpretability [
8,
9].
To overcome the limitations of physical models, lithology identification technology has undergone a comprehensive evolution from statistical models to deep learning [
1,
10]. Early methods primarily relied on shallow machine learning models such as Support Vector Machines (SVMs), Random Forests (RF), and Gradient Boosting Decision Trees (Gradient Boosting Decision Trees (XGBoost)) to mitigate non-linear mapping difficulties [
11,
12]. In recent years, deep learning technologies, represented by Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTM), have significantly improved the accuracy and automation of lithology identification by leveraging end-to-end feature extraction capabilities [
13,
14,
15]. Furthermore, to reduce the prediction variance of single models and enhance robustness, Ensemble Learning strategies—such as Voting, Stacking, and Bagging—have been widely adopted to fuse multi-model advantages, thereby improving the consistency and reliability of identification results [
1,
2].
Despite the significant progress of existing data-driven methods on standard datasets, they are still constrained by two intrinsic limitations when dealing with cross-well complex geological conditions [
11,
13]. The first is the deficiency of physical representation. Mainstream deep models tend to automatically learn features from raw data. However, given the “small sample, strong domain” characteristics of well log data, purely data-driven models lacking physical priors (e.g., texture, trends) struggle to capture robust boundaries with cross-well generalization capabilities [
16,
17]. The second is the rigidity of decision-making mechanisms. Sedimentary environments change over time and space, leading to vast differences in prediction difficulty across varying depths (i.e., non-stationarity). Yet, existing ensemble methods mostly adopt fixed-weight strategies, lacking “Sample-Aware” adaptive capabilities [
18]. This inability to dynamically schedule model capabilities prevents effective identification in critical “transition zones” and regarding “ambiguous samples”, thereby limiting overall accuracy improvements [
11].
Addressing these challenges, this paper proposes an intelligent lithology identification framework featuring Multivariate feature enhancement and sample-aware ensemble. To tackle the strong heterogeneity of geological environments, targeted improvements were made in two dimensions: data representation and decision mechanisms. At the input end, physics-informed enhancement and multi-scale statistics are introduced to construct a Multivariate high-dimensional feature system. At the output end, a sample-aware dynamic ensemble strategy is implemented to adjust model weights in real-time by evaluating sample uncertainty, overcoming the limitations of traditional static ensembles in adapting to non-stationary geological conditions. This synergistic mechanism of “Multivariate high-dimensional feature system” and “dynamic decision adaptation” significantly enhances the model’s generalization ability and robustness under complex cross-well conditions.
The main contributions of this paper are summarized as follows:
A Multivariate enhanced feature system is proposed by integrating micro-texture structures, multi-scale statistical measures, and petrophysical indicators. This constructs a Multivariate high-dimensional feature space that possesses both stratigraphic continuity and sensitivity to physical properties. This system significantly improves the separability among sandstone, siltstone, and mudstone, effectively mitigating the boundary ambiguity and cross-well representation instability caused by “overlapping log responses” in well log data.
A Sample-Aware Dynamic Confidence-Weighted Ensemble (DCWE) strategy based on an adaptive sharpening mechanism is developed, achieving a critical breakthrough from traditional static weighting to sample-wise adaptive decision-making. Leveraging global model performance as a prior and sample confidence as a posterior, this strategy dynamically adjusts model weights based on local uncertainty during the prediction process, thereby enhancing robustness and adaptability in non-stationary formations, lithology transition zones, and hard-to-classify samples.
The remainder of this paper is organized as follows:
Section 2 introduces the research data;
Section 3 describes the proposed method in detail;
Section 4 evaluates the model performance, validating its advantages in terms of computational efficiency and classification accuracy; and
Section 5 presents the conclusions.
2. Research Data
The dataset used in this study comes from real logging records provided by the 1st Sinopec Artificial Intelligence Innovation Competition. It contains 38,225 logging samples collected from four wells, covering a depth range of 301.04–3316.88 m. The dataset includes three logging features—spontaneous potential (SP), gamma ray (GR), and acoustic transit time (AC)—along with one three-class lithology label. The dataset is complete and contains no missing values.
For clarity and consistency throughout the manuscript, a unified color scheme is adopted to represent different lithofacies in all figures: siltstone (Label 0) is shown in red, sandstone (Label 1) in blue, and mudstone (Label 2) in green. This color convention is consistently applied in all statistical plots and well log profiles presented in this study.
Figure 1 shows a highly imbalanced label distribution: Label 0 accounts for 26.13% (9989 samples), Label 1 for 12.46% (4764 samples), and Label 2 for 61.40% (23,472 samples).
Figure 2,
Figure 3 and
Figure 4 summarize the overall statistical properties of each logging feature: the mean value of SP is 54.98 (standard deviation 21.59), GR has a mean of 134.72 (standard deviation 34.21), and AC has a mean of 293.89 (standard deviation 57.23). The three lithologies exhibit clear differences in the mean values of SP and GR: Label 2 (mudstone) shows the highest values, Label 1 (sandstone) the lowest, and Label 0 (siltstone) lies in between. This indicates that SP and GR are more sensitive to lithology variations. In addition, the correlations among the three features are relatively weak (SP–GR: 0.14; SP–AC: −0.10; GR–AC: −0.25), making them suitable as base inputs for subsequent feature engineering.
To facilitate subsequent comparative analysis and figure visualization, this study adopts the color scheme shown in
Table 1: siltstone (Facies 0) is represented in red, sandstone (Facies 1) in blue, and mudstone (Facies 2) in green, with the same scheme applied consistently across all cross-sections and visualization results.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 illustrate the GR, AC, and SP logging curves and the corresponding lithofacies for the four wells (Well 2, Well 146, Well 298, and Well 2010). Overall, all wells exhibit clear vertical stratigraphic variations, and the logging responses of different lithofacies display stable geological signatures. Mudstone (Facies 2) is typically associated with higher GR, slower AC, and a relatively smooth SP curve. Sandstone (Facies 1) is characterized by low GR, low AC, and larger-amplitude SP fluctuations. Siltstone (Facies 0) generally falls between the two end-members and often shows transitional logging characteristics.
Frequent interbedding of sandstone, siltstone, and mudstone occurs in all four wells, with lithofacies transitions being particularly dense in Well 298 and Well 2, reflecting the strong heterogeneity of the depositional system and variations in clastic supply within the study area. The overlapping logging responses and blurred lithofacies boundaries visible in the figures further indicate the weak separability of lithologies.
3. Methodology
Figure 9 illustrates the overall workflow of the intelligent identification method for well log data proposed in this paper, encompassing feature engineering and dynamic weighted ensemble steps. This workflow realizes end-to-end optimization, ranging from raw well log data processing to intelligent identification prediction. Specifically, the proposed intelligent lithology identification framework comprises two core stages: (1) Multivariate feature engineering and selection; and (2) Dynamic ensemble. The detailed workflow is described as follows:
3.1. Characteristic Engineering and Screening
Figure 10 shows the comprehensive feature-engineering scheme adopted in this study. Two core techniques—sliding-window construction and gradient computation—are used to enhance feature representation. The sliding-window method concatenates adjacent data points to form local sequence features, capturing stratigraphic continuity patterns, while gradient computation extracts the rate of change in logging parameters to characterize vertical variations in reservoir properties [
4,
19,
20].
Based on this foundation, multi-scale statistical features, geophysical parameters, and mineral-composition indicators specifically designed for the three lithologies are incorporated, together with deep features, cross features, and probability-enhanced features. This results in a high-dimensional feature space that integrates both physical significance and statistical characteristics, providing rich discriminative information for machine-learning models [
14,
16,
21,
22].
3.1.1. Data Loading and Synchronization Correction
The logging data used in this study include both the training and validation sets. A unified data interface is employed to perform standardized reading and depth alignment, ensuring consistency between feature construction and model training. To eliminate local anomalies and non-physical noise caused by logging conditions, the preprocessing workflow applies unified depth resampling within each well and uses a sliding-mean constraint to smooth both high- and low-frequency noise in the logging curves.
3.1.2. Design of Feature Generation Engine
This study constructs a feature-generation engine composed of three core modules—spatial continuity, multi-scale statistics, and physical mechanisms—to map the raw logging data into a high-dimensional feature space, resulting in a total of 1371 initial features.
To capture spatial continuity within the strata, we follow the strategy of [
23] and use sliding windows to construct feature vectors that include neighboring data points, explicitly preserving local variations in the logging curves. The theoretical basis is Tobler’s First Law of Geography [
24], which states that “things near each other are more closely related”. In addition, considering that sedimentary formations exhibit the Milankovitch cyclicity described by Hinnov [
25], we further introduce the triangular positional encoding concept from deep learning [
26]. Using sine and cosine transformations, linear depth is mapped into nonlinear features with periodic properties, thereby enhancing the model’s ability to perceive absolute position and stratigraphic cyclic structures.
In constructing the multi-scale statistical features, Davis noted that single-point logging responses are often affected by borehole environmental noise [
27], whereas statistical aggregation over windows of different sizes (e.g., moving averages and variances) can effectively suppress random noise and enhance trend components. Based on this principle, a multi-scale statistical interaction module is developed in this study to extract statistical measures—such as mean, variance, and kurtosis—over multiple window sizes (5, 10, and 25 points), thereby capturing multi-level variation patterns from micro-scale textures to macro-scale structural trends. Subsequently, higher-order differences and Z-score normalization are applied to strengthen the response to stratigraphic discontinuities and gradient changes, ultimately generating 1299 features.
To compensate for the limited physical interpretability of purely data-driven models, this study further introduces a set of rock-physics-derived indicators. Following the classical rock-physics volumetric model of Ellis and Singer [
28] and the logging-interpretation standards of Asquith and Krygowski [
29], a series of geologically meaningful derived features is constructed from the three conventional logging curves—GR, AC, and SP. As a first step, the three curves are min–max normalized on a per-well basis to ensure consistent scaling across different wells, mapping all values to the range [0, 1]:
The normalized curves retain their original geological implications:
reflects clay content,
describes porosity structure and formation compaction, and
represents interlayer electrochemical potential responses and permeability characteristics. Based on these normalized curves, various lithology-indicator features are constructed to enhance the model’s sensitivity to the three lithologies—sandstone, siltstone, and mudstone. For example, mudstone intervals typically exhibit high
GR values and weak
SP contrasts; therefore, a mudstone indicator is constructed as:
Sandstone typically exhibits low
GR and
AC (high wave velocity) characteristics, making it a suitable indicator for structural sandstone analysis.
Siltstone, as the intermediate phase between sandstone and mudstone, has both mud components and partial permeability. Therefore, it is defined as a siltstone indicator.
The above indicator is not to strictly describe the mineral content, but to construct the combination features with geological significance based on the response pattern of logging, which can improve the separability of the feature space.
Given the absence of density and neutron curves in this dataset, we developed a pseudo-porosity metric based on acoustic time difference to characterize material property variations. The metric is expressed as
This variable reflects the relative pore structure of the formation and is used to supplement the geological information of AC in reservoir property identification.
To better demonstrate the synergistic effect of the three logging curves in sandstone-mudstone identification, this study also develops a combined feature of GR–AC–SP to characterize the overall response trend.
The weights reflect the relative contribution of the three curves in distinguishing sandstone and mudstone, with GR showing the highest sensitivity to clay content, followed by SP and AC. This feature is not a physical volumetric model but rather a statistically defined linear combination designed to enhance feature discriminability and improve the overall lithology classification performance of machine-learning models.
Through the construction of these rock-physics-derived features, the original logging curves are mapped into an enhanced representation carrying geological implications such as clay content, permeability, electrical response, and pore-structure characteristics. A total of 43 features are generated, effectively improving the model’s robustness under complex stratigraphic conditions.
3.1.3. Feature Combination and Cleaning
Given the constructed 1371-dimensional high-dimensional feature space, directly feeding all features into the model may lead to the “curse of dimensionality”. As noted by Guyon and Elisseeff, a single feature-selection method may introduce structural bias—for example, mutual information tends to favor high-cardinality features [
30], whereas linear methods ignore nonlinear interactions. To ensure the robustness of the selection results, this study develops a parallel scoring framework that integrates information-theoretic, statistical, and machine-learning perspectives.
According to the information-theoretic framework of Cover and Thomas [
31], mutual information characterizes arbitrary dependency between features and labels through joint probability distributions, making it well suited for retaining features that are not linearly separable yet carry important geological significance. On this basis, to strengthen the discriminative ability of features across different lithologies, ANOVA-based variance analysis is introduced to perform linear discrimination and identify significant features with high F-statistics.
The joint probability distribution is denoted by p(x,y), and the marginal probability distributions are p(x) and p(y). The higher the MI value, the richer the information contained in the feature.
On this basis, one-way analysis of variance (ANOVA) is further applied to perform linear discriminative testing on the candidate features. ANOVA evaluates the significance of between-class differences by comparing the ratio of the between-class mean square to the within-class mean square (the F-statistic) [
32]. A larger F value indicates a more pronounced distributional difference in the feature across lithology categories.
Here, K denotes the number of classes, and N represents the total number of samples. A larger F value indicates that the feature exhibits a more significant distributional difference across geological categories, implying stronger linear separability.
Considering that many geological features demonstrate discriminative power only when used in combination, this study also adopts Light Gradient Boosting Machine (LightGBM) [
33], which leverages the Split Gain mechanism of gradient-boosting trees to identify high-order interaction features.
Here, M denotes the number of trees in the ensemble, represents the set of nodes in the t tree, denotes the splitting feature selected at node η\eta, and is the gain in the loss function introduced by that split.
Through the complementary screening across the aforementioned three dimensions, this mechanism eliminated approximately 276 redundant features. Consequently, an optimal feature subset balancing information content, discriminability, and interactivity was constructed, providing a low-redundancy and high signal-to-noise ratio input foundation for subsequent model training.
From the finally retained features, 10 representative features were selected for demonstration, as summarized in
Table 2. These correspond to seven feature generation strategies: spatial continuity modeling, depth encoding, multi-scale statistics, standardization enhancement, rate of change features, physics-derived indicators, and curve interaction.
3.2. Model Training and Dynamic Integration
This phase aims to construct a robust lithology classification model utilizing the selected feature set. To ensure the independence of the training and evaluation processes, the dataset is partitioned by well ID, thereby simulating realistic cross-well prediction scenarios.
During model training, a model library composed of structurally diverse base models was constructed. This library includes Gradient Boosting Decision Trees (e.g., LightGBM, XGBoost) and deep sequence models (e.g., CNN-LSTM, Tabular Transformer), designed to provide complementary predictive information derived from distinct learning paradigms.
To overcome the limitations of traditional static weighting in adapting to sample variability within complex formations, this paper proposes the Dynamic Confidence-Weighted Ensemble (DCWE) mechanism as the core methodology. Specifically, this mechanism first constructs global performance weights based on the Macro-F1 scores of base models on validation wells. Subsequently, during the inference phase, it generates dynamic weights by incorporating the model’s confidence in the current sample. Through the joint modulation of “global reliability” and “sample-level uncertainty”, DCWE achieves sample-wise adaptive fusion, thereby attaining superior robustness and stability under cross-well conditions, within lithology transition zones, and across non-stationary depth sections.
3.2.1. Workflow of the Proposed Method
After completing the feature engineering, multiple tree-based models are constructed and trained, including LightGBM, XGBoost, CatBoost, ExtraTrees, and RandomForest. These models exhibit complementary strengths in terms of loss-optimization paths, feature-splitting strategies, and bias–variance characteristics. Therefore, they are trained in parallel to enhance the diversity and expressive capacity of the model library.
To further capture the sequential dependencies and higher-order nonlinear patterns in the logging data, two types of deep-learning models are trained: a CNN–LSTM model, which jointly extracts local textural patterns and long- and short-range dependencies across depth, and an enhanced Tabular Transformer, which models structured interactions among high-dimensional inputs. All deep-learning models are monitored on the validation wells, and early stopping is applied to some models to suppress overfitting during training.
To achieve optimal fusion of heterogeneous models, this study proposes a confidence-regulated Dynamic Weighted Ensemble framework. In this framework, the weight assigned to each model is determined jointly by its global performance and local confidence, thus reflecting both the model’s overall reliability and its sample-specific responsiveness.
At the global level, the F1 scores obtained by each base model on the validation wells are used to derive its base weight, representing its stability and credibility in overall performance. At the same time, during the testing stage, the algorithm evaluates the predicted probabilities for each sample to characterize the model’s local confidence: models that produce more confident predictions receive higher weights for that sample, whereas those with lower confidence are down-weighted. This results in a weighting mechanism that dynamically adapts to variations in sample distribution.
By combining the global base weight with the dynamic, confidence-based adjustment, the final per-sample weight integrates both model-level reliability and sample-level certainty. This ensemble strategy not only improves overall predictive performance but also significantly enhances the generalization capability under mixed-well conditions.
3.2.2. Architecture of the Proposed Model
Figure 11 presents the two-stage dynamic hybrid ensemble architecture proposed in this study. The overall workflow consists of a parallel heterogeneous feature encoder and a sample-level dynamic weighting ensemble layer, achieving full-process optimization from Multivariate feature extraction to model fusion.
Figure 11 illustrates the proposed dynamic ensemble architecture. The overall workflow comprises the “Multi-Model Prediction Layer” and the “Sample-Level Dynamic Weighted Ensemble Layer”, realizing end-to-end optimization from Multivariate feature extraction to model fusion.
Multi-Model Prediction Layer
Following generalized feature engineering and feature space expansion, a multi-model prediction layer composed of tree-based and deep models was constructed to characterize the feature structure of well log data from different perspectives. The tree-based branch includes common ensemble algorithms such as LightGBM, XGBoost, CatBoost, ExtraTrees, and RandomForest. By employing feature partitioning, these models effectively capture non-linear relationships in high-dimensional data and remain robust to feature scaling and noise, thereby providing reliable baseline predictions within the overall system. Complementarily, the deep learning branch includes a CNN–LSTM network and an enhanced Tabular Transformer, utilized to model the local morphology and cross-depth correlations of well log curves. Specifically, the CNN–LSTM network reconstructs processed features into sequences, extracts local features via convolution, models depth-wise dependencies using LSTM, and outputs classification results combined with an attention mechanism; meanwhile, the Tabular Transformer maps features into an embedding space, establishes inter-feature relationships through multiple Transformer Blocks, and obtains final predictions via pooling and a Multi-Layer Perceptron (MLP). The model outputs generated by both branches are fed into the dynamic weighted ensemble module in the second stage to obtain the final classification results.
Dynamic Confidence-Weighted Ensemble Layer
Traditional static ensemble strategies rely solely on fixed weights for the linear combination of model outputs, inherently ignoring the non-stationary characteristics of prediction uncertainty across different geological samples. To overcome this limitation, this paper proposes a Sample-Aware Dynamic Ensemble Strategy. This strategy achieves optimal fusion at the sample level by constructing a dual constraint mechanism of “global performance priors” and “local confidence posteriors”. Its core lies in the algorithm’s ability to real-time “perceive” the prediction difficulty of samples and adaptively adjust the weight distribution morphology, which is specifically coupled by the following two components.
The global reliability anchor evaluates a model’s inherent performance using the macro F1 score on an independent validation set. To further reinforce the dominant role of superior
models, we apply a square scaling operator and normalize the results to obtain static benchmark weights. This metric reflects the model’s global generalization capability, providing stable prior constraints for ensemble learning.
Sample-aware sharpening mechanism
: For test samples, the instantaneous confidence level of the model is first quantified by the maximum value of the output probability vector. The innovation of this paper lies in introducing a sample-aware adaptive sharpening factor. This factor establishes a negative correlation mapping with the current average confidence level of the sample, thereby enabling real-time control of the weight distribution pattern.
The final fusion mechanism leverages the product of global prior and local sharpening weights, with the integrated prediction output being a weighted sum of all model probabilities.
The integrated prediction is the weighted sum of all model probabilities.
This sample perception mechanism essentially functions as an “uncertainty gate”:
On the “fuzzy sample” with large divergence of model group and low average confidence, the algorithm perceives high uncertainty, and automatically increases the sharpening factor to produce “Matthew effect” to lock the most credible local model and suppress the noise.
On the high-confidence “typical sample”, the algorithm perceives low uncertainty and automatically reduces it, and synthesizes the opinions of multiple models through “democratic voting” to reduce variance.
This real-time switching of decision logic based on sample characteristics significantly improves the model’s generalization robustness under complex non-stationary geological conditions.
4. Experimental Results and Discussion
4.1. Experimental Setup
The experimental dataset is derived from real-world well logging records provided by the 1st SINOPEC AI Innovation Competition, reflecting practical oil and gas exploration scenarios. To rigorously simulate unseen-well prediction and prevent data leakage, a strict Leave-One-Group-Out (LOGO) cross-validation strategy was adopted. Following the methodology described in [
34], well IDs were used as grouping labels, where three wells were used for model training and the remaining well was held out exclusively for testing. To further ensure scientific rigor, early stopping and Dynamic Class Weight Estimation were conducted using an internal validation well selected only from the training wells, rather than from the test well. This internal validation process was strictly isolated from the held-out test well, thereby preserving the integrity of the LOGO evaluation protocol. As a result, the test well remained completely unseen during all stages of model selection and optimization, ensuring a fully blind evaluation. Model performance was assessed using four complementary metrics: Accuracy, Macro F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Accuracy provides an overall measure of correctness, while Macro F1-score mitigates the impact of class imbalance commonly observed in geological datasets by assigning equal importance to each class [
35]. MCC offers a robust statistical evaluation by incorporating all elements of the confusion matrix, yielding reliable performance estimates even under severe class imbalance [
36]. Finally, Cohen’s Kappa measures the agreement between predictions and ground truth beyond chance, further enhancing the interpretability and reliability of the experimental results [
37]
4.2. Verification of the Effectiveness of the Characteristic Engineering
Prior to model training, it is necessary to verify the effectiveness of the Multivariate feature engineering proposed in this paper, such as sliding window features, multi-scale statistical features, pseudo-porosity, depth gradients, and cross features.
4.2.1. Analysis of Separability in Feature Space
Figure 12 displays the distribution of raw well log features and engineered features in the UMAP (Uniform Manifold Approximation and Projection) two-dimensional embedding space. In the figure, each point represents a depth sample point from the training set, with colors indicating the corresponding lithology labels (facies 0/1/2). UMAP is a nonlinear dimensionality reduction method based on manifold learning that maps high-dimensional data to a low-dimensional space while preserving local neighborhood structures as much as possible to reveal latent cluster structures; it has been widely used for the visualization and pattern recognition of high-dimensional data in the geoscience and energy fields [
38,
39,
40]. Therefore, this figure can be used to examine whether an “intra-class compactness—inter-class separation” structure is formed in different feature spaces, verifying the improvement effect of feature engineering on lithology separability from a visualization perspective.
Raw Log Curves UMAP: The raw features consist of three conventional well log curves: SP, GR, and AC. Due to characteristics such as high noise, overlapping physical property responses, and blurred class boundaries in well log curves, the three lithology classes are mixed in the two-dimensional space after dimensionality reduction. The large area of color overlap indicates that the discriminative ability of raw well log features in the feature space is limited.
Engineered Feature UMAP: Engineered features consist of high-dimensional features such as multi-scale statistics, gradients, and morphology, which can more completely describe the local patterns and variation trends of well log curves. After UMAP maps the high-dimensional engineering features to two dimensions, each lithology class forms relatively compact clusters, and distinct sparse zones appear between different clusters, indicating that class differences are captured more effectively. Overall, engineered features have formed a structure of “intra-class compactness and inter-class separation” in the high-dimensional space, whereas raw well log features struggle to present this discriminative ability, which proves the effectiveness of the feature engineering scheme proposed in this paper for the lithology classification task from a visual standpoint.
4.2.2. Original Features vs. Engineered Features
It compares the performance of models trained using the initial feature set and the full feature set under the same validation setting, in order to isolate the contribution of feature engineering in
Table 3.
The comparison in
Table 3 shows that feature engineering significantly improves the performance of all models. With original features, Dynamic Weighted Ensemble performed best, achieving an accuracy of 0.6927 and a Macro-F1 score of 0.4970. After feature engineering, the model still outperforms others, with an accuracy of 0.8439 and a Macro-F1 score of 0.7033, demonstrating the clear advantage of the ensemble model.
4.3. Comparison of Integration Strategies
To verify the effectiveness of the method proposed in this paper, we compared the DCWE model with existing mainstream models. The specific performance metrics of each model on the test set are shown in
Table 4.
As shown in
Table 4, which details the average consistency metrics of each model across four wells, the performance comparison was conducted to verify the effectiveness of the Dynamic Confidence-Weighted Ensemble (DCWE) strategy. The evaluation selected four dimensions—Accuracy, Macro F1-score (Macro F1), Matthews Correlation Coefficient (MCC), and Cohen’s Kappa—aiming to characterize model performance from multiple perspectives, including overall precision, class balance, and prediction consistency. Experimental data clearly indicate that ensemble learning strategies generally outperform single baseline models in terms of overall accuracy. The DCWE model ranks first with an average accuracy of 0.8186, performing on par with the static ensemble model (0.8175), and both surpass the best-performing single model, Transformer (0.8099). This high-precision performance confirms the fundamental advantages of multi-model fusion in reducing prediction variance and enhancing overall discriminative capability. It is worth noting that although the difference in accuracy between the static ensemble strategy and DCWE is marginal, the two exhibit significant statistical divergence in the Macro F1 metric, which measures class balance. The Macro F1 of the static ensemble model is only 0.5932, significantly lower than DCWE’s 0.6661; this gap of 0.07 profoundly reveals the limitations of traditional static averaging strategies when dealing with class-imbalanced data, specifically their tendency to be dominated by majority classes at the expense of minority class recognition precision. Conversely, by introducing a dynamic confidence mechanism, DCWE successfully achieves adaptive weight adjustment for hard-to-classify samples. While maintaining high accuracy, it not only significantly improves the Macro F1 score but also surpasses the CNN-BiLSTM model (0.6302), which possesses advantages in sequence feature extraction. Furthermore, the MCC and Kappa coefficients, serving as rigorous metrics for evaluating classification consistency and excluding random interference, further corroborate the robustness of the above conclusions. DCWE achieved 0.5774 and 0.5733 on these two metrics, respectively, both significantly higher than the static ensemble model (MCC: 0.5649, Kappa: 0.5601) and other comparison methods. This full-dimensional performance advantage fully verifies that the proposed method possesses superior robustness and generalization capabilities when facing complex geological data distributions, and can provide more reliable and balanced prediction results compared to traditional static ensembles. In addition, the ablation results reported in
Table A4 (
Appendix A), where all depth-related features are removed, provide strong evidence for the robustness of the proposed DCWE framework. Even in the absence of explicit depth or elevation information, the DCWE model maintains a stable Macro F1-score of approximately 0.65, outperforming all individual base models and the static ensemble. This result indicates that the proposed method does not overly rely on positional cues, but instead learns transferable rock-physics-related representations from multivariate logging responses. Such behavior is particularly important for cross-well applications, where depth distributions and stratigraphic thicknesses may vary significantly between wells. The consistently high MCC and Kappa values further confirm that DCWE preserves reliable classification behavior under degraded feature conditions, highlighting its superior robustness compared to conventional ensemble strategies.
4.4. Analysis of Model Robustness and Generalization Ability
To quantitatively evaluate the adaptability of the models in the face of changing geological conditions and data distribution shifts, we calculated the performance fluctuation range of each model across four wells, the detailed results of which are presented in
Table 5.
The experimental results reveal significant limitations in the generalization capabilities of single benchmark models. Although models such as Extra Trees and CNN-BiLSTM can achieve high accuracy upper bounds on specific datasets (reaching 0.8877 and 0.8711, respectively), their performance proves extremely unstable, with lower accuracy bounds dropping to 0.6958 and 0.7123, respectively, resulting in a range approaching 0.2. This drastic performance oscillation indicates that traditional models are prone to overfitting the specific distributional characteristics of training data; consequently, even slight shifts in lithology combinations or log responses in the test well section can trigger “performance collapse”, failing to meet the rigorous “lower limit guarantee” requirements for interpretation results in practical engineering contexts. In contrast, the DCWE model proposed in this paper demonstrates superior robustness and generalization advantages. Regarding accuracy, DCWE consistently converges within a high-performance interval of 0.7905 to 0.8458; its lower performance bound of 0.7905 significantly outperforms all comparison models, and it exhibits the smallest fluctuation amplitude, demonstrating the model’s reliable discriminative capability even on extremely challenging samples. Furthermore, regarding the MCC and Kappa metrics, which measure comprehensive consistency, DCWE achieves average values of 0.5774 and 0.5733, respectively, ranking first among all models. These results strongly confirm that the dynamic confidence weighting mechanism employed by DCWE can adaptively smooth the bias and variance of single base classifiers based on sample prediction difficulty, thereby effectively mitigating interference from local geological features and ensuring highly consistent and credible lithology identification results across cross-regional and variable geological environments. It should be noted that the dataset used in this study is highly imbalanced, with mudstone accounting for 61.40% of all samples, whereas sandstone—the primary target lithology in many exploration scenarios—constitutes only a minority portion of the data. Under such conditions, global accuracy (84.58%) becomes a secondary indicator of model performance, as it can be dominated by majority-class predictions. To address this issue, per-class precision, recall, and F1-scores are reported in
Appendix A and warrant explicit discussion here. In particular, sandstone recall is a critical metric from a geological and engineering perspective. As shown in
Table A2, the proposed DCWE model achieves the highest sandstone F1-score and a substantially improved recall compared to both single models and the static ensemble. This indicates that DCWE more effectively mitigates majority-class bias and enhances sensitivity to minority lithologies, without sacrificing overall stability. Such balanced behavior is essential for practical reservoir characterization, where missing sandstone intervals may lead to significant exploration and development risks.
4.5. Visual Analysis
To qualitatively validate the advantages of the proposed method, and given the performance variance observed across wells in the four-fold LOGO statistics, this section selects Well 298 for detailed visualization and analysis. This selection ensures representativeness while maintaining conciseness, as Well 298 is characterized by a complete geological structure, typical lithological transitions, and a moderate difficulty level within the LOGO framework; quantitative results for the remaining wells have already been presented in the preceding sections on average performance and range values.
Using Well 298’s validation depth window (2223.12–2252.12 m) as an example,
Figure 13 presents a comparative analysis of stratigraphic distribution between true lithological data and model predictions, while simultaneously displaying smoothed confidence curves and segmental means. By cross-referencing lithological boundaries with confidence level variations, the study systematically evaluates the model’s classification consistency, boundary delineation capability, and uncertainty characteristics within this validation window, providing a basis for subsequent precision analysis and result interpretation.
Figure 14 illustrates the lithology prediction results of various models on this validation well. In terms of overall distribution, all models exhibit high consistency in identifying major lithological intervals; however, significant discrepancies arise in complex stratigraphy, such as sandstone-siltstone alternating zones and argillaceous interbeds. Traditional tree-based models (XGB, RF, LGB, CAT, ET) demonstrate stable overall performance but suffer from class confusion and inter-layer oscillation at lithological boundaries. Conversely, deep learning models (Transformer, CNN-LSTM) show greater sensitivity in capturing non-linear well log responses and local continuous variations. While the static ensemble model (ensemble_static) smooths out single-model errors to some extent, its fixed weights limit its adaptability to dynamic stratigraphic changes; in contrast, the dynamic ensemble model (ensemble_dynamic) displays clear adaptive advantages across multiple complex intervals.
Specific depth intervals reveal the distinct advantages of the dynamic approach. In the upper section (approximately 1470–1600 m), where lithology is predominantly sandstone, models such as CAT, XGB, and the static ensemble exhibit misclassifications of sandstone as siltstone. However, the dynamic ensemble model successfully provides predictions consistent with the true lithology, indicating that when model disagreement occurs, it identifies the advantages of Transformer and CNN-LSTM in handling spatial continuous features, thereby avoiding the over-smoothed predictions typical of tree models in sandstone layers. In the middle transition section (approximately 1600–1750 m), where the formation gradually transitions from siltstone to sandstone with frequent fluctuations in curve characteristics, the divergence among models is most significant. XGB, RF, and the static ensemble misclassify siltstone as sandstone in multiple locations, demonstrating a delayed response to interface feature changes. The dynamic ensemble, however, adjusts weights dynamically based on local well log feature variations, more accurately distinguishing between the two lithologies. This adaptive capability demonstrates that dynamic integration is not merely a simple averaging or weighted fusion but possesses a “local judgment” mechanism capable of flexibly selecting the most credible model output based on feature distribution. In the lower reservoir section (below approximately 1800 m), where the formation is primarily sandstone interspersed with thin siltstone layers, CAT, LGB, and the static ensemble still show misclassifications at several interface positions. In contrast, the dynamic ensemble responds more accurately to these thin layers, showing a tighter correspondence with the true lithology distribution, which suggests it not only corrects errors at complex lithological interfaces but also maintains overall prediction consistency in stable sections, avoiding the “over-averaging” problem common in traditional ensemble methods.
Overall, the dynamic ensemble model’s predictions across the entire well are closest to the ground truth labels, with significantly fewer misclassification zones, particularly demonstrating higher discriminative power in the middle transition zone. Compared to static integration, the dynamic ensemble achieves a dynamic balance among multiple models by introducing a weight adjustment mechanism based on feature response. When predictions diverge significantly, it automatically enhances the contribution of deep learning models, yielding classification results in complex formations that better align with geological laws. This method enhances the flexibility of deep models while maintaining the stability of tree models, effectively improving the precision and continuity of lithology identification. In summary, the validation results on Well 298 fully embody the “dynamic arbitration” characteristic of the dynamic ensemble model. It adaptively integrates the strengths of various models across different intervals, significantly reducing misclassification and overfitting issues, and exhibiting overall performance superior to single models and static ensembles. This mechanism provides a more robust and geologically rational technical path for automatic lithology identification and fine stratigraphic subdivision under complex reservoir conditions.
4.5.1. Pairwise Comparison Analysis of Model Classification Performance
To further verify the superiority of the dynamic ensemble model, a pairwise comparative analysis of the confusion matrices for each model across 3 classes (Facies 0, 1, 2) was conducted, with results shown in
Figure 15.
Figure 15 displays the confusion matrices for each model in the lithology prediction task for Well 298, characterizing the classification accuracy and confusion patterns of each model across the three lithology classes. Overall, deep learning models (such as CNN-LSTM and Transformer) exhibit a more concentrated main diagonal structure in the major lithology classes, indicating their superior advantages in extracting longitudinal continuous information and characterizing stratigraphic context features, thus performing more stably in thick layers or lithological sections with distinct features. In contrast, tree-based models (such as RF, ET, XGB) generally exhibit more off-diagonal elements in lithological transition zones or depth intervals with drastic well log curve changes, suggesting a higher sensitivity to noise and thin interbeds, which leads to the misclassification of adjacent lithologies. Furthermore, although the static ensemble model outperforms some single models overall, it inevitably inherits the biases of base models regarding minority lithologies (especially Facies 1), resulting in observable inter-class confusion within the confusion matrix. Conversely, the dynamic ensemble model, by adaptively adjusting the trust weights for base models at different depths, effectively suppresses misclassifications at various lithological boundaries, presenting a clearer and more compact main diagonal structure, which reflects the full utilization of model complementarity by the dynamic arbitration mechanism. In summary, this figure clearly reveals the prediction bias patterns resulting from the combined effects of data imbalance, stratigraphic structural complexity, and model structural differences, providing an important basis for understanding the applicability and limitations of each model and for further designing more robust ensemble strategies.
4.5.2. Per-Class Macro-F1 Distribution Box Plots for Different Models
Figure 16 presents macro-F1 bar charts for nine models, with per-class points and min/max error bars to show overall performance and class-wise dispersion. The Dynamic and Static Weighted Ensembles rank highest (around 68–70% macro-F1), and their three class points cluster relatively tightly, indicating strong overall accuracy with smaller inter-class gaps. LightGBM and XGBoost form the second tier (about 57–58% macro-F1) but exhibit the largest class imbalance: Facies 2 is near 88–90%, while Facies 1 drops to roughly 5–8%, showing high sensitivity to class imbalance. Transformer, Random Forest, CatBoost, and CNN-BiLSTM sit in the mid range (about 62–67% macro-F1), where Facies 1 remains notably lower and drives much of the dispersion. Extra Trees is also mid-range, but its class spread is comparatively smaller (Facies 1 around mid−40%), suggesting better balance than XGBoost/LightGBM. Overall, ensemble strategies deliver the best macro-level accuracy and class balance, while boosting and some deep models show strong performance on certain classes but struggle on the minority class.
5. Conclusions
Addressing critical challenges in clastic formations—specifically the significant overlapping log responses, evident inter-well distribution shifts, and difficulties in identifying transition zones—this study proposes a dual-driven intelligent lithology identification framework composed of feature enhancement and dynamic ensemble learning. At the feature level, a high-dimensional feature system with improved geological consistency was constructed by integrating multi-scale statistical descriptors, micro-texture structures, and petrophysical indicators. This design effectively enhances the separability of sandstone, siltstone, and mudstone in the feature space, particularly under complex depositional conditions.
At the model level, a Dynamic Confidence-Weighted Ensemble (DCWE) mechanism was introduced, combining global performance priors with sample-level confidence posteriors. Unlike traditional static ensemble strategies, DCWE enables adaptive, sample-wise weight adjustment based on prediction uncertainty, thereby improving robustness to heterogeneous formations and local distribution shifts. Under strict (LOGO) testing conditions, the proposed framework demonstrated superior cross-well generalization and stability across multiple evaluation metrics. The maximum classification accuracy reached 84.58%, while Macro-F1, MCC, and Kappa scores consistently outperformed both individual models and static ensemble baselines. Moreover, the method maintained strong geological consistency in lithological transition zones and intervals exhibiting pronounced inter-well variability, highlighting its potential for practical well log interpretation tasks.
In addition to predictive performance, the sensitivity of depth-wise sequence models to training sample scale was carefully considered. Experimental results indicate that the proposed framework preserves stable performance under limited sample conditions due to its ensemble-driven uncertainty adaptation, while performance degradation under extreme data scarcity remains gradual rather than abrupt. This behavior suggests favorable robustness characteristics for real-world scenarios where labeled well data are often sparse or unevenly distributed.
From an engineering perspective, the proposed framework exhibits practical scalability and feasibility for field applications. Although the dynamic confidence evaluation introduces additional computation during inference, the overall complexity remains manageable due to the parallelizable structure of base learners and the absence of iterative post-processing. With appropriate model lightweighting and inference optimization, the framework can be adapted for near-real-time deployment. Nevertheless, practical constraints such as on-site computational resources, data transmission latency, and environmental limitations in field operations must be considered when applying the method in logging-while-drilling or real-time interpretation scenarios.
Despite its favorable performance, the effectiveness of DCWE remains influenced by the overall quality and diversity of the base learner pool, and local inconsistencies may still arise in intervals where all base models exhibit weak discriminative capability. Future research will focus on enhancing robustness and efficiency through several directions: (1) incorporating advanced uncertainty estimation techniques, Bayesian ensemble formulations, or Out-of-Distribution (OOD) detection to improve reliability in anomalous or highly uncertain sections; (2) extending the framework to multi-modal data fusion involving conventional logs, core measurements, and LWD data; and (3) integrating model compression, inference acceleration, and online updating mechanisms to further improve real-time usability. In addition, systematic evaluation of cross-basin and cross-block transferability will be essential to promote broader application across diverse geological settings.