Next Article in Journal
Rapid Growth of Dimension Stone Imports: Implications for the Urban Geocultural Heritage of the City of Poznań (Poland)
Previous Article in Journal
Hydrochemical Appraisal of Groundwater Quality for Managed Aquifer Recharge (MAR) in Southern Punjab, Pakistan
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Feature Engineering and XGBoost Framework for Prediction of TOC from Conventional Logs in the Dongying Depression, Bohai Bay Basin

1
School of Artificial Intelligence and Information Engineering, East China University of Technology, Nanchang 330013, China
2
Jiangxi Engineering Technology Research Center of Nuclear Geoscience Data Science and System, East China University of Technology, Nanchang 330013, China
*
Author to whom correspondence should be addressed.
Geosciences 2026, 16(1), 44; https://doi.org/10.3390/geosciences16010044
Submission received: 26 November 2025 / Revised: 26 December 2025 / Accepted: 30 December 2025 / Published: 19 January 2026

Abstract

Total organic carbon (TOC) is a critical parameter for evaluating shale source rock quality and hydrocarbon generation potential. However, accurate TOC estimation from conventional well logs remains challenging, especially in data-limited geological settings. This study proposes an optimized XGBoost model for TOC prediction using conventional logging data from the Shahejie Formation in the Dongying Depression, Bohai Bay Basin, China. We systematically transform four standard logs—resistivity, acoustic transit time, density, and neutron porosity—into 165 candidate features through multi-scale smoothing, statistical derivation, interaction term creation, and spectral transformation. A two-stage feature selection process, combining univariate filtering and recursive feature elimination and further refined by principal component analysis, identifies ten optimal predictors. The model hyperparameters are optimized via Bayesian search within the Optuna framework to minimize cross-validation error. The optimized model achieves an R2 of 0.9395, with a Mean Absolute Error (MAE) of 0.3392, a Root Mean Squared Error (RMSE) of 0.4259, and a Normalized Root Mean Squared Error (NRMSE) of 0.0604 on the test set, demonstrating excellent predictive accuracy and generalization capability. This study provides a reliable and interpretable methodology for TOC characterization, offering a valuable reference for source rock evaluation in analogous shale formations and sedimentary basins.

1. Introduction

The successful exploitation of shale resources in North America has stimulated global interest in unconventional hydrocarbon development, particularly within the terrestrial basins of China [1,2,3]. These shale oil and gas reservoirs are characterized by organic-rich source rocks in which hydrocarbons are generated and retained without significant secondary migration [4,5,6]. Total organic carbon (TOC) is a fundamental indicator of source-rock quality and hydrocarbon generation potential, and its accurate estimation along wellbores is essential for evaluating shale productivity and sweet-spot distribution [7,8,9,10,11].
Currently, two main approaches are used to determine TOC in petroleum exploration: direct laboratory measurement and indirect estimation [12,13]. The direct approach involves analyzing core samples in laboratories, providing accurate first-hand data for reservoir characterization. Although widely applied, this method is costly, time-consuming, and spatially limited [14,15,16]. Moreover, measurements are usually performed under ambient rather than in situ reservoir conditions, and their accuracy depends strongly on stress restoration procedures and instrumental precision [17,18].
The indirect approach relies on geophysical data, particularly conventional well logs, to provide continuous TOC estimation [19,20]. Traditional empirical methods such as the ΔlogR technique are valued for simplicity and low cost but depend heavily on subjective calibration and fixed empirical parameters, which limits their transferability across formations with different lithologies and maturities [21]. Classical machine-learning methods, including linear regression and support vector machines, are easy to implement and require relatively small datasets; however, they lack systematic optimization strategies, often leading to suboptimal parameter configurations and reduced prediction accuracy [22,23]. Ensemble methods such as Random Forest (RF) combine multiple decision trees to enhance prediction robustness and mitigate overfitting, yet their reliance on random feature selection may overlook significant nonlinear patterns in well-log data [24,25]. Deep-learning models such as Convolutional Neural Network (CNN) can capture complex nonlinear relationships between well logs and TOC but typically demand large training samples. In many oilfields, the number of cored wells is limited, constraining the practical applicability of deep-learning approaches [26,27,28,29,30].
Machine learning has therefore become a key data-driven tool for petroleum studies. Among various algorithms, the Extreme Gradient Boosting (XGBoost) model has shown outstanding predictive capability and robustness [31,32]. The XGBoost algorithm is a scalable tree-boosting method developed from gradient decision trees [33]. Through improvements in sparsity awareness, a theoretically justified weighted-quantile sketch for approximate learning, and computational optimizations including cache handling, data sharding, and compression, XGBoost achieves higher prediction accuracy and stronger generalization performance than traditional gradient-boosting algorithms and tree-based ensembles like Random Forest, owing to its gradient-guided boosting mechanism and more sophisticated regularization techniques [34,35]. These advantages have led to successful applications in production forecasting, lithology identification, and reservoir characterization [36,37]. Nevertheless, XGBoost models constructed with default or suboptimal hyperparameters often show limited adaptability to specific datasets, highlighting the need for systematic optimization strategies [38,39,40].
Recent advances in automated optimization frameworks such as Optuna allow efficient exploration of hyperparameter spaces through Bayesian optimization [41]. Optuna’s tree-structured Parzen estimator (TPE) enables adaptive and efficient parameter tuning while maintaining low computational cost [42]. Although the XGBoost model and Bayesian optimization have been applied in various geoscientific contexts, their combined use for TOC estimation from conventional well logs—particularly when integrated with systematic feature engineering and multi-stage feature selection—has not yet been comprehensively explored [43].
To address this need, this study develops an optimized XGBoost model for TOC prediction from conventional well logs. A Random Forest model is implemented as a baseline to benchmark the performance and highlight the efficacy of the proposed gradient-boosting framework. The proposed approach integrates systematic feature engineering, multi-stage feature selection, and Bayesian hyperparameter optimization using the Optuna framework to enhance model performance [44]. In this workflow, four conventional logs are expanded into a large set of derived attributes through statistical, interaction, and spectral transformations to better capture the nonlinear relationships between petrophysical parameters and TOC. A data-driven selection strategy is then employed to identify the most informative predictors, while Bayesian optimization efficiently explores the hyperparameter space to achieve an optimal model configuration. The model is trained and validated using core–log paired samples from the Shahejie Formation in the Dongying Depression, Bohai Bay Basin, China. Overall, this study provides a practical and interpretable machine-learning approach for TOC estimation, offering methodological insights and a transferable reference for similar shale and sedimentary basins.

2. Background

2.1. Passey Method

The ΔlogR technique is a well-established method for estimating TOC from conventional well logs [7]. This approach is based on the observation that in organic-rich source rocks, the separation between resistivity and sonic porosity logs increases due to the presence of hydrocarbon-generating organic matter.
Mathematically, the ΔlogR method is defined as:
Δ log R   =   log   10 R R 0     K   ×   Δ t Δ t 0
where R and Δ t represent measured resistivity (Ω·m) and sonic transit time (μs/ft), respectively; R 0 and Δ t 0 are their baseline values in water-saturated, organic-lean shale intervals; and K is a scaling coefficient, typically set to 0.02. In practice, a porosity log (commonly sonic) is overlain on a resistivity curve using predefined scales, with the baseline defined by parallel trends in clean intervals. The vertical separation in organic-rich zones is then quantified and empirically calibrated to TOC using:
TOC   =   Δ log R   ×   10 2.297     0.1688   ×   LOM
where LOM represents the Level of Organic Maturation. Gamma ray logs often assist in identifying clean shale intervals and constraining baseline selection [16].
Despite practical advantages—such as straightforward implementation and the provision of continuous TOC profiles where core data are unavailable—the ΔlogR method possesses significant limitations. Its reliance on fixed scaling coefficients (K ≈ 0.02) may not be universally applicable across diverse geological settings [19,45]. Accuracy is highly sensitive to subjective baseline selection, and the method does not systematically integrate complementary information from other logs (e.g., density or neutron porosity), constraining its predictive performance and generalizability to new depositional environments [8,20].

2.2. Extreme Gradient Boosting

XGBoost is a scalable and highly efficient implementation of the gradient boosting framework, renowned for its computational speed and model performance [32]. As an ensemble learning method, it constructs a strong predictive model by sequentially integrating multiple weak learners, typically decision trees, with each new tree designed to correct the residuals from the previous ensemble.
The algorithm operates on the principle of gradient descent, where the model is optimized by minimizing a predefined loss function. The distinctive feature of XGBoost lies in its regularization-enhanced objective function, which incorporates both a loss term and a complexity control term. This formulation effectively balances model accuracy and generalization capability, significantly reducing the risk of overfitting [32,40]. The objective function at the t-th iteration can be expressed as:
Obj t   =   i = 1 n   l y i , y ^ i t 1   +   f t x i   +   k = 1 t   Ω f k
where l denotes the differentiable loss function, f t represents the tree model at iteration t , and Ω is the regularization term that penalizes model complexity.
Hyperparameters in XGBoost are critical for controlling model complexity and preventing overfitting. Specifically, max_depth controls tree depth; learning_rate scales tree contributions; subsample and colsample_bytree govern the random sampling ratios for training instances and features, respectively. reg_alpha and reg_lambda correspond to L1 and L2 regularization terms on weights, while min_child_weight defines the minimum sum of instance weight needed in a child node [32,38]. The strategic configuration of these parameters enables optimal model performance across diverse datasets and problem domains.

3. Geological Settings

This study focuses on the Dongying Depression, a significant sedimentary unit situated within the Jiyang Depression of the Bohai Bay Basin. This basin, formed on a Paleozoic continental basement, is a prominent Mesozoic–Cenozoic continental rift system on China’s eastern margin [46,47], and its architectural framework is defined by six primary depressions, including the Jiyang Depression. As a third-order negative tectonic unit, the Dongying Depression spans approximately 90 km in an east–west direction and 65 km north–south, covering a total area of about 5700 km2 [48].
Structurally, the depression exhibits a complex semi-graben configuration, characterized by boundary normal faults, deep subsags, internal fault blocks, and transitional zones [49]. It is bounded by the Chenjiazhuang and Binxian uplifts to the north, the Gaoqing Uplift to the west, the Luxi Uplift to the south, and the Jiaodong Uplift to the east. East–west extensional tectonics have exerted a strong influence on the evolution of sedimentary systems in the region [46,50]. The depression experienced active rifting during the Late Mesozoic to Early Tertiary, followed by a phase of post-rift subsidence throughout the Cenozoic era. Internally, the depression can be subdivided into four subsags—Niutitang, Boxing, Minfeng, and Lijin—and three structural belts: the central anticlinal belt, the southern slope belt, and the northern slope belt. The Liye 1 well, analyzed in this study, is situated on the eastern slope of the Lijin subsag, within an open and structurally segmented belt of the Dongying Depression (Figure 1) [51,52,53,54,55,56].
The Cenozoic stratigraphic framework of the Dongying Depression consists of, in ascending order, the Kongdian (Ek), Shahejie (Es), Dongying (Ed), Guantao (Ng), Minghuazhen (Nm) formations, and Quaternary (Q) deposits (Figure 2) [46]. The Shahejie Formation is further divided into four members, namely Es4, Es3, Es2, and Es1 from base to top, with the Es3 and Es4 members recognized as the primary source rock intervals in the region [57,58].
Depositional environment analysis indicates that the Es3 member accumulated in a deep lacustrine setting under freshwater conditions, while the Es4 member was influenced by saline lake phases [59,60]. The combination of anoxic bottom waters, enhanced organic productivity, and reduced sedimentation rates promoted the development of substantial source rock thicknesses in these units. The lithological column shows that the basal Es3 primarily comprises oil shale, calcareous mudstone, and dark mudstone, while its middle-upper sections are marked by interbedded gray mudstone and sandstones. For the Es4 member, the lower part features red sandstone and mudstone with saline layers, and the upper part is characterized by gray mudstone, ocher mudstone, thin limestone, and oil shale [61]. Within this sequence, the lower Es3 (EsL 3) hosts an approximately 100 m thick oil shale, representing a major source rock in the region [62].
Figure 1. (A) Schematic diagram depicting the geographic location of the Bohai Bay Basin. (B) Map showing the regional position of the Dongying Sag (modified after Hu et al. [54]). (C) Schematic illustrating the structural framework and sampling well distribution in the Dongying Sag (modified after Zeng et al. [55]).
Figure 1. (A) Schematic diagram depicting the geographic location of the Bohai Bay Basin. (B) Map showing the regional position of the Dongying Sag (modified after Hu et al. [54]). (C) Schematic illustrating the structural framework and sampling well distribution in the Dongying Sag (modified after Zeng et al. [55]).
Geosciences 16 00044 g001
Figure 2. Composite stratigraphic column of the Paleogene succession in the Dongying Sag (modified after Chen et al. [62]). The column illustrates the following units: Dongying Formation (Ed); Shahejie Formation (Es), including its members 1 through 4 (Es1, Es2, Es3, Es4) with their lower (L), middle (M), and upper (U) subdivisions; and Kongdian Formation (Ek) with members 1 and 2 (Ek1, Ek2).
Figure 2. Composite stratigraphic column of the Paleogene succession in the Dongying Sag (modified after Chen et al. [62]). The column illustrates the following units: Dongying Formation (Ed); Shahejie Formation (Es), including its members 1 through 4 (Es1, Es2, Es3, Es4) with their lower (L), middle (M), and upper (U) subdivisions; and Kongdian Formation (Ek) with members 1 and 2 (Ek1, Ek2).
Geosciences 16 00044 g002

4. Methodology

4.1. Samples

The dataset used in this paper was derived from the rigorously quality-controlled sample set established by Wang et al. [63]. As documented in their original work, to ensure high data quality, the original 146 core samples from the EsL 3 and EsU 4 members underwent a strict refinement process. First, systematic depth errors were corrected by correlating core gamma ray measurements with well logs. Subsequently, to eliminate potential anomalies caused by experimental errors or contamination, a statistical screening based on the ±3.0 standard deviation criterion was applied. Data points falling outside this range were identified as outliers and removed. This pre-processing, as performed by Wang et al., refined the initial 146 samples to the final set of 125 high-quality samples, which we adopted as the foundation for developing the XGBoost model in this paper.

4.2. Input Selection

The dataset utilized in this study, derived from the Liye 1 well [63], consisted of five available conventional logging curves for the analyzed interval: natural gamma ray (GR), resistivity (RT), acoustic transit time (AC), bulk density (DEN), and neutron porosity (CNL). These logs represent the full suite of primary organic-sensitive indicators provided in the source dataset.
The response of these key well logs provides diagnostic criteria for identifying organic-rich intervals [10]. Specifically, increases in TOC content amplify the physical contrast between the organic matter and the matrix, resulting in more pronounced log anomalies characterized by significantly higher RT and AC, as well as lower DEN. This mechanism enables effective source rock characterization and TOC estimation. Given that single-log responses are susceptible to mineralogical and environmental influences, a multi-parameter integration is crucial for obtaining robust TOC predictions.
Crossplot analysis combined with Pearson correlation coefficient (r) analysis was employed to evaluate the relationships between different logging parameters and TOC content [13,64]. The Pearson correlation coefficient is calculated as:
r = i = 1 n   x i x ˉ y i y ˉ i = 1 n   ( x i x ˉ ) 2 i = 1 n   ( y i y ˉ ) 2
where the subscript i denotes the sample index, x i and y i represent the well-logging parameter and the core-measured TOC value of the i -th sample, x ˉ and y ˉ are their arithmetic means, and n is the total number of samples.
(1)
Resistivity Logging (RT)
Mudstone intervals are characterized by low resistivity, associated with the high conductivity of their constituent matrix and pore fluids. However, the presence of non-conductive hydrocarbons (resistivity: 105–109 Ω·m) generates significant positive resistivity anomalies. This explains the moderate positive correlation between RT and TOC (Figure 3), with r = 0.39 for the training set and r = 0.37 for the test set, confirming that elevated resistivity is associated with higher organic content [7].
(2)
Density Logging (DEN)
The bulk density of organic matter and hydrocarbons (1.1–1.4 g/cm3) is substantially lower than that of common rock-forming minerals (e.g., quartz: 2.65 g/cm3). Consequently, bulk density decreases as organic matter concentration increases. This robust inverse relationship is evidenced by the notable negative correlation between DEN and TOC (Figure 4), with r = −0.51 (training set) and r = −0.57 (test set) [13].
(3)
Acoustic Logging (AC)
While acoustic transit time in mudstones normally decreases with compaction, the presence of organic matter (with high acoustic slowness ~ 524.9 μs/m) increases it. This principle accounts for the observed strong positive correlation between AC and TOC (Figure 5), with r = 0.48 (training set) and r = 0.59 (test set), indicating that higher transit times signify organic enrichment [7,13].
(4)
Neutron Logging (CNL)
In principle, the hydrogen-richness of organic materials should lead to elevated neutron porosity readings. However, the observed correlation between CNL and TOC was weak to moderate, with considerable scatter in the data (Figure 6). The notable discrepancy between the training (r = 0.28) and test (r = 0.63) sets strongly suggests that the neutron log response is significantly influenced by other hydrogen-bearing components (e.g., clay-bound water) [8,13]. This variability indicates that CNL provides useful but non-unique information for TOC prediction.
(5)
Gamma Ray Logging (GR)
The relationship between GR and TOC remains contentious. In this study, GR exhibited a negligible correlation with TOC, as shown by the random scatter in the crossplot (Figure 7) and extremely low r-values (0.04 for training; 0.19 for test) [7,16]. This indicates that gamma ray activity here is controlled by factors unrelated to organic matter (e.g., clay mineralogy, detrital minerals) [45]. It is important to note that this weak correlation is recognized as a specific characteristic of the lacustrine source rocks in the Shahejie Formation. Given this weak and inconsistent relationship, which was deemed likely to introduce noise and undermine model robustness, GR was consequently excluded from the input feature set.

4.3. Model Development Workflow

Considering the limited availability of well-log parameters (RT, DEN, CNL, and AC) and the complex nonlinear relationships between petrophysical properties and TOC, XGBoost was adopted as the core learning algorithm [32]. The model construction process encompasses three integrated components: feature engineering, feature selection, and hyperparameter optimization, aiming to achieve a balance between predictive accuracy and model interpretability. As illustrated in Figure 8, the overall framework begins with well log data input and preprocessing, and proceeds through a systematic feature engineering process that expands the four raw parameters into 165 candidate features [44]. Specifically, this expansion was achieved by applying to the raw logs a structured set of mathematical transformations, including multi-scale Gaussian smoothing, gradient calculation, statistical aggregation, and frequency-domain analysis. This procedure aimed to construct a comprehensive candidate feature pool capable of capturing the complex non-linear petrophysical responses and stratigraphic trends that raw logs alone cannot fully represent. Crucially, this expanded set serves primarily as a search space for the subsequent feature selection process, ensuring that only the most relevant predictors are retained to prevent overfitting given the limited dataset size. This is followed by a two-stage feature selection procedure utilizing univariate F-test filtering and Recursive Feature Elimination (RFE) to identify an optimal subset of 10 features, and automated hyperparameter tuning of the XGBoost model using Optuna, culminating in the training of the final model with the optimal configuration [41]. The following subsections detail each component of this process.

4.3.1. Feature Engineering

To address the limitation of having only four conventional well logs (RT, DEN, CNL, and AC) and to enable the model to capture complex nonlinear relationships associated with organic richness, a systematic feature engineering scheme was developed to expand the original parameters into 165 derived features (Figure 9). The purpose of generating these 165 features is to establish a comprehensive candidate pool. This strategy creates a broad search space for the subsequent feature selection process, allowing the model to distill effective predictors from high-dimensional data while ensuring statistical robustness despite the limited sample size [65].
Temporal features were generated through Gaussian smoothing with σ = 1.2 and multi-scale rolling window analysis:
x i smooth   =   j   x j exp ( i j ) 2 2 σ 2
where x j is the original log value at depth j , and i is the target depth. Rolling statistics including mean, median, and quartiles (e.g., DEN_smooth, AC_smooth, CNL_smooth) were computed using window sizes of 3, 5, 7, and 10 samples to suppress noise while preserving formation boundaries.
The significance of this is twofold. The significance of this is twofold. First, regarding the window size, considering the standard logging sampling interval of 0.125 m, a 10-sample window corresponds to a vertical resolution of approximately 1.25 m. While this might smooth out high-frequency noise, it is physically appropriate for capturing the macroscopic background trends of thick source rock intervals. Second, the multi-scale rolling window analysis enables the model to simultaneously capture features of both thick beds (via larger windows) and thin interbeds (via smaller windows), thereby enhancing the model’s robustness and its adaptability to geological variations at different scales [66].
Dynamic features were derived to capture gradient strength and directional changes in petrophysical responses:
Δ x i   =   x i     x i 1
Δ 2 x i = Δ x i Δ x i 1
where Δ x i represents the rate of change between successive depths and Δ 2 x i captures acceleration. Metrics such as RT_median7 (median resistivity over a 7-sample window) detect stratigraphic transitions.
Their significance lies in their high sensitivity to “inflection points” or “abrupt changes” in the logs. These gradient features effectively delineate lithological boundaries, sharp changes in petrophysical properties, or stratigraphic sequence boundaries, which are often critical locations for the development or termination of source rocks.
Nonlinear transforms including polynomial terms (squares, cubes, square roots) and transcendental functions (logarithmic) were applied to represent complex dependencies. Logarithmic transformation of acoustic transit time (AC_log) enhances sensitivity to weak variations and enables recognition of nonlinear petrophysical relationships.
The rationale is that the relationship between log responses and TOC is often inherently nonlinear. For instance, the logarithmic transformation of acoustic transit time can enhance the model’s sensitivity to subtle variations and better fit the nonlinear compaction trends, thus empowering the model to learn these complex underlying petrophysical relationships.
Statistical features characterize local heterogeneity through quartile-based attributes and skewness:
IQR w   =   Q 75 , w     Q 25 , w
Skew w = 1 w j = 1 w   x j x ˉ σ x 3
where Q 75 , w and Q 25 , w denote the 75th and 25th percentiles, x ˉ is the window mean, and σ x is the window standard deviation. Attributes such as AC_q25_5 and AC_q25_3 quantify thin-layer variability in organic-rich facies.
These statistics are significant because they serve as powerful indicators of formation heterogeneity. For example, in organic-rich laminated shales, the local variability of log responses, as represented by the interquartile range (IQR), might be greater, and skewness can reflect distributional asymmetry, providing the model with crucial clues for identifying specific sedimentary microfacies.
Interaction features including cross-log ratios (DEN/AC) and multiplicative terms (RT times DEN, RT times DEN times AC) encode combined effects of density reduction, acoustic slowness increase, and resistivity variation characterizing organic-rich intervals.
Their importance stems from their ability to capture the synergistic effects of multiple petrophysical properties. For instance, source rocks typically exhibit a combination of low density from organic matter, high acoustic transit time from under-compaction, and high resistivity from hydrocarbons. While an anomaly in a single log may be ambiguous, a combined feature integrating all three provides a much stronger and more definitive indicator of organic-rich intervals.
Frequency features were extracted via Fast Fourier Transform:
FFT x   =   k = 0 n 1   x k exp 2 π i k j n
where x k is the log value at sample k , n is the total number of samples, and i is the imaginary unit. Maximum and mean FFT amplitudes reflect cyclic depositional patterns in lacustrine environments.
The value of these features is their ability to identify potential periodicity or cyclicity in sedimentation. In lacustrine environments, factors like climate cycles often lead to rhythmic depositional patterns, which manifest as specific frequency components in well logs. This provides the model with valuable sedimentological context.
Distance features standardize measurements and identify anomalous intervals using Z-scores:
Z i   =   x i     x ˉ s x
D i = x i x ˉ
where x i is the log value at depth i , x ˉ is the column mean, and s x is the standard deviation.
The purpose of these features is not only to eliminate the influence of different log scales but also to measure the deviation of a data point from the global mean. This allows the model to effectively identify “anomalous” intervals that may correspond to high TOC values.
Overall, the systematic approach expanded the four original logs into 165 derived attributes providing comprehensive coverage of temporal patterns, gradients, nonlinear behavior, local variability, and cross-log interactions [66]. This extensive feature engineering lays a necessary foundation for the subsequent feature selection process.

4.3.2. Feature Selection

Given the high dimensionality of the engineered feature space comprising 165 variables, a two-stage selection pipeline was implemented to distill the most informative predictors [67].
In the first stage, univariate F-test filtering evaluated linear correlations with TOC, retaining the 60 most significant features. This preliminary screening effectively eliminated noisy variables and streamlined subsequent processing [68].
The second stage employed RFE with a Random Forest regressor, iteratively removing the least important features based on ensemble rankings. To determine the optimal number of retained features, principal component analysis (PCA) was subsequently applied to assess the cumulative explained variance and intrinsic dimensionality of the candidate feature subsets. As shown in Figure 10, the analysis revealed that 10 features captured the primary modes of variation while preserving over 91% of the cumulative variance, providing an effective trade-off between information retention and model parsimony. This final configuration ensured the preservation of meaningful multivariate and nonlinear relationships in the logging measurements while effectively mitigating overfitting risk.
This methodology led to substantial dimensionality reduction—from 165 to 10 features—while preserving essential geophysical characteristics, thus improving computational efficiency and robustness for subsequent model development.

4.3.3. Hyperparameter Optimization

The hyperparameters of the XGBoost model were optimized using a Bayesian approach implemented with the Optuna framework and a TPE sampler [41,42]. The optimization process comprised 80 trials exploring a predefined search space: n_estimators (100–500), max_depth (3–10), learning_rate (0.01–0.3), subsample (0.6–1.0), colsample_bytree (0.6–1.0), reg_alpha (0–15), reg_lambda (0–15), and min_child_weight (1–15). The objective was to minimize the fivefold cross-validation root mean squared error (RMSE) on the full training set consisting of 100 samples and 10 selected features.
RMSE was selected as the optimization metric over R2 due to its stronger penalty on large prediction errors, which is particularly critical for TOC estimation where outlier predictions can significantly impact reservoir evaluation decisions [69,70]. This metric choice ensures the model prioritizes reducing absolute prediction errors rather than merely maximizing variance explained.
Prior to optimization, all features were standardized using quantile transformation to a normal distribution to ensure stable convergence and comparable scales. The optimization identified an optimal configuration with n_estimators = 384, max_depth = 3, learning_rate = 0.035, subsample = 0.870, colsample_bytree = 0.739, reg_alpha = 0.002, reg_lambda = 13.011, and min_child_weight = 6, which achieved a minimum cross-validation RMSE of 0.8352 (Figure 11).
Using this optimal configuration, the final XGBoost model was trained on the training dataset.

4.3.4. Model Evaluation Metrics

To assess the predictive accuracy and robustness of the developed model, five quantitative metrics were defined and applied to both training and test sets. Models with higher R 2 and lower MAE, RMSE and variance of absolute error indicate better predictive performance [70].
The coefficient of determination ( R 2 ), mean absolute error (MAE), and root mean squared error (RMSE) are defined as:
R 2   =   1     i = 1 n   ( y i y ^ i ) 2 i = 1 n   ( y i y ˉ ) 2
MAE = 1 n i = 1 n   y i y ^ i
RMSE   = 1 n i = 1 n   ( y i   y ^ i ) 2
The variance of absolute error (σ2) is computed as:
σ 2   =   1 n i = 1 n   ( e i   e ˉ ) 2
To facilitate comparison across different TOC value ranges, the normalized RMSE (NRMSE) was computed as:
NRMSE   =   RMSE y m a x y m i n
In the above equations, y i represents measured TOC values, y ^ i represents predicted TOC values, y ˉ is the mean of measured TOC values, n is the number of samples, e i represents the absolute prediction error for each sample, e ˉ is the mean absolute error, and y m a x and y m i n represent the maximum and minimum measured TOC values in the dataset.
R2 measures the proportion of variance explained by the model, ranging from 0 to 1. MAE provides a linear measure of average prediction error magnitude and is less sensitive to outliers than quadratic error measures. RMSE penalizes larger prediction errors more heavily, offering greater sensitivity to extreme deviations [69,70,71]. The variance of absolute error quantifies the consistency and stability of model predictions across the dataset—lower variance indicates more uniform prediction errors, suggesting greater model robustness and reliability. NRMSE is dimensionless and enables direct comparison across datasets with different TOC value ranges.
All metrics were computed for both training and test sets to comprehensively assess model accuracy and generalization capability.

5. Results and Discussion

5.1. Hyperparameter Optimization Convergence

The hyperparameter space was systematically explored through 80 trials using the Optuna framework with TPE sampling. The convergence behavior of the cross-validation RMSE during this process already illustrated in Figure 10. Characteristic fluctuations in RMSE values were observed as the algorithm probed diverse parameter configurations, reflecting the expected exploration–exploitation dynamics of the Bayesian optimization strategy.
The optimal parameter set was identified during the optimization process, achieving a minimum cross-validation RMSE of 0.8352, which corresponds to approximately 12% of the total TOC value range in this dataset. Such an error level indicates the model’s capability to deliver predictions sufficiently precise for practical TOC profiling.
This convergence pattern confirms the efficacy of the Bayesian optimization approach. The TPE sampler adaptively balanced exploration of uncharted hyperparameter regions with focused exploitation of promising combinations, efficiently guiding computational resources toward optimal solutions. Compared to conventional grid or random search methods, this adaptive strategy enables more rapid identification of optimal parameter sets. The systematic RMSE minimization ensured the final model was tuned for minimal prediction error while maintaining robust generalization capability, as evidenced in the subsequent model prediction performance

5.2. Feature Importance Analysis

Feature importance analysis based on XGBoost’s intrinsic gain metric revealed hierarchical contributions of the 10 selected engineered features for TOC prediction (Figure 12). AC_smooth (Gaussian-smoothed acoustic log) dominated with approximately 43.9% of total feature importance, reflecting the fundamental petrophysical principle that organic-rich intervals exhibit elevated acoustic transit times due to the high acoustic slowness of kerogen (~525 μs/m), making acoustic response a primary indicator of organic matter abundance [7].
DEN_smooth (Gaussian-smoothed density log) ranked second with 32.7% importance, underscoring that organic-rich shale intervals characteristically exhibit lower matrix densities due to the presence of organic matter (density typically <1.2 g/cm3, compared to quartz at 2.65 g/cm3) [12,18]. Together, AC_smooth and DEN_smooth jointly account for approximately 76.6% of total model gain, demonstrating that multi-scale noise suppression via Gaussian smoothing is critical for reliable TOC estimation and constitutes the primary signal pathway in the learned predictive model.
The remaining eight predictors comprise a hierarchy of statistical aggregations and nonlinear transformations: AC_q25_3 (7.2%), RT_median7 (5.7%), RT_q25_5 (3.1%), RT_ma10 (1.9%), DEN_q25_7 (1.8%), AC_q75_7 (1.7%), AC_ma7 (1.3%), and AC_median3 (0.8%). The recurrent appearance of quartile-based attributes (AC_q25_3, RT_q25_5, DEN_q25_7, AC_q75_7, RT_median7) reflects the model’s reliance on distribution and central tendency characteristics within rolling windows, capturing thin-layer heterogeneity and subtle variations in organic matter distribution. Resistivity-derived features (RT_median7, RT_q25_5, RT_ma10) collectively contribute approximately 8.7%, providing complementary information on hydrocarbon saturation and pore fluid variations associated with organic-rich intervals [7,14]. The moving average feature (AC_ma7) and median-based aggregation (AC_median3) further demonstrate the value of multi-scale temporal smoothing in isolating meaningful geological signals from high-frequency noise.
Through systematic feature engineering that expanded the original four-log dataset into 165 candidate attributes, followed by two-stage feature selection refined via PCA, the model distilled 10 core predictors that capture the essential multi-scale patterns governing TOC variability. The overwhelming dominance of smoothed acoustic and density logs, supplemented by statistically aggregated rolling windows and resistivity attributes, underscores that effective TOC estimation in shale formations depends primarily on capturing noise-robust density and acoustic responses, with additional insights provided by resistivity-based constraints on pore fluid properties.

5.3. Model Performance

To systematically assess the predictive capability of machine learning and empirical approaches for TOC estimation, we developed and evaluated multiple prediction models using a standardized dataset (100 training and 25 test samples). The comparative analysis encompassed the empirical Passey method, RF, and two XGBoost variants (baseline and optimized), with a CNN model reported by Wang et al. [63] included for reference. All models were evaluated using identical metrics (R2, NRMSE, and variance of absolute error) on the test set to ensure fair comparison.
As shown in Table 1, the optimized XGBoost model demonstrates superior predictive performance across all evaluation metrics compared to benchmark methods. The model achieved R2 = 0.9395 and NRMSE = 0.0604 on the test set, with improvements of 127.3% over the Passey method, 32.8% over RF, 18.9% over baseline XGBoost, and 13.4% over CNN. Notably, the optimized XGBoost attained the lowest variance of absolute error at 0.109, substantially lower than competing methods: Passey method 0.634, RF 0.298, CNN 0.176, and baseline XGBoost 0.240. This indicates consistent prediction accuracy across the sample range.
The Passey method exhibited the most limited performance with R2 = 0.4136 and NRMSE = 0.1881, as shown in Figure 13, attributable to reliance on universal empirical relationships that lack dataset-specific calibration. Random Forest achieved moderate performance with R2 = 0.7075 and a notable generalization gap of 0.2057, indicative of overfitting. As shown in Figure 14A,B, the model demonstrated high training accuracy of 0.9132 but a substantial performance decline on the test set. The CNN model yielded intermediate results with R2 = 0.8283 and NRMSE = 0.1010 on the test set, as presented in Figure 14C,D. Although outperforming Random Forest, its performance remained below the optimized XGBoost approach. The baseline XGBoost model developed without feature engineering attained R2 = 0.7889 with a generalization gap of 0.1037, as depicted in Figure 14E,F.
The optimized XGBoost demonstrated superior generalization with a minimal gap of 0.0331, comparing train R2 of 0.9726 to test R2 of 0.9395 as shown in Figure 14G,H. This indicates effective model stability and robust performance across training and evaluation phases. On the test set, additional performance metrics include MAE of 0.3392 and RMSE of 0.4259, representing 46.3% and 46.4% reductions compared to the baseline XGBoost model, respectively. Also, the reduced prediction variance of 0.109 demonstrates that comprehensive feature engineering and hyperparameter optimization enhanced both prediction accuracy and consistency across different TOC value ranges. Such consistency is essential for reliable subsurface estimation in practical applications. Nevertheless, it is crucial to interpret these high-performance metrics within the context of the dataset size. While the minimal generalization gap implies that overfitting has been effectively mitigated, the evaluation on a relatively small independent test set could potentially limit the statistical certainty of the model’s transferability to broader geological settings. Therefore, these results primarily serve as a successful validation of the proposed feature engineering and optimization framework, while further verification on larger, multi-well datasets remains a necessary step for establishing universal applicability.

6. Conclusions

This study successfully developed and validated an optimized XGBoost model for predicting TOC from conventional well logs in the Shahejie Formation. The model achieved R2 = 0.9395 and NRMSE = 0.0604 on an independent test set, demonstrating substantial improvement over established benchmarks including the empirical Passey method, Random Forest baseline, and a previously published CNN model. The minimal generalization gap of 0.0331 confirms model stability and effective mitigation of overfitting, attributable to integrated systematic feature engineering, rigorous feature selection, and Bayesian hyperparameter optimization.
However, the current evaluation is constrained by a limited independent test set of 25 samples, which restricts the statistical robustness of findings. Future work should prioritize external validation using larger, geologically diverse datasets to assess model robustness across different wells and formations. Additional investigation of stratigraphic constraints to refine predictions in complex lithological sequences and advanced ensemble techniques to enhance predictive consistency would further advance the framework.

Author Contributions

Conceptualization, G.Z.; methodology, Z.Z., G.Z. and F.D.; software, Z.Z. and F.D.; validation, P.D. and J.H.; writing—original draft preparation, Z.Z.; writing—review and editing, F.D. and P.D.; supervision, G.Z. and J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Laboratory of Uranium Resources Exploration–Mining and Nuclear Remote Sensing, East China University of Technology (Grant No. 2024QZ-TD-10), under the project “Prospecting Information Extraction and Intelligent Mineralization Prediction for Sandstone-Type Uranium Deposits in the Southern Songliao Basin.”. And the Natural Science Foundation of Jiangxi Province (Grant No. 20253BAC260013). The APC was funded by the same project.

Data Availability Statement

All data and materials during this study are included in this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jarvie, D.M.; Hill, R.J.; Ruble, T.E.; Pollastro, R.M. Unconventional shale-gas systems: The Mississippian Barnett Shale of north-central Texas as one model for thermogenic shale-gas generation. AAPG Bull. 2007, 91, 475–499. [Google Scholar] [CrossRef]
  2. Liu, C.; Wang, Z.; Guo, Z.; Hong, W.; Dun, C.; Zhang, X.; Li, B.; Wu, L. Enrichment and distribution of shale oil in the Cretaceous Qingshankou Formation, Songliao Basin, Northeast China. Mar. Pet. Geol. 2017, 86, 751–770. [Google Scholar] [CrossRef]
  3. Zou, C.; Dong, D.; Wang, S.; Li, J.; Li, X.; Wang, Y.; Li, D.; Cheng, K. Geological characteristics and resource potential of shale gas in China. Pet. Explor. Dev. 2010, 37, 641–653. [Google Scholar] [CrossRef]
  4. Peters, K.E.; Cassa, M.R. Applied source rock geochemistry. In The Petroleum System—From Source to Trap; Magoon, L.B., Dow, W.G., Eds.; American Association of Petroleum Geologists: Tulsa, OK, USA, 1994; pp. 93–120. [Google Scholar] [CrossRef]
  5. Gao, Z.; Bai, L.; Hu, Q.; Yang, Z.; Jiang, Z.; Wang, Z.; Xin, H.; Zhang, L.; Yang, A.; Jia, L.; et al. Shale oil migration across multiple scales: A review of characterization methods and different patterns. Earth Sci. Rev. 2024, 254, 104819. [Google Scholar] [CrossRef]
  6. Rudra, A.; Wood, J.M.; Biersteker, V.; Sanei, H. Oil migration from internal and external source rocks in an unconventional hybrid petroleum system, Montney Formation, western Canada. Int. J. Coal Geol. 2024, 285, 104473. [Google Scholar] [CrossRef]
  7. Passey, Q.R.; Creaney, S.; Kulla, J.B.; Moretti, F.J.; Strosity, J.D. A practical model for organic richness from porosity and resistivity logs. AAPG Bull. 1990, 74, 1777–1794. [Google Scholar] [CrossRef]
  8. Alshakhs, M.; Rezaee, R. A new method to estimate total organic carbon (TOC) content, an example from Goldwyer Shale Formation, the Canning Basin. Open Pet. Eng. J. 2017, 10, 118–133. [Google Scholar] [CrossRef]
  9. Elsaqqa, M.A.; El Din, M.Y.Z.; Afify, W. Unconventional shale gas sweet spot identification and characterization of the Middle Jurassic Upper Safa sediments, Amoun field, Shushan Basin, Western Desert, Egypt. J. Geol. Geophys. 2023, 12, 1103. [Google Scholar]
  10. Vergara, R.V. Well-log based TOC estimation using linear approximation methods. Geosci. Eng. 2020, 8, 116–130. [Google Scholar]
  11. Nyakilla, E.E.; Silingi, S.N.; Shen, C.; Jun, G.; Mulashani, A.K.; Chibura, P.E. Evaluation of source rock potentiality and prediction of total organic carbon using well log data and integrated methods of multivariate analysis, machine learning, and geochemical analysis. Nat. Resour. Res. 2022, 31, 619–641. [Google Scholar] [CrossRef]
  12. Tissot, B.P.; Welte, D.H. Petroleum Formation and Occurrence, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1984; 702p. [Google Scholar] [CrossRef]
  13. Lai, J.; Zhao, F.; Xia, Z.; Su, Y.; Zhang, C.; Tian, Y.; Wang, G.; Qin, Z. Well log prediction of total organic carbon: A comprehensive review. Earth Sci. Rev. 2024, 258, 104913. [Google Scholar] [CrossRef]
  14. Zhu, L.; Zhang, C.; Zhang, C.; Zhang, Z.; Zhou, X.; Liu, W.; Zhu, B. A new and reliable dual model- and data-driven TOC prediction concept: A TOC logging evaluation method using multiple overlapping methods integrated with semi-supervised deep learning. J. Pet. Sci. Eng. 2020, 188, 106944. [Google Scholar] [CrossRef]
  15. McCarthy, K.; Rojas, K.; Niemann, M.; Palmowski, D.; Peters, K.; Stankiewicz, A. Basic petroleum geochemistry for source rock evaluation. Oilfield Rev. 2011, 23, 32–43. [Google Scholar]
  16. Schmoker, J.W. Determination of organic-matter content of Appalachian Devonian shales from gamma-ray logs. AAPG Bull. 1981, 65, 1285–1298. [Google Scholar] [CrossRef]
  17. Cui, X.; Liu, J.; Sun, Z.; Wang, H. Rock mechanical properties of immature, organic-rich source rocks and their relationships to rock composition and lithofacies. Pet. Geosci. 2023, 29, petgeo2022-021. [Google Scholar] [CrossRef]
  18. Hunt, J. Petroleum Geochemistry and Geology, 2nd ed.; W.H. Freeman: San Francisco, CA, USA, 1995; 743p. [Google Scholar]
  19. Du, J.; Zhang, X.; Zhong, G.; Feng, C.; Guo, L.; Zhang, X.; Luo, W. Analysis on the optimization and application of well logs indentification methods for organic carbon content in source rocks of the tight oil—Illustrated by the example of the source rocks of Chang 7 member of Yanchang Formation in Ordos Basin. Prog. Geophys. 2016, 31, 2526–2533. [Google Scholar] [CrossRef]
  20. Zhao, P.; Ma, H.; Rasouli, V.; Liu, W.; Cai, J.; Huang, Z. An improved model for estimating the TOC in shale formations. Mar. Pet. Geol. 2017, 83, 174–183. [Google Scholar] [CrossRef]
  21. Polat, C.; Eren, T. Modification of ΔlogR method and nonlinear regression application for total organic carbon content estimation from well logs. Hittite J. Sci. Eng. 2021, 8, 161–169. [Google Scholar] [CrossRef]
  22. Feurer, M.; Hutter, F. Hyperparameter optimization. In Automated Machine Learning: Methods, Systems, Challenges; Hutter, F., Kotthoff, L., Vanschoren, J., Eds.; Springer: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar] [CrossRef]
  23. Syarif, I.; Prugel-Bennett, A.; Wills, G. SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA 2016, 14, 1502–1509. [Google Scholar] [CrossRef]
  24. Robnik-Šikonja, M. Improving random forests. In Proceedings of the 15th European Conference on Machine Learning on Machine Learning: ECML 2004, Pisa, Italy, 20–24 September 2004; pp. 359–370. [Google Scholar] [CrossRef]
  25. Khan, S.; Liu, Z.; Lu, Z.; Hussain, W.; Ahmed, S.; Muhammad, M.; Umar, M.U. Comparative analysis of machine learning and empirical approaches for total organic carbon prediction in the J1d formation, Sichuan Basin, China. Phys. Fluids 2025, 37, 086644. [Google Scholar] [CrossRef]
  26. Mahmoud, A.A.A.; Elkatatny, S.; Mahmoud, M.; Abouelresh, M.; Abdulraheem, A.; Ali, A. Determination of the total organic carbon (TOC) based on conventional well logs using artificial neural network. Int. J. Coal Geol. 2017, 179, 72–80. [Google Scholar] [CrossRef]
  27. He, Y.; Zhang, Z.; Wang, X.; Zhao, Z.; Qiao, W. Estimating the total organic carbon in complex lithology from well logs based on convolutional neural networks. Front. Earth Sci. 2022, 10, 871561. [Google Scholar] [CrossRef]
  28. Goliatt, L.; Saporetti, C.M.; Pereira, E. Super learner approach to predict total organic carbon using stacking machine learning models based on well logs. Fuel 2023, 353, 128682. [Google Scholar] [CrossRef]
  29. Liu, Y.; Li, N.; Li, C.; Jiang, J.; Wu, X.; Liang, H.; Zhang, D.; Hu, X. Prediction of total organic carbon content in deep marine shale reservoirs based on a super hybrid machine learning model. Energy Fuels 2024, 38, 17483–17498. [Google Scholar] [CrossRef]
  30. Barham, A.; Ismail, M.S.; Hermana, M.; Padmanabhan, E.; Baashar, Y.; Sabir, O. Predicting the maturity and organic richness using artificial neural networks (ANNs): A case study of Montney Formation, NE British Columbia, Canada. Alexandria Eng. J. 2021, 60, 3253–3264. [Google Scholar] [CrossRef]
  31. Wu, Q.; Pang, H.; Zhang, B.; Jiang, F.; Wu, L.; Chen, J.; Ma, K.; Huo, X. Application of shale TOC prediction model using the XGBoost machine learning algorithm: A case study of the Qiongzhusi Formation in central Sichuan Basin. Carbonates Evaporites 2025, 40, 8. [Google Scholar] [CrossRef]
  32. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
  33. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (accessed on 27 December 2025).
  34. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
  35. Cheng, L.; Yang, Z.; Costa, F. Insights on source lithology and pressure-temperature conditions of basalt generation using machine learning. Earth Space Sci. 2024, 11, e2024EA003732. [Google Scholar] [CrossRef]
  36. Shuvo, M.A.I.; Hossain Joy, S.M. A data driven approach to assess the petrophysical parametric sensitivity for lithology identification based on ensemble learning. J. Appl. Geophys. 2024, 222, 105330. [Google Scholar] [CrossRef]
  37. Abe, J.; Adekanye, D. Advancing reservoir characterization: A comparative analysis of XG boost and ANN for accurate porosity prediction. J. Data Anal. 2024, 3, 47–60. [Google Scholar] [CrossRef]
  38. Putatunda, S.; Rama, K. A Comparative analysis of hyperopt as against other approaches for hyper-parameter optimization of XGBoost. In Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, Shanghai, China, 28–30 November 2018; pp. 6–10. [Google Scholar] [CrossRef]
  39. Verma, V. Exploring key XGBoost hyperparameters: A study on optimal search spaces and practical recommendations for regression and classification. Int. J. All Res. Educ. Sci. Methods 2024, 12, 3259–3266. [Google Scholar] [CrossRef]
  40. Zhang, P.; Jia, Y.; Shang, Y. Research and application of XGBoost in imbalanced data. Int. J. Distrib. Sens. Netw. 2022, 18, 15501329221106935. [Google Scholar] [CrossRef]
  41. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
  42. Watanabe, S. Tree-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv 2023. [Google Scholar] [CrossRef]
  43. Liu, X.; Tian, Z.; Chen, C. Total organic carbon content prediction in lacustrine shale using extreme gradient boosting machine learning based on Bayesian optimization. Geofluids 2021, 2021, 6155663. [Google Scholar] [CrossRef]
  44. Yang, K.; Liu, L.; Wen, Y. The impact of Bayesian optimization on feature selection. Sci. Rep. 2024, 14, 3948. [Google Scholar] [CrossRef] [PubMed]
  45. Wang, X.; Ma, J.; Zhang, X.; Wang, Z.; Wang, F.; Wang, H.; Li, L. Prediction of total organic carbon content by a generalized ΔlogR method considering density factors: Illustrated by the example of deep continental source rocks in the southwestern part of the Bozhong sag. Prog. Geophys. 2020, 35, 1471–1480. [Google Scholar] [CrossRef]
  46. Wu, S.; Yu, Z.; Zhang, R.; Han, W.; Zou, D. Mesozoic–Cenozoic tectonic evolution of the Zhuanghai area, Bohai-Bay Basin, east China: The application of balanced cross-sections. J. Geophys. Eng. 2005, 2, 158–168. [Google Scholar] [CrossRef]
  47. Qi, J.; Yang, Q. Cenozoic structural deformation and dynamic processes of the Bohai Bay basin province, China. Mar. Pet. Geol. 2010, 27, 757–771. [Google Scholar] [CrossRef]
  48. Zhang, L.; Liu, Q.; Zhu, R.; Li, Z.; Lu, X. Source rocks in Mesozoic–Cenozoic continental rift basins, east China: A case from Dongying Depression, Bohai Bay Basin. Org. Geochem. 2009, 40, 229–242. [Google Scholar] [CrossRef]
  49. Allen, M.B.; Macdonald, D.I.M.; Zhao, X.; Vincent, S.J.; Brouet-Menzies, C. Early Cenozoic two-phase extension and late Cenozoic thermal subsidence and inversion of the Bohai Basin, northern China. Mar. Pet. Geol. 1997, 14, 951–972. [Google Scholar] [CrossRef]
  50. Zou, Y.; Sun, J.; Li, Z.; Xu, X.; Li, M.; Peng, P. Evaluating shale oil in the Dongying Depression, Bohai Bay Basin, China, using the oversaturation zone method. J. Pet. Sci. Eng. 2018, 161, 291–301. [Google Scholar] [CrossRef]
  51. Zhang, S.; Liu, H.; Wang, M.; Liu, X.; Liu, H.; Bao, Y.; Wang, W.; Li, R.; Luo, X.; Fang, Z. Shale pore characteristics of Shahejie Formation: Implication for pore evolution of shale oil reservoirs in Dongying sag, North China. Pet. Res. 2019, 4, 113–124. [Google Scholar] [CrossRef]
  52. Zhang, L.; Bao, Y.; Li, J.; Li, Z.; Zhu, R.; Zhang, J. Movability of lacustrine shale oil: A case study of Dongying Sag, Jiyang Depression, Bohai Bay Basin. Pet. Explor. Dev. 2014, 41, 703–711. [Google Scholar] [CrossRef]
  53. Guo, X.; Shi, Z.; Wang, Z.; Zhao, X.; Lu, J.; Zhang, Y. Geochemistry and mineralogy of quaternary sediments in the northern Bohai Bay Basin, North China: Implications for provenance and climate change. Can. J. Earth Sci. 2019, 57, 396–406. [Google Scholar] [CrossRef]
  54. Hu, Q.; Zhang, Y.; Meng, X.; Zheng, L.; Xie, Z.; Li, M. Characterization of micro-nano pore networks in shale oil reservoirs of Paleogene Shahejie Formation in Dongying Sag of Bohai Bay Basin, East China. Pet. Explor. Dev. 2017, 44, 720–730. [Google Scholar] [CrossRef]
  55. Zeng, X.; Cai, J.; Dong, Z.; Bian, L.; Li, Y. Relationship between mineral and organic matter in shales: The case of Shahejie Formation, Dongying Sag, China. Minerals 2018, 8, 222. [Google Scholar] [CrossRef]
  56. Yang, Y.; Khan, D.; Qiu, L.; Du, Y.; Long, J.; Li, W.; Zafar, T.; Ali, F.; Shaikh, A. Microscopic reservoir characteristics of the lacustrine calcareous shale: An example from the Es4s shale of the Paleogene Shahejie Formation in Boxing Sag, Dongying Depression. ACS Omega 2022, 7, 36748–36761. [Google Scholar] [CrossRef]
  57. Fang, X.; Ma, C.; Qin, F.; An, T.; Liu, R.; Song, H.; Zhang, C.; Wang, T.; Gao, B.; Hao, P. The control of astronomical cycles on lacustrine mixed sedimentation and hydrocarbon occurrence: A case study of the Paleogene Shahejie Formation in the Dongying Sag, Bohai Bay Basin. Pet. Sci. 2025; in press. [Google Scholar] [CrossRef]
  58. Li, S.; Pang, X.; Li, M.; Jin, Z. Geochemistry of petroleum systems in the Niuzhuang South Slope of Bohai Bay Basin—Part 1: Source rock characterization. Org. Geochem. 2003, 34, 389–412. [Google Scholar] [CrossRef]
  59. Liu, Q.; Zeng, X.; Wang, X.; Cai, J. Lithofacies of mudstone and shale deposits of the Es3z-Es4s formation in Dongying sag and their depositional environment. Mar. Geol. Quat. Geol. 2017, 37, 147–156. [Google Scholar] [CrossRef]
  60. Yang, H.; Liu, C.; Wang, F.; Tang, G.; Li, G.; Zeng, X.; Wu, Y. Geochemical characteristics and environmental implications of source rocks of the Dongying Formation in southwest subsag of Bozhong Sag. Bulle. Geol. Sci. Technol. 2023, 42, 339–349. [Google Scholar] [CrossRef]
  61. Li, C.; Wu, Y.; Ding, X.; Xie, X.; Luo, T.; Zhang, J.; Sun, Y.; Xia, C. Formation of carbonate laminae in shale and their impact on organic matter in Dongying depression. Sci. Rep. 2025, 15, 22093. [Google Scholar] [CrossRef]
  62. Chen, Z.; Jiang, W.; Zhang, L.; Zha, M. Organic matter, mineral composition, pore size, and gas sorption capacity of lacustrine mudstones: Implications for the shale oil and gas exploration in the Dongying depression, eastern China. AAPG Bull. 2018, 102, 1565–1600. [Google Scholar] [CrossRef]
  63. Wang, H.; Wu, W.; Chen, T.; Dong, X.; Wang, G. An improved neural network for TOC, S1 and S2 Estimation Based on Conventional well logs. J. Pet. Sci. Eng. 2019, 176, 664–678. [Google Scholar] [CrossRef]
  64. Verma, S.; Zhao, T.; Marfurt, K.J.; Devegowda, D. Estimation of total organic carbon and brittleness volume. Interpretation 2016, 4, T373–T385. [Google Scholar] [CrossRef]
  65. Zheng, W.; Tian, F.; Di, Q.; Zhang, J.; Zhou, H.; Zhang, W.; Wang, Z. A “data-feature-policy” solution for multiscale geological–geophysical intelligent reservoir characterization. Second Int. Meet. Appl. Geosci. Energy 2022, 41, 3272–3276. [Google Scholar] [CrossRef]
  66. Peng, Z.; Cao, D.; Xu, H.; Zhu, D.; Wang, P. Multi-scale information fusion of well-logging data-based deep learning 3D modeling method. J. Geophys. Eng. 2025, 22, 1671–1686. [Google Scholar] [CrossRef]
  67. Macêdo, B.S.; Wayo, D.D.K.; Campos, D.; Santis, R.B.; Martinho, A.D.; Yaseen, Z.M.; Saporetti, C.M.; Goliatt, L. Data-driven total organic carbon prediction using feature selection methods incorporated in an automated machine learning framework. Sci. Rep. 2025, 15, 10658. [Google Scholar] [CrossRef] [PubMed]
  68. Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef] [PubMed]
  69. Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
  70. Botchkarev, A. Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology. Interdiscip. J. Inf. Knowl. Manag. 2019, 14, 45–76. [Google Scholar] [CrossRef]
  71. Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Figure 3. Crossplot of resistivity (RT) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a moderate positive correlation (with r = 0.39 for training set; r = 0.37 for test set).
Figure 3. Crossplot of resistivity (RT) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a moderate positive correlation (with r = 0.39 for training set; r = 0.37 for test set).
Geosciences 16 00044 g003
Figure 4. Crossplot of bulk density (DEN) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing notable negative correlation (with r = −0.51 for training set; r = −0.57 for test set).
Figure 4. Crossplot of bulk density (DEN) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing notable negative correlation (with r = −0.51 for training set; r = −0.57 for test set).
Geosciences 16 00044 g004
Figure 5. Crossplot of acoustic transit time (AC) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a strong positive correlation (with r = 0.48 for training set; r = 0.59 for test set).
Figure 5. Crossplot of acoustic transit time (AC) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a strong positive correlation (with r = 0.48 for training set; r = 0.59 for test set).
Geosciences 16 00044 g005
Figure 6. Crossplot of neutron porosity (CNL) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a weak to moderate positive correlation with considerable scatter (with r = 0.28 for training set and r = 0.63 for test set).
Figure 6. Crossplot of neutron porosity (CNL) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing a weak to moderate positive correlation with considerable scatter (with r = 0.28 for training set and r = 0.63 for test set).
Geosciences 16 00044 g006
Figure 7. Crossplot of natural gamma ray (GR) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing negligible correlation (with r = 0.04 for training set and r = 0.19 for test set), which justifies its exclusion from model inputs.
Figure 7. Crossplot of natural gamma ray (GR) versus total organic carbon (TOC) for training (blue dots) and test (red dots) datasets, showing negligible correlation (with r = 0.04 for training set and r = 0.19 for test set), which justifies its exclusion from model inputs.
Geosciences 16 00044 g007
Figure 8. Overall framework of the XGBoost-based TOC prediction model.
Figure 8. Overall framework of the XGBoost-based TOC prediction model.
Geosciences 16 00044 g008
Figure 9. Feature engineering workflow, transforming 4 original parameters into 165 derived features across multiple categories.
Figure 9. Feature engineering workflow, transforming 4 original parameters into 165 derived features across multiple categories.
Geosciences 16 00044 g009
Figure 10. Principal component analysis of the 40 selected features.
Figure 10. Principal component analysis of the 40 selected features.
Geosciences 16 00044 g010
Figure 11. Convergence of the Optuna hyperparameter optimization showing cross-validation RMSE versus trial number. The optimal configuration (Trial 78, RMSE = 0.8352) is highlighted.
Figure 11. Convergence of the Optuna hyperparameter optimization showing cross-validation RMSE versus trial number. The optimal configuration (Trial 78, RMSE = 0.8352) is highlighted.
Geosciences 16 00044 g011
Figure 12. Feature importance bar chart displaying the predictors ranked by XGBoost gain.
Figure 12. Feature importance bar chart displaying the predictors ranked by XGBoost gain.
Geosciences 16 00044 g012
Figure 13. Crossplot of measured versus Passey ΔlogR-predicted TOC values for the combined dataset, showing limited correlation (R2 = 0.4537). The diagonal line represents the ideal 1:1 correspondence.
Figure 13. Crossplot of measured versus Passey ΔlogR-predicted TOC values for the combined dataset, showing limited correlation (R2 = 0.4537). The diagonal line represents the ideal 1:1 correspondence.
Geosciences 16 00044 g013
Figure 14. Performance comparison of machine learning models on Training (left column) and Test (right column) sets: (A,B) RF; (C,D) CNN [63]; (E,F) Baseline XGBoost; and (G,H) Optimized XGBoost. The diagonal line represents the ideal 1:1 fit.
Figure 14. Performance comparison of machine learning models on Training (left column) and Test (right column) sets: (A,B) RF; (C,D) CNN [63]; (E,F) Baseline XGBoost; and (G,H) Optimized XGBoost. The diagonal line represents the ideal 1:1 fit.
Geosciences 16 00044 g014
Table 1. Performance comparison of TOC prediction methods.
Table 1. Performance comparison of TOC prediction methods.
MethodTest R2NRMSE σ 2
Passey Method0.41360.18810.634
RF0.70750.13290.298
CNN (Wang et al. [63])0.82830.10100.176
XGBoost (Baseline)0.78890.11290.240
XGBoost (Optimized)0.93950.06040.109
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Zhong, G.; Diao, F.; Ding, P.; He, J. A Feature Engineering and XGBoost Framework for Prediction of TOC from Conventional Logs in the Dongying Depression, Bohai Bay Basin. Geosciences 2026, 16, 44. https://doi.org/10.3390/geosciences16010044

AMA Style

Zhao Z, Zhong G, Diao F, Ding P, He J. A Feature Engineering and XGBoost Framework for Prediction of TOC from Conventional Logs in the Dongying Depression, Bohai Bay Basin. Geosciences. 2026; 16(1):44. https://doi.org/10.3390/geosciences16010044

Chicago/Turabian Style

Zhao, Zexi, Guoyun Zhong, Fan Diao, Peng Ding, and Jianfeng He. 2026. "A Feature Engineering and XGBoost Framework for Prediction of TOC from Conventional Logs in the Dongying Depression, Bohai Bay Basin" Geosciences 16, no. 1: 44. https://doi.org/10.3390/geosciences16010044

APA Style

Zhao, Z., Zhong, G., Diao, F., Ding, P., & He, J. (2026). A Feature Engineering and XGBoost Framework for Prediction of TOC from Conventional Logs in the Dongying Depression, Bohai Bay Basin. Geosciences, 16(1), 44. https://doi.org/10.3390/geosciences16010044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop