Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model

Li, Lusheng; Tan, Chengqian; Xiao, Ling; Wei, Qinlian; Dang, Hailong; Kang, Shengsong; Liang, Weiwei; Dong, Xu; Liu, Ling

doi:10.3390/pr14010042

Open AccessArticle

Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model

by

Lusheng Li

^1,2

,

Chengqian Tan

^1,*,

Ling Xiao

^1,3

,

Qinlian Wei

^1,3,*

,

Hailong Dang

²,

Shengsong Kang

²,

Weiwei Liang

²,

Xu Dong

⁴ and

Ling Liu

²

¹

School of Petroleum Engineering, Xi’an Shiyou University, Xi’an 710065, China

²

Research Institute of Shaanxi Yanchang Petroleum (Group) Co., Ltd., Xi’an 710065, China

³

Shaanxi Key Laboratory of Petroleum Accumulation Geology, Xi’an Shiyou University, Xi’an 710065, China

⁴

Zhidan Oil Production Plant, Yanchang Oilfield Co., Ltd., Yan’an 717500, China

^*

Authors to whom correspondence should be addressed.

Processes 2026, 14(1), 42; https://doi.org/10.3390/pr14010042

Submission received: 2 December 2025 / Revised: 16 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

(This article belongs to the Topic Petroleum and Gas Engineering, 2nd edition)

Download

Browse Figures

Versions Notes

Abstract

Tight reservoirs are highly heterogeneous, with complex pore-throat structures and varying fluid occurrences. The Archie equation shows a nonlinear relationship, making traditional logging interpretation methods unreliable for accurately predicting water saturation. This paper employs particle swarm optimization (PSO), using Pearson correlation coefficient-based feature selection, to compare the accuracy of three machine learning algorithms: XGBoost, LightGBM, and MERF in predicting water saturation in tight reservoirs. It also applies the SHAP value algorithm to provide a visual and interpretive analysis of the PSO LightGBM model. The research results indicate that the root mean square error (RMSE), coefficient of determination (R2), and accuracy of water saturation (Swa) of the PSO-LightGBM model on the training and test sets are 0.955, 3.087, 91.8%, and 0.89, 5.132, 85.2%, respectively. Interpretability analysis using SHAP values reveals that the five normalized logging parameters—SP, M2R3, DEN, DT, and CN—are the most influential features in the water saturation prediction model. In application examples involving water saturation prediction across eight sections of tight reservoirs in the study area, the PSO–LightGBM, PSO–XGBoost, and PSO–MERF models achieved Swa of 88.9%, 80.3%, and 87.8%, respectively. The results demonstrate that the PSO–LightGBM model is a reliable and efficient method for predicting water saturation, with significant practical potential.

Keywords:

tight reservoir; prediction of water saturation; LightGBM; SHAP; Ordos Basin

1. Introduction

With the ongoing intensification of global oil and gas exploration and development, the reserves and production growth of high-quality conventional oil and gas have been sluggish, making unconventional tight oil and gas a key area of exploration and development today [1,2,3,4]. China is endowed with abundant tight oil resources, offering broad exploration and development prospects, with a total resource volume of approximately 243.04 billion tons [5]. The efficient development of tight oil is crucial for alleviating the contradiction between China’s energy supply and demand.

Water saturation, defined as the ratio of water volume in a reservoir to the total pore volume, is a key parameter in tight reservoir evaluation and reserve estimation, and has a significant influence on reservoir productivity [6,7]. At present, there are mainly two methods for determining water saturation: (1) Direct laboratory measurement method: Water saturation is directly obtained through methods such as distillation extraction, drying and weighing, and nuclear magnetic resonance (NMR) in the laboratory using pressure-maintained coring. This method is currently the most accurate one for determining water saturation. (2) Model-based interpretation methods using well log data: An appropriate interpretation model is established by integrating geophysical logging data with experimental measurements, primarily through techniques such as multiple regression, linear fitting, and empirical formulae. Due to their theoretical simplicity and ease of implementation, these logging interpretation methods have been extensively applied in reservoir evaluation [8,9].

The lacustrine tight oil reservoirs in the Chang8 interval of the central Ordos Basin represent a vital target sequence for tight oil exploration in China [10]. However, sedimentation in the lacustrine delta front environment results in severe reservoir heterogeneity, the development of micro- to nano-scale pore throats, and poor oil-water differentiation. Consequently, traditional logging interpretation models—such as the Archie equation, the Doll model, and the Poupon–Levêaux equation—cannot accurately calculate water saturation in such formations.

Machine learning algorithms possess unique advantages in handling nonlinear relationships. With the rapid advancement of data-driven machine learning technologies, their application in predicting reservoir parameters—based on core analysis data and well log data—has become increasingly widespread in practical production settings [11,12,13,14,15].

In the study of Mohammed Gad et al. [16], an artificial neural network (ANN) model was applied to predict water saturation in a Greek oilfield, achieving 90% accuracy. Sadegh Baziar et al. [17] employed support vector machines (SVMs), multi-layer perceptron neural networks (MLPs), decision trees, and tree-enhancement methods to estimate water saturation in Mesaverde tight-gas sandstone intervals in the Uinta Basin. The results showed that the support vector machine model outperformed other models on the dataset. Andrian Sutiadi et al. [18] utilized an artificial neural network (ANN) model to predict porosity and water saturation in the South Structure of Field X, achieving correlation coefficients of 0.93 (training set) and 0.82 (test set). Amer A. Shehata et al. [19] combined gradient boosting machines (GBM), distributed random forests (DRF), generalized linear models (GLM), and deep neural networks (DNN) to estimate the porosity and permeability of the Late Cretaceous reservoirs in the Gulf of Suez, Egypt. Harpreet Singh et al. [20] predicted gas hydrate saturation using a diverse set of machine learning algorithms.

However, several critical research gaps persist:

(1): Traditional water saturation determination methods (laboratory measurements and log-based models) fail to meet the requirements of full-area and full-interval coverage and high-precision prediction for lacustrine tight reservoirs with strong heterogeneity and micro-nano pore throats (e.g., Chang8 of the Ordos Basin).
(2): Existing machine learning-based water saturation prediction studies lack integration of hyperparameter optimization algorithms (e.g., PSO), leading to suboptimal computational efficiency and vulnerability to overfitting.
(3): Most machine learning models for reservoir parameter prediction are opaque ‘black-box’ models without interpretability analysis, making it difficult for geological researchers to validate and accept the prediction results.
(4): Few studies have targeted the specific geological characteristics of the Chang 8 lacustrine tight reservoirs in the central Ordos Basin, resulting in a lack of tailored high-precision prediction models for this key exploration target.

Particle Swarm Optimization (PSO) can reduce the computational cost of complex nonlinear optimization problems and improve efficiency [21]. It optimizes the hyperparameters of LightGBM, XGBoost, and MERF, addressing the issues of overfitting and slow convergence commonly encountered during model training on tight reservoir data [22].

To bridge these gaps, this study is motivated by the following objectives:

(1): To address the limitations of traditional methods, leverage the inherent advantages of machine learning in handling complex nonlinear relationships between geological/engineering parameters and water saturation, and develop a high-precision prediction model suitable for lacustrine tight reservoirs.
(2): To improve model performance by introducing PSO for hyperparameter optimization of LightGBM, XGBoost, and MERF, thereby solving overfitting and slow convergence issues in tight reservoir data training.
(3): To enhance the credibility and geological acceptability of the model by integrating the SHAP value method for visual interpretability analysis of the optimized model.
(4): To provide reliable technical support for the efficient exploration and development of tight oil resources in the Chang 8 Member of the central Ordos Basin, and alleviate China’s energy supply-demand contradiction.

2. Geological Setting

The study area is situated in the central part of the Yishan Slope within the Ordos Basin, which is the second-largest Mesozoic basin in China [23] (Figure 1). The regional structure is gentle, with a slope gradient of 7–10 m per kilometer. Within the Chang8 member, low-amplitude nose-like uplift structures are locally developed and are attributed to uneven differential compaction, serving as key zones for tight oil accumulation.

The Chang8 member belongs to a lacustrine delta-front sedimentary system, where the sandbody thickness ranges from 10 to 25 m, and the mud-to-sand ratio is relatively high [24,25]. As a typical component of lacustrine delta depositional environments, this front subfacies is characterized by frequent alternations of sandstone and mudstone, which directly contributes to the elevated mud content observed. The primary rock type is feldspathic sandstone, with a minor amount of lithic sandstone. Specifically, the feldspar content is 52.5%, quartz content is 25.9%, and lithic content is 7.9%. Petrographic analysis reveals that the sandstone is predominantly fine-grained clastic sandstone (90.6%), with good sorting and subangular–subrounded roundness. The support mode is mainly grain-supported, and the cementation types are primarily film-type and pore-type.

The pore types are dominated by primary pores and secondary pores, with an average surface porosity of 4.8%. The pore-throat combination is mainly of the medium-pore and fine-to-micro throat type, with an average pore-throat radius of 57.1 μm. Core analysis reveals that porosity ranges from 0.35% to 13.46%, with permeability ranging from 0.03 to 0.38 × 10⁻³ μm², indicating a typical tight reservoir.

3. Experiments and Methods (Shaanxi, Xi’an, China)

3.1. Workflow

The experimental hardware environment was a custom-built desktop computer equipped with an Intel Core i9-12900K processor and an NVIDIA GeForce GTX 1660 SUPER graphics card, running the Windows 11 Pro 64-bit operating system. On the software side, the model was implemented and trained using Python 3.12.0 within the Visual Studio Code development environment.

The workflow of the machine learning algorithm is illustrated in Figure 2. (1) Collate sealed coring data and logging curves, conduct depth calibration on core analysis data, and perform preprocessing steps, including outlier elimination and normalization to establish the sample dataset. (2) Feature parameters were optimized via correlation analysis, and the dataset was subsequently divided into training and testing sets. (3) To improve oil saturation prediction accuracy, the three models are optimized iteratively with PSO, using the optimal hyperparameters for initialization. (4) The LightGBM algorithm, which demonstrated the best generalization capability, was selected for final interpretation via SHAP values and for practical application.

3.2. Data Source and Analysis

A total of 319 sealed core samples were collected from the Chang8 member in the Yizheng-Wubu area of the central Northern Yishan Slope, Ordos Basin, and used for analysis in this study.

Thirteen types of logging curves were included, namely caliper (CAL), natural gamma ray (GR), natural gamma ray spectroscopy (uranium component), spontaneous potential (SP), acoustic travel time (DT), bulk density (DEN), compensated neutron porosity (CN), photoelectric absorption cross-section index (PE), and array induction resistivity (with logarithms taken for M2R2–M2RX). The vertical resolution of these curves is 0.125 m. Each sample forms a 14-dimensional vector, with the 13 logging curves serving as feature values and the core-analyzed water saturation (SW) as the label.

Since the core analysis data were derived from different wells, to eliminate systematic errors between logging instruments, depth corrections were applied to the cored intervals based on the cored lithology and the logging curves. Subsequently, the logging curve was standardized using the histogram method in Gxplorer x64 2024. On this basis, the Z-score method was applied for normalization to ensure the unbiasedness of all machine learning models [26].

Z' = \frac{Z - μ}{σ} .

(1)

where

Z'

is the Z-normalized value,

μ

and

σ

are the mean and standard deviation of the original dataset, respectively.

The Pearson Correlation Coefficient is a statistical index used to estimate the strength and direction of the linear relationship between two variables [27]. It can rapidly identify correlations with target variables from a large volume of geological and engineering parameters, making it well-suited for preliminary feature screening. Its value ranges from −1 to 1, where 1 indicates a perfect positive linear relationship, −1 indicates a perfect negative linear relationship, and 0 indicates no correlation. Its calculation formula is [26]:

r = \frac{\sum_{i = 1}^{n} {(x}_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x}_{i} - \bar{x})^{2}} \sqrt{\sum_{i = 1}^{n} {(y}_{i} - \bar{y})^{2} .}}

(2)

where

x_{i}

and

y_{i}

represent the individual data points of the two variables (logging parameters and water saturation, respectively);

\bar{x}

and

\bar{y}

denote their respective means; and n is the number of observations.

The results show that the lithological parameters—caliper (CAL), photoelectric absorption cross-section index (PE), thorium (TH), and spontaneous potential (SP)—show significant negative correlations with water saturation (SW), with correlation coefficients ranging from −0.29 to −0.51. However, their correlation with the natural gamma ray (GR) curve is weak, with a correlation coefficient of only 0.09 (Figure 3).

Among the porosity parameters, neutron porosity (CN) and bulk density (DEN) have relatively strong impacts on water saturation (SW), with correlation coefficients of −0.5 and 0.25, respectively. In contrast, the correlation between SW and acoustic travel time (DT) is weak. Water saturation shows a negative correlation with the array induction resistivity curves, with the strongest correlation with M2R3 (r = −0.38).

Regarding the correlations between eigenvalues: GR is positively correlated with CN; the photoelectric absorption cross-section index (PE) shows significant positive correlations with TH, CN, and M2R3–M2RX; SP is positively correlated with CAL, CN, and PE, but shows a low correlation with the array induction curves.

Based on the correlation analysis between logging parameters and water saturation, feature parameter screening was conducted. Eight logging curves—CAL, CN, DEN, DT, M2R3, PE, SP, and TH—which have a relatively strong impact on water saturation, were selected for water saturation prediction.

3.3. Xgboost Algorithm

The XGBoost (Extreme Gradient Boosting) algorithm significantly enhances the accuracy and generalization performance of traditional gradient boosting methods by incorporating techniques such as second-order derivatives and regularization [28]. First proposed by Chen in 2016, it has been widely applied in fields such as data mining and recommendation systems. The XGBoost algorithm comprises a loss function and a regularization term, with its objective function defined as [29]:

L (Φ) = \sum_{i} l (y_{i}, {\hat{y}}_{i}) + \sum_{k} Ω (f_{k}) .

(3)

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}) = {\hat{y}}_{i}^{t - 1} + f_{t} (x_{i}) .

(4)

{\hat{y}}_{i}

is the predicted value of the i-th sample

x_{i}

,

l (y_{i}, {\hat{y}}_{i})

is the loss function;

Ω (f_{k})

represents the regularization component,

\sum_{k} Ω (f_{k})

which denotes the complexity of the k trees. The regularization term primarily reduces the model’s complexity, thereby preventing overfitting and enhancing robustness.

Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2} .

(5)

where T is the number of leaf nodes,

ω_{j}

is the weight of the leaf nodes, and

γ

is the complexity cost of adding new leaf nodes. Unlike traditional gradient boosting algorithms, XGBoost employs a second-order Taylor expansion to approximate its loss function:

L^{t} = \sum_{i = 1}^{n} [l (y_{i}, {y_{i}}^{t - 1}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} {f_{t}}^{2} (x_{i})] + \sum_{k} Ω (f_{k})

(6)

where

l (y_{i}, {y_{i}}^{t - 1})

is a constant,

g_{i} = \partial_{{\hat{y}}_{t - 1}} l (y_{i}, {\hat{y}}_{t - 1})

h_{i} = {\partial^{2}}_{{\hat{y}}_{t - 1}} l (y_{i}, {\hat{y}}_{t - 1})

and

\sum_{k} Ω (f_{k}) = Ω (f_{t}) + constant

. After removing the constant term, the loss function can be simplified as:

L^{t} = \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} {f_{t}}^{2} (x_{i})] + Ω (f_{t}) .

(7)

Define the binding of the j-th node as I_j, let

G_{j} = \sum_{i \in I_{j}} g_{i}

, and substitute it into Equation (7), then

L^{t} = \sum_{j = 1}^{t} [G_{j} w_{j} + \frac{1}{2} (H_{j} + λ) {ω_{j}}^{2}] + γ T .

(8)

Differentiate Equation (6), let

ω_{j}^{﹡} = - \frac{G_{j}}{H_{j} + λ}

, then we can write Equation (8) as:

{\tilde{L}}^{(t)} = - \frac{1}{2} \sum_{j = 1}^{T} (\frac{G_{j}^{2}}{H_{j} + λ}) + γ T .

(9)

3.4. Lightgbm

Based on XGBoost, LightGBM introduces the histogram algorithm and the leaf-wise growth strategy, which significantly reduces memory usage and computational complexity [30]. LightGBM constructs decision trees through a sequential iteration approach, where each subsequent tree is designed to correct the prediction residuals of the previous tree. LightGBM uses a leaf-wise tree growth strategy rather than the traditional level-wise approach, resulting in higher efficiency and accuracy. In each iteration, it selects the node with the highest split gain from all current leaf nodes for expansion. This enables the maximum reduction in the loss function at the global level, thereby improving the model’s training efficiency and generalization performance. Compared with XGBoost, LightGBM, with its higher computational efficiency, can significantly improve training speed and prediction accuracy for large-scale datasets (such as high-dimensional, large-sample well logging data). Therefore, it has distinct advantages in reservoir inversion.

3.5. Mixed Effects Random Forest Algorithm (Merf)

The MERF algorithm is an ensemble learning algorithm, whose core lies in constructing multiple decision trees to perform classification or regression tasks through random sampling of data and features [31,32]. The MERF combines the advantages of regression forests with the capability to model hierarchical dependencies, effectively accounting for the differences between individuals and groups in the data. Its mathematical expression is:

y_{i} = X_{i} β + Z_{i} b_{i} + f (X_{i}) + ε_{j}

(10)

where

y_{i}

represents the measured water saturation of the i-th sample,

X_{i}

denotes the fixed-effect covariates,

β

is the fixed-effect coefficient,

Z_{i}

stands for the random-effect covariates,

b_{i}

is the random-effect coefficient of the i-th group,

f (X_{i})

is the random forest (RF) function based on input features and the target variable, and

ε_{j}

is the residual.

3.6. Particle Swarm Optimization Algorithm (Pso)

PSO is a heuristic global optimization algorithm that simulates the foraging behavior of animal groups. By incorporating individual and social learning, the algorithm iteratively updates particle positions in the search space, guiding the swarm toward convergence at the global optimum [33]. The algorithm first conducts a random search, and then iteratively updates its optimal position by adjusting the velocity based on the best local and global positions. Its mathematical expression is as follows:

ω_{i} (t + 1) = w \cdot ω_{i} (t) + c_{1} \cdot r a n d () \cdot (p_{i} - x_{i} (t)) + c_{2} \cdot r a n d () \cdot (g (t) - x_{i} (t)) .

(11)

x_{i} (t + 1) = x_{i} (t) + ω_{i} (t + 1) .

(12)

In the equation,

x_{i} (t + 1)

represents the position of the i-th particle at the t-th iteration,

ω_{i} (t)

denotes the velocity of the i-th particle at the t-th iteration,

p_{i}

indicates the individual best position of the i-th particle,

g (t)

represents the global best position at time t,

w

is the inertia weight,

c_{1}

and

c_{2}

represent the cognitive and social learning coefficients, respectively,

r a n d ()

are random numbers between 0 and 1.

3.7. Shap Algorithm

SHAP values (Shapley Additive Explanations) are a game theory-based method for interpreting model outputs. It explains the model’s prediction results by quantifying the marginal contribution of each feature to the model output, and computing the average of the marginal contributions of each feature under different feature sequences, thereby providing interpretability for the “black-box models” of machine learning algorithms [34]. After constructing the water saturation model using a machine learning algorithm, SHAP is applied to interpret the machine learning process—it can quantify the degree of influence and contribution rate of different well logging curves on the water saturation prediction model, and further evaluate the correlation between well logging parameters and the prediction model, as well as the scope of such influence. Its calculation formula is as follows [35,36]:

φ_{i} = \frac{|S|! (N - |S| - 1)!}{N!} (ν (S \cup {x_{i}} - ν (S)) .

(13)

where

φ_{i}

is the SHAP value of feature i; S is the subset of features used in the model; N is the total number of all input features;

{x_{i}}

represents the sample dataset of feature i; denotes the prediction of feature subset S; and

(ν (S \cup {x_{i}} - ν (S))

represents the marginal contribution of feature

x_{i}

to subset S.

3.8. Hyperparameter Optimization and Evaluation Metrics

To enhance the predictive performance of the models, the Particle Swarm Optimization (PSO) algorithm was applied for the automated search and optimization of the hyperparameters of the three models. The core parameters of each model and their value ranges are presented in Table 1. By efficiently searching for the optimal solution in the parameter space, PSO can independently determine the optimal hyperparameter combination, thereby significantly improving the prediction accuracy and generalization ability of the machine learning models [36]. To evaluate the model stability, the five-fold cross-validation method was adopted in the experiments, and on this basis, the optimal hyperparameter configuration of each model was determined according to the results of each round of experiments. For the convenience of comparison with the correlation analysis between variables, this study selected the root mean square error (RMSE), coefficient of determination (R²), and water saturation accuracy (S_wa) as the evaluation metrics for model performance, aiming to comprehensively assess the magnitude of prediction errors and the interpretability of the models.

RMSE (Root Mean Square Error) measures the magnitude of the error between predicted values and actual values. The R² (Coefficient of Determination) characterizes the degree of correlation between predicted values and actual values [37]. At the same time, S_wa represents the degree of prediction accuracy. Among them, RMSE is a loss-type metric; R² and Swa are gain-type metrics—specifically, the smaller the RMSE and the larger the R² and Swa, the better the model’s fit and the more accurate the model’s predictions. The definitions of these evaluation metrics are as follows:

RMSE = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2}} .

(14)

R^{2} = 1 - \frac{\sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2}} .

(15)

S_{wa} = \frac{k}{m} * 100

(16)

where m is the number of samples,

y_{i}

and

{\hat{y}}_{i}

are the measured value and model-predicted value of water saturation for the i-th sample, respectively; and

{\bar{y}}_{i}

is the average of the measured water saturation values, k refers to the number of samples where the absolute error of predicted water saturation is less than 5%.

4. Results and Discussion

4.1. Analysis of Core Water Saturation

Sealed coring results indicate that water saturation in this area ranges from 26.3% to 83.8%, with an average of 50.8%. Water saturation is mainly distributed in the 35%~55% range, accounting for 48%, and the peak value occurs in the 35%~45% range (Figure 4). This indicates that the reservoir of the Chang 8 Member in this area is characterized by oil-water coexistence.

4.2. Model Prediction Results

To comprehensively and fairly compare the advantages of various models, multiple algorithm sets were designed for comparative validation, and the PSO algorithm was used to optimize hyperparameters for each model. The datasets for all algorithms were randomly split into training and testing sets in a 7:3 ratio. The predictions of water saturation from the XGBoost, LightGBM, and MERF models showed a good fit with the measured water saturation from sealed core samples, with all R² values exceeding 0.82.

On the training set, the PSO+MERF algorithm achieved the best fit, with an R² of 0.972, an RMSE of 2.460, and a Swa of 94.9% (Figure 5a, Table 2). This performance was followed by the PSO+LightGBM algorithm, which attained an R² of 0.946, an RMSE of 3.367, and a Swa of 91.8% (Figure 6a, Table 2). The PSO+XGBoost algorithm exhibited the poorest performance, with an R² of 0.927, an RMSE of 3.933, and a Swa of 89.4% (Figure 7a). On the test set, the PSO+LightGBM algorithm demonstrated the strongest generalization, achieving an R² of 0.89, an RMSE of 5.114, and a Swa of 85.2% (Figure 6a,b; Table 2). The PSO+MERF and PSO+XGBoost algorithms yielded R² values of 0.845 (Figure 5b) and 0.840 (Figure 7b), RMSE values of 5.906 (Figure 5b) and 6.058 (Figure 7b), and Swa values of 83.3% and 81.5% (Table 2), respectively.

The MERF model captures the overall trends and group-specific variations in the data through its fixed effects and random effects, respectively. However, the random grouping variables used in this study may not align with the actual hierarchical structure of the data, causing the random effects component to overfit noise in the training set and thereby weakening the model’s generalization capability. In contrast, LightGBM does not rely on grouping assumptions, making it more robust with the current dataset.

In terms of computational efficiency, the training time of the LightGBM algorithm is 0.034 s, which is shorter than that of XGBoost by 0.026s and of MERF by 2.15 s. LightGBM and XGBoost have significantly higher computational efficiency than MERF due to their mechanism of iteratively adding weak learners.

The violin plot, which combines features of box plots and kernel density plots, provides insights into the data, variability, and probability density. The comparative violin plots of the porosity models (Figure 8) reveal that the medians (white dots) of the PSO+LightGBM and PSO+MERF models align almost perfectly with the median of the core analysis data. Their IQR (the 25th-75th percentiles, represented by the thicker black line) is also close to that of the core analysis data, indicating the strong ability of the two algorithms to capture central tendency and data dispersion. In contrast, the PSO+XGBoost algorithm shows significant discrepancies in its median, IQR, and 5th-95th percentiles (thin black lines) compared to the core analysis data.

In summary, although the PSO+LightGBM algorithm’s fit on the training set was slightly lower than that of the PSO+MERF algorithm, its predictive performance on the test set was significantly better than the other two algorithms, demonstrating superior generalization and robustness to unseen data. Therefore, the PSO+LightGBM algorithm represents the most accurate and consistent model for predicting water saturation.

4.3. Interpretability Evaluation

The SHAP (Shapley Additive EXplanations) value algorithm was employed to evaluate the interpretability of the LightGBM model, which demonstrated the best predictive performance. This evaluation aimed to analyze the impact of various logging parameters on the water-saturation model.

Figure 9 presents the global SHAP interpretation plot of logging parameters. The horizontal axis shows the distribution of the features’ impact on the model output, while the vertical axis sorts the features by the sum of SHAP values across all samples. Each point denotes a single sample, and the color indicates the feature value (red corresponds to high values, and blue corresponds to low values). It can be observed from the plot that:

The normalized SP (Self-Potential) parameter contributes most to the model, followed by M2R3, DEN (Density Log), DT (Acoustic Travel Time Log), and CN (Neutron Capture Cross-Section Log).

SP is the most influential factor, owing to its indirect effect on water saturation via lithology, permeability, and formation water resistivity.

M2R3 is crucial for identifying fluids in the near-wellbore zone.

DEN and DT are closely related to reservoir porosity, which in turn affects water saturation.

Additionally, higher values of the normalized SP, M2R3, and neutron porosity (CN) are negatively correlated with water saturation. In contrast, those of bulk density (DEN), acoustic travel time (DT), and the photoelectric factor (PE) are positively associated, reflecting their respective roles in the model.

SHAP dependence plots further reveal the magnitude and direction of the impact of feature values on the model’s output, clarifying how individual features influence the model’s predictions [38]. Overall, the impact of each feature is relatively independent, with minimal interaction effects from other features:

For the normalized SP: When SP < 0, there is a positive correlation between SP and its SHAP values; when SP > 0, no correlation is observed (Figure 10a).

For M2R3: There is a negative correlation between M2R3 and its SHAP values. From the perspective of the interaction between M2R3 and SP, when the SP value is low, it has little impact on the SHAP values of M2R3 (blue points in Figure 10b).

For DEN and DT: There is an overall positive correlation between DEN and its SHAP values, as well as between DT and its SHAP values.

For the normalized DEN (corresponding to original values of 2.435 g/cm³–2.546 g/cm³) in the range of −1.5–1: As the parameter increases, the SHAP values rise sharply. When DEN exceeds 2.546 g/cm³, a threshold effect occurs, and SHAP values increase slowly thereafter.

For the normalized DT (corresponding to original values of 220.0 μs/m–233.6 μs/m) in the range of −0.4–1.0: As the parameter increases, the SHAP values rise sharply. When DT exceeds 233.6 μs/m, a threshold effect occurs, and the SHAP values increase slowly (Figure 10c,d). The study area is characterized by relatively high shale content, which significantly influences acoustic travel time (DT). When DT and bulk density (DEN) exceed their respective threshold values, shale content and rock matrix density become the dominant controlling factors, while the contribution of pore fluids is suppressed. In this case, even if water saturation varies, the resulting changes in overall DT and DEN are negligible. Consequently, the growth rate of SHAP values slows down, exhibiting a trend of “diminishing marginal contribution”.

5. Model Application

To further verify the effectiveness of the models, three algorithms—PSO+XGBoost, PSO+LightGBM, and PSO+MERF—were applied to predict the water saturation of tight oil in the Chang8 Member of Well F5085, located in the southern part of the Zhidan Area, Ordos Basin (Figure 11). The results show that the predicted water saturation is basically consistent with the water saturation obtained from core analysis and the water saturation calculated by the LightGBM algorithm is the closest to that from core analysis. The coefficients of determination (R²) between the water saturation predicted by the three algorithms (PSO-XGBoost, PSO-LightGBM, and PSO-MERF) and the water saturation measured via core analysis reached 80.3%, 88.9%, and 87.8%, respectively, with their S_wa values hitting 76.8%, 82.3%, and 81.4%—all of which are higher than the traditional Archie model (Table 3).

The XGBoost model failed to fit some data points with high water saturation in the 1986–1988 m interval, whereas the LightGBM and MERF algorithms further improved the fit for water saturation in tight reservoirs.

Although the models successfully predicted wells outside the training dataset, they have a limitation: their performance depends on the similarity between the training and new datasets. If the test well has significantly different geological characteristics or its logging parameters fall outside the range covered by the training data, the models’ predictive performance may decline. Therefore, there is room for further optimization of the models’ accuracy and generalization ability.

6. Conclusions

(1): The PSO algorithm can quickly determine the optimal hyperparameter combination for the LightGBM model and establish a nonlinear mapping between water saturation and logging parameters. Evaluation metrics based on RMSE, R², and S_wa indicate that the PSO+LightGBM algorithm outperforms PSO+XGBoost and PSO+MERF in predicting water saturation in tight reservoirs.
(2): The SHAP (Shapley Additive EXPlanations) algorithm was used to conduct an interpretability analysis of the constructed PSO+LightGBM model. The results indicate that five logging parameters—SP (Self-Potential), M2R3, DEN (Density Log), DT (Acoustic Travel Time Log), and CN (Neutron Capture Cross-Section Log)—have the most significant impact on the water saturation prediction model.
(3): Application results demonstrate that the PSO+LightGBM water saturation prediction model constructed in this study exhibits excellent generalization performance in low-permeability tight reservoirs, making it a reliable and efficient prediction method. This model has considerable practical potential and offers a novel technical approach for evaluating tight sandstone reservoirs.

Author Contributions

C.T.: Conceptualization, Formal analysis, Investigation, L.L. (Lusheng Li): Writing—original draft. L.X.: Writing—review. Q.W.: Writing—review and editing. H.D.: Resources. S.K.: Resources. W.L.: programming. X.D.: programming. L.L. (Ling Liu): Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Lusheng Li, Hailong Dang, Shengsong Kang, Weiwei Liang and Ling Liu were employed by Research Institute of Shaanxi Yanchang Petroleum (Group) Co., Ltd. Author Xu Dong was employed by Zhidan Oil Production Plant, Yanchang Oilfield Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, Z.; Fan, Z.; Zhang, X.; Liu, B.; Chen, X. Status, Trends and Enlightenment of Global Oil and Gas Development in 2021. Pet. Explor. Dev. 2022, 49, 1210–1228. [Google Scholar] [CrossRef]
Qu, J.; Ding, X.; Zha, M.; Chen, H.; Gao, C.; Wang, Z. Geochemical Characterization of Lucaogou Formation and Its Correlation of Tight Oil Accumulation in Jimsar Sag of Junggar Basin, Northwestern China. J. Pet. Explor. Prod. Technol. 2017, 7, 699–706. [Google Scholar] [CrossRef] [PubMed][Green Version]
Fu, J.; Wang, L.; Chen, X.; Liu, J.; Hui, X.; Cheng, D. Progress and prospects of shale oil exploration and development in the seventh member of Yanchang Formation in Ordos Basin. China Pet. Explor. 2023, 28, 1–14. (In Chinese) [Google Scholar]
Fu, S.; Fu, J.; Niu, X.; Li, S.; Wu, Z.; Zhou, X.; Liu, J. Accumulation conditions and key exploration and development technologies of Qingcheng Oilfield. Acta Pet. Sin. 2020, 41, 777–795. (In Chinese) [Google Scholar]
Tao, S.; Hu, S.; Wang, J.; Bai, B.; Pang, Z.; Wang, M.; Chen, Y.; Chen, Y.; Yang, Y.; Jin, X.; et al. Formation conditions, enrichment regularities and resource potentials of continental tight oil in China. Acta Pet. Sin. 2023, 44, 1222–1239. (In Chinese) [Google Scholar]
Zhou, X.; Zhang, C.; Zhang, Z.; Zhang, R.; Zhu, L.; Zhang, C. A Saturation Evaluation Method in Tight Gas Sandstones Based on Diagenetic Facies. Mar. Pet. Geol. 2019, 107, 310–325. [Google Scholar] [CrossRef]
Pan, H.-J.; Wei, C.; Yan, X.-F.; Li, X.-M.; Yang, Z.-F.; Gui, Z.-X.; Liu, S.-X. 3D Rock Physics Template-Based Probabilistic Estimation of Tight Sandstone Reservoir Properties. Pet. Sci. 2024, 21, 3090–3101. [Google Scholar] [CrossRef]
Wu, J.; Luo, R.; Lei, C.; Yin, J.; Chen, X. Prediction of water saturation in tight sandstone reservoirs from well log data based on the large language models (LLMs). Nat. Gas Ind. 2024, 44, 77–87. (In Chinese) [Google Scholar]
Ding, S.; Yang, S.; Lu, W.; Luo, R.; Zhu, L.; Gu, Y.; Chen, X. Robust prediction for water saturation based on strategy of light gradient boosting machine. Prog. Geophys. 2023, 38, 185X200. (In Chinese) [Google Scholar] [CrossRef]
Yang, G.; Ren, Z.; Qi, K. Research on Diagenetic Evolution and Hydrocarbon Accumulation Periods of Chang 8 Reservoir in Zhenjing Area of Ordos Basin. Energies 2022, 15, 3846. [Google Scholar] [CrossRef]
Wang, X.; Yang, S.; Zhao, Y.; Wang, Y. Improved Pore Structure Prediction Based on MICP with a Data Mining and Machine Learning System Approach in Mesozoic Strata of Gaoqing Field, Jiyang Depression. J. Pet. Sci. Eng. 2018, 171, 362–393. [Google Scholar] [CrossRef]
Lai, F.; Li, Z.; Zhang, W.; Dong, H.; Kong, F.; Jiang, Z. Investigation of Pore Characteristics and Irreducible Water Saturation of Tight Reservoir Using Experimental and Theoretical Methods. Energy Fuels 2018, 32, 3368–3379. [Google Scholar] [CrossRef]
Fu, J.; Chen, M.; Chen, L.; Shao, R.; Li, Y.; Chen, Z.; Xin, J.; Pan, Y. Reservoir Permeability Prediction Method Based on Fuzzy Clustering and Machine Learning. Chem. Technol. Fuels Oils 2025, 60, 1518–1527. [Google Scholar] [CrossRef]
Behdad, A.; Cuddy, S. Water Saturation Modeling in Carbonate Reservoirs Using the Bulk Volume Water Approach. J. Pet. Explor. Prod. Technol. 2025, 15, 105. [Google Scholar] [CrossRef]
Okon, A.N.; Adewole, S.E.; Uguma, E.M. Artificial Neural Network Model for Reservoir Petrophysical Properties: Porosity, Permeability and Water Saturation Prediction. Model. Earth Syst. Environ. 2021, 7, 2373–2390. [Google Scholar] [CrossRef]
Gad, M.; Mahmoud, A.A.; Panagopoulos, G.; Kiomourtzi, P.; Kirmizakis, P.; Elkatatny, S.; bin Waheed, U.; Soupios, P. Predicting Water Saturation in a Greek Oilfield with the Power of Artificial Neural Networks. ACS Omega 2025, 10, 557–566. [Google Scholar] [CrossRef]
Baziar, S.; Shahripour, H.B.; Tadayoni, M.; Nabi-Bidhendi, M. Prediction of Water Saturation in a Tight Gas Sandstone Reservoir by Using Four Intelligent Methods: A Comparative Study. Neural Comput. Appl. 2018, 30, 1171–1185. [Google Scholar] [CrossRef]
Sutiadi, A.; Taufiq Fathaddin, M. Estimating the Porosity and Initial Water Saturation in South Structure of X Field Using Artificial Neural Network. IOP Conf. Ser. Earth Environ. Sci. 2025, 1451, 012032. [Google Scholar] [CrossRef]
Shehata, A.A.; Ahmed, M.; Kassem, A.A.; Abdelrehim, R.; Tsuji, T.; Ismail, A. Optimizing Permeability and Porosity Prediction with Advanced Machine Learning: A Case Study Unlocking the Complexities of Late Cretaceous Reservoirs, Gulf of Suez, Egypt. J. Afr. Earth Sci. 2025, 228, 105670. [Google Scholar] [CrossRef]
Singh, H.; Seol, Y.; Myshakin, E.M. Prediction of Gas Hydrate Saturation Using Machine Learning and Optimal Set of Well-Logs. Comput. Geosci. 2021, 25, 267–283. [Google Scholar] [CrossRef]
Akbari, A.; Rahimi, M. Estimation of water saturation in an oil reservoir using nine different machine learning techniques: A case study. Geosystem Eng. 2025, 1–27. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, L.; Chen, G.; Kong, M.; Yuan, L.; Wang, B.; Hu, L.; Jiang, T.; Zhou, F. A Genetic Particle Swarm Optimization with Policy Gradient for Hydraulic Fracturing Optimization. SPE J. 2024, 30, 560–572. [Google Scholar] [CrossRef]
Li, M.; Li, W.; Gu, M.; Wu, S.; Wang, P.; Wang, Y.; Cao, Q.; Xu, Z.; Hao, Y. Reservoir Characteristics and Shale Oil Enrichment of Shale Laminae in the Chang 7 Member, Ordos Basin. Energies 2025, 18, 5342. [Google Scholar] [CrossRef]
Pang, Q.; Hu, G.; Hu, C.; Meng, F.; Wang, B.; Zhang, J. The Lithofacies of Sandstones Interbedded with Shales: Implication for Organic Matter Accumulation of Triassic Deep Lacustrine Setting, Southern Ordos Basin. ACS Omega 2024, 9, 23266–23282. [Google Scholar] [CrossRef]
Yang, Z.; Wu, S.; Zhang, J.; Zhang, K.; Xu, Z. Diagenetic Controls on the Reservoir Quality of Tight Reservoirs in Digitate Shallow-Water Lacustrine Delta Deposits: An Example from the Triassic Yanchang Formation, Southwestern Ordos Basin, China. Mar. Pet. Geol. 2022, 144, 105839. [Google Scholar] [CrossRef]
Wang, W.; Dang, H.; Kang, S.; Xiao, Q.; Ding, L.; Shi, L. Porosity Prediction of Tight Oil Reservoirs Based on LightGBM and SHAP Algorithms. Oil Gas Geol. Recovery 2025, 32, 90–99. (In Chinese) [Google Scholar] [CrossRef]
Zhang, Y.; Shi, B.; Zhang, Y.; Shi, H.; Wen, W.; Zhang, Y. Application of Machine Learning for Porosity Estimation of Beach and Bar Sand Bodies in a Lacustrine Basin: A case study of the Lower Cretaceous strata in Chepaizi area, Junggar Basin, NW China. Acta Sedimentol. Sin. 2023, 41, 1559–1567. [Google Scholar]
Davoudi, A.; Kalantariasl, A.; Parsaei, R.; Parsaei, H. Estimating Permeability Impairment Due to Asphaltene Deposition during the Natural Oil Depletion Process Using Machine Learning Techniques. Geoenergy Sci. Eng. 2023, 230, 212225. [Google Scholar] [CrossRef]
Chen, T. XGBoost: A Scalable Tree Boosting System; Cornell University: Ithaca, NY, USA, 2016. [Google Scholar]
Mahayana, D. Data-Driven LightGBM Controller for Robotic Manipulator. IEEE Access 2024, 12, 40883–40893. [Google Scholar] [CrossRef]
Mwakipunda, G.C.; Komba, N.A.; Kouassi, A.K.F.; Ayimadu, E.T.; Mgimba, M.M.; Ngata, M.R.; Yu, L. Prediction of Hydrogen Solubility in Aqueous Solution Using Modified Mixed Effects Random Forest Based on Particle Swarm Optimization for Underground Hydrogen Storage. Int. J. Hydrog. Energy 2024, 87, 373–388. [Google Scholar] [CrossRef]
Krennmair, P.; Schmid, T. Flexible Domain Prediction Using Mixed Effects Random Forests. J. R. Stat. Soc. Ser. C Appl. Stat. 2022, 71, 1865–1894. [Google Scholar] [CrossRef]
Wang, D.; Tan, D.; Liu, L. Particle Swarm Optimization Algorithm: An Overview. Soft Comput. 2018, 22, 387–408. [Google Scholar] [CrossRef]
Chu, C.C.F.; Chan, D.P.K. Feature Selection Using Approximated High-Order Interaction Components of the Shapley Value for Boosted Tree Classifier. IEEE Access 2020, 8, 112742–112750. [Google Scholar] [CrossRef]
Dai, Z.; Li, S.; Hu, B.; Kong, X.; Zhang, J.; Zhu, B.; Wei, Q. Machine Learning-Based Prediction of the Migration Range of Dissolved CO₂ in Deep Saline Aquifers: SHAP Interpretation and Engineering Insights. Energy Fuels 2025, 39, 18924–18934. [Google Scholar] [CrossRef]
He, Z.; Yang, Y.; Fang, R.; Zhou, S.; Zhao, W.; Bai, Y.; Li, J.; Wang, B. Integration of Shapley Additive Explanations with Random Forest Model for Quantitative Precipitation Estimation of Mesoscale Convective Systems. Front. Environ. Sci. 2023, 10, 1057081. [Google Scholar] [CrossRef]
Nazari, H.; Hajizadeh, F. Prediction of Oil Reservoir Porosity Using Petrophysical Data and a New Intelligent Hybrid Method. Pure Appl. Geophys. 2023, 180, 4261–4274. [Google Scholar] [CrossRef]
Jialong, L.; Yuanku, M. Machine learning applications in distinguishing granite genesis types. China J. Geol. 2025, 60, 1509–1529. [Google Scholar] [CrossRef]

Figure 1. Structural location of the study area.

Figure 2. Workflow for water saturation prediction using machine learning.

Figure 3. Correlation matrix of water saturation and logging parameters.

Figure 4. Core Analysis Water Saturation Histogram.

Figure 5. Predicted and core water saturation for PSO+MERF. (a) Train set; (b) Test set.

Figure 6. Predicted and core water saturation for PSO+LightGBM. (a) Train set; (b) Test set.

Figure 7. Predicted and core water saturation for PSO+XGBoost. (a) Train set; (b) Test set.

Figure 8. Comparison chart of violins with different water saturation models.

Figure 9. Global explanation diagram of the water saturation prediction model.

Figure 10. Partial Water Saturation Prediction Model Dependency Diagram.

Figure 11. Comparison of different water saturation prediction models for well F5085 in southern Zhidan.

Table 1. Hyperparameter optimization and computational efficiency of the three models.

Model	Core Parameters	Search Range	Optimal Value	Training Duration(s)
PSO+Xgboost	Learning-rate	0.01~0.5	0.1013	0.070
	Max-depth	2~256	57
	Minimum of samples per leaf	1~100	49.3
	Number of trees	50–500	345
	Gamma	0~10	9.035
PSO+Lightgbm	Learning-rate	0.01~0.5	0.1466	0.034
	Max-depth	2~256	169
	Minimum of samples per leaf	1~100	8
	L1 Regularization term	0~20	17.818
	L2 Regularization term	0~20	13.46
PSO+MERF	Max-depth	2~256	241	2.184
	Minimum of Samples per Leaf	1~100	24
	Number of trees	50–500	241

Table 2. S_wa of different water saturation models.

Model	Training Set-S_wa (%)	Testing Set-S_wa (%)
PSO+Xgboost	89.4	81.5
PSO+Lightgbm	91.8	85.2
PSO+MERF	94.9	83.3

Table 3. Comparison of Different Water Saturation Models and Their Prediction Effects.

Model	Core Plugs/Piece	Water Saturation from Core Analysis (%)		Model-Predicted Water Saturation (%)		R²	S_wa
Model	Core Plugs/Piece	Range	Mean	Range	Mean	R²	S_wa
PSO+XGBoost	118	26.65–83.85	50.43	33.9–74.5	56.54	80.3	76.8
PSO+LightGBM				28.7–80.7	50.64	88.9	82.3
PSO+MERF				33.4–78.3	50.51	87.8	81.4
Archie				22.4–89.6	55.22	72.8	67.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Tan, C.; Xiao, L.; Wei, Q.; Dang, H.; Kang, S.; Liang, W.; Dong, X.; Liu, L. Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model. Processes 2026, 14, 42. https://doi.org/10.3390/pr14010042

AMA Style

Li L, Tan C, Xiao L, Wei Q, Dang H, Kang S, Liang W, Dong X, Liu L. Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model. Processes. 2026; 14(1):42. https://doi.org/10.3390/pr14010042

Chicago/Turabian Style

Li, Lusheng, Chengqian Tan, Ling Xiao, Qinlian Wei, Hailong Dang, Shengsong Kang, Weiwei Liang, Xu Dong, and Ling Liu. 2026. "Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model" Processes 14, no. 1: 42. https://doi.org/10.3390/pr14010042

APA Style

Li, L., Tan, C., Xiao, L., Wei, Q., Dang, H., Kang, S., Liang, W., Dong, X., & Liu, L. (2026). Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model. Processes, 14(1), 42. https://doi.org/10.3390/pr14010042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Water Saturation in Lacustrine Tight Reservoirs of Chang8 in the Central Ordos Basin—Based on the PSO+LightGBM Model

Abstract

1. Introduction

2. Geological Setting

3. Experiments and Methods (Shaanxi, Xi’an, China)

3.1. Workflow

3.2. Data Source and Analysis

3.3. Xgboost Algorithm

3.4. Lightgbm

3.5. Mixed Effects Random Forest Algorithm (Merf)

3.6. Particle Swarm Optimization Algorithm (Pso)

3.7. Shap Algorithm

3.8. Hyperparameter Optimization and Evaluation Metrics

4. Results and Discussion

4.1. Analysis of Core Water Saturation

4.2. Model Prediction Results

4.3. Interpretability Evaluation

5. Model Application

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI